Home » Security Bloggers Network » LW ROUNDTABLE: CrowdStrike outage reveals long road ahead to achieve digital resiliency

LW ROUNDTABLE: CrowdStrike outage reveals long road ahead to achieve digital resiliency

by bacohido on July 25, 2024

LW ROUNDTABLE: CrowdStrike outage reveals long road ahead to achieve digital resiliency

By Byron V. Acohido

Last week, CrowdStrike, one of the cybersecurity industry’s most reputable solution providers, inadvertently caused more disruption across the Internet than all the threat actors active online at the time.

A flawed update to CrowdStrike’s Falcon security software caused millions of computers running Microsoft Windows to display the infamous blue screen of death. More than 5,000 flights and an untold number of hospital procedures got canceled and banking services got knocked offline across the globe. To try to restore normality, organizations had to reboot in safe mode or use the Windows Recovery Environment.

While inexcusable, the CrowdStrike outage was not terribly surprising. It falls right in line with a seemingly never-ending series of major cyber incidents that continue to expose the stark fragility of our digital infrastructure.

Elusive resiliency

Unless things change, digital resiliency will continue to remain elusive. This is because the pace of digitizing business operations continues to intensify, resulting in new services – and fresh exposures.

All companies – including cybersecurity vendors – are racing to leverage automation and AI to boost innovation, i.e. increase revenue. The new attack vectors that spin out of this chase entice cyber adversaries to continually iterate and improve upon tried-and-true cyber attack tools and techniques, with the goal of gaining unauthorized network access.

Lest we forget, catastrophes of the CrowdStrike outage class abound, including:

•SolarWinds supply chain debacle

•Colonial Pipeline ransomware attack

•Microsoft Exchange Server hack

•JBS Foods ransomware attack

•Kaseya VSA ransomware attack

•Facebook users’ data leak

•Log4j/Log4Shell vulnerability

•T-Mobile users’ data breach

The CrowdStrike outage drips with irony because the culprit is not a criminal hacking collective or a nation state actor; it’s a marquee cybersecurity vendor. CrowdStrike abjectly failed to recognize a gaping exposure lurking in its automated update delivery system – a flaw with the potential to cause catastrophic disruption up and down its supply chain, which is exactly what happened.

SolarWinds redux

In many ways, CrowdStrike was a repeat of the SolarWinds supply chain hack. In the latter, a threat actor purposefully identified and exploited a soft spot in SolarWinds’ automated software update service.

This time around, CrowdStrike internally got tripped up by an automation flaw that it very well should have sussed out before delivering anything to its customers.

Thus, once again we’re reminded that taming digital complexity remains a huge challenge — and that digital resiliency remains as elusive as ever. With this in mind, Last Watchdog sought commentary from technology thought leaders about what the CrowdStrike outage says about the state of digital resiliency. Responses edited for clairy and length:

Geoffrey Mattson, CEO, Xage

Mattson

Like many vendors of their vintage, CrowdStrike built their product on an “agent” that must be installed deep in each laptop or server. This introduces complexity and allows a bug to have the type of deep impact that we’ve witnessed with this incident . . . Since the agent had not been vetted, it inflicted the same damage as malware would have. Implementing zero trust across the entirety of the technology stack would go a long way toward increasing resilience against events like this.

Justin Endres, CRO, Seclore

Endres

Fallout from the recent disruption caused by a CrowdStrike update highlights how widespread reliance on any one solution can lead to global outages. This incident underscores the critical importance of diversifying our digital infrastructure . . . If this had been a cyber attack exploiting a nearly universal vulnerability, the implications could have been far worse. Recovery will be measured in weeks not hours, as many of the impacted systems will need to be rebuilt manually. Clearly, Microsoft must also enhance its operating system so Windows can automatically recover from this type of system error.

Tamara Nolan, Cyber & Operational Resilience Practice Leader, MorganFranklin

Nolan

This event was only possible because of the number of organizations who updated to the latest version of their security software. They were following best practices advice and now they must face the fact that even good advice can lead to bad outcomes . . . Should the affected organizations have accepted CrowdStrike’s auto updates without scrutiny? Should security leaders trust vendors representations about QA/QC in general? Due to the nature of the fix, IT personnel will need to physically access each affected machine. This means that the recovery process may take some time.

Neatsun Ziv, CEO, OX Security

Ziv

While this is not a cyberattack, the downstream effects are comparable in that one action impacts another, which then impacts another dependency, and so on. As illustrated here, deployment and management of agents is problematic at scale. Ensuring consistent agent configurations and updates across the entire ecosystem is extremely challenging. We need to address system reliability, and how best to avoid a single point of failure. Using agentless updates, as opposed to automatically updating agents on the endpoint servers, is a good first step.

Charles Henderson, EVP, Coalfire

Henderson

Successful recovery of IT systems will only be the first hurdle for security organizations. As we’ve seen countless times over the years, attackers will certainly use the disruption and public awareness of this issue to further their attacks . . . Starting now and for at least the next month, all organizations should be in a heightened state of vigilance for phishing emails purporting to be from, or affiliated with, CrowdStrike.

Dimitri Chichlo, CSO, BforeAI

Chichlo

Our networks remain fragile because of interdependence and the assumption that technology always works. When your executive committee or board of directors have little appetite for IT matters, little effort will be brought on making infrastructure resilient. IT teams must stop considering security a separate discipline from IT but embed security and resilience by default into their day-to-day activities.

Steve Hahn, EVP, BullWall

Hahn

It will be interesting to see if we have a ripple of downstream consequences. Right now we are dealing with outages at airlines and other critical businesses. Will we also see a wave of ransomware attacks that follow? Time will tell.

This event, more than any other, is precisely why companies need a defense in depth strategy. Ransomware uses endpoints and other attack vectors as their launch mechanism for their attack and you need layers of security over your critical data and file shares.

Bruno Kurtic, CEO, Bedrock Security

Kurtic

CrowdStrike initially stated there was no security breach, only a software defect that led to a disruption. This underscores the need for cybersecurity vendors to recognize their role in maintaining business and societal functions and their responsibility for resilience . . . . While running two EDR software applications on the same system isn’t feasible, businesses can protect cloud, data, and other assets with different vendors. Additionally, conducting tabletop exercises for catastrophic failures and analyzing supply chain risks are crucial.

Evan Dornbush, former NSA cybersecurity expert

Dornbush

This is, of course, a phishing attack opportunity. Don’t make a bad situation worse. Only follow recommended instructions direct from your CrowdStrike rep. There will be a lot of misinformation about how to reconfigure your computers or which critical system files to delete. Don’t fall victim to downloading phony solutions.

Dylan Owen, CISO, Nightwing

Owen

Now is a good time to review incident response plans and identify any weak spots, like missing backups. Learning from this event can be critical to reducing the recovery time from major outages to come. Ultimately, organizations need to take a measured approach. Continuing to follow cyber hygiene best practices can more likely create those constant invisible benefits that keep organizations from falling victim to a ransomware event or other compromises and ultimately bolster their digital resiliency. 77

Ted Miracco, CEO, Approov

Miracco

This outage shows the severe consequences of having a single point of failure, the misguided notion that bigger is better with cybersecurity, and the dangers of down selecting to a single vendor for each service. The most economical way to support complex systems, might be to have multiple overlapping solutions that provide redundancy. This includes adopting multi-cloud strategies and decentralizing critical functions to ensure continuous operation during localized failures. Sandboxing updates and rigorously testing upgrades before releasing them to all users are crucial practices for enhancing digital resilience.

Irfan Shakeel, Vice President Training & Certification Services, OPSWAT

Shakeel

Deployment of a faulty update to a production environment without adequate testing underscores a critical flaw in resilience planning. Disruption can cascade through interconnected systems, causing widespread chaos. Such incidents emphasize the urgent need to diversify IT infrastructure, implement incident response plans and prioritize thorough testing. This will strengthen digital resilience.

Dan Potter, Director of Operational Resilience, Immersive Labs

Potter

This crisis highlights a pressing concern — the over-reliance on digital systems, which have inherent limitations and vulnerabilities. Moving forward, organizations need to both defend against increasing cyber threats and optimize business response to disruptions, including having the capacity to revert swiftly to manual processes. This requires having the people that give your organization the human edge in responding to any unexpected event.

Willy Leichter, CMO, AppSOC

Leichter

The Irish potato blight and famine showed that being completely dependent on a single food source for millions of people could be catastrophic . . . It’s understandable that many organizations have set a goal of reducing the number of security tools they use, while standardizing on a few giant vendors to build and manage their infrastructure. Yet, there is significant value to have some diversity of tools, and a non-homogenous approach to security.

Scott Kannry, CEO, Axio

Kannry

The outage reinforces the need for organizations to better understand how the failure or loss of key technological dependencies can impede their business operations . . . Achieving digital resiliency is a manageable process, commonly practiced as part of enterprise risk management in large enterprises. Start by identifying the core products and services that form the business’s lifeblood. Explore alternative operational enablers, both technological and non-technological, and evaluate their costs and investment thresholds.

Pukar Hamal, CEO, SecurityPal

Hamal

A proactive and comprehensive risk assessment and management approach and constant re-evaluation of compliance postures are necessary. Adopting advanced technologies or models, like AI, cloud computing, or Low-code/no-code models, can be helpful for early issue detection and automated responses. The key thing is digital resiliency is about redundancy at both the People, Process, and Technology layer.

Acohido

Pulitzer Prize-winning business journalist Byron V. Acohido is dedicated to fostering public awareness about how to make the Internet as private and secure as it ought to be.