MELTDOWN: Single Software Update Causes Largest IT Outage in History

July 22, 2024

Pete Goldin

Editor and Publisher

APMdigest

A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world. Thousands of flights were canceled. TV stations went offline. Some 911 systems were down. Hospital operations were disrupted. Bank accounts were inaccessible. Many businesses and government services were unable to function.

The problem started with a bug in an automatic update for CrowdStrike's Falcon sensor — which is used to block online cyberattacks — and quickly escalated globally, causing Microsoft Windows systems to crash. CrowdStrike confirmed that the cause was a defect in a single content update for Windows hosts, not a security incident or cyberattack.

The Automation Challenge

"As companies transition to products with fully automated updates, they gain touchless update and patch remediation. However, automation is useless if it's supplied with bad content or configuration," said Kent Feid, Senior Director of Product Management at Quest.

"This event demonstrates that even the best companies can push out patches that cripple environments and, at times, entire essential service industries, and highlights the need for a balance between control and automation when it comes to software releases. While automation is necessary, it is the balanced approach that provides the best control and minimizes risk."

The issue also shines a spotlight on quality assurance. "A simple defect found in a single content update for Windows hosts was enough to cause havoc globally. The lesson to be learned is to integrate quality assurance into the software development lifecycle and to assure business outcomes not just technology," said Tom Reuner, Executive Research Leader, HFS Research.

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

"The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This is a reminder you need to manage and control change: Don't blindly update software or change configuration," Mehdi Daoudi, CEO of Catchpoint, said on Friday. "At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down."

Daoudi continued, "Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today's event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails."

Real-Time Observability

"The massive Microsoft outage, caused by a faulty CrowdStrike update, underscores the new reality companies face: globally distributed software platforms that drive business today are a complex web of interdependencies, not all of which are under any one actor's control," explained Antony Falco, VP at Hydrolix.

"A modest mistake can literally grind global business to a halt. The monitoring and observability solutions we rely on to spot these modest mistakes and critical issues have struggled to keep up, even with systems of smaller scale. Clearly we need a new approach to observability — one that is real-time and can simplify the management of tremendous volumes of data streaming in from myriad sources so events can be detected and mitigated before they spread."

Redundancy and Diversity

In addition, this type of event demonstrates that for critical services, redundancy and diversity are key, according to Olaf Kolkman, Principal - Internet Technology, Policy, and Advocacy, and Dan York, Director, Internet Technology, both from the Internet Society. "We need diversity across all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down."

They added, "The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again."

The Cost of Downtime

Just as a final thought, I would point out that several recent reports have shown that the cost of downtime is high, and downtime can impact companies in many ways. Catchpoint's Internet Resilience Report 2024 found that almost half of survey respondents said outages cost them from $1 million to $10 million every month.

Similarly, Splunk's recent report, The Hidden Costs of Downtime calculates lost revenue due to downtime averages $49 million, regulatory fines average $22 million, and missed SLA penalties average $16 million annually.

Downtime also negatively impacts customer experience, employee productivity, innovation, brand reputation and even share value. In fact, AP reported that shares of CrowdStrike stock fell nearly 10% on Friday, and Microsoft stock fell more than 3%. These numbers speak louder than words.

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

Downtime

Observability

The Latest

Data Center Outage Frequency Decreasing

May 08, 2025

Overall outage frequency and the general level of reported severity continue to decline, according to the Outage Analysis 2025 from Uptime Institute. However, cyber security incidents are on the rise and often have severe, lasting impacts ...

6 Takeaways from the State of Observability for Media and Entertainment

May 07, 2025

In March, New Relic published the State of Observability for Media and Entertainment Report to share insights, data, and analysis into the adoption and business value of observability across the media and entertainment industry. Here are six key takeaways from the report ...

Simplifying Actionable Problem-Solving with Decision Optimization

May 06, 2025

Regardless of their scale, business decisions often take time, effort, and a lot of back-and-forth discussion to reach any sort of actionable conclusion ... Any means of streamlining this process and getting from complex problems to optimal solutions more efficiently and reliably is key. How can organizations optimize their decision-making to save time and reduce excess effort from those involved? ...

Cloud Cost Crisis: As Semiconductor Tariffs Loom, CIOs Face Budget Overruns

May 05, 2025

As enterprises accelerate their cloud adoption strategies, CIOs are routinely exceeding their cloud budgets — a concern that's about to face additional pressure from an unexpected direction: uncertainty over semiconductor tariffs. The CIO Cloud Trends Survey & Report from Azul reveals the extent continued cloud investment despite cost overruns, and how organizations are attempting to bring spending under control ...

The Results Are In: IT Professionals Want More AI and Automation Support

May 01, 2025

According to Auvik's 2025 IT Trends Report, 60% of IT professionals feel at least moderately burned out on the job, with 43% stating that their workload is contributing to work stress. At the same time, many IT professionals are naming AI and machine learning as key areas they'd most like to upskill ...

Immutable by Design: Reinventing Business Continuity and Disaster Recovery

April 30, 2025

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

Embracing Cost-Effective Observability Through an OpenTelemetry Approach

April 29, 2025

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Why Employees Hate Security (And What Businesses Can Do About It)

April 28, 2025

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

MEAN TIME TO INSIGHT Podcast - Episode 13: Hybrid Multi-Cloud Networking Strategy

April 25, 2025

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ...

Transforming Network Remediation with a Closed-Loop Approach

April 24, 2025

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

MELTDOWN: Single Software Update Causes Largest IT Outage in History

July 22, 2024

Pete Goldin

Editor and Publisher

APMdigest

The Automation Challenge

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

Real-Time Observability

Redundancy and Diversity

The Cost of Downtime

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

Downtime

Observability

The Latest

Data Center Outage Frequency Decreasing

May 08, 2025

6 Takeaways from the State of Observability for Media and Entertainment

May 07, 2025

Simplifying Actionable Problem-Solving with Decision Optimization

May 06, 2025

Cloud Cost Crisis: As Semiconductor Tariffs Loom, CIOs Face Budget Overruns

May 05, 2025

The Results Are In: IT Professionals Want More AI and Automation Support

May 01, 2025

Immutable by Design: Reinventing Business Continuity and Disaster Recovery

April 30, 2025

Embracing Cost-Effective Observability Through an OpenTelemetry Approach

April 29, 2025

Why Employees Hate Security (And What Businesses Can Do About It)

April 28, 2025

MEAN TIME TO INSIGHT Podcast - Episode 13: Hybrid Multi-Cloud Networking Strategy

April 25, 2025

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ...

Transforming Network Remediation with a Closed-Loop Approach

April 24, 2025

Featured White Paper

Featured Webinar

Featured White Paper

Featured Webinar

Featured White Paper

Featured eBook

Featured Webinar

Featured Webinar

Featured Webinar

Featured Free Trial

Featured White Paper

Featured Free Trial

Featured Webinar

Featured White Paper

Featured Webinar

Featured Webinar

Featured Free Trial

Featured White Paper

Featured White Paper

Featured White Paper

Featured eBook

Featured Report

Featured Report

Featured Free Trial

Featured Free Trial

Featured Webinar

Featured White Paper

Featured White Paper

Featured Free Trial

Featured White Paper

Featured eBook

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured White Paper

Featured Free Trial

Featured Report

Featured Webinar

Featured Report

Featured White Paper

Featured White Paper

Featured Webinar

Featured White Paper

Featured Webinar

Featured eBook

Featured Free Trial

Featured eBook

Featured Webinar

Featured White Paper

Featured White Paper

Featured Report

Featured Webinar

Featured Webinar

Featured eBook

Featured Webinar

Featured White Paper

Featured Free Trial

Featured Free Trial

Featured White Paper