Skip to main content

MELTDOWN: Single Software Update Causes Largest IT Outage in History

Pete Goldin
Editor and Publisher
APMdigest

A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world. Thousands of flights were canceled. TV stations went offline. Some 911 systems were down. Hospital operations were disrupted. Bank accounts were inaccessible. Many businesses and government services were unable to function.

The problem started with a bug in an automatic update for CrowdStrike's Falcon sensor — which is used to block online cyberattacks — and quickly escalated globally, causing Microsoft Windows systems to crash. CrowdStrike confirmed that the cause was a defect in a single content update for Windows hosts, not a security incident or cyberattack.


The Automation Challenge

"As companies transition to products with fully automated updates, they gain touchless update and patch remediation. However, automation is useless if it's supplied with bad content or configuration," said Kent Feid, Senior Director of Product Management at Quest.

"This event demonstrates that even the best companies can push out patches that cripple environments and, at times, entire essential service industries, and highlights the need for a balance between control and automation when it comes to software releases. While automation is necessary, it is the balanced approach that provides the best control and minimizes risk."

The issue also shines a spotlight on quality assurance. "A simple defect found in a single content update for Windows hosts was enough to cause havoc globally. The lesson to be learned is to integrate quality assurance into the software development lifecycle and to assure business outcomes not just technology," said Tom Reuner, Executive Research Leader, HFS Research.

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

"The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This is a reminder you need to manage and control change: Don't blindly update software or change configuration," Mehdi Daoudi, CEO of Catchpoint, said on Friday. "At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down."

Image removed.

Daoudi continued, "Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today's event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails."

Real-Time Observability

"The massive Microsoft outage, caused by a faulty CrowdStrike update, underscores the new reality companies face: globally distributed software platforms that drive business today are a complex web of interdependencies, not all of which are under any one actor's control," explained Antony Falco, VP at Hydrolix.

"A modest mistake can literally grind global business to a halt. The monitoring and observability solutions we rely on to spot these modest mistakes and critical issues have struggled to keep up, even with systems of smaller scale. Clearly we need a new approach to observability — one that is real-time and can simplify the management of tremendous volumes of data streaming in from myriad sources so events can be detected and mitigated before they spread."

Redundancy and Diversity

In addition, this type of event demonstrates that for critical services, redundancy and diversity are key, according to Olaf Kolkman, Principal - Internet Technology, Policy, and Advocacy, and Dan York, Director, Internet Technology, both from the Internet Society. "We need diversity across all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down."

They added, "The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again."

The Cost of Downtime

Just as a final thought, I would point out that several recent reports have shown that the cost of downtime is high, and downtime can impact companies in many ways. Catchpoint's Internet Resilience Report 2024 found that almost half of survey respondents said outages cost them from $1 million to $10 million every month.

Similarly, Splunk's recent report, The Hidden Costs of Downtime calculates lost revenue due to downtime averages $49 million, regulatory fines average $22 million, and missed SLA penalties average $16 million annually.

Downtime also negatively impacts customer experience, employee productivity, innovation, brand reputation and even share value. In fact, AP reported that shares of CrowdStrike stock fell nearly 10% on Friday, and Microsoft stock fell more than 3%. These numbers speak louder than words.

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

The Latest

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...

Despite the frustrations, every engineer we spoke with ultimately affirmed the value and power of OpenTelemetry. The "sucks" moments are often the flip side of its greatest strengths ... Part 2 of this blog covers the powerful advantages and breakthroughs — the "OTel Rocks" moments ...

OpenTelemetry (OTel) arrived with a grand promise: a unified, vendor-neutral standard for observability data (traces, metrics, logs) that would free engineers from vendor lock-in and provide deeper insights into complex systems ... No powerful technology comes without its challenges, and OpenTelemetry is no exception. The engineers we spoke with were frank about the friction points they've encountered ...

Enterprises are turning to AI-powered software platforms to make IT management more intelligent and ensure their systems and technology meet business needs for efficiency, lowers costs and innovation, according to new research from Information Services Group ...

MELTDOWN: Single Software Update Causes Largest IT Outage in History

Pete Goldin
Editor and Publisher
APMdigest

A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world. Thousands of flights were canceled. TV stations went offline. Some 911 systems were down. Hospital operations were disrupted. Bank accounts were inaccessible. Many businesses and government services were unable to function.

The problem started with a bug in an automatic update for CrowdStrike's Falcon sensor — which is used to block online cyberattacks — and quickly escalated globally, causing Microsoft Windows systems to crash. CrowdStrike confirmed that the cause was a defect in a single content update for Windows hosts, not a security incident or cyberattack.


The Automation Challenge

"As companies transition to products with fully automated updates, they gain touchless update and patch remediation. However, automation is useless if it's supplied with bad content or configuration," said Kent Feid, Senior Director of Product Management at Quest.

"This event demonstrates that even the best companies can push out patches that cripple environments and, at times, entire essential service industries, and highlights the need for a balance between control and automation when it comes to software releases. While automation is necessary, it is the balanced approach that provides the best control and minimizes risk."

The issue also shines a spotlight on quality assurance. "A simple defect found in a single content update for Windows hosts was enough to cause havoc globally. The lesson to be learned is to integrate quality assurance into the software development lifecycle and to assure business outcomes not just technology," said Tom Reuner, Executive Research Leader, HFS Research.

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

"The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This is a reminder you need to manage and control change: Don't blindly update software or change configuration," Mehdi Daoudi, CEO of Catchpoint, said on Friday. "At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down."

Image removed.

Daoudi continued, "Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today's event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails."

Real-Time Observability

"The massive Microsoft outage, caused by a faulty CrowdStrike update, underscores the new reality companies face: globally distributed software platforms that drive business today are a complex web of interdependencies, not all of which are under any one actor's control," explained Antony Falco, VP at Hydrolix.

"A modest mistake can literally grind global business to a halt. The monitoring and observability solutions we rely on to spot these modest mistakes and critical issues have struggled to keep up, even with systems of smaller scale. Clearly we need a new approach to observability — one that is real-time and can simplify the management of tremendous volumes of data streaming in from myriad sources so events can be detected and mitigated before they spread."

Redundancy and Diversity

In addition, this type of event demonstrates that for critical services, redundancy and diversity are key, according to Olaf Kolkman, Principal - Internet Technology, Policy, and Advocacy, and Dan York, Director, Internet Technology, both from the Internet Society. "We need diversity across all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down."

They added, "The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again."

The Cost of Downtime

Just as a final thought, I would point out that several recent reports have shown that the cost of downtime is high, and downtime can impact companies in many ways. Catchpoint's Internet Resilience Report 2024 found that almost half of survey respondents said outages cost them from $1 million to $10 million every month.

Similarly, Splunk's recent report, The Hidden Costs of Downtime calculates lost revenue due to downtime averages $49 million, regulatory fines average $22 million, and missed SLA penalties average $16 million annually.

Downtime also negatively impacts customer experience, employee productivity, innovation, brand reputation and even share value. In fact, AP reported that shares of CrowdStrike stock fell nearly 10% on Friday, and Microsoft stock fell more than 3%. These numbers speak louder than words.

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

The Latest

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...

Despite the frustrations, every engineer we spoke with ultimately affirmed the value and power of OpenTelemetry. The "sucks" moments are often the flip side of its greatest strengths ... Part 2 of this blog covers the powerful advantages and breakthroughs — the "OTel Rocks" moments ...

OpenTelemetry (OTel) arrived with a grand promise: a unified, vendor-neutral standard for observability data (traces, metrics, logs) that would free engineers from vendor lock-in and provide deeper insights into complex systems ... No powerful technology comes without its challenges, and OpenTelemetry is no exception. The engineers we spoke with were frank about the friction points they've encountered ...

Enterprises are turning to AI-powered software platforms to make IT management more intelligent and ensure their systems and technology meet business needs for efficiency, lowers costs and innovation, according to new research from Information Services Group ...