China's Web Outage: The Latest Earthquake to Rock the Internet

February 18, 2014

Heiko Specht

On Tuesday, January 21, one of the biggest outages in history — if not the world's largest outage — happened to the Internet in China. The web was essentially unavailable for one of the strongest and fastest growing economies for one full business day.

The initial reaction of the international press was somewhat lax – after all, the event was marginally important to web users outside China. But the fact is, while approximately 500 million Chinese web users were undoubtedly affected, every company that does online business in China was hurt.

Consider a company like Porsche, which has been experiencing double-digit revenue growth in China over the past few months. The hit on revenues and even brand image for such companies – even though the outage was completely beyond their control – was likely significant. Not to mention, major global businesses advertising on Chinese sites forfeited hefty investments that day.

A Look Inside the Outage

So what exactly happened? At around 3 p.m. local time on January 21, two-thirds of all domain requests in China were routed to a single IP address in Wyoming, which promptly collapsed under load. This was believed to be a domain name system (DNS) attack, the biggest of its type in history. Not all domains were affected; mainly it was those ending in .com and .net, while those ending in .com.cn were partially affected.

Unfortunately, even most of the Chinese websites that were not directly impacted also ended up going down. Here's why: many of the affected domains were hosts to third-party services relied upon by thousands of Chinese websites.

One example is analytics engines. Never mind that the analytics engines weren't working, meaning that companies lost out on a whole day's worth of data that could have been used to increase conversions. That was just the collateral damage. Like dominoes, these "poisoned" third-party services brought down the websites integrating them, even those websites that were not directly affected by the attack.

Another third-party service that went dark was PayPal. This meant that any website integrating PayPal on its back-end could not process transactions for a full eight hours – which was a moot point anyway, because these websites were likely inaccessible.

In this sense, the Chinese outage was a perfect case-in-point of what Compuware APM has been evangelizing for a long time. And that is: the increased complexity and interdependency of the modern web that can turn even the most well-run and well-developed website into a house of cards, on the verge of collapse at any moment.

But these days, reliance on third-party services is a way of life. These services enable website and web application developers to bring to market cutting edge services quickly and cost-effectively, without the burden of having to develop these services from scratch. However, the China example highlights how that reliance on third-party services comes with the downside of increased vulnerability and fragility.

Lessons Learned

In this era of increased interdependency, what can an organization do to better protect and insulate its web performance?

Organizations need to be better about getting ahead of website performance issues: Given all the performance-impacting elements standing between the data center and the end user – i.e. the cloud, CDNs, ISPs, devices and browsers – the end-user perspective is the only reliable vantage point from which to gauge performance. Next-generation application performance management (APM) tools can deliver this view, and it's important to work with technology providers that provide performance views across key geographies and user segments.

Organizations must closely evaluate and monitor third-party services: Before a third-party service is enlisted, organizations should carefully test its performance. One way is to compare website performance before a third-party service is added and afterwards, gauge the overall performance impact. If a performance degradation is identified, organizations must work with the third-party service to resolutely fix the problem, before the service is implemented.

Monitoring third-party services in production is also important in order to validate SLAs, but also to identify third-party performance issues as they occur and take appropriate action.

As the China example illustrates, the "ripple effect" of third-party performance issues is often unavoidable. But that doesn't mean the impact can't be thwarted or minimized. That is, when a serious performance problem is detected, organizations should have contingency plans in place so that offending third-party services can quickly be removed. While they can be extremely valuable when performing well, many third-party services (such as analytics) are not worth having if it means frustrating customers.

The end-user experience needs to be top-of-mind in all third-party service decisions: In general, websites should keep third-party services to a minimum. Organizations always need to ask themselves before adding a third-party service, if the added feature/functionality is worth the potential increase in overall vulnerability and lost conversions.

In this vein, there needs to be constant communication between performance monitoring teams, and the teams who request and depend on these third-party services. This is the key to making the smartest decisions that will protect and promote revenues above all else.

Additionally, when a third-party service is implemented, there are design steps organizations can take to proactively reduce risk exposure. For example, by understanding the load order of elements on a site and making sure third-party services and applications are on the bottom, organizations can protect and enhance perceived customer load time, even when a third-party service does suddenly go awry.

As a final note here, to ensure better performance for feature-rich websites and applications, many organizations rely on content delivery networks (CDNs) strategically located in key geographies. Ironically, CDNs represent another third-party service and another potential point of failure. Here, again, measuring performance from the true end-user perspective, on the other side of a CDN, is critical to protecting and maximizing these investments.

Leverage industry resources: Look for free services that identify third-party service outages and the corresponding regional impacts. Services like this may not prevent major outages from happening, but they can help organizations at least see when a widespread performance issue is not their own, and give them a head start in putting contingency plans into place and communicating proactively with customers.

Conclusion

In summary, to a certain extent, major web events like the one that just happened in China are unavoidable. But in many cases, the corresponding impact on modern websites can be anticipated, contained and minimized with the right approaches.

As a first step, organizations must understand the true end-user experience and the resulting business impact, so performance problems can be prioritized for remediation. From there, organizations must be able to correlate performance issues to the broadest possible range of variables both within and outside the firewall, including third-party services, and take appropriate action. It is critical to understand what can and cannot be controlled, and focus on addressing and fixing what is possible. In many cases, this can help organizations avoid going down with the proverbial ship.

Heiko Specht is a Technology Expert at the Compuware APM Center of Excellence.

The Latest

AI, Security, and Sustainability Are Major Drivers for IT Modernization

April 25, 2024

The use of hybrid multicloud models is forecasted to double over the next one to three years as IT decision makers are facing new pressures to modernize IT infrastructures because of drivers like AI, security, and sustainability, according to the Enterprise Cloud Index (ECI) report from Nutanix ...

The Past, Present and Future of DEX

April 24, 2024

Over the last 20 years Digital Employee Experience has become a necessity for companies committed to digital transformation and improving IT experiences. In fact, by 2025, more than 50% of IT organizations will use digital employee experience to prioritize and measure digital initiative success ...

Cloud Barriers Impact the Bottom Line

April 23, 2024

While most companies are now deploying cloud-based technologies, the 2024 Secure Cloud Networking Field Report from Aviatrix found that there is a silent struggle to maximize value from those investments. Many of the challenges organizations have faced over the past several years have evolved, but continue today ...

Full-Stack Observability in 2024 and the Importance of End-to-End Visibility for IT Teams

April 22, 2024

In our latest research, Cisco's The App Attention Index 2023: Beware the Application Generation, 62% of consumers report their expectations for digital experiences are far higher than they were two years ago, and 64% state they are less forgiving of poor digital services than they were just 12 months ago ...

MEAN TIME TO INSIGHT Podcast - Episode 5: Network Source of Truth

April 19, 2024

In MEAN TIME TO INSIGHT Episode 5, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the network source of truth ...

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

A Look Inside the Outage

Lessons Learned

Conclusion

The Latest

Videos

Hot Topics

APM DIGEST

Search form

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

A Look Inside the Outage

Lessons Learned

Conclusion

Hot Topics

The Latest

Videos

Hot Topics

APM DIGEST

User login