On Tuesday, January 21, one of the biggest outages in history — if not the world's largest outage — happened to the Internet in China. The web was essentially unavailable for one of the strongest and fastest growing economies for one full business day.
The initial reaction of the international press was somewhat lax – after all, the event was marginally important to web users outside China. But the fact is, while approximately 500 million Chinese web users were undoubtedly affected, every company that does online business in China was hurt.
Consider a company like Porsche, which has been experiencing double-digit revenue growth in China over the past few months. The hit on revenues and even brand image for such companies – even though the outage was completely beyond their control – was likely significant. Not to mention, major global businesses advertising on Chinese sites forfeited hefty investments that day.
A Look Inside the Outage
So what exactly happened? At around 3 p.m. local time on January 21, two-thirds of all domain requests in China were routed to a single IP address in Wyoming, which promptly collapsed under load. This was believed to be a domain name system (DNS) attack, the biggest of its type in history. Not all domains were affected; mainly it was those ending in .com and .net, while those ending in .com.cn were partially affected.
Unfortunately, even most of the Chinese websites that were not directly impacted also ended up going down. Here's why: many of the affected domains were hosts to third-party services relied upon by thousands of Chinese websites.
One example is analytics engines. Never mind that the analytics engines weren't working, meaning that companies lost out on a whole day's worth of data that could have been used to increase conversions. That was just the collateral damage. Like dominoes, these "poisoned" third-party services brought down the websites integrating them, even those websites that were not directly affected by the attack.
Another third-party service that went dark was PayPal. This meant that any website integrating PayPal on its back-end could not process transactions for a full eight hours – which was a moot point anyway, because these websites were likely inaccessible.
In this sense, the Chinese outage was a perfect case-in-point of what Compuware APM has been evangelizing for a long time. And that is: the increased complexity and interdependency of the modern web that can turn even the most well-run and well-developed website into a house of cards, on the verge of collapse at any moment.
But these days, reliance on third-party services is a way of life. These services enable website and web application developers to bring to market cutting edge services quickly and cost-effectively, without the burden of having to develop these services from scratch. However, the China example highlights how that reliance on third-party services comes with the downside of increased vulnerability and fragility.
Lessons Learned
In this era of increased interdependency, what can an organization do to better protect and insulate its web performance?
Organizations need to be better about getting ahead of website performance issues: Given all the performance-impacting elements standing between the data center and the end user – i.e. the cloud, CDNs, ISPs, devices and browsers – the end-user perspective is the only reliable vantage point from which to gauge performance. Next-generation application performance management (APM) tools can deliver this view, and it's important to work with technology providers that provide performance views across key geographies and user segments.
Organizations must closely evaluate and monitor third-party services: Before a third-party service is enlisted, organizations should carefully test its performance. One way is to compare website performance before a third-party service is added and afterwards, gauge the overall performance impact. If a performance degradation is identified, organizations must work with the third-party service to resolutely fix the problem, before the service is implemented.
Monitoring third-party services in production is also important in order to validate SLAs, but also to identify third-party performance issues as they occur and take appropriate action.
As the China example illustrates, the "ripple effect" of third-party performance issues is often unavoidable. But that doesn't mean the impact can't be thwarted or minimized. That is, when a serious performance problem is detected, organizations should have contingency plans in place so that offending third-party services can quickly be removed. While they can be extremely valuable when performing well, many third-party services (such as analytics) are not worth having if it means frustrating customers.
The end-user experience needs to be top-of-mind in all third-party service decisions: In general, websites should keep third-party services to a minimum. Organizations always need to ask themselves before adding a third-party service, if the added feature/functionality is worth the potential increase in overall vulnerability and lost conversions.
In this vein, there needs to be constant communication between performance monitoring teams, and the teams who request and depend on these third-party services. This is the key to making the smartest decisions that will protect and promote revenues above all else.
Additionally, when a third-party service is implemented, there are design steps organizations can take to proactively reduce risk exposure. For example, by understanding the load order of elements on a site and making sure third-party services and applications are on the bottom, organizations can protect and enhance perceived customer load time, even when a third-party service does suddenly go awry.
As a final note here, to ensure better performance for feature-rich websites and applications, many organizations rely on content delivery networks (CDNs) strategically located in key geographies. Ironically, CDNs represent another third-party service and another potential point of failure. Here, again, measuring performance from the true end-user perspective, on the other side of a CDN, is critical to protecting and maximizing these investments.
Leverage industry resources: Look for free services that identify third-party service outages and the corresponding regional impacts. Services like this may not prevent major outages from happening, but they can help organizations at least see when a widespread performance issue is not their own, and give them a head start in putting contingency plans into place and communicating proactively with customers.
Conclusion
In summary, to a certain extent, major web events like the one that just happened in China are unavoidable. But in many cases, the corresponding impact on modern websites can be anticipated, contained and minimized with the right approaches.
As a first step, organizations must understand the true end-user experience and the resulting business impact, so performance problems can be prioritized for remediation. From there, organizations must be able to correlate performance issues to the broadest possible range of variables both within and outside the firewall, including third-party services, and take appropriate action. It is critical to understand what can and cannot be controlled, and focus on addressing and fixing what is possible. In many cases, this can help organizations avoid going down with the proverbial ship.
Heiko Specht is a Technology Expert at the Compuware APM Center of Excellence.
The Latest
Broad proliferation of cloud infrastructure combined with continued support for remote workers is driving increased complexity and visibility challenges for network operations teams, according to new research conducted by Dimensional Research and sponsored by Broadcom ...
New research from ServiceNow and ThoughtLab reveals that less than 30% of banks feel their transformation efforts are meeting evolving customer digital needs. Additionally, 52% say they must revamp their strategy to counter competition from outside the sector. Adapting to these challenges isn't just about staying competitive — it's about staying in business ...
Leaders in the financial services sector are bullish on AI, with 95% of business and IT decision makers saying that AI is a top C-Suite priority, and 96% of respondents believing it provides their business a competitive advantage, according to Riverbed's Global AI and Digital Experience Survey ...
SLOs have long been a staple for DevOps teams to monitor the health of their applications and infrastructure ... Now, as digital trends have shifted, more and more teams are looking to adapt this model for the mobile environment. This, however, is not without its challenges ...
Modernizing IT infrastructure has become essential for organizations striving to remain competitive. This modernization extends beyond merely upgrading hardware or software; it involves strategically leveraging new technologies like AI and cloud computing to enhance operational efficiency, increase data accessibility, and improve the end-user experience ...