The headlines are filled with news of retail website failures and crashes – most recently with the launch of Obamacare and the continuing healthcare.gov crashes due to high visitor load. Some of this attention is due to the media's insatiable appetite for bad news, some of it is fueled by massive user dissatisfaction, but for the most part; websites are just simply failing more.
Load-driven performance issues aside, the causes of most failures are unavoidable. Malicious attacks are getting more sophisticated; natural disasters are taking out datacenters like we saw with Sandy. Attaining perfection is impossible, so human error will always be a factor, and as we heard at Yahoo, sometimes even a single squirrel can bring business to a halt.
Quite often however, sites go down because organizations are not sufficiently prepared to manage the risks that exist because of the complexity that surrounds their sites. Most websites are intricate ecosystems of different services, tools and platforms. More players than ever are involved in creating a rich, engaging and profitable experience.
Operations must worry not only about the health of the infrastructure and applications they own and manage, but also about those of their vendors, their vendors’ vendors and so on. Just one broken component in the delivery chain of a website can take down the entire service, as we have seen in the case of SPoF (single point of failure).
So with all of this in mind, companies need to accept that failure will happen and plan for it to alleviate and minimize its negative business and branding impacts. As Benjamin Franklin once said, "By failing to prepare, you are preparing to fail." By planning, you can get creative, as did the New York Times when it took to social media to keep pushing the news when its site went down in August.
Prevention and Readiness
So, how to plan?
1. Identify every situation that can make your business fail - Dig through every part of your infrastructure and applications and identify who your vendors are and what their impacts are to your service.
2. Monitor every aspect of your site's availability on a regular basis – Keep an eye on your partners’ servers to truly understand the availability of your site.
3. Do capacity testing on all of your servers - Test load balancers, front end, back end, edge servers, vendors – everything.
4. Design your strategy for each case of failure - Ensure you have a capacity plan for the worst case scenario and build it into your release cycle. A capacity plan is especially important before an event or promotion when you expect a lot of traffic to come to your site. Smart companies will stagger promotions to prevent drastic spikes in traffic.
As a backup plan, have a lightweight site ready and on hand if your business requires 100 percent uptime. Even if it's simply a bunch of Apache servers hosted in the cloud, have one ready. Absolutely no third parties or personalization, keep it bare-boned so it can be turned on during any and all types of downtime.
Creative Response to Failures
When you do fail, make it fun and give what could be a frustrated user a chuckle. This will provide a happy memory of your page even if they were unable to access it and will elicit a better chance of return.
A good error page is like a good airport bar. You are still stuck at the airport, but at least you are enjoying yourself.
Recovery
If you do experience a site crash:
1. Offer some incentive for your customers to come back and revisit the site once it's back up - Offer a "failure discount" to keep a customer from immediately going to a competing site to purchase the power drill they originally intended to buy from you.
2. Collect data during the outage - Monitor and understand what is going on to determine the root cause and analyze the events leading up to the downtime.
3. Ask questions - Have we experienced this before? Was my infrastructure at fault? Could this have been avoided? Understanding the failure allows you to adjust your disaster plans accordingly.
4. Share your post-mortem analysis both internally and externally - Let everyone learn what you learned; sharing knowledge is the best way to make the web better, stronger and faster for everyone.
The Latest
Broad proliferation of cloud infrastructure combined with continued support for remote workers is driving increased complexity and visibility challenges for network operations teams, according to new research conducted by Dimensional Research and sponsored by Broadcom ...
New research from ServiceNow and ThoughtLab reveals that less than 30% of banks feel their transformation efforts are meeting evolving customer digital needs. Additionally, 52% say they must revamp their strategy to counter competition from outside the sector. Adapting to these challenges isn't just about staying competitive — it's about staying in business ...
Leaders in the financial services sector are bullish on AI, with 95% of business and IT decision makers saying that AI is a top C-Suite priority, and 96% of respondents believing it provides their business a competitive advantage, according to Riverbed's Global AI and Digital Experience Survey ...
SLOs have long been a staple for DevOps teams to monitor the health of their applications and infrastructure ... Now, as digital trends have shifted, more and more teams are looking to adapt this model for the mobile environment. This, however, is not without its challenges ...
Modernizing IT infrastructure has become essential for organizations striving to remain competitive. This modernization extends beyond merely upgrading hardware or software; it involves strategically leveraging new technologies like AI and cloud computing to enhance operational efficiency, increase data accessibility, and improve the end-user experience ...
AI sure grew fast in popularity, but are AI apps any good? ... If companies are going to keep integrating AI applications into their tech stack at the rate they are, then they need to be aware of AI's limitations. More importantly, they need to evolve their testing regiment ...
If you were lucky, you found out about the massive CrowdStrike/Microsoft outage last July by reading about it over coffee. Those less fortunate were awoken hours earlier by frantic calls from work ... Whether you were directly affected or not, there's an important lesson: all organizations should be conducting in-depth reviews of testing and change management ...
In MEAN TIME TO INSIGHT Episode 11, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses Secure Access Service Edge (SASE) ...
On average, only 48% of digital initiatives enterprise-wide meet or exceed their business outcome targets according to Gartner's annual global survey of CIOs and technology executives ...
Artificial intelligence (AI) is rapidly reshaping industries around the world. From optimizing business processes to unlocking new levels of innovation, AI is a critical driver of success for modern enterprises. As a result, business leaders — from DevOps engineers to CTOs — are under pressure to incorporate AI into their workflows to stay competitive. But the question isn't whether AI should be adopted — it's how ...