The Perils of Downtime in the Cloud
October 23, 2014

Cliff Moon
Boundary

Share this

The mantra for developers at Facebook for the longest time has been "move fast and break things". The idea behind this philosophy being that the stigma around screwing up and breaking production slows down feature development, therefore if one removes the stigma from breakage, more agility will result. The cloud readily embodies this philosophy, since it is explicitly made of of unreliable components. The challenge for the enterprise embracing the cloud is to build up the processes and resiliency necessary to build reliable systems from unreliable components. Otherwise, moving to the cloud will mean that your customers are the first people to notice when you are experiencing downtime.

So what changes are necessary to remove the costs of downtime in the cloud? Foremost what is needed is a move to a more resilient architecture. The health of the service as a whole cannot rely on any single node. This means no special nodes: everything gets installed onto multiple instances with active-active load balancing between identical services. Not only that, but any service with a dependency must be able to survive that dependency going away. Writing code that is resilient to the myriad failures that may happen in the cloud is an art unto itself. No one will be good at it to start. This is where process and culture modifications come in.

It turns out that if you want programmers to write code that behaves well in production, an effective way to achieve that is to make them responsible for the behavior of their code in production. The individual programmers go on pager rotation and because they have to work side by side with the other people on rotation, they are held accountable for the code they write. It should never be an option to point to the failure of another service as the cause of your own service's failure. The writers of each discrete service should be encouraged to own their availability by measuring it separately from that of their dependencies. Techniques like serving stale data from cache, graceful degradation of ancillary features, and well reasoned timeout settings are all useful for being resilient while still depending on unreliable dependencies.

If your developers are on pager rotation, then there should be something to page them about. This is where monitoring comes in. Monitoring alerts come in two basic flavors: noise and signal. Monitoring setups with too many alerts configured will tend to be noisy, which leads to alert fatigue.

A good rule of thumb for any alerts you may have setup are that they be: actionable, impacting, and imminent. By actionable, I mean that there is a clear set of steps for resolving the issue. An actionable alert would be to tell you that a service has gone down. Less actionable would be to tell you that latencies are up, since it isn't clear what, if anything, you could do about that.

Impacting means that without human intervention the underlying condition will either cause or continue to cause customer impact.

And imminent means that the alert requires immediate intervention to alleviate service disruption. An example of a non-imminent alert would be alerting that your SSL certificates were due to expire in a month. Impactful and actionable, absolutely. But it doesn't warrant getting out of bed in the middle of the night.

At the end of the day, adopting the cloud alone isn't going to be the silver bullet that automatically injects agility into your team. The culture and structure of the team must be adapted to fit the tools and platforms they use in order to get the most out of them. Otherwise, you're going to be having a lot of downtime in the cloud.

Cliff Moon is CTO and Founder of Boundary.

Share this

The Latest

April 24, 2017

The Internet of Things (IoT) is increasingly present in our daily lives, at work, in the home and in the public sphere, making the world a more connected place. In fact, 2020 will see at least 20 billion connected devices across the globe. So, let's take a look at the most common iterations of the IoT at the moment, and what we can expect to see in the IoT landscape over the next 5 years ...

April 21, 2017

In the spirit of Earth Day, which is Saturday, April 22, we recently asked IT professionals for the tips and tricks they're using to help keep their data centers as green as possible. Here are a few ideas inspired by the responses we got ...

April 20, 2017

Almost One-Third (28 percent) of IT workers surveyed fear that cloud adoption is putting their job at risk, according to a survey conducted by ScienceLogic ...

April 19, 2017

A majority of senior IT leaders and decision-making managers of large companies surveyed around the world indicate their organizations have yet to fully embrace the aspects of IT Transformation needed to remain competitive, according to a new study conducted by Enterprise Strategy Group (ESG) ...

April 18, 2017

The move to cloud-based solutions like Office 365, Google Apps and others is one of the biggest fundamental changes IT professionals will undertake in the history of computing. The cost savings and productivity enhancements available to organizations are huge. But these savings and benefits can't be reaped without careful planning, network assessment, change management and continuous monitoring. Read on for things that you shouldn't do with your network in preparation for a move to one of these cloud providers ...

April 17, 2017

One of the most ubiquitous words in the development and DevOps vocabularies is "Agile." It is that shining, valued, and sometimes elusive goal that all enterprises strive for. But how do you get there? How does your organization become truly Agile? With these questions in mind, DEVOPSdigest asked experts across the industry — including analysts, consultants and vendors — for their opinions on the best way for a development or DevOps team to become more Agile ...

April 12, 2017

Is composable infrastructure the right choice for your IT environment? The following are 5 key questions that can help you begin to explore the capabilities of composable infrastructure and its applicability within your own IT environment ...

April 11, 2017

What is composable infrastructure, and is it the right choice for your IT environment? That's the question on many CIOs' minds today as they work to position their organizations as "digitally driven," delivering better, deeper, faster user experiences and a more agile response to change in whatever vertical market you do business in today ...

April 10, 2017

As companies adopt new hardware and applications, their networks grow larger and become harder to manage. For network engineers and administrators, the continued emergence of integrated technology has forced them to reconfigure and manage networks in a more dynamic way ...

April 07, 2017

The complexity of data in motion is growing and risks undermining the success of the modern data-driven enterprise. A recent survey of data engineers and architects, conducted by StreamSets, sought to bring some perspective to the new reality in the enterprise, leading to some interesting insights about the enterprise data landscape ...