The recent outage of the Amazon EC2 cloud led to disruption of service for Netflix, Instagram and other major site and applications. Last year after the first major Amazon outage I posted on this site a blog post: After Amazon: 5 Ways BSM Can Protect You from the Next Cloud Outage. The points are worth repeating since we see that cloud computing is not a guarantee of better application performance:
The Amazon cloud outage is a wake-up call for IT staff that are not adequately prepared for the journey to the cloud. Planning for migration of applications to any type of cloud – public or private, on-premise or off-premise – requires appropriate service management processes and infrastructure. Otherwise, you risk being unable to manage, or even understand, the business impact of future cloud outages.
When talking about business services in the cloud, it's almost impossible to avoid the obvious play on words: when you move to the cloud, you lose visibility. In order to meet SLAs, maintain a quality user experience, and resolve problems quickly, you need a clear picture of your services as they traverse each hop of the infrastructure. But in the cloud, where resources are virtualized and allocated dynamically, you often have little idea where services are running.
The Amazon cloud outage demonstrates the point. When the outage occurred, the EC2 dashboard could not tell customers how their applications and services were performing. It did not provide round-trip transaction times or report on the user experience. Instead, it reported various problems with latency and errors that were eventually linked to the cloud storage service. Those KPIs did not tell EC2 customers how the outage was affecting their business. In fact, according to Amazon, the outage was not even a violation of customer SLAs – even though many sites went down completely.
Cloud computing requires a sophisticated approach to Business Service Management that enables you to track services from the data center and into the cloud. This post looks at 5 key capabilities that organizations must have in order to maintain visibility and control in the cloud:
In the cloud more than ever, you need a top-down view of your business services, end-to-end. The service cannot be a block box; instead, you need a topological map that shows the execution of the each service – also called a business transaction – as it traverses every server in the private and public cloud. As we saw recently, it is critical to build redundancy and not to rely on a single cloud provider for all of your needs, so you need a solution that can track complex hybrid architectures, even between clouds.
You need to see the performance not only round-trip, but on each leg of the journey. This is the only way to assure SLAs on the one hand, and to quickly identify the source of performance degradation on the other. Ideally, your solution will also provide some deep-dive capabilities so that in addition to identifying the problem tier, it will also lead you to the source of the problem.
Since dynamic resource allocation is a cornerstone of the cloud ROI model, the path of a service or transaction in the cloud will be changing. If your monitoring solution requires manual definition of services, it is very likely that it will not work properly in this type of environment.
To ensure accuracy and to save valuable time, it is important to choose a solution that automatically identifies business services and maintains a dynamic picture of service delivery.
Once of the most important indicators of application health is the experience of real end-users. Synthetic transactions can provide an important indicator during quiet times but they cannot tell you what all of your users are experiencing, all of the time. Setting up a real-user monitoring solution in the cloud can be complicated since you do not necessarily control the point on the network between the application and your users. You should make sure that your monitoring solution can track real-user transactions in any cloud configuration. This is a crucial piece of information that puts the technical information from your cloud services provider into business context.
Even in the datacenter, change is probably the greatest risk to service stability. That risk is magnified exponentially in the cloud where any change to code, hardware, or configuration can affect the behavior and performance of business services in unpredictable ways. Again, the Amazon outage shows us that even in the cloud, you may have to make some fast decisions and changes in order to keep your critical services on line.
To mitigate the danger, you need a monitoring solution that can baseline service performance and analyze the impact of change on a wide variety of parameters. It's important to choose a solution that captures all transaction instances – and does not rely on sampling – so that you can accurately analyze problems and find root causes that occurred before a service level alarm would have been triggered.
One of the biggest obstacles to the cloud is the – understandable – fear of business owners that performance and usability will decline. Many application owners are concerned about the risks of sharing resources and are reluctant to accept the standardization and loss of control inherent in the cloud model. Unfortunately, well-publicized events such as the Amazon outage will only exacerbate those fears.
Yet the benefits of the cloud are real, and IT must be able to not only mitigate the risks of outages, but also to demonstrate the benefits to a business audience. You need a solution that measures performance and user experience, and can communicate them in a robust and intuitive fashion.
Russell Rothstein is Founder and CEO, IT Central Station.