Elasticity is one of the key components of the cloud environment – in fact, its hallmark. While elasticity makes cloud viable, there is a hidden side effect that impacts those of us responsible for ensuring application performance. In cloud environments, where volatility is the norm, it becomes ever more difficult to predict application behavior and measure performance. And, lest we forget, it was hard doing this when it was all contained in the data center. Now, we are chasing a moving target, and hence the added difficulties. Are you up for the challenge? Then, read on.
What does managing application performance in the cloud mean?
Let’s enumerate some of the parts, as cloud architecture is often viewed as a composite of several different stacks. Up at the top we have the end user who could be someone using a web browser. Beneath the end user is Application as a Service (AaaS), followed by Infrastructure as a Service (IaaS), and, at the bottom, Platform as a Service (PaaS). Monitoring the Infrastructure and Platform stacks is relatively straightforward.
However, with applications this is not the case at all. They can do anything. The developer can write whatever he wants in order to deliver on their requirements. Unfortunately, this makes monitoring using traditional methods quite difficult as the behavior of these applications appears to be unpredictable.
When we are attempting to handle a problem in performance, the closer to the end user it occurs in the cloud stack the easier it will be to determine root cause. The complexity comes when we try to track the propagation of an issue through the various stacks up to the user. And when a simple glitch starts in the infrastructure and then percolates up to create multiple glitches in the platform and then creates a variety of issues in the application, we end up with very unhappy users. But, if you can mitigate the risks in each of these stacks, you can hide a lot of the complexity and avoid some of the problems that impact end users.
The first option is to monitor each stack in the set separately. In this approach, indicators such as CPU, memory, I/O, storage and network are monitored in IaaS.
In PaaS, the indicators monitored include the JVM or CLR performance, Servlets, Web service response, GC behavior, resource pool utilization, clustering, replication and many others. These are all fairly well defined.
And finally in AaaS, the indicators monitored include user experience, orders filled or missed, revenue booked, Web page response time and so many others. In AaaS, the indicators are not as well defined as the other layers. This approach works to a point, but doesn’t handle all situations.
Another approach is needed when problems occur between stacks; when all the IaaS, PaaS and SaaS indicators are all normal, yet the user experiences a time out. Looking at the individual stacks here does not help in understanding what the user is experiencing. This can happen when perhaps there is a slow query or a deadlock in the database.
A better approach might be to monitor each stack as well as the transactions that flow between them. (a.k.a., transaction profiling or message tracking). All the IaaS, PaaS and SaaS indicators may still be normal, but now we might have an indicator for a transaction for a specific user that timed out. By correlating transactions, we could know that the cause of the timeout is the slow query on the database. And thus, we have an answer for the prior alternative. This is a good approach. But, is it good enough? The user still had a problem.
The best approach would be to prevent the problem. Can that be done? Yes, it can. In this scenario all the IaaS, PaaS and SaaS indicators are all normal and the users are happy buying things. Yet, trouble lurks. Over time there is a slow degradation in the time it takes for a query to execute. No one has a problem at the moment, but if we extrapolate from what we see, they eventually will. Why not fix the problem before users complain?
The old-school method for monitoring performance was to set thresholds. But in our elastic environment, this approach doesn’t work. The VMs can hop from server to server, the number of CPUs can vary and a variety of different indicators can be changing. No one knows what the thresholds should be and they become useless. Plus, while the application development group understands their creations, the groups supporting them may not have that knowledge and do not understand their behavior.
In the proactive approach, we are using heuristics to figure out what is normal and abnormal for your applications and predict problems prior to impact. The challenge may be that the problem is in the business logic and thus the only way to see ahead is to follow trends like in our example of the SQL query that keeps taking longer and longer.
In my next post, Chasing a Moving Target: APM in the Cloud - Part 2, I will move on to the next steps in the process – detection, analysis and action.
Albert Mavashev is Chief Technology Officer at Nastel Technologies.