One of the common questions that every IT manager asks on a regular basis is, “Why is my application so slow today when everything was fine yesterday?” Application Performance Management (APM) is the only way to truly answer that question, and it is one of the must-have tools for every IT manager.
With this APM imperative in mind, the following are 10 capabilities every IT manager should look for when choosing an APM solution:
Real-time monitoring is a must. When digging into a problem, tracking events in real-time as they occur is by far more effective than doing so via “post-mortem” analysis. There are many APM vendors that claim to provide real-time monitoring but sometimes they really mean “near real-time”, with delays from 30 seconds to five minutes, typically. This restricts your ability to analyze and react to events in real-time. Make sure real-time is truly real-time. Real-time monitoring should provide you with important metrics such as: Who is doing what, how much resources are being taken, and who is affecting who right now?
Sometimes you get lucky and witness a problem in real-time. But in most cases, this doesn’t happen. This is why a good APM solution must be able to collect all transaction activity and performance metrics into a rich, but light-weight repository.
Some APM vendors store the statistics they gather but they aggregate it to save disk space or because they just can’t handle too much data in a reasonable amount of time. Analyzing performance incidents based on aggregated data is similar to assessing a book by reading only its rear cover. You get the general idea but you have no ability to understand what really happened. That’s why good APM solutions must give you all of the granular information including individual transactions and their characteristics, resource consumption, traffic order (chain of events) etc.
APM solutions are designed to improve the end user experience. Improving user experience starts by measuring it and identifying QoS and SLA anomalies. Only then can you make informed decisions and take action. You should also have the ability to compare user experience before and after a change is applied to your systems.
Some APM solutions enable users to analyze performance data and identify root problems retroactively, but do nothing to enable real-time resolution of performance issues. Because these solutions are fundamentally passive by nature, you have no choice but to wait for application performance to nosedive before corrective action can be taken. And in these cases, the wait time from issue identification to resolution can be hours or even days. Avoiding QoS problems can be achieved only if you take proactive steps. Proactive APM solution can turn this: “I got a text message at 2:00AM from our APM tool that indicated that we had a QoS problem so I logged into the system and solved it,” into: “I got a text message at 8:00 AM from our APM tool letting me know that at 1:50 AM a QoS problem was about to occur and it took care of it automatically.” Being proactivite can be achieved in many ways: by activating automatic scripts, managing system resources, and triggering third party tools, etc.
If an APM tool only notifies you that you ran out of system resources because of job X, then you don’t really have root cause analysis capabilities. Root cause analysis is when your APM tool tells you that this job usually runs at 8:00 PM but because of problem on a secondary system, it has started 1 hour later and collided with another job that was scheduled to run at the same time. APM tools must do the hard work of correlating many little pieces of data so that you can get to the source of the problem. Otherwise you will find yourself trying to assemble a 1,000 piece puzzle while your CEO knocks on your door every 5 minutes looking for answers.
Analyzing a problem can take many shapes. The conventional way is by digging into the top-10 hit lists. But those top-10 lists always miss something - the chain of events. Who came first, who came after, “it was all fine until this transaction came in”, etc. Analyzing the chain of events before the system crashed is crucial if you wish to avoid this problem in the future. An APM tool should give you the ability to travel back in time and look into the granular metrics second by second as if you were watching a movie in slow motion. This is possible only if the APM tool collects data at a very high level of granularity and does not lose it over time (i.e. it retains the raw collected metrics).
There are two main performance troubleshooting approaches that an APM tool should support. Performance drill downs to a specific period of time, and performance comparison. If you have a performance problem now, but all was fine yesterday, you must assume that something has changed. Hunting for those changes will lead you to the root cause much quicker than a conventional drill down into the current problem's performance metrics. You should have the ability to answer questions like these in seconds: “Is this new storage system I just implemented faster than the old one we had?” and “why is it working very well in QA but not in production?” If your APM tool collects and stores raw performance metrics, by comparing those metrics you can easily answer all these questions and dramatically shorten your mean time to recovery.
When an APM tool stores millions of pieces of raw (and aggregated) data, it should also deliver a convenient way to slice and dice this data. Some APM tools will decide for you the best way to process this data by providing a pre-defined set of graph and report templates. A good APM tool will let you decide how you want to slice and dice this data by giving you a flexible and easy to use BI-like dashboard where you can drag and drop dimensions and drill down by double clicking in order to answer questions like, “What user consumed most of my CPU and what is the top program he/she has been using that caused the most impact?”
Bad performance usually starts with bad design or bad coding and very rarely stems from hardware faults. If a developer writes a poor piece of code, the IT division needs to spend more money on hardware or software licenses to deal with it. This is why it’s becoming popular in many organizations to turn this dynamic upside down - here the annual budgets are distributed between the application development divisions that use this money to buy IT services from their IT division. If they write poor code they ultimately need to pay more. This is workable only if the IT department has an APM tool that can measure and enforce resources usage by ‘tenant’. This approach has proven to be effective in helping companies reduce their IT budget quite significantly.
Irad Deutsch is a CTO at Veracity group, an international software infrastructure integrator. Irad is also the CTO of MORE IT Resources - MoreVRP, a provider of application and database performance optimization solutions.