Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

The Latest

May 22, 2015

Organizations large and small are struggling to meet their Key Performance Indicator (KPI) goals and prevent IT issues before they adversely impact the business — in fact, organizations detect and address an average of only 57% of critical IT issues before they impact the business — according to Continuity Software's 2015 IT Operations Analytics Survey ...

May 21, 2015

Companies are increasing IT salaries in order to attract and retain talent in a highly competitive hiring market, and the security profession in particular is red-hot, according to IDG’s Computerworld 2015 IT Salary Survey.

May 20, 2015

Very few CMDB solutions are currently packaged as standalone options. For instance, you may already have a CMDB embedded in your service desk that’s not yet in use. However, you may decide for any number of reasons that your current investment isn’t the one to take you the whole distance going forward. Moreover, there are a growing number of variations on a theme — as some CMDBs are packaged primarily as BSM solutions optimized for service impact and performance, others target workflow and automation, and some CMDB solutions are extensions of application discovery and dependency mapping tools ...

May 19, 2015

A VMTurbo survey on OpenStack reveals increasing interest in investigating and deploying OpenStack as a private cloud infrastructure, despite recent press coverage and perceived challenges of implementation ...

May 18, 2015

It is easy to feel that so called "second generation" APM tooling rules the world. And for good reason, many would argue – certainly the positive disruptive effects of support for highly distributed / Service Orientated architectures, and the requirements of many fast moving businesses to support a plethora of different technologies are a powerful dynamic. That leaves aside the undoubted advantages of comprehensive traffic screening (as opposed to "hard" sampling), ease of installation and commissioning (relative in some cases), user accessibility, flexible reporting and tighter productive association between IT and business – in short, empowering the DevOps and PerfOps revolution. So, modern APM is certainly well attuned to the requirements of current business. What's not to like? Could these technologies have an Achilles heel? ...

May 15, 2015

Reveille has compiled industry statistics to create a new infographic that reveals a lack of in-depth visibility into business-critical Enterprise Content Management (ECM) applications’ components, processes, and service levels.

May 14, 2015

Three-fourths (75 percent) of CIO respondents stated their network is an issue in achieving their organization's goals, according to a new survey of CIOs worldwide from Brocade, conducted by independent research agency Vanson Bourne. For almost a quarter of CIOs polled, it is a "significant" issue ...

May 13, 2015

The PADS (Performance Analytics Decision Support) Framework helps companies take a more strategic approach to user experience. It's a framework that lets IT and business management understand the link between next-generation Application Performance Management (APM) and big data analytics to enable improved application governance and operational performance. Across industry sectors, companies that unify APM and user experience outperform their peer group in financial results and market valuation. These companies also use 30% fewer tools to achieve these results. The majority have consolidated onto a core platform from one vendor, with tactical deployments of other vendor solutions for specific use cases, departments or technologies. They consistently deliver stellar user experiences with greater IT productivity and lower costs than their less-performing peers ...

May 12, 2015

We conducted a performance diagnostic session on a live e-commerce website, and after our first initial glance at their landing page we saw the usual performance suspects. Some of the highlights we found on the website we analyzed during the performance clinic are below ...

May 11, 2015

Last December, my APMdigest prediction for 2015 was:"The advent of the “Internet of Things” (IoT) will elevate the importance of implementing powerful, easy-to-use and cost-effective APM solutions as a rapidly expanding universe of end-points are connected by software-enabled sensors and systems." Less than halfway through the new year and we're seeing the market activity around IoT opportunities accelerate ...

Share this