Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

Share this

The Latest

June 24, 2016

APMdigest asked the top minds in the industry what they feel is the most important way Application Performance Management (APM) tools must evolve. The recommendations on this list provide a rare look into the long-term future of APM technology. Part 2 covers the relationship between APM and analytics ...

June 23, 2016

At the end of every year, APMdigest takes a look into the future by asking experts to predict the changes that will occur within the Application Performance Management (APM) industry in the coming new year. With this new list, we are looking even farther into the future, to the evolution of APM. This list is comprised of expert opinions on how APM should evolve – an evolutionary wish list for APM. The recommendations on this list provide a rare look into the long-term future of APM technology ...

June 22, 2016

One of the most noteworthy elements of this year's State of DevOps Report is the continued advancement of concrete metrics, and notably ROI calculations, useful in determining the level of impact that organizations are appreciating via use of the practices ...

June 21, 2016

Application performance levels too often fail to meet the needs of the business. This creates what I call a "performance gap" – a widening gulf between the needs of business and what IT is able to provide (or not) to meet those needs. The business impacts include more unhappy customers, contract delays, missed deadlines and lost revenue. So in Part 2 of this series, let's examine the four key elements any organization can address today to bridge this gap ...

June 20, 2016

The technology landscape is littered with confusing terminology. The term "monitoring," for example, can mean any number of things, and while more specified terms like application performance monitoring, network performance monitoring, or infrastructure monitoring are supposed to narrow it down, there is often overlap and confusion into what is supposed to go where. Here are several key areas to focus on when evaluating your next IT purchase ...

June 17, 2016

The demand for real-time collaboration has introduced new performance requirements for enterprise networks to deliver a great user experience. A recent study conducted by BT and InfoVista, Meeting the Network Demands of Changing Generations, found that 90 percent of today’s workforce is unsatisfied with the application performance on their employer’s network overall ...

June 16, 2016

In this blog I'd like to highlight one very critical area of AIA that came out in my research: the growing role of security as an integrated requirement for performance, change and capacity management ...

June 15, 2016

Network communications are a top priority for DevOps teams working in support of modern globally-distributed systems and microservices. But basic network interface statistics like received and sent traffic aren't as useful as they once were because multiple microservices may share the same network interface. For meaningful analysis, you need to dig deeper and correlate network-traffic metrics with individual processes. This is however just the beginning ...

June 14, 2016

The global distributed performance and availability management software market is expected to grow at a CAGR of more than 13% until 2020, according to Technavio analysts.

June 13, 2016

If your company has experience in developing applications or performance management solutions, then you might want to consider joining an APM vendor's ecosystem to grow revenue. Here is how it should work: you develop market solutions incorporating your industry and technology experience, the vendor sells the solution globally through multiple channels, and you collect your check each month. The key is developing solutions for a market, not just one customer ...