Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

The Latest

August 28, 2015

In Part 2 of a three-part interview, AppDynamics talks about Application Performance Management for cloud and mobile ...

August 27, 2015

In Part 1 of a three-part interview, AppDynamics talks about Application Performance Management, monitoring and the 2015 APM Tools Survey, conducted by Enterprise Management Associates (EMA) ...

August 26, 2015

For the business, application performance is only relevant if it correlates to meaningful user experiences and conversion metrics. The most common challenge hindering companies from realizing the full promise of application performance solutions has been the lack of a common language, and business-relevant metrics to measure monitor and set targets for customer experiences. The organizational divisions that separate development, IT operations and business teams have led to varied and disparate perspectives on end-user experience, how performance impacts business, and the level of investments needed to consistently excel. To really move beyond the traditional APM mindset, where performance is seen as a technical problem, marketing and business leaders across global industries are in need of new approach to monitoring. An approach that starts and end with the user experience ...

August 25, 2015

This is Part 2 of a three-part series on change management. In this blog, I’ll look at what it takes to make change management initiatives succeed — including metrics and requirements, best practice concerns, and some of the more common pitfalls ...

August 24, 2015

Sixty percent of those surveyed had apps created internally, while 35 percent had custom apps created by a third party, according to the 2015 Enterprise Mobility Report, from Apperian with the help of CITO Research ...

August 20, 2015

Circonus conducted a survey at the recent ChefConf show. Some of the results were what we expected, especially of such a DevOps-oriented audience. Other results were surprising, as we tried to gauge, for example, how far along people were on their DevOps journey and, in particular, what the new DevOps requirements were for monitoring tools ...

August 19, 2015

This is the first of a three-part series on change management. In this blog, I’ll try to answer the question, “What is change management?” from both a process and a benefits (or use-case) perspective ...

August 18, 2015

Application-Aware Network Performance Management (AA NPM) solutions tout benefits from capabilities embedded in such themes as "User Experience," "Application Performance," or "Business Impact" – with enticing dashboards and lots of metrics and graphs to grab attention. In this blog, I'll outline four of the more significant broken AA NPM promises ...

August 17, 2015

ITSM is a modern approach to planning, implementing and managing IT services of an agile, service-oriented organization. The practice is business, rather than technology-centered. IT services add the most value when they are in complete alignment with the needs of an organization. Otherwise, they impede a company's ability to react to market changes, put a strain on the budget, and, ultimately, result in dissatisfied customers and lost business opportunities. Four key solutions that help deliver ITSM benefits include the following ...

August 14, 2015

The “What’s Your ECM Action Plan?” infographic shows the how modern, ECM-aware application management solutions (with pre-configured ECM tests, notifications, dashboards and reports) can provide a measurable and positive production impact for a business, its IT team and its end-users ...

Share this