Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Share this

The Latest

March 24, 2017

A growing IT delivery gap is slowing down the majority of the businesses surveyed and directly putting revenue at risk, according to MuleSoft's 2017 Connectivity Benchmark Report on digital transformation initiatives and the business impact of APIs ...

March 23, 2017

Why containers are growing in popularity is no surprise — they’re extremely easy to spin up or down, but come with an unforeseen issue. Without the right foresight, DevOps and IT teams may lose a lot of visibility into these containers resulting in operational blind spots and even more haystacks to find the presumptive performance issue needle ...

March 22, 2017

Much emphasis is placed on servers and storage when discussing Application Performance, mainly because the application lives on a server and uses storage. However, the network has considerable importance, certainly in the case of WANs where there are ways of speeding up the transmission of data of a network ...

March 21, 2017

The majority of IT executives believe investment in IT Service Management (ITSM) is important to gain the agility needed to compete in an era of global, cross-industry disruption and digital transformation, according to Delivering Value to Today’s Digital Enterprise: The State of IT Service Management 2017, a report by BMC, conducted in association with Forbes ...

March 17, 2017

Let’s say your company has examined all the potential pros and cons, and moved your critical business applications to the cloud. The advertised benefits of the cloud seem like they’ll work out great. And in many ways, life is easier for you now. But as often happens when things seem too good to be true, reality has a way of kicking in to reveal just exactly how many things can go wrong with your cloud setup – things that can directly impact your business ...

March 16, 2017

IT leadership is more driven to be innovative than ever, but also more in need of justifying costs and showing value than ever. Combining the two is no mean feat, especially when individual technologies are put forward as the single tantalizing answer ...

March 15, 2017

The move to Citrix 7.X is in full swing. This has improved the centralizing of Management and reduction of costs, but End User Experience is becoming top of the business objectives list. However, delivering that is not something to be considered after the upgrade ...

March 14, 2017

As organizations understand the findings of the Cyber Monday Web Performance Index and look to improve their site performance for the next Cyber Monday shopping day, I wanted to offer a few recommendations to help any organization improve in 2017 ...

March 13, 2017

Online retailers stand to make a lot of money on Cyber Monday as long as their infrastructure can keep up with customers. If your company's site goes offline or substantially slows down, you're going to lose sales. And even top ecommerce sites experience performance or stability issues at peak loads, like Cyber Monday, according to Apica's Cyber Monday Web Performance Index ...

March 10, 2017

Applications and infrastructure are being deployed and commissioned at a faster rate than ever before, the number of tools it takes to effectively manage these services is multiplying, and the expectations placed on IT to ensure customer satisfaction is increasing, according to The State of Monitoring 2017 report from BigPanda ...