Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

Share this

The Latest

July 22, 2016

More than $1 trillion in IT spending will be directly or indirectly affected by the shift to cloud during the next five years, according to Gartner, Inc. This will make cloud computing one of the most disruptive forces of IT spending since the early days of the digital age ...

July 21, 2016

One of the most common problems network monitoring tools are employed to solve are problems with bandwidth. Availability is critical for IT departments of all sizes, and slow bandwidth creates productivity problems and even outages that have a real effect on businesses. Identifying the problems behind bandwidth drains can be difficult, so to help, I’ve put together a list of the five most common causes of sudden traffic spikes ...

July 20, 2016

In 2014 Gartner predicted that "75 percent of IT organizations will be bi-modal in some way by 2017." We are in the midst of this two-speed IT approach that organizations are adopting at an increasing rate to stay relevant for their customers. Speed 1 is the traditional IT that is being managed by the IT Operations persona and Speed 2 is the agile IT where within the organization especially the Developer persona and the Line of Business Persona are involved to get the most out of the digital innovations that flood our daily lives. One thing that these personas have in common is that they have a need for monitoring. In this blog I will focus on the needs of the various personas ...

July 19, 2016

While shoppers enjoy the bargains on Prime Day – a 24-hour sale on Amazon – few may realize that the success of such massive events hinges on network and application performance ...

July 18, 2016

I am constantly hearing the common theme that organizations want to make their IT-dependent employees and customers top priority in order to better support business growth. However, what I then find contradictory is while the desire is there, it's a significant challenge for organizations to actually achieve this. Here are four common barriers to business transformation initiatives – and suggested steps enterprises can take to overcome them ...

July 15, 2016

You already see the potential of adopting an Internet of Things model into your enterprise, but are you doing it in the best way? The following are four questions you and your team should be answering to determine how to find the right opportunity in the IoT space for your business ...

July 14, 2016

Cloud is no longer a new topic for IT, or for IT service management (ITSM). But its impact on how ITSM teams work, as well as on how IT works overall, has probably never been greater. Leveraging EMA research on the future of ITSM and on digital and IT transformation, this blog looks at data relevant to the impact of cloud on ITSM teams and addresses the following questions ...

July 13, 2016

The correct response within the expected time – those are the two main benefits your data center provides. If you don't provide those benefits your business loses customers. It will filter down to you through sales, then marketing, then the CIO then – what are you going to do now?

July 12, 2016

By participating in identifying and implementing tools that fill employees needs while providing quality experiences, IT departments can support employee productivity rather than hinder it. At the same time, they’ll be able to more easily and securely govern the plethora of consumer applications inundating the enterprise. Here are examples of software characteristics and user experience best practices that can provide workers with efficient workflows yet also meet the functionality requirements of IT administrators ...

July 11, 2016

APM tools must evolve to focus on the problems that users and customers are seeing from their perspective and give insight as to how to continuously diagnose and correct the digital user experience for them ...