Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

Share this

The Latest

May 24, 2016

Regarding the HTTP/1.1 limitations outlined in my last blog, it was known that an update was needed to address them. But this did not happen, until recently. With the need for better performance, a number of workarounds were created to get around the limitations ...

May 23, 2016

With the ever-changing business landscape, employees are more willing and able to evolve in their roles and lean into change if they are empowered with a positive, "frictionless" end user experience. During this period of business transformation, technology is seen primarily as an enabler for improved business and business change, therefore aligning IT with business goals and focusing on supporting business initiatives is now more critical than ever. But business transformation involves not only disruptive technologies such as cloud, virtualization and mobile, but also the people involved in the transformation ...

May 20, 2016

All businesses are fully aware of how much data they're swimming through on a daily basis. And because its buzzy and trendy, most of these businesses are looking to do more with their data, striving to implement cool sounding technologies like machine learning and predictive analytics. But a startling number are overlooking a crucial factor that could make or break the success of these investments: the quality of their own data ...

May 19, 2016

The HyperText Transport Protocol (HTTP) is the application layer protocol in the TCP/IP stack used for the communication of web traffic. The most widely used version is the previous version, HTTP/1.1, which has a number of limitations ...

May 18, 2016

As websites continue to advance, the underlying protocols that they run on top of must change in order to meet the demands of user expected page load times. This blog is the first in a 5-part series on APMdigest where I will discuss web application performance and how new protocols like SPDY, HTTP/2, and QUIC will hopefully improve IT so we can have happy website users ...

May 17, 2016

The seismic shifts occurring in the enterprise network are presenting a number of interesting challenges for Application Performance Management (APM), both for today and the near future ...

May 16, 2016

While the common assumption is that the cloud represents reduced costs and better application performance, many organizations will fail to realize those benefits, according to research by VMTurbo. A multi-cloud approach, where businesses operate a number of separate private and public clouds, is an essential precursor to a true hybrid cloud. Yet in the survey of 1,368 organizations 57 percent of those surveyed had no multi-cloud strategy at all. Similarly, 35 percent had no private cloud strategy, and 28 percent had no public cloud strategy ...

May 13, 2016

"Tell me what to expect with APM of the future." I'm sure this is a question on the minds of many people who manage the performance of critical business applications. APM is fundamentally changing. Traditional rules and requirements don't apply anymore. Adapting to changes in the industry, APM needs to focus on 3 distinct areas ...

May 12, 2016

Software Defined Network (SDN) is being called "the future of networking" for its ability to deliver greater efficiency and automation, however, research from British cloud and network provider Exponential-e has revealed that 86 percent of UK businesses do not understand SDN and 95 percent do not know what benefits it could bring to their enterprise ...

May 11, 2016

A recent enterprise mobility survey noted that respondents executing business mobility average two to three times the ROI over surveyed organizations that have not undertaken business mobility initiatives [and only run business apps on desktops or physical devices], according to the The Mobile Playbook, 3rd Annual Edition produced by EndUserExperience2Day ...