Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

Share this

The Latest

February 24, 2017

Global revenue in the BI and analytics software market is forecast to reach $18.3 billion in 2017, an increase of 7.3 percent from 2016, according to the latest Gartner forecast. Gartner believes the rapidly evolving modern BI and analytics market is being influenced by the following 7 dynamics ...

February 23, 2017

An important aspect of performance monitoring is where the observer stands when looking at the IT scenario. Each participant has a different view of what is bad performance - network, database, web, system, user personnel, management and external people - customers, regulatory bodies etc. These are what I call viewpoints ...

February 22, 2017

An important aspect of performance monitoring is where the observer stands when looking at the IT scenario. If a complaint says the performance of an application is dreadful, the network man might say "Everything is fine" and the database man may agree, both saying "What's the problem?" All these people may say that the performance world is rosy but not to other people who have a different idea on what is rosy and what is not ...

February 21, 2017

Instapaper, a "read later" tool for saving web pages to read on other devices or offline, suffered an extensive outage 2 weeks ago. While Instapaper hit a unique problem — a file size limitation — its experience speaks to a much larger problem: scaling a database is difficult, and never quick. That basic fact explains why outages like this are surprisingly common ...

February 16, 2017

Hybrid Cloud is the preferred enterprise strategy, according to RightScale's 2017 State of the Cloud Report ...

February 15, 2017

IT departments often try to protect against downtime by focusing on the web application. Monitoring web application's performance helps identify malfunctions and their cause on a code level, so that the DevOps team can solve the problem. But, monitoring application performance only protects against application errors and ignores external factors such as network traffic, hardware, connectivity issues or bandwidth usage, all of which can have an impact performance and availability of a website ...

February 14, 2017

Everybody loves DevOps. In fact, DevOps is the hottest date in IT. That's because DevOps promises to satisfy the deepest longings of digital business — including fast execution on innovative ideas, competitively differentiated customer experiences, and significantly improved operational efficiencies ...

February 13, 2017

Forrester forecasted that direct online sales totaled 11.6 percent of total US retail sales in 2016, but digital touchpoints actually impacted an estimated 49 percent of total US retail sales, according to The State of Retailing Online 2017: Key Metrics, Business Objectives and Mobile report, released by the National Retail Federation’s Shop.org division and Forrester ...

February 10, 2017

Cisco's acquisition of AppDynamics – and the premium it paid – represents a "statement acquisition" that addresses several converging trends in both technology and financial markets. For strategic acquirers and tech investors, the acquisition is about delivering value to users and improving business outcomes through a go-to-market model that drives recurring revenues ...

February 08, 2017

Industrial and technological revolutions happen because new manufacturing systems or technologies make life easier, less expensive, more convenient, or more efficient. It's been that way in every epoch – but Continuity Software's new study indicates that in the cloud era, there's still work to be done ...