Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Charley Rich is VP Product Management and Marketing at Nastel Technologies and has over 28 years of technical, hands-on experience working with large-scale customers to meet their application and systems management requirements. Prior to joining Nastel, Charley was Product Manager for IBM's Tivoli Application Dependency Discovery Manager software, where he co-authored an IBM Redbook, charted the product roadmap, managed an agile requirements process and was recognized for his accomplishments by winning the Tivoli General Manager's Award. Recently, Charley was granted a patent for an Application Discovery and Monitoring process.
Share this

The Latest

November 20, 2017

When you say "DevOps" one of the first words that comes to mind is "collaboration." But exactly how do we make this collaboration happen? This epic DEVOPSdigest list – posted in 7 parts – intends to find the answer. DEVOPSdigest asked experts from across the industry for their opinions on the best way to foster collaboration between Dev and Ops ...

November 17, 2017

Just in time for the holiday shopping season, APMdigest asked experts from across the industry for their opinions on the best way to measure eCommerce performance, in terms of applications, networks and infrastructure. Part 3, the final installment, covers the customer journey ...

November 16, 2017

Just in time for the holiday shopping season, APMdigest asked experts from across the industry for their opinions on the best way to measure eCommerce performance, in terms of applications, networks and infrastructure. Part 2 covers APM and monitoring ...

November 15, 2017

As the holiday shopping season looms ahead, and online sales are positioned to challenge or even beat in-store purchases, eCommerce is on the minds of many decision makers. To help organizations decide how to gauge their eCommerce success, APMdigest compiled a list of expert opinions on the best way to measure eCommerce performance ...

November 14, 2017

More than 90 percent of respondents are concerned about data and application security in public clouds while nearly 60 percent of respondents reported that public cloud environments make it more difficult to obtain visibility into data traffic, according to a new Cloud Security survey ...

November 13, 2017

Today's technology advances have enabled end-users to operate more efficiently, and for businesses to more easily interact with customers and gather and store huge amounts of data that previously would be impossible to collect. In kind, IT departments can also collect valuable telemetry from their distributed enterprise devices to allow for many of the same benefits. But now that all this data is within reach, how can organizations make sense of it all? ...

November 09, 2017

CIOs trying to lead digital transformation at the speed needed to succeed need a mix of three scale accelerators, according to Gartner, Inc. The three scale accelerators include: digital dexterity, network effect technologies, and an industrialized digital platform ...

November 08, 2017

While the majority of IT practitioners in the UK believe their organization is equipped to support digital services, over half of them also say they face consumer-impacting incidents at least one or more times a week, sometimes costing their organizations millions in lost revenue for every hour that an application is down, according to PagerDuty's State of Digital Operations Report: United Kingdom ...

November 07, 2017

Today's IT is under considerable pressure to remain agile, responsive and scalable to meet the changing needs of business. IT infrastructure can't become a bottleneck, it must be the enabler. But as new paradigms, such as DevOps, are adopted, data center complexity increases and infrastructure constraints can block the ability to achieve these goals ...

November 06, 2017

It's 3:47am. You and the rest of the Ops team have been summoned from your peaceful slumber to mitigate an application delivery outage. Your mind races as you switch to problem solving mode. It's time to start thinking about how to make this mitigation FUN! ...