Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

Share this

The Latest

September 23, 2016

Whether your team is called the Service Desk, the Help Desk, or Level 1 Support, you're the first line of defense in ensuring IT supports the business. Here are seven ways that an end user experience monitoring solution enables Service Desk teams to deliver excellent end user experience ...

September 22, 2016

Network performance monitoring (NPM) has been around a long time. Unlike APM, NPM is still in the process of catching up to cloud realities. In May of this year, Gartner published a research note entitled Network Performance Monitoring Tools Leave Gaps in Cloud Monitoring. It's a fairly biting critique of the NPM space that says, essentially, that the vast majority of current NPM approaches were largely built for a pre-cloud era. As a result, network managers are left in the lurch when trying to adapt to the realities of digital operations ...

September 21, 2016

While the layers of abstraction created in virtualized environments afford numerous advantages, they can also obscure how the virtual resources are best allocated and how physical resources are performing. This can make maintaining optimal application performance a never-ending exercise in trial-and-error. This post highlights some of the challenges encountered when using traditional monitoring and analytics tools, and describes how machine learning, as a next-generation analytics platform, provides a better way to meet SLAs by finding and fixing issues before they become performance problems ...

September 20, 2016

New surveys by SolarWinds demonstrate the mounting responsibility being placed on the modern IT professional. With the second annual IT Professionals Day upon us, these survey results are particularly timely as they emphasize the need for greater appreciation towards you, the IT professionals of the world, and the critical role you play not only in modern business, but in the lives of nearly all technology end users ...

September 16, 2016

The worldwide public cloud services market is projected to grow 17.2 percent in 2016 to total $208.6 billion, up from $178 billion in 2015, according to Gartner. IT modernization is currently the top driver of public cloud adoption, followed by cost savings, innovation, agility and other benefits ...

September 15, 2016

A recent survey sponsored by Unisys Corporation shows a strong commitment among executives to adopting a digital business model, with the cloud as the key enabler ...

September 14, 2016

There comes a time when the vendors that serve every subset of the IT industry need to forgo self-interest and put aside competitive drivers to do whatever it takes to advance the cause of the user. Thankfully, such an effort to bring together providers of critical technology to benefit customer implementation has already emerged ...

September 13, 2016

On the first Sunday of the NFL season, ESPN's fantasy football app crashed. We see these types of stories often during so-called "surge" events, like when Black Friday takes down a retailer. Why? Often, it's the database that's been swamped in the process ...

September 09, 2016

Today’s native digital generations prefer to work on digital channels versus in-person channels. This ongoing trend has given rise to improvements in customer service, where interactions are delivered across multiple digital channels, ranging from social channels like Twitter and Facebook to text and voice communications. However, there is still more work to be done to unify these platforms more seamlessly ...

September 08, 2016

SSL certificates confirm that a web page is equipped with secured data exchange. Site visitors can therefore see at first glance whether they are on the site of a trustworthy provider. In addition, SSL certificates also increase the findability of a page on Google and operators benefit from an improved SEO ranking. Following this principle, this is how SSL certificates work ...