Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

Share this

The Latest

January 17, 2017

Choosing an application performance monitoring (APM) solution can be a daunting task. A quick Google search will show popular products, but there's also a long list of less-well-known open source products available, too. So how do you choose the right solution? ...

January 13, 2017

Digital transformation is a key initiative for enterprises that want to reach new customers and offer greater value via technology. Changing user expectations, new modes of engagement and the need to improve responsiveness are the main factors driving companies to update outdated processes and develop new applications as part of a digital transformation strategy. But in order to deliver on the promise of digital transformation, organizations must also modernize their IT infrastructure to support speed, scale and change ...

January 12, 2017

Digital transformation is evolving the enterprise to one in which high performance applications are now the norm as organizations use video, graphics and other information intensive multimedia to populate these new channels of engagement. Digital technologies, and high performance applications, create further pressure on IT staffs which are grappling with PCs that are past their optimum performance. As a result, IT is looking at alternatives to swapping out PCs and investing in more costly equipment that will inevitably have an expiration date. One solution is to build on virtualization solutions that incorporate high-performance thin clients ...

January 11, 2017

If your business depends on mission-critical web or legacy applications, then monitoring how your end users interact with your applications is critical. Most monitoring solutions try to infer the end-user experience based on resource utilization. However, resource utilization cannot provide meaningful results on how the end-user is experiencing an interaction with an application. The true measurement of end-user experience is availability and response time of the application, end-to-end and hop-by-hop ...

January 10, 2017

There's nothing like a major web outage to remind us how much our applications rely on other web services and technologies to function. In late October of last year, a Distributed Denial of Service (DDoS) attack on Dyn, one of the largest Domain Name Service (DNS) providers on the internet, disrupted service for consumer and business applications across the web. This attack shed light on the delicate interdependent nature of the web as productivity and uptime across the world was effected ...

January 09, 2017

As an IT professional, I'm used to words that mean different things to different people. For example, "log monitoring" could mean anything from simple text files to logfile aggregation systems. "Uptime" is also notoriously hard to nail down. Heck, even the word "monitoring" itself can be obscure. This is why I'm not surprised that application performance monitoring (APM) can mean so many different things depending on the context ...

January 06, 2017

Big data continues to be the fastest-growing segment of the information management software market. New findings released by Ovum estimate that the big data market will grow from $1.7bn in 2016 to $9.4bn by 2020, comprising 10% of the overall market for information management tooling ...

January 05, 2017

Apica highlights the following trends and predictions for 2017, covering the API Economy, ecommerce, analytics and more ...

January 04, 2017

IDC has been chronicling the emergence and evolution of the 3rd Platform, built on cloud, mobile, big data/analytics, and social technologies, for nearly a decade. Over the past several years, the adoption of these technologies has accelerated as enterprises commit to the 3rd Platform and undergo digital transformation (DX) on a massive scale. Looking forward, IDC predicts that digital transformation will attain macroeconomic scale over the next three to four years, changing the way enterprises operate and reshaping the global economy. This is the dawn of the "DX Economy" ...

January 03, 2017

Riverbed highlights the following trends and predictions for 2017, covering digital transformation, software-defined everything and more ...