Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

The Latest

February 05, 2016

As the Super Bowl approaches, an equally epic game is taking place in cyberspace. The Super Bowl is the "moment of truth" for the NFL, sports media, advertisers, restaurants (especially pizza joints), food delivery services and gambling sites. They have to be ready in order to capitalize on unpredictably "spiky" traffic and the transactions before, during and, for some, after the game. To better evaluate the impact of the Super Bowl on ecommerce and website traffic, SOASTA conducted a survey that examined the online and mobile habits and preferences of Americans watching this year's big game ...

February 04, 2016

It’s hard to define software-defined networking (SDN) as one thing, given that it is applied to so many different areas of networking: Data centers, enterprise campus, the WAN, radio access networks, etc. And each vendor that introduces an SDN product to the market is working from a definition that fits into its own strategy. But … what do those people who deploy SDN have to say? ...

February 03, 2016

Remember the adage "beauty is in the eye of the beholder?" Similarly, service quality is in the eye of the user. So, to understand service quality, we should be measuring end-user experience (EUE). You may already be measuring EUE. Some of your applications – particularly those based on Java and .NET – may already be instrumented with agent-based APM solutions. But there are a few challenges to an agent-based approach to EUE ...

February 02, 2016

IT and business executives agree that digital transformation and the use of hybrid clouds are key to competitive success in the digital age, according to a new study conducted by IDG Research Services ...

February 01, 2016

ExtraHop expects to see the network emerge as a critical nexus of business over the next twelve months, with significant integration between network and security, demand for operational support of connected devices, and the ability to mine all data-in-motion for correlated, cross-tier and cross-team insights ...

January 29, 2016

Log Analytics is a process of investigating logs and hoping to derive actionable information that might be useful to the business. Many log analytics tools are used to gain visibility into web traffic, security, application behavior, etc. But how valuable and practical is log analytics in reality? ...

January 28, 2016

Nearly one out of every three businesses (32%) become aware of most application performance issues from their end users, according to the ManageEngine Application Performance Monitoring Survey 2015 ...

January 27, 2016

While IoT is gaining traction with consumers, it's already having a transformative effect on the business world. Companies can analyze the volumes of data connected devices supply to improve decision-making processes and to help employees be more productive. However, as an enterprise brings more connected devices online, IT will struggle to maintain adequate application performance levels. Implementing application performance monitoring (APM) establishes the end-to-end visibility IT needs in order to immediately identify what's causing an application to perform poorly ...

January 26, 2016

Companies are beginning the business mobility transformation — transitioning from the client-server era to the mobile-cloud era — shifting at least one core business process to the mobile paradigm, according to the VMware 2015 State of Business Mobility Report ...

January 25, 2016

BMC and Forbes Insights recently surveyed executives in North America and Europe to get their perspective on their organization's overall security health and to find out what issues are critical to address. The results revealed the need for a framework organizations can use to get a solid strategy in place for improved security and compliance ...

Share this