Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

Share this

The Latest

August 24, 2016

While service catalogs are not new, they are becoming increasingly critical to enterprises seeking to optimize IT efficiencies, service delivery and business outcomes. They are also a way of supporting both enterprise and IT services, as well as optimizing IT for cost and value with critical metrics and insights. In this blog, we'll look at how and why service catalogs are becoming ever more important both to IT organizations and to the businesses and organizations they serve ...

August 23, 2016

What is needed to create a next-generation network management tool? Nothing less than the development of a sophisticated network-aware orchestration engine that is able to detect any interdependencies, resolve them and deploy network policies automatically over the network ...

August 22, 2016

The challenge today for network operations (NetOps) is how to maintain and evolve the network while demand for network services continues to grow. Software-Defined Networking (SDN) promises to make the network more agile and adaptable. Various solutions exist, yet most are missing a layer to orchestrate new features and policies in a standardized, automated and replicable manner while providing sufficient customization to meet enterprise-level requirements ...

August 19, 2016

ScaleArc's Summer Blockbuster Survey found that 62 percent of Americans said they would be upset if they were purchasing movie tickets and the site or app went down, and 90 percent agreed that movie ticketing websites and apps should have no downtime this summer ...

August 18, 2016

This blog talks about end-user expectations in terms of felt or experienced performance of applications or desktops delivered by technology which is called VDI, Desktop Virtualization, Remote Desktop, App Virtualization …

August 17, 2016

Monitoring your middleware platforms with a consolidated monitoring application has been shown over and over to reduce the frequency and duration of severity 1 and 2 incidents and prevent losses of revenue attributed to downtime. However, making a strong business care for end-end monitoring and middleware monitoring can be challenging and can present unique learning opportunities. Here are some recommendations to help you make a better business case ...

August 16, 2016

Organizations are embracing IoT as part of their strategic initiatives, with over 70% of respondents indicating that IoT is “essential” or “important” to their organization’s business and technical strategies, according to new research by Enterprise Management Associates (EMA), titled The Rise of the Internet of Things: Connecting Our World One Device at a Time ...

August 15, 2016

As machine and IT event data continue to become more complex – and massively abundant – IT departments are trying to manage a plethora of information. In many cases, IT departments – as well as business practice groups – manage IT data by silo, each concerned solely about their particular piece of the puzzle, and not focusing on the whole picture required to understand where their piece fits ...

August 12, 2016

One way top-tier e-commerce companies maintain their prestige and keep customers coming back is by delivering an exceptional customer experience. Users can depend on these sites for fast, reliable web interactions, and speedy and convenient transactions. Catchpoint just analyzed and ranked the top performing e-commerce companies and one thing is clear – they all make performance optimization a priority. The top three performers – Target, Apple and Walgreens – shared best practices that have allowed them to achieve their competitive edge ...

August 11, 2016

Web application load times can make the difference between your e-business thriving or dying. Speedy load times are so essential to a web application’s success that they should be considered a key performance indicator ...