Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

Share this

The Latest

December 02, 2016

There is an increasing recognition of the interconnected nature of the information technology environment. Also, user expectations and IT complexity are rising. As a result, IT infrastructure performance management (IPM) is becoming more popular. Companies practicing IPM are realizing the benefits it delivers to the bottom line. They include the ability to ...

December 01, 2016

In my last blog, I expressed my opinion that IT operations teams may be about to enjoy a renaissance rather than dismally fading away — but only if they adopt new ways of working, measuring themselves and interacting with business stakeholders. In this blog, I'd like to discuss how technology investments can help smooth the way toward operational transformation with a few examples from recent interviews. More specifically, I'd like to focus on three key areas of innovation, all in some way related to Advanced IT Analytics ...

November 30, 2016

Almost one-third (28 percent) of customers will not return to a slow site, according to SOASTA's 2016 Holiday Retail Insights Report ...

November 29, 2016

Black Friday. Retailers know it's coming every year, and still – every year – someone has a spectacular failure. This year Macy's gets top billing – asking customers to wait to shop. Since 500 milliseconds of web delay is estimated to cost 5% of revenue, how much can we guess Macy's lost by asking EVERY shopper, for hours, to wait to shop? It's clearly in the millions of dollars ...

November 28, 2016

The most destructive root cause of 75 percent of outages during big online events like Black Friday and Cyber Monday are unplanned configuration changes to a system – when IT and Ops teams find something they think might cause a problem and try to fix it immediately, unintentionally creating a much bigger issue for the web or mobile site. The following are BigPanda's top recommendations for preventing outages during throughout the entire holiday shopping season ...

November 22, 2016

It's safe to say that the role of IT Operations is changing, but beyond that there are countless opinions about just why and how. Lately I've been hearing a growing number of doomsday prophecies about how operations professionals are going away as they shrink in importance to managing an infrastructure already being replaced by cloud. However, I see a strong and consistent trend that isn't a move away from operations, but rather a deliberate transformation of how IT operations teams work. So which vision is correct? Gloom and doom or new levels of empowerment and rebirth? ...

November 21, 2016

Over the past few years, IT service management (ITSM) has become increasingly important to an organization's IT strategy, and companies are seeking new ways to improve IT service delivery and efficiency via better ITSM processes. Using advanced IT analytics, managers can identify blind spots and hidden gaps in their ITSM process as well as make accurate decisions by monitoring key metrics. Here is how advanced IT analytics can make the best of your IT service desk ...

November 18, 2016

The IoT is in position to become one of the greatest application performance management challenges faced by IT. APMdigest asked experts across the industry for their recommendations on how to ensure performance for IoT applications. Part 4, the final installment of the list, covering communication and the network ...

November 17, 2016

The IoT is in position to become one of the greatest application performance management challenges faced by IT. APMdigest asked experts across the industry for their recommendations on how to ensure performance for IoT applications. Part 3 covers app design and development ...

November 16, 2016

The IoT is in position to become one of the greatest application performance management challenges faced by IT. APMdigest asked experts across the industry – including analysts, consultants and vendors – for their recommendations on how to ensure performance for IoT applications. Part 2 covers data and analytics ...