Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

The Latest

October 08, 2015

A new global survey reveals the top traits of companies who are disrupting their competitors and transforming into successful software-driven, digital businesses. Here's what you can learn from them ...

October 07, 2015

Legacy performance management solutions were architected for smaller, less-complex and static computing environments that did not change much from year-to-year. When all an IT team had to worry about was measuring infrastructure availability and utilization these tools were sufficient. But time has passed them by ...

October 06, 2015

eCommerce is relevant across all industries and it's growing at an exponential rate. Everyone who provides eCommerce understands the significance of website or mobile application performance and how it directly hits the bottom line. And those who are new to eCommerce have started realizing the monetary consequences of page loads and bounce rates. Poor eCommerce performance directly hits your bottom line. No matter what industry you are in, you should be monitoring your websites, web applications and mobile applications to ensure that your customers and end users can do what they wish to do ...

October 05, 2015

As a follow-up to my previous columns on change management, I’d like to step back a little and shine a light on an even broader landscape. Here I’ll touch briefly on process, dialog, and workflow as a triad that can help IT organizations move forward toward a more efficient and potentially more business-aligned way of working ...

October 02, 2015

IDG Enterprise's 2015 Role & Influence of the Technology Decision-Maker research reveals how organizations set technology strategy, the individuals involved in technology purchase decisions and the resources used to stay in the know on technology transformation. Collaboration continues to be a key theme as business executives set the organizational strategy and IT executives lead teams to build and execute plans to help advance the organization ...

October 01, 2015

Every year, the number of consumers who shop online rises, and that traffic increase invariably leads to crashing web sites, unhappy customers and lost sales. Application performance directly impacts business performance. Providing high-performing applications 24/7 is critical, but that is easier said than done with complex applications that must work in environments spanning the cloud, middleware, third-party services and diverse networks. Effectively managing application performance requires broad and deep visibility across all of this, and your preparations for the crush of the holiday shopping season should begin today ...

September 30, 2015

The software-defined data center (SDDC) is crucial to the long-term evolution of an agile digital business according to Gartner, Inc. It is not, however, the right choice for all IT organizations currently ...

September 29, 2015

Gabriel Lowy, Founder of Tech-Tonics, looks at Application Performance Management (APM) from the investor's perspective ...

September 28, 2015

Is your website slow to load? Page size and complexity are two of the main factors you need to consider. Looking back at the trends over the last five years, the average site has ballooned from just over 700KB to 2,135KB. That’s over a 200% increase in five years! The number of requests have grown as well, from around 70 to about 100 ...

September 24, 2015

IT pros really felt the heat this summer as they kept networks buzzing along for remote workers having fun in the sun, according to Ipswitch's inaugural Summertime Blues Survey ...

Share this