Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Related Links:

For more information on this methodology see the Forrester document:
Technology Spotlight: Application Performance Management And Complex Event Processing

www.nastel.com

The Latest

April 29, 2016

A majority (80 percent) of organizations receiving 500 or more severe/critical alerts per day currently investigate less than one percent of those alerts, according to new research from Enterprise Management Associates (EMA), sponsored by Savvius ...

April 28, 2016

Ipswitch recently released a report, The Challenges of Controlling IT Complexity, that reveals IT teams feel they are at risk of losing control of their company’s IT environment in the face of new technologies. But what exactly is it about new technologies that is vexing today’s IT teams? A deeper dive into the research uncovers two major themes that teams are grappling with to better manage increasing IT complexity ...

April 27, 2016

The findings outlined in Part 1 of this blog point to a need for "smart" APM solutions supporting automation of change monitoring, performance and availability management, and production troubleshooting functions. With such capabilities in place, Dev and Ops resources could be freed up to deliver the new software products that have become the lifeblood of the agile business ...

April 26, 2016

At a time when software is becoming increasingly business relevant, IT teams are, in too many cases, retreating to the silo monitoring techniques of the past to track and troubleshoot application performance ...

April 25, 2016

DevOps is hot. This sizzling buzzword is on the tip of every tongue in the IT world, from Development, Testing and QA through IT Operations. At DEVOPSdigest, we have talked a lot about what DevOps is, and how you get there – but what's the point? Why go through all this trouble? What advantages can be gained from adopting a DevOps strategy? To explore the answers to these questions, DEVOPSdigest asked experts from across the industry – including consultants, analysts and the leading vendors – for their opinions on the most significant advantages of DevOps ...

April 22, 2016

Here are some common recommendations to optimize the steps of a web page request ...

April 21, 2016

The performance of your website is obviously very important. When visitors comes to your company website, they won't stick around very long if it's slow. If those visitors are users of your web application, they may not be for long if they encounter a consistently slow performing app. So we want to make our websites and web applications fast. But how can we go about doing that? ...

April 20, 2016

CEOs have underlined that growth will be their top business priority for 2016, according to a recent survey by Gartner, Inc. The 2016 Gartner CEO and senior business executive survey found that despite indications that the global economy is struggling in early 2016, CEOs do not plan to significantly change their priorities. After growth (54 percent), the second and third business priorities are customers (31 percent) and workforce (27 percent) ...

April 19, 2016

A bad onboarding experience can be a nightmare for a new user. Bad website design, absent customer support, a poorly implemented tutorial, all of this can turn into a Kafkaesque nightmare for the unsuspecting customer. Here are four of the most common customer onboarding pitfalls in the SaaS world and the equivalent nightmares we've all experienced at one time or another ...

April 18, 2016

The digital business era is placing a premium on strong end-user performance (speed) for all websites, mobile sites and applications. Failing to deliver strong experiences can negatively impact a company's profits and brand reputation. Staying ahead of the game from a performance perspective really comes down to preparation and monitoring. If a comprehensive performance strategy is deployed, organizations are less likely to fall behind. Today, a failsafe performance management strategy consists of these six key factors ...

Share this