Weathering Event Storms and Alert Floods
Actionable Alerting for the Cloud and Dynamic Datacenter
August 23, 2011
Floyd Strimling
Share this

It seems that everyone in IT has caught “Cloud-fever,” as Enterprises and Service Providers alike race to revamp their architectures and offerings to take advantage of this great IT inflection point. However, lost within the technology is the reality that someone is responsible for keeping the Cloud up and running. That someone is usually Operations personnel along with their fellow Systems, Network, Storage, and Security Engineers. The lifeline of these dedicated individuals is a unified monitoring and eventing system with a goal of providing relevant, functional, and timely alerts.

To accomplish this goal, IT Operations must have the ability to effectively monitor the entire datacenter, and to provide high-quality data to the eventing system. As the saying goes, “garbage-in, garbage-out,” and no degree of filtering or pre-processing will alleviate this problem. In the end, the monitoring data that is collected is turned into events that are processed by the eventing system independent of the alerting mechanisms. This allows common techniques such as correlation, filtering, and suppression to take place prior to an alert being issued.

Herein lies the first challenge. How do you take an event storm with tens, hundreds, or even thousands of events and turn it into a single relevant event and subsequent alert? Rules-based correlation engines of the past cannot keep pace with the high rate of change within the dynamic datacenter. Instead, a new approach is needed that views the infrastructure as services instead of individually monitored components, and provides a service assurance layer to IT Operations and other business stakeholders. Assuming that the first challenge is overcome, it is time to design an alerting solution.

Careful consideration must be made to the purpose of the alert being processed. For example, is it an informational alert to the customer regarding a service issue, or is it an operational alert to a system administrator to fix an issue? Are any automated actions being used such as restarting a Windows service or Linux process? Is there integration to a service desk such as ServiceNow? Is the alert a high priority issue for revenue generation such as a customer issue or an internal issue?

Herein lies the second challenge -- alert floods. Alert floods fill your pager/email/phone with alerts that have either already been acknowledged or are irrelevant. Perhaps there is nothing more frustrating than getting an alert from a device that you are in the process of working on or have placed into maintenance. Many Operations personnel have a special folder or rule to take care of this, but this may actually cause them to miss relevant alerts. Operations personnel must trust that the alerts they receive are valid and require their immediate attention.

To accomplish this, only an intelligent solution that provides granular control over the alerts will eliminate this issue. Unlike the event storms discussed earlier, alerting lends itself to granular filtering, time-based policies, and escalation rules. The key is to have an eventing system that provides well-formed events that can be filtered against via a set of flexible and powerful rules. For example, an alert is only sent out if the automated action failed and the event has not been acknowledged for ten minutes. If the subsequent alert is not cleared within another ten minutes, the alert is resent only this time it goes to operations management. Finally, alerts should have the ability to be subscribed to and shared among your IT staff.

Alerting for the Cloud and dynamic datacenter requires IT organizations to re-examine how they deliver, monitor, and alert on vital services. IT Operations has minutes to respond to issues that could take down tens, hundreds, or thousands of virtual servers, impacting the business in ways we have never seen before. Accepting a console full of “Red” or a pager/phone/email full of useless alerts is a recipe for disaster. However, with proper planning and re-evaluation of your current People, Process, and Solutions, IT Operations will be able to meet demands and keep the Cloud running.

About Floyd Strimling

Floyd Strimling is a Technology Evangelist at Zenoss, who enjoys creating, debating, and following technology trends with the goal of making them a reality. Strimling’s unique background spans both hardware and software environments with experience in Cloud Computing/Autonomic Computing, Datacenter Automation, Virtualization, Networking and Security.

Related Links:

www.zenoss.com

Zenoss Service Dynamics Now Supports IPv6

Share this

The Latest

November 21, 2024

Broad proliferation of cloud infrastructure combined with continued support for remote workers is driving increased complexity and visibility challenges for network operations teams, according to new research conducted by Dimensional Research and sponsored by Broadcom ...

November 20, 2024

New research from ServiceNow and ThoughtLab reveals that less than 30% of banks feel their transformation efforts are meeting evolving customer digital needs. Additionally, 52% say they must revamp their strategy to counter competition from outside the sector. Adapting to these challenges isn't just about staying competitive — it's about staying in business ...

November 19, 2024

Leaders in the financial services sector are bullish on AI, with 95% of business and IT decision makers saying that AI is a top C-Suite priority, and 96% of respondents believing it provides their business a competitive advantage, according to Riverbed's Global AI and Digital Experience Survey ...

November 18, 2024

SLOs have long been a staple for DevOps teams to monitor the health of their applications and infrastructure ... Now, as digital trends have shifted, more and more teams are looking to adapt this model for the mobile environment. This, however, is not without its challenges ...

November 14, 2024

Modernizing IT infrastructure has become essential for organizations striving to remain competitive. This modernization extends beyond merely upgrading hardware or software; it involves strategically leveraging new technologies like AI and cloud computing to enhance operational efficiency, increase data accessibility, and improve the end-user experience ...