Using Machine Learning Analytics to Deliver Service Levels
September 21, 2016

Jerry Melnick
SIOS Technology

Share this

While the layers of abstraction created in virtualized environments afford numerous advantages, they can also obscure how the virtual resources are best allocated and how physical resources are performing. This can make maintaining optimal application performance a never-ending exercise in trial-and-error.

This post highlights some of the challenges encountered when using traditional monitoring and analytics tools, and describes how machine learning, as a next-generation analytics platform, provides a better way to meet SLAs by finding and fixing issues before they become performance problems. A future post will describe how machine learning analytics can also be used to allocate resources for optimal performance and cost-saving efficiency.

Most IT departments identify performance problems with tools that monitor a variety of discrete events against preset thresholds. For example they set a specific threshold for CPU utilization. Whenever that threshold is exceeded, the tool fires off alerts. But the use of thresholds presents several challenges. They do not account for the interrelated nature of resources in virtualized environments, where a change to or in one can have a significant impact on another. Such interrelationships exist both within and across silos. Without a complete understanding of the environment across silos, users of threshold-based tools frequently discover that their attempts to solve a problem have simply moved it to a different silo.

Thresholds often generate "alert storms" of meaningless data and miss important correlations that might indicate a severe problem exists. They are ineffective in detecting the symptoms of subtle issues that may indicate a significant imminent problem such as "noisy neighbors" or datastore latency issues. These subtle issues may not exceed a threshold related to the root cause or may exceed a threshold in short, random intervals, producing alerts that are frequently lost amid the "noise" of alert storms.

Even the so-called dynamic thresholds cannot accommodate the constant change in dynamic environments and, as a result, require significant ongoing IT intervention. And finally, while they may alert IT to an issue, they rarely provide sufficiently actionable information for resolving it. The exponential growth in the size and complexity of virtual environments has outstripped the ability of IT staff to set, manage, and continuously adjust threshold-based tools effectively. The time for an automated solution has come.

Advanced machine learning-based analytics software overcomes these and other challenges by continuously learning the many complex behaviors and interactions among interrelated objects – CPU, storage, network, applications – across the infrastructure. Unlike threshold-based solutions, this growing knowledge enables machine learning-based IT analytics solutions to provide a highly accurate means of identifying the root cause(s) of performance problems and making specific recommendations for resolving them cost-effectively.

This ability to aggregate, normalize, and then correlate and analyze hundreds of thousands of data points from different monitoring and management systems enable machine learning analytics solutions to transform massive volumes of data into meaningful insights across applications, servers and hosts, and storage and network infrastructures.

As it gathers and analyzes this wealth of data, the MLA system learns what constitutes normal behaviors, and it is this baseline that gives the system the ability to detect anomalies and find root causes automatically.

In addition to identifying root causes, advance machine learning based analytics solutions are able to simulate and predict the impact of making certain changes in resources and their allocations, which can be particularly useful for optimizing resource utilization and planning for expansion. This capability can also be useful for assessing if there is adequate capacity to handle a partial or complete failover. And these are topics worthy of a deeper dive in a future post.

Jerry Melnick is President and CEO of SIOS Technology.

Share this

The Latest

July 21, 2017

Public sector organizations undergoing digital transformation are losing confidence in IT Operations' ability to manage the influx of new technologies and evolving expectations, according to the 2017 Splunk Public Sector IT Operations Survey ...

July 20, 2017

It's no surprise that web application quality is incredibly important for businesses; 99 percent of those surveyed by Sencha are in agreement. But despite technological advances in testing, including automation, problems with web application quality remain an issue for most businesses ...

July 19, 2017

Market hype and growing interest in artificial intelligence (AI) are pushing established software vendors to introduce AI into their product strategy, creating considerable confusion in the process, according to Gartner. Analysts predict that by 2020, AI technologies will be virtually pervasive in almost every new software product and service ...

July 18, 2017

Organizations are encountering user, revenue or customer-impacting digital performance problems once every five days, according a new study by Dynatrace. Furthermore, the study reveals that individuals are losing a quarter of their working lives battling to address these problems ...

July 17, 2017
Mobile devices account for more than 60 percent of all digital minutes in all 9 markets profiled in comScore's report: Mobile’s Hierarchy of Needs ...
July 14, 2017

Cloud adoption is still the most vexing factor in increased network complexity, ahead of the internet of things (IoT), software-defined networking (SDN), and network functions virtualization (NFV), according to a new survey conducted by Kentik ...

July 13, 2017

Gigabit speeds and new technologies are driving new capabilities and even more opportunities to innovate and differentiate. Faster compute, new applications and more storage are all working together to enable greater efficiency and greater power. Yet with opportunity comes complexity ...

July 12, 2017

Achieving broad competence in event-driven IT will be a top three priority for the majority of global enterprise CIOs by 2020, according to Gartner, Inc. Defining an event-centric digital business strategy will be key to delivering on the growth agenda that many CEOs see as their highest business priority ...

July 11, 2017

It's not especially surprising that a new IT survey shows that cloud use for business and government poses challenges. In significant numbers across the board, respondents cited cloud complexity, compliance and security, cost control, speed of delivery, and domain expertise as the cloud problems their organizations were working to overcome this year ...

July 10, 2017
Your organization's Application Management and IT Help Desk teams are your "first line of defense," and they also wear many hats. One of the biggest challenges they face is the management of application portfolios. To help ensure your application and help desk operations are effective and manageable, there are a few simple things that IT leaders can do ...