The Case Against AIOps
November 02, 2023

Phillip Carter
Honeycomb

Share this

For the last couple weeks, APMdigest posted a series of blogs about AIOps that included my commentary. In this blog, I present the case against AIOps.

In theory, the ideas behind AIOps features are sound, but the machine learning (ML) systems involved aren't sophisticated enough to be effective or trustworthy.

AIOps is relatively mature today, at least in its current form. The ML models companies use for AIOps tasks work as well as they can, and the features that wrap them are fairly stable and mature. That being said, maturity might be a bit orthogonal to usefulness.

Despite being based on mature tech, AIOps features aren't widely used because they don't seem to often help with problems people have in practice. It's like if you were struggling with cooking a meal and the main challenge lies in mixing all the ingredients at the right time, but someone offered you a better way to chop the vegetables. Does chopping up vegetables more efficiently help? Maybe, but that doesn't solve the difficulty in timing your ingredients.

In addition, AIOps adoption is a big challenge for teams. Organizations may be constrained by their budget and cannot implement due to the feature's cost. AIOps often comes bundled with several other features, all with a high learning curve, and very few can work as a turnkey solution. It's yet another thing for busy teams to learn, which is not likely to be high on their priority list.

AIOps Does Not Provide Actionable Insights

AIOps arguably doesn't provide actionable insights. Sure, there are examples of teams reducing false positives and using anomaly detection to identify something worth investigating. Still, teams have been able to reduce false positives and identify uniquely interesting patterns in data long before AIOps, and typically do this today without AIOps features.

For example, you don't need ML models to tell you that a particular measure crosses a threshold. Furthermore, these models work only with past behavior as context. They can't predict future behavior, especially for services with irregular traffic patterns. And it's services with irregular traffic patterns that actually present the most problems (and thus time spent debugging) in the first place.

One use case that can be helpful in understanding this problem is analyzing a giant bucket of data that hasn't been organized. When organizations treat operations data as a dumping ground, using an ML model to perform pattern analysis and separate usable from unusable data can be helpful. However, it's only treating a symptom and not the root cause.

And when there are issues that AIOps features can't help identify, you're back to an extremely long time spent figuring out what's wrong in a system.

Facing Your Organizational Issues

The advantages of AIOps are insignificant because AIOps features primarily exist to patch organizational and technical failures. The long-term solution is to invest in your organization and empower your teams to pick quality tools, not be sold the flashy promises of a quick AI fix.

I wouldn't suggest users go looking for an AIOps-specific provider and should instead leverage their team's expertise. Regarding these specific use cases, humans are far better at making critical judgment calls than the ML models on the market today. Deciding what's worth looking at and alerting on is the best possible use of human time.

Most of the problems that AIOps purports to solve are organizational issues. Fix your organizational and technical issues by giving your teams the agency to fix things in the first place.

If you have problems with noise in your data, look at how you generate telemetry and prioritize working to improve it. Lead a culture shift by enforcing the principle that good telemetry is a concern for application developers, not just ops teams.

If your alerts are out of order, have your team look at what they're alerting on and make necessary adjustments. If you have noisy alerts, talk to the people who are getting alerted to discover and investigate why things are too noisy. Take on call engineers very seriously, constantly poll people, and ensure they're not burning out. Some vendors will try to sell you on ML models that will magically solve alert fatigue, but please know and take caution that there is no magic, and your problems won't get solved by ML models.

If your organization doesn't have development teams prioritizing good telemetry, incentivize them to care about it.

LLMs for Observability

Can you tell I'm not particularly bullish on AIOps? I am incredibly bullish on LLMs for Observability, though. LLMs do a great job of taking natural language inputs and producing things like queries on data, analyzing data relevant to a query, and generating things that can help to teach people how to use a product. We'll uncover more use cases but right now LLMs are best at actually reducing toil and lowering the bar to learning how to analyze your production data in the first place.

While I'm not too hopeful about the future of AIOps, I am optimistic about how AI will continue to integrate into operations. LLMs present novel ways for us to interact with systems that were previously impossible. For example, observability vendors are releasing AI features that lower the barrier for developers to access and make the most out of their observability tools. Innovations like this will continue to enhance developer workflows and transform the way we work for the better.

Phillip Carter is Principal Product Manager at Honeycomb
Share this

The Latest

May 09, 2024

App sprawl has been a concern for technologists for some time, but it has never presented such a challenge as now. As organizations move to implement generative AI into their applications, it's only going to become more complex ... Observability is a necessary component for understanding the vast amounts of complex data within AI-infused applications, and it must be the centerpiece of an app- and data-centric strategy to truly manage app sprawl ...

May 08, 2024

Fundamentally, investments in digital transformation — often an amorphous budget category for enterprises — have not yielded their anticipated productivity and value ... In the wake of the tsunami of money thrown at digital transformation, most businesses don't actually know what technology they've acquired, or the extent of it, and how it's being used, which is directly tied to how people do their jobs. Now, AI transformation represents the biggest change management challenge organizations will face in the next one to two years ...

May 07, 2024

As businesses focus more and more on uncovering new ways to unlock the value of their data, generative AI (GenAI) is presenting some new opportunities to do so, particularly when it comes to data management and how organizations collect, process, analyze, and derive insights from their assets. In the near future, I expect to see six key ways in which GenAI will reshape our current data management landscape ...

May 06, 2024

The rise of AI is ushering in a new disrupt-or-die era. "Data-ready enterprises that connect and unify broad structured and unstructured data sets into an intelligent data infrastructure are best positioned to win in the age of AI ...

May 02, 2024

A majority (61%) of organizations are forced to evolve or rethink their data and analytics (D&A) operating model because of the impact of disruptive artificial intelligence (AI) technologies, according to a new Gartner survey ...

May 01, 2024

The power of AI, and the increasing importance of GenAI are changing the way people work, teams collaborate, and processes operate ... Gartner identified the top data and analytics (D&A) trends for 2024 that are driving the emergence of a wide range of challenges, including organizational and human issues ...

April 30, 2024

IT and the business are disconnected. Ask the business what IT does and you might hear "they implement infrastructure, write software, and migrate things to cloud," and for some that might be the extent of their knowledge of IT. Similarly, IT might know that the business "markets and sells and develops product," but they may not know what those functions entail beyond the unit they serve the most ...

April 29, 2024

Cloud spending continues to soar. Globally, cloud users spent a mind-boggling $563.6 billion last year on public cloud services, and there's no sign of a slowdown ... CloudZero's State of Cloud Cost Report 2024 found that organizations are still struggling to gain control over their cloud costs and that a lack of visibility is having a significant impact. Among the key findings of the report ...

April 25, 2024

The use of hybrid multicloud models is forecasted to double over the next one to three years as IT decision makers are facing new pressures to modernize IT infrastructures because of drivers like AI, security, and sustainability, according to the Enterprise Cloud Index (ECI) report from Nutanix ...

April 24, 2024

Over the last 20 years Digital Employee Experience has become a necessity for companies committed to digital transformation and improving IT experiences. In fact, by 2025, more than 50% of IT organizations will use digital employee experience to prioritize and measure digital initiative success ...