Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.
Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.
The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.
In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.
Busting Siloes
Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.
Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.
In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.
This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.
To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.
That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.
Beyond the Obvious
The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.
To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.
Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.
More Than Three Pillars
Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.
Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.
Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.
The Latest
Industry experts offer predictions on how NetOps, Network Performance Management, Network Observability and related technologies will evolve and impact business in 2025 ...
In APMdigest's 2025 Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 6 covers cloud, the edge and IT outages ...
In APMdigest's 2025 Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 5 covers user experience, Digital Experience Management (DEM) and the hybrid workforce ...
In APMdigest's 2025 Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 4 covers logs and Observability data ...
In APMdigest's 2025 Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 3 covers OpenTelemetry, DevOps and more ...
In APMdigest's 2025 Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 2 covers AI's impact on Observability, including AI Observability, AI-Powered Observability and AIOps ...
The Holiday Season means it is time for APMdigest's annual list of predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how Observability, APM, AIOps and related technologies will evolve and impact business in 2025 ...
Technology leaders will invest in AI-driven customer experience (CX) strategies in the year ahead as they build more dynamic, relevant and meaningful connections with their target audiences ... As AI shifts the CX paradigm from reactive to proactive, tech leaders and their teams will embrace these five AI-driven strategies that will improve customer support and cybersecurity while providing smoother, more reliable service offerings ...
We're at a critical inflection point in the data landscape. In our recent survey of executive leaders in the data space — The State of Data Observability in 2024 — we found that while 92% of organizations now consider data reliability core to their strategy, most still struggle with fundamental visibility challenges ...