Fault Domain Isolation Key to Avoiding Network Blame Game - Part 1
April 13, 2015

Jeff Brown
Emulex

Share this

The team-of-experts approach to incident response was effective when network problems were less complex and everyone was part of the same organization. However, in recent years the process required for Root Cause Analysis (RCA) of network events and business application performance issues has become more difficult, obscured by infrastructural cloudiness and stakeholders residing in disparate departments, companies and geographies. 
 
For many organizations, the task of quickly identifying root cause has become paramount to meeting Service Level Agreements (SLAs) and preventing customer churn. Yet, according to the Emulex Visibility Study, 79 percent of organizations have had events attributed to the wrong IT group, adding confusion and delays to the resolution of these issues.
 
This two-part series will explain a more fact-based, packet-analysis driven approach to Fault Domain Isolation (FDI), which is helping organizations troubleshoot and resolve network and application performance incidents.

Outsourcing Takes Over

It was hard enough getting visibility into what was actually happening when the entire infrastructure was owned and controlled by a single organization. With the rapid expansion of outsourcing, there are a growing number of blind spots developing throughout end-to-end business applications. When an entire technology tier is outsourced, what you have is a massive blind spot keeping you from performing root cause analysis within that technology domain. To accommodate outsourced technology, organizations must clearly define the purpose and requirements of the Fault Domain Isolation stage of the incident response workflow compared to the Root Cause Analysis stage.

Understanding FDI

The motivation behind FDI is easy to understand because anyone who’s gone to the doctor has seen it in action. An “incident investigation” in healthcare typically starts with a process that is essentially FDI. A general practitioner performs an initial assessment, orders diagnostic tests, and evaluates the results. The patient is sent to a specialist for additional diagnosis and treatment only if there is sufficient evidence to justify it. Facts, not guesswork, drive the diagnostic process.

Organizations that deploy FDI seek to minimize the number and type of technology experts involved in each incident, which is why FDI should precede RCA. The goal is to identify exactly one suspect technology tier before starting the deep dive search for root cause.

Why isolate by technology? Because that is how departments (and outsourcing) are typically organized, and how you quickly reduce the number of people involved. By implicating just one fault domain, you eliminate entire departments and external organizations from being tied up in the investigation; just as you wouldn’t pull in a neurosurgeon to examine a broken toe!

A key goal of FDI is to stop the “passing the buck” phenomenon in its tracks. For FDI to be effective it must provide irrefutable evidence that root cause lies in the “suspect” sub-system or technology tier, and just as importantly, that the same evidence confirms root cause is highly unlikely to lie anywhere else. This is especially important when the fault domain lies in an outsourced technology.

When handing the problem over to the responsible team or service provider, effective FDI also provides technology-specific, actionable data. It supplies the context, symptoms, and information needed for the technology team to immediately begin their deep dive search for root cause within the system for which they are responsible.

Exactly One Set of Facts

In order to be efficient and effective, FDI requires its analysis to be based on the actual packet data exchanged between the technology tiers. Packets don’t lie, nor do they obscure the critical details in averages or statistics. And having the underlying packets as evidence ensures the FDI process assigns irrefutable responsibility to the faulty technology tier.

Primary FDI – the act of assigning the incident to a specific technology team or outsourced service provider – is exceedingly cost effective to implement because its goal is relatively modest: to allocate incidents among a handful of departments or teams, plus any outsourced services. In practice, it involves relatively few technology tiers, a manageable number of tap points in the network, and a few network recorders monitoring between each technology tier.

Read Part 2 of this Blog, which identifies some of the hang ups of adopting FDI, as well as best practices.

Jeff Brown is Global Director of Training, NVP at Emulex.

Share this

The Latest

May 20, 2024

Amid economic disruption, fintech competition, and other headwinds in recent years, banks have had to quickly adjust to the demands of the market. This adaptation is often reliant on having the right technology infrastructure in place ...

May 17, 2024

In MEAN TIME TO INSIGHT Episode 6, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network automation ...

May 16, 2024

In the ever-evolving landscape of software development and infrastructure management, observability stands as a crucial pillar. Among its fundamental components lies log collection ... However, traditional methods of log collection have faced challenges, especially in high-volume and dynamic environments. Enter eBPF, a groundbreaking technology ...

May 15, 2024

Businesses are dazzled by the promise of generative AI, as it touts the capability to increase productivity and efficiency, cut costs, and provide competitive advantages. With more and more generative AI options available today, businesses are now investigating how to convert the AI promise into profit. One way businesses are looking to do this is by using AI to improve personalized customer engagement ...

May 14, 2024

In the fast-evolving realm of cloud computing, where innovation collides with fiscal responsibility, the Flexera 2024 State of the Cloud Report illuminates the challenges and triumphs shaping the digital landscape ... At the forefront of this year's findings is the resounding chorus of organizations grappling with cloud costs ...

May 13, 2024

Government agencies are transforming to improve the digital experience for employees and citizens, allowing them to achieve key goals, including unleashing staff productivity, recruiting and retaining talent in the public sector, and delivering on the mission, according to the Global Digital Employee Experience (DEX) Survey from Riverbed ...

May 09, 2024

App sprawl has been a concern for technologists for some time, but it has never presented such a challenge as now. As organizations move to implement generative AI into their applications, it's only going to become more complex ... Observability is a necessary component for understanding the vast amounts of complex data within AI-infused applications, and it must be the centerpiece of an app- and data-centric strategy to truly manage app sprawl ...

May 08, 2024

Fundamentally, investments in digital transformation — often an amorphous budget category for enterprises — have not yielded their anticipated productivity and value ... In the wake of the tsunami of money thrown at digital transformation, most businesses don't actually know what technology they've acquired, or the extent of it, and how it's being used, which is directly tied to how people do their jobs. Now, AI transformation represents the biggest change management challenge organizations will face in the next one to two years ...

May 07, 2024

As businesses focus more and more on uncovering new ways to unlock the value of their data, generative AI (GenAI) is presenting some new opportunities to do so, particularly when it comes to data management and how organizations collect, process, analyze, and derive insights from their assets. In the near future, I expect to see six key ways in which GenAI will reshape our current data management landscape ...

May 06, 2024

The rise of AI is ushering in a new disrupt-or-die era. "Data-ready enterprises that connect and unify broad structured and unstructured data sets into an intelligent data infrastructure are best positioned to win in the age of AI ...