A growing need for process automation as a result of the confluence of digital transformation initiatives with the remote/hybrid work policies brought on by the pandemic was uncovered by an independent survey of over 500 IT Operations, DevOps, and Site Reliability Engineering (SRE) professionals commissioned by Transposit for its inaugural State of DevOps Automation Report.
More than half of respondents reported that the most common challenge while taking action to resolve an incident was a lack of automation. This influx of stressors means ITOps and software engineering teams — including DevOps and SREs — face increasing complexity in their work, leading to significantly more strain and application downtime unless preventive measures are taken.
Service Incidents and Remediation in a Pandemic-Influenced World
The vast majority of organizations surveyed adopted remote/hybrid work policies and augmented digital transformation initiatives since the start of the pandemic. At the same time many have also been hampered by longer incident resolution, inefficient processes, and lack of automation.
9 out of 10 organizations experienced an increase in service incidents that have affected their customers since the start of the pandemic
The acceleration in digital transformation has resulted in an uptick in service incidents, putting a heavier burden on DevOps, SRE, and IT teams. The survey found that 9 out of 10 organizations experienced an increase in service incidents that have affected their customers since the start of the pandemic, with nearly 60% of respondents observing at least a 20% increase in service incidents or more. Most (93%) said incidents were taking longer to resolve while working remotely and nearly 70% saw an increase in the cost of downtime since the pandemic began.
The survey results indicate these findings stem from a number of variables. First, most organizations still rely on manual, repetitive DevOps processes that cause unnecessary toil.
They're also investing precious resources on building custom in-house tools — which burdens all parts of the software stack — when those resources could instead be used on product innovation or customer service initiatives.
Still, organizations are motivated to get the right tools, processes, and reliable automation in place to keep pace with innovation and decrease mean time to resolution (MTTR). The majority of respondents believed that systematically mining insights from human data (such as archived Slack communications, postmortem interviews, group feedback, etc.) could improve both future incident response and fuel operational excellence.
The Growing Popularity of Site Reliability Engineering
SREs are essential to any organization for solving infrastructure and operational problems — and they're going mainstream. In fact, an overwhelming 94% of respondents increased focus on SRE practices in their organization in the past 12 months and 86% of organizations are planning to hire SREs in the next 12 months. While these numbers are high, they're not surprising when considering how engineering and operations teams are being stretched to the limit. Investments in automation are a natural reaction to these circumstances.
Even if organizations do not have formal SRE roles, ITOps teams are adopting SRE practices. Almost all (98%) of respondents with the "VP/Director/Manager IT Operations" role increased focus on SRE practices in their organization in the past 12 months, while 62.4% of IT Operations respondents plan to expand SRE efforts in 2021.
SREs are critical contributors to incident resolution and help teams work with complex distributed systems at scale. However, nearly 80% of respondents said individuals responsible for reliability engineering are experiencing challenges while trying to solve incidents as they are occurring.
Automation Drivers and Barriers
A key takeaway from the study is that automation is a highly valuable tool for engineering operations. Although the benefits of automation are known, nearly half of respondents reported that their engineering operations are only 26-50% automated. Half (51.9%) cited inadequate documentation of institutional knowledge and existing processes as a barrier, followed by lack of clarity about what to automate (47.3%) and the gaps in share of knowledge (43.8%).
While organizations are still draining resources, time, and money on manual tasks while responding to incidents, they're aware something needs to change. This is evidenced by the 40% of organizations who have one or more full time engineers working on custom in-house tools or bots for automating incident response.
Most commercially available automation solutions use the "automate everything" approach and do not incorporate human-in-the-loop automation, which helps explain this finding. And humans aren't going anywhere: the research revealed that 9 out of 10 respondents believe automation should let humans use their judgment at critical decision points to be more reliable and effective.
One simple yet effective beachhead for moving automation forward is documentation. The marriage of automated process documentation that keeps humans in the loop and availability of actionable data on how to operate systems during and in between incidents can improve (MTTR), enhance service reliability, streamline operations, and lower the cost of downtime.
The Latest
Users have become digital hoarders, saving everything they handle, including outdated reports, duplicate files and irrelevant documents that make it difficult to find critical information, slowing down systems and productivity. In digital terms, they have simply shoved the mess off their desks and into the virtual storage bins ...
Today we could be witnessing the dawn of a new age in software development, transformed by Artificial Intelligence (AI). But is AI a gateway or a precipice? Is AI in software development transformative, just the latest helpful tool, or a bunch of hype? To help with this assessment, DEVOPSdigest invited experts across the industry to comment on how AI can support the SDLC. In this epic multi-part series to be posted over the next several weeks, DEVOPSdigest will explore the advantages and disadvantages; the current state of maturity and adoption; and how AI will impact the processes, the developers, and the future of software development ...
Half of all employees are using Shadow AI (i.e. non-company issued AI tools), according to a new report by Software AG ...
On their digital transformation journey, companies are migrating more workloads to the cloud, which can incur higher costs during the process due to the higher volume of cloud resources needed ... Here are four critical components of a cloud governance framework that can help keep cloud costs under control ...
Operational resilience is an organization's ability to predict, respond to, and prevent unplanned work to drive reliable customer experiences and protect revenue. This doesn't just apply to downtime; it also covers service degradation due to latency or other factors. But make no mistake — when things go sideways, the bottom line and the customer are impacted ...
Organizations continue to struggle to generate business value with AI. Despite increased investments in AI, only 34% of AI professionals feel fully equipped with the tools necessary to meet their organization's AI goals, according to The Unmet AI Needs Surveywas conducted by DataRobot ...
High-business-impact outages are costly, and a fast MTTx (mean-time-to-detect (MTTD) and mean-time-to-resolve (MTTR)) is crucial, with 62% of businesses reporting a loss of at least $1 million per hour of downtime ...
Organizations recognize the benefits of generative AI (GenAI) yet need help to implement the infrastructure necessary to deploy it, according to The Future of AI in IT Operations: Benefits and Challenges, a new report commissioned by ScienceLogic ...
Splunk's latest research reveals that companies embracing observability aren't just keeping up, they're pulling ahead. Whether it's unlocking advantages across their digital infrastructure, achieving deeper understanding of their IT environments or uncovering faster insights, organizations are slashing through resolution times like never before ...
A majority of IT workers surveyed (79%) believe the current service desk model will be unrecognizable within three years, with nearly as many (77%) saying new technologies will render it "redundant" by 2027, according to The Death (and Rebirth) of the Service Desk from Nexthink ...