Skip to main content

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

Cloud migration is a highly strategic decision that involves leadership sponsorship, business justifications for moving to the cloud, and a clear understanding of expected value. Lack of this alignment can be the reigning cause of cost and budget overruns and why almost half of the migration efforts underway today will fail in the next three years ...

One of the most misunderstood culprits of poor application performance is packet loss. Even minimal packet loss can cripple the throughput of a high-speed connection, making enterprise applications sluggish and frustrating for remote employee ... So, what's going wrong? And why does adding more bandwidth fail to fix the issue? ...

Image
Cloudbrink

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

Cloud migration is a highly strategic decision that involves leadership sponsorship, business justifications for moving to the cloud, and a clear understanding of expected value. Lack of this alignment can be the reigning cause of cost and budget overruns and why almost half of the migration efforts underway today will fail in the next three years ...

One of the most misunderstood culprits of poor application performance is packet loss. Even minimal packet loss can cripple the throughput of a high-speed connection, making enterprise applications sluggish and frustrating for remote employee ... So, what's going wrong? And why does adding more bandwidth fail to fix the issue? ...

Image
Cloudbrink

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint