As modern systems become more distributed, interconnected and dependent on automation, maintaining reliability without exhausting engineering teams becomes increasingly difficult. Site reliability engineering, or SRE, provides organizations with a structured way to improve uptime, resiliency and incident response, but is only effective when practices are focused, intentional and manageable.
The challenge isn’t just adding more tracking, processes or tools. It helps teams identify what’s most important and respond without unnecessary noise or complexity. Below, its members Forbes Technology Council share SRE practices that organizations can use to enhance reliability while keeping workloads sustainable.
Prioritize user-centric reliability metrics
Focus engineering effort on what really affects users. Prevent groups from being overloaded with low-impact notifications. Create a common language between product, engineering, and operations regarding reliability tradeoffs. Enable controlled innovation—teams can move faster when bug budgets are healthy and slow when risk increases. – Rahul RajWalmart
Establish net system ownership
With clear ownership, dependency mapping, and security guardrails, teams can standardize reliability work and deliver predictable uptime, faster recovery, and stronger resiliency without adding operational load and overwhelming teams. When this foundation is missing, SRE teams end up reinventing the system with every incident and change. – Rick Vanover, Veeam
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Am I eligible?
Adopt root cause analysis with the help of AI
In cloud-native systems, incidents often span multiple layers. This forces teams to manually associate signals between tools and domains, slowing analysis and increasing workload. Consider adopting AI-assisted SRE to help coordinate analysis across the stack and identify root causes faster, automate investigation and improve reliability at no additional cost. – Ben Ofiri, Commodore
Enforce error budgets on all groups
In SRE practice, what most teams need is not better monitoring. imposes error budgets with real consequences. More dashboards don’t reduce complexity. They add noise. Error budgets force honest conversations between engineering and product about reliability versus speed. When breaking a budget actually disrupts releases, reliability stops being an operations problem and becomes everyone’s problem. – Hurrem Javed Mir, Quality Inc.
Set Clear Service Level Goals
Setting and enforcing service level goals. SLOs focus teams on what really affects reliability, reduce noise from non-critical issues, and create clear priorities—improving system stability without overwhelming engineering teams. – Paul A Mohabir, Transservice Logistics
Create durable observability practices
Organizations should prioritize observability built for resilience. Clear signals about system health, performance and failure functions enable teams to identify problems early and recover quickly. Strong observability supports proactive resiliency, improving reliability without overwhelming groups with alerts or operational noise. – Gary Diemer, InfusionPoints, LLC
Use AI to reduce operational effort
Cap labor at 50% and use AI to recover the rest. Google’s SRE book defines toil as manual, repetitive work that scales linearly with system size. The 50% cap was Google’s guardrail to keep engineers focused on engineering. In 2026, this is actionable: Route alerts through an LLM that deletes, composes runbook steps, and escalates only truly new issues. – Ankit Narayan Singh, ParallelDots, Inc.
Simplifying systems to reduce cognitive load
There is an effective approach to improving reliability without overwhelming groups: reducing their cognitive load. This means minimizing the architectural framework that engineers need to understand issues and respond effectively and quickly. To achieve this, teams should measure service complexity, track dependencies and service surface, and invest in automated runbooks and clear ownership. – Kostiantyn Gitko, Devox software
Implement progressive traffic strategies
This results in the liberation of engineering. As teams start shipping faster, there is still a limit to how many changes can be absorbed at any given time, so releases must be planned and rolled out incrementally. Spreading to a smaller team initially allows the teams to see how the system behaves and address issues early, rather than everyone tackling the issues at once. – Yuri Gubin, DataArt
Automate iterations for safer builds
Adopt progressive delivery with automated rollback. By releasing changes incrementally (canaries or feature flags) and linking them to real-time health metrics, teams can catch problems early and automatically roll back. This limits the blast radius, reduces firefighting and improves reliability without adding operational burdens or overwhelming engineering teams. – Amirtha Saminathan, Lowe’s
Tests control failures proactively
Teams should also consider proactive control-failure injection as another SRE option. Unlike chaos engineering for systems, these are more precise surgical testing activities that focus on basic control conditions. Examples include scenarios that validate the effectiveness of basic data controls, measure anomaly detection latency, and evaluate response detection during peak volume to ensure stability. – James Gowen, Jr., Citi
Designing systems with clear failure limits
As systems increase in complexity, reliability is no longer a function of control but of design intent. The only practice that should be prioritized is setting clear failure thresholds: systems that know how to degrade, not crash. In the spirit of the innovative enterprise, resilience occurs when complexity is structured to absorb uncertainty rather than fight it. – Motaz Agamawi, PwC
Production readiness assessments required
Prioritize production readiness assessments for new services. Before release, get teams to answer simple questions about rollback plans, dependencies, load limits, and ownership on the call. It’s a lightweight gateway that prevents fragile systems from working and saves far more time than it costs. – Dan Higham, AppMakers USA
Automate event response with clean playbooks
Combine clean runbooks with automated incident response. When common issues have predefined steps, teams don’t have to think from scratch under pressure. Automation can handle repetitive actions instantly, reducing response time. This keeps incidents manageable, reduces cognitive load, and improves reliability without increasing operational overhead or draining teams. – Kshitij Dixit, Zeo Route Planner
Applying behavioral testing to AI systems
Treat AI agents as a new tier of production code that needs behavioral testing, not just monitoring. As agents write or modify production systems, traditional SLOs and error budgets stop covering the non-deterministic failure mode. We built a test to formalize this: a tiered assurance architecture where cheap deterministic checks are always performed and expensive probabilistic checks are performed when needed. – Nikhil Jathar, AvanSaber Technologies
Strengthen incident response discipline
Have a strong incident response discipline. As systems become more complex, reliability is less about keeping track of everything and more about knowing what matters, responding quickly, and learning from failure. Clean runbooks, proprietary and post-incident reviews reduce noise and improve resilience. – Rahul Saluja, WinWire
Improve visibility into automated systems
Make sure that every step of the way, teams understand automated processes, alerts, and anything else that can provide insights into systems. The goal of SRE is to remove the baggage as the infrastructure evolves so that your business can continue to operate and the responsibility for technology is distributed across departments. – WaiJe Coler, InfoTracer
Align reliability goals with business priorities
Prioritize cost accountability at the service level. Link reliability objectives (SLOs) to financial impact—outage costs, vendor penalties, or lost revenue. This forces teams to focus on what really matters, not everything at once. Reliability is improved when it is treated as a business decision, not just an engineering one. – Prajkta Waditwar, Box Inc.
Conduct multi-function reliability assessments
Bring a multidisciplinary team around the table for a HAZOP reliability check and systematically ask what can go wrong. By involving engineering, operations, security and product teams, organizations can identify hidden risks early without leaving reliability solely to overburdened SRE teams after incidents occur. – Grigoris Shakhnovski, Modcon Systems Ltd.
