USENIX ;login:: The Evolution of SRE at Google

Posted on 2024-12-30 :: 1107 Words

Google’s SRE team has pioneered methods to keep failures rare by engineering reliability into every part of the stack. SREs have scaled up methods that have gotten us very far—Service Level Objectives (SLOs), error budgets, isolation strategies, thorough postmortems, progressive rollouts, and other techniques. In the face of increasing system complexity and emerging challenges, we at Google are always asking ourselves: what's next? How can we continue to push the boundaries of reliability and safety?

Ideas like error budgets worked well with products that were largely stateless web services, but today our products have losses that must never occur—error budgets of zero. The types of failures we need to prevent have evolved beyond what error budgets can effectively address. Issues like privacy breaches, data loss, and regulatory compliance demand absolute prevention, not just low frequency and rapid mitigation. In addition to these elevated expectations, our systems also become more complex every year. Sophisticated automation has enabled us to scale, AI and ML are now core to almost every product we build, and cost and energy efficiency are as important as user-visible features.

To address these challenges, Google SRE has embraced systems theory and control theory. We have adopted the STAMP (System-Theoretic Accident Model and Processes) framework, developed by Professor Nancy Leveson at MIT, which shifts the focus from preventing individual component failures to understanding and managing complex system interactions. STAMP incorporates tools like Causal Analysis based on Systems Theory (CAST) for post-incident investigations and System-Theoretic Process Analysis (STPA) for hazard analysis.

A design might implement its requirements flawlessly. But what if requirements necessary for the system to be safe were incorrect or, even worse, missing altogether?

Building upon the foundations laid by cybernetics pioneer Norbert Wiener and control theorists like Rudolf Kalman, Leveson recognized that safety is an emergent property that can only be analyzed at the system-level, rather than an attribute of individual system components. STAMP applies control theory principles to safety engineering, viewing accidents not as a chain of events, but as complex interactions between system components, including human operators and software.

Control Theory as a Foundation - The Four Conditions

"In order to control a process, four conditions are required:

Goal Condition: The controller must have a goal or goals (for example, to maintain the setpoint).

Action Condition: The controller must be able to affect the state of the system. In engineering, actuators implement control actions.

Model Condition: The controller must be (or contain) a model of the system.

Observability Condition: The controller must be able to ascertain the state of the system. In engineering terminology, observation of the state of the system is provided by sensors."

These four conditions provide a structured way to think about control in complex systems. When applying STAMP to our SRE practices, we can use these conditions as a checklist to ensure we have the necessary elements in place for effective control.

Treating Accidents as a Control Problem

STAMP shifts our perspective on accidents from a linear chain of failure events to a control problem. We want our model to explain accidents that result from component failures (like server crashes and buggy automation), but also external disturbances (environmental factors in our datacenters or subsea Internet cables), interactions between components of the system (including human-human, human-software, and software-software interactions), and also incorrect or inadequate behavior of individual system components—flawed algorithms or decision making.

Instead of asking "What software service failed?" we ask “What interactions between parts of the system were inadequately controlled?” In complex systems, most accidents result from interactions between components that are all functioning as designed, but collectively produce an unsafe state.

Hazard states give you time

"A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to a loss [for one or more stakeholders in the system]."

Hazard states are not discrete events. They do not describe anything at the individual system component-level. A hazard state is a property of the system as a whole, and the system can be in a hazard state for a long period of time before an accident occurs. That gives engineers a much larger target to aim at when trying to prevent outages. Rather than trying to eliminate any single failure that could occur anywhere in the system, we work to prevent the system from entering a hazard state. And if we do enter a hazard state, if we can detect it and take action to transition from the hazard state back to normal operations, we can prevent any accident from occurring. In some cases, the system is in a hazard state for a long time—a bug is introduced but never triggered, an alert fires but no one receives it, a server is underprovisioned but suddenly receives traffic from a popular new product feature, etc.

STPA analyzes each interaction in a system to determine comprehensively how the interaction must be controlled in order for the system to be safe. Unsafe control actions lead to the system entering one or more hazard states. There are only four possible types of UCA:

A required control action is not provided.

An incorrect or inadequate control action is provided.

A control action is provided at the wrong time or in the wrong sequence.

A control action is stopped too soon or applied for too long.

This highlights a major advantage of STPA—by looking at the system level and by modeling the system in terms of control-feedback loops, we find issues both in the control path and the feedback path. As we run STPA on more and more systems, we see that the feedback path is often less well understood than the control path, but just as important from a system safety perspective.

As Leveson writes in Engineering a Safer World: "In [STAMP], understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints." This shift in perspective - from trying to prove the absence of problems to effectively managing known and potential hazards - is a key principle in our system safety approach.

Rather than seeing complexity as a bug, SRE teams at Google are leveraging control theory and methods like STPA and CAST to lead us to more comprehensive and proactive approaches to reliability, moving beyond simply reacting to failures to actively designing safer systems from the ground up.

Link