Reliability
SLOs and error budgets — the spec for reliability
SLO = Service Level Objective. Reliability target your team commits to. Error budget = how much unreliability you can spend before halting feature work. ## Vocabulary - **SLI**: a measurable thing (success rate, latency) - **SLO**: target for an SLI (e.g., 99.9% success over 30 days) - **SLA**: contractual commitment (usually weaker than SLO) - **Error budget**: 1 - SLO, applied to total events ## Picking SLOs 1. What does the user actually care about? "Did the request succeed?" + "was it fast enough?" 2. Pick a budget that has business meaning. 99.9% might mean 43m of failed requests/month. Is that OK? 3. Pick a measurement window. 30 days standard. 4. Pick a target you can MEASURE today. ## Example SLOs ``` Availability: Target: 99.9% over rolling 30 days Error budget: 43m 50s per 30 days Latency: Target: 95% of requests under 500ms over 30 days Error budget: 36h of slow responses per 30 days ``` ## Error budget policy Decide upfront what happens when you burn budget: - **Burn fast (>50% in <1 week)**: page on-call, halt risky deploys - **Burn moderate (>50% in 2-3 weeks)**: pause features, sprint on reliability - **Burn slow**: business as usual Without a written policy, SLOs are vibes. Write the policy. ## Multi-burn-rate alerts (modern pattern) - **Fast burn**: 14.4x budget rate over 1 hour → page immediately - **Slow burn**: 6x budget rate over 6 hours → ticket / warn The two-threshold pattern is the SRE-book standard. Single-threshold alerts either page too often or miss slow degradation. ## Avoiding SLO theater - Don't set 99.99% for a service that doesn't need it. Five 9s costs 100x more than three. - Don't ignore the budget. If you never use it, your SLO is too loose. - Don't blame individuals for burning budget. Blame the system. - The SLO is a CONTRACT with users. Treat it like one.