Reliability

SLOs and error budgets — the spec for reliability

SLO = Service Level Objective. Reliability target your team commits to. Error budget = how much unreliability you can spend before halting feature work.

## Vocabulary

- **SLI**: a measurable thing (success rate, latency)
- **SLO**: target for an SLI (e.g., 99.9% success over 30 days)
- **SLA**: contractual commitment (usually weaker than SLO)
- **Error budget**: 1 - SLO, applied to total events

## Picking SLOs

1. What does the user actually care about? "Did the request succeed?" + "was it fast enough?"
2. Pick a budget that has business meaning. 99.9% might mean 43m of failed requests/month. Is that OK?
3. Pick a measurement window. 30 days standard.
4. Pick a target you can MEASURE today.

## Example SLOs

```
Availability:
Target: 99.9% over rolling 30 days
Error budget: 43m 50s per 30 days

Latency:
Target: 95% of requests under 500ms over 30 days
Error budget: 36h of slow responses per 30 days
```

## Error budget policy

Decide upfront what happens when you burn budget:
- **Burn fast (>50% in <1 week)**: page on-call, halt risky deploys
- **Burn moderate (>50% in 2-3 weeks)**: pause features, sprint on reliability
- **Burn slow**: business as usual

Without a written policy, SLOs are vibes. Write the policy.

## Multi-burn-rate alerts (modern pattern)

- **Fast burn**: 14.4x budget rate over 1 hour → page immediately
- **Slow burn**: 6x budget rate over 6 hours → ticket / warn

The two-threshold pattern is the SRE-book standard. Single-threshold alerts either page too often or miss slow degradation.

## Avoiding SLO theater

- Don't set 99.99% for a service that doesn't need it. Five 9s costs 100x more than three.
- Don't ignore the budget. If you never use it, your SLO is too loose.
- Don't blame individuals for burning budget. Blame the system.
- The SLO is a CONTRACT with users. Treat it like one.