Observability
Grafana — dashboards, alerts, and panel hygiene

Grafana is the standard visualization layer on Prometheus + other sources. The trap is dashboard sprawl. Build the small set that pays the rent.

## Dashboards every team needs

1. **Service overview** (rps, error rate, p99 latency, by service)
2. **Resource utilization** (CPU, memory, disk by node + container)
3. **Database performance** (qps, slow queries, connection count, replication lag)
4. **Business KPIs** (signups, orders, revenue)

Build these four. Skip the rest until someone asks.

## Panel hygiene rules

- One panel = one question. Don't cram 4 metrics on a 200px-tall panel.
- Stat panels for "is this OK right now?" Graph panels for "how did we get here?"
- Always include the unit (ms, %, rps). Grafana doesn't know.
- Color-code by severity. Red = problem; yellow = watch; green = healthy.

## Alert design

- Alert on user-visible symptoms (high error rate, high p99).
- Not on causes (CPU high). Causes are diagnostic.
- One alert per condition.
- Include a runbook link in every alert.

## Variables (templating)

Make dashboards work across environments. Add variables for namespace, service, env, region.

## Quick reference

```
Dashboard JSON: edit via UI, then "Save JSON" + commit to git
Provisioning: provisioning/dashboards/*.yaml
Annotations: deploys, incidents, feature releases on graphs
Alerting: prefer Prometheus rules over Grafana alerts (one source)
```

## The "delete old dashboards" rule

Every quarter, archive dashboards not viewed in 60 days. They confuse on-call.