Observability
Grafana — dashboards, alerts, and panel hygiene
Grafana is the standard visualization layer on Prometheus + other sources. The trap is dashboard sprawl. Build the small set that pays the rent. ## Dashboards every team needs 1. **Service overview** (rps, error rate, p99 latency, by service) 2. **Resource utilization** (CPU, memory, disk by node + container) 3. **Database performance** (qps, slow queries, connection count, replication lag) 4. **Business KPIs** (signups, orders, revenue) Build these four. Skip the rest until someone asks. ## Panel hygiene rules - One panel = one question. Don't cram 4 metrics on a 200px-tall panel. - Stat panels for "is this OK right now?" Graph panels for "how did we get here?" - Always include the unit (ms, %, rps). Grafana doesn't know. - Color-code by severity. Red = problem; yellow = watch; green = healthy. ## Alert design - Alert on user-visible symptoms (high error rate, high p99). - Not on causes (CPU high). Causes are diagnostic. - One alert per condition. - Include a runbook link in every alert. ## Variables (templating) Make dashboards work across environments. Add variables for namespace, service, env, region. ## Quick reference ``` Dashboard JSON: edit via UI, then "Save JSON" + commit to git Provisioning: provisioning/dashboards/*.yaml Annotations: deploys, incidents, feature releases on graphs Alerting: prefer Prometheus rules over Grafana alerts (one source) ``` ## The "delete old dashboards" rule Every quarter, archive dashboards not viewed in 60 days. They confuse on-call.