Observability
Datadog — APM, logs, and metrics in one stack

Datadog is the dominant SaaS observability platform in 2026. APM + logs + metrics + RUM in one UI. Expensive at scale; pays for itself in incident MTTR for many teams.

## Four telemetry surfaces

- **Metrics** (system + custom)
- **Logs** (structured JSON ideally)
- **Traces** (distributed tracing)
- **RUM** (real user monitoring)

Get all four flowing. Debugging with three is harder than it sounds.

## APM (Traces) — highest leverage

```python
from ddtrace import tracer
with tracer.trace("handle_request") as span:
    span.set_tag("user_id", user_id)
```

Auto-instrumentation covers most web frameworks. Manual spans on business-critical functions.

## Log aggregation

- Structure logs as JSON in production. Don't fight the parser.
- Include trace_id in every log so logs ↔ traces link automatically.

```json
{"level":"info","message":"order placed","trace_id":"abc","order_id":123}
```

## Custom metrics

```python
from datadog import statsd
statsd.increment('orders.placed', tags=['region:us-east'])
statsd.histogram('checkout.duration_ms', 423)
```

Watch custom-metric count. Cardinality explosion = bill explosion.

## Monitors (alerts)

- Composite monitors: alert only when multiple conditions hit.
- Use "notify no data" sparingly. False positives in flaky envs.
- Auto-resolve when condition clears for N minutes.

## Cost control

- Log indexing rules: index high-value, archive the rest
- Sample traces: 100% on errors, 5-10% on normal traffic
- Drop unused custom metrics quarterly