Observability
Datadog — APM, logs, and metrics in one stack
Datadog is the dominant SaaS observability platform in 2026. APM + logs + metrics + RUM in one UI. Expensive at scale; pays for itself in incident MTTR for many teams.
## Four telemetry surfaces
- **Metrics** (system + custom)
- **Logs** (structured JSON ideally)
- **Traces** (distributed tracing)
- **RUM** (real user monitoring)
Get all four flowing. Debugging with three is harder than it sounds.
## APM (Traces) — highest leverage
```python
from ddtrace import tracer
with tracer.trace("handle_request") as span:
span.set_tag("user_id", user_id)
```
Auto-instrumentation covers most web frameworks. Manual spans on business-critical functions.
## Log aggregation
- Structure logs as JSON in production. Don't fight the parser.
- Include trace_id in every log so logs ↔ traces link automatically.
```json
{"level":"info","message":"order placed","trace_id":"abc","order_id":123}
```
## Custom metrics
```python
from datadog import statsd
statsd.increment('orders.placed', tags=['region:us-east'])
statsd.histogram('checkout.duration_ms', 423)
```
Watch custom-metric count. Cardinality explosion = bill explosion.
## Monitors (alerts)
- Composite monitors: alert only when multiple conditions hit.
- Use "notify no data" sparingly. False positives in flaky envs.
- Auto-resolve when condition clears for N minutes.
## Cost control
- Log indexing rules: index high-value, archive the rest
- Sample traces: 100% on errors, 5-10% on normal traffic
- Drop unused custom metrics quarterly