Incident response playbook: logging workflows that actually help
Timeline templates, LoggerMan features for war rooms, share links, audit trails, and post-incident log hygiene.
Incidents are a search problem
When production breaks, the bottleneck is rarely “lack of logs” — it is **finding the right twenty lines** among millions. LoggerMan is built for fast filter, bulk triage, and shareable views (dashboard).
This playbook assumes you already ingest production traffic (getting started).
Minute 0–5: stabilize and scope
- Confirm the customer-facing symptom (status page, support queue).
- Open Logs on the affected project.
- Filter `level:ERROR` + relevant `source` + time window (last 15 minutes).
- Pin one representative ERROR and annotate with hypothesis (dashboard bulk actions).
Enable a maintenance window on alerts if deploy-related noise would page the team.
Minute 5–30: correlate
Use metadata you standardized during instrumentation (SDK production guide):
- `requestId` / `traceId` across services
- `userId` (hashed) for account-specific failures
- `deploymentId` or `gitSha` after CI integration
Analytics helps when the incident is volume-shaped (“errors 4× baseline”) rather than a single stack trace.
Share without exporting secrets
For vendors or PMs without accounts, create a **time-boxed share link** (projects → share links). Enable metadata redaction when customer payloads might appear.
Never paste raw tokens in Slack — rotate via API keys if leaked.
Communication templates
**Internal Slack (start)**
> Incident: <symptom> — severity S1/S2 — lead: @name — LoggerMan project: <name> — filter: <link or description>
**Customer-facing (later)**
> We detected elevated errors affecting <feature>. Engineering is investigating. Next update in 30 minutes.
Audit and accountability
Settings changes (tokens, team, security) emit audit events. After the incident, verify whether a configuration change correlated with the start time.
Post-incident: log debt paydown
Within 48 hours:
- Add or tune an alert rule that would have fired 5 minutes earlier — not 5 minutes later.
- Remove or downgrade noisy INFO lines that hid signal.
- Document runbook links in the team wiki and link from alert channels.
Legal and data handling
EU/Swiss customers should review Privacy retention and erasure paths (security). Security issues go through responsible disclosure.