← All posts

Incident response playbook: logging workflows that actually help

Timeline templates, LoggerMan features for war rooms, share links, audit trails, and post-incident log hygiene.

IncidentsSREOperations

Incidents are a search problem

When production breaks, the bottleneck is rarely “lack of logs” — it is **finding the right twenty lines** among millions. LoggerMan is built for fast filter, bulk triage, and shareable views (dashboard).

This playbook assumes you already ingest production traffic (getting started).

Minute 0–5: stabilize and scope

  1. Confirm the customer-facing symptom (status page, support queue).
  2. Open Logs on the affected project.
  3. Filter `level:ERROR` + relevant `source` + time window (last 15 minutes).
  4. Pin one representative ERROR and annotate with hypothesis (dashboard bulk actions).

Enable a maintenance window on alerts if deploy-related noise would page the team.

Minute 5–30: correlate

Use metadata you standardized during instrumentation (SDK production guide):

  • `requestId` / `traceId` across services
  • `userId` (hashed) for account-specific failures
  • `deploymentId` or `gitSha` after CI integration

Analytics helps when the incident is volume-shaped (“errors 4× baseline”) rather than a single stack trace.

Share without exporting secrets

For vendors or PMs without accounts, create a **time-boxed share link** (projects → share links). Enable metadata redaction when customer payloads might appear.

Never paste raw tokens in Slack — rotate via API keys if leaked.

Communication templates

**Internal Slack (start)**

> Incident: <symptom> — severity S1/S2 — lead: @name — LoggerMan project: <name> — filter: <link or description>

**Customer-facing (later)**

> We detected elevated errors affecting <feature>. Engineering is investigating. Next update in 30 minutes.

Audit and accountability

Settings changes (tokens, team, security) emit audit events. After the incident, verify whether a configuration change correlated with the start time.

Post-incident: log debt paydown

Within 48 hours:

  1. Add or tune an alert rule that would have fired 5 minutes earlier — not 5 minutes later.
  2. Remove or downgrade noisy INFO lines that hid signal.
  3. Document runbook links in the team wiki and link from alert channels.

Legal and data handling

EU/Swiss customers should review Privacy retention and erasure paths (security). Security issues go through responsible disclosure.

Related reading