2026-05-28 · 11 min read

Alert fatigue: fewer rules, better signals

How to design LoggerMan alert rules — spike vs threshold, cooldowns, maintenance windows, channels, and on-call hygiene.

AlertsOn-callSRE

The goal is wake-up quality

Every page should answer three questions in under thirty seconds: **what broke**, **who is affected**, and **what do I do first**? If a rule cannot support that sentence, delete it.

LoggerMan alerts evaluate log volume and patterns in near real time. Configuration lives under Alerts → Rules in the dashboard; conceptual background is in our alert fatigue doc.

Start from user-visible symptoms

Map alerts to **customer pain**, not infrastructure curiosity:

Checkout ERROR spike → revenue risk.
Auth WARNING cluster → login degradation.
Ingest failures → data loss risk (check troubleshooting first).

Avoid alerting on INFO lines unless you have a compliance requirement.

Spike beats static thresholds for traffic-shaped workloads

A fixed ERROR count that works at 2pm will page you at 2am on Sunday when traffic is 10× lower. Prefer **spike** conditions when volume swings with marketing campaigns or timezone effects.

Static thresholds still help for **invariant** signals — e.g. “any ERROR from `billing.webhook`” where zero is the correct baseline. Document the rationale in the rule description for the next engineer.

Cooldowns protect humans and channels

`cooldownMinutes` prevents the same stack trace from opening five incidents. Pair cooldowns with:

**Maintenance windows** during schema migrations (Alerts → Maintenance).
A runbook link in the notification template (Notion/Confluence) — LoggerMan Slack webhooks support custom payloads via outgoing webhooks.

One channel per severity

Recommended routing:

**ERROR spikes** → Slack or Teams for fast triage (one thread per incident).
**Daily digest** → email for non-urgent trends.
**Ownership rotation** → on-call schedules on the Scale plan.

Do not duplicate the same rule to Slack **and** SMS unless on-call policy requires it.

Test with a sample log before production

Open Integration and send a sample ERROR.
Confirm the rule fires **once** within the expected window.
Adjust condition, cooldown, or environment filter.
Repeat after deploy — see structured logging in Next.js for stable `code` fields.

When alerts fire but logs look fine

Often the issue is environment mix-ups or duplicate projects. Verify `environment` tags in the SDK (integrations) and that preview traffic is not counted in production rules.