← All posts

Alert fatigue: fewer rules, better signals

Practical patterns for threshold vs spike alerts, cooldowns, and when to snooze during deploys.

The goal is wake-up quality

Every alert should answer: *what broke, for whom, and what do I do first?*

Prefer spike over static thresholds for traffic

A fixed ERROR count that works at 2pm will page you at 2am on Sunday. Use spike detection when volume swings with traffic.

Cooldowns are not laziness

Set **cooldownMinutes** so the same stack trace does not open five incidents. Pair with maintenance windows during migrations.

One channel per severity

Route ERROR spikes to Slack, keep email for daily digests. On-call schedules belong on Scale when rotations matter.

Test with a sample log

Send a sample ERROR from Integration, confirm the rule fires once, then tune the condition.