Alert fatigue: fewer rules, better signals
Practical patterns for threshold vs spike alerts, cooldowns, and when to snooze during deploys.
The goal is wake-up quality
Every alert should answer: *what broke, for whom, and what do I do first?*
Prefer spike over static thresholds for traffic
A fixed ERROR count that works at 2pm will page you at 2am on Sunday. Use spike detection when volume swings with traffic.
Cooldowns are not laziness
Set **cooldownMinutes** so the same stack trace does not open five incidents. Pair with maintenance windows during migrations.
One channel per severity
Route ERROR spikes to Slack, keep email for daily digests. On-call schedules belong on Scale when rotations matter.
Test with a sample log
Send a sample ERROR from Integration, confirm the rule fires once, then tune the condition.