Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.
Key decisions captured:
- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
backup_failed, forget_failed, prune_failed, check_failed,
stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
for immediate triggers; one 60s ticker for stale-schedule detection +
auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
Acknowledge as a separate I-have-seen-it intermediate state that does
NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
acknowledged / resolved / test events; ntfy uses the standard publish
format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
every confirming tick so the UI can render still happening · Ns ago
without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
to a new notification_log table but never retried. Alert row in the DB
is the source of truth.
Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables
Wireframe: _diag/p3-alerts-wireframe/wireframe.html