Files
restic-manager/docs/superpowers/specs
steve 6165e34f6f docs: P3 alerts design spec
Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.

Key decisions captured:

- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
  backup_failed, forget_failed, prune_failed, check_failed,
  stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
  for immediate triggers; one 60s ticker for stale-schedule detection +
  auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
  Acknowledge as a separate I-have-seen-it intermediate state that does
  NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
  scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
  acknowledged / resolved / test events; ntfy uses the standard publish
  format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
  a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
  every confirming tick so the UI can render still happening · Ns ago
  without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
  Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
  to a new notification_log table but never retried. Alert row in the DB
  is the source of truth.

Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables

Wireframe: _diag/p3-alerts-wireframe/wireframe.html
2026-05-04 18:39:26 +01:00
..