From 6165e34f6fef31a8811670ac5e699d87daeaf38b Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Mon, 4 May 2026 18:39:26 +0100 Subject: [PATCH] docs: P3 alerts design spec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 3 sub-spec covering the alerts engine, notification channels, and UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions locked before this spec was written. Key decisions captured: - Hardcoded rule set, no operator-tunable thresholds in v1. Six rules: backup_failed, forget_failed, prune_failed, check_failed, stale_schedule, agent_offline. - Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper for immediate triggers; one 60s ticker for stale-schedule detection + auto-resolution sweeps. - Auto-resolve when underlying condition clears; manual Resolve any time; Acknowledge as a separate I-have-seen-it intermediate state that does NOT close the alert. - v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel scope is global only — no per-host or per-severity routing. - Webhook payload is one stable JSON envelope shape across raised / acknowledged / resolved / test events; ntfy uses the standard publish format with severity → priority mapping. - Per-channel Send Test Notification button hits the real send path with a synthetic info-severity event; inline green-tick / red-cross result. - Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on every confirming tick so the UI can render still happening · Ns ago without re-notifying. - Top-level /alerts page; Settings shell with Notifications sub-tab. Per-host vitals Open alerts cell deep-links into filtered list. - Best-effort fire-and-forget delivery with 5s timeout; failures logged to a new notification_log table but never retried. Alert row in the DB is the source of truth. Migrations: - 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md) - 0014 adds notification_channels + notification_log tables Wireframe: _diag/p3-alerts-wireframe/wireframe.html --- .../specs/2026-05-04-p3-alerts-design.md | 387 ++++++++++++++++++ 1 file changed, 387 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-04-p3-alerts-design.md diff --git a/docs/superpowers/specs/2026-05-04-p3-alerts-design.md b/docs/superpowers/specs/2026-05-04-p3-alerts-design.md new file mode 100644 index 0000000..b88ca95 --- /dev/null +++ b/docs/superpowers/specs/2026-05-04-p3-alerts-design.md @@ -0,0 +1,387 @@ +# P3 — Alerts (design) + +> Phase 3 sub-spec covering the alerts engine, notification channels, and UI +> (P3-05 / P3-06 / P3-07). +> +> Wireframe: `_diag/p3-alerts-wireframe/wireframe.html`. Screenshots in the +> same directory. Spec brainstorm ran 2026-05-04; user approved all ten +> design decisions before this spec was written. + +## Scope locked + +Brainstorm decisions (in order asked): + +1. **Rule model.** Hardcoded rule set, no operator-tunable thresholds in v1. + The engine knows about each rule type internally; per-rule config can land + later if/when an operator asks. +2. **Rule set.** Six rules: `backup_failed`, `forget_failed`, `prune_failed`, + `check_failed`, `stale_schedule`, `agent_offline`. +3. **Engine cadence.** Hybrid. Event hooks at the existing + `MarkJobFinished` and offline-sweeper sites for the immediate triggers; + one 60-second ticker handles stale-schedule detection and auto-resolution. +4. **Resolution.** Auto-resolve when the underlying condition clears + manual + Resolve at any time. Acknowledge is a separate "I've seen it" intermediate + state that does NOT close the alert. +5. **v1 channels.** Webhook + native ntfy. Apprise deferred (the channel + plumbing accepts new kinds without reshaping). +6. **Channel scope.** Global only. No per-host or per-severity routing in v1. +7. **Notification body.** Structured JSON for webhooks, formatted + title+body+click-URL for ntfy, plus a per-channel "Send test notification" + button with inline result feedback. +8. **Deduplication.** Open-alert uniqueness on `(host_id, kind)` with a + `last_seen_at` bump on every confirming tick. One notification per + occurrence; the UI shows "still happening · Ns ago" while a rule keeps + matching. +9. **Alert UI.** Top-level `/alerts` page (the existing nav stub becomes + real). Per-host vitals "Open alerts" cell links to `/alerts?host_id=...`. + Channel CRUD lives at `/settings/notifications`. +10. **Delivery semantics.** Best-effort fire-and-forget with a 5s timeout + per notification. Failures are logged but not retried. The alert row in + the DB is the source of truth. + +## Architecture + +The subsystem is three loosely-coupled units behind one `AlertEngine` +goroutine: + +``` + ┌───────────────────────────┐ + event hooks ─────────────────►│ │ + │ AlertEngine │ ──► raise/resolve + 60s ticker ──────────────────►│ (rule evaluation) │ alert row + │ │ + └────────────┬──────────────┘ + │ + ▼ + ┌──────────────────────┐ + │ notification.Hub │ + │ (fire-and-forget) │ + └──┬────────┬──────────┘ + │ │ + ┌──────▼──┐ ┌──▼──────┐ + │ Webhook │ │ Ntfy │ …future channels + └─────────┘ └─────────┘ +``` + +### Component boundaries + +| Component | Purpose | Depends on | +| ---------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------- | +| `internal/alert.Engine` | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog | +| `internal/alert.Rule` + per-rule files | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models | +| `internal/notification.Hub` | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table. | store, channel adapters | +| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context. Two impls in v1: `webhookChannel`, `ntfyChannel`. | http.Client | +| `internal/store/alerts.go` | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite | +| `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table). | sqlite, crypto.AEAD (for secrets) | +| `internal/server/http/ui_alerts.go` | `/alerts` page handler + filter parsing + ack/resolve form actions. | store | +| `internal/server/http/ui_notifications.go` | `/settings/notifications` page + channel CRUD + "Send test" handler. | store, notification.Hub | + +### Engine event shape + +The engine runs as one goroutine per server process started in +`cmd/server/main.go`. It exposes a small set of channels other code writes to: + +```go +type Engine struct { + store *store.Store + hub *notification.Hub + + // Event channels (buffered, drop-on-full with a slog warning to keep + // hot paths non-blocking). The engine drains them on its own + // goroutine, evaluates the rule, and acts. + jobFinished chan jobFinishedEvent // from store.MarkJobFinished hook + hostOffline chan string // host_id; from offline sweeper + hostOnline chan string // host_id; from ws handler hello + + // 60s ticker drives stale-schedule + auto-resolution sweeps. + tick *time.Ticker +} +``` + +The hot-path call sites (`store.MarkJobFinished`, `ws.handler` offline +sweep, `ws.handler` hello) push to these channels via a tiny +`Engine.Notify*` method that does a non-blocking send. The engine's own +goroutine handles every match — keeps mutation off the hot path. + +### Rule catalogue + +| Kind | Severity | Trigger | Auto-resolve when | +| ------------------- | -------- | ----------------------------------------------------------------------- | -------------------------------------------------- | +| `backup_failed` | warning | `MarkJobFinished` with kind=backup, status=failed | next backup for the same host succeeds | +| `forget_failed` | warning | `MarkJobFinished` with kind=forget, status=failed | next forget for the same host succeeds | +| `prune_failed` | warning | `MarkJobFinished` with kind=prune, status=failed | next prune for the same host succeeds | +| `check_failed` | critical | `MarkJobFinished` with kind=check, status=failed OR errors_found | next check for the same host succeeds without errors | +| `stale_schedule` | warning | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted | +| `agent_offline` | warning | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`) | hostOnline event for that host | + +The 15-minute floor on `agent_offline` exists so a 30-second blip during +agent restart doesn't generate a notification storm. The store's existing +offline sweeper (`hosts.last_seen_at` with 90s threshold) already marks the +host offline; the engine sees the event but waits for the threshold before +raising. + +### Dedup + last_seen_at + +`store.RaiseOrTouch(host_id, kind, severity, message)`: + +```sql +SELECT id, last_seen_at FROM alerts + WHERE host_id = ? AND kind = ? AND resolved_at IS NULL + LIMIT 1; +``` + +- Found: `UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?`, + return `(id, didRaise=false)`. +- Not found: `INSERT INTO alerts (id, host_id, kind, severity, message, + created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?)`, return + `(id, didRaise=true)`. + +The engine fires a notification through the Hub only when `didRaise=true`. +Touch-only events keep the row's `last_seen_at` fresh so the UI can render +"still happening · Ns ago" without spamming the operator's phone. + +### Notification payload shapes + +**Webhook** — a single JSON envelope per event: + +```json +{ + "event": "alert.raised", + "alert_id": "01KQT...", + "severity": "warning", + "kind": "backup_failed", + "host_id": "01KQ...", + "host_name": "alfa-01", + "message": "Backup 'system-config' failed: rest-server returned 401", + "raised_at": "2026-05-04T15:42:01Z", + "link": "https://restic-manager.example/alerts/01KQT..." +} +``` + +`event` is one of `alert.raised | alert.acknowledged | alert.resolved | +alert.test`. The same envelope shape is reused across events — operators +build one bridge, switch on `event` and `severity`. + +**Ntfy** — uses the standard publish format: + +``` +POST / HTTP/1.1 +Host: +Authorization: Bearer (if configured) +Title: [warning] alfa-01 backup failed +Priority: 4 +Tags: warning,backup_failed +Click: https://restic-manager.example/alerts/01KQT... + +Backup 'system-config' failed: rest-server returned 401 +``` + +Severity → priority mapping: + +| Severity | Priority | +| --------- | -------- | +| info | 3 (default) | +| warning | 4 (high) | +| critical | 5 (urgent) | + +Per-channel `default_priority` setting overrides for non-critical alerts; +critical always goes urgent regardless. + +### Test notification + +`POST /api/notifications/{channel_id}/test` builds a synthetic event +(severity=info, kind=test_notification, message="Test from +restic-manager", link to the channel's edit page) and runs it through the +real send path. Returns `{ok: bool, latency_ms: int, status_code?: int, +error?: string}`. UI renders the green ✓ / red ✗ feedback inline. + +## Routes added + +| Method | Path | Purpose | +| ------- | ----------------------------------------------------- | ------------------------------------------------------------- | +| GET | `/alerts` | Fleet alerts list with filters (`?status=open&severity=warning&host_id=...&q=...`) | +| POST | `/alerts/{id}/acknowledge` | Mark alert acknowledged (HTMX form) | +| POST | `/alerts/{id}/resolve` | Manual resolve (HTMX form) | +| GET | `/settings/notifications` | Channel list page | +| GET | `/settings/notifications/new` | Channel kind picker + empty form | +| POST | `/settings/notifications/new` | Validate + create + redirect | +| GET | `/settings/notifications/{id}/edit` | Channel edit form | +| POST | `/settings/notifications/{id}/edit` | Validate + update | +| POST | `/settings/notifications/{id}/delete` | Delete channel (typed-confirm name in the form) | +| POST | `/api/notifications/{id}/test` | Fire test notification, return JSON result | +| GET | `/api/alerts` | JSON list (mirrors the UI filters) for future REST callers | + +## Data model + +### Migration 0013 — alerts.last_seen_at + +```sql +ALTER TABLE alerts ADD COLUMN last_seen_at TEXT; +UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL; +``` + +Existing alerts (currently zero in production — nothing writes them yet) +get `last_seen_at = created_at`. Column is nullable for forwards-compat +with rows from the alert-engine-pre-bump period. + +### Migration 0014 — notification_channels + notification_log + +```sql +CREATE TABLE notification_channels ( + id TEXT PRIMARY KEY, + kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy')), + name TEXT NOT NULL, + enabled INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)), + config BLOB NOT NULL, -- AEAD-encrypted JSON; per-kind shape + default_priority TEXT, -- ntfy only; null for others + created_at TEXT NOT NULL, + updated_at TEXT NOT NULL, + last_fired_at TEXT +); + +CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1; + +CREATE TABLE notification_log ( + id TEXT PRIMARY KEY, + channel_id TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE, + alert_id TEXT REFERENCES alerts(id) ON DELETE SET NULL, + event TEXT NOT NULL, -- alert.raised | alert.acknowledged | alert.resolved | alert.test + ok INTEGER NOT NULL CHECK (ok IN (0, 1)), + status_code INTEGER, + latency_ms INTEGER, + error TEXT, + fired_at TEXT NOT NULL +); + +CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC); +CREATE INDEX notification_log_alert ON notification_log(alert_id); +``` + +`config` is an AEAD-encrypted JSON blob — bearer tokens for webhooks and +access tokens for ntfy live there. Per-kind config shapes: + +```go +type webhookConfig struct { + URL string `json:"url"` + BearerToken string `json:"bearer_token,omitempty"` + HeaderName string `json:"header_name,omitempty"` + HeaderValue string `json:"header_value,omitempty"` +} + +type ntfyConfig struct { + ServerURL string `json:"server_url"` // default https://ntfy.sh + Topic string `json:"topic"` + AccessToken string `json:"access_token,omitempty"` +} +``` + +### Engine state + +The engine itself is stateless beyond the channels it owns; all +persisted state is in the existing `alerts` table + the new +`notification_log` table. A process restart re-evaluates from scratch: +on next tick the stale-schedule + auto-resolution sweeps catch up with +whatever happened during the downtime. No outbox to drain. + +## UI templates + +| Template | Purpose | +| ----------------------------------------- | ------------------------------------------------------ | +| `web/templates/pages/alerts.html` | Fleet alerts page | +| `web/templates/partials/alert_row.html` | One alert row (used by both list and detail-fragment swap) | +| `web/templates/pages/settings.html` | Settings shell with Notifications / Users / Auth sub-tabs | +| `web/templates/pages/notifications.html` | Channel list (Notifications sub-tab body) | +| `web/templates/pages/notification_edit.html` | Channel kind picker + per-kind form + test button + payload preview | +| `web/templates/partials/crit_banner.html` | Dashboard top-of-page banner | +| `web/templates/partials/nav.html` | Existing — gain a `data-alerts-count` attribute on the Alerts tab so the badge auto-updates | + +The Settings shell + Notifications sub-tab is the new chrome the wireframe +introduced; Users + Authentication tabs are placeholder links that 404 in +v1 (or render an "Lands later" notice). Same pattern P2R-02 used for +inert sub-tabs. + +## Tests (target coverage) + +- `internal/alert/engine_test.go` — rule firing per kind: backup_failed + raises on `MarkJobFinished(kind=backup, status=failed)`; touch-only on + the second failure for the same host (no second notification); + auto-resolve on next success. +- `internal/alert/agent_offline_test.go` — `OnHostOffline` emits without + raising until the 15-min floor; `OnHostOnline` clears the alert. +- `internal/alert/stale_schedule_test.go` — synthetic schedule whose next + fire is in the past triggers; resets when a job lands. +- `internal/notification/webhook_test.go` — payload shape pinned; + authorisation header sent when bearer set; custom header echoed; 5s + timeout enforced; error in `notification_log`. +- `internal/notification/ntfy_test.go` — title/priority/tags/click headers + match the severity mapping; access token sent as `Authorization: Bearer + `; default priority overridden by severity for critical. +- `internal/server/http/ui_alerts_test.go` — page renders with filters + applied; ack/resolve POSTs flip the row + write audit; HX-Redirect + bounces back to the filtered list. +- `internal/server/http/ui_notifications_test.go` — CRUD happy paths, + validation re-render, secrets-encrypted-at-rest assertion (load row, + decrypt, compare), test-button hits the real send path against a + test http.Server. +- Migration 0013 + 0014 round-trip tested via `store.Open` on a fresh + db. + +## Playwright sweep + +End-of-phase sweep mirrors the P2R-02 / P3-restore pattern: + +1. Login → `/alerts` (initially empty) → see "All clear · last alert + never" empty state. +2. Trigger a fake-failed-backup via `POST /api/hosts/{id}/jobs` against a + host with a deliberately-wrong rest-server URL. Wait for the + `backup_failed` alert to appear in the list within ~2s of the job + finishing. +3. Acknowledge → row tints + ack actor visible. +4. Take the agent offline (`systemctl stop`); wait 15 min OR mock + `last_seen_at` to 16 min ago via the test harness; confirm + `agent_offline` alert raises once. +5. Restart the agent → `agent_offline` auto-resolves; `backup_failed` is + still open. +6. Configure a webhook channel pointing at a local test sink; click "Send + test" → green ✓. +7. Configure a ntfy channel pointing at a local sink → click "Send test" + → green ✓. +8. Trigger a fresh failed backup → both channels receive the notification + (verified from sink logs); `notification_log` has two rows + `event=alert.raised, ok=true`. +9. Manually Resolve the open `backup_failed`; confirm both channels + receive `event=alert.resolved`. +10. Critical-severity test: trigger `check_failed` (mocked) → dashboard + banner appears; clicking it lands on `/alerts?severity=critical&status=open`. +11. Empty the alerts again → banner disappears. + +Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console +errors, before handing back. + +## What does NOT change + +- Existing chrome/templates beyond the small additions noted above. +- Existing `alerts.severity` CHECK (`info`/`warning`/`critical`) — already + the right shape; no migration needed for that. +- Audit log writer pattern — engine writes audit rows for ack/resolve + the same way every other state-changing handler does. +- The agent. Alerts are entirely a server concern; the agent doesn't + know they exist. + +## Open questions / explicit non-goals + +- **Per-rule cooldowns / re-raise on long-running issues.** Out of scope + (brainstorm question 8 ruled this out). Operators see "still happening" + in the UI; they don't get a reminder ping. +- **SMTP / email channel.** Out of scope. Operators wanting email today + can chain webhook → email-gateway; native SMTP can land later. +- **Apprise sidecar integration.** Deferred per brainstorm. The + `Channel` interface accepts a third impl without reshaping when we get + there. +- **Per-host or per-severity channel routing.** Out of scope. Likely + next step if operators ask: a `min_severity` field on the channel row. +- **Snooze / mute.** Out of scope. Acknowledge is the closest analogue; + full silence-windows would need a new table and is YAGNI for v1. +- **PagerDuty / OpsGenie.** Both have webhook receivers; operators wire + them via the webhook channel today. +- **Alert "rules" UI.** No CRUD; the rule set is hardcoded.