# P3 — Alerts (design) > Phase 3 sub-spec covering the alerts engine, notification channels, and UI > (P3-05 / P3-06 / P3-07). > > Wireframe: `_diag/p3-alerts-wireframe/wireframe.html`. Screenshots in the > same directory. Spec brainstorm ran 2026-05-04; user approved all ten > design decisions before this spec was written. ## Scope locked Brainstorm decisions (in order asked): 1. **Rule model.** Hardcoded rule set, no operator-tunable thresholds in v1. The engine knows about each rule type internally; per-rule config can land later if/when an operator asks. 2. **Rule set.** Six rules: `backup_failed`, `forget_failed`, `prune_failed`, `check_failed`, `stale_schedule`, `agent_offline`. 3. **Engine cadence.** Hybrid. Event hooks at the existing `MarkJobFinished` and offline-sweeper sites for the immediate triggers; one 60-second ticker handles stale-schedule detection and auto-resolution. 4. **Resolution.** Auto-resolve when the underlying condition clears + manual Resolve at any time. Acknowledge is a separate "I've seen it" intermediate state that does NOT close the alert. 5. **v1 channels.** Webhook + native ntfy + SMTP. Apprise deferred (the channel plumbing accepts new kinds without reshaping). SMTP added as a first-class channel post-brainstorm because the use case — overnight alerts the operator wants to read in the morning rather than be pinged on at 03:00 — is poorly served by ntfy's push model and clumsy via webhook → email-gateway. 6. **Channel scope.** Global only. No per-host or per-severity routing in v1. 7. **Notification body.** Structured JSON for webhooks, formatted title+body+click-URL for ntfy, plus a per-channel "Send test notification" button with inline result feedback. 8. **Deduplication.** Open-alert uniqueness on `(host_id, kind)` with a `last_seen_at` bump on every confirming tick. One notification per occurrence; the UI shows "still happening · Ns ago" while a rule keeps matching. 9. **Alert UI.** Top-level `/alerts` page (the existing nav stub becomes real). Per-host vitals "Open alerts" cell links to `/alerts?host_id=...`. Channel CRUD lives at `/settings/notifications`. 10. **Delivery semantics.** Best-effort fire-and-forget with a 5s timeout per notification. Failures are logged but not retried. The alert row in the DB is the source of truth. ## Architecture The subsystem is three loosely-coupled units behind one `AlertEngine` goroutine: ``` ┌───────────────────────────┐ event hooks ─────────────────►│ │ │ AlertEngine │ ──► raise/resolve 60s ticker ──────────────────►│ (rule evaluation) │ alert row │ │ └────────────┬──────────────┘ │ ▼ ┌──────────────────────┐ │ notification.Hub │ │ (fire-and-forget) │ └──┬────────┬──────────┘ │ │ ┌──────▼──┐ ┌──▼──────┐ │ Webhook │ │ Ntfy │ …future channels └─────────┘ └─────────┘ ``` ### Component boundaries | Component | Purpose | Depends on | | ---------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------- | | `internal/alert.Engine` | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog | | `internal/alert.Rule` + per-rule files | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models | | `internal/notification.Hub` | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table. | store, channel adapters | | `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: `webhookChannel`, `ntfyChannel`, `smtpChannel`. | http.Client; net/smtp + crypto/tls for SMTP | | `internal/store/alerts.go` | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite | | `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table). | sqlite, crypto.AEAD (for secrets) | | `internal/server/http/ui_alerts.go` | `/alerts` page handler + filter parsing + ack/resolve form actions. | store | | `internal/server/http/ui_notifications.go` | `/settings/notifications` page + channel CRUD + "Send test" handler. | store, notification.Hub | ### Engine event shape The engine runs as one goroutine per server process started in `cmd/server/main.go`. It exposes a small set of channels other code writes to: ```go type Engine struct { store *store.Store hub *notification.Hub // Event channels (buffered, drop-on-full with a slog warning to keep // hot paths non-blocking). The engine drains them on its own // goroutine, evaluates the rule, and acts. jobFinished chan jobFinishedEvent // from store.MarkJobFinished hook hostOffline chan string // host_id; from offline sweeper hostOnline chan string // host_id; from ws handler hello // 60s ticker drives stale-schedule + auto-resolution sweeps. tick *time.Ticker } ``` The hot-path call sites (`store.MarkJobFinished`, `ws.handler` offline sweep, `ws.handler` hello) push to these channels via a tiny `Engine.Notify*` method that does a non-blocking send. The engine's own goroutine handles every match — keeps mutation off the hot path. ### Rule catalogue | Kind | Severity | Trigger | Auto-resolve when | | ------------------- | -------- | ----------------------------------------------------------------------- | -------------------------------------------------- | | `backup_failed` | warning | `MarkJobFinished` with kind=backup, status=failed | next backup for the same host succeeds | | `forget_failed` | warning | `MarkJobFinished` with kind=forget, status=failed | next forget for the same host succeeds | | `prune_failed` | warning | `MarkJobFinished` with kind=prune, status=failed | next prune for the same host succeeds | | `check_failed` | critical | `MarkJobFinished` with kind=check, status=failed OR errors_found | next check for the same host succeeds without errors | | `stale_schedule` | warning | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted | | `agent_offline` | warning | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`) | hostOnline event for that host | The 15-minute floor on `agent_offline` exists so a 30-second blip during agent restart doesn't generate a notification storm. The store's existing offline sweeper (`hosts.last_seen_at` with 90s threshold) already marks the host offline; the engine sees the event but waits for the threshold before raising. ### Dedup + last_seen_at `store.RaiseOrTouch(host_id, kind, severity, message)`: ```sql SELECT id, last_seen_at FROM alerts WHERE host_id = ? AND kind = ? AND resolved_at IS NULL LIMIT 1; ``` - Found: `UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?`, return `(id, didRaise=false)`. - Not found: `INSERT INTO alerts (id, host_id, kind, severity, message, created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?)`, return `(id, didRaise=true)`. The engine fires a notification through the Hub only when `didRaise=true`. Touch-only events keep the row's `last_seen_at` fresh so the UI can render "still happening · Ns ago" without spamming the operator's phone. ### Notification payload shapes **Webhook** — a single JSON envelope per event: ```json { "event": "alert.raised", "alert_id": "01KQT...", "severity": "warning", "kind": "backup_failed", "host_id": "01KQ...", "host_name": "alfa-01", "message": "Backup 'system-config' failed: rest-server returned 401", "raised_at": "2026-05-04T15:42:01Z", "link": "https://restic-manager.example/alerts/01KQT..." } ``` `event` is one of `alert.raised | alert.acknowledged | alert.resolved | alert.test`. The same envelope shape is reused across events — operators build one bridge, switch on `event` and `severity`. **SMTP** — single-recipient plain-text email per channel. The channel config carries the SMTP server credentials and a `to` address; one channel = one recipient (or one distribution-list address). Operators who want multiple recipients add multiple channels — keeps the config flat and the failure modes per-recipient. Subject pattern is hardcoded (no per-channel template in v1): ``` Subject: [restic-manager] [] : From: To: Date: Message-ID: > — Raised at: 2026-05-04T15:42:01Z Severity: warning Host: alfa-01 Kind: backup_failed Open in restic-manager: https://restic-manager.example/alerts/01KQT... (This message was sent by restic-manager. Acknowledge or resolve in the UI.) ``` The body is plain text only in v1 — no HTML alternative — both because the data is already structured well enough as text and because HTML email opens a long tail of rendering / sanitisation concerns. The `Message-ID` includes the alert id so a thread-aware client can group related events (raised → acknowledged → resolved) together. Encryption: - **STARTTLS** (default, port 587). Opportunistic upgrade. Most operator-facing relays. - **Implicit TLS** (port 465). Connect-then-TLS-handshake. - **None** (port 25). Plain. Hidden behind a "Yes I understand" warning on the form because the password goes over the wire. Auth: - **PLAIN** (RFC 4616) over TLS. Default and almost always what's wanted. - **CRAM-MD5** (RFC 2195). Offered if the server advertises it, no UI toggle — automatic. - No OAuth2 / XOAUTH2 in v1; that's a real next step if Gmail-without- app-passwords becomes a recurring ask. Per-message timeout is 10s (vs 5s for HTTP channels) — STARTTLS handshake + DATA over a slow link can legitimately take that long. **Ntfy** — uses the standard publish format: ``` POST / HTTP/1.1 Host: Authorization: Bearer (if configured) Title: [warning] alfa-01 backup failed Priority: 4 Tags: warning,backup_failed Click: https://restic-manager.example/alerts/01KQT... Backup 'system-config' failed: rest-server returned 401 ``` Severity → priority mapping: | Severity | Priority | | --------- | -------- | | info | 3 (default) | | warning | 4 (high) | | critical | 5 (urgent) | Per-channel `default_priority` setting overrides for non-critical alerts; critical always goes urgent regardless. ### Test notification `POST /api/notifications/{channel_id}/test` builds a synthetic event (severity=info, kind=test_notification, message="Test from restic-manager", link to the channel's edit page) and runs it through the real send path. Returns `{ok: bool, latency_ms: int, status_code?: int, error?: string}`. UI renders the green ✓ / red ✗ feedback inline. ## Routes added | Method | Path | Purpose | | ------- | ----------------------------------------------------- | ------------------------------------------------------------- | | GET | `/alerts` | Fleet alerts list with filters (`?status=open&severity=warning&host_id=...&q=...`) | | POST | `/alerts/{id}/acknowledge` | Mark alert acknowledged (HTMX form) | | POST | `/alerts/{id}/resolve` | Manual resolve (HTMX form) | | GET | `/settings/notifications` | Channel list page | | GET | `/settings/notifications/new` | Channel kind picker + empty form | | POST | `/settings/notifications/new` | Validate + create + redirect | | GET | `/settings/notifications/{id}/edit` | Channel edit form | | POST | `/settings/notifications/{id}/edit` | Validate + update | | POST | `/settings/notifications/{id}/delete` | Delete channel (typed-confirm name in the form) | | POST | `/api/notifications/{id}/test` | Fire test notification, return JSON result | | GET | `/api/alerts` | JSON list (mirrors the UI filters) for future REST callers | ## Data model ### Migration 0013 — alerts.last_seen_at ```sql ALTER TABLE alerts ADD COLUMN last_seen_at TEXT; UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL; ``` Existing alerts (currently zero in production — nothing writes them yet) get `last_seen_at = created_at`. Column is nullable for forwards-compat with rows from the alert-engine-pre-bump period. ### Migration 0014 — notification_channels + notification_log ```sql CREATE TABLE notification_channels ( id TEXT PRIMARY KEY, kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy', 'smtp')), name TEXT NOT NULL, enabled INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)), config BLOB NOT NULL, -- AEAD-encrypted JSON; per-kind shape default_priority TEXT, -- ntfy only; null for webhook + smtp created_at TEXT NOT NULL, updated_at TEXT NOT NULL, last_fired_at TEXT ); CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1; CREATE TABLE notification_log ( id TEXT PRIMARY KEY, channel_id TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE, alert_id TEXT REFERENCES alerts(id) ON DELETE SET NULL, event TEXT NOT NULL, -- alert.raised | alert.acknowledged | alert.resolved | alert.test ok INTEGER NOT NULL CHECK (ok IN (0, 1)), status_code INTEGER, latency_ms INTEGER, error TEXT, fired_at TEXT NOT NULL ); CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC); CREATE INDEX notification_log_alert ON notification_log(alert_id); ``` `config` is an AEAD-encrypted JSON blob — bearer tokens for webhooks and access tokens for ntfy live there. Per-kind config shapes: ```go type webhookConfig struct { URL string `json:"url"` BearerToken string `json:"bearer_token,omitempty"` HeaderName string `json:"header_name,omitempty"` HeaderValue string `json:"header_value,omitempty"` } type ntfyConfig struct { ServerURL string `json:"server_url"` // default https://ntfy.sh Topic string `json:"topic"` AccessToken string `json:"access_token,omitempty"` } type smtpConfig struct { Host string `json:"host"` // e.g. smtp.example.com Port int `json:"port"` // default 587 (STARTTLS), 465 (TLS), 25 (none) Encryption string `json:"encryption"` // "starttls" | "tls" | "none" Username string `json:"username"` Password string `json:"password"` // sensitive — AEAD-encrypted with the rest of config From string `json:"from"` // RFC 5322 address; "alerts@example.com" or "Restic-Manager " To string `json:"to"` // single recipient or distribution-list address; v1 = one channel = one to-line } ``` ### Engine state The engine itself is stateless beyond the channels it owns; all persisted state is in the existing `alerts` table + the new `notification_log` table. A process restart re-evaluates from scratch: on next tick the stale-schedule + auto-resolution sweeps catch up with whatever happened during the downtime. No outbox to drain. ## UI templates | Template | Purpose | | ----------------------------------------- | ------------------------------------------------------ | | `web/templates/pages/alerts.html` | Fleet alerts page | | `web/templates/partials/alert_row.html` | One alert row (used by both list and detail-fragment swap) | | `web/templates/pages/settings.html` | Settings shell with Notifications / Users / Auth sub-tabs | | `web/templates/pages/notifications.html` | Channel list (Notifications sub-tab body) | | `web/templates/pages/notification_edit.html` | Channel kind picker + per-kind form + test button + payload preview | | `web/templates/partials/crit_banner.html` | Dashboard top-of-page banner | | `web/templates/partials/nav.html` | Existing — gain a `data-alerts-count` attribute on the Alerts tab so the badge auto-updates | The Settings shell + Notifications sub-tab is the new chrome the wireframe introduced; Users + Authentication tabs are placeholder links that 404 in v1 (or render an "Lands later" notice). Same pattern P2R-02 used for inert sub-tabs. ## Tests (target coverage) - `internal/alert/engine_test.go` — rule firing per kind: backup_failed raises on `MarkJobFinished(kind=backup, status=failed)`; touch-only on the second failure for the same host (no second notification); auto-resolve on next success. - `internal/alert/agent_offline_test.go` — `OnHostOffline` emits without raising until the 15-min floor; `OnHostOnline` clears the alert. - `internal/alert/stale_schedule_test.go` — synthetic schedule whose next fire is in the past triggers; resets when a job lands. - `internal/notification/webhook_test.go` — payload shape pinned; authorisation header sent when bearer set; custom header echoed; 5s timeout enforced; error in `notification_log`. - `internal/notification/ntfy_test.go` — title/priority/tags/click headers match the severity mapping; access token sent as `Authorization: Bearer `; default priority overridden by severity for critical. - `internal/notification/smtp_test.go` — round-trip against a local `net/smtp.NewServer`-style fake (or `mhog`/MailHog if convenient): STARTTLS handshake completes against a self-signed cert; PLAIN auth uses configured creds; subject + from + to + body bytes match the spec'd format; Message-ID contains the alert id; 10s timeout enforced; failure path (auth refused) lands in `notification_log` with the server's error string. - `internal/server/http/ui_alerts_test.go` — page renders with filters applied; ack/resolve POSTs flip the row + write audit; HX-Redirect bounces back to the filtered list. - `internal/server/http/ui_notifications_test.go` — CRUD happy paths, validation re-render, secrets-encrypted-at-rest assertion (load row, decrypt, compare), test-button hits the real send path against a test http.Server. - Migration 0013 + 0014 round-trip tested via `store.Open` on a fresh db. ## Playwright sweep End-of-phase sweep mirrors the P2R-02 / P3-restore pattern: 1. Login → `/alerts` (initially empty) → see "All clear · last alert never" empty state. 2. Trigger a fake-failed-backup via `POST /api/hosts/{id}/jobs` against a host with a deliberately-wrong rest-server URL. Wait for the `backup_failed` alert to appear in the list within ~2s of the job finishing. 3. Acknowledge → row tints + ack actor visible. 4. Take the agent offline (`systemctl stop`); wait 15 min OR mock `last_seen_at` to 16 min ago via the test harness; confirm `agent_offline` alert raises once. 5. Restart the agent → `agent_offline` auto-resolves; `backup_failed` is still open. 6. Configure a webhook channel pointing at a local test sink; click "Send test" → green ✓. 7. Configure a ntfy channel pointing at a local sink → click "Send test" → green ✓. 8. Configure an SMTP channel pointing at a local MailHog (Docker, port 1025, no TLS for the local-only sweep) → click "Send test" → green ✓ → MailHog UI at :8025 shows the test email with the right subject and Message-ID. 9. Trigger a fresh failed backup → all three channels receive the notification (verified from sink logs + MailHog inbox); `notification_log` has three rows `event=alert.raised, ok=true`. 10. Manually Resolve the open `backup_failed`; confirm all three channels receive `event=alert.resolved`. 11. Critical-severity test: trigger `check_failed` (mocked) → dashboard banner appears; clicking it lands on `/alerts?severity=critical&status=open`. 12. Empty the alerts again → banner disappears. Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console errors, before handing back. ## What does NOT change - Existing chrome/templates beyond the small additions noted above. - Existing `alerts.severity` CHECK (`info`/`warning`/`critical`) — already the right shape; no migration needed for that. - Audit log writer pattern — engine writes audit rows for ack/resolve the same way every other state-changing handler does. - The agent. Alerts are entirely a server concern; the agent doesn't know they exist. ## Open questions / explicit non-goals - **Per-rule cooldowns / re-raise on long-running issues.** Out of scope (brainstorm question 8 ruled this out). Operators see "still happening" in the UI; they don't get a reminder ping. - **SMTP HTML emails.** v1 is plain text only — operators wanting rich rendering can deploy a webhook → mail-merge bridge, or wait for a v2 template engine. The Message-ID threading + plain text body should be enough for almost every overnight-digest workflow. - **SMTP OAuth2 / XOAUTH2.** Out of scope. Gmail / Microsoft 365 with modern OAuth requires an `app password` workaround in v1. Native XOAUTH2 lands when an operator asks (or when Google starts refusing app passwords for non-business accounts in earnest). - **Multi-recipient SMTP channels.** A channel = one `To`. Operators wanting multiple recipients add multiple channels. Keeps failure attribution per-recipient. - **Apprise sidecar integration.** Deferred per brainstorm. The `Channel` interface accepts a third impl without reshaping when we get there. - **Per-host or per-severity channel routing.** Out of scope. Likely next step if operators ask: a `min_severity` field on the channel row. - **Snooze / mute.** Out of scope. Acknowledge is the closest analogue; full silence-windows would need a new table and is YAGNI for v1. - **PagerDuty / OpsGenie.** Both have webhook receivers; operators wire them via the webhook channel today. - **Alert "rules" UI.** No CRUD; the rule set is hardcoded.