Files

T

steve 6165e34f6f docs: P3 alerts design spec

Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.

Key decisions captured:

- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
  backup_failed, forget_failed, prune_failed, check_failed,
  stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
  for immediate triggers; one 60s ticker for stale-schedule detection +
  auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
  Acknowledge as a separate I-have-seen-it intermediate state that does
  NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
  scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
  acknowledged / resolved / test events; ntfy uses the standard publish
  format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
  a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
  every confirming tick so the UI can render still happening · Ns ago
  without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
  Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
  to a new notification_log table but never retried. Alert row in the DB
  is the source of truth.

Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables

Wireframe: _diag/p3-alerts-wireframe/wireframe.html

2026-05-04 18:39:26 +01:00

20 KiB

Raw Blame History

P3 — Alerts (design)

Phase 3 sub-spec covering the alerts engine, notification channels, and UI (P3-05 / P3-06 / P3-07).

Wireframe: _diag/p3-alerts-wireframe/wireframe.html. Screenshots in the same directory. Spec brainstorm ran 2026-05-04; user approved all ten design decisions before this spec was written.

Scope locked

Brainstorm decisions (in order asked):

Rule model. Hardcoded rule set, no operator-tunable thresholds in v1. The engine knows about each rule type internally; per-rule config can land later if/when an operator asks.
Rule set. Six rules: backup_failed, forget_failed, prune_failed, check_failed, stale_schedule, agent_offline.
Engine cadence. Hybrid. Event hooks at the existing MarkJobFinished and offline-sweeper sites for the immediate triggers; one 60-second ticker handles stale-schedule detection and auto-resolution.
Resolution. Auto-resolve when the underlying condition clears + manual Resolve at any time. Acknowledge is a separate "I've seen it" intermediate state that does NOT close the alert.
v1 channels. Webhook + native ntfy. Apprise deferred (the channel plumbing accepts new kinds without reshaping).
Channel scope. Global only. No per-host or per-severity routing in v1.
Notification body. Structured JSON for webhooks, formatted title+body+click-URL for ntfy, plus a per-channel "Send test notification" button with inline result feedback.
Deduplication. Open-alert uniqueness on (host_id, kind) with a last_seen_at bump on every confirming tick. One notification per occurrence; the UI shows "still happening · Ns ago" while a rule keeps matching.
Alert UI. Top-level /alerts page (the existing nav stub becomes real). Per-host vitals "Open alerts" cell links to /alerts?host_id=.... Channel CRUD lives at /settings/notifications.
Delivery semantics. Best-effort fire-and-forget with a 5s timeout per notification. Failures are logged but not retried. The alert row in the DB is the source of truth.

Architecture

The subsystem is three loosely-coupled units behind one AlertEngine goroutine:

                                 ┌───────────────────────────┐
   event hooks ─────────────────►│                           │
                                 │   AlertEngine             │ ──► raise/resolve
   60s ticker ──────────────────►│   (rule evaluation)       │     alert row
                                 │                           │
                                 └────────────┬──────────────┘
                                              │
                                              ▼
                                  ┌──────────────────────┐
                                  │   notification.Hub   │
                                  │   (fire-and-forget)  │
                                  └──┬────────┬──────────┘
                                     │        │
                              ┌──────▼──┐  ┌──▼──────┐
                              │ Webhook │  │  Ntfy   │  …future channels
                              └─────────┘  └─────────┘

Component boundaries

Component	Purpose	Depends on
`internal/alert.Engine`	Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store.	store, notification.Hub, slog
`internal/alert.Rule` + per-rule files	Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice.	store models
`internal/notification.Hub`	Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table.	store, channel adapters
`internal/notification.Channel` (iface)	Single method `Send(ctx, payload) error` with a 5s context. Two impls in v1: `webhookChannel`, `ntfyChannel`.	http.Client
`internal/store/alerts.go`	CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump.	sqlite
`internal/store/notification_channels.go`	CRUD on `notification_channels` (new table) + `notification_log` (new table).	sqlite, crypto.AEAD (for secrets)
`internal/server/http/ui_alerts.go`	`/alerts` page handler + filter parsing + ack/resolve form actions.	store
`internal/server/http/ui_notifications.go`	`/settings/notifications` page + channel CRUD + "Send test" handler.	store, notification.Hub

Engine event shape

The engine runs as one goroutine per server process started in cmd/server/main.go. It exposes a small set of channels other code writes to:

type Engine struct {
    store *store.Store
    hub   *notification.Hub

    // Event channels (buffered, drop-on-full with a slog warning to keep
    // hot paths non-blocking). The engine drains them on its own
    // goroutine, evaluates the rule, and acts.
    jobFinished chan jobFinishedEvent  // from store.MarkJobFinished hook
    hostOffline chan string            // host_id; from offline sweeper
    hostOnline  chan string            // host_id; from ws handler hello

    // 60s ticker drives stale-schedule + auto-resolution sweeps.
    tick *time.Ticker
}

The hot-path call sites (store.MarkJobFinished, ws.handler offline sweep, ws.handler hello) push to these channels via a tiny Engine.Notify* method that does a non-blocking send. The engine's own goroutine handles every match — keeps mutation off the hot path.

Rule catalogue

Kind	Severity	Trigger	Auto-resolve when
`backup_failed`	warning	`MarkJobFinished` with kind=backup, status=failed	next backup for the same host succeeds
`forget_failed`	warning	`MarkJobFinished` with kind=forget, status=failed	next forget for the same host succeeds
`prune_failed`	warning	`MarkJobFinished` with kind=prune, status=failed	next prune for the same host succeeds
`check_failed`	critical	`MarkJobFinished` with kind=check, status=failed OR errors_found	next check for the same host succeeds without errors
`stale_schedule`	warning	60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since	next job for that schedule succeeds OR schedule deleted
`agent_offline`	warning	offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`)	hostOnline event for that host

The 15-minute floor on agent_offline exists so a 30-second blip during agent restart doesn't generate a notification storm. The store's existing offline sweeper (hosts.last_seen_at with 90s threshold) already marks the host offline; the engine sees the event but waits for the threshold before raising.

Dedup + last_seen_at

store.RaiseOrTouch(host_id, kind, severity, message):

SELECT id, last_seen_at FROM alerts
 WHERE host_id = ? AND kind = ? AND resolved_at IS NULL
 LIMIT 1;

Found: UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?, return (id, didRaise=false).
Not found: INSERT INTO alerts (id, host_id, kind, severity, message, created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?), return (id, didRaise=true).

The engine fires a notification through the Hub only when didRaise=true. Touch-only events keep the row's last_seen_at fresh so the UI can render "still happening · Ns ago" without spamming the operator's phone.

Notification payload shapes

Webhook — a single JSON envelope per event:

{
  "event":     "alert.raised",
  "alert_id":  "01KQT...",
  "severity":  "warning",
  "kind":      "backup_failed",
  "host_id":   "01KQ...",
  "host_name": "alfa-01",
  "message":   "Backup 'system-config' failed: rest-server returned 401",
  "raised_at": "2026-05-04T15:42:01Z",
  "link":      "https://restic-manager.example/alerts/01KQT..."
}

event is one of alert.raised | alert.acknowledged | alert.resolved | alert.test. The same envelope shape is reused across events — operators build one bridge, switch on event and severity.

Ntfy — uses the standard publish format:

POST /<topic> HTTP/1.1
Host: <server>
Authorization: Bearer <access-token>   (if configured)
Title: [warning] alfa-01 backup failed
Priority: 4
Tags: warning,backup_failed
Click: https://restic-manager.example/alerts/01KQT...

Backup 'system-config' failed: rest-server returned 401

Severity → priority mapping:

Severity	Priority
info	3 (default)
warning	4 (high)
critical	5 (urgent)

Per-channel default_priority setting overrides for non-critical alerts; critical always goes urgent regardless.

Test notification

POST /api/notifications/{channel_id}/test builds a synthetic event (severity=info, kind=test_notification, message="Test from restic-manager", link to the channel's edit page) and runs it through the real send path. Returns {ok: bool, latency_ms: int, status_code?: int, error?: string}. UI renders the green ✓ / red ✗ feedback inline.

Routes added

Method	Path	Purpose
GET	`/alerts`	Fleet alerts list with filters (`?status=open&severity=warning&host_id=...&q=...`)
POST	`/alerts/{id}/acknowledge`	Mark alert acknowledged (HTMX form)
POST	`/alerts/{id}/resolve`	Manual resolve (HTMX form)
GET	`/settings/notifications`	Channel list page
GET	`/settings/notifications/new`	Channel kind picker + empty form
POST	`/settings/notifications/new`	Validate + create + redirect
GET	`/settings/notifications/{id}/edit`	Channel edit form
POST	`/settings/notifications/{id}/edit`	Validate + update
POST	`/settings/notifications/{id}/delete`	Delete channel (typed-confirm name in the form)
POST	`/api/notifications/{id}/test`	Fire test notification, return JSON result
GET	`/api/alerts`	JSON list (mirrors the UI filters) for future REST callers

Data model

Migration 0013 — alerts.last_seen_at

ALTER TABLE alerts ADD COLUMN last_seen_at TEXT;
UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL;

Existing alerts (currently zero in production — nothing writes them yet) get last_seen_at = created_at. Column is nullable for forwards-compat with rows from the alert-engine-pre-bump period.

Migration 0014 — notification_channels + notification_log

CREATE TABLE notification_channels (
  id              TEXT PRIMARY KEY,
  kind            TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy')),
  name            TEXT NOT NULL,
  enabled         INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)),
  config          BLOB NOT NULL,        -- AEAD-encrypted JSON; per-kind shape
  default_priority TEXT,                -- ntfy only; null for others
  created_at      TEXT NOT NULL,
  updated_at      TEXT NOT NULL,
  last_fired_at   TEXT
);

CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1;

CREATE TABLE notification_log (
  id           TEXT PRIMARY KEY,
  channel_id   TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE,
  alert_id     TEXT REFERENCES alerts(id) ON DELETE SET NULL,
  event        TEXT NOT NULL,           -- alert.raised | alert.acknowledged | alert.resolved | alert.test
  ok           INTEGER NOT NULL CHECK (ok IN (0, 1)),
  status_code  INTEGER,
  latency_ms   INTEGER,
  error        TEXT,
  fired_at     TEXT NOT NULL
);

CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC);
CREATE INDEX notification_log_alert ON notification_log(alert_id);

config is an AEAD-encrypted JSON blob — bearer tokens for webhooks and access tokens for ntfy live there. Per-kind config shapes:

type webhookConfig struct {
    URL          string `json:"url"`
    BearerToken  string `json:"bearer_token,omitempty"`
    HeaderName   string `json:"header_name,omitempty"`
    HeaderValue  string `json:"header_value,omitempty"`
}

type ntfyConfig struct {
    ServerURL    string `json:"server_url"`     // default https://ntfy.sh
    Topic        string `json:"topic"`
    AccessToken  string `json:"access_token,omitempty"`
}

Engine state

The engine itself is stateless beyond the channels it owns; all persisted state is in the existing alerts table + the new notification_log table. A process restart re-evaluates from scratch: on next tick the stale-schedule + auto-resolution sweeps catch up with whatever happened during the downtime. No outbox to drain.

UI templates

Template	Purpose
`web/templates/pages/alerts.html`	Fleet alerts page
`web/templates/partials/alert_row.html`	One alert row (used by both list and detail-fragment swap)
`web/templates/pages/settings.html`	Settings shell with Notifications / Users / Auth sub-tabs
`web/templates/pages/notifications.html`	Channel list (Notifications sub-tab body)
`web/templates/pages/notification_edit.html`	Channel kind picker + per-kind form + test button + payload preview
`web/templates/partials/crit_banner.html`	Dashboard top-of-page banner
`web/templates/partials/nav.html`	Existing — gain a `data-alerts-count` attribute on the Alerts tab so the badge auto-updates

The Settings shell + Notifications sub-tab is the new chrome the wireframe introduced; Users + Authentication tabs are placeholder links that 404 in v1 (or render an "Lands later" notice). Same pattern P2R-02 used for inert sub-tabs.

Tests (target coverage)

internal/alert/engine_test.go — rule firing per kind: backup_failed raises on MarkJobFinished(kind=backup, status=failed); touch-only on the second failure for the same host (no second notification); auto-resolve on next success.
internal/alert/agent_offline_test.go — OnHostOffline emits without raising until the 15-min floor; OnHostOnline clears the alert.
internal/alert/stale_schedule_test.go — synthetic schedule whose next fire is in the past triggers; resets when a job lands.
internal/notification/webhook_test.go — payload shape pinned; authorisation header sent when bearer set; custom header echoed; 5s timeout enforced; error in notification_log.
internal/notification/ntfy_test.go — title/priority/tags/click headers match the severity mapping; access token sent as Authorization: Bearer <token>; default priority overridden by severity for critical.
internal/server/http/ui_alerts_test.go — page renders with filters applied; ack/resolve POSTs flip the row + write audit; HX-Redirect bounces back to the filtered list.
internal/server/http/ui_notifications_test.go — CRUD happy paths, validation re-render, secrets-encrypted-at-rest assertion (load row, decrypt, compare), test-button hits the real send path against a test http.Server.
Migration 0013 + 0014 round-trip tested via store.Open on a fresh db.

Playwright sweep

End-of-phase sweep mirrors the P2R-02 / P3-restore pattern:

Login → /alerts (initially empty) → see "All clear · last alert never" empty state.
Trigger a fake-failed-backup via POST /api/hosts/{id}/jobs against a host with a deliberately-wrong rest-server URL. Wait for the backup_failed alert to appear in the list within ~2s of the job finishing.
Acknowledge → row tints + ack actor visible.
Take the agent offline (systemctl stop); wait 15 min OR mock last_seen_at to 16 min ago via the test harness; confirm agent_offline alert raises once.
Restart the agent → agent_offline auto-resolves; backup_failed is still open.
Configure a webhook channel pointing at a local test sink; click "Send test" → green ✓.
Configure a ntfy channel pointing at a local sink → click "Send test" → green ✓.
Trigger a fresh failed backup → both channels receive the notification (verified from sink logs); notification_log has two rows event=alert.raised, ok=true.
Manually Resolve the open backup_failed; confirm both channels receive event=alert.resolved.
Critical-severity test: trigger check_failed (mocked) → dashboard banner appears; clicking it lands on /alerts?severity=critical&status=open.
Empty the alerts again → banner disappears.

Screenshots into _diag/p3-alerts-sweep/. End-to-end clean, zero console errors, before handing back.

What does NOT change

Existing chrome/templates beyond the small additions noted above.
Existing alerts.severity CHECK (info/warning/critical) — already the right shape; no migration needed for that.
Audit log writer pattern — engine writes audit rows for ack/resolve the same way every other state-changing handler does.
The agent. Alerts are entirely a server concern; the agent doesn't know they exist.

Open questions / explicit non-goals

Per-rule cooldowns / re-raise on long-running issues. Out of scope (brainstorm question 8 ruled this out). Operators see "still happening" in the UI; they don't get a reminder ping.
SMTP / email channel. Out of scope. Operators wanting email today can chain webhook → email-gateway; native SMTP can land later.
Apprise sidecar integration. Deferred per brainstorm. The Channel interface accepts a third impl without reshaping when we get there.
Per-host or per-severity channel routing. Out of scope. Likely next step if operators ask: a min_severity field on the channel row.
Snooze / mute. Out of scope. Acknowledge is the closest analogue; full silence-windows would need a new table and is YAGNI for v1.
PagerDuty / OpsGenie. Both have webhook receivers; operators wire them via the webhook channel today.
Alert "rules" UI. No CRUD; the rule set is hardcoded.

20 KiB Raw Blame History