restic-manager/docs/superpowers/specs/2026-05-04-p3-alerts-design.md

# P3 — Alerts (design)

> Phase 3 sub-spec covering the alerts engine, notification channels, and UI
> (P3-05 / P3-06 / P3-07).
>
> Wireframe: `_diag/p3-alerts-wireframe/wireframe.html`. Screenshots in the
> same directory. Spec brainstorm ran 2026-05-04; user approved all ten
> design decisions before this spec was written.

## Scope locked

Brainstorm decisions (in order asked):

1. **Rule model.** Hardcoded rule set, no operator-tunable thresholds in v1.
   The engine knows about each rule type internally; per-rule config can land
   later if/when an operator asks.
2. **Rule set.** Six rules: `backup_failed`, `forget_failed`, `prune_failed`,
   `check_failed`, `stale_schedule`, `agent_offline`.
3. **Engine cadence.** Hybrid. Event hooks at the existing
   `MarkJobFinished` and offline-sweeper sites for the immediate triggers;
   one 60-second ticker handles stale-schedule detection and auto-resolution.
4. **Resolution.** Auto-resolve when the underlying condition clears + manual
   Resolve at any time. Acknowledge is a separate "I've seen it" intermediate
   state that does NOT close the alert.
5. **v1 channels.** Webhook + native ntfy + SMTP. Apprise deferred (the
   channel plumbing accepts new kinds without reshaping). SMTP added as
   a first-class channel post-brainstorm because the use case — overnight
   alerts the operator wants to read in the morning rather than be pinged
   on at 03:00 — is poorly served by ntfy's push model and clumsy via
   webhook → email-gateway.
6. **Channel scope.** Global only. No per-host or per-severity routing in v1.
7. **Notification body.** Structured JSON for webhooks, formatted
   title+body+click-URL for ntfy, plus a per-channel "Send test notification"
   button with inline result feedback.
8. **Deduplication.** Open-alert uniqueness on `(host_id, kind)` with a
   `last_seen_at` bump on every confirming tick. One notification per
   occurrence; the UI shows "still happening · Ns ago" while a rule keeps
   matching.
9. **Alert UI.** Top-level `/alerts` page (the existing nav stub becomes
   real). Per-host vitals "Open alerts" cell links to `/alerts?host_id=...`.
   Channel CRUD lives at `/settings/notifications`.
10. **Delivery semantics.** Best-effort fire-and-forget with a 5s timeout
    per notification. Failures are logged but not retried. The alert row in
    the DB is the source of truth.

## Architecture

The subsystem is three loosely-coupled units behind one `AlertEngine`
goroutine:

```
                                 ┌───────────────────────────┐
   event hooks ─────────────────►│                           │
                                 │   AlertEngine             │ ──► raise/resolve
   60s ticker ──────────────────►│   (rule evaluation)       │     alert row
                                 │                           │
                                 └────────────┬──────────────┘
                                              │
                                              ▼
                                  ┌──────────────────────┐
                                  │   notification.Hub   │
                                  │   (fire-and-forget)  │
                                  └──┬────────┬──────────┘
                                     │        │
                              ┌──────▼──┐  ┌──▼──────┐
                              │ Webhook │  │  Ntfy   │  …future channels
                              └─────────┘  └─────────┘
```

### Component boundaries

| Component                                | Purpose                                                                                  | Depends on                             |
| ---------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------- |
| `internal/alert.Engine`                  | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog          |
| `internal/alert.Rule` + per-rule files   | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models                           |
| `internal/notification.Hub`              | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table.        | store, channel adapters                |
| `internal/notification.Channel` (iface)  | Single method `Send(ctx, payload) error` with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: `webhookChannel`, `ntfyChannel`, `smtpChannel`. | http.Client; net/smtp + crypto/tls for SMTP |
| `internal/store/alerts.go`               | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite                                 |
| `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table).            | sqlite, crypto.AEAD (for secrets)      |
| `internal/server/http/ui_alerts.go`      | `/alerts` page handler + filter parsing + ack/resolve form actions.                      | store                                  |
| `internal/server/http/ui_notifications.go` | `/settings/notifications` page + channel CRUD + "Send test" handler.                   | store, notification.Hub                |

### Engine event shape

The engine runs as one goroutine per server process started in
`cmd/server/main.go`. It exposes a small set of channels other code writes to:

```go
type Engine struct {
    store *store.Store
    hub   *notification.Hub

    // Event channels (buffered, drop-on-full with a slog warning to keep
    // hot paths non-blocking). The engine drains them on its own
    // goroutine, evaluates the rule, and acts.
    jobFinished chan jobFinishedEvent  // from store.MarkJobFinished hook
    hostOffline chan string            // host_id; from offline sweeper
    hostOnline  chan string            // host_id; from ws handler hello

    // 60s ticker drives stale-schedule + auto-resolution sweeps.
    tick *time.Ticker
}
```

The hot-path call sites (`store.MarkJobFinished`, `ws.handler` offline
sweep, `ws.handler` hello) push to these channels via a tiny
`Engine.Notify*` method that does a non-blocking send. The engine's own
goroutine handles every match — keeps mutation off the hot path.

### Rule catalogue

| Kind                | Severity | Trigger                                                                 | Auto-resolve when                                  |
| ------------------- | -------- | ----------------------------------------------------------------------- | -------------------------------------------------- |
| `backup_failed`     | warning  | `MarkJobFinished` with kind=backup, status=failed                       | next backup for the same host succeeds             |
| `forget_failed`     | warning  | `MarkJobFinished` with kind=forget, status=failed                       | next forget for the same host succeeds             |
| `prune_failed`      | warning  | `MarkJobFinished` with kind=prune, status=failed                        | next prune for the same host succeeds              |
| `check_failed`      | critical | `MarkJobFinished` with kind=check, status=failed OR errors_found        | next check for the same host succeeds without errors |
| `stale_schedule`    | warning  | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted |
| `agent_offline`     | warning  | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`) | hostOnline event for that host                     |

The 15-minute floor on `agent_offline` exists so a 30-second blip during
agent restart doesn't generate a notification storm. The store's existing
offline sweeper (`hosts.last_seen_at` with 90s threshold) already marks the
host offline; the engine sees the event but waits for the threshold before
raising.

### Dedup + last_seen_at

`store.RaiseOrTouch(host_id, kind, severity, message)`:

```sql
SELECT id, last_seen_at FROM alerts
 WHERE host_id = ? AND kind = ? AND resolved_at IS NULL
 LIMIT 1;
```

- Found: `UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?`,
  return `(id, didRaise=false)`.
- Not found: `INSERT INTO alerts (id, host_id, kind, severity, message,
  created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?)`, return
  `(id, didRaise=true)`.

The engine fires a notification through the Hub only when `didRaise=true`.
Touch-only events keep the row's `last_seen_at` fresh so the UI can render
"still happening · Ns ago" without spamming the operator's phone.

### Notification payload shapes

**Webhook** — a single JSON envelope per event:

```json
{
  "event":     "alert.raised",
  "alert_id":  "01KQT...",
  "severity":  "warning",
  "kind":      "backup_failed",
  "host_id":   "01KQ...",
  "host_name": "alfa-01",
  "message":   "Backup 'system-config' failed: rest-server returned 401",
  "raised_at": "2026-05-04T15:42:01Z",
  "link":      "https://restic-manager.example/alerts/01KQT..."
}
```

`event` is one of `alert.raised | alert.acknowledged | alert.resolved |
alert.test`. The same envelope shape is reused across events — operators
build one bridge, switch on `event` and `severity`.

**SMTP** — single-recipient plain-text email per channel. The channel
config carries the SMTP server credentials and a `to` address; one
channel = one recipient (or one distribution-list address). Operators
who want multiple recipients add multiple channels — keeps the config
flat and the failure modes per-recipient.

Subject pattern is hardcoded (no per-channel template in v1):

```
Subject: [restic-manager] [<severity>] <host_name>: <kind>
From: <configured-from-address>
To: <configured-to-address>
Date: <RFC 5322>
Message-ID: <alert_id@<server-host>>

<message line — same string the webhook/ntfy gets>

—
Raised at: 2026-05-04T15:42:01Z
Severity:  warning
Host:      alfa-01
Kind:      backup_failed

Open in restic-manager:
https://restic-manager.example/alerts/01KQT...

(This message was sent by restic-manager. Acknowledge or resolve in the UI.)
```

The body is plain text only in v1 — no HTML alternative — both because
the data is already structured well enough as text and because HTML
email opens a long tail of rendering / sanitisation concerns. The
`Message-ID` includes the alert id so a thread-aware client can group
related events (raised → acknowledged → resolved) together.

Encryption:
- **STARTTLS** (default, port 587). Opportunistic upgrade. Most
  operator-facing relays.
- **Implicit TLS** (port 465). Connect-then-TLS-handshake.
- **None** (port 25). Plain. Hidden behind a "Yes I understand" warning
  on the form because the password goes over the wire.

Auth:
- **PLAIN** (RFC 4616) over TLS. Default and almost always what's wanted.
- **CRAM-MD5** (RFC 2195). Offered if the server advertises it, no UI
  toggle — automatic.
- No OAuth2 / XOAUTH2 in v1; that's a real next step if Gmail-without-
  app-passwords becomes a recurring ask.

Per-message timeout is 10s (vs 5s for HTTP channels) — STARTTLS
handshake + DATA over a slow link can legitimately take that long.

**Ntfy** — uses the standard publish format:

```
POST /<topic> HTTP/1.1
Host: <server>
Authorization: Bearer <access-token>   (if configured)
Title: [warning] alfa-01 backup failed
Priority: 4
Tags: warning,backup_failed
Click: https://restic-manager.example/alerts/01KQT...

Backup 'system-config' failed: rest-server returned 401
```

Severity → priority mapping:

| Severity  | Priority |
| --------- | -------- |
| info      | 3 (default) |
| warning   | 4 (high)    |
| critical  | 5 (urgent)  |

Per-channel `default_priority` setting overrides for non-critical alerts;
critical always goes urgent regardless.

### Test notification

`POST /api/notifications/{channel_id}/test` builds a synthetic event
(severity=info, kind=test_notification, message="Test from
restic-manager", link to the channel's edit page) and runs it through the
real send path. Returns `{ok: bool, latency_ms: int, status_code?: int,
error?: string}`. UI renders the green ✓ / red ✗ feedback inline.

## Routes added

| Method  | Path                                                  | Purpose                                                       |
| ------- | ----------------------------------------------------- | ------------------------------------------------------------- |
| GET     | `/alerts`                                             | Fleet alerts list with filters (`?status=open&severity=warning&host_id=...&q=...`) |
| POST    | `/alerts/{id}/acknowledge`                            | Mark alert acknowledged (HTMX form)                           |
| POST    | `/alerts/{id}/resolve`                                | Manual resolve (HTMX form)                                    |
| GET     | `/settings/notifications`                             | Channel list page                                             |
| GET     | `/settings/notifications/new`                         | Channel kind picker + empty form                              |
| POST    | `/settings/notifications/new`                         | Validate + create + redirect                                  |
| GET     | `/settings/notifications/{id}/edit`                   | Channel edit form                                             |
| POST    | `/settings/notifications/{id}/edit`                   | Validate + update                                             |
| POST    | `/settings/notifications/{id}/delete`                 | Delete channel (typed-confirm name in the form)               |
| POST    | `/api/notifications/{id}/test`                        | Fire test notification, return JSON result                    |
| GET     | `/api/alerts`                                         | JSON list (mirrors the UI filters) for future REST callers    |

## Data model

### Migration 0013 — alerts.last_seen_at

```sql
ALTER TABLE alerts ADD COLUMN last_seen_at TEXT;
UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL;
```

Existing alerts (currently zero in production — nothing writes them yet)
get `last_seen_at = created_at`. Column is nullable for forwards-compat
with rows from the alert-engine-pre-bump period.

### Migration 0014 — notification_channels + notification_log

```sql
CREATE TABLE notification_channels (
  id              TEXT PRIMARY KEY,
  kind            TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy', 'smtp')),
  name            TEXT NOT NULL,
  enabled         INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)),
  config          BLOB NOT NULL,        -- AEAD-encrypted JSON; per-kind shape
  default_priority TEXT,                -- ntfy only; null for webhook + smtp
  created_at      TEXT NOT NULL,
  updated_at      TEXT NOT NULL,
  last_fired_at   TEXT
);

CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1;

CREATE TABLE notification_log (
  id           TEXT PRIMARY KEY,
  channel_id   TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE,
  alert_id     TEXT REFERENCES alerts(id) ON DELETE SET NULL,
  event        TEXT NOT NULL,           -- alert.raised | alert.acknowledged | alert.resolved | alert.test
  ok           INTEGER NOT NULL CHECK (ok IN (0, 1)),
  status_code  INTEGER,
  latency_ms   INTEGER,
  error        TEXT,
  fired_at     TEXT NOT NULL
);

CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC);
CREATE INDEX notification_log_alert ON notification_log(alert_id);
```

`config` is an AEAD-encrypted JSON blob — bearer tokens for webhooks and
access tokens for ntfy live there. Per-kind config shapes:

```go
type webhookConfig struct {
    URL          string `json:"url"`
    BearerToken  string `json:"bearer_token,omitempty"`
    HeaderName   string `json:"header_name,omitempty"`
    HeaderValue  string `json:"header_value,omitempty"`
}

type ntfyConfig struct {
    ServerURL    string `json:"server_url"`     // default https://ntfy.sh
    Topic        string `json:"topic"`
    AccessToken  string `json:"access_token,omitempty"`
}

type smtpConfig struct {
    Host       string `json:"host"`         // e.g. smtp.example.com
    Port       int    `json:"port"`         // default 587 (STARTTLS), 465 (TLS), 25 (none)
    Encryption string `json:"encryption"`   // "starttls" | "tls" | "none"
    Username   string `json:"username"`
    Password   string `json:"password"`     // sensitive — AEAD-encrypted with the rest of config
    From       string `json:"from"`         // RFC 5322 address; "alerts@example.com" or "Restic-Manager <alerts@…>"
    To         string `json:"to"`           // single recipient or distribution-list address; v1 = one channel = one to-line
}
```

### Engine state

The engine itself is stateless beyond the channels it owns; all
persisted state is in the existing `alerts` table + the new
`notification_log` table. A process restart re-evaluates from scratch:
on next tick the stale-schedule + auto-resolution sweeps catch up with
whatever happened during the downtime. No outbox to drain.

## UI templates

| Template                                  | Purpose                                                |
| ----------------------------------------- | ------------------------------------------------------ |
| `web/templates/pages/alerts.html`         | Fleet alerts page                                      |
| `web/templates/partials/alert_row.html`   | One alert row (used by both list and detail-fragment swap) |
| `web/templates/pages/settings.html`       | Settings shell with Notifications / Users / Auth sub-tabs |
| `web/templates/pages/notifications.html`  | Channel list (Notifications sub-tab body)              |
| `web/templates/pages/notification_edit.html` | Channel kind picker + per-kind form + test button + payload preview |
| `web/templates/partials/crit_banner.html` | Dashboard top-of-page banner                           |
| `web/templates/partials/nav.html`         | Existing — gain a `data-alerts-count` attribute on the Alerts tab so the badge auto-updates |

The Settings shell + Notifications sub-tab is the new chrome the wireframe
introduced; Users + Authentication tabs are placeholder links that 404 in
v1 (or render an "Lands later" notice). Same pattern P2R-02 used for
inert sub-tabs.

## Tests (target coverage)

- `internal/alert/engine_test.go` — rule firing per kind: backup_failed
  raises on `MarkJobFinished(kind=backup, status=failed)`; touch-only on
  the second failure for the same host (no second notification);
  auto-resolve on next success.
- `internal/alert/agent_offline_test.go` — `OnHostOffline` emits without
  raising until the 15-min floor; `OnHostOnline` clears the alert.
- `internal/alert/stale_schedule_test.go` — synthetic schedule whose next
  fire is in the past triggers; resets when a job lands.
- `internal/notification/webhook_test.go` — payload shape pinned;
  authorisation header sent when bearer set; custom header echoed; 5s
  timeout enforced; error in `notification_log`.
- `internal/notification/ntfy_test.go` — title/priority/tags/click headers
  match the severity mapping; access token sent as `Authorization: Bearer
  <token>`; default priority overridden by severity for critical.
- `internal/notification/smtp_test.go` — round-trip against a local
  `net/smtp.NewServer`-style fake (or `mhog`/MailHog if convenient):
  STARTTLS handshake completes against a self-signed cert; PLAIN auth
  uses configured creds; subject + from + to + body bytes match the
  spec'd format; Message-ID contains the alert id; 10s timeout enforced;
  failure path (auth refused) lands in `notification_log` with the
  server's error string.
- `internal/server/http/ui_alerts_test.go` — page renders with filters
  applied; ack/resolve POSTs flip the row + write audit; HX-Redirect
  bounces back to the filtered list.
- `internal/server/http/ui_notifications_test.go` — CRUD happy paths,
  validation re-render, secrets-encrypted-at-rest assertion (load row,
  decrypt, compare), test-button hits the real send path against a
  test http.Server.
- Migration 0013 + 0014 round-trip tested via `store.Open` on a fresh
  db.

## Playwright sweep

End-of-phase sweep mirrors the P2R-02 / P3-restore pattern:

1. Login → `/alerts` (initially empty) → see "All clear · last alert
   never" empty state.
2. Trigger a fake-failed-backup via `POST /api/hosts/{id}/jobs` against a
   host with a deliberately-wrong rest-server URL. Wait for the
   `backup_failed` alert to appear in the list within ~2s of the job
   finishing.
3. Acknowledge → row tints + ack actor visible.
4. Take the agent offline (`systemctl stop`); wait 15 min OR mock
   `last_seen_at` to 16 min ago via the test harness; confirm
   `agent_offline` alert raises once.
5. Restart the agent → `agent_offline` auto-resolves; `backup_failed` is
   still open.
6. Configure a webhook channel pointing at a local test sink; click "Send
   test" → green ✓.
7. Configure a ntfy channel pointing at a local sink → click "Send test"
   → green ✓.
8. Configure an SMTP channel pointing at a local MailHog (Docker, port
   1025, no TLS for the local-only sweep) → click "Send test" → green ✓
   → MailHog UI at :8025 shows the test email with the right subject
   and Message-ID.
9. Trigger a fresh failed backup → all three channels receive the
   notification (verified from sink logs + MailHog inbox);
   `notification_log` has three rows `event=alert.raised, ok=true`.
10. Manually Resolve the open `backup_failed`; confirm all three channels
    receive `event=alert.resolved`.
11. Critical-severity test: trigger `check_failed` (mocked) → dashboard
    banner appears; clicking it lands on `/alerts?severity=critical&status=open`.
12. Empty the alerts again → banner disappears.

Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console
errors, before handing back.

## What does NOT change

- Existing chrome/templates beyond the small additions noted above.
- Existing `alerts.severity` CHECK (`info`/`warning`/`critical`) — already
  the right shape; no migration needed for that.
- Audit log writer pattern — engine writes audit rows for ack/resolve
  the same way every other state-changing handler does.
- The agent. Alerts are entirely a server concern; the agent doesn't
  know they exist.

## Open questions / explicit non-goals

- **Per-rule cooldowns / re-raise on long-running issues.** Out of scope
  (brainstorm question 8 ruled this out). Operators see "still happening"
  in the UI; they don't get a reminder ping.
- **SMTP HTML emails.** v1 is plain text only — operators wanting rich
  rendering can deploy a webhook → mail-merge bridge, or wait for a v2
  template engine. The Message-ID threading + plain text body should be
  enough for almost every overnight-digest workflow.
- **SMTP OAuth2 / XOAUTH2.** Out of scope. Gmail / Microsoft 365 with
  modern OAuth requires an `app password` workaround in v1. Native
  XOAUTH2 lands when an operator asks (or when Google starts refusing
  app passwords for non-business accounts in earnest).
- **Multi-recipient SMTP channels.** A channel = one `To`. Operators
  wanting multiple recipients add multiple channels. Keeps failure
  attribution per-recipient.
- **Apprise sidecar integration.** Deferred per brainstorm. The
  `Channel` interface accepts a third impl without reshaping when we get
  there.
- **Per-host or per-severity channel routing.** Out of scope. Likely
  next step if operators ask: a `min_severity` field on the channel row.
- **Snooze / mute.** Out of scope. Acknowledge is the closest analogue;
  full silence-windows would need a new table and is YAGNI for v1.
- **PagerDuty / OpsGenie.** Both have webhook receivers; operators wire
  them via the webhook channel today.
- **Alert "rules" UI.** No CRUD; the rule set is hardcoded.