diff --git a/docs/superpowers/specs/2026-05-04-p3-alerts-design.md b/docs/superpowers/specs/2026-05-04-p3-alerts-design.md index b88ca95..21e1f61 100644 --- a/docs/superpowers/specs/2026-05-04-p3-alerts-design.md +++ b/docs/superpowers/specs/2026-05-04-p3-alerts-design.md @@ -22,8 +22,12 @@ Brainstorm decisions (in order asked): 4. **Resolution.** Auto-resolve when the underlying condition clears + manual Resolve at any time. Acknowledge is a separate "I've seen it" intermediate state that does NOT close the alert. -5. **v1 channels.** Webhook + native ntfy. Apprise deferred (the channel - plumbing accepts new kinds without reshaping). +5. **v1 channels.** Webhook + native ntfy + SMTP. Apprise deferred (the + channel plumbing accepts new kinds without reshaping). SMTP added as + a first-class channel post-brainstorm because the use case — overnight + alerts the operator wants to read in the morning rather than be pinged + on at 03:00 — is poorly served by ntfy's push model and clumsy via + webhook → email-gateway. 6. **Channel scope.** Global only. No per-host or per-severity routing in v1. 7. **Notification body.** Structured JSON for webhooks, formatted title+body+click-URL for ntfy, plus a per-channel "Send test notification" @@ -70,7 +74,7 @@ goroutine: | `internal/alert.Engine` | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog | | `internal/alert.Rule` + per-rule files | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models | | `internal/notification.Hub` | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table. | store, channel adapters | -| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context. Two impls in v1: `webhookChannel`, `ntfyChannel`. | http.Client | +| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: `webhookChannel`, `ntfyChannel`, `smtpChannel`. | http.Client; net/smtp + crypto/tls for SMTP | | `internal/store/alerts.go` | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite | | `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table). | sqlite, crypto.AEAD (for secrets) | | `internal/server/http/ui_alerts.go` | `/alerts` page handler + filter parsing + ack/resolve form actions. | store | @@ -162,6 +166,58 @@ Touch-only events keep the row's `last_seen_at` fresh so the UI can render alert.test`. The same envelope shape is reused across events — operators build one bridge, switch on `event` and `severity`. +**SMTP** — single-recipient plain-text email per channel. The channel +config carries the SMTP server credentials and a `to` address; one +channel = one recipient (or one distribution-list address). Operators +who want multiple recipients add multiple channels — keeps the config +flat and the failure modes per-recipient. + +Subject pattern is hardcoded (no per-channel template in v1): + +``` +Subject: [restic-manager] [] : +From: +To: +Date: +Message-ID: > + + + +— +Raised at: 2026-05-04T15:42:01Z +Severity: warning +Host: alfa-01 +Kind: backup_failed + +Open in restic-manager: +https://restic-manager.example/alerts/01KQT... + +(This message was sent by restic-manager. Acknowledge or resolve in the UI.) +``` + +The body is plain text only in v1 — no HTML alternative — both because +the data is already structured well enough as text and because HTML +email opens a long tail of rendering / sanitisation concerns. The +`Message-ID` includes the alert id so a thread-aware client can group +related events (raised → acknowledged → resolved) together. + +Encryption: +- **STARTTLS** (default, port 587). Opportunistic upgrade. Most + operator-facing relays. +- **Implicit TLS** (port 465). Connect-then-TLS-handshake. +- **None** (port 25). Plain. Hidden behind a "Yes I understand" warning + on the form because the password goes over the wire. + +Auth: +- **PLAIN** (RFC 4616) over TLS. Default and almost always what's wanted. +- **CRAM-MD5** (RFC 2195). Offered if the server advertises it, no UI + toggle — automatic. +- No OAuth2 / XOAUTH2 in v1; that's a real next step if Gmail-without- + app-passwords becomes a recurring ask. + +Per-message timeout is 10s (vs 5s for HTTP channels) — STARTTLS +handshake + DATA over a slow link can legitimately take that long. + **Ntfy** — uses the standard publish format: ``` @@ -229,11 +285,11 @@ with rows from the alert-engine-pre-bump period. ```sql CREATE TABLE notification_channels ( id TEXT PRIMARY KEY, - kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy')), + kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy', 'smtp')), name TEXT NOT NULL, enabled INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)), config BLOB NOT NULL, -- AEAD-encrypted JSON; per-kind shape - default_priority TEXT, -- ntfy only; null for others + default_priority TEXT, -- ntfy only; null for webhook + smtp created_at TEXT NOT NULL, updated_at TEXT NOT NULL, last_fired_at TEXT @@ -273,6 +329,16 @@ type ntfyConfig struct { Topic string `json:"topic"` AccessToken string `json:"access_token,omitempty"` } + +type smtpConfig struct { + Host string `json:"host"` // e.g. smtp.example.com + Port int `json:"port"` // default 587 (STARTTLS), 465 (TLS), 25 (none) + Encryption string `json:"encryption"` // "starttls" | "tls" | "none" + Username string `json:"username"` + Password string `json:"password"` // sensitive — AEAD-encrypted with the rest of config + From string `json:"from"` // RFC 5322 address; "alerts@example.com" or "Restic-Manager " + To string `json:"to"` // single recipient or distribution-list address; v1 = one channel = one to-line +} ``` ### Engine state @@ -316,6 +382,13 @@ inert sub-tabs. - `internal/notification/ntfy_test.go` — title/priority/tags/click headers match the severity mapping; access token sent as `Authorization: Bearer `; default priority overridden by severity for critical. +- `internal/notification/smtp_test.go` — round-trip against a local + `net/smtp.NewServer`-style fake (or `mhog`/MailHog if convenient): + STARTTLS handshake completes against a self-signed cert; PLAIN auth + uses configured creds; subject + from + to + body bytes match the + spec'd format; Message-ID contains the alert id; 10s timeout enforced; + failure path (auth refused) lands in `notification_log` with the + server's error string. - `internal/server/http/ui_alerts_test.go` — page renders with filters applied; ack/resolve POSTs flip the row + write audit; HX-Redirect bounces back to the filtered list. @@ -346,14 +419,18 @@ End-of-phase sweep mirrors the P2R-02 / P3-restore pattern: test" → green ✓. 7. Configure a ntfy channel pointing at a local sink → click "Send test" → green ✓. -8. Trigger a fresh failed backup → both channels receive the notification - (verified from sink logs); `notification_log` has two rows - `event=alert.raised, ok=true`. -9. Manually Resolve the open `backup_failed`; confirm both channels - receive `event=alert.resolved`. -10. Critical-severity test: trigger `check_failed` (mocked) → dashboard +8. Configure an SMTP channel pointing at a local MailHog (Docker, port + 1025, no TLS for the local-only sweep) → click "Send test" → green ✓ + → MailHog UI at :8025 shows the test email with the right subject + and Message-ID. +9. Trigger a fresh failed backup → all three channels receive the + notification (verified from sink logs + MailHog inbox); + `notification_log` has three rows `event=alert.raised, ok=true`. +10. Manually Resolve the open `backup_failed`; confirm all three channels + receive `event=alert.resolved`. +11. Critical-severity test: trigger `check_failed` (mocked) → dashboard banner appears; clicking it lands on `/alerts?severity=critical&status=open`. -11. Empty the alerts again → banner disappears. +12. Empty the alerts again → banner disappears. Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console errors, before handing back. @@ -373,8 +450,17 @@ errors, before handing back. - **Per-rule cooldowns / re-raise on long-running issues.** Out of scope (brainstorm question 8 ruled this out). Operators see "still happening" in the UI; they don't get a reminder ping. -- **SMTP / email channel.** Out of scope. Operators wanting email today - can chain webhook → email-gateway; native SMTP can land later. +- **SMTP HTML emails.** v1 is plain text only — operators wanting rich + rendering can deploy a webhook → mail-merge bridge, or wait for a v2 + template engine. The Message-ID threading + plain text body should be + enough for almost every overnight-digest workflow. +- **SMTP OAuth2 / XOAUTH2.** Out of scope. Gmail / Microsoft 365 with + modern OAuth requires an `app password` workaround in v1. Native + XOAUTH2 lands when an operator asks (or when Google starts refusing + app passwords for non-business accounts in earnest). +- **Multi-recipient SMTP channels.** A channel = one `To`. Operators + wanting multiple recipients add multiple channels. Keeps failure + attribution per-recipient. - **Apprise sidecar integration.** Deferred per brainstorm. The `Channel` interface accepts a third impl without reshaping when we get there.