docs: P3 alerts spec — add SMTP as first-class v1 channel

Post-brainstorm change after operator review: overnight-digest /
"don't ping me at 03:00, email me in the morning" use case is poorly
served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins
webhook + ntfy as the third v1 channel; Apprise stays deferred.

Spec updates:
- Decision 5 reworded: three channels in v1.
- Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s
  timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link
  legitimately needs the headroom.
- Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host,
  port, encryption (starttls/tls/none), username, password (AEAD), from,
  to. One channel = one To-address; multi-recipient = multiple channels
  (keeps failure attribution per-recipient).
- Body shape documented: hardcoded subject pattern
  '[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the
  alert id so threading groups raised → ack → resolved cleanly. Plain
  text only in v1.
- Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no
  XOAUTH2 yet (app passwords recommended for Gmail / M365).
- Test plan adds MailHog step in the Playwright sweep.
- Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient
  channels are explicitly out of v1.

Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html):
- Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP
  gets the --ok green colour family so it visually separates from
  webhook (accent) and ntfy (warm).
- New SMTP variant section (3c): host+port+encryption row, user+pass
  row, from+to row, test result, plus right-rail email shape preview
  showing the RFC 5322 layout.
- Channel list grows a third row: 'overnight-digest · smtp://… →
  ops-overnight@example.com'.
This commit is contained in:
2026-05-04 18:48:15 +01:00
parent 6165e34f6f
commit 518c29ddb3
@@ -22,8 +22,12 @@ Brainstorm decisions (in order asked):
4. **Resolution.** Auto-resolve when the underlying condition clears + manual
Resolve at any time. Acknowledge is a separate "I've seen it" intermediate
state that does NOT close the alert.
5. **v1 channels.** Webhook + native ntfy. Apprise deferred (the channel
plumbing accepts new kinds without reshaping).
5. **v1 channels.** Webhook + native ntfy + SMTP. Apprise deferred (the
channel plumbing accepts new kinds without reshaping). SMTP added as
a first-class channel post-brainstorm because the use case — overnight
alerts the operator wants to read in the morning rather than be pinged
on at 03:00 — is poorly served by ntfy's push model and clumsy via
webhook → email-gateway.
6. **Channel scope.** Global only. No per-host or per-severity routing in v1.
7. **Notification body.** Structured JSON for webhooks, formatted
title+body+click-URL for ntfy, plus a per-channel "Send test notification"
@@ -70,7 +74,7 @@ goroutine:
| `internal/alert.Engine` | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog |
| `internal/alert.Rule` + per-rule files | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models |
| `internal/notification.Hub` | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table. | store, channel adapters |
| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context. Two impls in v1: `webhookChannel`, `ntfyChannel`. | http.Client |
| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: `webhookChannel`, `ntfyChannel`, `smtpChannel`. | http.Client; net/smtp + crypto/tls for SMTP |
| `internal/store/alerts.go` | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite |
| `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table). | sqlite, crypto.AEAD (for secrets) |
| `internal/server/http/ui_alerts.go` | `/alerts` page handler + filter parsing + ack/resolve form actions. | store |
@@ -162,6 +166,58 @@ Touch-only events keep the row's `last_seen_at` fresh so the UI can render
alert.test`. The same envelope shape is reused across events — operators
build one bridge, switch on `event` and `severity`.
**SMTP** — single-recipient plain-text email per channel. The channel
config carries the SMTP server credentials and a `to` address; one
channel = one recipient (or one distribution-list address). Operators
who want multiple recipients add multiple channels — keeps the config
flat and the failure modes per-recipient.
Subject pattern is hardcoded (no per-channel template in v1):
```
Subject: [restic-manager] [<severity>] <host_name>: <kind>
From: <configured-from-address>
To: <configured-to-address>
Date: <RFC 5322>
Message-ID: <alert_id@<server-host>>
<message line — same string the webhook/ntfy gets>
Raised at: 2026-05-04T15:42:01Z
Severity: warning
Host: alfa-01
Kind: backup_failed
Open in restic-manager:
https://restic-manager.example/alerts/01KQT...
(This message was sent by restic-manager. Acknowledge or resolve in the UI.)
```
The body is plain text only in v1 — no HTML alternative — both because
the data is already structured well enough as text and because HTML
email opens a long tail of rendering / sanitisation concerns. The
`Message-ID` includes the alert id so a thread-aware client can group
related events (raised → acknowledged → resolved) together.
Encryption:
- **STARTTLS** (default, port 587). Opportunistic upgrade. Most
operator-facing relays.
- **Implicit TLS** (port 465). Connect-then-TLS-handshake.
- **None** (port 25). Plain. Hidden behind a "Yes I understand" warning
on the form because the password goes over the wire.
Auth:
- **PLAIN** (RFC 4616) over TLS. Default and almost always what's wanted.
- **CRAM-MD5** (RFC 2195). Offered if the server advertises it, no UI
toggle — automatic.
- No OAuth2 / XOAUTH2 in v1; that's a real next step if Gmail-without-
app-passwords becomes a recurring ask.
Per-message timeout is 10s (vs 5s for HTTP channels) — STARTTLS
handshake + DATA over a slow link can legitimately take that long.
**Ntfy** — uses the standard publish format:
```
@@ -229,11 +285,11 @@ with rows from the alert-engine-pre-bump period.
```sql
CREATE TABLE notification_channels (
id TEXT PRIMARY KEY,
kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy')),
kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy', 'smtp')),
name TEXT NOT NULL,
enabled INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)),
config BLOB NOT NULL, -- AEAD-encrypted JSON; per-kind shape
default_priority TEXT, -- ntfy only; null for others
default_priority TEXT, -- ntfy only; null for webhook + smtp
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
last_fired_at TEXT
@@ -273,6 +329,16 @@ type ntfyConfig struct {
Topic string `json:"topic"`
AccessToken string `json:"access_token,omitempty"`
}
type smtpConfig struct {
Host string `json:"host"` // e.g. smtp.example.com
Port int `json:"port"` // default 587 (STARTTLS), 465 (TLS), 25 (none)
Encryption string `json:"encryption"` // "starttls" | "tls" | "none"
Username string `json:"username"`
Password string `json:"password"` // sensitive — AEAD-encrypted with the rest of config
From string `json:"from"` // RFC 5322 address; "alerts@example.com" or "Restic-Manager <alerts@…>"
To string `json:"to"` // single recipient or distribution-list address; v1 = one channel = one to-line
}
```
### Engine state
@@ -316,6 +382,13 @@ inert sub-tabs.
- `internal/notification/ntfy_test.go` — title/priority/tags/click headers
match the severity mapping; access token sent as `Authorization: Bearer
<token>`; default priority overridden by severity for critical.
- `internal/notification/smtp_test.go` — round-trip against a local
`net/smtp.NewServer`-style fake (or `mhog`/MailHog if convenient):
STARTTLS handshake completes against a self-signed cert; PLAIN auth
uses configured creds; subject + from + to + body bytes match the
spec'd format; Message-ID contains the alert id; 10s timeout enforced;
failure path (auth refused) lands in `notification_log` with the
server's error string.
- `internal/server/http/ui_alerts_test.go` — page renders with filters
applied; ack/resolve POSTs flip the row + write audit; HX-Redirect
bounces back to the filtered list.
@@ -346,14 +419,18 @@ End-of-phase sweep mirrors the P2R-02 / P3-restore pattern:
test" → green ✓.
7. Configure a ntfy channel pointing at a local sink → click "Send test"
→ green ✓.
8. Trigger a fresh failed backup → both channels receive the notification
(verified from sink logs); `notification_log` has two rows
`event=alert.raised, ok=true`.
9. Manually Resolve the open `backup_failed`; confirm both channels
receive `event=alert.resolved`.
10. Critical-severity test: trigger `check_failed` (mocked) → dashboard
8. Configure an SMTP channel pointing at a local MailHog (Docker, port
1025, no TLS for the local-only sweep) → click "Send test" → green ✓
→ MailHog UI at :8025 shows the test email with the right subject
and Message-ID.
9. Trigger a fresh failed backup → all three channels receive the
notification (verified from sink logs + MailHog inbox);
`notification_log` has three rows `event=alert.raised, ok=true`.
10. Manually Resolve the open `backup_failed`; confirm all three channels
receive `event=alert.resolved`.
11. Critical-severity test: trigger `check_failed` (mocked) → dashboard
banner appears; clicking it lands on `/alerts?severity=critical&status=open`.
11. Empty the alerts again → banner disappears.
12. Empty the alerts again → banner disappears.
Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console
errors, before handing back.
@@ -373,8 +450,17 @@ errors, before handing back.
- **Per-rule cooldowns / re-raise on long-running issues.** Out of scope
(brainstorm question 8 ruled this out). Operators see "still happening"
in the UI; they don't get a reminder ping.
- **SMTP / email channel.** Out of scope. Operators wanting email today
can chain webhook → email-gateway; native SMTP can land later.
- **SMTP HTML emails.** v1 is plain text only — operators wanting rich
rendering can deploy a webhook → mail-merge bridge, or wait for a v2
template engine. The Message-ID threading + plain text body should be
enough for almost every overnight-digest workflow.
- **SMTP OAuth2 / XOAUTH2.** Out of scope. Gmail / Microsoft 365 with
modern OAuth requires an `app password` workaround in v1. Native
XOAUTH2 lands when an operator asks (or when Google starts refusing
app passwords for non-business accounts in earnest).
- **Multi-recipient SMTP channels.** A channel = one `To`. Operators
wanting multiple recipients add multiple channels. Keeps failure
attribution per-recipient.
- **Apprise sidecar integration.** Deferred per brainstorm. The
`Channel` interface accepts a third impl without reshaping when we get
there.