Post-brainstorm change after operator review: overnight-digest / "don't ping me at 03:00, email me in the morning" use case is poorly served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins webhook + ntfy as the third v1 channel; Apprise stays deferred. Spec updates: - Decision 5 reworded: three channels in v1. - Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link legitimately needs the headroom. - Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host, port, encryption (starttls/tls/none), username, password (AEAD), from, to. One channel = one To-address; multi-recipient = multiple channels (keeps failure attribution per-recipient). - Body shape documented: hardcoded subject pattern '[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the alert id so threading groups raised → ack → resolved cleanly. Plain text only in v1. - Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no XOAUTH2 yet (app passwords recommended for Gmail / M365). - Test plan adds MailHog step in the Playwright sweep. - Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient channels are explicitly out of v1. Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html): - Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP gets the --ok green colour family so it visually separates from webhook (accent) and ntfy (warm). - New SMTP variant section (3c): host+port+encryption row, user+pass row, from+to row, test result, plus right-rail email shape preview showing the RFC 5322 layout. - Channel list grows a third row: 'overnight-digest · smtp://… → ops-overnight@example.com'.
24 KiB
P3 — Alerts (design)
Phase 3 sub-spec covering the alerts engine, notification channels, and UI (P3-05 / P3-06 / P3-07).
Wireframe:
_diag/p3-alerts-wireframe/wireframe.html. Screenshots in the same directory. Spec brainstorm ran 2026-05-04; user approved all ten design decisions before this spec was written.
Scope locked
Brainstorm decisions (in order asked):
- Rule model. Hardcoded rule set, no operator-tunable thresholds in v1. The engine knows about each rule type internally; per-rule config can land later if/when an operator asks.
- Rule set. Six rules:
backup_failed,forget_failed,prune_failed,check_failed,stale_schedule,agent_offline. - Engine cadence. Hybrid. Event hooks at the existing
MarkJobFinishedand offline-sweeper sites for the immediate triggers; one 60-second ticker handles stale-schedule detection and auto-resolution. - Resolution. Auto-resolve when the underlying condition clears + manual Resolve at any time. Acknowledge is a separate "I've seen it" intermediate state that does NOT close the alert.
- v1 channels. Webhook + native ntfy + SMTP. Apprise deferred (the channel plumbing accepts new kinds without reshaping). SMTP added as a first-class channel post-brainstorm because the use case — overnight alerts the operator wants to read in the morning rather than be pinged on at 03:00 — is poorly served by ntfy's push model and clumsy via webhook → email-gateway.
- Channel scope. Global only. No per-host or per-severity routing in v1.
- Notification body. Structured JSON for webhooks, formatted title+body+click-URL for ntfy, plus a per-channel "Send test notification" button with inline result feedback.
- Deduplication. Open-alert uniqueness on
(host_id, kind)with alast_seen_atbump on every confirming tick. One notification per occurrence; the UI shows "still happening · Ns ago" while a rule keeps matching. - Alert UI. Top-level
/alertspage (the existing nav stub becomes real). Per-host vitals "Open alerts" cell links to/alerts?host_id=.... Channel CRUD lives at/settings/notifications. - Delivery semantics. Best-effort fire-and-forget with a 5s timeout per notification. Failures are logged but not retried. The alert row in the DB is the source of truth.
Architecture
The subsystem is three loosely-coupled units behind one AlertEngine
goroutine:
┌───────────────────────────┐
event hooks ─────────────────►│ │
│ AlertEngine │ ──► raise/resolve
60s ticker ──────────────────►│ (rule evaluation) │ alert row
│ │
└────────────┬──────────────┘
│
▼
┌──────────────────────┐
│ notification.Hub │
│ (fire-and-forget) │
└──┬────────┬──────────┘
│ │
┌──────▼──┐ ┌──▼──────┐
│ Webhook │ │ Ntfy │ …future channels
└─────────┘ └─────────┘
Component boundaries
| Component | Purpose | Depends on |
|---|---|---|
internal/alert.Engine |
Owns the rule evaluation. Exposes OnJobFinished, OnHostOffline, OnHostOnline event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. |
store, notification.Hub, slog |
internal/alert.Rule + per-rule files |
Each of the six rules is a small struct with Kind() string, Severity() string, MessageFor(ctx) string. The engine iterates over a registered slice. |
store models |
internal/notification.Hub |
Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new notification_log table. |
store, channel adapters |
internal/notification.Channel (iface) |
Single method Send(ctx, payload) error with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: webhookChannel, ntfyChannel, smtpChannel. |
http.Client; net/smtp + crypto/tls for SMTP |
internal/store/alerts.go |
CRUD on alerts table: RaiseOrTouch(host_id, kind, severity, message), Acknowledge(id, user), Resolve(id, by user), AutoResolve(host_id, kind), ListAlerts(filter), plus the last_seen_at bump. |
sqlite |
internal/store/notification_channels.go |
CRUD on notification_channels (new table) + notification_log (new table). |
sqlite, crypto.AEAD (for secrets) |
internal/server/http/ui_alerts.go |
/alerts page handler + filter parsing + ack/resolve form actions. |
store |
internal/server/http/ui_notifications.go |
/settings/notifications page + channel CRUD + "Send test" handler. |
store, notification.Hub |
Engine event shape
The engine runs as one goroutine per server process started in
cmd/server/main.go. It exposes a small set of channels other code writes to:
type Engine struct {
store *store.Store
hub *notification.Hub
// Event channels (buffered, drop-on-full with a slog warning to keep
// hot paths non-blocking). The engine drains them on its own
// goroutine, evaluates the rule, and acts.
jobFinished chan jobFinishedEvent // from store.MarkJobFinished hook
hostOffline chan string // host_id; from offline sweeper
hostOnline chan string // host_id; from ws handler hello
// 60s ticker drives stale-schedule + auto-resolution sweeps.
tick *time.Ticker
}
The hot-path call sites (store.MarkJobFinished, ws.handler offline
sweep, ws.handler hello) push to these channels via a tiny
Engine.Notify* method that does a non-blocking send. The engine's own
goroutine handles every match — keeps mutation off the hot path.
Rule catalogue
| Kind | Severity | Trigger | Auto-resolve when |
|---|---|---|---|
backup_failed |
warning | MarkJobFinished with kind=backup, status=failed |
next backup for the same host succeeds |
forget_failed |
warning | MarkJobFinished with kind=forget, status=failed |
next forget for the same host succeeds |
prune_failed |
warning | MarkJobFinished with kind=prune, status=failed |
next prune for the same host succeeds |
check_failed |
critical | MarkJobFinished with kind=check, status=failed OR errors_found |
next check for the same host succeeds without errors |
stale_schedule |
warning | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted |
agent_offline |
warning | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks last_seen_at) |
hostOnline event for that host |
The 15-minute floor on agent_offline exists so a 30-second blip during
agent restart doesn't generate a notification storm. The store's existing
offline sweeper (hosts.last_seen_at with 90s threshold) already marks the
host offline; the engine sees the event but waits for the threshold before
raising.
Dedup + last_seen_at
store.RaiseOrTouch(host_id, kind, severity, message):
SELECT id, last_seen_at FROM alerts
WHERE host_id = ? AND kind = ? AND resolved_at IS NULL
LIMIT 1;
- Found:
UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?, return(id, didRaise=false). - Not found:
INSERT INTO alerts (id, host_id, kind, severity, message, created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?), return(id, didRaise=true).
The engine fires a notification through the Hub only when didRaise=true.
Touch-only events keep the row's last_seen_at fresh so the UI can render
"still happening · Ns ago" without spamming the operator's phone.
Notification payload shapes
Webhook — a single JSON envelope per event:
{
"event": "alert.raised",
"alert_id": "01KQT...",
"severity": "warning",
"kind": "backup_failed",
"host_id": "01KQ...",
"host_name": "alfa-01",
"message": "Backup 'system-config' failed: rest-server returned 401",
"raised_at": "2026-05-04T15:42:01Z",
"link": "https://restic-manager.example/alerts/01KQT..."
}
event is one of alert.raised | alert.acknowledged | alert.resolved | alert.test. The same envelope shape is reused across events — operators
build one bridge, switch on event and severity.
SMTP — single-recipient plain-text email per channel. The channel
config carries the SMTP server credentials and a to address; one
channel = one recipient (or one distribution-list address). Operators
who want multiple recipients add multiple channels — keeps the config
flat and the failure modes per-recipient.
Subject pattern is hardcoded (no per-channel template in v1):
Subject: [restic-manager] [<severity>] <host_name>: <kind>
From: <configured-from-address>
To: <configured-to-address>
Date: <RFC 5322>
Message-ID: <alert_id@<server-host>>
<message line — same string the webhook/ntfy gets>
—
Raised at: 2026-05-04T15:42:01Z
Severity: warning
Host: alfa-01
Kind: backup_failed
Open in restic-manager:
https://restic-manager.example/alerts/01KQT...
(This message was sent by restic-manager. Acknowledge or resolve in the UI.)
The body is plain text only in v1 — no HTML alternative — both because
the data is already structured well enough as text and because HTML
email opens a long tail of rendering / sanitisation concerns. The
Message-ID includes the alert id so a thread-aware client can group
related events (raised → acknowledged → resolved) together.
Encryption:
- STARTTLS (default, port 587). Opportunistic upgrade. Most operator-facing relays.
- Implicit TLS (port 465). Connect-then-TLS-handshake.
- None (port 25). Plain. Hidden behind a "Yes I understand" warning on the form because the password goes over the wire.
Auth:
- PLAIN (RFC 4616) over TLS. Default and almost always what's wanted.
- CRAM-MD5 (RFC 2195). Offered if the server advertises it, no UI toggle — automatic.
- No OAuth2 / XOAUTH2 in v1; that's a real next step if Gmail-without- app-passwords becomes a recurring ask.
Per-message timeout is 10s (vs 5s for HTTP channels) — STARTTLS handshake + DATA over a slow link can legitimately take that long.
Ntfy — uses the standard publish format:
POST /<topic> HTTP/1.1
Host: <server>
Authorization: Bearer <access-token> (if configured)
Title: [warning] alfa-01 backup failed
Priority: 4
Tags: warning,backup_failed
Click: https://restic-manager.example/alerts/01KQT...
Backup 'system-config' failed: rest-server returned 401
Severity → priority mapping:
| Severity | Priority |
|---|---|
| info | 3 (default) |
| warning | 4 (high) |
| critical | 5 (urgent) |
Per-channel default_priority setting overrides for non-critical alerts;
critical always goes urgent regardless.
Test notification
POST /api/notifications/{channel_id}/test builds a synthetic event
(severity=info, kind=test_notification, message="Test from
restic-manager", link to the channel's edit page) and runs it through the
real send path. Returns {ok: bool, latency_ms: int, status_code?: int, error?: string}. UI renders the green ✓ / red ✗ feedback inline.
Routes added
| Method | Path | Purpose |
|---|---|---|
| GET | /alerts |
Fleet alerts list with filters (?status=open&severity=warning&host_id=...&q=...) |
| POST | /alerts/{id}/acknowledge |
Mark alert acknowledged (HTMX form) |
| POST | /alerts/{id}/resolve |
Manual resolve (HTMX form) |
| GET | /settings/notifications |
Channel list page |
| GET | /settings/notifications/new |
Channel kind picker + empty form |
| POST | /settings/notifications/new |
Validate + create + redirect |
| GET | /settings/notifications/{id}/edit |
Channel edit form |
| POST | /settings/notifications/{id}/edit |
Validate + update |
| POST | /settings/notifications/{id}/delete |
Delete channel (typed-confirm name in the form) |
| POST | /api/notifications/{id}/test |
Fire test notification, return JSON result |
| GET | /api/alerts |
JSON list (mirrors the UI filters) for future REST callers |
Data model
Migration 0013 — alerts.last_seen_at
ALTER TABLE alerts ADD COLUMN last_seen_at TEXT;
UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL;
Existing alerts (currently zero in production — nothing writes them yet)
get last_seen_at = created_at. Column is nullable for forwards-compat
with rows from the alert-engine-pre-bump period.
Migration 0014 — notification_channels + notification_log
CREATE TABLE notification_channels (
id TEXT PRIMARY KEY,
kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy', 'smtp')),
name TEXT NOT NULL,
enabled INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)),
config BLOB NOT NULL, -- AEAD-encrypted JSON; per-kind shape
default_priority TEXT, -- ntfy only; null for webhook + smtp
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
last_fired_at TEXT
);
CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1;
CREATE TABLE notification_log (
id TEXT PRIMARY KEY,
channel_id TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE,
alert_id TEXT REFERENCES alerts(id) ON DELETE SET NULL,
event TEXT NOT NULL, -- alert.raised | alert.acknowledged | alert.resolved | alert.test
ok INTEGER NOT NULL CHECK (ok IN (0, 1)),
status_code INTEGER,
latency_ms INTEGER,
error TEXT,
fired_at TEXT NOT NULL
);
CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC);
CREATE INDEX notification_log_alert ON notification_log(alert_id);
config is an AEAD-encrypted JSON blob — bearer tokens for webhooks and
access tokens for ntfy live there. Per-kind config shapes:
type webhookConfig struct {
URL string `json:"url"`
BearerToken string `json:"bearer_token,omitempty"`
HeaderName string `json:"header_name,omitempty"`
HeaderValue string `json:"header_value,omitempty"`
}
type ntfyConfig struct {
ServerURL string `json:"server_url"` // default https://ntfy.sh
Topic string `json:"topic"`
AccessToken string `json:"access_token,omitempty"`
}
type smtpConfig struct {
Host string `json:"host"` // e.g. smtp.example.com
Port int `json:"port"` // default 587 (STARTTLS), 465 (TLS), 25 (none)
Encryption string `json:"encryption"` // "starttls" | "tls" | "none"
Username string `json:"username"`
Password string `json:"password"` // sensitive — AEAD-encrypted with the rest of config
From string `json:"from"` // RFC 5322 address; "alerts@example.com" or "Restic-Manager <alerts@…>"
To string `json:"to"` // single recipient or distribution-list address; v1 = one channel = one to-line
}
Engine state
The engine itself is stateless beyond the channels it owns; all
persisted state is in the existing alerts table + the new
notification_log table. A process restart re-evaluates from scratch:
on next tick the stale-schedule + auto-resolution sweeps catch up with
whatever happened during the downtime. No outbox to drain.
UI templates
| Template | Purpose |
|---|---|
web/templates/pages/alerts.html |
Fleet alerts page |
web/templates/partials/alert_row.html |
One alert row (used by both list and detail-fragment swap) |
web/templates/pages/settings.html |
Settings shell with Notifications / Users / Auth sub-tabs |
web/templates/pages/notifications.html |
Channel list (Notifications sub-tab body) |
web/templates/pages/notification_edit.html |
Channel kind picker + per-kind form + test button + payload preview |
web/templates/partials/crit_banner.html |
Dashboard top-of-page banner |
web/templates/partials/nav.html |
Existing — gain a data-alerts-count attribute on the Alerts tab so the badge auto-updates |
The Settings shell + Notifications sub-tab is the new chrome the wireframe introduced; Users + Authentication tabs are placeholder links that 404 in v1 (or render an "Lands later" notice). Same pattern P2R-02 used for inert sub-tabs.
Tests (target coverage)
internal/alert/engine_test.go— rule firing per kind: backup_failed raises onMarkJobFinished(kind=backup, status=failed); touch-only on the second failure for the same host (no second notification); auto-resolve on next success.internal/alert/agent_offline_test.go—OnHostOfflineemits without raising until the 15-min floor;OnHostOnlineclears the alert.internal/alert/stale_schedule_test.go— synthetic schedule whose next fire is in the past triggers; resets when a job lands.internal/notification/webhook_test.go— payload shape pinned; authorisation header sent when bearer set; custom header echoed; 5s timeout enforced; error innotification_log.internal/notification/ntfy_test.go— title/priority/tags/click headers match the severity mapping; access token sent asAuthorization: Bearer <token>; default priority overridden by severity for critical.internal/notification/smtp_test.go— round-trip against a localnet/smtp.NewServer-style fake (ormhog/MailHog if convenient): STARTTLS handshake completes against a self-signed cert; PLAIN auth uses configured creds; subject + from + to + body bytes match the spec'd format; Message-ID contains the alert id; 10s timeout enforced; failure path (auth refused) lands innotification_logwith the server's error string.internal/server/http/ui_alerts_test.go— page renders with filters applied; ack/resolve POSTs flip the row + write audit; HX-Redirect bounces back to the filtered list.internal/server/http/ui_notifications_test.go— CRUD happy paths, validation re-render, secrets-encrypted-at-rest assertion (load row, decrypt, compare), test-button hits the real send path against a test http.Server.- Migration 0013 + 0014 round-trip tested via
store.Openon a fresh db.
Playwright sweep
End-of-phase sweep mirrors the P2R-02 / P3-restore pattern:
- Login →
/alerts(initially empty) → see "All clear · last alert never" empty state. - Trigger a fake-failed-backup via
POST /api/hosts/{id}/jobsagainst a host with a deliberately-wrong rest-server URL. Wait for thebackup_failedalert to appear in the list within ~2s of the job finishing. - Acknowledge → row tints + ack actor visible.
- Take the agent offline (
systemctl stop); wait 15 min OR mocklast_seen_atto 16 min ago via the test harness; confirmagent_offlinealert raises once. - Restart the agent →
agent_offlineauto-resolves;backup_failedis still open. - Configure a webhook channel pointing at a local test sink; click "Send test" → green ✓.
- Configure a ntfy channel pointing at a local sink → click "Send test" → green ✓.
- Configure an SMTP channel pointing at a local MailHog (Docker, port 1025, no TLS for the local-only sweep) → click "Send test" → green ✓ → MailHog UI at :8025 shows the test email with the right subject and Message-ID.
- Trigger a fresh failed backup → all three channels receive the
notification (verified from sink logs + MailHog inbox);
notification_loghas three rowsevent=alert.raised, ok=true. - Manually Resolve the open
backup_failed; confirm all three channels receiveevent=alert.resolved. - Critical-severity test: trigger
check_failed(mocked) → dashboard banner appears; clicking it lands on/alerts?severity=critical&status=open. - Empty the alerts again → banner disappears.
Screenshots into _diag/p3-alerts-sweep/. End-to-end clean, zero console
errors, before handing back.
What does NOT change
- Existing chrome/templates beyond the small additions noted above.
- Existing
alerts.severityCHECK (info/warning/critical) — already the right shape; no migration needed for that. - Audit log writer pattern — engine writes audit rows for ack/resolve the same way every other state-changing handler does.
- The agent. Alerts are entirely a server concern; the agent doesn't know they exist.
Open questions / explicit non-goals
- Per-rule cooldowns / re-raise on long-running issues. Out of scope (brainstorm question 8 ruled this out). Operators see "still happening" in the UI; they don't get a reminder ping.
- SMTP HTML emails. v1 is plain text only — operators wanting rich rendering can deploy a webhook → mail-merge bridge, or wait for a v2 template engine. The Message-ID threading + plain text body should be enough for almost every overnight-digest workflow.
- SMTP OAuth2 / XOAUTH2. Out of scope. Gmail / Microsoft 365 with
modern OAuth requires an
app passwordworkaround in v1. Native XOAUTH2 lands when an operator asks (or when Google starts refusing app passwords for non-business accounts in earnest). - Multi-recipient SMTP channels. A channel = one
To. Operators wanting multiple recipients add multiple channels. Keeps failure attribution per-recipient. - Apprise sidecar integration. Deferred per brainstorm. The
Channelinterface accepts a third impl without reshaping when we get there. - Per-host or per-severity channel routing. Out of scope. Likely
next step if operators ask: a
min_severityfield on the channel row. - Snooze / mute. Out of scope. Acknowledge is the closest analogue; full silence-windows would need a new table and is YAGNI for v1.
- PagerDuty / OpsGenie. Both have webhook receivers; operators wire them via the webhook channel today.
- Alert "rules" UI. No CRUD; the rule set is hardcoded.