docs: P3 alerts design spec
Phase 3 sub-spec covering the alerts engine, notification channels, and UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions locked before this spec was written. Key decisions captured: - Hardcoded rule set, no operator-tunable thresholds in v1. Six rules: backup_failed, forget_failed, prune_failed, check_failed, stale_schedule, agent_offline. - Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper for immediate triggers; one 60s ticker for stale-schedule detection + auto-resolution sweeps. - Auto-resolve when underlying condition clears; manual Resolve any time; Acknowledge as a separate I-have-seen-it intermediate state that does NOT close the alert. - v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel scope is global only — no per-host or per-severity routing. - Webhook payload is one stable JSON envelope shape across raised / acknowledged / resolved / test events; ntfy uses the standard publish format with severity → priority mapping. - Per-channel Send Test Notification button hits the real send path with a synthetic info-severity event; inline green-tick / red-cross result. - Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on every confirming tick so the UI can render still happening · Ns ago without re-notifying. - Top-level /alerts page; Settings shell with Notifications sub-tab. Per-host vitals Open alerts cell deep-links into filtered list. - Best-effort fire-and-forget delivery with 5s timeout; failures logged to a new notification_log table but never retried. Alert row in the DB is the source of truth. Migrations: - 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md) - 0014 adds notification_channels + notification_log tables Wireframe: _diag/p3-alerts-wireframe/wireframe.html
This commit is contained in:
@@ -0,0 +1,387 @@
|
||||
# P3 — Alerts (design)
|
||||
|
||||
> Phase 3 sub-spec covering the alerts engine, notification channels, and UI
|
||||
> (P3-05 / P3-06 / P3-07).
|
||||
>
|
||||
> Wireframe: `_diag/p3-alerts-wireframe/wireframe.html`. Screenshots in the
|
||||
> same directory. Spec brainstorm ran 2026-05-04; user approved all ten
|
||||
> design decisions before this spec was written.
|
||||
|
||||
## Scope locked
|
||||
|
||||
Brainstorm decisions (in order asked):
|
||||
|
||||
1. **Rule model.** Hardcoded rule set, no operator-tunable thresholds in v1.
|
||||
The engine knows about each rule type internally; per-rule config can land
|
||||
later if/when an operator asks.
|
||||
2. **Rule set.** Six rules: `backup_failed`, `forget_failed`, `prune_failed`,
|
||||
`check_failed`, `stale_schedule`, `agent_offline`.
|
||||
3. **Engine cadence.** Hybrid. Event hooks at the existing
|
||||
`MarkJobFinished` and offline-sweeper sites for the immediate triggers;
|
||||
one 60-second ticker handles stale-schedule detection and auto-resolution.
|
||||
4. **Resolution.** Auto-resolve when the underlying condition clears + manual
|
||||
Resolve at any time. Acknowledge is a separate "I've seen it" intermediate
|
||||
state that does NOT close the alert.
|
||||
5. **v1 channels.** Webhook + native ntfy. Apprise deferred (the channel
|
||||
plumbing accepts new kinds without reshaping).
|
||||
6. **Channel scope.** Global only. No per-host or per-severity routing in v1.
|
||||
7. **Notification body.** Structured JSON for webhooks, formatted
|
||||
title+body+click-URL for ntfy, plus a per-channel "Send test notification"
|
||||
button with inline result feedback.
|
||||
8. **Deduplication.** Open-alert uniqueness on `(host_id, kind)` with a
|
||||
`last_seen_at` bump on every confirming tick. One notification per
|
||||
occurrence; the UI shows "still happening · Ns ago" while a rule keeps
|
||||
matching.
|
||||
9. **Alert UI.** Top-level `/alerts` page (the existing nav stub becomes
|
||||
real). Per-host vitals "Open alerts" cell links to `/alerts?host_id=...`.
|
||||
Channel CRUD lives at `/settings/notifications`.
|
||||
10. **Delivery semantics.** Best-effort fire-and-forget with a 5s timeout
|
||||
per notification. Failures are logged but not retried. The alert row in
|
||||
the DB is the source of truth.
|
||||
|
||||
## Architecture
|
||||
|
||||
The subsystem is three loosely-coupled units behind one `AlertEngine`
|
||||
goroutine:
|
||||
|
||||
```
|
||||
┌───────────────────────────┐
|
||||
event hooks ─────────────────►│ │
|
||||
│ AlertEngine │ ──► raise/resolve
|
||||
60s ticker ──────────────────►│ (rule evaluation) │ alert row
|
||||
│ │
|
||||
└────────────┬──────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ notification.Hub │
|
||||
│ (fire-and-forget) │
|
||||
└──┬────────┬──────────┘
|
||||
│ │
|
||||
┌──────▼──┐ ┌──▼──────┐
|
||||
│ Webhook │ │ Ntfy │ …future channels
|
||||
└─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
### Component boundaries
|
||||
|
||||
| Component | Purpose | Depends on |
|
||||
| ---------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------- |
|
||||
| `internal/alert.Engine` | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog |
|
||||
| `internal/alert.Rule` + per-rule files | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models |
|
||||
| `internal/notification.Hub` | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table. | store, channel adapters |
|
||||
| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context. Two impls in v1: `webhookChannel`, `ntfyChannel`. | http.Client |
|
||||
| `internal/store/alerts.go` | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite |
|
||||
| `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table). | sqlite, crypto.AEAD (for secrets) |
|
||||
| `internal/server/http/ui_alerts.go` | `/alerts` page handler + filter parsing + ack/resolve form actions. | store |
|
||||
| `internal/server/http/ui_notifications.go` | `/settings/notifications` page + channel CRUD + "Send test" handler. | store, notification.Hub |
|
||||
|
||||
### Engine event shape
|
||||
|
||||
The engine runs as one goroutine per server process started in
|
||||
`cmd/server/main.go`. It exposes a small set of channels other code writes to:
|
||||
|
||||
```go
|
||||
type Engine struct {
|
||||
store *store.Store
|
||||
hub *notification.Hub
|
||||
|
||||
// Event channels (buffered, drop-on-full with a slog warning to keep
|
||||
// hot paths non-blocking). The engine drains them on its own
|
||||
// goroutine, evaluates the rule, and acts.
|
||||
jobFinished chan jobFinishedEvent // from store.MarkJobFinished hook
|
||||
hostOffline chan string // host_id; from offline sweeper
|
||||
hostOnline chan string // host_id; from ws handler hello
|
||||
|
||||
// 60s ticker drives stale-schedule + auto-resolution sweeps.
|
||||
tick *time.Ticker
|
||||
}
|
||||
```
|
||||
|
||||
The hot-path call sites (`store.MarkJobFinished`, `ws.handler` offline
|
||||
sweep, `ws.handler` hello) push to these channels via a tiny
|
||||
`Engine.Notify*` method that does a non-blocking send. The engine's own
|
||||
goroutine handles every match — keeps mutation off the hot path.
|
||||
|
||||
### Rule catalogue
|
||||
|
||||
| Kind | Severity | Trigger | Auto-resolve when |
|
||||
| ------------------- | -------- | ----------------------------------------------------------------------- | -------------------------------------------------- |
|
||||
| `backup_failed` | warning | `MarkJobFinished` with kind=backup, status=failed | next backup for the same host succeeds |
|
||||
| `forget_failed` | warning | `MarkJobFinished` with kind=forget, status=failed | next forget for the same host succeeds |
|
||||
| `prune_failed` | warning | `MarkJobFinished` with kind=prune, status=failed | next prune for the same host succeeds |
|
||||
| `check_failed` | critical | `MarkJobFinished` with kind=check, status=failed OR errors_found | next check for the same host succeeds without errors |
|
||||
| `stale_schedule` | warning | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted |
|
||||
| `agent_offline` | warning | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`) | hostOnline event for that host |
|
||||
|
||||
The 15-minute floor on `agent_offline` exists so a 30-second blip during
|
||||
agent restart doesn't generate a notification storm. The store's existing
|
||||
offline sweeper (`hosts.last_seen_at` with 90s threshold) already marks the
|
||||
host offline; the engine sees the event but waits for the threshold before
|
||||
raising.
|
||||
|
||||
### Dedup + last_seen_at
|
||||
|
||||
`store.RaiseOrTouch(host_id, kind, severity, message)`:
|
||||
|
||||
```sql
|
||||
SELECT id, last_seen_at FROM alerts
|
||||
WHERE host_id = ? AND kind = ? AND resolved_at IS NULL
|
||||
LIMIT 1;
|
||||
```
|
||||
|
||||
- Found: `UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?`,
|
||||
return `(id, didRaise=false)`.
|
||||
- Not found: `INSERT INTO alerts (id, host_id, kind, severity, message,
|
||||
created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?)`, return
|
||||
`(id, didRaise=true)`.
|
||||
|
||||
The engine fires a notification through the Hub only when `didRaise=true`.
|
||||
Touch-only events keep the row's `last_seen_at` fresh so the UI can render
|
||||
"still happening · Ns ago" without spamming the operator's phone.
|
||||
|
||||
### Notification payload shapes
|
||||
|
||||
**Webhook** — a single JSON envelope per event:
|
||||
|
||||
```json
|
||||
{
|
||||
"event": "alert.raised",
|
||||
"alert_id": "01KQT...",
|
||||
"severity": "warning",
|
||||
"kind": "backup_failed",
|
||||
"host_id": "01KQ...",
|
||||
"host_name": "alfa-01",
|
||||
"message": "Backup 'system-config' failed: rest-server returned 401",
|
||||
"raised_at": "2026-05-04T15:42:01Z",
|
||||
"link": "https://restic-manager.example/alerts/01KQT..."
|
||||
}
|
||||
```
|
||||
|
||||
`event` is one of `alert.raised | alert.acknowledged | alert.resolved |
|
||||
alert.test`. The same envelope shape is reused across events — operators
|
||||
build one bridge, switch on `event` and `severity`.
|
||||
|
||||
**Ntfy** — uses the standard publish format:
|
||||
|
||||
```
|
||||
POST /<topic> HTTP/1.1
|
||||
Host: <server>
|
||||
Authorization: Bearer <access-token> (if configured)
|
||||
Title: [warning] alfa-01 backup failed
|
||||
Priority: 4
|
||||
Tags: warning,backup_failed
|
||||
Click: https://restic-manager.example/alerts/01KQT...
|
||||
|
||||
Backup 'system-config' failed: rest-server returned 401
|
||||
```
|
||||
|
||||
Severity → priority mapping:
|
||||
|
||||
| Severity | Priority |
|
||||
| --------- | -------- |
|
||||
| info | 3 (default) |
|
||||
| warning | 4 (high) |
|
||||
| critical | 5 (urgent) |
|
||||
|
||||
Per-channel `default_priority` setting overrides for non-critical alerts;
|
||||
critical always goes urgent regardless.
|
||||
|
||||
### Test notification
|
||||
|
||||
`POST /api/notifications/{channel_id}/test` builds a synthetic event
|
||||
(severity=info, kind=test_notification, message="Test from
|
||||
restic-manager", link to the channel's edit page) and runs it through the
|
||||
real send path. Returns `{ok: bool, latency_ms: int, status_code?: int,
|
||||
error?: string}`. UI renders the green ✓ / red ✗ feedback inline.
|
||||
|
||||
## Routes added
|
||||
|
||||
| Method | Path | Purpose |
|
||||
| ------- | ----------------------------------------------------- | ------------------------------------------------------------- |
|
||||
| GET | `/alerts` | Fleet alerts list with filters (`?status=open&severity=warning&host_id=...&q=...`) |
|
||||
| POST | `/alerts/{id}/acknowledge` | Mark alert acknowledged (HTMX form) |
|
||||
| POST | `/alerts/{id}/resolve` | Manual resolve (HTMX form) |
|
||||
| GET | `/settings/notifications` | Channel list page |
|
||||
| GET | `/settings/notifications/new` | Channel kind picker + empty form |
|
||||
| POST | `/settings/notifications/new` | Validate + create + redirect |
|
||||
| GET | `/settings/notifications/{id}/edit` | Channel edit form |
|
||||
| POST | `/settings/notifications/{id}/edit` | Validate + update |
|
||||
| POST | `/settings/notifications/{id}/delete` | Delete channel (typed-confirm name in the form) |
|
||||
| POST | `/api/notifications/{id}/test` | Fire test notification, return JSON result |
|
||||
| GET | `/api/alerts` | JSON list (mirrors the UI filters) for future REST callers |
|
||||
|
||||
## Data model
|
||||
|
||||
### Migration 0013 — alerts.last_seen_at
|
||||
|
||||
```sql
|
||||
ALTER TABLE alerts ADD COLUMN last_seen_at TEXT;
|
||||
UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL;
|
||||
```
|
||||
|
||||
Existing alerts (currently zero in production — nothing writes them yet)
|
||||
get `last_seen_at = created_at`. Column is nullable for forwards-compat
|
||||
with rows from the alert-engine-pre-bump period.
|
||||
|
||||
### Migration 0014 — notification_channels + notification_log
|
||||
|
||||
```sql
|
||||
CREATE TABLE notification_channels (
|
||||
id TEXT PRIMARY KEY,
|
||||
kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy')),
|
||||
name TEXT NOT NULL,
|
||||
enabled INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)),
|
||||
config BLOB NOT NULL, -- AEAD-encrypted JSON; per-kind shape
|
||||
default_priority TEXT, -- ntfy only; null for others
|
||||
created_at TEXT NOT NULL,
|
||||
updated_at TEXT NOT NULL,
|
||||
last_fired_at TEXT
|
||||
);
|
||||
|
||||
CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1;
|
||||
|
||||
CREATE TABLE notification_log (
|
||||
id TEXT PRIMARY KEY,
|
||||
channel_id TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE,
|
||||
alert_id TEXT REFERENCES alerts(id) ON DELETE SET NULL,
|
||||
event TEXT NOT NULL, -- alert.raised | alert.acknowledged | alert.resolved | alert.test
|
||||
ok INTEGER NOT NULL CHECK (ok IN (0, 1)),
|
||||
status_code INTEGER,
|
||||
latency_ms INTEGER,
|
||||
error TEXT,
|
||||
fired_at TEXT NOT NULL
|
||||
);
|
||||
|
||||
CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC);
|
||||
CREATE INDEX notification_log_alert ON notification_log(alert_id);
|
||||
```
|
||||
|
||||
`config` is an AEAD-encrypted JSON blob — bearer tokens for webhooks and
|
||||
access tokens for ntfy live there. Per-kind config shapes:
|
||||
|
||||
```go
|
||||
type webhookConfig struct {
|
||||
URL string `json:"url"`
|
||||
BearerToken string `json:"bearer_token,omitempty"`
|
||||
HeaderName string `json:"header_name,omitempty"`
|
||||
HeaderValue string `json:"header_value,omitempty"`
|
||||
}
|
||||
|
||||
type ntfyConfig struct {
|
||||
ServerURL string `json:"server_url"` // default https://ntfy.sh
|
||||
Topic string `json:"topic"`
|
||||
AccessToken string `json:"access_token,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
### Engine state
|
||||
|
||||
The engine itself is stateless beyond the channels it owns; all
|
||||
persisted state is in the existing `alerts` table + the new
|
||||
`notification_log` table. A process restart re-evaluates from scratch:
|
||||
on next tick the stale-schedule + auto-resolution sweeps catch up with
|
||||
whatever happened during the downtime. No outbox to drain.
|
||||
|
||||
## UI templates
|
||||
|
||||
| Template | Purpose |
|
||||
| ----------------------------------------- | ------------------------------------------------------ |
|
||||
| `web/templates/pages/alerts.html` | Fleet alerts page |
|
||||
| `web/templates/partials/alert_row.html` | One alert row (used by both list and detail-fragment swap) |
|
||||
| `web/templates/pages/settings.html` | Settings shell with Notifications / Users / Auth sub-tabs |
|
||||
| `web/templates/pages/notifications.html` | Channel list (Notifications sub-tab body) |
|
||||
| `web/templates/pages/notification_edit.html` | Channel kind picker + per-kind form + test button + payload preview |
|
||||
| `web/templates/partials/crit_banner.html` | Dashboard top-of-page banner |
|
||||
| `web/templates/partials/nav.html` | Existing — gain a `data-alerts-count` attribute on the Alerts tab so the badge auto-updates |
|
||||
|
||||
The Settings shell + Notifications sub-tab is the new chrome the wireframe
|
||||
introduced; Users + Authentication tabs are placeholder links that 404 in
|
||||
v1 (or render an "Lands later" notice). Same pattern P2R-02 used for
|
||||
inert sub-tabs.
|
||||
|
||||
## Tests (target coverage)
|
||||
|
||||
- `internal/alert/engine_test.go` — rule firing per kind: backup_failed
|
||||
raises on `MarkJobFinished(kind=backup, status=failed)`; touch-only on
|
||||
the second failure for the same host (no second notification);
|
||||
auto-resolve on next success.
|
||||
- `internal/alert/agent_offline_test.go` — `OnHostOffline` emits without
|
||||
raising until the 15-min floor; `OnHostOnline` clears the alert.
|
||||
- `internal/alert/stale_schedule_test.go` — synthetic schedule whose next
|
||||
fire is in the past triggers; resets when a job lands.
|
||||
- `internal/notification/webhook_test.go` — payload shape pinned;
|
||||
authorisation header sent when bearer set; custom header echoed; 5s
|
||||
timeout enforced; error in `notification_log`.
|
||||
- `internal/notification/ntfy_test.go` — title/priority/tags/click headers
|
||||
match the severity mapping; access token sent as `Authorization: Bearer
|
||||
<token>`; default priority overridden by severity for critical.
|
||||
- `internal/server/http/ui_alerts_test.go` — page renders with filters
|
||||
applied; ack/resolve POSTs flip the row + write audit; HX-Redirect
|
||||
bounces back to the filtered list.
|
||||
- `internal/server/http/ui_notifications_test.go` — CRUD happy paths,
|
||||
validation re-render, secrets-encrypted-at-rest assertion (load row,
|
||||
decrypt, compare), test-button hits the real send path against a
|
||||
test http.Server.
|
||||
- Migration 0013 + 0014 round-trip tested via `store.Open` on a fresh
|
||||
db.
|
||||
|
||||
## Playwright sweep
|
||||
|
||||
End-of-phase sweep mirrors the P2R-02 / P3-restore pattern:
|
||||
|
||||
1. Login → `/alerts` (initially empty) → see "All clear · last alert
|
||||
never" empty state.
|
||||
2. Trigger a fake-failed-backup via `POST /api/hosts/{id}/jobs` against a
|
||||
host with a deliberately-wrong rest-server URL. Wait for the
|
||||
`backup_failed` alert to appear in the list within ~2s of the job
|
||||
finishing.
|
||||
3. Acknowledge → row tints + ack actor visible.
|
||||
4. Take the agent offline (`systemctl stop`); wait 15 min OR mock
|
||||
`last_seen_at` to 16 min ago via the test harness; confirm
|
||||
`agent_offline` alert raises once.
|
||||
5. Restart the agent → `agent_offline` auto-resolves; `backup_failed` is
|
||||
still open.
|
||||
6. Configure a webhook channel pointing at a local test sink; click "Send
|
||||
test" → green ✓.
|
||||
7. Configure a ntfy channel pointing at a local sink → click "Send test"
|
||||
→ green ✓.
|
||||
8. Trigger a fresh failed backup → both channels receive the notification
|
||||
(verified from sink logs); `notification_log` has two rows
|
||||
`event=alert.raised, ok=true`.
|
||||
9. Manually Resolve the open `backup_failed`; confirm both channels
|
||||
receive `event=alert.resolved`.
|
||||
10. Critical-severity test: trigger `check_failed` (mocked) → dashboard
|
||||
banner appears; clicking it lands on `/alerts?severity=critical&status=open`.
|
||||
11. Empty the alerts again → banner disappears.
|
||||
|
||||
Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console
|
||||
errors, before handing back.
|
||||
|
||||
## What does NOT change
|
||||
|
||||
- Existing chrome/templates beyond the small additions noted above.
|
||||
- Existing `alerts.severity` CHECK (`info`/`warning`/`critical`) — already
|
||||
the right shape; no migration needed for that.
|
||||
- Audit log writer pattern — engine writes audit rows for ack/resolve
|
||||
the same way every other state-changing handler does.
|
||||
- The agent. Alerts are entirely a server concern; the agent doesn't
|
||||
know they exist.
|
||||
|
||||
## Open questions / explicit non-goals
|
||||
|
||||
- **Per-rule cooldowns / re-raise on long-running issues.** Out of scope
|
||||
(brainstorm question 8 ruled this out). Operators see "still happening"
|
||||
in the UI; they don't get a reminder ping.
|
||||
- **SMTP / email channel.** Out of scope. Operators wanting email today
|
||||
can chain webhook → email-gateway; native SMTP can land later.
|
||||
- **Apprise sidecar integration.** Deferred per brainstorm. The
|
||||
`Channel` interface accepts a third impl without reshaping when we get
|
||||
there.
|
||||
- **Per-host or per-severity channel routing.** Out of scope. Likely
|
||||
next step if operators ask: a `min_severity` field on the channel row.
|
||||
- **Snooze / mute.** Out of scope. Acknowledge is the closest analogue;
|
||||
full silence-windows would need a new table and is YAGNI for v1.
|
||||
- **PagerDuty / OpsGenie.** Both have webhook receivers; operators wire
|
||||
them via the webhook channel today.
|
||||
- **Alert "rules" UI.** No CRUD; the rule set is hardcoded.
|
||||
Reference in New Issue
Block a user