Phase 3 — Alerts (P3-05/06/07) #7

Merged
steve merged 34 commits from p3-alerts into main 2026-05-04 22:51:17 +01:00
Owner

Summary

Phase-3 Alerts subsystem: hardcoded rule engine, three notification
channels (webhook + ntfy + SMTP), and the UI surfaces for review,
acknowledgement, and resolve.

  • Six v1 rules: backup_failed, forget_failed, prune_failed,
    check_failed (critical), agent_offline, plus a stale_schedule
    placeholder. Engine runs as a goroutine with a 60s ticker plus
    event-driven hooks (job finished, host up/down).
  • Channels live behind notification.Channel; notification.Hub
    fans every event out to all enabled channels in parallel and
    writes a notification_log row per dispatch (status code,
    latency, error).
  • AEAD-encrypted channel config in notification_channels.config;
    associated data binds ciphertext to the row id so swapping config
    between rows is rejected.
  • Top-level /alerts list with status/severity/host filters, JSON
    variant at /api/alerts, ack + resolve handlers, dashboard
    critical-alerts banner, nav badge.
  • /settings/notifications channel CRUD with per-kind sub-forms,
    Test button that fires a synthetic alert.test payload.

Sweep findings (live Playwright run, 2026-05-04)

Three real bugs caught and fixed mid-sweep — see commits 9be3cea,
6466f8c, 3d99306:

  • Ack/Resolve handlers updated state but never dispatched
    alert.acknowledged / alert.resolved. Added Engine.Acknowledge
    / Engine.Resolve wrappers; handlers now route through the engine.
    Also detached the goroutine context with context.WithoutCancel
    so the dispatch survives the 204 response.
  • Channel form persisted enabled=0 even when the toggle was on:
    hidden+checkbox both named enabled, and PostForm.Get returned
    the first ("0"). Switched to a slice scan helper.
  • hosts.open_alert_count projection was never written by the
    alerts code path, so the dashboard's OPEN ALERTS card and per-host
    alerts column always read 0. Added refreshHostOpenAlertCount
    (recompute from alerts table — self-healing) and called it from
    RaiseOrTouch (when a row was inserted), Resolve, and
    AutoResolve.

End-to-end verified: 3 channels created + Test fired (webhook 200/1ms,
ntfy 200/322ms, SMTP 250/3ms via local MailHog) → synthetic critical
raised → /alerts list, nav badge, dashboard banner, OPEN ALERTS card
all populate → Acknowledge fans out across all 3 channels → Resolve
fans out across all 3 channels and clears the banner + count.

Test plan

  • go test ./... (passes locally)
  • /settings/notifications: create webhook channel, send Test, verify endpoint receives the JSON envelope
  • /settings/notifications: create ntfy channel, send Test, verify the ntfy topic
  • /settings/notifications: create SMTP channel against a local MailHog (1025), send Test, verify subject [restic-manager] [info] (test): test_notification
  • Force-fail a backup → verify a backup_failed alert appears on /alerts at warning severity, banner does NOT show (warning), nav badge increments, all enabled channels receive alert.raised
  • Click Acknowledge → verify Acknowledged tab + ack'd-by line + alert.acknowledged fanned out
  • Click Resolve → verify alert moves to Resolved tab + banner clears + dashboard count drops + alert.resolved fanned out
  • Disconnect an agent for >15 min → verify agent_offline raised at warning; reconnect → auto-resolved
## Summary Phase-3 Alerts subsystem: hardcoded rule engine, three notification channels (webhook + ntfy + SMTP), and the UI surfaces for review, acknowledgement, and resolve. - Six v1 rules: `backup_failed`, `forget_failed`, `prune_failed`, `check_failed` (critical), `agent_offline`, plus a `stale_schedule` placeholder. Engine runs as a goroutine with a 60s ticker plus event-driven hooks (job finished, host up/down). - Channels live behind `notification.Channel`; `notification.Hub` fans every event out to all enabled channels in parallel and writes a `notification_log` row per dispatch (status code, latency, error). - AEAD-encrypted channel config in `notification_channels.config`; associated data binds ciphertext to the row id so swapping config between rows is rejected. - Top-level `/alerts` list with status/severity/host filters, JSON variant at `/api/alerts`, ack + resolve handlers, dashboard critical-alerts banner, nav badge. - `/settings/notifications` channel CRUD with per-kind sub-forms, Test button that fires a synthetic `alert.test` payload. ## Sweep findings (live Playwright run, 2026-05-04) Three real bugs caught and fixed mid-sweep — see commits 9be3cea, 6466f8c, 3d99306: - Ack/Resolve handlers updated state but never dispatched `alert.acknowledged` / `alert.resolved`. Added `Engine.Acknowledge` / `Engine.Resolve` wrappers; handlers now route through the engine. Also detached the goroutine context with `context.WithoutCancel` so the dispatch survives the 204 response. - Channel form persisted `enabled=0` even when the toggle was on: hidden+checkbox both named `enabled`, and `PostForm.Get` returned the first ("0"). Switched to a slice scan helper. - `hosts.open_alert_count` projection was never written by the alerts code path, so the dashboard's OPEN ALERTS card and per-host alerts column always read 0. Added `refreshHostOpenAlertCount` (recompute from alerts table — self-healing) and called it from `RaiseOrTouch` (when a row was inserted), `Resolve`, and `AutoResolve`. End-to-end verified: 3 channels created + Test fired (webhook 200/1ms, ntfy 200/322ms, SMTP 250/3ms via local MailHog) → synthetic critical raised → /alerts list, nav badge, dashboard banner, OPEN ALERTS card all populate → Acknowledge fans out across all 3 channels → Resolve fans out across all 3 channels and clears the banner + count. ## Test plan - [ ] `go test ./...` (passes locally) - [ ] /settings/notifications: create webhook channel, send Test, verify endpoint receives the JSON envelope - [ ] /settings/notifications: create ntfy channel, send Test, verify the ntfy topic - [ ] /settings/notifications: create SMTP channel against a local MailHog (1025), send Test, verify subject `[restic-manager] [info] (test): test_notification` - [ ] Force-fail a backup → verify a `backup_failed` alert appears on /alerts at warning severity, banner does NOT show (warning), nav badge increments, all enabled channels receive `alert.raised` - [ ] Click Acknowledge → verify Acknowledged tab + ack'd-by line + `alert.acknowledged` fanned out - [ ] Click Resolve → verify alert moves to Resolved tab + banner clears + dashboard count drops + `alert.resolved` fanned out - [ ] Disconnect an agent for >15 min → verify `agent_offline` raised at warning; reconnect → auto-resolved
steve added 28 commits 2026-05-04 21:02:20 +01:00
Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.

Key decisions captured:

- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
  backup_failed, forget_failed, prune_failed, check_failed,
  stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
  for immediate triggers; one 60s ticker for stale-schedule detection +
  auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
  Acknowledge as a separate I-have-seen-it intermediate state that does
  NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
  scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
  acknowledged / resolved / test events; ntfy uses the standard publish
  format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
  a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
  every confirming tick so the UI can render still happening · Ns ago
  without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
  Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
  to a new notification_log table but never retried. Alert row in the DB
  is the source of truth.

Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables

Wireframe: _diag/p3-alerts-wireframe/wireframe.html
Post-brainstorm change after operator review: overnight-digest /
"don't ping me at 03:00, email me in the morning" use case is poorly
served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins
webhook + ntfy as the third v1 channel; Apprise stays deferred.

Spec updates:
- Decision 5 reworded: three channels in v1.
- Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s
  timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link
  legitimately needs the headroom.
- Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host,
  port, encryption (starttls/tls/none), username, password (AEAD), from,
  to. One channel = one To-address; multi-recipient = multiple channels
  (keeps failure attribution per-recipient).
- Body shape documented: hardcoded subject pattern
  '[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the
  alert id so threading groups raised → ack → resolved cleanly. Plain
  text only in v1.
- Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no
  XOAUTH2 yet (app passwords recommended for Gmail / M365).
- Test plan adds MailHog step in the Playwright sweep.
- Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient
  channels are explicitly out of v1.

Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html):
- Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP
  gets the --ok green colour family so it visually separates from
  webhook (accent) and ntfy (warm).
- New SMTP variant section (3c): host+port+encryption row, user+pass
  row, from+to row, test result, plus right-rail email shape preview
  showing the RFC 5322 layout.
- Channel list grows a third row: 'overnight-digest · smtp://… →
  ops-overnight@example.com'.
Code-quality nits flagged in review of e6d965d. Mirrors the existing
pattern in host_credentials_test.go.
Fixes flagged in spec review of f0a323e: ntfy POSTs need explicit
Content-Type: text/plain (the spec calls for it; ntfy works without
but explicit beats inferred); trim trailing slashes from server URL
to avoid double-slash when operators paste 'https://ntfy.sh/'.
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
  from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
  looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
  transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
  TODO(G1) comment marks where NotifyHostOffline calls will land
Flagged in review of cd38b40: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.

The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
- Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go
- Start go alertEngine.Run(ctx) after construction, before the HTTP listener
- Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed)
- Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.

Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.

Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.

Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.

Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.

Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
tasks: tick P3-05/06/07 + Playwright sweep notes
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 32s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 3m44s
c5b884a22b
Sweep against the live smoke env confirmed the alerts subsystem
end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh,
SMTP → MailHog) created and verified via the Test button; synthetic
critical raised; ack + resolve fan out alert.acknowledged /
alert.resolved across all three; dashboard banner appears and
clears; nav badge tracks open count.

Three real bugs found and fixed mid-sweep — see preceding three
commits for the full reasoning.
steve added 1 commit 2026-05-04 22:17:06 +01:00
fix: read 'name' across all per-kind sub-forms when editing channels
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 38s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Test (linux/amd64) (pull_request) Successful in 2m39s
84e121bb9c
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.

Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 6466f8c — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
steve added 1 commit 2026-05-04 22:21:52 +01:00
fix: enabled toggle — list-row click + edit-form save
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 24s
CI / Build (linux/arm64) (pull_request) Successful in 24s
CI / Lint (pull_request) Successful in 1m15s
CI / Test (linux/amd64) (pull_request) Successful in 1m36s
cffad4b4f3
Two bugs in the channel-enabled affordance:

1. List-row toggle was a static span with no handler; the row's
   row-link overlay swallowed every click and routed to /edit. Add
   POST /settings/notifications/{id}/toggle backed by a new store
   method SetNotificationChannelEnabled, and turn the row toggle
   into an htmx-driven button that swaps in the new state. Use
   event.stopPropagation() on the toggle so it beats the row link.

2. Edit-form toggle visually flipped but the underlying checkbox
   reverted: the visual span lives inside the <label>, so clicking
   it fired the inline JS handler AND the label's native
   checkbox-toggle, cancelling out. Bind to the checkbox 'change'
   event instead and let the label do the toggling — the JS just
   mirrors check.checked into the .on class.
steve added 1 commit 2026-05-04 22:25:51 +01:00
feat(ntfy): support HTTP Basic auth alongside access tokens
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Lint (pull_request) Successful in 1m12s
CI / Test (linux/amd64) (pull_request) Successful in 1m18s
feaeff217d
Self-hosted ntfy that doesn't expose a token-mint endpoint can still
authenticate over HTTP Basic. Add Username + Password fields to
NtfyConfig; the channel sends 'Authorization: Basic …' when token is
empty and username is set. Token wins when both are configured.

Form-side: two new optional fields next to the access token, with
the same write-only placeholder treatment as smtp_password (blank
on edit means 'keep stored value'). Username is round-tripped on
edit; password is masked.
steve added 1 commit 2026-05-04 22:35:58 +01:00
fix: move channel delete-panel out of edit form (nested form bug)
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Lint (pull_request) Successful in 1m11s
CI / Test (linux/amd64) (pull_request) Successful in 1m22s
7f2a9964db
The delete-panel <form action='.../delete'> was nested inside the
main <form action='.../edit'>. HTML doesn't allow nested forms —
browsers parse the inner form as if it didn't exist, so clicking
'Delete permanently' submitted the outer edit form to /edit
instead of /delete, leaving the channel intact.

Move the delete-panel block to a sibling of the main form. The
'Delete channel…' button still toggles its visibility via JS, the
panel still renders inside the page layout, and now its form
actually posts to the delete handler.
steve added 1 commit 2026-05-04 22:40:46 +01:00
fix: payload-preview rail follows kind switcher
CI / Lint (pull_request) Successful in 32s
CI / Build (windows/amd64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 1m18s
CI / Build (linux/arm64) (pull_request) Successful in 43s
3cdaee63d4
Right-rail preview was rendered server-side via {{if eq $f.Kind ...}},
so it stayed on whatever kind the page loaded with. Editing an SMTP
channel and flipping to ntfy in the picker left the email RFC 5322
sample on screen.

Render all three preview panels with id='preview-<kind>' (only the
matching one visible on first render) and toggle their .hidden class
in the kind-switcher JS alongside the field panels. Same pattern
used for fields-<kind>.
steve added 1 commit 2026-05-04 22:49:49 +01:00
chore: ignore cmd/_* dev binaries + Tailwind rebuild
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 1m13s
CI / Test (linux/amd64) (pull_request) Successful in 1m20s
2eac324cec
cmd/_fake_alert and similar one-shot dev tools live under cmd/_*
where Go's build tooling skips them. Add an explicit gitignore line
so an accidental 'git add cmd/.' can't drag them into a release.

styles.css is the regenerated Tailwind output — picks up the new
ntfy basic-auth fields and the right-rail preview ids.
steve merged commit 7792aadb94 into main 2026-05-04 22:51:17 +01:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/restic-manager#7