Bite-sized TDD tasks across 7 slices (A schema, B middleware,
C session re-validation, D setup-token flow, E user CRUD API,
F UI, G wiring + sweep). Each task is one commit with concrete
code blocks and test cases — no placeholders.
Refs spec at docs/superpowers/specs/2026-05-05-p4-03-04-rbac-user-mgmt-design.md.
Brainstormed shape locked: chi route-group middleware, fail-closed
admin default; setup-token flow with 1h single-use tokens
(sha256-hashed at rest, raw shown to admin once); disable-only user
lifecycle with last-admin guard; self-service /settings/account
password change for every role; email field on users (metadata
v1); session re-validation on every authenticated request so
disable / role change land immediately.
Locked decisions captured in §Role taxonomy, §Schema changes,
§Setup-token flow, §RBAC enforcement, §Last-admin self-protection.
Deferred items in §Out of scope (OIDC, SMTP email-the-link,
hard delete, lockout).
Migrations 0017 (users extensions) + 0018 (user_setup_tokens)
both column-level ALTERs per CLAUDE.md preference.
Post-brainstorm change after operator review: overnight-digest /
"don't ping me at 03:00, email me in the morning" use case is poorly
served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins
webhook + ntfy as the third v1 channel; Apprise stays deferred.
Spec updates:
- Decision 5 reworded: three channels in v1.
- Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s
timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link
legitimately needs the headroom.
- Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host,
port, encryption (starttls/tls/none), username, password (AEAD), from,
to. One channel = one To-address; multi-recipient = multiple channels
(keeps failure attribution per-recipient).
- Body shape documented: hardcoded subject pattern
'[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the
alert id so threading groups raised → ack → resolved cleanly. Plain
text only in v1.
- Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no
XOAUTH2 yet (app passwords recommended for Gmail / M365).
- Test plan adds MailHog step in the Playwright sweep.
- Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient
channels are explicitly out of v1.
Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html):
- Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP
gets the --ok green colour family so it visually separates from
webhook (accent) and ntfy (warm).
- New SMTP variant section (3c): host+port+encryption row, user+pass
row, from+to row, test result, plus right-rail email shape preview
showing the RFC 5322 layout.
- Channel list grows a third row: 'overnight-digest · smtp://… →
ops-overnight@example.com'.
Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.
Key decisions captured:
- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
backup_failed, forget_failed, prune_failed, check_failed,
stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
for immediate triggers; one 60s ticker for stale-schedule detection +
auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
Acknowledge as a separate I-have-seen-it intermediate state that does
NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
acknowledged / resolved / test events; ntfy uses the standard publish
format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
every confirming tick so the UI can render still happening · Ns ago
without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
to a new notification_log table but never retried. Alert row in the DB
is the source of truth.
Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables
Wireframe: _diag/p3-alerts-wireframe/wireframe.html
Splits Phase 3 into three independently-shippable sub-phases (Restore,
Alerts, Audit UI) so they can land in separate PRs with their own brainstorm
→ spec → plan cycles. The Restore sub-phase is up first.
The brainstorm ran on 2026-05-04 and locked the following decisions:
- Single-host restore only this phase. P3-04 (cross-host restore) is moved
to a new 'Future / unscheduled' section. Disaster recovery is already
covered by re-enrolling a replacement host with the same repo creds; the
remaining 'pull a file from host A onto host C' use case is genuinely
different (file sharing / migration, not DR) and has no confirmed need.
- Default target is /var/restic-restore/<job-id>/ with --no-ownership;
in-place restore preserves uid/gid/mode and is gated by typed-confirmation
of the host name (mirroring the repo re-init danger zone).
- Tree browser is the path picker, lazy-loaded via a synchronous WS RPC
(tree.list) over the existing correlation-ID infrastructure with a
per-wizard-session in-memory cache (~30 min TTL).
- Single-page wizard with progressively-enabled sections; entry is a
top-level Restore button on host detail (or per-snapshot Restore action
for direct deep-link).
- Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other
agent operation; output streams to the standard live job log page.
- Restore-specific live job page variant with files-restored /
bytes-restored / current-file widget.
- Single-flight per host across all kinds, plus a real cancel-job feature
(command.cancel WS envelope, agent kills the restic subprocess via
context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a
long-running backup if they need to restore urgently. Wires the existing
job_detail Cancel button (which was a UI stub).
- Audit row host.restore on every dispatch + a recent-restores panel on
host detail. Role gate deferred to P4-03 RBAC.
Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored —
transient design artefact); screenshot reviewed and approved 2026-05-04.
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:
* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
with ReadWritePaths confined to /etc + /var/lib/restic-manager;
NoNewPrivileges blocks escalation. Install script no longer
creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
rationale (matches UrBackup / Veeam / Bareos defaults; trying to
back up "everything" as an unprivileged user creates silent skips
on /home, /root, /var/lib/* with no upside vs the threat model
the agent already implies).
* Init-repo end-to-end. New JobKind="init" wired through agent
runner, restic.Env.RunInit, server dispatcher, and a UI button
(red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
flips on init success, on backup success, or on a non-empty
snapshots.report. The "Run now" / "Init" / "Retry" branching now
drives both the dashboard host row and the host-detail panel.
Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
the safe create-new-then-rename pattern; first version corrupted
job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
affected DBs).
* rest-server creds embedded at exec time only. restic.Env gains
RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
inside envSlice() and never assigns it back to the struct, so
nothing slog-able ever sees the cleartext form. RedactURL helper
for any future surface that needs to log a URL safely. Both
helpers tested.
* Add-host UX. Repo password is now optional — server mints a
24-byte URL-safe random one and surfaces it once, alongside an
htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
the operator pastes one command on the rest-server host and one
on the endpoint. Result page also links the install snippet at
/install/install.sh (was /install.sh — 404'd before) and pipes
to bash (not sh — script uses set -o pipefail and other
bashisms; on Debian/Ubuntu sh is dash).
* Late-subscriber race in JobHub. A fast-failing job could finish
(DB write + Broadcast) before the browser's HX-Redirect → page
load → WS-connect path completed, so the JS sat forever waiting
on a job.finished that already passed. JobHub split into
Register + Send + Run; handleJobStream now subscribes first,
re-fetches the job, and sends a synthetic job.finished if the
state is already terminal.
* HTMX error visibility. New toast partial listens to
htmx:responseError and surfaces the response body as a
bottom-right toast — every server-side validation error now
becomes visible without per-handler JS wiring. Also handles
custom rm:toast events for future server-pushed notifications
via the HX-Trigger header. Themed via existing CSS vars.
* Dashboard rows are now whole-row clickable to host detail
(CSS card-link pattern: absolute-positioned anchor + .row-action
z-index restoration so the action button stays clickable).
"View →" on a running job links to /jobs/<id> rather than
/hosts/<id> since the row click already covers the host page.
* "Run first" / "Run first backup" → "Run now" everywhere for
consistency.
* runbook (docs/e2e-smoke.md) updated — live-log streaming step
now reflects P1-26; mentions the browser-driven Run-now flow.
* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
it up; .gitignore now excludes /_diag/ entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The smoke runbook caught a real bug: ConsumeEnrollmentToken was
inserting into host_credentials (FK -> hosts) inside the same tx as
the token burn, but the host row didn't exist yet — CreateHost
runs in the *next* statement. The agent saw a generic 401 with no
clue why.
Fix: drop the host_credentials insert from ConsumeEnrollmentToken;
the HTTP handler now does Consume -> CreateHost ->
SetHostCredentials. SetHostCredentials failure is logged loudly
but doesn't fail the enrol — operator recovers via PUT
/api/hosts/{id}/repo-credentials.
Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the
underlying cause is visible in server logs (the wire response stays
generic to avoid leaking which step failed).
Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new
order (consume -> create host -> SetHostCredentials).
Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly
in use); URLs use trailing slash on the rest path; clarified that
secrets_key is minted on first agent start, not at enrol time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds docs/e2e-smoke.md — an ~5-minute runbook that walks the full
P1 happy path against a sibling restic/rest-server: bootstrap
admin, mint token with repo creds, enrol an agent, watch the
config.update push land, run a backup, confirm the snapshot, edit
creds and watch the second push fire. Per the design discussion
this is a runbook (not a Go integration test); the Playwright
version lands in P5-06.
GET /api/hosts/{id}/repo-credentials returns the redacted view —
{repo_url, repo_username, has_password} — so the UI can pre-fill
the edit form without ever pulling the password out of the AEAD
blob.
Marks P1-32 / P1-33 / P1-34 done in tasks.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>