formatRelTime now wraps its label in <time data-rel-ts=...>, and
both layouts include a small ticker that re-renders every 30s.
Without this, a job-detail page rendered an hour ago kept showing
'2h ago' when the wall-clock truth was '3h ago'.
The Dockerfile only set `-X main.version=...`, so docker-built binaries
left `internal/version.Version` at its default "dev". The update logic
(host_update.go:61, hosts.go:94, fleet_update.go:101 et al.) compares
against `internal/version.Version`, so a v1.0.0 host always looked
out-of-date to a v1.0.0 server, the chip never cleared, and pressing
"update" re-downloaded the same bundled binary on a loop.
Collapse the two version sources: drop the `var version/commit/date`
locals in cmd/{server,agent}/main.go, route everything through
internal/version (now also carrying Date), and have both the Dockerfile
and the Makefile set the same single set of -X flags. Verified
end-to-end: make build and docker build both emit binaries whose
--version reflects the build VERSION.
The CI e2e workflow greps for 'bootstrap token' in server logs to capture
the one-shot token. The earlier reword dropped that phrase; restore it on
the headless-instructions line so .gitea/workflows/e2e.yml step 'Capture
bootstrap token from server logs' keeps matching.
- CHANGELOG.md: Keep-a-Changelog format, v1.0.0 entry summarising
what each phase delivered.
- docs/threat-model.md: structured walkthrough of assets, actors,
attack surfaces and residual risks; reviewed against v1.0.0.
- cmd/server/main.go: at first-run startup, print a clickable
$RM_BASE_URL/bootstrap URL alongside the existing one-shot
bootstrap token (or a fallback hint when RM_BASE_URL is unset).
- web/templates/pages/bootstrap.html: visible "Minimum 12 characters"
hint under the password field so the rule is communicated
before the operator submits.
- tasks.md: close X-01, X-04, X-05 with notes.
Two small follow-ups noted while working through the
p5-oss-readiness CI-runner switch:
* NS-05 — actions/setup-go is now redundant; ci-runner-go ships
Go on PATH and re-downloading on every job costs ~5s a shard.
* NS-06 — host_chrome's per-host "Run backup now" button is a
permanently-disabled tombstone; remove it so the chrome stops
advertising an action that no longer exists.
Per-host Run-backup is gone — the host_chrome partial still
renders the button but it's hard-disabled with a tooltip
pointing to per-source-group Run-now. The smoke test was
clicking that disabled button and waiting forever for a URL
change that would never happen.
Replace the navigation-based dispatch with two API calls:
create a source group covering the agent's /source mount,
then POST to /api/hosts/{id}/source-groups/{gid}/run. The
backup-status assertion at the end is unchanged — host record
is still the source of truth.
The dashboard renders init_running / init_failed / ready state
based on host.repo_status, but the JSON endpoint dropped the
field on its way out. The e2e test couldn't poll for repo
readiness; reflect the same projection the UI uses.
Two issues uncovered by the page-snapshot dump after the agent
state-dir fix:
* The host page server-renders `Run backup now` as disabled
while repo_status != ready, and the page has no live-refresh
on that field. The test was navigating right after status
flipped to 'online' but before auto-init had completed (~3s
later), so the rendered HTML still showed init_running and
the click was a no-op. Wait for repo_status === 'ready'
before navigating.
* playwright.config.ts pinned the per-test timeout at 60s,
but the test itself uses 60s + 120s of internal waits.
Bump to 240s so the test fails on real regressions instead
of timing out on its own internal budget.
Renamed the test description away from "under a minute" since
it overpromises against the new timeout. The performance SLO
belongs in a separate test if we want to assert it.
The agent writes its encrypted secrets blob to
$DefaultSecretsPath (/var/lib/restic-manager/secrets.enc) but
the e2e fixtures created and mounted a directory at
/var/lib/restic-manager-agent — name mismatch. Result: every
`config.update` push failed with 'create tmp: no such file or
directory', the auto-init never got the repo creds, the host
landed in init_failed, and the smoke test couldn't kick off a
backup (the Run backup button is disabled while
repo_status != ready).
Align the compose volume mount and the Dockerfile mkdir on
/var/lib/restic-manager so they match the production install
script + the agent's own default.
The dashboard's empty-state ("No hosts yet.") was gated on
HostCount == 0 alone, which hid the pending-hosts panel — and
the inline accept form — for the most common first-run scenario:
operator just installed an agent that announced, the fleet has
zero accepted hosts, and the only thing the operator needs to do
is review fingerprint + click Accept.
Tighten the gate so the empty state only shows when there are
truly zero hosts and zero pending announces. With a pending
host, fall through to the regular dashboard layout so the
approval queue is visible and actionable.
Caught by the e2e enrol-via-announce smoke test (now unblocked
on PR #23).
The Playwright run produces error-context.md per failed test
with a full DOM snapshot — useful for triaging UI test failures
without round-tripping through downloaded artifacts. Cat it
into the workflow log on failure.
Also bump actions/upload-artifact v3 → v4. v3 uploads still
return success on this Gitea runner but the artifacts don't
surface through the API or UI; v4 is the correct version per
the workflow header note.
When the runner job runs inside a container, compose's relative
`./playwright/playwright-report` resolves to a path that exists
only inside the runner container, so the host's docker daemon
silently bind-mounts an empty dir and the report never lands
anywhere we can read.
Drop the bind mounts; keep the playwright container around
(--name e2e-pw, no --rm); after the test, `docker cp` the
report and traces out into the runner's workspace volume so
upload-artifact has something real to upload. The new test-results
directory (Playwright traces, screenshots, videos) is also
included so failure post-mortem doesn't need a re-run.
The original write-tmp-then-rename guard handles the ETXTBSY race
on a vanilla filesystem, but inside the new ci-runner-go
container our jobs land on overlayfs, which keeps a lagged
"writable inode" view long enough to leak ETXTBSY into the
exec the test does milliseconds later.
After rename, probe-exec the file with a benign argument
("__rm_probe__" — every script's case statement falls through
to a clean exit) until exec succeeds. Each script body is shaped
`case "$1" in restore) ... ;; esac` so the probe is a no-op.
3s deadline keeps a stuck filesystem from hanging the suite.
When jobs run with `container:` set, Gitea Actions defaults to
`sh -e` (dash on Ubuntu), so `set -euo pipefail` fails with
"Illegal option -o pipefail". Pinning bash workflow-wide
matches what the runner used pre-container and keeps existing
scripts portable.
Pin every job to gitea.dcglab.co.uk/steve/ci-runner-go:2026-05-08
so Go, Node, and Docker tooling are already installed when the
job starts. Drops three actions/setup-go invocations from ci.yml
(redundant — Go is on PATH) and inherits Buildx + Compose v2 in
e2e.yml and release.yml without per-job apt-installs.
Recipe lives in steve/ci. Bump the date pin in lockstep across
the three workflows when picking up a fresher image (e.g. when
the Go floor moves).
Without --profile test, `docker compose build` skips the
playwright service (profiles: [test]) and the image is built
on-demand by `compose run` instead. Across CI runs the Gitea
runner caches the resulting tag, so a Dockerfile FROM bump
(v1.50.0 → v1.59.1) is masked by the cached image — the
container ends up with old browser binaries and Playwright's
own version-mismatch check fails the suite. Pull base images
on every build so the FROM tag wins.
`@playwright/test` was loose-pinned to ^1.50.0; npm resolved it
to 1.59.1 inside the runner image, which only ships browser
binaries for 1.50.0. Pin both the package and the docker image
to v1.59.1 so deps and binaries stay aligned.
Gitea's act-style runners execute workflow steps inside a runner
container, so compose's host port-publish (127.0.0.1:8080:8080) is
not reachable from the steps. PR #23's e2e job timed out waiting
for the server even though the container was up and listening.
Move both the health probe and the Playwright run onto rmnet so
they address the server as http://server:8080:
* health probe: docker run --rm --network e2e_rmnet curlimages/curl
* Playwright: new mcr.microsoft.com/playwright-based image, added
as a profile-gated `playwright` service in compose.e2e.yml,
invoked via `docker compose run --rm playwright`. Drops the
setup-node + npm install runner steps.
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.
Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.
Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.
Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
Spent half an evening fighting a smoke server that kept getting SIGTERM'd
mid-iteration. Root cause: backgrounded processes spawned from sandboxed
shell tool calls don't outlive the parent — even with nohup + disown.
Fix: hand the server to user-systemd as a transient unit so its lifecycle
is owned by the user's session, not by whichever bash subprocess started it.
New Make targets:
make smoke-restart build server + (re)launch as systemd --user unit
make smoke-status show unit status
make smoke-logs tail $HOME/smoke/server.log
make smoke-stop stop the unit
make smoke-deploy full rebuild + restage agent assets + restart
Documents the workflow in CLAUDE.md so the next session doesn't relitigate.
- Project total_size_bytes onto hosts.repo_size_bytes inside the
UpsertHostRepoStats transaction. The hosts row column has been
unwritten since the initial schema in 0001, so the dashboard's
Repo size cell has always rendered '—' even after backups. Now
the column updates atomically alongside the host_repo_stats row,
and FleetSummary's SUM(repo_size_bytes) becomes accurate too.
- Right-align the Alerts column header so it sits over its
right-aligned value (was floating left of column, ambiguous).
- Add text-ink-mid to the 30d trend / Alerts / Tags headers so all
column headers share the same brightness.
- Add rotated 'Size' (left) and 'Snapshots' (right) axis titles in
the chart's outer margins so the two y-axes are self-describing.
- Bump the chart viewBox from 600x220 to 640x220 and lift padL from
56 to 72 so the rotated labels and byte tick numbers don't crowd.
- Dedupe the X-axis labels for short windows (1 or 2 days collapsed
the start/mid/end indices onto each other, stacking 'May 7' three
times); the 1-day case now centres a single label, 2-day uses
start+end only.
- Pin a lone data dot to the chart centre instead of the left edge
when len(days)==1, so it sits under the centred date label.
Goldens regenerated.
Adds /hosts/{id}/jobs page listing recent jobs for the host (newest
first, capped at 100) with click-through to /jobs/{id}. Converts the
Jobs placeholder <div> to a real <a> nav link; removes the Settings
stub entirely. Also registers durationHuman template func and a
.jobs-row CSS grid to match the existing .schd-row idiom.
The dashboard's 'Last backup' column reads hosts.last_backup_at /
last_backup_status, but the WS handler only updated hosts.repo_status
on job.finished — backup terminations were silently dropped. Add a
SetHostLastBackup store method and call it from the same job.finished
switch that already handles init jobs.
Also: CLAUDE.md restage block uses /tmp/rm-smoke (the original
default) but the actual dev env runs out of $HOME/smoke. Update the
paths in the doc to match.
Smoke caught this: ProtectSystem=full mounts /usr read-only so the
agent couldn't write its own .new staging file or atomic-rename over
the running binary. Adding /usr/local/bin to ReadWritePaths is the
minimum diff that lets self-update work; the whole-dir grant is
required because os.Rename needs write on the parent directory.
- Add ?updates=behind query filter and the matching dashboardFilter
field; round-trips through encode/parse.
- Compute UpdatesBehind on the dashboard view-model (online + version
trailing the server) and surface as an amber hero tile that links
to the filtered list.
- Test exercise covering the new filter case.
- Surface UpdateAvailable + TargetVersion on the dashboard host row,
the host_chrome header, and the JSON Host shape.
- New host_update_chip partial renders an amber out-of-date pill
next to the agent-version display when the host's agent trails
the server.
- Host detail right-rail gains an admin-only Update agent button
(disabled when host is offline or already updating).
- New .update-chip and .btn-amber CSS tokens; tailwind output
refreshed.
- POST /api/fleet/update, POST /api/fleet-updates/{id}/cancel,
GET /api/fleet-updates/{id} (admin-only).
- GET /settings/fleet-update + /partial for htmx polling.
- Renders idle / running / terminal states with per-host progress.
- Tests cover happy path, derive-host-ids, conflict, cancel, get,
and RBAC.
- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted
(system-scoped, host_id NULL via new RaiseOrTouchSystem helper).
- ws: UpdateWatcher tracks in-flight command.update dispatches and
reconciles them against incoming hello envelopes — success path
marks the job succeeded and auto-resolves the alert; 90s timeout
marks the job failed and raises update_failed.
- http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX
/hosts/{id}/update form variant. Pre-checks: host exists, online,
agent_version != current, no running update job. Refactored core
into Server.dispatchHostUpdate so the fleet worker can share it
without going through HTTP.
- fleetupdate: rolling worker iterating through host slots, halting
on first failure and raising fleet_update_halted. Polling-based
version-match (re-read hosts.agent_version every 1s up to 95s) —
no extra plumbing into the WS hello path. At-most-one-running is
enforced at the store layer (ErrFleetUpdateRunning).
- cmd/server: wire UpdateWatcher and FleetWorker into the main
goroutine; the worker uses a small serverDispatcher adapter that
delegates back into Server.DispatchHostUpdate.
Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint
(happy + four pre-check branches + RBAC), worker (two-host happy,
timeout-halt, host-offline-halt, already-at-target skip, cancel
mid-run, double-Start guard).
The original plan was apt repo + Chocolatey package. The P5-03 Docker
pivot bundled matching agent binaries into the server image and
exposes them via /agent/binary, so 'update agent' now collapses to
're-fetch from your own server'. No third-party packaging or signing
infra needed. P6-01 drops to S; P6-02 keeps the dashboard reporting
+ fleet-update UX but points at the new mechanism.
The auto-issued GITHUB_TOKEN lacks write:package scope on this Gitea
instance, so the v0.9.0 tag build failed at docker login. Switch to
the user-level DEV_TOKEN secret which has the correct scope.
Smoothes the rough edges that came up exercising a live deployment.
First-run bootstrap UI: /bootstrap renders a username + password form
that uses the in-memory token directly (operator no longer copies it
out of the log); /login redirects there while bootstrap is available.
Agent reliability: failJob synthetic envelopes so command.run early
returns no longer hang the server-side job; runtime probe of restic
restore --help drives --no-ownership instead of version sniffing
(0.18.x had it removed). Server unit re-shaped: ProtectSystem=full
plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore
can now write anywhere a user might want.
Restore wizard: default target is /root/rm-restore/<job-id>/ with
clearer help text. Re-init confirm input uses .field (was .input,
which doesn't exist — text was invisible).
NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete
with hostname-confirm danger zone, audit, FK cascade, live WS close.
NS-02 enrollment-token recovery: outstanding-tokens panel on
/hosts/new, regenerate (preserves attachments) and revoke handlers
+ audit, store-level ListOutstandingEnrollmentTokens and
DeleteEnrollmentToken.
NS-03 repo init / probe surface: migration 0020 adds
hosts.repo_status + repo_status_error; WS handler projects every
init job's outcome onto the host row (idempotent already-initialised
collapses to ready); creds-save resets status and dispatches a fresh
probe; /hosts/{id}/repo/probe retry endpoint with banner.
NS-04 dashboard live + sort + filter: query-string filter
(q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the
alerts pattern with a localStorage live toggle, sortable column
headers, filter row + clear.
Alerts page: ack'd-by line resolves user_id ULID to username.
Compose.yaml ignored — host-specific.
The reverse proxy is assumed to live outside this project (Caddy,
nginx, Traefik, whatever the operator already runs). The reference
compose stands up only the server: image-pinned via RM_VERSION,
named volume for operator state, localhost-bound so the proxy
reaches it on loopback.
docs/reverse-proxy.md covers what the proxy must forward — the
X-Forwarded-* headers, Host, and Connection: upgrade for the agent
WebSocket and live-log streams — plus the RM_TRUSTED_PROXY CIDR
rule that gates header trust. Worked examples for Caddy, nginx
(with the websocket upgrade map + 1h proxy_read_timeout for live
logs), and Traefik.
Single public deliverable per tag: a multi-arch server image, with
cross-compiled agent binaries + install scripts + the systemd unit
baked under /opt/restic-manager/dist/. The /agent/binary and
/install/* handlers fall back from <DataDir>/... to that read-only
path so a fresh container Just Works without first-run staging;
operators can still drop a custom build into <DataDir>/ to override
per-host.
Architecture rationale: agent distribution already routes through
the running server, so the release surface mirrors that — there's
no second source of truth to keep in sync.
Workflow .gitea/workflows/release.yml triggers on v*.*.* tag-push
(fan-out :vX.Y.Z / :X.Y / :X, plus :latest once MAJOR>=1) and
workflow_dispatch (snapshot tag only). Pushes to the Gitea
container registry on this instance.
Both binaries grow main.commit + main.date ldflag targets. Makefile
and Dockerfile fill them; release workflow forwards from gitea.sha
plus a UTC timestamp.
Spec : docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md
Plan : docs/superpowers/plans/2026-05-05-p5-03-docker-only-release.md
Authelia (and many other IdPs) only put `sub` in the ID token by
default, surfacing `preferred_username`/`email`/`groups` from the
userinfo endpoint. Fetch userinfo after id_token verification and
fold its claims into the parsed claim map; the id_token claims
remain authoritative on conflict so the signed assertion still
wins.
Live sweep against https://auth.dcglab.co.uk verified all four
flows: rm-admin → admin JIT, rm-operator → operator JIT (RBAC
denies admin pages), rm-viewer → viewer JIT (RBAC denies operator
pages), rm-other → no_role_match banner with no row created.
Returning rm-admin sign-in resolves to the same row by sub.
Screenshots in _diag/p4-05-sweep/.
Pull the operator-experience polish out of Phase 4 so a working v1
ships sooner. Phase 4 keeps RBAC + user mgmt (already done), OIDC,
and host tags. Deferred items renumbered as P6-01..P6-05:
P4-01 → P6-01 apt + Chocolatey update delivery
P4-02 → P6-02 agent-version-behind-server tracking on dashboard
P4-06 → P6-03 repo size trend graphs
P4-08 → P6-04 Prometheus /metrics endpoint
P4-09 → P6-05 Grafana dashboard JSON + integration docs
None of these gate getting the system into production. They land
after Phase 5 (OSS readiness) on the new Phase 6.
Phase 4 remaining: P4-05 (OIDC login) + P4-07 (per-host tags +
dashboard filtering).
Live Playwright + curl sweep on the smoke env exercised the full
user-management lifecycle:
admin add user → setup link generated → curl-as-new-user fetches
/setup (200, username on page) → POSTs password → 303 to / with
Set-Cookie → 200 on dashboard, 200 on /settings/account,
**403 on /settings/users** (admin-only) → admin disables → next
request is **401** + session row count drops to 0 → audit log
reflects user.created + user.setup_completed.
Three-role middleware enforces band gates; admin is fail-closed
default. Setup tokens are sha256-hashed at rest with 1h expiry;
expired tokens are swept on the alert engine's 60s tick. Last-admin
guard rejects disable + demote of the only enabled admin. Self-
service password change at /settings/account is reachable by every
role.
Adds GET/POST handlers for /settings/account in the viewer band
(any authenticated user), account.html template with current-password
field suppressed when must_change_password is set, and audits the
change via AppendAudit.
Adds handleUIUserNewGet, handleUIUserNewPost, handleUIUserSetupLinkGet
to ui_users.go; creates web/templates/pages/user_edit.html (multi-mode
new/edit/setup-link); wires three routes in the admin band of server.go.
Replaces the 501 stub with the full handler: validates the token and
password, hashes and stores the password, deletes the setup token,
mints an 8-hour session cookie, appends a user.setup_completed audit
entry, and redirects to /. Adds TestSetupPostHappyPath covering the
full round-trip including normal-login verification after setup.
Routes are now structured into Public / Viewer / Operator / Admin bands
using requireRole middleware. Job log stream and download moved into the
Viewer band. healthz moved from New() into routes() with the other
public endpoints.
Test job was wall-clocked by `internal/server/http` (~156s on the
self-hosted runner under -race). Two changes here cut that:
1. Matrix-shard the test job by package group: server-http, store,
and "rest" (everything else, computed via `go list | grep -v`).
Each shard runs on its own runner so the heavy package isn't
CPU-starved by siblings.
2. `auth.HashPassword` drops to cheap argon2id params (8 KiB / 1
iter / 1 lane) when `testing.Testing()` returns true. Production
params are unchanged. VerifyPassword reads params from the
encoded hash so cheap-params hashes verify identically — no test
call sites need to change.
Two operator-visible changes on /alerts:
1. Polling drops from 15s to 5s and gains a checkbox in the table
header to turn live monitoring on/off. Choice is persisted in
localStorage so it survives full-page navigations. The toggle
state is woven into the htmx hx-trigger predicate, so flipping
the checkbox just sets the flag and the next tick (or the
absence of one) honours it — no attribute juggling, no
htmx.process re-init. The dot dims to 0.3 opacity when paused
so operators can see at a glance that they're looking at a
stale view.
2. Severity dropdown options pick up the same oklch tints used by
the row dots / left borders / kind chips. The kind column shows
only the kind text, so without a colour cue the dropdown
mentioned a concept (severity) that the table itself didn't
render. Now the colours bridge the gap.
Note on <option> styling: Chrome and Firefox honour inline color:
on options; Safari ignores it. Acceptable degradation — falls back
to plain text, which is what we had.
The alerts list is the one screen where staleness is genuinely
harmful — an operator can be looking at an Open tab that's already
been resolved by another admin or auto-resolved by the engine, and
take action on a row that no longer exists.
Add an htmx poll on just the table panel:
hx-get same URL with current querystring (filters preserved)
hx-trigger every 15s, only when document is visible (no idle CPU)
hx-select #alerts-table — pull this element out of the response
hx-swap outerHTML
Polling lives on the table div, not the page root, so the filter
strip and header don't flash on each tick. Header gains a small
'live ●' label so the polling is discoverable.
RefreshURL is r.URL.RequestURI() on the server side — keeps any
status/severity/host_id/q params intact across refreshes.
Other screens (dashboard, hosts, jobs) deliberately stay manual-
refresh per the project's anti-flicker stance.
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.
Schema changes (column-level ALTER, no rebuild):
- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
partial index gets dropped and replaced with a UNIQUE partial
index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
the index is now the actual dedup primitive.
Plumbing:
- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
passes it through for backup_failed only (forget/prune/check stay
repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
per-group Run-now path and schedule.fire path pass &g.ID.
Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.
Dev tool: cmd/_fake_alert gains -dedup-key flag.
cmd/_fake_alert and similar one-shot dev tools live under cmd/_*
where Go's build tooling skips them. Add an explicit gitignore line
so an accidental 'git add cmd/.' can't drag them into a release.
styles.css is the regenerated Tailwind output — picks up the new
ntfy basic-auth fields and the right-rail preview ids.
Right-rail preview was rendered server-side via {{if eq $f.Kind ...}},
so it stayed on whatever kind the page loaded with. Editing an SMTP
channel and flipping to ntfy in the picker left the email RFC 5322
sample on screen.
Render all three preview panels with id='preview-<kind>' (only the
matching one visible on first render) and toggle their .hidden class
in the kind-switcher JS alongside the field panels. Same pattern
used for fields-<kind>.
The delete-panel <form action='.../delete'> was nested inside the
main <form action='.../edit'>. HTML doesn't allow nested forms —
browsers parse the inner form as if it didn't exist, so clicking
'Delete permanently' submitted the outer edit form to /edit
instead of /delete, leaving the channel intact.
Move the delete-panel block to a sibling of the main form. The
'Delete channel…' button still toggles its visibility via JS, the
panel still renders inside the page layout, and now its form
actually posts to the delete handler.
Self-hosted ntfy that doesn't expose a token-mint endpoint can still
authenticate over HTTP Basic. Add Username + Password fields to
NtfyConfig; the channel sends 'Authorization: Basic …' when token is
empty and username is set. Token wins when both are configured.
Form-side: two new optional fields next to the access token, with
the same write-only placeholder treatment as smtp_password (blank
on edit means 'keep stored value'). Username is round-tripped on
edit; password is masked.
Two bugs in the channel-enabled affordance:
1. List-row toggle was a static span with no handler; the row's
row-link overlay swallowed every click and routed to /edit. Add
POST /settings/notifications/{id}/toggle backed by a new store
method SetNotificationChannelEnabled, and turn the row toggle
into an htmx-driven button that swaps in the new state. Use
event.stopPropagation() on the toggle so it beats the row link.
2. Edit-form toggle visually flipped but the underlying checkbox
reverted: the visual span lives inside the <label>, so clicking
it fired the inline JS handler AND the label's native
checkbox-toggle, cancelling out. Bind to the checkbox 'change'
event instead and let the label do the toggling — the JS just
mirrors check.checked into the .on class.
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.
Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 24eecc1 — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
Sweep against the live smoke env confirmed the alerts subsystem
end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh,
SMTP → MailHog) created and verified via the Test button; synthetic
critical raised; ack + resolve fan out alert.acknowledged /
alert.resolved across all three; dashboard banner appears and
clears; nav badge tracks open count.
Three real bugs found and fixed mid-sweep — see preceding three
commits for the full reasoning.
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.
Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.
Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.
Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.
Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.
Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
- Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go
- Start go alertEngine.Run(ctx) after construction, before the HTTP listener
- Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed)
- Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.
The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
Flagged in review of 35dee98: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
TODO(G1) comment marks where NotifyHostOffline calls will land
Fixes flagged in spec review of 1ff0b2d: ntfy POSTs need explicit
Content-Type: text/plain (the spec calls for it; ntfy works without
but explicit beats inferred); trim trailing slashes from server URL
to avoid double-slash when operators paste 'https://ntfy.sh/'.
The CI runs go test with -race; the agent runner has two pump goroutines
(pumpStdout + pumpStderr) writing through the sender concurrently, and
the unprotected fakeSender slice append raced. The cancel_test had a
local 'safeSender' workaround for the same issue; promote that mutex
onto fakeSender itself so every test in the package is race-clean
without per-test variants.
- fakeSender grows mu sync.Mutex; Send takes/releases. New snapshot()
helper for tests that want a stable copy.
- cancel_test drops its local safeSender + sync import; uses fakeSender.
Verified: go test -race ./... passes across all packages.
1. Agent-side MkdirAll on the new-dir restore target. Restic creates
missing leaves but won't traverse multiple missing levels, and
under the systemd sandbox writes outside ReadWritePaths fail
anyway. Calling os.MkdirAll(target, 0700) before invoking restic
means the operator never has to pre-create the per-job subdir,
and a path the sandbox rejects surfaces as a clean
'restic restore: prepare target ...: read-only file system' error
in the job log instead of a cryptic restic-side stat failure.
2. tasks.md Phase 3 — Restore section refreshed:
- P3-X4 added (job log download dropdown — txt + ndjson)
- P3-X5 added (UK lint locale switch + 73-correction sweep)
- P3-X6 added (SIZE/FILES tooltip when host's restic < 0.17)
- P3-03 entry expanded to cover version-gated --no-ownership,
editable target, $HOME expansion, agent-side MkdirAll
- As-shipped sweep summary mentions custom-target restore +
download dropdown + tooltip in addition to the original walk
Test: TestRunRestoreNewDirAutoCreatesTarget seeds a multi-level
target the operator hasn't created and confirms RunRestore mkdir's
the chain before invoking restic.
Per-snapshot size + file-count come from the embedded summary block
restic added to 'snapshots --json' in 0.17 (the source comment in
internal/restic/snapshots.go incorrectly said 0.16+). Hosts running
0.16.x leave those columns blank.
- Fix the snapshots.go doc comment: '0.16+' -> '0.17+'.
- hostDetailPage carries a LegacyRestic bool computed from the host's
reported ResticVersion via Env.AtLeastVersion(0, 17). Empty version
also counts as legacy (conservative default).
- Template attaches title='Needs restic 0.17+ on the agent host. This
host runs <ver>.' + cursor:help on the SIZE / FILES headers when
the flag is true. Hosts already on 0.17+ get no tooltip and no
extra styling.
A host upgrading restic to 0.17+ gets the columns populated on the
next backup automatically — no further code change needed.
Replace the floating 'Download log' button + bare '.ndjson' link with
one cohesive dropdown menu — same affordance as the rest of the
header, opens to two well-described options.
- Native <details><summary> for keyboard + no-JS support; only the
click-outside-to-close handler is JS (a few lines).
- New .dropdown / .dropdown-menu / .dropdown-item tokens in
web/styles/input.css. Reusable for future header menus
(host-detail overflow, source-group action menus, etc).
- Chevron flips 180 degrees when open via .dropdown[open] selector.
- Each option has a label + a mono hint line explaining when to pick it
(.txt for humans / paste into a ticket; .ndjson for jq / tooling).
Three small follow-ups from review:
1. Restore target is now operator-editable. Default value is the
literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
run time using os.UserHomeDir(); also handles \${HOME} and ~/
prefixes). Operator can replace with any absolute path.
- ui_restore.go validates the input is either absolute or starts
with one of the recognised prefixes; other env-var refs (\$PATH
etc.) are deliberately rejected so operator paths can't pick up
arbitrary agent env values.
- host_restore.html replaces the read-only mono-text display with
a real <input>; help text spells out that \$HOME resolves
agent-side and <job-id> is substituted on dispatch.
- install.sh + the systemd unit prep /root/rm-restore so the
default works under the sandbox: ReadWritePaths gains a soft
'-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
if missing, but install.sh pre-creates it root-owned 0700).
2. --no-ownership flag now gated on restic version. The flag was
added in restic 0.17 and 0.16 rejects it. Previously dropped it
wholesale — that meant new-dir restores silently preserved
ownership against design intent on 0.17+. Now the agent threads
its detected restic version (sysinfo already collects it) through
runner.Config -> restic.Env, and RunRestore appends --no-ownership
only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
restore with original uid/gid; help text in the wizard explicitly
notes this. The previous 'Original ownership is preserved' copy
was wrong for new-dir mode and is corrected.
3. golangci-lint misspell locale switched US -> UK and the codebase
swept (73 corrections, mostly behaviour/serialise/recognise/honour).
Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
contract change but the agent doesn't parse those codes today and
no external API consumers exist yet. Tests passed before + after.
Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
edge cases (empty, exact match, patch above, minor below, non-
numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
with the job_id substituted into the placeholder.
Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
The diff job's full output streams to the standard live job log page,
which can be a lot of text the operator wants to grep through or paste
into a ticket. Add a Download button.
Source of truth is the persisted job_logs table — works any time
(running or finished) and doesn't need to pause the live WS stream.
The download is 'everything the server has up to right now'; if the
operator wants a fuller snapshot of a still-running job, they hit
Download again.
- New endpoint GET /api/jobs/{id}/log.{txt,ndjson} (chi {format}
matcher constrained to the two known suffixes). Auth via session
cookie. 404 on unknown job.
- internal/server/http/job_download.go writeLogsText emits a small
header + 'HH:MM:SS.mmm TAG payload' rows mirroring what the live
page shows. writeLogsNDJSON emits one self-contained {seq,ts,stream,
payload} JSON object per line — appending stays valid (each line
stands alone), and the whole file pipes cleanly into jq. NDJSON is
newline-delimited JSON; not the same as a JSON array.
- web/templates/pages/job_detail.html grows two header buttons:
'Download log' (txt) + '.ndjson' ghost variant for tooling.
Tests cover the txt format (header + per-row shape), the ndjson
format (each line round-trips through json.Unmarshal), unknown job
404, unauthenticated 401.
Bug fixes from the Playwright sweep against the live smoke server:
1. Snapshot-picker layout. The .snap-row class was used in the wireframe
but never landed in web/styles/input.css; rows rendered as vertical
blocks instead of a 6-column grid. Added the token (mirrors host-row
shape with restore-specific column widths).
2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't
a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven
expansion with a small window.__rmTreeToggle helper that uses plain
fetch + .tree-pair wrapper structure for trivial sibling lookup.
Caches loaded state per node.
3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership;
0.16 rejects it ('unknown flag') before doing any work. Since the
agent runs as root in the systemd unit, restored files keep their
original uid/gid either way and the parent dir is root-owned, so
the 'cp without sudo' rationale doesn't hold. Drop the flag entirely.
4. Default target dir moved to /var/lib/restic-manager/restore. The
systemd unit pins ReadWritePaths to /etc/restic-manager +
/var/lib/restic-manager (with ProtectSystem=strict making the rest
of /var read-only); writes to /var/restic-restore failed with
'read-only file system'.
5. Confirm summary HTML escaping. defaultTarget JS literal evaluates
to a string with literal angle brackets; insertion into innerHTML
must escape them. Added an inline HTML-escape pass.
tasks.md ticked for the Restore sub-phase with a sweep summary
covering the live end-to-end test.
P3-09 — snapshot diff dispatcher.
- POST /api/hosts/{id}/snapshots/diff (and the unprefixed HTMX-form
variant) takes {snapshot_a, snapshot_b}, validates both belong to
the host (long id / short id / prefix match), checks the agent is
online, mints a JobDiff, ships command.run with DiffPayload, writes
a host.snapshot_diff audit row, returns HX-Redirect to the live
job page (or JSON {job_id, job_url} for REST callers).
- Two-snapshot guard: POSTing diff(a,a) returns 422.
- UI: small panel on the host_detail right rail (visible when the
host has 2+ snapshots) with two short-id inputs and a Diff button.
Output renders on the standard live job page where the operator
reads the per-line diff text directly.
P3-X3 — recent-restores line.
- hostChromeData grows RestoreStatus / RestoreAt / RestoreJobID
populated via store.LatestJobByKind(host_id, 'restore') (already
exists, used by the init line).
- host_chrome.html renders a small line below the existing init-status
one with status-coloured copy + a link to the job log. Hidden when
no restore has ever run on this host.
Tests:
- diff_test covers happy path (correct DiffPayload + HX-Redirect),
same-id rejection (422), unknown-id rejection (422). Adds a
seedTwoSnapshots helper since ReplaceHostSnapshots is atomic-swap
(calling seedSnapshot twice would only leave the second).
Restage block (CLAUDE.md) deferred to the end of the restore phase.
End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link
/hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch →
restore-shaped live job page.
Backend (internal/server/http/ui_restore.go):
- GET handlers render the four-step wizard against the wireframe shape
in docs/superpowers/specs/2026-05-04-p3-restore-design.md.
- HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each
directory expansion is a sub-second cached lookup after the first
miss.
- POST validates: snapshot_id non-empty, ≥1 absolute path, in-place
mode requires confirm_hostname == host name, agent online. On error
re-renders the wizard with the operator's input intact. Happy path
mints a job_id, computes the new-directory target as
/var/restic-restore/<job-id>/ (operator can't escape the prefix —
server picks it), creates the job row, ships command.run with
kind=restore + RestorePayload, writes a host.restore audit row,
returns HX-Redirect (or 303) to the live job page.
Templates:
- host_restore.html: single-page progressively-enabled wizard matching
_diag/p3-restore-wizard wireframe. Form-state-driven JS computes a
running tally of selected paths and the step-4 confirm summary
client-side; the server re-renders on validation failure with form
fields preserved.
- partials/tree_node.html: recursive HTMX-served tree fragment.
- Top-level Restore button on host_detail right rail + per-snapshot
Restore action on snapshot rows replace the previous P3-stub.
Restore-shaped job page (job_detail.html):
- Progress widget rendered as a panel rather than a bare strip when
the job is active.
- Current-file display under the bar, updated from log.stream stdout
lines that look like absolute paths. Hidden for non-restore kinds.
Migration 0012:
- Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite
can't ALTER CHECK in place); follows the safe pattern from 0005.
Defensive: stash job_logs into a temp table before the rebuild and
INSERT OR IGNORE back afterwards so even if SQLite cascades on
DROP TABLE jobs the log history survives.
Tests:
- ui_restore_test covers GET step-1 render, GET pre-selected snapshot
summary card, POST missing snapshot, POST missing paths, POST
in-place wrong-hostname rejection (no command.run leaks to the
agent), POST happy path (HX-Redirect + correct payload + audit
row), POST against offline host returns 503.
Restage block (CLAUDE.md) deferred to the end of the restore phase.
Wires JobRestore and JobDiff end-to-end at the agent layer (the wizard
backend that drives this lands in the next slice).
- internal/api: JobRestore + JobDiff JobKind constants. CommandRunPayload
grows nullable Restore + Diff sub-payloads. RestorePayload carries
snapshot_id, paths, in_place, target_dir; DiffPayload carries
snapshot_a + snapshot_b.
- internal/restic.RunRestore wraps 'restic restore <sid> --target ...
[--no-ownership] [--include p]...' with --json. New pumpRestoreStdout
parses the per-line status / summary objects (drops raw status from
log.stream — the throttled job.progress envelope covers it). New
RestoreStatus + RestoreSummary types mirror restic's wire shape.
- internal/restic.RunDiff wraps 'restic diff --json <a> <b>'.
- internal/agent/runner: RunRestore translates RestoreStatus into
job.progress (mapping FilesRestored → FilesDone etc) with a small
estimateETA helper since restic doesn't provide ETA for restore.
RunDiff is a thin streamHandler wrapper.
- cmd/agent dispatcher gains JobRestore + JobDiff cases. Both reuse
the spawn() helper from P3-X1 so cancel just works.
- Drive-by fix: lastProgress was initialised to time.Now() so the
very first status event was suppressed by the 1s throttle if the
agent reported quickly. Initialise to time.Time{} (zero) so the
first event always emits. Affects backup + restore.
Tests:
- restore_test covers restore happy path (started → progress →
finished, kind=restore on the started envelope), in-place argv
asserts no --no-ownership, new-dir argv asserts --no-ownership +
--target + --include, diff produces the expected log.stream lines.
Restage block (CLAUDE.md) is deferred to the end of the restore
sub-phase so we restage once with all changes.
Foundational for the restore wizard's tree browser. The wizard needs to
lazy-load directory contents from a snapshot as the operator drills
down; this lands the transport.
- internal/api adds MsgTreeList (server → agent) + MsgTreeListResult
(agent → server) with TreeListRequestPayload / TreeListEntry /
TreeListResultPayload types. Reply correlates by Envelope.ID.
- internal/restic.ListTreeChildren wraps 'restic ls --json' and
filters its recursive output to direct children of the requested
path. Parser + path-normalisation + isDirectChild are unit-tested.
- internal/server/ws/rpc.go introduces a generic SendRPC helper on
Hub: register a buffered channel keyed by ULID, send the request,
block on ctx.Done()/timeout/reply. Reply routing piggybacks on the
existing dispatchAgentMessage by adding a MsgTreeListResult case
that forwards to the registered waiter; if no waiter is registered
(caller already gave up) the stray reply is dropped quietly.
- cmd/agent gains a tree.list handler that runs ListTreeChildren on a
fresh per-call context (60s ceiling) and ships the matching
tree.list.result envelope. Errors surface in result.Error rather
than as transport failures so the server-side waiter can render a
sensible UI message.
- internal/server/http/tree_cache.go is the per-wizard-session cache
layer (~30min TTL, sweep-on-access) that fetchTreeWithCache uses
before falling through to SendRPC. Cached on success only; agent
errors aren't cached so a transient failure doesn't poison the
session.
Tests:
- internal/restic/ls_test.go covers parseLsChildren at root / mid-tree
/ leaf, plus normalizeTreePath and isDirectChild edge cases.
- internal/server/ws/rpc_test.go unit-tests the registry: round-trip,
release semantics, concurrent waiters, ctx-cancel.
- internal/server/http/tree_rpc_test.go is the full round-trip: server
SendRPC → fake-agent over a real WS → reply → server gets the
payload. Plus a timeout test that confirms ~300ms timeouts terminate
in ~300ms rather than waiting forever.
The cache is plumbed but no UI handler hits fetchTreeWithCache yet —
that lands with P3-01 (wizard backend). The unused-linter is suppressed
via nolint until the wizard wires it in.
Wires the existing job_detail Cancel button (which was a UI stub) into
real backend behaviour:
- internal/api already declared MsgCommandCancel + CommandCancelPayload;
promote those from forward-declarations to a working envelope. Agent
side: cmd/agent/main.go drops the TODO-stub and gains a per-job
ctx.CancelFunc map. runJob's switch is refactored around a small
spawn() helper so each kind's goroutine derives a per-job context,
registers the cancel, and removes itself on completion regardless of
outcome. command.cancel looks up the func and fires it.
- internal/agent/runner.sendFinished now takes ctx and rebadges
ctx.Canceled errors as JobCancelled (exit 130) rather than
JobFailed. All Run* call sites updated.
- internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via
build-tagged sigterm constant; os.Kill on Windows since SIGTERM
isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL
fallback. SIGTERM lets restic remove its lock file before exiting.
- New POST /api/jobs/{id}/cancel server endpoint validates the job
is non-terminal and the host is online, sends command.cancel via
the hub, writes a job.cancel audit row, returns 202. The agent's
resulting job.finished (status=cancelled) is what actually
transitions the row.
Tests:
- internal/server/http/cancel_test.go covers happy path (envelope
shape + audit row), 409 for terminal jobs, 404 for missing jobs,
503 for offline hosts.
- internal/agent/runner/cancel_test.go covers cancel mid-run: a fake
restic that exec'd into 'sleep 30' is canceled 150ms after start
and the resulting job.finished reports JobCancelled with exit 130
in well under the WaitDelay.
Foundational for P3 restore (operator needs to be able to cancel a
running backup if they need to restore urgently). Independently useful
for prune/check/backup that are stuck.
Splits Phase 3 into three independently-shippable sub-phases (Restore,
Alerts, Audit UI) so they can land in separate PRs with their own brainstorm
→ spec → plan cycles. The Restore sub-phase is up first.
The brainstorm ran on 2026-05-04 and locked the following decisions:
- Single-host restore only this phase. P3-04 (cross-host restore) is moved
to a new 'Future / unscheduled' section. Disaster recovery is already
covered by re-enrolling a replacement host with the same repo creds; the
remaining 'pull a file from host A onto host C' use case is genuinely
different (file sharing / migration, not DR) and has no confirmed need.
- Default target is /var/restic-restore/<job-id>/ with --no-ownership;
in-place restore preserves uid/gid/mode and is gated by typed-confirmation
of the host name (mirroring the repo re-init danger zone).
- Tree browser is the path picker, lazy-loaded via a synchronous WS RPC
(tree.list) over the existing correlation-ID infrastructure with a
per-wizard-session in-memory cache (~30 min TTL).
- Single-page wizard with progressively-enabled sections; entry is a
top-level Restore button on host detail (or per-snapshot Restore action
for direct deep-link).
- Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other
agent operation; output streams to the standard live job log page.
- Restore-specific live job page variant with files-restored /
bytes-restored / current-file widget.
- Single-flight per host across all kinds, plus a real cancel-job feature
(command.cancel WS envelope, agent kills the restic subprocess via
context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a
long-running backup if they need to restore urgently. Wires the existing
job_detail Cancel button (which was a UI stub).
- Audit row host.restore on every dispatch + a recent-restores panel on
host detail. Role gate deferred to P4-03 RBAC.
Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored —
transient design artefact); screenshot reviewed and approved 2026-05-04.
P2R-09/10/11/12/13/14, P2-16/17/18 all marked done. Acceptance line
for Windows hosts annotated as 'compile-verified, untested in CI'.
_diag/p2-completion-sweep/ holds the dashboard + host-detail +
schedules + sources + repo + source-group-edit screenshots from a
clean sweep against :8080. Zero console errors throughout.
announce_test.go: rate-limit + global-cap subtests dropped t.Parallel
to avoid racing on the package-level tunables under -race.
Pwsh installer that detects arch, downloads
$Server/agent/binary?os=windows&arch=amd64 to
C:\Program Files\restic-manager\, runs the agent in -enroll-server
[+ -enroll-token] mode (token flow OR announce-and-approve), then
calls 'restic-manager-agent install' to register the SCM service.
Surfaces existing scheduled tasks named *restic* without disabling.
CLAUDE.md restage block updated to also stage install.ps1 alongside
install.sh.
internal/agent/service: build-tagged into service_windows.go (svc.Handler
that listens for Stop/Shutdown + delegates to the agent loop) and
service_other.go (foreground stub for Linux/macOS). install_windows.go
wraps mgr.Connect+CreateService/Delete/Start/Stop for the new
'restic-manager-agent install|uninstall|start|stop' subcommands.
Cross-compile verified: GOOS=windows GOARCH=amd64 go build ./cmd/agent
succeeds. UNTESTED on Windows itself — the SCM round-trip can't be
exercised from Linux CI; treat as a starting point for the first
real Windows install.
Dashboard handler loads ListPendingHosts(now); template renders a
warn-bordered panel above the host table with hostname, OS/arch,
fingerprint (selectable / copyable), source IP, age, expiry. Each
row carries an inline accept form (repo URL/user/password) plus a
Reject button. cmd/server adds a 60s ticker calling
DeleteExpiredPendingHosts so 1h-stale rows drop off.
When -enroll-server is supplied without -enroll-token, the agent
mints (and persists) an Ed25519 keypair, POSTs /api/agents/announce,
prints the SHA256 fingerprint in a copy-friendly banner, opens
/ws/agent/pending, signs the server's nonce, and blocks until the
admin clicks Accept (1h ceiling). On accept, persists the bearer +
host_id from the 'enrolled' message; on reject (close code 4001)
exits with a clear error.
Repo creds are pushed via config.update on the first standard WS
hello (P1-32 path), not in the enrolled message itself.
GET /ws/agent/pending?pending_id=… runs an Ed25519 nonce-sign
handshake against the row's stored public key, then holds the
connection open. POST /api/pending-hosts/{id}/accept (admin)
mints a real Host row + bearer + AEAD-encrypted repo creds, pushes
the bearer down the open WS, deletes the pending row, and writes
a host.accept_pending audit entry. POST /api/pending-hosts/{id}/reject
closes the socket with code 4001 and audit-logs host.reject_pending.
In-memory pendingHub keyed by pending_id wires accept/reject to
their live socket.
Source-group edit form gains pre/post hook textareas with a service-
user warning banner; bodies AEAD-encrypted on save (per-group AD).
Repo page adds a 'Host-default hooks' panel above the danger zone
with the same shape; saved via POST /hosts/{id}/repo/hooks.
Agent: new runner.BackupHooks struct + runHook helper invoked via
/bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts
the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed
in env. Output streamed as 'hook(<phase>): …' log.stream lines.
Hooks only run for kind=backup (other kinds skip both phases).
Server: resolveBackupHooks resolves group → host default → empty,
decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext
into CommandRunPayload for both schedule.fire and per-group
Run-now dispatch sites. Decrypt failures degrade silently to no
hook so a malformed blob can't poison every backup.
Adds pre_hook/post_hook BLOB columns to source_groups and
pre_hook_default/post_hook_default to hosts. Bytes stored verbatim
(AEAD encrypt/decrypt happens at the HTTP layer where the AEAD key
lives). Round-trip tests cover set/clear semantics on both tables.
Latest 'init' job status surfaced under the host-detail vitals strip
(succeeded/failed/running/queued, with link to the live job log on
non-success). New POST /hosts/{id}/repo/reinit handler dispatches a
fresh init job after the operator types the host name to confirm;
audit row records 'host.repo_reinit'.
P2R-14. New store.LatestJobBySchedule query (per-schedule fired job).
Schedules-tab handler computes next-fire from cron + last-fire from
the jobs table per row. Schedules table grows two columns; dashboard
host row prepends 'next 12h ago/from now' to the existing last-backup
line when a single covering schedule is the run-now candidate.
Embeds store.Schedule into scheduleRow so existing template field
references keep working without bulk renames.
P2R-13b. POST /hosts/{id}/source-groups/{gid}/run accepts optional
bandwidth_up_kbps / bandwidth_down_kbps form fields, plumbs them onto
CommandRunPayload. Agent dispatcher already prefers per-job override
over host-wide caps (T1). UI wraps the Run-now button in a form with
a <details> 'Limit bandwidth for this run' disclosure containing two
KB/s inputs.
P2R-13a. restic.Env gains LimitUploadKBps/LimitDownloadKBps which are
emitted as global --limit-upload/--limit-download flags before the
subcommand on every invocation. Agent dispatcher tracks host-wide
caps received via config.update; server pushes them on hello and
after PUT /api/hosts/{id}/bandwidth.
Also extends api.CommandRunPayload with optional per-job overrides
(BandwidthUpKBps/Down + PreHook/PostHook); the override consumers
land in T2/T6.
The Phase 5 section had drifted from the convention used by phases
1–4 (single section header carrying ✅, no separate summary block).
Collapse to the existing pattern; fold the summary into a blockquote
sitting right under the header.
While there: P2R-03 and P2R-04 still carried forward-references
saying "cadence-driven dispatch lands in P2R-04 / P2R-05". Both
should point at P2R-06 (the maintenance ticker), not the next item
in the list. Updated descriptions to reflect what actually shipped:
LatestJobByKind anchor includes in-flight jobs, ForgetGroups
multi-group payload reshape, repo.stats envelope shape, per-host
drain mutex.
CI run #50 failed with:
--- FAIL: TestDrainPendingDispatchesOnReconnect (1.03s)
pending_drain_test.go:150: pending rows after drain: got 1, want 0
The test waits for a backup command.run envelope on the wire and
then checks the pending-row count. But conn.Send (the wire write)
returns BEFORE DeletePendingRun runs in the drain goroutine — both
fire serially inside drainOne, but the wire-side reader can observe
the Send while the delete is still pending.
Use the existing waitForPendingCount helper to poll the count with
a 2s deadline. Behaviour unchanged when the delete is fast (count
hits 0 immediately); only relevant under CI scheduling pressure.
-race -count=10 locally now passes consistently.
CI run #48 failed with:
--- FAIL: TestRunInitShipsStartedAndFinished
RunInit: ... fork/exec /tmp/.../restic: text file busy
setupScript and setupScriptBin used os.WriteFile to write a shell
script directly at the final path, then exec'd it. Under -race +
many t.Parallel tests, a fork-from-another-goroutine could inherit
the still-open writable fd from one of those WriteFile calls; the
kernel returns ETXTBSY when the freshly-execed binary still has a
writable fd anywhere on the system.
Fix: write to "<path>.tmp", then os.Rename into place. The rename
is a pure dirent op; by the time the final path exists, no process
has a writable fd on its inode and exec is safe. -race + -count=5
on both runner packages now passes consistently.
version.go: add a comment block explaining why Phase 5's wire changes
(CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did
not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade
path, smoke env restage enforces it. Notes where a version bump to 2
would be required if a multi-version path is ever introduced.
cmd/agent/main.go: document why the JobForget handler hard-errors on
empty ForgetGroups rather than falling back to a single-policy form.
The maintenance ticker is the only writer and always populates the
field; the fallback was specced but skipped given lockstep deploy.
Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on
the Server struct. DrainPending acquires it with TryLock: if a drain is
already in-flight for this host, the call returns immediately — the
running drain will see every pending row. This prevents the on-hello
goroutine and the 30s tick from both listing the same host's rows and
dispatching them twice.
Update three existing tests that called srv.DrainPending explicitly
after the on-hello goroutine had already been spawned: replace the
now-redundant direct call with a waitForPendingCount poll so they don't
race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost
which fires 10 concurrent DrainPending goroutines against a 5-row queue
and asserts exactly 5 job rows result.
Widen the SQL query to consider all statuses (queued, running,
succeeded, failed, cancelled) rather than terminal-only. An in-flight
prune that outlasts the 60s tick interval previously produced
ErrNotFound, causing the ticker to anchor at now-24h and fire a second
prune concurrently with the first.
Update the doc comment and test: remove the "queued job filtered out"
case, add assertions that a running job and a queued job are each
returned as the latest.
GetSourceGroup errors in drainOne now gate on errors.Is(err,
store.ErrNotFound) before calling abandonPending, mirroring the
existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context
cancellation) now log a warning and return without deleting the row.
Add regression test TestDrainPendingDropsRowsForGoneSourceGroup
confirming the ErrNotFound path still abandons correctly. Also add
a comment above the backoff-doubling loop explaining the progression.
Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on
failure) from dispatchBackupForGroup. drainOne now calls the core
directly so a failed Send only bumps the existing pending_runs row via
BumpPendingRunAttempt — not create a second row — stopping the
geometric duplication on repeated drain failures.
dispatchBackupForGroup (schedule.fire path) wraps the core and keeps
its enqueue-on-failure behaviour unchanged.
TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row
remains after a send failure (was tolerating >=1 duplicate rows).
Two trigger paths land here:
- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
walks pending_runs rows whose next_attempt_at <= now, dedupes by
host, skips offline hosts, and per online host runs DrainPending.
- onAgentHello spawns a background DrainPending(hostID). When a
host comes back, every pending row for it is dispatchable now —
due-ness becomes irrelevant once the wire is back.
Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.
Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:
- skips offline hosts (no pending_runs queueing — maintenance is
not a backup, missed fires shouldn't pile up)
- silently skips prune when admin creds aren't bound
- pushes admin creds before prune, then dispatches with
RequiresAdminCreds=true (same as operator-driven prune)
- persists job rows with actor_kind="system"
Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in
host_repo.html to match the existing pattern on host_sources.html and
host_schedules.html. Fix all-blank admin-credentials save to redirect
without ?saved= query string so no false-positive banner is shown;
strengthen the corresponding test to assert Location has no ?saved=.
Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.
- hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online,
and StatsView (pre-dereferenced projection of host_repo_stats).
- loadHostRepoPage loads the admin slot (tolerating ErrNotFound),
hub.Connected, and stats (tolerating ErrNotFound).
- renderRepoPage gains an adminErr parameter; all callers updated.
- handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added
(form-POST handlers mirroring the repo-creds pattern, with audit).
- Routes /hosts/{id}/admin-credentials POST and /delete POST registered.
- Template: Admin credentials form after Connection, Run-now HTMX
buttons after Maintenance, Repo health stats panel in right rail.
- Tests: 9 new tests covering rendering, disabled states, save/delete
round-trips, audit rows, and idempotent delete.
Switch handleSetHostCredentials, handleSetAdminCredentials, and
handleDeleteAdminCredentials from authedUser (bool) to requireUser
(*store.User) so AuditEntry.UserID and Actor are populated correctly.
Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in
handleRunRepoPrune so decrypt/send failures surface in the server log
rather than appearing as a generic host_offline 503.
Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer
routes for HTMX form posts). Prune pushes the admin-cred slot via
pushAdminCredsToAgent before dispatch and refuses with
admin_creds_required when the slot is not set. Check reads
check_subset_pct from host_repo_maintenance (overridable via ?subset=N,
clamped 0-100; non-numeric override falls back to DB value silently).
Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect
response split as the per-source-group run-now endpoint.
Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that
mirror the existing repo-credentials endpoints but write to
store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped
away from the repo slot to prevent cross-binding). PUT immediately pushes
a config.update(Slot:"admin") to the agent when it is connected, and the
new pushAdminCredsToAgent helper is wired for use by the upcoming prune
run-now endpoint (D2) to push on-demand before dispatch.
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.
Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).
Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs
are detected at load time by the presence of "repo_url" at the top level
and transparently promoted into the new shape on the next Save/SaveAdmin.
Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.
Reshape RepoStatsPayload into pointer-field partial-update form matching
store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload
for admin vs repo credential routing; add RequiresAdminCreds flag to
CommandRunPayload for prune/unlock jobs that need delete authority.
Narrow the LockPresent predicate from bare "locked" (too broad) to
"stale lock" and "already locked" — the two phrases restic actually
emits. Replace TestRunCheckParsesLock with table-driven
TestRunCheckLockSniff covering both trigger phrases and a benign
"locked-file" line that must not set LockPresent. Add
TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot
JSON without error.
Add RunUnlock (delegates straight to runWithPump) and RunStats which
runs `restic stats --json --mode raw-data`, captures the single JSON
line from stdout into RepoStats, and returns an error if no JSON
arrives. Tests cover arg plumbing for unlock, JSON parsing, and the
no-JSON error path.
Add CheckResult (LockPresent, ErrorsFound) and RunCheck. subsetPct>0
passes --read-data-subset N% to limit data reads. Stderr is sniffed
for "Found stale lock"/"locked" to set LockPresent; a non-zero exit
from restic is absorbed as ErrorsFound=true rather than an error so
the caller can always persist last_check_status. Tests cover lock
detection, exit-1 absorption, and subset-arg plumbing.
Add RunPrune for admin-credential prune invocations. Extract
runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget
and RunInit to delegate to it (RunInit preserves the "config file
already exists" soft-success sniff by wrapping the handler before the
call). Add runner_test.go with TestRunPruneInvokesPrune.
The runner-provisioning script has been handed off to the infra
agent, who will own it going forward. ci.yml's header comment is
updated to point at "the infra team owns the script" rather than
the in-repo path, but the runner expectations themselves stay the
same — workflows still rely on the persistent volumes, pre-cloned
actions, and host-installed golangci-lint that any compliant
provisioning produces.
scripts/provision-gitea-runner.sh is a one-shot, idempotent host
setup for an act_runner LXC. It mounts persistent host volumes for
GOMODCACHE / GOCACHE / act-clones, pre-pulls the runner image,
pre-clones the common GitHub actions, installs golangci-lint, and
sets up a nightly cron to refresh the lot. Generic — no per-project
state.
With those persistent volumes in place, `cache: true` on
actions/setup-go becomes a net negative — the action keeps tar-ing /
un-tar-ing GOMODCACHE+GOCACHE through the Gitea cache backend on
every job, adding ~10s per job and overwriting the volume contents.
Drop it from all three jobs in ci.yml. Add a header comment block
explaining the runner-side expectations and the Go version / build
matrix / upload-artifact context for anyone reading later.
Bumping CI to v2.5.0 surfaced two new gofumpt findings (in two test
files that gofumpt v2.1.6 considered fine). Local re-format with
the matching tool brings them in line.
Pre-commit hook config: prepend $GOPATH/bin to PATH inside the hook
entry so gofumpt + golangci-lint resolve when ~/go/bin isn't on the
operator's interactive shell PATH (common — go install puts them
there but PATH config varies). Without this, the hooks fail with
'Executable not found' even when the tools are installed.
Pin the Makefile setup target to v2.5.0 so a fresh clone gets the
same binary CI runs — keeps pre-commit and CI from drifting again.
The v2.1.6 release binary is built with Go 1.24, and golangci-lint
refuses to load a config targeting a newer toolchain than itself
('Go language version (go1.24) used to build golangci-lint is lower
than the targeted Go version (1.25.0)'). go.mod is on 1.25, so the
binary needs to be too.
Locally this didn't bite because 'go install …@v2.1.6' compiled
v2.1.6 against the local Go 1.25 toolchain; CI uses the prebuilt
release tarball which carries the build-time Go version.
v2.5.0 is the first v2.x line built with Go 1.25 — pin in lockstep
with go.mod going forward.
The repo had a .pre-commit-config.yaml entry for golangci-lint
already, but pinned to v1.61.0 — which doesn't grok the v2 schema
we just migrated to, so it would crash if anyone ever ran it. Hence
nobody did.
Replace the third-party hook blocks with local hooks that call
whatever tool is on the developer's PATH (gofumpt + go vet +
golangci-lint). That way the version of each tool tracks what the
developer would invoke by hand — no drift between hook config and
binary.
Add 'make setup' as a one-liner per-clone bootstrap:
* installs gofumpt + golangci-lint via go install if missing
* installs the pre-commit hooks via 'pre-commit install'
end-of-file-fixer auto-fixed two existing files (web/static/css/
styles.css and ask.md) — trailing newlines, harmless.
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:
* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
api.JobCancelled = "cancelled" since that literal is the wire +
DB CHECK constraint value, plus matched the case in store/fleet.go
back to "cancelled" and added //nolint:misspell on both for the
next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
`defer res.Body.Close()` in `defer func() { _ = .Close() }()`
to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
upgrade response Body — coder/websocket can return res with a nil
Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
comments explaining why nil-on-error is the contract (cookie
missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
the dashboard primary nav today
Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
The bump from golangci-lint-action@v6 → v7 (which downloads the v2.x
binary) was blocking CI lint with 'unsupported version of the
configuration: ""' because .golangci.yml was still in the v1 schema.
Migrate the config to v2:
* version: "2" prelude
* disable-all → default: none
* linters-settings → linters.settings
* gofumpt + goimports move into formatters.enable + formatters.settings
* exclude-rules move into linters.exclusions.rules
* gosimple drops (folded into staticcheck in v2)
Fix the four lint hits in the new P2R-02 code:
* host_bandwidth.go: convert hostBandwidthRequest directly to
hostBandwidthView via type conversion (S1016)
* ui_repo.go: drop unparam savedSection + status arguments from
renderRepoPage (always "" / always 422 — split GET render from
validation-fail render)
* ui_schedules.go: gofumpt formatting on the scheduleEditPage struct
Add only-new-issues: true to the lint job. The repo carries ~90
pre-existing findings (gofumpt drift × 31, misspell × 25, missing
godoc × 10, bodyclose × 6, errcheck × 12, …) accumulated before
lint was actually wired into CI. Without this gate, every PR would
fail on baseline noise instead of its own changes.
Track the cleanup as X-06 in tasks.md so the gate is temporary.
Update tasks.md: Phase 4 of the P2 redesign is done end-to-end.
Slice 1–5 wired the four host-detail tabs against the new
slim-schedule + source-group + repo-maintenance model; slice 6
ran a Playwright sweep against the live :8080 server (login,
walk every tab, create source group, create schedule, Run-now,
confirm a snapshot landed) — clean pass, no console errors.
Screenshots in _diag/p2r-02-sweep/.
Side-fix landed alongside slice 6: agent runner now drops
restic's noisy --json status events from log.stream (the
throttled job.progress envelope already covers them).
Phase 5 (server-side maintenance ticker — P2R-03..08) is next.
Replace the placeholder 'Open →' link with a per-host Run-now
decision computed server-side once per render:
* If the host has exactly one enabled schedule whose source-group
set covers every group on the host → primary 'Run all groups'
button (HX-POST to that schedule's /run endpoint, fires every
backup the host knows about in one click).
* Otherwise (zero matches, multiple matches, or any ambiguity) →
ghost 'Open →' link to /hosts/{id}/sources, where the operator
picks per-group from the source-group rows.
dashboardPage.Hosts moves from []store.Host to []dashboardHostRow
to carry the precomputed RunAllScheduleID; host_row.html now reads
.Host.* and .RunAllScheduleID. Two extra store calls per host on
dashboard render — fine at fleet sizes we care about; if we ever
need to support thousands of hosts we'll batch these queries.
restic --json emits a status frame ~every 16ms during a backup.
The runner was forwarding every line to log.stream verbatim, which
flooded the live log pane with duplicate status JSON for any
short-running backup (visible immediately on a 1000-file, ~4MB
test set: ~14 identical 'percent_done: 1' lines in 220ms).
The progress widget already covers the same information at a sane
sample rate (one per second via job.progress), so the raw status
lines in log.stream are double-bookkeeping. Skip them and forward
only non-status lines (file names, errors, summary).
Throttling logic for job.progress is unchanged.
Schedules tab Run-now used to silently HX-Redirect back to the
list, leaving the operator wondering whether the click registered.
Now:
* Single-source-group schedule → HX-Redirect to that one job's
live log, matching the per-source-group Run-now UX from Sources.
* Multi-group schedule → stay on the schedules list and fire a
success toast ("N backups dispatched: <group names>") via the
existing rm:toast HX-Trigger channel, so the operator sees clear
acknowledgement without losing their place.
dispatchBackupForGroup now returns the persisted job ID so the
caller can choose between job-log redirect and toast feedback;
on any internal failure it returns "" and the warning still
hits slog as before. The cron-fired path (dispatchScheduledJob)
ignores the return value, behaviour unchanged.
Three independent forms on /hosts/{id}/repo so saving one section
doesn't disturb the others:
* Connection: edits repo URL, username, password (pre-filled from
the redacted GET /api/hosts/{id}/repo-credentials view; password
field shows masked stored-creds placeholder; blank password = keep
existing). On save, encrypts and pushes config.update to a
connected agent.
* Bandwidth: host-wide upload/download caps (KB/s; blank = no cap)
written via store.SetHostBandwidth. New REST endpoint
PUT /api/hosts/{id}/bandwidth for JSON callers.
* Maintenance: forget/prune/check cadences + check subset %, with
per-row enabled toggles. Reuses cronParser for validation;
auto-seeds the row if a host pre-dates the migration.
Right-rail surfaces repo size, snapshot count, snapshots-by-tag
breakdown (counted from existing snapshot tag rows), and an
'untagged snapshots are left alone' note.
Danger-zone re-init button is rendered but disabled with a hint
pointing at P2R-09 (real implementation lands there).
Validation re-renders the page with the relevant form's banner and
all other section state intact. Successful saves redirect with a
?saved=<section> query param so the page surfaces a small ✓ saved
indicator on the relevant form.
ci.yml: bump golangci-lint-action v6→v7 (separate change picked up
in this commit).
Surface the Run-now button on every schedule when the host is online,
not just enabled ones. Disabled rows render the button as a non-primary
style + a HX-confirm dialog ("This schedule is paused — running it now
won't change that. Fire it once anyway?"); enabled rows keep the
zero-friction primary button.
Server-side, Run-now no longer short-circuits on !Enabled — it
dispatches the source groups inline rather than via dispatchScheduledJob
(which always bails on disabled schedules, since cron-tick semantics
are different from explicit operator intent). The audit-log entry
inside dispatchBackupForGroup still records every fire.
Aligns Sources and Schedules tab rows with the dashboard's row-click
UX: whole-row click navigates to the row's edit page (mirroring
.host-row.clickable). Drops the redundant Edit buttons; Run-now and
Delete remain in .row-action cells that sit above the row-link
overlay via z-index.
Schedule edit form's cron preset chips now carry human-readable
title= tooltips ("Every day at 03:00", "Every Sunday at 03:00", etc).
tasks.md gets a binding row-design rule covering all current and
future list-row templates, and the P2R-02 entry is split into the
six slices already agreed with the operator (slices 1–3 marked
done, 4 next).
Schedules list: status (enabled/paused) + cron + source-group tags +
actions (Run-now when enabled+online, Edit, Delete). Run-now reuses
dispatchScheduledJob — same path real cron fires take, so each
referenced source group runs as its own backup with its own tag.
Falls back to a 409 if the agent is offline.
Schedule new/edit form: cron input with five preset chips
(quick-pick @hourly / nightly / 6h / weekly / monthly), source-group
multi-pick rendered as styled checkbox cards (visual state tracks
the underlying box via a tiny inline script), enabled toggle. No
paths/excludes/retention/kind on the schedule itself — those live on
source groups now.
Server-side validation re-renders with the operator's input + ticked
groups intact. Every successful mutation calls pushScheduleSetAsync.
Adds .schd-row, .preset-chip, .picker styles.
Belt-and-braces: the UI now disables the Delete button when a group
is the only one on the host (with a tooltip explaining why), and the
server-side handler returns 409 if a curl/form-replay tries anyway.
Every host needs at least one source group to be backup-able, so the
'last group on a fresh host' case is a meaningful accident to guard
against.
Sources tab now lists every source group on the host with per-row
counts (used-by-N-schedules, snapshot count by tag), the v4
conflict tag (keep-* dimension that has no compatible cadence),
and Run-now / Edit / Delete actions. Run-now reuses the existing
HTMX-aware /hosts/{id}/source-groups/{gid}/run handler.
New /hosts/{id}/sources/new and /sources/{gid}/edit form: name +
includes/excludes textareas + the 3×2 keep-* retention grid +
retry-on-offline knobs. Server-side validation re-renders with the
operator's input intact; the inline conflict banner shows above the
retention grid when ConflictDimension is set.
Delete blocks (UI + server) when the group is referenced by any
schedule. Every successful mutation calls pushScheduleSetAsync so
an online agent re-arms within seconds.
Adds .src-row and .keep-cell to input.css for the row + retention
grid layout.
Extract header/vitals/sub-tabs into a host_chrome partial that every
host-detail tab page renders. Sources / Schedules / Repo go from
inert divs to real <a> links backed by stub pages that share the
chrome and a 'coming next' body — slices 2/3/4 fill them in.
Also re-establishes the version indicator (host_schedule_version vs
agent's applied_schedule_version) in the header.
Drops the legacy fat-schedule list/edit templates that referenced
fields removed by the P2 redesign (Manual / Paths / RetentionPolicy
on Schedule); the new templates land in slice 3.
- host_credentials_test.go's CreateEnrollmentToken fixture passed 1<<20
as the TTL (third arg, time.Duration) — that's ~1ms in nanoseconds.
Local non-race runs finished inside the window, but -race overhead
blew the deadline so the token was already expired by the time
GetEnrollmentTokenAttachments / ConsumeEnrollmentToken ran. Use
time.Hour instead, which matches the spirit of a per-test fixture.
- Lint pin v1.61.0 was built against Go 1.23 and refuses to load a
config targeting newer toolchains. go.mod is on 1.25, so the lint
step exited 3 ('the Go language version used to build golangci-lint
is lower than the targeted Go version'). Bumping to v2.1.6, which
supports Go 1.25.
Both failures showed up only on the Gitea runner because local make
target runs go test without -race and lint hadn't been re-run after
the go.mod toolchain bump.
Adds p2r01_ws_test.go covering the two paths the original commit's
in-process tests couldn't reach without a live conn:
- maybeAutoInit dispatches command.run(init) on first hello when creds
are bound, skips on second hello once a job row exists, and skips
entirely when the host has no creds.
- dispatchScheduledJob iterates a schedule's source groups and emits
one backup per group with the right Tag/Includes; persists job rows
with actor_kind=schedule + scheduled_id; no-ops on a disabled
schedule.
Drops RetentionPolicy from the per-group Run-now and schedule.fire
backup payloads — the agent's RunBackup ignores it (forget is the
only consumer). Adds Hub.Conn() so tests can grab the live *Conn
post-hello.
Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron
parsed via robfig/cron/v3 and group membership scoped to the host.
New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete
refuses with 409 if any schedule still references the group, returning
the schedule list so the UI can prompt 'remove from these schedules
first.' Repo-maintenance GET/PUT manages forget/prune/check cadences
on host_repo_maintenance — no version bump, the server-side ticker
(P2R-06) drives execution.
Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run)
resolves the group's includes/excludes/retention/tag and dispatches a
backup command.run with the new structured CommandRunPayload fields
(Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and
/hosts/{id}/init-repo return 410 Gone with a redirect message.
schedule_push.go is rebuilt: buildScheduleSetPayload assembles the
slim wire shape, pushScheduleSetOnConn ships it during the on-hello
window, pushScheduleSetAsync fires after every CRUD mutation, and
dispatchScheduledJob handles agent schedule.fire by iterating the
schedule's source groups and dispatching one backup per group with
actor_kind=schedule and scheduled_id pointing at the schedule.
Auto-init at first WS connect: when the host has repo creds bound and
no init job in its history, server dispatches restic init. Restic's
'config file already exists' soft-success means re-runs against an
existing repo no-op; we don't auto-retry on failure (operator triggers
re-init manually via the danger zone in P2R-09).
api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc.
in favour of {id, cron, enabled, source_groups: [...]}. The agent
scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads
Includes/Excludes/Tag instead of Args.
Tests cover the new HTTP surface end-to-end: source-groups CRUD with
in-use refusal, schedule validation (bad cron / missing groups /
foreign group), repo-maintenance auto-seed and validation, the 410
route, and buildScheduleSetPayload's wire-shape correctness. Full
suite passes; smoke env exercises auto-init dispatch on hello,
async push after schedule create, and per-source-group Run-now
landing the right paths/excludes/tag at the agent.
The store rewrite in e7eea7a left tasks.md describing a data shape
(fat schedules, host.repo_initialised_at, manual flag) that no longer
exists, and left the host-detail templates rendering against fields
the store no longer exposes. This commit reconciles both.
tasks.md
* Mid-phase pivot called out at the top of Phase 2 with commit hashes.
* P2-01..P2-05 kept as done but stamped ⚠️ "shipped against old shape
— to re-validate under P2R-02".
* P2-04.5 (manual flag) struck as superseded.
* New P2R-NN section covering work that previously lived only in
commit messages and code stubs:
P2R-00.1/00.2/00.3/00.4 — phases already shipped (this commit
records 00.4)
P2R-01 — REST + WS rewire against slim schedules + source groups
+ repo maintenance + auto-init
P2R-02 — UI rewire against the v4 wireframes
P2R-03..05 — prune / check / unlock command surfaces
P2R-06 — server-side maintenance ticker (cadence-driven)
P2R-07 — repo stats panel
P2R-08 — pending_runs queue worker
P2R-09 — auto-init UX polish
P2R-10..12 — pre/post hooks rehomed from schedule onto source group
P2R-13..14 — bandwidth + next/last-run surface
* P2-16/17/18 (Windows + announce-and-approve) untouched.
* Phase 2 acceptance criteria rewritten against the new model.
UI patch-up (P2R-00.4)
* host_detail.html + host_row.html: removed every $host.RepoInitialisedAt
reference (column dropped in migration 0008 — render was 500'ing).
* Removed manual init-repo branches; the auto-init path replaces them.
* Schedules sub-tab demoted from active link to inert div until P2R-02
rebuilds the page (it was linking to a raw 501 from the stubbed
ui_schedules.go handlers).
* Disabled the four per-host Run-now buttons (dashboard row + host
detail header + empty-snapshots state + right-rail) with a
"lands in P2 Phase 4" hint — handler is 501-stubbed pending P2R-01,
so leaving them clickable produced silent failures over htmx.
* Dashboard row-action becomes "Open →" instead of Run-now.
Project tooling
* .mcp.json at repo root: project-scoped Playwright MCP override.
Forces --headless (so I don't pop a browser at the operator) and
--output-dir _diag (so screenshots / traces land in the gitignored
_diag/ directory rather than scattered at the repo root).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Go-side data model rebuilt against migration 0008. The fat-Schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks) is
gone; that surface lives on source_groups now.
* store/types.go
- Schedule slimmed to {id, host_id, cron, enabled, source_group_ids,
timestamps}. SourceGroupIDs populated by Get/List, accepted on
Create/Update so callers pass desired junction state in one shape.
- SourceGroup added: name (= snapshot tag), includes/excludes,
retention_policy, retry_max + retry_backoff_seconds, cached
conflict_dimension.
- HostRepoMaintenance added: forget/prune/check cadences + enabled.
- PendingRun added: offline-retry queue.
- Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps.
- RetentionPolicy moves home from "schedule field" to "source group
field" but the type itself + Summary() method unchanged.
* store/sources.go (new) — CRUD + GetByName + ConflictDimension cache.
Group writes bump host_schedule_version; conflict cache writes don't
(server-internal projection, agent doesn't see it).
* store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR
IGNORE). UpdateRepoMaintenance doesn't bump schedule version because
these run on the server's own ticker, not the agent's local cron.
* store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete.
* store/schedules.go — rewritten for slim shape + junction CRUD.
Update wipes the schedule_source_groups junction wholesale and
re-inserts (simpler than diffing). Adds SchedulesUsingGroup for
retention-conflict detection + UI labels.
* store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan.
New SetHostBandwidth helper.
* HTTP layer — temporarily stubbed during this rewrite (501 returns
with redesign_in_progress error code). Phase 3 fills these in
against the new shape:
- schedules.go REST CRUD
- schedule_push.go agent reconciliation
- ui_schedules.go HTML form CRUD
Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed
— both go away in the new model (Run-now per source group; auto-init
at host enrolment).
* enrollment.go — replaces "seed manual schedule from typed paths"
with "seed default source group + repo-maintenance row." The default
group gets the typed paths as its includes; operator edits later
via Sources tab.
* ws/handler.go — drops the MarkHostRepoInitialised projection (column
is gone; auto-init makes it derivable from latest init job's status).
Tests:
* store: existing schedule test rewritten for slim shape + junction;
new sources_test.go covers source-group CRUD, name uniqueness,
conflict cache, repo-maintenance defaults + idempotent seed,
pending-runs queue lifecycle.
* http: schedules_test.go and schedule_push_test.go deleted — both
exercised the obsolete fat-schedule API. Phase 3 rewrites them
against the new endpoints.
go test ./... green. cmd/server + cmd/agent build. The UI is broken
end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3
restores REST + on-the-wire reconciliation; Phase 4 rewires the UI
templates against the new model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hi-fi mock of the four pages affected by the redesign:
* /hosts/{id}/sources — list of source groups with per-row meta
line (includes/excludes count, retention summary, usage,
snapshot count) and Run-now / Edit / Delete actions. Tweaks
toggle flips between fresh-host (default empty group, Run-now
+ Delete disabled) and multi-group states.
* /hosts/{id}/sources/{gid}/edit — name (snapshot tag), includes/
excludes textareas, retention as a 3×2 grid of keep-* cells,
retry-on-offline, inline conflict banner above retention when
granularity↔cadence mismatch detected.
* /hosts/{id}/schedules — slim list (status / cron / source-tags
/ actions) plus new-schedule form (cron with quick-pick chips,
source-group multi-select via clickable check pickers, enabled
toggle).
* /hosts/{id}/repo — connection (URL/user/password/cert pin),
bandwidth caps, maintenance rows (forget daily / prune weekly /
check monthly with 5% subset), danger zone re-init.
Footer carries the retention-conflict detection spec (granularity
vs cadence mismatch). Visual language matches v1: --accent cyan,
JetBrains Mono for IDs/cron, btn tokens, sub-tab nav, hairline
panels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema rebuild for the model collapse described in
design/v4-sources-redesign.html. Three nouns now stand on their
own:
* schedules — slim. Only cron + enabled + host_id. Fat-schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks)
is dropped wholesale. Schedule data wiped — by design (smoke env
was nuked before this ran; fresh installs have nothing to lose).
* source_groups — name + includes + excludes + retention_policy +
retry policy + cached conflict_dimension. Group name doubles as
the snapshot tag so retention can target it cleanly. UNIQUE
(host_id, name) enforces tag unambiguity.
* schedule_source_groups — N:M junction. One schedule can fire N
groups per tick; one group can be referenced by N schedules.
* host_repo_maintenance — 1:1 with hosts. Default cadences:
forget daily 03:00, prune weekly Sun 04:00, check monthly 1st
05:00 with --read-data-subset 5%. Operator can edit on Repo tab.
* pending_runs — offline-retry queue. Server-side ticker dispatches
due rows; bounded by source_groups.retry_max + retry_backoff_seconds.
Plus:
* hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps.
* hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes
it derivable from the latest init job; the Init-repo button goes
too (failure surfaces via job history banner).
Note on FK safety: smoke env was wiped before migration ran, so
DROP TABLE schedules cascades to nothing. Fresh installs apply
0001-0007 then immediately 0008 — same story (no schedule rows
to lose). For an upgrade path on a populated DB, this migration
would need a data-preserving variant; not needed today.
Tests fail to compile/run after this — expected. The Go side
(store types, CRUD, REST handlers, agent runner, UI templates)
gets rebuilt in subsequent phases. tasks.md will track P2 redesign
progress.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end forget plumbing — operator can create a forget schedule
with keep-* values, agent runs restic forget --keep-* … on the
schedule's cron (or via per-row Run-now), snapshot list shrinks,
UI updates.
* api.CommandRunPayload gains retention_policy json.RawMessage so
the agent doesn't need a typed copy of the server-side struct.
* restic.ForgetPolicy mirrors restic's --keep-* flags. Empty()
reports zero dimensions; restic wrapper RunForget refuses to
run an empty policy (would delete every snapshot). Does NOT
pass --prune — pruning lives behind a separate admin-only
credential (P2-06); forget just rewrites the snapshot index.
* runner.RunForget mirrors RunBackup's envelope shape so the
live log viewer works without special-casing. On success
triggers reportSnapshots (forget shrinks the index, the host's
snapshot count almost certainly changed).
* cmd/agent dispatcher handles MsgCommandRun with kind=forget,
decodes RetentionPolicy from the wire, builds restic.ForgetPolicy.
* Server dispatchScheduleNow marshals the schedule's
RetentionPolicy into the wire payload for kind=forget jobs.
Refuses to dispatch a forget schedule with empty retention.
* validateSchedule rejects kind=forget without at least one keep-*
dimension (new error code: missing_retention).
* UI schedule edit form gains a Kind dropdown (backup or forget;
immutable on edit). Paths block toggles by kind via inline
data-kind attributes. Form help-text explains the prune
separation.
Other kinds (prune, check, unlock) deferred to P2-06..08; the
Kind dropdown only offers backup and forget today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three sites:
* Schedules list per-row Run-now / Edit / Delete column was 1fr
next to a 1.3fr retention column — too narrow for the three
buttons. Pin the action column to 240px and add
whitespace-nowrap to each button so the layout can't squeeze
them onto two lines regardless.
* Dashboard host_row Run-now button got whitespace-nowrap +
for the same reason inside the 92px action column.
* Host detail header "Run backup now" — the words so the
button never breaks across lines if the header gets crowded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agent runs as root (HOME=/root from systemd) with ProtectHome=
read-only, so restic's `mkdir /root/.cache/restic` fails on the
first call. Backups still completed (restic falls back to no-cache)
but every job log started with a noisy red "unable to open cache"
warning.
Default to /var/lib/restic-manager unconditionally — that's already
in the unit's ReadWritePaths and survives ProtectHome. ExtraEnv
overrides still win for tests / unusual setups.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allow-list filter @system-service excludes some syscalls Go's
runtime + restic's file scanner reach for; init job died
immediately with "bad system call (core dumped)". CapabilityBounding
already constrains what root can do; the Protect*/Restrict* toggles
still cover network / kernel / mount / namespace. Net effect on the
threat model is negligible vs the operational cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three rules to date:
* After every make build, restage the agent binary + install
assets into /tmp/rm-smoke/data/ and replace the running agent
on this dev box. Plain `make build` doesn't reach either, and
forgetting has bitten the smoke env twice today (stale agent
without mergeRestCreds; stale unit without User=root).
* Migrations: prefer ALTER TABLE DROP/RENAME COLUMN (SQLite
3.35+) over the rebuild dance. With foreign_keys=ON in the DSN,
DROP TABLE on a parent with ON DELETE CASCADE children wipes
every dependent table — and PRAGMA foreign_keys=OFF inside a
migration is a no-op (PRAGMA can only change outside a tx).
* Don't slog restic's merged URL. The user:pass@-embedded form
exists only inside envSlice() at exec time; if any URL needs
to be operator-visible, route it through restic.RedactURL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-running restic init on a repo that's already initialised exits
non-zero with "Fatal: ... config file already exists". Semantically
that's a no-op, not a failure — the repo IS initialised, the
caller's intent is satisfied. Sniff stderr for the magic string
and swallow the exit code in that case, emitting an event line
so the operator-facing log says what happened.
Caught while smoke-testing P2-04.5: I'd init'd the repo manually
during a debug session, then the operator clicking the UI's
Init-repo button would hit this and the host's repo_initialised_at
would never flip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pending page suppressed the htpasswd snippet when repo_username
was blank — but with --private-repos the username is required for
auth, and operators routinely leave the field blank assuming the
system will pick something sensible.
* handleUIAddHostPost defaults repo_username to the typed hostname
when blank. Matches what --private-repos expects (URL path
segment == username).
* pending_host.html: snippet now renders whenever a password is
present (always true after the generate-on-blank logic landed
earlier).
* Form help-text updated to describe the default explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues from a smoke session:
1. The awaiting-agent panel never refreshed — operator had to go
back to the dashboard to see the host had connected.
2. Generated passwords were displayed only on the POST response.
Navigating away (or even an accidental tab close) lost them
permanently, so the operator couldn't update the rest-server's
htpasswd.
Both are the same fix: convert the POST-rendered transient
"result state" into a durable GET page at /hosts/pending/{token}.
* New route GET /hosts/pending/{token} renders the install-command +
htpasswd snippet view. Password is decrypted from the (still-
encrypted-at-rest) token row on every render — operator can
refresh, bookmark, navigate away and come back. Once the agent
enrols, the page redirects to /hosts/{id}; once the token
expires, redirect to /hosts/new.
* New route GET /hosts/pending/{token}/awaiting returns a polled
HTML fragment that the pending page swaps in every 2s via HTMX.
States: awaiting (keep polling) | connected (show "Open host →"
+ "View schedules" CTAs, polling stops) | expired (mint-new
link, polling stops). Polling stops naturally because only the
awaiting state's wrapper carries the hx-trigger attribute.
* POST /hosts/new now 303-redirects to /hosts/pending/{token}
on success; validation errors keep re-rendering the form with
banner.
Supporting changes:
* New store helper Store.GetEnrollmentTokenStatus(tokenHash) for
the polling endpoint — returns {expires_at, consumed_at,
consumed_host} in one round-trip without dragging in the
attachments-decryption path.
* New ui.Renderer.RenderPartial(w, name, data) for HTMX fragment
responses (no layout wrap). Picks an arbitrary page's template
set as the lookup point — every page parses the full common-
paths list, so they all see every partial.
* add_host.html stripped to form-only; pending_host.html owns the
result-state UI; awaiting_agent.html is the polled partial.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent path lists for "what does this host back up?" was
a real divergence footgun — operator types one set at Add-host time
and a different set into a schedule, both end up in the same repo,
the snapshot history looks fine until restore. Resolution: drop
host.default_paths entirely; add a `manual` flag on schedules.
A manual schedule has paths/excludes/tags/retention like any other
but no cron — it fires only via per-schedule Run-now. Single source
of truth for what gets backed up.
Schema (migration 0007):
* schedules.manual INTEGER NOT NULL DEFAULT 0.
* For every host with non-empty default_paths, seed a manual
schedule with those paths and bump host_schedule_version.
* ALTER TABLE hosts DROP COLUMN default_paths.
* ALTER TABLE enrollment_tokens RENAME COLUMN default_paths
TO initial_paths.
Original draft of this migration rebuilt hosts via the
create-new + drop-old + rename-new pattern. With foreign_keys=ON
(set in the connection DSN), DROP TABLE on the parent fired
ON DELETE CASCADE on every child of hosts(id) — schedules /
jobs / snapshots / host_credentials all wiped on the smoke env
when I tried it. SQLite 3.35+ supports column-level ALTERs
directly, so we skip the rebuild dance and avoid the cascade
trap. Six lines of SQL instead of sixty, no FK risk.
Run-now rewiring:
* New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper
unifies the agent-driven path (cron fire → schedule.fire →
OnScheduleFire callback) and the UI-driven path (operator
clicks Run-now on a schedule row). Conn arg is optional; nil
falls back to Hub.Send.
* New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row
Run-now button on the schedules list.
* Dashboard's per-host Run-now (handleUIRunBackup) now picks the
host's only enabled manual schedule, falls back to the only
enabled schedule, else returns "pick one in Schedules tab".
Keeps one-click for the common case.
Agent:
* Scheduler skips manual schedules in cron build (silent — they're
a normal data shape, not an error).
* Wire Schedule struct gains Manual flag.
* Schedule.fire flow unchanged — the agent only ever fires
non-manual schedules anyway.
UI:
* Add-host form retitled "Initial schedule · manual" so the
operator knows the paths become an editable schedule under
the Schedules tab. Result page calls out the manual schedule
+ points at Host > Schedules.
* Schedule edit form: "Manual schedule" checkbox at the top of
the When section; toggling it hides/shows the cron field via
inline JS. Server-side validator skips the cron requirement
when manual=true.
* Schedule list shows a "manual" tag under the status pill and
renders the When column as "— run-now only —" for manual rows.
Each row gets a Run-now button when the schedule is enabled
and the host is online.
Tests + go test ./... green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the schedule foundations slice — operator can now drive the
plumbing P2-01..03 landed without touching the JSON API.
* New routes:
- GET /hosts/{id}/schedules (list)
- GET /hosts/{id}/schedules/new (create form)
- POST /hosts/{id}/schedules/new (create)
- GET /hosts/{id}/schedules/{sid}/edit (edit form)
- POST /hosts/{id}/schedules/{sid}/edit (update)
- POST /hosts/{id}/schedules/{sid}/delete (delete, confirm-then-redirect)
* List view (web/templates/pages/schedules_list.html):
status, cron, paths, retention summary, tags, edit/delete buttons.
Header shows "version N · agent in sync" or "agent at vM" when the
push hasn't been ack'd yet — backed by host_schedule_version +
applied_schedule_version. Empty-state CTA points at /schedules/new.
* Create/edit form (web/templates/pages/schedule_edit.html, shared):
cron expression with five quick-pick presets (daily 3am / every 6h
/ @hourly / weekly Sun / monthly 1st), paths textarea (one per
line), excludes textarea, tags (comma-separated), retention as six
numeric fields (mirrors restic's --keep-* flags one-for-one),
bandwidth caps, enabled toggle. Side panel explains the
reconciliation flow so the operator knows what saving actually
does. Validation errors re-render with operator's input intact.
* internal/server/http/ui_schedules.go owns the handlers; reuses
the same validateSchedule + pushScheduleSetAsync used by the JSON
API path. Each save audit-logs schedule.created / schedule.updated
/ schedule.deleted (matching the JSON API actions).
* store.RetentionPolicy gains a Summary() method ("last=7, d=14,
w=4" or "—"). Used by the list view's table cell so templates
don't have to do any conditional retention rendering.
* Two new template helpers: list (string varargs → []string, used
for the cron preset row) and joinComma (sibling to joinDot for
the rare list that wants commas). RetentionPolicy.Summary covers
the schedule-list case but the helpers are general.
* host_detail.html secondary tabs row converted from inert <div>s
into <a> links. Snapshots active by default; Schedules now points
at the new page. Jobs/Repo/Settings remain inert until their
P2 owners ship.
Hooks UI deferred to P2-15 (lands with the hook execution path).
Single-kind UI (backup only) by design — other kinds get a UI when
their job dispatch lands in P2-05..08.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the schedule reconciliation loop end-to-end.
* New `internal/agent/scheduler` package wraps robfig/cron/v3 with
the lifecycle the agent needs:
- Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting
for in-flight entries to return), rebuilds from scratch, starts,
and emits schedule.ack with the version we just applied.
- Disabled entries skipped silently; bad cron exprs (which
shouldn't reach us — the server validates — but defensive)
log a warn and skip.
- On each cron tick the entry sends a new schedule.fire envelope
to the server with {schedule_id, scheduled_at}. The scheduler
itself never builds CommandRunPayloads — server is the source
of truth for jobs.
- tx is swapped on every Apply, so reconnect is handled
naturally: cron entries that fire against a dropped tx log
"no active connection" and skip the tick.
- Stop() is idempotent and waits for the cron's in-flight
workers via cron.Stop().Done().
* New wire message api.MsgScheduleFire + api.ScheduleFirePayload
for the agent → server "I just fired locally" RPC.
* Server-side dispatch (schedule_push.go: dispatchScheduledJob):
looks up the schedule by id, validates ownership + that it's
enabled, builds args from kind (paths for backup; other kinds
are still arg-less in Phase 2 and grow as those job kinds land
in P2-05..08), persists a jobs row with actor_kind=schedule +
scheduled_id, and writes command.run back on the same conn so
the agent runs through its existing dispatch path.
* store.CreateJob now writes scheduled_id. This column was in the
schema since 0001 but never populated — the original P1 path
only had operator-driven jobs, so actor_kind was always 'user'
and scheduled_id was always nil.
* cmd/agent/main.go integration: dispatcher gains a
*scheduler.Scheduler; the MsgScheduleSet case now hands the
payload to scheduler.Apply (in a goroutine so the WS read loop
keeps draining other messages).
* WS dispatcher gains OnScheduleFire alongside OnScheduleAck.
* Tests:
- scheduler unit tests (4): ack-on-apply, cron tick fires
schedule.fire envelope, disabled entries don't fire, replace-
prior-state stops the old cron.
- Server-side end-to-end: schedule.fire → command.run with the
right job_id / kind / args, plus jobs row with actor_kind=
"schedule" and scheduled_id linking back to the schedule.
Persistence of next-fire times across agent restarts is
deliberately deferred. A missed fire window during downtime
simply fires once on reconnect — that's the desirable behaviour
(the operator wants the missed backup to run, not be silently
skipped because we lost track of when it was due).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server is now the source of truth for the agent's cron set.
* Helpers in schedule_push.go:
- loadScheduleSetPayload reads the host's schedules + canonical
version into the wire shape.
- pushScheduleSetOnConn writes directly to a just-handshaken conn
(avoids racing against Hub.Register on a brand-new connection).
- pushScheduleSetAsync is the post-CRUD flavour — no-op when the
host is offline (the next reconnect's on-hello path catches it
up, so a missed push is non-fatal).
- applyScheduleAck records what version the agent has confirmed.
* onAgentHello restructured: was returning early when the host had
no repo credentials, which made the schedule push unreachable for
fresh hosts. Split into pushRepoCredsOnHello (silent no-op on
ErrNotFound) + pushScheduleSetOnConn (always runs). Empty schedule
list is a valid push: tells the agent to drop stale cron entries.
* WS dispatcher gains an OnScheduleAck hook on HandlerDeps; the
http server wires it to applyScheduleAck. MsgScheduleAck moves
out of the "TODO(P2)" group into a real case that decodes the
payload and forwards to the callback.
* Schedule CRUD handlers each fire pushScheduleSetAsync after the
audit-log write so the agent picks up changes within seconds.
Tests cover:
- On-hello push of an already-created schedule, agent acks,
applied_schedule_version flips on the host row.
- Connect-then-CRUD: empty initial push (version 0), then a
follow-on push at version 1 after the operator creates a
schedule via REST.
Agent-side `schedule.set` handler (parse, replace local cron,
emit `schedule.ack`) is the remainder of P2-02 and lands with
P2-03's local scheduler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `schedules` table was already laid down in migration 0001; this
slice adds the Go-side data model, store CRUD with atomic version
bumps, and REST endpoints.
* `store.Schedule` + `RetentionPolicy` + `ScheduleOptions` typed
views (the wire form on the agent side keeps retention/options
as raw JSON since the agent just forwards them to restic).
* Store CRUD: CreateSchedule / GetSchedule / ListSchedulesByHost /
UpdateSchedule / DeleteSchedule. Each mutation bumps
`host_schedule_version` atomically in the same tx via UPSERT on
`host_schedule_version`. SetHostAppliedScheduleVersion records
what the agent has confirmed via schedule.ack (P2-02 will use it).
* REST endpoints under /api/hosts/{id}/schedules + /{sid}:
GET (list, with the version envelope so callers can detect
drift), POST (create), PUT (update — kind is immutable), DELETE.
* Validation: cron expressions parse via robfig/cron/v3 (same
parser the agent will use, so anything that validates here will
fire there); kind ∈ {backup, forget, prune, check} (init/unlock
are operator-only one-shot kinds, not schedulable); backup
schedules require ≥1 path; hooks rejected on non-backup kinds
(spec §14.3).
* All mutations audit-logged.
* Tests: store-level CRUD + version-bump invariants; REST happy
path (create→list→update→delete with version progression); REST
validation table covers each rejection code.
newTestServerWithHub now sets BootstrapToken so the schedules
handler tests can use the existing login flow without a parallel
test-server constructor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:
* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
with ReadWritePaths confined to /etc + /var/lib/restic-manager;
NoNewPrivileges blocks escalation. Install script no longer
creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
rationale (matches UrBackup / Veeam / Bareos defaults; trying to
back up "everything" as an unprivileged user creates silent skips
on /home, /root, /var/lib/* with no upside vs the threat model
the agent already implies).
* Init-repo end-to-end. New JobKind="init" wired through agent
runner, restic.Env.RunInit, server dispatcher, and a UI button
(red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
flips on init success, on backup success, or on a non-empty
snapshots.report. The "Run now" / "Init" / "Retry" branching now
drives both the dashboard host row and the host-detail panel.
Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
the safe create-new-then-rename pattern; first version corrupted
job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
affected DBs).
* rest-server creds embedded at exec time only. restic.Env gains
RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
inside envSlice() and never assigns it back to the struct, so
nothing slog-able ever sees the cleartext form. RedactURL helper
for any future surface that needs to log a URL safely. Both
helpers tested.
* Add-host UX. Repo password is now optional — server mints a
24-byte URL-safe random one and surfaces it once, alongside an
htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
the operator pastes one command on the rest-server host and one
on the endpoint. Result page also links the install snippet at
/install/install.sh (was /install.sh — 404'd before) and pipes
to bash (not sh — script uses set -o pipefail and other
bashisms; on Debian/Ubuntu sh is dash).
* Late-subscriber race in JobHub. A fast-failing job could finish
(DB write + Broadcast) before the browser's HX-Redirect → page
load → WS-connect path completed, so the JS sat forever waiting
on a job.finished that already passed. JobHub split into
Register + Send + Run; handleJobStream now subscribes first,
re-fetches the job, and sends a synthetic job.finished if the
state is already terminal.
* HTMX error visibility. New toast partial listens to
htmx:responseError and surfaces the response body as a
bottom-right toast — every server-side validation error now
becomes visible without per-handler JS wiring. Also handles
custom rm:toast events for future server-pushed notifications
via the HX-Trigger header. Themed via existing CSS vars.
* Dashboard rows are now whole-row clickable to host detail
(CSS card-link pattern: absolute-positioned anchor + .row-action
z-index restoration so the action button stays clickable).
"View →" on a running job links to /jobs/<id> rather than
/hosts/<id> since the row click already covers the host page.
* "Run first" / "Run first backup" → "Run now" everywhere for
consistency.
* runbook (docs/e2e-smoke.md) updated — live-log streaming step
now reflects P1-26; mentions the browser-driven Run-now flow.
* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
it up; .gitignore now excludes /_diag/ entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes that close the loop on dashboard run-now and harden the
agent's restic invocation.
Default paths (interim until P2-01 schedules):
- 0003 migration adds default_paths TEXT NOT NULL DEFAULT '[]'
to hosts and to enrollment_tokens.
- Operator types paths in the Add-host form (textarea, one per
line). They ride on the enrol_token row alongside the
encrypted creds (paths aren't secret — plain JSON column).
- On consume, ConsumeEnrollmentToken still just burns the token;
the new GetEnrollmentTokenAttachments returns both the
re-bindable creds and the path list in one round trip, the
handler transfers them onto the new host row inside CreateHost.
- The dashboard's Run-now and host-detail's "Run backup now"
button now read Host.DefaultPaths and pass them to dispatchJob.
A host with no default paths returns 400 with a friendly
"no paths set" message instead of dispatching a doomed
`restic backup` with no positional args.
- Doc comments explicitly call this out as a Phase 1 interim —
schedules supersede.
Restic env hygiene:
- envSlice() previously omitted HOME / XDG_CACHE_HOME, which
bit the smoke runs whenever the agent was launched outside
systemd (restic refused to start: "neither $XDG_CACHE_HOME
nor $HOME are defined"). Now both are set explicitly: prefer
Env.ExtraEnv overrides, fall back to the agent process's own
HOME, and finally to /var/lib/restic-manager.
- Comment makes the env policy explicit: parent's RESTIC_* /
AWS_* / B2_* env is filtered out by design — control-plane
is the unambiguous source of truth.
JS bug fix in the live log page:
- {{$job.ID | printf "%q"}} produced a literal-quoted JS string,
which then went into the WS URL as ".../jobs/"<ID>"/stream"
→ 404. Switched to '{{$job.ID}}' inside the literal so
html/template's auto-escape does the right thing. Verified
end-to-end: dashboard "Run now" → live progress + log lines
arrive over the WS → succeeded pill renders.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the P1-21 remainder.
internal/server/ws/jobhub.go — new JobHub. Per-job_id set of
subscribers; each gets a 64-deep buffered channel with a writer
goroutine. Broadcast is non-blocking: if a subscriber is slow,
its channel fills and messages are dropped for that subscriber
only — the agent's read loop is never blocked by a stuck browser.
The agent dispatchAgentMessage path mirrors job.started /
job.progress / log.stream / job.finished envelopes onto the hub
in addition to its existing persistence work. The wire shape is
the same end-to-end, so client-side JS switches on env.type the
same way Go code does.
GET /api/jobs/{id}/stream is the browser endpoint. Auth via
session cookie (HTTP layer); upgrade; subscribe; pump until
context closes.
GET /jobs/{id} renders the live log page. Three states (queued/
running/succeeded/failed) drive the header pill, the progress
bar block, the failure summary panel, and the action button
(Cancel job while running, Back to host afterwards). Already-
persisted log lines are server-rendered on initial load; new
lines arrive over the WS and append to #log-stream. Auto-scrolls
unless the user scrolls up (a "⇢ Follow" pill re-attaches).
On job.finished the page reloads after 600ms to pick up the
final-state header rendered server-side.
POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id}
on success so HTMX lands the operator straight on the live log.
For non-HTMX callers (curl / plain form post) it 303s to the
same target.
store.ListJobLogs returns persisted log lines for initial render
on page load.
Browser-verified end-to-end: enrol → run a real backup against a
sibling restic/rest-server → live progress + 11 log lines stream
in → succeeded pill + final stats land after page reload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET /hosts/{id} renders the v1 host detail layout:
- persistent header: status dot (pulse if a job is in flight),
monospace name, tags, plus a metadata strip (os/arch, agent
version, restic version, "last seen Xs ago" or "online · last
heartbeat …").
- vitals strip: four tiles for last backup (status + relative
time), repo size, snapshot count, open alerts.
- sub-tabs: Snapshots is active; Jobs / Repo / Settings are
visible but inert until P2.
- snapshot table: short id, time (absolute), paths joined with
" · ", size, file count, restore button (disabled — wires up
in P3).
- right rail: run-now stack (backup live, forget/prune/check/
unlock disabled with the Phase tag), danger-zone remove panel
(also disabled for now).
Empty state: when a host has no snapshots yet, the table replaces
itself with a "no snapshots yet" prompt that includes the run-now
button (provided the agent is online).
Pagination cap of 50 most-recent snapshots; full pagination lands
when fleet sizes demand it.
Template helpers grew: comma() now accepts int / int32 / int64 so
templates don't fight Go's type inference; joinDot() concatenates
a []string with " · "; absTime() formats time.Time as
YYYY-MM-DD HH:MM:SS; the existing relTime() already accepts T or
*T after P1-27.
Browser-verified end-to-end with seeded fixture data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET /hosts/new renders the focused two-column form (hostname,
tags, repo URL/username/password). POST /hosts/new validates,
mints a one-time token via the new mintEnrollmentToken helper —
shared with the existing JSON /api/enrollment-tokens endpoint —
and re-renders the same page in result state showing:
- the install command with RM_SERVER + RM_TOKEN filled in (and
an inline copy-to-clipboard button),
- an "awaiting agent connection" panel with the hostname
pre-filled,
- a troubleshooting list pointing at the most common reasons
the agent doesn't appear,
- back-to-dashboard / add-another-host links.
publicURL() resolves RM_BASE_URL first, falling back to scheme +
Host on the inbound request — useful for local smoke without a
proxy.
Browser-verified end-to-end: form submit → token minted → install
command renders with the right values from the form input.
template fn formatRelTime now accepts time.Time *or* *time.Time
so templates can pass either without fighting Go's lack of an
address-of operator.
Deferred: download-preconfigured-installer (a templated .sh with
the values baked in) — copy-paste covers v1; nice-to-have later.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-rendered HTML view backed by:
- new store.FleetSummary aggregating host counts + repo bytes +
snapshot total + open alerts + last-24h job rollup in two queries.
- GET /api/hosts (JSON list of hosts in the dashboard projection).
- GET /api/fleet/summary (JSON aggregate, same shape as above).
The HTML page (web/templates/pages/dashboard.html) renders the four
summary tiles + host table directly from store data — no separate
fetch. Per-row state colour comes from .host-row.{degraded,failed,
offline} which paint a 3px left edge so problem hosts are scannable
without reading. HTMX is loaded into the base layout so per-row
"Run now" buttons can hx-post to /hosts/{id}/run-backup, a thin
HTML wrapper that funnels into a new dispatchJob helper shared
with the JSON /api/hosts/{id}/jobs endpoint.
Empty state (zero hosts) collapses to the "no hosts yet" prompt
with the + Add host CTA — matches the v1 mockup.
Template helpers (internal/server/ui/funcs.go) added for byte
formatting (412 GB / 3.7 TB), relative time (3m ago / 2d ago), and
comma grouping (1,847). Pure Go, no template-magic dependency.
Browser-verified end-to-end with seeded fixture data: five hosts
across all four states render with correct dots, accents, last-
backup pills, sizes, snapshot counts, alerts, tags, and the right
action button (Run now / Retry / Run first / View → / offline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1-28: Tailwind standalone CLI wired into the Makefile. `make tailwind`
downloads the pinned v3.4.17 binary into bin/tailwindcss (gitignored),
builds web/styles/input.css → web/static/css/styles.css. `make build`
now runs the CSS pass first; `make tailwind-watch` for dev. Output is
embedded in the binary via web.FS — single static binary, no Node.
The CSS source carries every component class the v1 mockups defined
(status dots, buttons, host row, log viewer, progress bar, fields,
chips, snippet panel, empty state) so screens that land later can
just reach for them.
P1-23: html/template tree at web/templates with two layouts (base
with chrome, chromeless for login + bootstrap), one nav partial, and
two pages (dashboard placeholder, login). internal/server/ui parses
the tree at startup; ui_handlers.go in the http package wires:
GET / dashboard (303 → /login when unauthed)
GET /login sign-in form
POST /login consume form, mint session cookie, 303 → /
POST /logout drop cookie, 303 → /login
GET /static/* embedded Tailwind bundle
The HTML login flow shares store/session logic with /api/auth/login
via a new authenticateAndSession helper — same security guarantees,
two surface representations (HTML form / JSON).
Verified end-to-end: bootstrap → form-login → authed dashboard →
sign-out → 303 cycle works in the browser; Tailwind output emits
only the component classes referenced in the live templates (9.6kB
minified).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five hi-fi screens completing the Phase 1 surface, all in v1's dark
operator-console register.
v1-login Sparse centred card. Sign-in + first-error variant.
No marketing chrome; build version sits in footer
so a returning operator can spot agent drift.
v1-add-host Focused two-column page (form left, contextual
"what happens next" right) — not a modal. Two
states: form (state A) and minted-token result
with install command (state B). Backed by
POST /api/enrollment-tokens (P1-32).
v1-host-detail Persistent header (status dot, mono name, tags,
primary CTAs, vitals strip) over four sub-tabs
(Snapshots / Jobs / Repo / Settings). Snapshots
is the default — the thing 90% of operators
want when they click a host name. Right rail
holds Recent activity, run-now stack, and a
danger-zone panel.
v1-job-log WS-streamed log view. Three states: running (live
progress bar + auto-scroll cursor), succeeded
(summary stats + final lines), failed (error
panel + tail). Backed by WS /api/jobs/{id}/stream
(P1-21 remainder).
v1-components The load-bearing reference. 14 sections covering
tokens (colour + type scale), status, buttons,
form fields, tags, tabs, host row, log viewer,
progress bar, stat tile, modal, toast, install
snippet, empty-state pattern. Every CSS class is
real and copy-able into the Go template build.
This locks the visual register before P1-23 onwards. Each Phase 1
template gets a {{define}} matching a section in v1-components.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Single .host-row CSS rule replaces 13 inline grid-template-columns
copies; column widths bumped so "backup running…" doesn't wrap.
- Faint left-edge accent for degraded / failed / offline rows so
problem hosts are scannable without reading.
- Empty-state hero added: top-bar + nav still present (Dashboard
active, others dimmed) but body collapses to a calm "no hosts yet"
prompt with the install command as the load-bearing affordance.
Prerequisite note keeps the deliberate "restic must already be
installed" decision visible to first-time operators.
This is the artefact P1-23/24/27 will template against.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three deliberately differentiated takes on the dashboard so we can
lock the visual register before the UI work starts (P1-23 onwards).
v1 — Operator console (Linear/Datadog dark register).
Dense table, monospace numerics, restrained colour, pulsing
status dot only when a job is running. The natural fit for
the audience and the most defensible choice.
v2 — Editorial calm (Stripe/Notion light register).
Serif hero headline that humanises the data, cards with
breathing room in a 2-up grid, demoted "quiet hosts" strip,
subtle rust accent. Reads as trustworthy infrastructure.
v3 — Print spec (Tufte/aerospace monospace register).
Pure monospace, near-monochrome, status as typeset glyphs
(●▶▲○✗) so the screen survives greyscale. "Requires
attention" block groups problem hosts at the top; activity
tail reads like a real log. Most polarising; highest
craft ceiling.
Each file is self-contained (Tailwind via CDN + Google Fonts) and
includes a philosophy preamble + the dashboard hero + a component
vocabulary section so we can read the system, not just one screen.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two warts surfaced during the smoke run:
- Agent was silent between "config.update applied" and "job
finished" — operators tailing journalctl saw no acknowledgement
that a command.run had landed. Adds Info logs at job-accept
({job_id, paths}) and at successful completion.
- The host.enrolled audit row had an empty {} payload. Now
carries {hostname, os, arch, has_repo_creds} so an audit-log
reader can answer "what got enrolled and did the operator
bundle creds with the token" without joining back to hosts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The smoke runbook caught a real bug: ConsumeEnrollmentToken was
inserting into host_credentials (FK -> hosts) inside the same tx as
the token burn, but the host row didn't exist yet — CreateHost
runs in the *next* statement. The agent saw a generic 401 with no
clue why.
Fix: drop the host_credentials insert from ConsumeEnrollmentToken;
the HTTP handler now does Consume -> CreateHost ->
SetHostCredentials. SetHostCredentials failure is logged loudly
but doesn't fail the enrol — operator recovers via PUT
/api/hosts/{id}/repo-credentials.
Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the
underlying cause is visible in server logs (the wire response stays
generic to avoid leaking which step failed).
Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new
order (consume -> create host -> SetHostCredentials).
Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly
in use); URLs use trailing slash on the rest path; clarified that
secrets_key is minted on first agent start, not at enrol time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds docs/e2e-smoke.md — an ~5-minute runbook that walks the full
P1 happy path against a sibling restic/rest-server: bootstrap
admin, mint token with repo creds, enrol an agent, watch the
config.update push land, run a backup, confirm the snapshot, edit
creds and watch the second push fire. Per the design discussion
this is a runbook (not a Go integration test); the Playwright
version lands in P5-06.
GET /api/hosts/{id}/repo-credentials returns the redacted view —
{repo_url, repo_username, has_password} — so the UI can pre-fill
the edit form without ever pulling the password out of the AEAD
blob.
Marks P1-32 / P1-33 / P1-34 done in tasks.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New internal/agent/secrets package: AEAD blob at
/var/lib/restic-manager/secrets.enc, atomic write (os.CreateTemp +
Sync + Rename), 0600. Key lives in agent.yaml as base64
(SecretsKey) — same trust boundary as the bearer token, minted on
first start via EnsureSecretsKey.
cmd/agent: dispatcher reads creds fresh from secrets.Load() on
each job rather than from in-memory config. config.update merges
the push with what's on disk and persists, so a daemon restart
keeps the latest values. Legacy plaintext repo_url/repo_password
in agent.yaml are silently migrated into secrets.enc on next start
and stripped from the YAML on the following save.
Tests: round-trip + wrong-key rejection + atomic-write
post-condition for secrets; key idempotence + legacy-field
parse/clear for config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator-minted enrollment tokens now carry the repo URL/username/
password as one AEAD blob bound (via additional-data) to the token
hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a
host_credentials row in the same tx as token-burn, so the binding
moves with the credential.
PUT /api/hosts/{id}/repo-credentials lets an operator edit creds
post-enrollment; merges with the existing blob, audits, and pushes
config.update if the agent is connected.
WS handler grows an OnHello hook that the HTTP layer wires to send
the host's decrypted creds as a config.update immediately after the
hello succeeds — synchronously, so a racing command.run lands after
the agent has its repo password.
Schema: 0002_host_credentials.sql adds enc_repo_creds to
enrollment_tokens and a host_credentials table (PK = host_id, FK
ON DELETE CASCADE).
Tests: round-trip token → consume → host_credentials with AAD swap
detection; no-creds path stays compatible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds P1-32/33/34: encrypted repo creds carried on the enrollment token,
agent-side AEAD secrets file, end-to-end smoke. spec.md §4.2 and §7.3
rewritten to describe the full flow (server-issued at token time,
pushed via config.update on hello, persisted encrypted on the agent)
and to make the encrypted-file-now / OS-keyring-Phase-2 split
explicit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2-18 captures the keypair + fingerprint-comparison enrollment flow
as a Phase 2 alternative to the token model. Includes guards
(rate limit, pending cap, hostname-collision flagging) and explicit
acceptance criteria.
P1-27 grows to mint encrypted repo creds alongside the token and
expose a one-click preconfigured-installer download from the
"Add host" form (cf. UrBackup Internet-mode push installer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agent calls restic snapshots --json after each successful backup
(60s timeout, separate from the backup ctx) and ships the projection
over the existing snapshots.report WS envelope. Failure here is
logged but doesn't fail the job — the next successful backup catches
the projection up.
Server-side ReplaceHostSnapshots is delete-then-insert plus a
hosts.snapshot_count update in one transaction so the dashboard's
per-host count stays consistent with the projection. New read
endpoint GET /api/hosts/{id}/snapshots returns the cached list with
a refreshed_at marker so the UI can show staleness when an agent
has been offline.
Schema: dropped the unused snapshots.repo_id FK (repos as a
first-class entity is P2 work), added short_id and refreshed_at
columns, switched the time index to DESC for the most-recent-first
list query. api.Snapshot gains short_id; size_bytes/file_count come
from the embedded summary block on restic 0.16+ and stay zero on
older clients.
Tests cover round-trip, authoritative replacement after forget+prune
shrinkage, and empty-after-wipe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx;
making the server do TLS too means double cert config, dual ACME
plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY,
remove TLSEnabled() and the ListenAndServeTLS branch.
Replace the cookie's "Secure if TLS-in-process" check with a new
RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets
RM_COOKIE_SECURE=false; production is always behind a TLS proxy and
the cookie stays Secure.
Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and
populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a
Caddyfile snippet and a hard "do not expose RM_LISTEN publicly"
warning. enrollResponse keeps cert_pin_sha256 in the shape but the
server can't introspect a cert it doesn't terminate — operator
pastes the proxy's hash into -cert-pin at install time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the state landed in this session:
Done (P1-01..03, P1-05, P1-06, P1-08..16, P1-17..20, P1-29):
HTTP server, store + schema, crypto, first-run bootstrap,
every API type with wire-shape tests, WS transport,
enrollment + hello + heartbeat round-trip, agent config +
service unit + WS client + sysinfo, restic wrapper, job
lifecycle store + run-now endpoint, agent runner.
Partial (P1-04, P1-07, P1-21, P1-31):
CSRF middleware lives with the UI work; audit middleware
sweep lives with rest of API; live job-log fan-out needs
the per-job browser hub; signed agent binaries deferred to
Phase 5.
Open (P1-22..28):
Snapshot listing, full UI suite (login, dashboard, host
detail, live job log, add-host, Tailwind build).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the operator → server → agent → restic → server roundtrip for
on-demand backups. The flow:
POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]}
→ server creates a queued Job row
→ server emits command.run over WS to the host's agent
→ agent dispatcher spawns runner.RunBackup in a goroutine
→ runner spawns `restic backup --json`, parses each line
→ forwards: job.started, log.stream (every line), job.progress
(throttled to 1/sec), job.finished (with summary stats blob)
→ server WS handler persists those into jobs / job_logs
P1-16 internal/restic: thin Locate + Env wrapper that runs `restic
backup --json`, scans stdout/stderr, parses BackupStatus +
BackupSummary, calls back into a LineHandler so the agent can fan
out to log.stream + job.progress. Treats exit code 3 as
"succeeded with issues" (matches restic's contract).
P1-18 store: jobs accessors (CreateJob, MarkJobStarted,
MarkJobFinished, AppendJobLog, GetJob).
P1-19 server: POST /api/hosts/{id}/jobs creates the Job row,
validates kind, dispatches via Hub.Send, audit-logs the action.
P1-20 agent runner: wraps restic.RunBackup with throttled progress
emission. Sender abstraction was added to wsclient.Handler so
background goroutines can keep replying after dispatch returns.
P1-21 server WS: dispatchAgentMessage now persists job.started,
job.finished, log.stream into the database. Browser fan-out for
live tailing lands with the UI work.
Agent gets repo_url + repo_password from agent.yaml in plaintext
for now (mode 0600, owned by service user); spec.md §7.3's keyring
storage moves there in P2. config.update over WS overrides the
in-memory copy (does not persist).
Build clean; all tests pass. End-to-end with a real restic still
needs a host that has restic installed — wire shape verified by
the existing hello/heartbeat round-trip test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1-14 deploy/install/restic-manager-agent.service: standard systemd
unit with the usual hardening switches (NoNewPrivileges, Protect*,
RestrictRealtime, MemoryDenyWriteExecute). Restart=always with a
5s backoff. Runs as a dedicated unprivileged restic-manager-agent
user; the install script creates it.
P1-29 deploy/install/install.sh: arch detection (amd64/arm64), pulls
the agent binary from /agent/binary, creates the service user
+ dirs (/etc/restic-manager, /var/lib/restic-manager), runs
enrollment via `agent -enroll-server -enroll-token`, lays down
the systemd unit, enables and starts it.
Honours the spec's "detect, don't auto-disable" rule for existing
schedulers: scans systemd timers, /etc/cron.d/*, /etc/cron.daily/*,
root crontab for restic-named entries and prints them with the
exact disable command — operator decides.
P1-31 server endpoints to ship the agent installation payload:
GET /agent/binary?os=linux&arch=amd64 → serves
<DataDir>/agent-binaries/restic-manager-agent-linux-amd64
GET /install/<file> → serves
<DataDir>/install/<file>
Both endpoints reject path traversal and return 404 if the file
isn't published. Operators drop the binaries + service unit into
these directories at release time. Signed-bundle verification is
deferred to Phase 5 OSS readiness.
All tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the protocol layer end-to-end: an agent can be enrolled
through the operator UI, store credentials, dial back to the server
over WS, complete the protocol_version handshake, and stay
connected with periodic heartbeats.
Server side:
- P1-09 ws.Hub: one Conn per host_id, last-write-wins eviction,
json envelope writer with a write mutex, reader, error envelopes.
- P1-09 ws.AgentHandler: bearer-auth, accept upgrade, hello-stage
(10s deadline, protocol_version checked against
api.MinAgentProtocolVersion → ErrProtocolTooOld with help URL on
reject), main read loop, defer hub register/unregister.
- P1-10 POST /api/agents/enroll consumes a one-time token, mints a
persistent agent bearer (sha-256 stored), creates a host row.
- P1-10 POST /api/enrollment-tokens (operator, session-auth)
issues a 1h one-time token.
- P1-11 hello upserts agent_version + restic_version +
protocol_version on the host row, flips status to online.
- P1-12 heartbeat touches last_seen_at; background sweeper marks
hosts offline after 90s without one.
- store: hosts table accessors, host_schedule_version,
enrollment_tokens FK on consumed_host dropped (audit-only field;
the token gets burned before the host row exists).
Agent side:
- P1-13 internal/agent/config: yaml at /etc/restic-manager/agent.yaml,
atomic Save (tmp+fsync+rename), Enrolled() helper.
- P1-15 internal/agent/wsclient: dial with bearer + optional
TLS cert pinning (sha-256 of leaf), exponential backoff with
jitter (1s → 60s cap), heartbeat goroutine, fatal handling for
ErrProtocolTooOld.
- P1-15 wsclient.Enroll: HTTP POST /api/agents/enroll with sysinfo.
- P1-17 internal/agent/sysinfo: hostname/OS/arch/restic-version
collection. restic detected by `restic version` parse; absent
restic doesn't block startup.
- cmd/agent: -enroll-server / -enroll-token flags drive first-run
enrollment then exit (so the install script can hand off to
systemd to run the persistent service).
End-to-end smoke verified: bootstrap → login → issue token →
enroll → run agent → server logs `ws agent connected` with the
right host_id and protocol_version 1.
All tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1-01 chi router, slog request log, graceful shutdown via signal
context. Health endpoint, /api/auth/login, /api/auth/logout,
/api/bootstrap. Background sweeper for expired sessions and
enrollment tokens (15 min cadence).
P1-04 (sessions half) HttpOnly Secure-when-TLS cookie carrying a
base64url token; server stores SHA-256(token) so a stolen DB
doesn't yield credentials. Unknown user and bad password collapse
to the same 401 response code so a probe can't enumerate names.
P1-05 first-run admin bootstrap. On a fresh DB the server mints a
one-time token and prints it to stderr inside a banner. The
/api/bootstrap handler accepts {token, username, password},
creates the first admin, then becomes a 409 forever.
P1-07 (partial) audit hooks fire on auth.login and auth.bootstrap.
Full middleware-driven coverage lands with the rest of the API.
internal/server/config: env > YAML > defaults. RM_LISTEN /
RM_DATA_DIR / RM_BASE_URL / RM_TLS_CERT / RM_TLS_KEY /
RM_SECRET_KEY_FILE / RM_TRUSTED_PROXY (CIDR list, validated).
End-to-end smoke test passes: server boots on a fresh dir,
prints the bootstrap token, POST /api/bootstrap creates the admin,
POST /api/auth/login returns 200 with a session cookie.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the bottom three layers of Phase 1:
P1-08 internal/api: protocol_version + envelope + every WS message
shape from spec.md §6.2 (Hello, Heartbeat, Job*, Schedule*, etc).
Wire-format tests pin the JSON shape so a rename here breaks
tests instead of silently breaking the agent.
P1-02 + P1-03 internal/store: SQLite via modernc.org/sqlite,
embed.FS + a tiny version table for hand-rolled migrations.
0001_initial.sql covers every table from spec.md §5 plus
enrollment_tokens and host_schedule_version. Typed accessors
for users / sessions / enrollment / audit. WAL + foreign_keys
+ busy_timeout on by default.
P1-06 internal/crypto: XChaCha20-Poly1305 AEAD wrapper with
per-message random nonce. Key file lifecycle (generate +
refuse-to-overwrite, load with size validation). Optional
additionalData binds ciphertext to the row that owns it.
P1-04 internal/auth (partial — passwords + tokens; sessions
middleware lands with the HTTP handlers): argon2id following
RFC 9106 (64 MiB / t=3 / p=4 / 32B), constant-time verify.
HashToken stores SHA-256 of session/agent/enrollment tokens
so a stolen DB doesn't hand over credentials.
Build floor moves to Go 1.25 (modernc.org/sqlite v1.50+ requires
it); CI + Dockerfile + README updated. Markdown lint diagnostics
on tasks.md cleared.
All packages tested. ~70 new tests pass in <1s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc-only changes captured before any Phase 1 code lands.
spec.md:
- §4.1 nhooyr.io/websocket → github.com/coder/websocket (the
maintained fork; the original is unmaintained)
- §4.1 RM_LISTEN documented as source of truth for the bind port;
add RM_TRUSTED_PROXY env var for X-Forwarded-* handling behind
Caddy/Traefik
- §4.2 Phase 1 ships Linux only; Windows binaries continue to build
in CI to keep the codebase portable, but service integration +
installer move to Phase 2
- §4.2 self-update via apt/choco, not bespoke signed binaries
- §5 add Host.protocol_version + Host.applied_schedule_version
- §6.2 lock protocol_version handshake semantics (clean error on
mismatch, not weird JSON parse failures)
- §6.2 schedule reconciliation when server unreachable: agent keeps
firing last-known-good indefinitely; server's view canonical on
reconnect; UI surfaces drift via applied_schedule_version
- §6.2 schedule.set carries schedule_version; new schedule.ack
agent→server message
- §10.1 cross-reference RM_LISTEN ↔ compose port mapping
- §14.3 hooks rejected at validation on non-backup schedule kinds
tasks.md:
- P1-14 / P1-30 (Windows service + install.ps1) → Phase 2 as
P2-16 / P2-17
- P1-29 install.sh detects existing restic timers/cron and prints
disable commands, doesn't auto-disable
- Phase 1 acceptance: drop Windows from end-to-end criterion,
require windows cross-compile in CI
- P4-01 rewritten: package-manager-based update delivery
- P5-08 removed (duplicate of P4-08 Prometheus /metrics)
- Various references updated
No Go code changes; build still clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Close every remaining P2 task in `tasks.md`: P2R-09 (auto-init UX), P2R-10/11/12 (hooks), P2R-13 (bandwidth wiring + per-job override), P2R-14 (schedule next/last run), P2-16 (Windows svc), P2-17 (`install.ps1`), P2-18 (announce-and-approve).
**Architecture:** Server stays HTTP+WS; agent stays a single binary that auto-restages via `make build`. Hooks live on `source_groups` (and host-level defaults). Announce-and-approve adds a separate WS path (`/ws/agent/pending`) and a Pending hosts panel; token-flow stays default. Windows service support uses `golang.org/x/sys/windows/svc` behind a `//go:build windows` tag — Linux builds untouched. **Operator is away — make best guesses on small UX choices, but commit each item separately so the choices are reviewable.**
**Tech Stack:** Go 1.23+, chi router, modernc/sqlite, `coder/websocket`, `robfig/cron/v3`, HTMX + Tailwind, `golang.org/x/sys/windows/svc`, Ed25519 (stdlib).
---
## Pre-flight
- [ ] **Run baseline:**`go vet ./... && go build ./... && go test ./...` — must be green before starting. Restage agent + restart server (per CLAUDE.md restage block) so smoke env is warm.
## Order of execution
Smallest blast-radius first. UI polish → bandwidth → next/last → hooks → announce → Windows. Commit and restage at each task boundary. Run `go vet ./... && go test ./...` before every commit.
- Modify: `internal/restic/runner.go` (add `LimitUploadKBps`, `LimitDownloadKBps` to `Env` or to a per-call options struct already present; emit `--limit-upload N`/`--limit-download N` on `restic backup|forget|prune|check|restore`)
- Modify: `internal/agent/runner/*.go` — pass host-wide caps into the runner. Caps come from `agent.config.Config` or are pushed via `config.update`. Decision: ship caps in the existing `config.update` envelope as new fields `bandwidth_up_kbps`, `bandwidth_down_kbps`. Server pushes on hello + on `PUT /api/hosts/{id}/bandwidth`.
- Modify: `internal/api/messages.go` — extend `ConfigUpdatePayload` with the two int pointers.
- Modify: `internal/server/ws/handler.go` (or wherever hello/config push lives) — include caps in the pushed config.
- Modify: `internal/server/http/host_bandwidth.go` — after `SetHostBandwidth`, fan out a `config.update` to the connected agent (mirror the credentials-edit path).
- Test: `internal/restic/runner_test.go` — assert flag injection.
- Test: `internal/server/ws/*_test.go` — assert config.update carries caps on hello and on edit.
- [ ] **Step 1.1** Add `LimitUploadKBps *int`, `LimitDownloadKBps *int` to whatever per-host config the runner already consults. Existing pattern is `restic.Env{}`; extend it.
- [ ] **Step 1.2** Failing test in `internal/restic/runner_test.go`: build a backup command with `LimitUploadKBps=1024`, assert the resulting argv contains `--limit-upload 1024`.
- [ ] **Step 1.3** Implement: prepend the flags in argv builders for `backup`, `forget`, `prune`, `check`, `restore`. Skip when nil/<=0.
- [ ] **Step 1.4** Wire `config.update` payload — server reads `Host.BandwidthUpKBps`/`DownKBps`, includes them in the existing `ConfigUpdatePayload` push on hello and on bandwidth edit (mirror cred-edit fan-out in `internal/server/http/host_credentials.go`).
- [ ] **Step 1.5** Agent applies caps: store in the in-memory dispatcher state on `config.update`, attach to every restic call.
- [ ] **Step 1.6**`go vet ./... && go test ./... && make build && <restage block>`. Commit:
```
agent+server: apply host bandwidth caps to restic invocations
**Decision:** A small numeric input on the per-source-group Run-now button (and dashboard Run-all). Operator is away — keep it minimal: two optional inputs (up/down KB/s) on the dispatch endpoint; UI shows a `<details>` "Limit bandwidth for this run" disclosure with two number inputs.
**Files:**
- Modify: `internal/server/http/sources.go` (or wherever the per-group Run-now POST lives) — accept optional `bandwidth_up_kbps`/`bandwidth_down_kbps` form fields, pass through.
- Modify: dispatch path (`internal/server/dispatch_*.go` or `ws/handler.go` job-dispatch core) — accept overrides, include in the `command.run` payload.
- Modify: `internal/api/messages.go` — `CommandRunPayload` gains optional caps that take precedence over host-wide caps when present.
- Modify: agent dispatcher — use payload override if present else falls back to config caps.
- [ ] **Step 2.5** Playwright spot-check via `:8080` smoke env: open Sources tab, expand the Run-now disclosure, fire with limit=128, then open the live job log and confirm the agent's restic argv (read `/tmp/rm-smoke/server.log` for the dispatched command — it logs argv) shows `--limit-upload 128`.
- Modify: `internal/store/schedules.go` — add `NextRunAt(time.Time)` derivation helper and `LatestScheduledJobAt(host_id, schedule_id) (time.Time, error)` (or a single batched fetch for all schedules of a host).
- Modify: dashboard host row (`web/templates/partials/host_row.html`) — show "Next: …" and "Last: …" when there's a single covering schedule (already detected in slice 5).
- Modify: `web/templates/pages/host_schedules.html` — add Next/Last columns to the schedules table.
- Test: `schedules_test.go` for next-run derivation (parse cron, compute next from a fixed `now`).
- [ ] **Step 3.1** Add `NextRun(cronExpr string, from time.Time) (time.Time, error)` helper using `robfig/cron/v3`'s `Parse(...).Next(from)`. Test with three crons.
- [ ] **Step 3.2** Add `LatestJobByActorKindForSchedule(host_id, schedule_id) (time.Time, status, error)` query against `jobs` (filter `actor_kind='schedule'` AND `schedule_id=?`, ORDER BY `started_at` DESC LIMIT 1).
- [ ] **Step 3.3** Wire schedules-page handler to populate Next/Last per row; render relative time + ISO tooltip (mirror existing `formatRelTime` template helper if it exists; otherwise use a simple "5m ago" helper).
- [ ] **Step 3.4** Wire dashboard row: when single covering schedule, surface "Next: 03:00" / "Last: 8h ago — succeeded".
- [ ] **Step 3.5** Playwright spot-check: a host with a schedule shows Next/Last; pause it → Next becomes "—" / "(paused)".
- Modify: `internal/server/http/ui_repo.go` (or new `repo_reinit.go`) — `POST /hosts/{id}/repo/reinit` admin-only, audit-logged. Server runs `restic init --force` (or wipes-then-inits — pick the safer of the two; restic doesn't truly wipe a repo, the operator must clear the bucket. **Best guess:** dispatch a normal `init` job with a flag that re-runs even if the repo claims to exist; if restic refuses, surface "the repo on the remote already has data — clear it manually before re-init" via the job log).
- Modify: host detail page header / vitals strip — surface init result line. Use the existing latest-`init`-job query to render "repo ready · initialised <relative time> ago" or "init failed · job N · retry".
- Test: HTTP test for re-init endpoint (auth, audit, host-name confirm); template test that the result line renders for both states.
- [ ] **Step 4.2** Render init line into vitals strip; show "init failed" amber when latest init failed.
- [ ] **Step 4.3** Implement `POST /hosts/{id}/repo/reinit` handler — admin role check, requires a `confirm_hostname` form field that must equal `host.Name`, returns 400 otherwise. Dispatches a fresh `init` job.
- [ ] **Step 4.4** Add danger-zone re-init form to `host_repo.html` (currently disabled per slice 4). Two-step confirm with the typed hostname.
- [ ] **Step 4.5** Playwright: visit `/hosts/{id}/repo`, click re-init, type wrong hostname → blocked; type right hostname → dispatches init job → returns to live log.
- [ ] **Step 5.2** Update store types + getters/setters with AEAD encrypt/decrypt. Mirror `internal/store/host_credentials.go` patterns exactly.
- [ ] **Step 5.3** Round-trip test: set hook on a source group; reload; assert plaintext returned. Set nil; assert nil after reload.
- [ ] **Step 5.4**`go vet && go test`. Commit.
## Task 6 — P2R-11: Agent execution of hooks
**Files:**
- Modify: `internal/api/messages.go` — `ConfigUpdatePayload` (or the per-source-group bundle inside `ScheduleSetPayload`) carries `PreHook`, `PostHook` plaintext (server has decrypted by then; wire is authenticated WS, same trust boundary as repo creds).
- Modify: agent dispatcher — for `kind=backup` only:
- Run `pre_hook` (if present) via `os/exec` with the host shell (`/bin/sh -c` on Linux, `cmd.exe /C` on Windows). Capture stdout+stderr → JobLog with `hook:` prefix. Non-zero exit aborts the backup, marks the job failed with `pre_hook` error.
- Run `post_hook` (if present) **always** after the backup, with `RM_JOB_STATUS=succeeded|failed` env var. Capture into JobLog, prefix `hook:`. Non-zero exit on post_hook does NOT change job status (warning logged).
- Skip both for `kind` ∈ {forget, prune, check, unlock, init} per spec.md §14.3.
- Test: dispatcher test with a `pre_hook` that exits 1 → backup not started; `post_hook` always runs and sees `RM_JOB_STATUS`.
- [ ] **Step 6.1** Plumb hooks through `ScheduleSetPayload` source-group bundle + per-group Run-now `command.run` payload (override host-default with group hook if both present). Server-side resolution: host default if group hook is empty.
- [ ] **Step 6.2** Agent dispatcher: factor hook execution into `internal/agent/runner/hooks.go`. Use `exec.CommandContext`, set env, plumb output to existing JobLog stream with `Source: "hook"` (or prefix the log lines `hook: …`).
- [ ] **Step 6.3** Failing test in `internal/agent/runner/runner_test.go` (create file if absent): `pre_hook=/bin/false` → job fails with `pre_hook failed (exit 1)` and the actual restic backup never runs (assert via mock-restic shim).
- [ ] **Step 6.4** Test: `post_hook` runs even when backup fails; receives `RM_JOB_STATUS=failed`.
- [ ] **Step 6.6**`go vet && go test && make build && <restage block>`. Commit.
## Task 7 — P2R-12: Hook editor UI
**Files:**
- Modify: `web/templates/pages/source_group_edit.html` (new or extend existing source-group form) — `<textarea>` for pre_hook, `<textarea>` for post_hook, with the warning banner: "this hook runs as the agent service user (root on Linux; LocalSystem on Windows)".
- Modify: source-group HTTP handler (`internal/server/http/sources.go`) — accept hook fields on POST/PUT, encrypt-and-persist via store.
- Create: a new "Settings" tab section on host detail (currently inert per P1-25) — wait, just add a new sub-tab or extend Repo page. **Decision:** add `pre_hook_default` / `post_hook_default` to the Repo page under a new "Hooks" section since Settings is still inert.
- Modify: source-group form admin-only check; post-only edit allowed by operators? **Decision:** admin-only edit per spec; render but disable for operators.
- Modify: audit-log writer — emit `source_group.hook_updated` and `host.default_hook_updated` events (without the hook body).
- Test: HTTP test for create + update; admin-only enforcement; audit row written without secret.
- Create: `internal/server/ws/pending.go` — `GET /ws/agent/pending` upgrade. Server issues a 32-byte nonce; agent signs it with its Ed25519 private key; server verifies against the `public_key` stored on the pending row keyed by the supplied `pending_id`. If valid, hold the connection open; on accept, push a single `enrolled` message containing `{bearer_token, repo_credentials_aead_blob}` and close cleanly. On reject, close with code 4001 + reason "rejected".
- Create: `internal/server/http/pending.go` — admin-only `POST /api/pending-hosts/{id}/accept` (atomically: mint bearer, decrypt admin-supplied repo creds (passed in form), promote pending row → real `hosts` row, push `enrolled` to the open WS, audit-log) and `POST /api/pending-hosts/{id}/reject` (delete row + close socket).
- Modify: server `main.go` route registration.
- Test: integration test — fake agent opens pending WS, admin POST /accept, agent receives bearer.
- Modify: `cmd/agent/main.go` — when `RM_TOKEN` is unset, switch to announce mode instead of erroring out. `RM_SERVER` still required.
- Create: `internal/agent/announce/announce.go` — generate-or-load Ed25519 keypair (persisted as a file alongside `secrets.enc`, mode 0600). POST `/api/agents/announce`. Open `/ws/agent/pending`. Wait. On `enrolled` message, persist bearer to `agent.yaml`, persist repo creds via existing secrets store, exit announce mode and reconnect via the normal WS path.
- Modify: `deploy/install/install.sh` — when `RM_TOKEN` is missing, run agent in announce mode and `journalctl --follow` until the agent prints the fingerprint, print it to the operator's terminal in big copy-friendly format, then keep following until enrolled.
- Test: end-to-end test in `internal/server/...` using a fake agent.
- [ ] **Step 10.2** Announce client + pending WS client; print `SHA256:…` fingerprint to stdout in a banner.
- [ ] **Step 10.3** Install script branch.
- [ ] **Step 10.4** Playwright: register a host via announce mode (run agent locally with no RM_TOKEN), log into UI, see Pending hosts panel with the fingerprint, click Accept, confirm host appears.
- [ ] **Step 10.5** Commit.
## Task 11 — P2-18d: Pending hosts UI panel
**Files:**
- Modify: `web/templates/pages/dashboard.html` — add Pending hosts panel above the host list when any pending rows exist.
- Add buttons → POST `/api/pending-hosts/{id}/accept` and `/reject` via HTMX.
- Background sweeper for `DeleteExpiredPendingHosts` every 60s (mirror the existing offline-sweeper goroutine pattern).
- [ ] **Step 11.1** Sweeper goroutine.
- [ ] **Step 11.2** Dashboard handler + template.
- [ ] **Step 11.3** Accept form must include the same repo URL/user/pw fields as the token-mint form (admin still supplies repo creds at accept time).
- [ ] **Step 11.4** Playwright sweep.
- [ ] **Step 11.5** Commit.
## Task 12 — P2-16: Windows service integration
**Decision:** Cannot test on Windows from WSL. Goal is a clean compile under `GOOS=windows GOARCH=amd64` and code that follows the canonical `golang.org/x/sys/windows/svc/example` pattern. Untestable beyond compile + manual review; mark in commit message.
**Files:**
- Create: `internal/agent/service/service_windows.go` (build tag `//go:build windows`) — implements `svc.Handler`. `Execute` starts the agent's main loop in a goroutine, listens for `svc.Stop`/`svc.Shutdown`, cancels ctx, waits.
- Create: `internal/agent/service/service_other.go` (build tag `//go:build !windows`) — stub `RunService` that just runs the agent loop in the foreground.
- Modify: `cmd/agent/main.go` — sub-commands: `install`, `uninstall`, `start`, `stop`, `run` (default). `run` delegates to `service.Run()` which on Windows checks `svc.IsWindowsService()` and dispatches accordingly.
- Test: `internal/agent/service/service_windows_test.go` (build-tagged) for argv parsing only — actual SCM interaction can't be tested in CI.
- [ ] **Step 12.1** Implement the svc.Handler shell.
- [ ] **Step 12.3** Cross-compile check: `GOOS=windows GOARCH=amd64 go build ./cmd/agent` must succeed.
- [ ] **Step 12.4** Commit with note "untested on Windows; compile-verified only".
## Task 13 — P2-17: install.ps1
**Files:**
- Create: `deploy/install/install.ps1` — PowerShell 5.1+ compatible. Checks admin elevation. Downloads agent binary from `$RM_SERVER/agent/binary?os=windows&arch=amd64`. Drops it at `C:\Program Files\restic-manager\restic-manager-agent.exe`. Runs `restic-manager-agent.exe install` (registers service). Starts it. Detects existing tasks named `*restic*` via `Get-ScheduledTask` and prints them — does not auto-disable. Writes `C:\ProgramData\restic-manager\agent.yaml` with `RM_SERVER` + `RM_TOKEN` (or no token if announce-mode).
- Modify: `internal/server/http/install.go` (or wherever install scripts are served) to also serve `/install/install.ps1`.
- Modify: CLAUDE.md restage block to also stage `install.ps1`.
- [ ] **Step 13.1** Write the script.
- [ ] **Step 13.2** Wire serving + restage.
- [ ] **Step 13.3** Smoke parse: `pwsh -NoProfile -Command "Get-Command -Syntax (Get-ChildItem deploy/install/install.ps1)"` if pwsh is on PATH, else `Set-StrictMode` parse via `pwsh -c "$null = [scriptblock]::Create((Get-Content deploy/install/install.ps1 -Raw))"`. Skip if no pwsh available — note in commit.
- [ ] **Step 13.4** Commit.
## Task 14 — Final integration sweep
- [ ] **Step 14.1**`go vet ./... && go test ./... -race`. Full build. Restage. Restart server.
- [ ] **Step 14.2** Playwright walkthrough on `:8080`: login → dashboard shows pending-hosts empty state → create source group → set a `pre_hook` → Run-now with bandwidth override → confirm hook fires + bandwidth applied → schedules tab shows next/last → repo page shows init-OK line → re-init flow gated by typed hostname.
- [ ] **Step 14.4** Open PR `p2-completion → main` with a summary of every item closed.
---
## Decisions made on the operator's behalf (away)
1. **Bandwidth UI for per-job override:** small `<details>` disclosure under each Run-now button. Simpler than a modal; matches the rest of the app's progressive-disclosure style.
2. **Re-init UX:** server dispatches a fresh `init` job; if restic refuses because the repo already exists, surfaces the error in the job log and instructs the operator to clear the remote bucket. We don't try to forcibly wipe — too dangerous, and the agent doesn't have credentials to wipe S3/B2/etc generically.
3. **Hooks editor lives on the Repo page (host defaults) + on the source-group edit form (per-group override).** Skips inventing a new "Settings" tab since that surface is still inert.
4. **Announce flow:** admin still supplies repo creds at accept time (same form as the token-mint flow). The pending row only carries identity-of-the-endpoint material, never repo creds.
5. **Windows service:** compile-verified only; untested. Commit message will say so.
- method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured.
- Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR).
- `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`).
- Tests: extend `config_test.go` covering both env vars + happy/sad CIDR.
## Step 2 — `internal/server/metrics` package
- `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`.
- `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count.
- `Snapshot() Snapshot` — copies state under lock; returns plain value type.
- `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).
- `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\``"``\n` in label values per the Prom format spec.
- Unit tests: golden render, concurrent observe, bucket boundaries.
## Step 3 — HTTP handler
- New `internal/server/http/metrics.go`:
- `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`.
- `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use).
- `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`.
- Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`.
- `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers.
- In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race).
- `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance.
- `internal/server/metrics/render_test.go` — golden output for a fixed snapshot.
- `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.
- `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.
## Step 7 — Tasks.md + verification
- Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries.
- Run `go vet ./...`, `go test ./...`, `make build`.
- Push branch (no PR per standing instruction).
## Risk register
- **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
- **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice.
- **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.
| `internal/alert.Engine` | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog |
| `internal/alert.Rule` + per-rule files | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models |
| `internal/notification.Hub` | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table. | store, channel adapters |
| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: `webhookChannel`, `ntfyChannel`, `smtpChannel`. | http.Client; net/smtp + crypto/tls for SMTP |
| `internal/store/alerts.go` | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite |
| `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table). | sqlite, crypto.AEAD (for secrets) |
| `internal/server/http/ui_alerts.go` | `/alerts` page handler + filter parsing + ack/resolve form actions. | store |
| `backup_failed` | warning | `MarkJobFinished` with kind=backup, status=failed | next backup for the same host succeeds |
| `forget_failed` | warning | `MarkJobFinished` with kind=forget, status=failed | next forget for the same host succeeds |
| `prune_failed` | warning | `MarkJobFinished` with kind=prune, status=failed | next prune for the same host succeeds |
| `check_failed` | critical | `MarkJobFinished` with kind=check, status=failed OR errors_found | next check for the same host succeeds without errors |
| `stale_schedule` | warning | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted |
| `agent_offline` | warning | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`) | hostOnline event for that host |
The 15-minute floor on `agent_offline` exists so a 30-second blip during
agent restart doesn't generate a notification storm. The store's existing
offline sweeper (`hosts.last_seen_at` with 90s threshold) already marks the
host offline; the engine sees the event but waits for the threshold before
> **Closes:** P4-03 (RBAC enforcement at API layer), P4-04 (User management UI)
## Goal
Enforce role-based access control at the HTTP layer (currently every authenticated user has admin powers) and ship the operator-facing screens for managing users, roles, and password lifecycle.
## Architecture
Two coupled subsystems landing in one PR:
1. **RBAC enforcement** — chi route-group middleware that gates each subtree by minimum role. Fail-closed default (admin) so a forgotten declaration doesn't accidentally widen access.
2. **User management** — `/settings/users` sub-tab with list / add / edit / disable. Setup-link flow for new users (1-hour-expiry single-use token). Self-service password change at `/settings/account`.
The audit log already records actor + user_id on every mutation; new endpoints fold in naturally.
## Role taxonomy
Locked. Three roles, hierarchical (admin ⊇ operator ⊇ viewer):
The role enum already exists in the schema (`CHECK (role IN ('admin','operator','viewer'))`) and in `internal/store/types.go`. Bootstrap creates the first user as admin. Zero migration needed for existing installs.
## Schema changes
All column-level ALTERs (CLAUDE.md prefers these over rebuilds; safe under `foreign_keys=ON`).
### Migration 0017 — `users` extensions
```sql
ALTER TABLE users ADD COLUMN email TEXT;
ALTER TABLE users ADD COLUMN disabled_at TEXT;
ALTER TABLE users ADD COLUMN must_change_password INTEGER NOT NULL DEFAULT 0;
-- Username case-insensitive lookup. Existing rows are kept as-is;
-- normalisation only applies to new INSERTs (handled in Go).
CREATE UNIQUE INDEX users_username_lower ON users(LOWER(username));
```
### Migration 0018 — `user_setup_tokens`
```sql
CREATE TABLE user_setup_tokens (
user_id TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
token_hash TEXT NOT NULL, -- sha256(raw_token), hex
expires_at TEXT NOT NULL,
created_at TEXT NOT NULL,
created_by TEXT NOT NULL REFERENCES users(id) ON DELETE SET NULL
);
CREATE INDEX user_setup_tokens_expires ON user_setup_tokens(expires_at);
```
`user_id` is PRIMARY KEY, not just FOREIGN KEY — only one outstanding setup token per user. Regenerating supersedes the old via `INSERT OR REPLACE`.
## RBAC enforcement
### Middleware
```go
// requireRole returns chi middleware that 403s any request whose
// session-resolved user doesn't meet the minimum role. Roles are
// hierarchical: admin > operator > viewer.
func (s *Server) requireRole(min store.Role) func(http.Handler) http.Handler
```
Hierarchy implemented as a small helper:
```go
func roleAtLeast(have, min store.Role) bool {
rank := map[store.Role]int{
store.RoleViewer: 1,
store.RoleOperator: 2,
store.RoleAdmin: 3,
}
return rank[have] >= rank[min]
}
```
### Route grouping in `server.go`
The existing `/api` and UI routes get re-grouped into three role bands plus a self-service group:
```
/api/* viewer-readable — GET endpoints anyone authenticated can hit
/api/* operator+ — mutating endpoints up to host/source-group/schedule level
Default at the bottom of `routes()` is admin (fail-closed). Any future endpoint that doesn't get explicitly placed lands in admin-only, surfacing the missing declaration as a permission error rather than a silent bypass.
### Per-handler nuance
One existing case warrants a handler-level check on top of the route gate: `GET /settings/users/{id}/edit` is admin-only, but the `PUT /api/account/password` is viewer-OK. The split-by-route already covers this; no per-handler overrides expected in v1.
### Out of scope of role middleware
- `/ws/agent` and `/api/agents/*` — agent bearer-token auth, separate chain
- `/healthz` — unauthenticated
- `/login`, `/logout`, `/bootstrap` — public
### 403 handling
- JSON endpoints: `{"error":"forbidden","code":"insufficient_role"}` with HTTP 403
- HTML endpoints: render a small "You don't have permission" panel inside the chrome (so the user keeps their nav and can move away), HTTP 403
- **No audit row on 403** — too noisy with normal users hitting URLs they don't have access to
### Session re-validation
Sessions need to honour `disabled_at` and current role on every request, not just at login. The session-validation middleware reads the user row each request (single PK lookup, fast in SQLite). If `disabled_at IS NOT NULL`, the session is invalidated and the request 401s. This makes "disable user" and "force logout" effectively immediate.
Cost: one SELECT per authenticated request. SQLite handles this comfortably for the fleet sizes this codebase targets.
## Setup-token flow (replacing temp passwords)
### Add user
1. Admin clicks **+ Add user** on `/settings/users`
2. Form: username (required, lowercase-normalised), email (optional, validated), role (admin/operator/viewer)
3. Server:
- Validates username uniqueness (case-insensitive). On collision with a *disabled* user, return a 409 with `{"existing_user_id": "...", "disabled": true}` so the UI can pivot to a "re-enable existing user" prompt
- On collision with an enabled user: 409 with a plain "username taken" error
- Creates user row with `password_hash = ""`, `must_change_password = 1`, `disabled_at = NULL`
- Generates 32 random bytes, hex-encodes → raw token (64 chars). Stores `sha256(token)` hex in `user_setup_tokens`. `expires_at = now + 1h`
`/settings/users/{id}/edit` shows a "Regenerate setup link" button when `must_change_password = 1`. Clicking it:
1. Generates a new token + hash, INSERT OR REPLACE on `user_setup_tokens`
2. Returns the admin to the same one-time link page as the add-user flow
3. Audit: `user.setup_token.regenerated`
### Cleanup
Expired tokens linger in the DB until cleaned. Add a cheap sweep on the existing maintenance ticker: `DELETE FROM user_setup_tokens WHERE expires_at < ?`. Runs at the same cadence as the alert engine tick (60s). No new ticker needed.
## Self-service password change
`/settings/account`
- Accessible to every authenticated user (any role)
- Form: `current password + new password + confirm`
- Server validates current password (re-uses login bcrypt comparison), updates hash, audits `user.password_changed`
- Special case: if `must_change_password = 1`, the current-password field is hidden / not required (covers the legacy "admin reset password" path if we ever add one — current setup-token path doesn't use this)
The bootstrap user's password change uses this same page (no special case for "first admin").
## User list / management UI
### `/settings/users` (admin-only)
```
Settings · Users [3]
─────────────────────────────────────────────────
[ + Add user ] [ ] Show disabled
USERNAME EMAIL ROLE LAST LOGIN STATUS
alice alice@example.com admin 2 mins ago enabled
bob — operator 3 days ago enabled
charlie c@example.com viewer never setup pending ← if has open setup token
diane d@example.com operator 1 month ago disabled ← only when "Show disabled"
Actions per row: Edit · (Re-enable | Disable)
```
- "setup pending" badge for users with `must_change_password=1` — clicking the row goes to edit, which surfaces the regenerate-link button prominently
- "Show disabled" is a checkbox querystring filter (`?show_disabled=1`)
- Sort columns: clickable like the audit log (username, role, last_login). Reuse the same pattern (server-side sort + URL builder + glyph)
### `/settings/users/new` (admin-only)
Single form: `username + email (optional) + role`. On submit → either landed on the setup-link page (success) or returned with an inline "username exists, re-enable existing?" panel (collision with disabled user) / red error (collision with enabled user).
### `/settings/users/{id}/edit` (admin-only)
- Display-only block: id, created_at, last_login_at, status
- **Editable**: email, role
- **Buttons**:
- "Regenerate setup link" — only when `must_change_password = 1`
- "Disable user" — flips `disabled_at`; rejected if last enabled admin (server-side check). Confirmation modal with typed name to confirm.
- "Re-enable user" — clears `disabled_at`. No confirmation.
- "Force logout" — separate from disable; just kills the session but keeps the user enabled. Useful for "I think Bob's session was hijacked" without locking him out.
Renders the one-time link with copy button + countdown. Shown after add-user and after regenerate. Reload of this URL after the token is consumed: 410 Gone with a clear message.
### `/settings/account` (any authenticated)
Self-service password change. Form-only page; no nav under Settings since most users will only see this one Settings page in v1.
## API surface
```
GET /api/users admin — list (with ?show_disabled=1 filter)
POST /api/users admin — create user, returns user_id + setup_url
GET /api/users/{id} admin — read
PATCH /api/users/{id} admin — update email, role
POST /api/users/{id}/disable admin — set disabled_at; rejects last-admin
POST /api/users/{id}/enable admin — clear disabled_at
POST /api/users/{id}/regenerate-setup admin — new token, returns setup_url
POST /api/users/{id}/force-logout admin — kill all sessions for this user
POST /api/account/password any auth — self password change
GET /setup public — landing page (HTML form)
POST /setup public — submit new password
```
UI routes mirror the API but at `/settings/users/...`.
## Last-admin self-protection
Two operations that could lock everyone out are guarded:
- **Disable user**: rejected if the user is admin AND there are no other enabled admins
- **Demote admin to operator/viewer**: same check
Server-side enforcement (single SELECT on `COUNT(*) FROM users WHERE role='admin' AND disabled_at IS NULL`). UI hint: edit page disables the role dropdown's non-admin options + disable button when the user is the last admin, with a tooltip explaining why.
The bootstrap admin is just a regular admin row; this check covers it.
## Audit actions
New action strings introduced:
- `user.created`
- `user.updated` (email / role change)
- `user.disabled`
- `user.enabled`
- `user.password_changed`
- `user.setup_completed`
- `user.setup_token.regenerated`
- `user.setup_token.expired` (system-driven, on cleanup sweep)
- `user.force_logout`
All target_kind = `user`, target_id = the affected user's id. Existing payload conventions apply.
## Ordering / dependencies
Slices in approximate landing order (writing-plans will firm this up):
7. **G. Sweep** — Playwright walk through the full lifecycle (add → setup link → user signs in → admin disables → user gets kicked → admin re-enables → user signs back in)
Each slice can land as its own commit on the branch. RBAC middleware (B) goes in *before* user CRUD so we don't ship an open `/api/users/*` even briefly.
- **HTTP middleware**: `roleAtLeast` truth table; viewer hitting an operator route returns 403; disabled user gets 401 mid-session
- **Setup flow integration**: create user → fetch setup URL → land on `/setup?token=...` → POST password → user can log in → token row gone
- **UI**: existing Playwright sweep pattern, screenshots into `_diag/p4-03-04-sweep/`
## Out of scope (deferred)
- **OIDC** (P4-05) — adds a parallel auth chain. This PR keeps the surface for it (role taxonomy, session middleware) but doesn't wire it.
- **Email-the-setup-link** — explicitly deferred. Easy follow-up because the SMTP channel client from P3-06 is already there.
- **Hard delete** — disable-only in v1; can add a typed-confirm "purge" later if it turns out to be needed.
- **Password complexity / rotation policy** — current minimum (12 chars) and no rotation; tighten later if/when policy demands.
- **Lockout on failed login** — a brute-force protection layer is its own task and orthogonal to RBAC.
- **Audit on 403** — not in v1; revisit if compliance asks for it.
## Risks / gotchas to watch
- **Existing tests** that assume "any logged-in user can hit any endpoint" will break. Audit the test fixtures: most use `loginAsAdmin`, which is fine; any tests currently exercising specific operator/viewer paths need explicit role assignment. (Quick grep suggests there aren't many — bootstrap-only.)
- **Bootstrap user normalisation** — the existing admin row's username is whatever it was set to at first run. The new lowercase-uniqueness index uses `LOWER(username)`, which makes the existing row implicitly lowercase-keyed for lookups. No data migration needed.
- **Session middleware re-read cost** — one SELECT per authenticated request. SQLite WAL handles this fine at expected fleet sizes; if it ever shows up on a profile we add a small in-memory cache keyed by session id with a 30s TTL.
- **403 vs 401 distinction** — make sure unauthenticated requests still get 401 (login redirect) and authenticated-but-insufficient get 403. The middleware should compose: auth-required first, role-required second.
## Acceptance
- [ ] An admin can add a user, copy the setup link, the new user can land on `/setup?token=...`, set a password, and reach `/`
- [ ] An expired token (>1h) on `/setup?token=...` shows the "contact your administrator" page
- [ ] Admin regenerates the link, old token is invalid, new token works
- [ ] Operator user can trigger Run-now but cannot reach `/settings/users` (403) and the Users tab in Settings is hidden in their nav
- [ ] Viewer user gets 403 on Run-now, 200 on dashboard / alerts / audit
- [ ] Admin disables a user mid-session — the user's next request is 401 and they're redirected to login
- [ ] Admin cannot disable themselves if they are the last enabled admin (server returns 409, UI button is greyed)
- [ ] Self-service password change at `/settings/account` works for every role
- [ ] All existing tests pass; new test suite covers role middleware, setup-token lifecycle, last-admin guard
## Self-review notes
- ✅ All sections concrete, no TBD / TODO
- ✅ Schema migrations are column-level (CLAUDE.md compliance)
- ✅ Audit action vocabulary listed in one place; no string typos to drift
- ✅ Out-of-scope list explicit so reviewers can challenge what we *aren't* doing
- ✅ Last-admin guard handled both server-side and UI-hinted
- ✅ Token storage hashes the secret server-side; raw is shown to admin once and never again
- ✅ Session re-validation cost noted with a fallback if it shows up on a profile
Wire OpenID Connect authentication as a sign-in path alongside the existing local-user system, so a deployment that already has an IdP (Authelia, Authentik, Keycloak, Okta, Auth0, etc.) can use it for restic-manager logins.
## Architecture
OIDC sits on top of the local-user system rather than replacing it. The first time a user signs in via OIDC the server **just-in-time provisions** a local user row marked `auth_source='oidc'`, with role derived from the IdP's `roles` claim. Subsequent sign-ins look up the same row by stable `oidc_subject` and refresh role + email from the latest claims. Once the row exists it behaves like any other local user — admin can disable it, force-logout, see it in audit logs, etc. — except password-login is rejected because there's no password.
The Authorization Code flow (with PKCE) is implemented against the discovered well-known config of a single configured issuer. Front-channel logout: clicking Sign out drops the local session + redirects the browser to the IdP's `end_session_endpoint` (when advertised). Back-channel logout deferred.
## Locked decisions
| Decision | Pick |
|---|---|
| User lifecycle | **B** — JIT-provision local rows on first OIDC login (`auth_source='oidc'`, `oidc_subject`) |
| Role mapping config | **A** — YAML/env, claim name configurable (default `groups`, matching Authelia / Keycloak / Authentik), default = deny on no-match |
| Username source | `preferred_username`, fallback to `email` |
| Username collision with existing local user | **Refuse** with clear remediation message |
| Provider config | **Single provider** — `providers:` array can come later |
| Login page layout | SSO button **above** password form; password form labelled "or sign in with a local account" |
| OIDC users + password login | **Disabled** — `auth_source='oidc'` rows have empty `password_hash`; password form rejects them |
| Logout shape | **Front-channel only** — drop session + redirect to `end_session_endpoint` when advertised |
| Role re-evaluation | **At login only** — claims read at the OIDC callback; admin can disable mid-session locally |
## Schema changes
Migration 0019 — `users` extensions for OIDC bookkeeping:
```sql
ALTER TABLE users ADD COLUMN auth_source TEXT NOT NULL DEFAULT 'local'
CHECK (auth_source IN ('local', 'oidc'));
ALTER TABLE users ADD COLUMN oidc_subject TEXT;
CREATE UNIQUE INDEX users_oidc_subject ON users(oidc_subject)
WHERE oidc_subject IS NOT NULL;
```
Both column-level ALTERs (CLAUDE.md preference). The unique partial index defends the JIT-lookup invariant (one row per IdP subject) without blocking multiple rows with NULL oidc_subject (the local users).
## Configuration
```yaml
# server config — extend existing config struct
oidc:
issuer: https://auth.example.com # well-known config discovered from this
client_id: restic-manager
client_secret: ${RM_OIDC_CLIENT_SECRET} # or via _FILE
display_name: Authelia # button label "Sign in with <display_name>"; default "SSO"
scopes: [openid, profile, email, groups]
role_claim: groups # default if absent (matches Authelia / Keycloak / Authentik)
Env-var overrides: `RM_OIDC_ISSUER`, `RM_OIDC_CLIENT_ID`, `RM_OIDC_CLIENT_SECRET`, `RM_OIDC_CLIENT_SECRET_FILE`. Mapping is YAML-only (env doesn't fit a multi-key string→string map cleanly).
When `oidc.issuer` is empty or missing, OIDC is disabled (current behaviour). No restart-toggle UI; this is a deploy-time setting.
## Auth flow
### Login start
`GET /auth/oidc/login` — only mounted when OIDC is configured.
1. Generate `state` (32 random bytes, base64) and `code_verifier` (64 random bytes, base64); compute `code_challenge = base64(sha256(code_verifier))`.
2. Store `(state, code_verifier, created_at)` in a new ephemeral table (or in memory with a 5-minute TTL — see "trade-off" below).
3. Redirect to `<authorization_endpoint>?response_type=code&client_id=...&redirect_uri=...&scope=...&state=...&code_challenge=...&code_challenge_method=S256`.
### Callback
`GET /auth/oidc/callback?code=...&state=...` — also OIDC-only mount.
1. Validate `state` against the stored value (one-shot — delete row on read). Reject if missing/expired/already used.
2. Exchange `code` + `code_verifier` for tokens at `token_endpoint`.
3. Validate the `id_token` JWT: signature against the JWKS endpoint, `iss`, `aud`, `exp`, `iat`, `nonce` (if used).
4. Extract `sub`, `preferred_username`, `email`, and the configured `role_claim` (default `roles`).
5. Pick username: `preferred_username` if non-empty, else `email`. Lowercase / trim per the existing local-user rules.
6. Pick role: first match in `role_mapping` against the array of role-claim values. **No match → deny with a clear error page**, no row created.
7. Look up user by `oidc_subject`. Three cases:
- **Found** — refresh `email`, `role`, `last_login_at`. Don't touch `username` (changing it would break audit trails; if the IdP changes the username, that's an operator concern). Log `user.oidc_login`.
- **Not found, username free** — INSERT row with `auth_source='oidc'`, `oidc_subject=<sub>`, `password_hash=''`, `must_change_password=0`. Log `user.created` with payload `{"auth_source":"oidc"}` + `user.oidc_login`.
- **Not found, username taken by a local user** — render an error page: "This OIDC user (`<sub>`) wants to sign in as `alice`, but a local user with that name already exists. Ask your administrator to either rename / remove the local user, or exclude this user from the OIDC mapping." 403, no row created. Log `user.oidc_login_blocked`.
8. Drop a session cookie + `MarkUserLogin` (the existing helper).
9. Redirect to `/`.
### Logout
`POST /logout` (existing handler) — augmented:
1. Look up the session before deletion (we need the user row to know if they're an OIDC user).
2. Delete the session as today.
3. If the user is `auth_source='oidc'` AND the discovered `end_session_endpoint` is non-empty → 303 to `<end_session_endpoint>?id_token_hint=<id_token>&post_logout_redirect_uri=<base>/login`. Otherwise → existing 303 to `/login`.
We need to keep the latest `id_token` per session to drive `id_token_hint`. Stash it in a new `sessions.id_token TEXT` column (one column-level ALTER on migration 0019 alongside the user columns), populated only for OIDC sessions.
## State table
Two reasonable shapes for the short-lived state used during the OAuth round-trip:
- **In-memory map** with a 5-minute TTL sweeper. Simpler, but multi-process deployments lose it (no multi-process today, but Phase 5 OSS readiness might add).
- **`oidc_state` table** — `(state_hash PK, code_verifier, created_at)`, swept on the same 60s alert-engine tick that already handles setup-token cleanup.
I'll go with the **table**. Costs ~3 lines in the existing cleanup tick, behaves correctly under restarts, and survives a future scale-out. Migration 0019 includes:
```sql
CREATE TABLE oidc_state (
state_hash TEXT PRIMARY KEY, -- sha256(state) hex; raw state never persisted
code_verifier TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE INDEX oidc_state_created ON oidc_state(created_at);
```
## Login-page UI
`/login` template branches based on `view.OIDCEnabled`:
- **OIDC off** → current layout (just the password form).
- **OIDC on** → an `Sign in with <provider name>` button at the top, then a faint divider line, then the existing password form labelled "Or sign in with a local account". Provider name comes from a new optional config `oidc.display_name` (defaults to "SSO").
Failed-OIDC redirects (no role match, username collision, IdP error) land on `/login?oidc_error=<reason>` with a small banner above the buttons.
- `user.oidc_login_blocked` (target_kind=user, target_id=oidc_subject when no row was created, payload `{"username":"…", "reason":"username_taken|no_role_match|other"}`)
- `user.created` already exists; OIDC's first-time provisioning fires this with payload `{"auth_source":"oidc"}` so the audit log distinguishes admin-created from JIT-provisioned rows.
## User-management UI changes
Small additions, not new screens:
- **Users list** — Status column adds a small `oidc` chip when `auth_source='oidc'` so admin can see at a glance which rows came from JIT-provisioning. Sortable by auth_source via the same sortable-headers pattern (lands as a small follow-up if anyone asks; out of scope for v1).
- **Add user form** — disabled when OIDC is the only auth path, with a hint: "User provisioning is handled by your OIDC provider; users appear here on first sign-in." Configurable later via a `oidc.disable_local_users` flag if that becomes a real ask. Out of scope for v1; both paths stay open.
- **Edit user form** — when `auth_source='oidc'`:
- Username field disabled (changing it would just be undone on next OIDC login)
- Role dropdown disabled, with a hint: "Role is managed by your OIDC provider's `roles` claim mapping. Edit the mapping in server config to change."
- Email field disabled (refreshed from IdP on each login)
- **Disable / Enable / Force logout** still work — disabling an OIDC user kicks their session and rejects future OIDC logins ("user disabled by administrator")
- **Regenerate setup link** hidden — there's no setup token for OIDC users
- **Login UI** — password form rejects users with `auth_source='oidc'` ("This account uses single sign-on. Click the SSO button above.")
## Middleware / handler changes
- **Routes**: new public-band entries `GET /auth/oidc/login`, `GET /auth/oidc/callback`. Skipped entirely when OIDC isn't configured (`s.deps.OIDC == nil`).
- **Logout handler** augmented to fetch the user row + decide between local logout (303 → `/login`) and OIDC logout (303 → `end_session_endpoint`).
- **Login handler** rejects `auth_source='oidc'` users with the SSO-prompt error.
- **Last-admin guard** — already covers OIDC users naturally because they live in the `users` table. The role-from-claims path could create a "every admin gets demoted to operator" situation if the IdP's claim mapping is wrong; the guard rejects that demotion at the moment it'd be applied (returns the user to the login page with `oidc_error=role_change_blocked` and audit entry; admin must fix the mapping or promote a local admin first).
2. **Config** — extend `internal/server/config` with the OIDC block + env-var overrides; load JWKS lazily
3. **Discovery + JWKS** — small helper that fetches `<issuer>/.well-known/openid-configuration` once at startup, caches `authorization_endpoint`, `token_endpoint`, `end_session_endpoint`, `jwks_uri`. JWKS refreshed on first failed verification.
4. **Login start handler** — `/auth/oidc/login`
5. **Callback handler** — `/auth/oidc/callback`, with the four claim-resolution branches
6. **Logout handler augmentation** — branch on `auth_source`
7. **Login form rejection** — local-user password form rejects OIDC accounts
9. **UI** — `oidc` chip on users list, disabled fields on edit-form for OIDC users, login page SSO button + error banner
10. **Tests** — config parse tests; happy-path callback test using a fake IdP (httptest server with a hand-rolled discovery doc + JWKS); username-collision test; no-role-match test; logout test
11. **Sweep** — full Playwright walk against an actual IdP (Authelia in a Docker container) — admin gets in via OIDC, role mapping works, logout redirects through IdP, OIDC user can't password-login
## Test strategy
The IdP is the hard part to test cleanly. Two layers:
- **Unit / integration tests** use a stub OIDC provider built into the test harness — `httptest.Server` exposing `.well-known/openid-configuration`, a token endpoint that signs minted JWTs with a test ECDSA key, and a JWKS endpoint serving the public key. This covers every code path without a real IdP. Pattern: each test mints its own claims and runs the callback against the stub.
- **Smoke env** runs against a real Authelia container (existing `compose.smoke.yaml`-style file or one-liner `docker run`) for the final sweep — confirms the discovery doc isn't being misread, real JWT verification works, real `end_session_endpoint` redirect works.
## Out of scope (deferred)
- **Multi-provider** support (`providers:` array)
- **Back-channel logout** (RFC 8138) — schema isn't blocked from adding it later
- **UI-driven role mapping** (config-only in v1)
- **Refresh tokens / mid-session role re-evaluation** — login-only refresh in v1
- **`oidc.disable_local_users`** flag — both paths stay open in v1
- **OIDC user dashboard chip / badges** beyond the small `oidc` indicator on the users list
- **Per-user "auth source" filter on the users list** — sortable headers cover most of the use case
## Risks / gotchas
- **JWKS key rotation** — refresh on first failed verification is the standard fix; document the cache TTL (1h) in the config block.
- **Clock skew** — accept `iat`/`exp` with a 60s leeway; matches what most OIDC libraries do.
- **End-session 404 / not advertised** — degrade gracefully; just drop the session and 303 to `/login`. Don't 500 the logout because the IdP doesn't implement RP-initiated logout.
- **Username changes at the IdP** — silently keep the local username (matches our locked decision: subject is the stable key, username is display-only). Document.
- **Role claim is sometimes a string, sometimes an array, sometimes a comma-separated string** depending on IdP — normalise into `[]string` before mapping. Authelia/Keycloak emit arrays; some custom setups emit strings; handle both.
- **Authelia `sub` is an opaque UUID, not the username** (Authelia 4.39+ default for new clients). Don't assume `sub` is human-readable; it's stable but display value is `preferred_username` or `email`. The locked design already keys lookups on `sub` and uses `preferred_username` for the display username, so this is just a correctness note.
- **`end_session_endpoint` may not be published** (Authelia doesn't advertise it for many configs). The locked logout flow already degrades to "drop session + redirect to /login" when the discovery doc lacks it; no extra config needed.
- **Password-form bypass for OIDC users via /api/auth/login (JSON)** — same rejection rule applies, not just the HTML form.
## Acceptance
- [ ] An OIDC user with `roles: ["rm-admins"]` can sign in, becomes an admin, is visible in `/settings/users` with an `oidc` chip
- [ ] Same user signing in again resolves to the same row (no duplicate)
- [ ] Same user with `roles: ["something-else"]` is denied, lands on `/login?oidc_error=no_role_match` with a banner, no row created
- [ ] OIDC user can't password-login through `/login` or `/api/auth/login`
- [ ] Admin disables an OIDC user → next OIDC login is rejected, existing session bounced (existing disable-mid-session)
- [ ] Sign out as an OIDC user → 303 to IdP's end-session URL (when advertised); no end-session URL → 303 to `/login`
- [ ] OIDC config absent → password login works exactly as today (zero behavioural change)
- [ ] Username collision: a local `alice` exists, OIDC user with `preferred_username=alice` and a different `sub` → blocked at sign-in with the clear error page
- [ ] Last-admin guard refuses to demote the only enabled admin even if the IdP's role mapping says otherwise
- [ ] All existing tests pass; new test suite covers the four claim-resolution branches and logout
**Status:** approved 2026-05-05. Pivots P5-03 away from `goreleaser` +
binary archives toward a single Docker image as the only public
deliverable.
## Goal
One artifact per tag: the `restic-manager` server image, multi-arch
(linux amd64 + arm64), published to the Gitea container registry of
this self-hosted instance. The image bakes in cross-compiled agent
binaries (linux amd64, linux arm64, windows amd64), the install
scripts, and the systemd unit at a read-only image path. The running
server distributes those agents and scripts via its existing
`/agent/binary` and `/install/*` endpoints; operators on N hosts never
download a release artifact directly.
Source builds via `make build` remain a first-class path for anyone
who wants binaries.
## Non-goals
- Standalone binary archives (`.tar.gz`, `.zip`) on the release page.
- darwin / windows-arm64 agent targets — neither is service-tested.
- `goreleaser`. Not used.
- `cosign`, `SBOM`, `in-toto`, `minisign`. Re-promote when we ship
binaries outside an image (Phase 6 candidate).
- GHCR / GitHub mirror. Single source of truth = Gitea.
## Decisions captured (with one-line rationale)
| ID | Decision | Why |
|----|----------|-----|
| D1 | One artifact: server Docker image | Architecture already routes agent distribution through the server (`/agent/binary`); release surface should mirror that. |
| D2 | Trigger: `tag-push` (`v*.*.*`) **plus**`workflow_dispatch` | Tag for real cuts; dispatch for snapshot iteration without polluting tag history. |
| D3 | Build matrix: linux amd64+arm64 server image; agent cross-compiles for linux amd64+arm64+windows amd64 | Mirrors the existing CI build matrix; nothing ships that hasn't been service-tested. |
| D4 | Image-baked, separate path (`/opt/restic-manager/dist/`); HTTP handler reads `<DataDir>/...` first, falls back to `/opt/...` | Volume stays purely operator state; image content is immutable per tag; eliminates the smoke-env "stale agent" footgun in production. |
| D5 | Tag fan-out: `vX.Y.Z`, `X.Y`, `X`, `latest` — but `latest` is held back until `v1.0.0` | Standard rolling-minor pattern; pre-1.0 forces explicit pinning. |
| D6 | Snapshot tag: `:snapshot-<shortsha>`, never moves `latest` | Operator can never accidentally pull an unblessed build. |
| D7 | Version embedding via `-ldflags`: `main.version`, `main.commit`, `main.date` on both `cmd/server` and `cmd/agent` | Server already had `version`; add `commit`/`date` to both for parity and traceability. |
| D8 | Registry: Gitea container registry on this instance, under `<host>/<owner>/restic-manager` | One source of truth, no external creds. |
| D9 | Integrity: a `SHA256SUMS` file + the manifest digest in the release notes; nothing else | Image is the unit of trust; pull-by-digest is the verification primitive. |
| D10 | P1-31 (signed binaries) stays deferred | Re-promote the day we ship binaries outside an image. |
Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard
version reporting + fleet update UI). One spec, one branch — the
two tasks are tightly coupled (P6-02 is the operator surface for
the mechanism P6-01 ships).
## 1. Background
P5-03 pivoted release distribution to a single multi-arch server
Docker image, with cross-compiled agent binaries baked under
`/opt/restic-manager/dist/agent-binaries/` and served via
`GET /agent/binary?os=…&arch=…`. The plumbing already does
dual-path lookup: `<DataDir>/agent-binaries/<name>` overrides the
image-baked copy, so an operator can hot-patch a pre-release agent
without rebuilding the image.
That makes the server the natural distribution point for agent
upgrades. "Update agent" collapses to "re-fetch from your own
server" — no apt repo, no Chocolatey, no third-party signing infra,
and version pinning is automatic because the server only ever
serves the agent that matches its own release.
This spec wires up the update mechanism end-to-end and the
operator surface that drives it.
## 2. Decisions
| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked |
| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe |
| 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" |
| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's |
| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition |
| 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears |
## 3. Wire protocol
### 3.1 Server → agent: `command.update`
```
{
"type": "command.update",
"id": "<envelope id>",
"payload": {
"job_id": "<ulid>"
}
}
```
No `os` / `arch` / `version` in the payload — the agent already
knows its own build target and fetches from its configured server
URL via the existing `/agent/binary` handler. Including a target
version would also tempt the agent into version-comparison logic;
keep that on the server side.
### 3.2 Job lifecycle (server-driven)
The agent has limited ability to report on its own restart, so the
job state machine lives on the server:
- **queued → running** when the envelope is dispatched.
- **running → succeeded** when the agent re-hellos with
`agent_version == server.Version` after dispatch and within
the timeout. Audit `host.update_succeeded`.
- **running → failed (timeout)** if 90 seconds pass without a
hello carrying the matching version. Audit `host.update_failed`.
| `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer <token>`. Compared with `crypto/subtle.ConstantTimeCompare`. |
| `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. |
If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
Falls back gracefully if the registry was never wired (tests).
Route registration in `server.go`:
```go
if s.deps.Cfg.MetricsAuthEnabled() {
r.Get("/metrics", s.handleMetrics)
}
```
## Cardinality + cost
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1.
A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
## Documentation (P6-05)
- `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
- `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline.
2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`).
3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`.
4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`.
5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`.
6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window.
Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
## Testing
- Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
- Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence.
- Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
- End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
## Out of scope, explicitly
- Per-job latency tracking with `job_id` labels (cardinality bomb).
- Restore-specific metrics (P3 surfaces are still settling).
- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
- Auto-discovery / file-SD generators for Prometheus.
A short, structured walkthrough of the assets restic-manager
protects, the actors that interact with it, the attack surfaces
exposed, and the mitigations in place. This document is written for
operators considering a deployment and for contributors evaluating
security-sensitive changes. It is **not** a formal certification —
restic-manager has not been third-party audited.
Last reviewed: **2026-05-09** (against v1.0.0).
---
## 1. Assets
In rough order of sensitivity:
| Asset | Why it matters |
|---|---|
| **Restic repository passwords** | Decrypt every backup in the repo. Server holds them encrypted at rest; agents need plaintext at backup-time. |
| **Repository URLs with embedded credentials** (e.g. `rest:https://user:pass@host/repo`) | Same as above — read access to the repo is leak-equivalent to the password. |
| **Agent bearer tokens** | Long-lived credentials authenticating each agent → server WS. Compromise lets an attacker impersonate that host (push fake snapshots, ack fake schedule versions, exfiltrate repo creds the server pushes back). |
| **Server session cookies** | Browser-side session for human operators. Compromise = full UI access at the user's role for the cookie's TTL (24h). |
| **Database secret key** | Wraps every encrypted-at-rest field (repo creds, agent enrolment payloads). Loss of the file means decryptable backups; rotation requires re-pushing creds to every agent. |
| **Bootstrap / setup tokens** | One-shot, time-limited; mint admin or invited-user accounts. |
| **Audit log** | Tamper-evident record of admin actions; read-only via UI. |
| **Backup data on the wire** | Restic itself encrypts on the agent before sending — see "out of scope". |
---
## 2. Actors
| Actor | Trust |
|---|---|
| **Anonymous internet** | Untrusted. Should not reach the server unless proxied behind auth (see deployment guide). |
| **Authenticated viewer** | Read-only on hosts/jobs/alerts/audit. |
| **Authenticated operator** | Add/remove hosts, edit schedules, run backups/restores, mint enrolment tokens, ack alerts. |
| **Authenticated admin** | All of the above plus user management, role changes, fleet update controls, secret-key visibility (no — see below). |
| **Agent** | Trusted to backup-and-report on its own host only. Cannot read other hosts' creds. Bearer-authenticated. |
| **Restic backend (rest-server / S3 / B2 / etc.)** | Out of scope for this document — assumed to authenticate the credentials presented and not collude. |
- **Risk**: race between server start and admin creation — an attacker who reaches the server first can claim admin.
- **Mitigations**:
- Bootstrap token printed to stderr exactly once; held in memory, not persisted.
- The UI form on `/bootstrap` uses the in-memory token automatically (no token field for the operator to type or expose).
- Both surfaces self-disable the moment any user row exists (`CountUsers > 0`).
- Token is also blanked from process memory after success (defence in depth).
- **Residual risk**: if an operator brings up the server on the public internet before reaching the bootstrap page, an attacker reaching `/bootstrap` first wins. **Recommendation**: bring the server up behind an existing trusted network or with the listener bound to `127.0.0.1` until first-run is complete.
### 3.2 Local user accounts
- **Surface**: `/login`, `/api/auth/login`.
- **Mitigations**: Argon2id password hashing with per-deployment params; constant-time password compare; session-cookie minting via `crypto/rand`; session rows hash-only (raw token only in cookie).
- **Rate limiting**: Currently not in place at the application layer — the project assumes a reverse proxy enforces login throttling. **Recommendation**: front the server with `caddy`/`nginx` rate-limit rules in production.
- **Password policy**: 12-character minimum on bootstrap and user-setup paths; no maximum, no rotation, no history. Sufficient for self-hosted ops; tighten in policy if a deployment requires it.
### 3.3 OIDC SSO
- **Surface**: `/auth/oidc/*` — generic OIDC client, JIT user provisioning.
- **Mitigations**: state + nonce per flow; role mapping is server-configured (claims trusted only to identify the user, not pick role); user-disabled gate runs after IdP success.
- **Residual risk**: misconfigured role-mapping rules can promote any IdP user to admin. **Recommendation**: review `cfg.OIDC.RoleMappings` carefully.
### 3.4 Agent enrolment
- **Surface**: `/api/agents/enroll` (token-authenticated), `/api/agents/announce` (anonymous, then operator-approves).
- **Mitigations**:
- Token path: one-shot, hashed at rest, 1h TTL; agent receives a fresh long-lived bearer in the response.
- Announce path: agent supplies an Ed25519 public key; operator sees a fingerprint to confirm out-of-band before accepting.
- Bearer tokens are SHA-256 hashed in the DB.
- **Residual risk**: an attacker on the network between operator and target host who intercepts the install snippet can enrol *as* the target. The install script must be served over TLS in production (the docker-only deployment defaults to TLS-by-default; bare-metal deployers must configure their own).
### 3.5 Agent → server WebSocket
- **Surface**: persistent WS authenticated by agent bearer.
- **Mitigations**: bearer is presented per-connection; server pins the agent fingerprint for the announce flow; messages are envelope-typed and rejected if shape-invalid.
- **No payload-level signing** today — TLS is the integrity boundary. A man-in-the-middle with a valid cert chain could swap messages. **Recommendation**: pin the server cert via `RM_SERVER_CERT_PIN_SHA256` if running over a network you don't fully control.
### 3.6 Repo credential lifecycle
- Stored encrypted at rest under the AEAD secret key.
- Pushed to the agent over the WS on hello, on creds change, and on demand.
- Agent persists them encrypted (per-host secret key derived from a value known only to the agent).
- Logged surfaces use `restic.RedactURL()` to strip `user:pass@` from URLs before they reach `slog`.
- Plaintext form is constructed only at `exec.Command` time inside the agent, never stored on a struct field that could be slogged.
### 3.7 Restore
- Operators can restore to any path the agent (running as root) can write.
- Cross-host restore (host A's snapshot → host C) is **deferred** — see F-01. The current single-host restore does not require granting any cross-host privileges.
### 3.8 Audit log
- Append-only writes from the application; SQLite enforces no schema-level immutability.
- A compromise of the SQLite file (via OS-level access) can edit the audit log. **Recommendation**: ship audit entries to an append-only sink (syslog / Loki / Splunk) if tamper-evidence beyond the OS boundary is required.
### 3.9 Self-update channel (P6)
- Agents fetch new binaries via the WS transport from the server.
- Binaries are signature-checked by the agent against a key embedded in the existing agent (see `internal/fleetupdate/`).
- **Residual risk**: a server compromise lets the attacker push code to every agent (running as root). The signing-key compromise window is the same as the server compromise window because both live on the server. Splitting the signing key onto a separate signer is future work (not v1).
---
## 4. Out of scope
- **Restic itself** — its repository format, encryption, and backend protocol are upstream-trusted.
- **The host OS** — root compromise of a host obviously compromises that host's backups.
- **The backup destination** — restic-manager assumes the rest-server / object-store / SFTP target enforces its own auth.
- **Side-channel attacks** on the server process (RAM dump, process tracing).
- **Physical access** to the server's disk.
---
## 5. Reporting
Found something we missed? See `SECURITY.md` for the disclosure
process. Coordinated disclosure preferred; the project is
maintained by a small team and we'll respond as quickly as we
{name:"never backed up is overdue",cron:daily,lastBackup:nil,now:mustParse("2026-06-15T09:00:00Z"),want:true},
{name:"missed last nights window",cron:daily,lastBackup:ptrTime(mustParse("2026-06-13T02:05:00Z")),now:mustParse("2026-06-15T09:00:00Z"),want:true},
{name:"backed up after the most recent window",cron:daily,lastBackup:ptrTime(mustParse("2026-06-15T02:05:00Z")),now:mustParse("2026-06-15T09:00:00Z"),want:false},
{name:"unparseable cron is never overdue",cron:"not a cron",lastBackup:nil,now:mustParse("2026-06-15T09:00:00Z"),want:false},
@@ -310,7 +310,7 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
> **Sweep verified (smoke env):** admin adds operator → setup link generated → curl-as-new-user fetches /setup (200, page shows username) → POSTs password → 303 to / + Set-Cookie → operator authenticated → 200 on /, 200 on /settings/account, **403 on /settings/users** (admin-only) → admin disables user → operator's next request is **401** + session row count drops to 0 → audit log shows `user.created` + `user.setup_completed` for the cycle. All 26 implementation tasks landed; full `go test ./...` green.
- [x] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
> **As shipped (2026-05-05):** Authorization Code + PKCE (S256) against any OIDC IdP advertising standard discovery. Config is YAML+env (`oidc.issuer`, `oidc.client_id`, `oidc.client_secret`/`_file`, `oidc.role_claim` default `groups`, `oidc.role_mapping`, `oidc.display_name`, `oidc.redirect_url`); empty issuer → OIDC disabled, no routes mounted. Migration 0019 adds `users.auth_source`/`oidc_subject` (partial unique index on `oidc_subject`), `sessions.id_token`, and a small `oidc_state` table for state+verifier round-trip (cleaned up every alert tick, 5 min TTL). Login page renders **Sign in with `<display_name>`** above the local form when OIDC is enabled; the SSO button kicks off a 303 to the IdP with state + S256 code_challenge persisted server-side. Callback verifies ID token, fetches `/userinfo` to merge claims (Authelia / many IdPs only put `sub` in the ID token and surface `preferred_username`/`email`/`groups` from userinfo), maps the first matching group to a role; **no match → deny banner**, no row created, audit `user.oidc_login_blocked`. Username-collision with an existing local user → same deny path with `username_taken`. New user → JIT-provisioned with `auth_source='oidc'`, `oidc_subject=<sub>`, `password_hash=''`. Returning user → looked up by `oidc_subject` (stable when usernames change at the IdP), role + email refreshed on every login. Local password login is rejected for `auth_source='oidc'` users. Logout posts to `/logout` and, when the IdP advertised `end_session_endpoint`, follows up with RP-initiated logout (carries `id_token_hint` + `post_logout_redirect_uri=BaseURL`); when not advertised (Authelia in our smoke env), the local session is cleared and the browser lands on `/login`. Users list shows a small **oidc** chip beside enabled/disabled; the edit page disables username/email/role for OIDC users (server-side guard mirrors UI, returns 403). Force-logout, disable, and the last-admin guard from P4-04 all still apply. **Live Authelia sweep verified all four paths against `https://auth.dcglab.co.uk`:** rm-admin → admin role + JIT row + chip + readonly edit; rm-operator → operator JIT, 403 on `/settings/users`; rm-viewer → viewer JIT, 403 on `/hosts/new`; rm-other (group not in role_mapping) → no_role_match banner, no row created, audit logged. Returning rm-admin login resolved to the same row by sub. Screenshots in `_diag/p4-05-sweep/`. Out-of-scope and on Phase 6 candidate list: refresh tokens, back-channel logout, multiple providers, post-login PKCE for the cookie itself.
> **As shipped (2026-05-05):** Authorization Code + PKCE (S256) against any OIDC IdP advertising standard discovery. Config is YAML+env (`oidc.issuer`, `oidc.client_id`, `oidc.client_secret`/`_file`, `oidc.role_claim` default `groups`, `oidc.role_mapping`, `oidc.display_name`, `oidc.redirect_url`); empty issuer → OIDC disabled, no routes mounted. Migration 0019 adds `users.auth_source`/`oidc_subject` (partial unique index on `oidc_subject`), `sessions.id_token`, and a small `oidc_state` table for state+verifier round-trip (cleaned up every alert tick, 5 min TTL). Login page renders **Sign in with `<display_name>`** above the local form when OIDC is enabled; the SSO button kicks off a 303 to the IdP with state + S256 code_challenge persisted server-side. Callback verifies ID token, fetches `/userinfo` to merge claims (Authelia / many IdPs only put `sub` in the ID token and surface `preferred_username`/`email`/`groups` from userinfo), maps the first matching group to a role; **no match → deny banner**, no row created, audit `user.oidc_login_blocked`. Username-collision with an existing local user → same deny path with `username_taken`. New user → JIT-provisioned with `auth_source='oidc'`, `oidc_subject=<sub>`, `password_hash=''`. Returning user → looked up by `oidc_subject` (stable when usernames change at the IdP), role + email refreshed on every login. Local password login is rejected for `auth_source='oidc'` users. Logout posts to `/logout` and, when the IdP advertised `end_session_endpoint`, follows up with RP-initiated logout (carries `id_token_hint` + `post_logout_redirect_uri=BaseURL`); when not advertised (Authelia in our smoke env), the local session is cleared and the browser lands on `/login`. Users list shows a small **oidc** chip beside enabled/disabled; the edit page disables username/email/role for OIDC users (server-side guard mirrors UI, returns 403). Force-logout, disable, and the last-admin guard from P4-04 all still apply. **Live Authelia sweep verified all four paths against local auth:** rm-admin → admin role + JIT row + chip + readonly edit; rm-operator → operator JIT, 403 on `/settings/users`; rm-viewer → viewer JIT, 403 on `/hosts/new`; rm-other (group not in role_mapping) → no_role_match banner, no row created, audit logged. Returning rm-admin login resolved to the same row by sub. Screenshots in `_diag/p4-05-sweep/`. Out-of-scope and on Phase 6 candidate list: refresh tokens, back-channel logout, multiple providers, post-login PKCE for the cookie itself.
- [x] **P4-07** (S) Per-host tags + dashboard filtering by tag
@@ -480,11 +480,11 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
- [x] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format). ✅ Landed: `CHANGELOG.md` at the repo root with a v1.0.0 entry summarising what each phase shipped, plus an empty Unreleased section to accumulate changes after the tag. Updated on each release going forward.
- [ ] **X-02** Track restic version compatibility matrix
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
- [] **X-04** Threat-model review at end of each phase
- [] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl``/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation.
- [x] **X-04** Threat-model review at end of each phase. ✅ Landed: `docs/threat-model.md` covering assets, actors, attack surfaces (bootstrap, local accounts, OIDC, agent enrolment, agent ↔ server WS, credential lifecycle, restore, audit log, self-update channel), residual risks, and explicit out-of-scope items. Reviewed against v1.0.0 surface; refresh on each tagged release.
- [x] **X-05** Proper first-run onboarding UI. ✅ Landed: bootstrap form already lives at `/bootstrap` and `/login` redirects to it when no users exist (so an operator hitting the server in a browser is guided into setup automatically — the form takes username + password only, no token field needed because the server holds the in-memory token and applies it server-side). Improvements added here: at first-run startup the server now prints a clickable `$RM_BASE_URL/bootstrap` URL (or a fallback message when `RM_BASE_URL` is unset) alongside the existing one-shot token for headless `/api/bootstrap` use; the bootstrap form's password field shows an explicit "Minimum 12 characters" hint so the rule is visible before submission instead of failing on submit.
---
@@ -496,6 +496,10 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
- [x] **NS-01** Admin-driven host deletion. ✅ Landed: store `DeleteHost` (FK cascade revokes the agent bearer along with everything else), admin-band `POST /hosts/{id}/delete`, danger-zone form on host detail with hostname-confirm, audit `host.deleted`, live WS connection closed pre-delete. Original scope below for reference. No UI or API surface today — once a host is enrolled the only way to remove it is hand-editing SQLite, which then cascades through schedules/jobs/snapshots/source-groups via the FK chain. Needs: store-level `DeleteHost` + cascade audit, admin-band `DELETE /api/hosts/{id}` and form-post variant, confirm-modal on the host-detail page, audit entry, and a decision on whether to also revoke the agent's bearer (recommend: yes, so a re-installed host comes back through the normal pending-host accept flow).
- [x] **NS-02** Recoverable enrollment-token UX. ✅ Landed: `Store.ListOutstandingEnrollmentTokens` + `DeleteEnrollmentToken`; outstanding-tokens panel on the Add-host page (short hash, redacted repo URL, created/expires) with per-row Regenerate (revokes old hash, mints fresh raw token preserving repo creds + initial paths, 303s to `/hosts/pending/{newToken}`) and Revoke (delete + audit). Audit actions `enrollment_token.regenerated` / `enrollment_token.revoked`. Original scope below. Today `POST /hosts/new` mints a token and 303s to `/hosts/pending/{token}`; if the operator closes that tab the install snippet is lost and there's no UI surface to find it again — the row sits in `enrollment_tokens` until TTL expiry, invisible. Needs: store-level `ListOutstandingEnrollmentTokens` returning `(token_hash, created_at, expires_at, repo_url_redacted, initial_paths, attached_host_id_or_null)`; a small list section on the Add-host page (and/or Settings) showing outstanding tokens with created/expires-in and the redacted repo URL; admin-band `POST /api/enrollment-tokens/{id}/regenerate` (revokes the old hash, mints a fresh raw token, re-uses the original attachments — same pattern as the user-setup-token regenerate flow) and `POST /api/enrollment-tokens/{id}/revoke`. Choose regenerate over "show original token" because we only persist hashes, never raw tokens.
- [x] **NS-03** Auto-init repo on first onboard, surface credential failures eagerly. ✅ Landed: migration 0020 adds `hosts.repo_status` (`unknown`/`ready`/`init_failed`) + `repo_status_error`; WS handler projects every init job's terminal state onto the host row (with idempotent "config file already exists" → ready); creds-save handlers (UI + JSON API) reset status to `unknown` and dispatch a fresh init when the agent is online; new `/hosts/{id}/repo/probe` retry endpoint and a status banner on the repo page. Remainder of original scope below. surface credential failures eagerly. Today the operator types repo URL + creds during Add-host and the credentials are pushed to the agent on connect, but no `restic init`/probe runs until the first scheduled job — so a typo in the password or a wrong URL goes undetected for hours/days, manifesting as a silent missed-backup. Wanted behaviour: when the host completes enrolment (or when an admin saves new repo creds), the server dispatches a one-shot probe job that runs `restic cat config` (cheap, repo-existence + creds-validity in one call). On `Is there already a config file? unable to open config file` → run `restic init`. On success → mark the host's repo as ready. On any other error (network, auth, fingerprint) → surface a panel-level error on the host detail page and audit the failure, leaving the host in an "init pending" state with a "Retry" button. Needs: a new `JobKind` (or piggyback on an existing one) for the probe, server-side state on the host row (`repo_status` enum: `unknown`/`ready`/`init_pending`/`init_failed`), UI panel that shows the state, and clear copy on the Add-host page so the operator knows the save isn't fire-and-forget.
- [x] **NS-05** Drop redundant `actions/setup-go` from `.gitea/workflows/ci.yml`. ✅ Already gone — verified `.gitea/workflows/ci.yml` has zero `actions/setup-go@v5` invocations and no `GO_VERSION` env; the file's header comment now documents that the runner image (`gitea.dcglab.co.uk/steve/ci-runner-go`) is the single source of truth for the Go version. Closing as done; no further code change needed.
- [x] **NS-06** Remove the permanently-disabled "Run backup now" button from `web/templates/partials/host_chrome.html`. ✅ Landed: dropped the disabled tombstone button from the host header action row; only "Edit credentials" + the ⋯ menu remain. Per-source-group Run-now on `/hosts/{id}/sources` is the only path now. No e2e change needed — `smoke.spec.ts` does not assert on host_chrome's button row.
- [x] **NS-07** Relative timestamps go stale on long-open tabs. ✅ Landed: `formatRelTime` now wraps its label in `<time data-rel-ts=…>` and both layouts (`base.html`, `chromeless.html`) carry a small ticker that re-renders every 30s, so a page rendered an hour ago no longer keeps showing "2h ago" when the wall-clock truth is "3h ago". Covered by `funcs_test.go`. The bug: every relative label was computed once at server render and never updated client-side, so a job-detail page left open drifted further from reality the longer it sat.
- [x] **NS-08** Always-On vs intermittent host mode. ✅ Landed: a host can now be marked not-always-on (laptop/workstation) so it stops generating offline-alert noise when it legitimately sleeps. Migration 0024 adds `hosts.always_on` (default 1 = today's 24×7 behaviour; intermittent is strictly opt-in). The alert engine suppresses `agent_offline` for intermittent hosts and instead wires up the previously-dead `stale_schedule` alert for them — raised at a 7-day global threshold when the host has an enabled schedule and a stale last backup, resolved on the next successful backup. A new server-side catch-up scheduler (`internal/server/http/catchup.go`) arms on agent hello and fires from the existing 30s pending-drain tick: ~60s after an intermittent host reconnects it dispatches a backup for any enabled schedule whose window elapsed while asleep (overdue = `cron.Next(lastBackup) <= now`, reusing the shared `cronParser`), guarded against firing when the host bounced offline, flipped to always-on, or already has a job running. Overdue is measured against the per-host `LastBackupAt` (exact for the common single-schedule laptop; a known coarseness for multi-cadence hosts, documented in code). Operator toggle via `POST /hosts/{id}/mode` (audited `host.mode_updated`), which also clears open offline/staleness alerts so the next sweep re-settles. UI: intermittent offline hosts render a calm grey `asleep · <relTime> · will catch up on return` state (new `.dot-asleep`) instead of red "offline"; a `24×7` chip shows only for always-on hosts; a "presence" inline toggle on the host header. Design + plan in `docs/specs/2026-06-15-always-on-host-mode-design.md` and `docs/plans/2026-06-15-always-on-host-mode.md`. Spec §2 (online/offline mechanics) deliberately left untouched. Out of scope for v1: per-host staleness thresholds, continuous (non-reconnect) overdue evaluation, per-schedule last-success tracking.
- [x] **NS-04** Dashboard parity with the alerts screen: live refresh, column sorting, filters. ✅ Landed: `/` now parses `q`/`status`/`repo_status`/`tag`/`sort`/`dir` query params (round-trip durable for bookmarks); table is wrapped in an `id="hosts-table"` htmx live-poll matching the alerts cadence (5s, gated on `document.visibilityState` and `localStorage.rm-dashboard-live`); filter row above the table with hostname free-text + status + repo_status selects + tag chips + clear; column headers (Host / OS · arch / Last backup / Repo size / Snapshots) are clickable links that toggle direction on the active column; pure-Go sort+filter pipeline covered by `dashboard_filter_test.go`. Original scope below. live refresh, column sorting, filters. The host list is currently a static render — operators have to reload to see new heartbeats / job state changes. Mirror the alerts pattern (`web/templates/pages/alerts.html` uses `hx-trigger="every 5s [document.visibilityState==='visible' && localStorage.getItem('rm-alerts-live')!=='off']"` plus a Live/Off toggle so background tabs and explicit-off don't burn server cycles). Add: server-side sort on every meaningful column (name, OS, last-backup time, last-backup status, agent online/offline, restic version, tags), and a small filter row above the table — at minimum free-text on hostname, status (online/offline/never-seen), and tag chips. Columns + filter state should round-trip through query string so a bookmarked / shared URL is durable. Re-use the `host_row` partial that already exists so the live-refresh swap is a clean OOB swap, not a full table re-render.
{{if .UpdateAvailable}}<spanclass="update-chip"title="Agent at {{.Host.AgentVersion}}; server at {{.TargetVersion}}">out of date · {{.Host.AgentVersion}} → {{.TargetVersion}}</span>{{end}}
{{if .UpdateAvailable}}<spanclass="update-chip"title="Agent at {{.Host.AgentVersion}}; server at {{.TargetVersion}}">out of date</span>{{end}}
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.