- CHANGELOG.md: Keep-a-Changelog format, v1.0.0 entry summarising
what each phase delivered.
- docs/threat-model.md: structured walkthrough of assets, actors,
attack surfaces and residual risks; reviewed against v1.0.0.
- cmd/server/main.go: at first-run startup, print a clickable
$RM_BASE_URL/bootstrap URL alongside the existing one-shot
bootstrap token (or a fallback hint when RM_BASE_URL is unset).
- web/templates/pages/bootstrap.html: visible "Minimum 12 characters"
hint under the password field so the rule is communicated
before the operator submits.
- tasks.md: close X-01, X-04, X-05 with notes.
New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.
Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.
Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.
Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted
(system-scoped, host_id NULL via new RaiseOrTouchSystem helper).
- ws: UpdateWatcher tracks in-flight command.update dispatches and
reconciles them against incoming hello envelopes — success path
marks the job succeeded and auto-resolves the alert; 90s timeout
marks the job failed and raises update_failed.
- http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX
/hosts/{id}/update form variant. Pre-checks: host exists, online,
agent_version != current, no running update job. Refactored core
into Server.dispatchHostUpdate so the fleet worker can share it
without going through HTTP.
- fleetupdate: rolling worker iterating through host slots, halting
on first failure and raising fleet_update_halted. Polling-based
version-match (re-read hosts.agent_version every 1s up to 95s) —
no extra plumbing into the WS hello path. At-most-one-running is
enforced at the store layer (ErrFleetUpdateRunning).
- cmd/server: wire UpdateWatcher and FleetWorker into the main
goroutine; the worker uses a small serverDispatcher adapter that
delegates back into Server.DispatchHostUpdate.
Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint
(happy + four pre-check branches + RBAC), worker (two-host happy,
timeout-halt, host-offline-halt, already-at-target skip, cancel
mid-run, double-Start guard).
Single public deliverable per tag: a multi-arch server image, with
cross-compiled agent binaries + install scripts + the systemd unit
baked under /opt/restic-manager/dist/. The /agent/binary and
/install/* handlers fall back from <DataDir>/... to that read-only
path so a fresh container Just Works without first-run staging;
operators can still drop a custom build into <DataDir>/ to override
per-host.
Architecture rationale: agent distribution already routes through
the running server, so the release surface mirrors that — there's
no second source of truth to keep in sync.
Workflow .gitea/workflows/release.yml triggers on v*.*.* tag-push
(fan-out :vX.Y.Z / :X.Y / :X, plus :latest once MAJOR>=1) and
workflow_dispatch (snapshot tag only). Pushes to the Gitea
container registry on this instance.
Both binaries grow main.commit + main.date ldflag targets. Makefile
and Dockerfile fill them; release workflow forwards from gitea.sha
plus a UTC timestamp.
Spec : docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md
Plan : docs/superpowers/plans/2026-05-05-p5-03-docker-only-release.md
- Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go
- Start go alertEngine.Run(ctx) after construction, before the HTTP listener
- Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed)
- Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
TODO(G1) comment marks where NotifyHostOffline calls will land
Dashboard handler loads ListPendingHosts(now); template renders a
warn-bordered panel above the host table with hostname, OS/arch,
fingerprint (selectable / copyable), source IP, age, expiry. Each
row carries an inline accept form (repo URL/user/password) plus a
Reject button. cmd/server adds a 60s ticker calling
DeleteExpiredPendingHosts so 1h-stale rows drop off.
Two trigger paths land here:
- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
walks pending_runs rows whose next_attempt_at <= now, dedupes by
host, skips offline hosts, and per online host runs DrainPending.
- onAgentHello spawns a background DrainPending(hostID). When a
host comes back, every pending row for it is dispatchable now —
due-ness becomes irrelevant once the wire is back.
Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:
- skips offline hosts (no pending_runs queueing — maintenance is
not a backup, missed fires shouldn't pile up)
- silently skips prune when admin creds aren't bound
- pushes admin creds before prune, then dispatches with
RequiresAdminCreds=true (same as operator-driven prune)
- persists job rows with actor_kind="system"
Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
Closes the P1-21 remainder.
internal/server/ws/jobhub.go — new JobHub. Per-job_id set of
subscribers; each gets a 64-deep buffered channel with a writer
goroutine. Broadcast is non-blocking: if a subscriber is slow,
its channel fills and messages are dropped for that subscriber
only — the agent's read loop is never blocked by a stuck browser.
The agent dispatchAgentMessage path mirrors job.started /
job.progress / log.stream / job.finished envelopes onto the hub
in addition to its existing persistence work. The wire shape is
the same end-to-end, so client-side JS switches on env.type the
same way Go code does.
GET /api/jobs/{id}/stream is the browser endpoint. Auth via
session cookie (HTTP layer); upgrade; subscribe; pump until
context closes.
GET /jobs/{id} renders the live log page. Three states (queued/
running/succeeded/failed) drive the header pill, the progress
bar block, the failure summary panel, and the action button
(Cancel job while running, Back to host afterwards). Already-
persisted log lines are server-rendered on initial load; new
lines arrive over the WS and append to #log-stream. Auto-scrolls
unless the user scrolls up (a "⇢ Follow" pill re-attaches).
On job.finished the page reloads after 600ms to pick up the
final-state header rendered server-side.
POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id}
on success so HTMX lands the operator straight on the live log.
For non-HTMX callers (curl / plain form post) it 303s to the
same target.
store.ListJobLogs returns persisted log lines for initial render
on page load.
Browser-verified end-to-end: enrol → run a real backup against a
sibling restic/rest-server → live progress + 11 log lines stream
in → succeeded pill + final stats land after page reload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1-28: Tailwind standalone CLI wired into the Makefile. `make tailwind`
downloads the pinned v3.4.17 binary into bin/tailwindcss (gitignored),
builds web/styles/input.css → web/static/css/styles.css. `make build`
now runs the CSS pass first; `make tailwind-watch` for dev. Output is
embedded in the binary via web.FS — single static binary, no Node.
The CSS source carries every component class the v1 mockups defined
(status dots, buttons, host row, log viewer, progress bar, fields,
chips, snippet panel, empty state) so screens that land later can
just reach for them.
P1-23: html/template tree at web/templates with two layouts (base
with chrome, chromeless for login + bootstrap), one nav partial, and
two pages (dashboard placeholder, login). internal/server/ui parses
the tree at startup; ui_handlers.go in the http package wires:
GET / dashboard (303 → /login when unauthed)
GET /login sign-in form
POST /login consume form, mint session cookie, 303 → /
POST /logout drop cookie, 303 → /login
GET /static/* embedded Tailwind bundle
The HTML login flow shares store/session logic with /api/auth/login
via a new authenticateAndSession helper — same security guarantees,
two surface representations (HTML form / JSON).
Verified end-to-end: bootstrap → form-login → authed dashboard →
sign-out → 303 cycle works in the browser; Tailwind output emits
only the component classes referenced in the live templates (9.6kB
minified).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx;
making the server do TLS too means double cert config, dual ACME
plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY,
remove TLSEnabled() and the ListenAndServeTLS branch.
Replace the cookie's "Secure if TLS-in-process" check with a new
RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets
RM_COOKIE_SECURE=false; production is always behind a TLS proxy and
the cookie stays Secure.
Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and
populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a
Caddyfile snippet and a hard "do not expose RM_LISTEN publicly"
warning. enrollResponse keeps cert_pin_sha256 in the shape but the
server can't introspect a cert it doesn't terminate — operator
pastes the proxy's hash into -cert-pin at install time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the protocol layer end-to-end: an agent can be enrolled
through the operator UI, store credentials, dial back to the server
over WS, complete the protocol_version handshake, and stay
connected with periodic heartbeats.
Server side:
- P1-09 ws.Hub: one Conn per host_id, last-write-wins eviction,
json envelope writer with a write mutex, reader, error envelopes.
- P1-09 ws.AgentHandler: bearer-auth, accept upgrade, hello-stage
(10s deadline, protocol_version checked against
api.MinAgentProtocolVersion → ErrProtocolTooOld with help URL on
reject), main read loop, defer hub register/unregister.
- P1-10 POST /api/agents/enroll consumes a one-time token, mints a
persistent agent bearer (sha-256 stored), creates a host row.
- P1-10 POST /api/enrollment-tokens (operator, session-auth)
issues a 1h one-time token.
- P1-11 hello upserts agent_version + restic_version +
protocol_version on the host row, flips status to online.
- P1-12 heartbeat touches last_seen_at; background sweeper marks
hosts offline after 90s without one.
- store: hosts table accessors, host_schedule_version,
enrollment_tokens FK on consumed_host dropped (audit-only field;
the token gets burned before the host row exists).
Agent side:
- P1-13 internal/agent/config: yaml at /etc/restic-manager/agent.yaml,
atomic Save (tmp+fsync+rename), Enrolled() helper.
- P1-15 internal/agent/wsclient: dial with bearer + optional
TLS cert pinning (sha-256 of leaf), exponential backoff with
jitter (1s → 60s cap), heartbeat goroutine, fatal handling for
ErrProtocolTooOld.
- P1-15 wsclient.Enroll: HTTP POST /api/agents/enroll with sysinfo.
- P1-17 internal/agent/sysinfo: hostname/OS/arch/restic-version
collection. restic detected by `restic version` parse; absent
restic doesn't block startup.
- cmd/agent: -enroll-server / -enroll-token flags drive first-run
enrollment then exit (so the install script can hand off to
systemd to run the persistent service).
End-to-end smoke verified: bootstrap → login → issue token →
enroll → run agent → server logs `ws agent connected` with the
right host_id and protocol_version 1.
All tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1-01 chi router, slog request log, graceful shutdown via signal
context. Health endpoint, /api/auth/login, /api/auth/logout,
/api/bootstrap. Background sweeper for expired sessions and
enrollment tokens (15 min cadence).
P1-04 (sessions half) HttpOnly Secure-when-TLS cookie carrying a
base64url token; server stores SHA-256(token) so a stolen DB
doesn't yield credentials. Unknown user and bad password collapse
to the same 401 response code so a probe can't enumerate names.
P1-05 first-run admin bootstrap. On a fresh DB the server mints a
one-time token and prints it to stderr inside a banner. The
/api/bootstrap handler accepts {token, username, password},
creates the first admin, then becomes a 409 forever.
P1-07 (partial) audit hooks fire on auth.login and auth.bootstrap.
Full middleware-driven coverage lands with the rest of the API.
internal/server/config: env > YAML > defaults. RM_LISTEN /
RM_DATA_DIR / RM_BASE_URL / RM_TLS_CERT / RM_TLS_KEY /
RM_SECRET_KEY_FILE / RM_TRUSTED_PROXY (CIDR list, validated).
End-to-end smoke test passes: server boots on a fresh dir,
prints the bootstrap token, POST /api/bootstrap creates the admin,
POST /api/auth/login returns 200 with a session cookie.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>