71 KiB
restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: S = under a day, M = 1–3 days, L = 3–7 days.
Phase 0 — Project bootstrap
- P0-01 (S) Initialize Go module,
cmd/server,cmd/agent, baselineinternal/packages - P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- P0-03 (S) Set up
golangci-lint,gofumpt,goimports; pre-commit config - P0-04 (S)
GitHub ActionsGitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint - P0-05 (S)
Dockerfile.server(multi-stage, distroless),deploy/docker-compose.yml - P0-06 (S) Makefile /
with common targets (taskfile.ymlbuild,test,run,release)
Phase 1 — MVP: enrollment, visibility, on-demand backup
Server foundations
- P1-01 (M) HTTP server scaffolding (
chi, structured logging viaslog, graceful shutdown) - P1-02 (M) SQLite store layer (
modernc.org/sqlite) + migrations (hand-rolled,embed.FS) - P1-03 (M) Schema for
users,sessions,hosts,repos,credentials,jobs,job_logs,snapshots,audit_log - [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
- P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
- P1-06 (M) Secret encryption helper (AEAD with key from
RM_SECRET_KEY_FILE) - [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited
Agent ↔ server protocol
- P1-08 (M) Define shared API types in
internal/api(envelopes, every WS message +protocol_versionconstants; JSON-shape tests pin the wire) - P1-09 (L) WebSocket transport (
github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side - P1-10 (M) Enrollment flow:
POST /api/agents/enrollwith one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate. - P1-11 (M) Agent registration on connect (
helloupserts agent_version/restic_version/protocol_version, flips status online,protocol_too_oldrejection has clean error envelope) - P1-12 (S) Heartbeat handler (touches
last_seen_at; background sweeper marks hosts offline after 90s without one)
Agent foundations
- P1-13 (M) Agent config file (
/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2 - P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
- P1-15 (M) Outbound WS client (
github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats,protocol_versionin hello - P1-16 (M) Restic wrapper: locate via PATH or override, run with
--json, scan stdout/stderr, parseBackupStatus+BackupSummary, exit-code 3 treated as success-with-issues - P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
Run-now backup
- P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
- P1-19 (M) Server endpoint
POST /api/hosts/{id}/jobsto dispatch abackupcommand (validates kind, checks online, audit-logs) - P1-20 (M) Agent executes
restic backup, streams stdout/stderr + parsed JSON events back asjob.progress(1Hz throttle) /log.stream - [~] P1-21 (M) Server persists log stream to
job_logs✓; WS/api/jobs/{id}/streamfor live browser tailing still TODO — needs the per-job fan-out hub - P1-22 (S) Snapshot listing: agent calls
restic snapshots --jsonafter each successful backup and ships the projection oversnapshots.report. ServerReplaceHostSnapshotsatomically swaps the per-host list and updateshosts.snapshot_countin the same tx. Read endpoint:GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unusedrepo_idFK fromsnapshots(repos as a first-class entity is P2 work).
UI (HTMX + Tailwind)
- P1-23 (M) Base layout, login page, session-aware nav
- P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by
GET /api/hosts+GET /api/fleet/summary(JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMXRun nowbutton posts to/hosts/{id}/run-backup. - P1-25 (M) Host detail page (
/hosts/{id}): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel. - P1-26 (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens
/api/jobs/{id}/stream; agent-emittedjob.started/job.progress/log.stream/job.finishedare mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads onjob.finishedto show the final header. "Run now" setsHX-Redirectso the operator lands on the live log. - [~] P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (
RM_SERVER+RM_TOKENfilled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). Deferred: one-click "download preconfigured installer"install-<hostname>.sh(cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1. - P1-28 (S) Tailwind build via
tailwindcssstandalone binary (no Node) — Makefile downloads pinned v3.4.17 intobin/tailwindcss, buildsweb/styles/input.css→web/static/css/styles.css, embedded into the binary viaweb.FS.make buildruns Tailwind first.
Install scripts
- P1-29 (M)
install.sh(Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers //etc/cron.{d,daily,hourly,weekly}/*/ root crontab and prints them with the exact disable commands — does not auto-disable - [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (
/agent/binary+/install/*); signature verification deferred to Phase 5 OSS readiness
Repo credentials (pulled forward from Phase 2)
-
P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:
POST /api/enrollment-tokensbody growsrepo_url,repo_username,repo_password(all required).- Token row stores them as one AEAD-encrypted blob (existing
crypto.AEAD);ConsumeEnrollmentTokenmoves the blob to a newhost_credentialsrow keyed byhost_idin the same tx. PUT /api/hosts/{id}/repo-credentials(admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.GET /api/hosts/{id}/repo-credentialsreturns the redacted view (URL + username +has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.- On WS
hello, server pushes aconfig.updatewith decrypted creds before returning the connection to idle. Same path on edit-while-connected. - Audit-logged on create / consume / edit; payload omits the secret material.
-
P1-33 (M) Agent-side encrypted secrets store:
- New
internal/agent/secretspackage: AEAD blob at/var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600). - Per-host 32-byte secrets key minted at enrollment, persisted in
agent.yaml(already 0600 root-only — same trust boundary as the bearer; explicit comment in the file). - Strip
repo_url/repo_passwordfromagent.config.Config. Agent loads creds fromsecrets.encat startup;config.updatehandler writes through to the file. - Dispatcher reads from the secrets store on every job rather than from in-memory config.
- Migration path: if
agent.yamlstill containsrepo_url/repo_password, copy them intosecrets.encon next start, then strip from the YAML on save.
- New
-
P1-34 (S) End-to-end smoke runbook:
docs/e2e-smoke.mdwalks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a realrestic/rest-serverin a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (
.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17). - Agent ↔ server
protocol_versionhandshake rejects mismatched versions with a clear error rather than failing on JSON parse. - Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as
config.update.
Phase 2 — Scheduling, retention, repo operations
Mid-phase pivot — "P2 redesign" (commits
7a7cac5,666af41,5667cdf). The original P2 plan put paths/excludes/retention/manual/kind/options onScheduleand one repo per host. After landing P2-01..P2-05 against that shape, the data model was rewritten: schedules are slim (cron + whichsource_groups); paths/excludes/retention/retry live onsource_group(also doubles as the snapshot tag); forget/prune/check cadences live onhost_repo_maintenanceand run on a server-side ticker, not the agent cron;pending_runsqueues offline retries;host.repo_initialised_atis gone (auto-init at enrolment). The redesign is captured below asP2R-NNitems. Items P2-01..P2-05 stay marked done because the work shipped, but they're labelled ⚠️ shipped against old shape — behaviour to be re-validated under P2R-02 after UI rewire. P2-04.5 (manualflag) is dropped wholesale. P2-06..P2-15 are reframed below to point at their new homes; P2-16/17/18 are unaffected by the redesign.
Original P2 work — shipped (against pre-redesign shape)
- ⚠️ P2-01 (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
- ⚠️ P2-02 (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
- ⚠️ P2-03 (M) Agent local scheduler (
internal/agent/scheduler,robfig/cron/v3,schedule.fireenvelope,dispatchScheduledJob). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01. - ⚠️ P2-04 (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
P2-04.5 Manual schedules / kill— superseded; thehost.default_pathsmanualflag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).- ⚠️ P2-05 (M)
forgetcommand with retention policy. Wire payload (CommandRunPayload.retention_policy) and restic wrapper (restic.ForgetPolicy,RunForget) are still correct; what changes under P2R-03 is where retention comes from (source_group, not schedule) and who dispatches (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).
P2 redesign — Phase 1 ✅
- P2R-00.1 (M) Migration 0008 — sources + repo maintenance. Adds
source_groups,schedule_source_groupsjunction,host_repo_maintenance,pending_runs,host.bandwidth_up_kbps/bandwidth_down_kbps. Dropshost.repo_initialised_at. Slim-schedule columns dropped fromschedules. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit7a7cac5. - P2R-00.2 (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of
/hosts/{id}/sources,/sources/{gid}/edit(with retention-conflict banner), slim/schedules,/repo(connection / bandwidth / maintenance / re-init). Commit666af41.
P2 redesign — Phase 2 ✅
- P2R-00.3 (L) Go-side store rewrite against migration 0008. New types:
SourceGroup,HostRepoMaintenance,PendingRun.Scheduleslimmed to{id, host_id, cron, enabled, source_group_ids, timestamps}.RetentionPolicymoves from schedule field → source group field (type unchanged).HostlosesRepoInitialisedAt, gains bandwidth caps. New files:store/sources.go,store/maintenance.go,store/pending.go.store/schedules.gorewritten for slim shape + junction CRUD.enrollment.goseeds a default source group + repo-maintenance row instead of a manual schedule.ws/handler.godropsMarkHostRepoInitialised. HTTP layer + UI templates temporarily 501-stubbed withredesign_in_progress— this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit5667cdf. - P2R-00.4 (S) Host-detail UI patched up enough to render:
RepoInitialisedAttemplate refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.
P2 redesign — Phase 3 (REST + WS rewire) ✅
- P2R-01 (L) HTTP/WS layer against the slim shape:
- Schedules REST CRUD:
GET|POST /api/hosts/{id}/schedules,PUT|DELETE /api/hosts/{id}/schedules/{sid}. Body shape is{cron, enabled, source_group_ids[]}— paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (perstore.UpdateSchedule). Validation: cron parses viarobfig/cron/v3; ≥1source_group_ids; all referenced groups belong to the host. - Source-groups REST CRUD:
GET|POST /api/hosts/{id}/source-groups,GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}. Body:{name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}. Name uniqueness per host. Refuse delete ifSchedulesUsingGroup(gid)is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bumphost_schedule_version. - Repo-maintenance REST:
GET|PUT /api/hosts/{id}/repo-maintenance. Body:{forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}. Server-side ticker drives execution (P2R-04), so updates here do not bumphost_schedule_version. - Per-source-group Run-now:
POST /hosts/{id}/source-groups/{gid}/run. Reuses the existingdispatchScheduleNow-style path; agent receives a normalcommand.runcarrying the resolved includes/excludes/retention from the group. This replaces the old per-host/hosts/{id}/run-backupendpoint (kept around as a 410-Gone with a hint pointing to source groups). schedule_push.goreconciliation: rebuildpushScheduleSet*to ship the new wire format (ScheduleSetPayloadcarries[{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}]— agent doesn't need to knowsource_group_id, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persistsapplied_schedule_version.- Auto-init at enrolment: server dispatches
restic initon first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row withkind=initso the audit trail still shows it. Oninitreturning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour. - Tests: rewrite the deleted
schedules_test.goandschedule_push_test.goagainst new endpoints; newsource_groups_test.go,repo_maintenance_test.go,auto_init_test.go. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.
- Schedules REST CRUD:
P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅
Row-design rule (binding for every list-row template in this app, current and future): Whole-row click navigates to the row's primary detail/edit page — mirror
.host-row.clickableon the dashboard (partials/host_row.html): an absolute-positioned.row-linkoverlay withtext-indent: -9999pxcovers the row, action buttons live in.row-actioncells that sit above via z-index. Do not add an explicit "Edit" button when the row is clickable — it duplicates the affordance and dilutes the click target. Action cells are reserved for verbs that aren't "open this row" (Run-now, Delete, Pause, etc).
- P2R-02 (L) UI templates rebuilt against the new model:
- Slice 1 ✅ Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a
host_chromepartial; Sources / Schedules / Repo become real<a>links; placeholder pages share the chrome; version indicator restored. (commita535822) - Slice 2 ✅ Sources tab —
/hosts/{id}/sourceslist with per-row meta + clickable rows + per-group Run-now/Delete;/sources/newand/sources/{gid}/editform (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner fromConflictDimensioncache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits0ed9c3d,dede74f) - Slice 3 ✅ Schedules tab —
/hosts/{id}/schedulesslim list (status / cron / source-tags / actions, clickable rows) plus/schedules/newand/schedules/{sid}/editform (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reusesdispatchScheduledJobfor enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit67ca769+ follow-ups64d2fcf,8b91d30,4035c44) - Slice 4 ✅
/hosts/{id}/repo— three independent forms (connection: URL/user/password pre-filled fromGET /api/hosts/{id}/repo-credentialsredacted view; bandwidth: host-wide caps via newPUT /api/hosts/{id}/bandwidth; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commitd62b173) - Slice 5 ✅ Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit
fab99b4) - Slice 6 ✅ Playwright sweep against the live
:8080server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in_diag/p2r-02-sweep/. - Side-fix: agent runner drops noisy restic
statusevents fromlog.stream(they were drowning the live log on short backups; the throttledjob.progressenvelope already covers the same data). (commitffba737) - Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by
host_schedule_version+applied_schedule_version). - Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires
pushScheduleSetAsyncso an online agent re-arms within seconds.
- Slice 1 ✅ Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a
P2 redesign — Phase 5 ✅
Shipped on branch
p2r-phase5-maintenance(PR #3). Plan:docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md.
- P2R-03 (M)
prunecommand end-to-end. Restic wrapper (restic.RunPrune), agent dispatcher (case api.JobPrune:), wire envelope. Admin-only credential: a secondhost_credentialsrow keyed byhost_id+kind=admincarries the non-append-only username/password; server pushes it viaconfig.updateonly when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now viaPOST /hosts/{id}/repo/prune. Cadence-driven dispatch via the maintenance ticker (P2R-06). - P2R-04 (M)
checkcommand end-to-end (restic check --read-data-subset N%). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now viaPOST /hosts/{id}/repo/check. Cadence-driven dispatch via the maintenance ticker (P2R-06). - P2R-05 (S)
unlockcommand end-to-end (restic unlock). Operator-only — no cadence.POST /hosts/{id}/repo/unlock. Repo page surfaces lock state from the most recentcheck(which warns about stale locks). - P2R-06 (M) Server-side maintenance ticker. Cron-style loop on the server reads
host_repo_maintenancerows, dispatchesforget/prune/checkjobs against the right host on the configured cadence. Last-fire anchor is derived from thejobstable viaLatestJobByKind(queued + running included so a long-running prune correctly suppresses the next tick). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (logged, no queue — only scheduled backup fires queue, per P2R-08). Forget reshape: ships a multi-groupForgetGroupspayload so one job fires N restic-forget invocations per tick. - P2R-07 (S) Repo stats panel on the Repo page: total size, raw size, last-check timestamp + status (color-coded), last-prune timestamp, stale-lock banner. Backed by
restic stats --json --mode raw-datathat the agent ships in arepo.statsenvelope after every backup / check / prune / unlock; persisted viaStore.UpsertHostRepoStatsinto a newhost_repo_statsprojection table. - P2R-08 (M) Pending-runs queue worker. Scheduled backup fires that race an agent disconnect queue to
pending_runs. Drained on a 30s server-side tick and on agent reconnect (viaonAgentHello); per-host TryLock mutex prevents the two paths double-dispatching the same row. Exponential backoff capped at 30 minutes; abandons rows that exceed the source-group'sretry_max(audit-logged) or whose schedule/group has genuinely been deleted.
P2 redesign — Phase 6 (auto-init follow-up) ✅
- P2R-09 (S) Auto-init UX polish. Latest
initjob status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). Danger-zonePOST /hosts/{id}/repo/reinitdispatches a fresh init job after the operator types the host name to confirm; audit row recordshost.repo_reinit.
Pre/post hooks (rehomed onto source groups) ✅
- P2R-10 (M) Hook schema: migration 0010 adds
pre_hook/post_hookBLOB columns tosource_groupsandpre_hook_default/post_hook_defaulttohosts. Bytes stored verbatim — AEAD encrypt/decrypt at the HTTP layer (per-slot AD bytes). Round-trip tests cover set/clear semantics on both tables. - P2R-11 (M) Agent execution of hooks:
runner.BackupHooks+runHookhelper invoked via/bin/sh -c(cmd.exe /Con Windows). pre_hook non-zero exit aborts the backup; post_hook always runs withRM_JOB_STATUS=succeeded|failedin env. Output streamed ashook(<phase>): …log.stream lines. Hooks only run forkind=backup. Server side resolves group → host default → empty and ships plaintext on the WS payload (decrypt at HTTP layer). - P2R-12 (S) Hook editor UI: source-group edit form gains pre/post hook textareas with the service-user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a host-default Hooks panel with the same shape; saved via
POST /hosts/{id}/repo/hooks.
Bandwidth + niceties (rehomed onto host + source groups) ✅
- P2R-13 (S) Bandwidth limit fields.
restic.EnvgainsLimitUploadKBps/LimitDownloadKBps, emitted as--limit-upload/--limit-downloadglobal flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received viaconfig.update; server pushes them on hello and afterPUT /api/hosts/{id}/bandwidth. Per-job override on the per-source-group Run-now form (collapsed<details>"Limit bandwidth for this run" with two KB/s inputs); override wins over host caps. - P2R-14 (S) Schedule "next run" / "last run". New
store.LatestJobBySchedulequery. Schedules tab grows two columns (Next derived from cron viarobfig/cron/v3.Parse(...).Next, Last from latestactor_kind=schedulejob). Dashboard host row prependsnext 12h ago/from nowwhen a single covering schedule is the run-now candidate.
Cross-platform + alt-enrolment ✅
-
P2-16 (M) Windows service integration:
internal/agent/service(build-tagged) implementssvc.Handler; newrestic-manager-agent install|uninstall|start|stop|runsubcommands wrap the SCM viagolang.org/x/sys/windows/svc/mgr. Cross-compile verified (GOOS=windows GOARCH=amd64 go build ./cmd/agent); untested on Windows itself — Linux CI can't exercise the SCM round-trip. -
P2-17 (M)
install.ps1(Windows): pwsh installer that detects arch, downloads$Server/agent/binary?os=windows&arch=amd64, runs the agent in-enroll-server(+ optional-enroll-token) mode (token flow OR announce-and-approve), then registers the service viarestic-manager-agent install. Surfaces existing scheduled tasks named*restic*without disabling. Served by the existingGET /install/*handler; restage block in CLAUDE.md updated. -
P2-18 (L) Announce-and-approve enrolment (second enrolment mode):
- Agent run with no
RM_TOKENgenerates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), thenPOST /api/agents/announcewith{hostname, os, arch, agent_version, restic_version, public_key}. Server stores apending_hostsrow (public_key,fingerprint = sha256(public_key),announced_from_ip,first_seen_at,last_seen_at,expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal. - Agent then opens a long-poll/WS to
/ws/agent/pendingauthenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits. - Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g.
SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept. - UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
- Server-side guards: per-source-IP rate limit on
/api/agents/announce(token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race). - Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting
hostnameover the wire.
As shipped: migration 0011 +
store/pending_hosts.gocover the table.POST /api/agents/announce(rate-limited 10/min/IP, global cap 100 in-flight rows) returns{pending_id, fingerprint, hostname_collision}.GET /ws/agent/pendingruns the Ed25519 nonce-sign handshake. Admin POSTs to/api/pending-hosts/{id}/accept|reject(audit-logged ashost.accept_pending/host.reject_pending). Dashboard panel renders the queue with a copyable fingerprint + inline accept form (URL/user/password). 60s server ticker sweeps expired rows. Agent:cmd/agent/announce.gomints + persists an Ed25519 keypair intoagent.yaml'sannounce_keyfield; runs automatically when-enroll-serveris supplied without-enroll-token. The install scripts haven't been updated to surface the printed fingerprint beyond the agent's own banner — the operator reads it from the install script's stdout. - Agent run with no
Phase 2 acceptance
- A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
- Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
- Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to
pending_runsand drain on reconnect. - Pre/post hooks fire correctly per source group, fail loudly on
pre_hookerrors, runpost_hookwithRM_JOB_STATUS. Rejected on non-backup kinds. - Bandwidth limits honoured (host-wide default + per-run override).
- A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming. Not validated in CI: Linux runners cannot exercise the SCM round-trip; the
service_windows.go/install.ps1pieces compile cleanly underGOOS=windows GOARCH=amd64but the first real Windows install will be the first end-to-end test. - A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.
Phase 3 — Restore, alerts, audit
Phase 3 is split into three independently-shippable sub-phases: Restore (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC), Alerts (P3-05..07), Audit UI (P3-08). Each sub-phase has its own spec → plan → implement cycle; we hand back at sub-phase boundaries.
P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm on 2026-05-04: disaster recovery is already covered by re-enrolling a replacement host with the same repo creds (snapshots reappear, restore is same-host). The remaining "pull a file from host A onto host C without giving C permanent access" use case is genuinely different and doesn't have a confirmed need yet, so it's moved to the Future / unscheduled section at the end of this file.
Phase 3 — Restore ✅
Spec:
docs/superpowers/specs/2026-05-04-p3-restore-design.md. Wireframe:_diag/p3-restore-wizard/wireframe.html. Sweep screenshots:_diag/p3-restore-sweep/. Shipped on branchp3-restore.
- P3-X1 (S) Cancel-job feature.
command.cancelWS envelope; agent tracks per-job ctx.CancelFunc and kills the runningresticsubprocess via context cancel (SIGTERM, SIGKILL after 5s grace viacmd.Cancel+cmd.WaitDelay); server endpointPOST /api/jobs/{id}/cancelbridges UI → WS; the existing UI Cancel button on/jobs/{id}is now real for any running kind. Sandbox-aware:internal/restic/cancel_{unix,windows}.gobuild-tags pick SIGTERM on POSIX vsos.Killon Windows (which can't deliver SIGTERM). Tests: cancel mid-run via 'sleep 30' fake-restic returns JobCancelled with exit 130 in <200ms. - P3-X2 (S) Tree-list synchronous WS RPC.
MsgTreeList↔MsgTreeListResultwithEnvelope.IDcorrelation; genericHub.SendRPChelper (registry of buffered channels keyed by ULID, ctx-cancel + timeout aware).internal/restic.ListTreeChildrenwrapsrestic ls --jsonand filters its recursive output to direct children. Server-sidetreeCacheis per-wizard-session (keyed by session cookie + host + snapshot + path) with a 30-min TTL and lazy sweep. - P3-01 (L) Restore wizard backend (
internal/server/http/ui_restore.go). GET handlers render the four-step wizard against the wireframe. HTMX/fetch tree partial endpoint hitsfetchTreeWithCache. POST validates: snapshot_id, ≥1 absolute path, in-place ⇒ confirm_hostname == host name, agent online; on error re-renders with operator's input intact. Happy path mints job_id, target =/var/lib/restic-manager/restore/<job-id>(server-picked, agent's writable dir under the systemd sandbox'sReadWritePaths), creates job row, shipscommand.runwithRestorePayload, writeshost.restoreaudit row, returns HX-Redirect (or 303) to the live job page. - P3-02 (L) Wizard UI templates (
web/templates/pages/host_restore.html+partials/tree_node.html). Single-page progressively-enabled four-step form. Form-state-driven JS computes a running tally + step-4 confirm summary client-side. Tree expansion uses plain fetch (not HTMX) for simpler target lookup; loaded-state cached per node. Top-level Restore button on host detail right rail + per-snapshot Restore action on snapshot rows. New.snap-rowtoken inweb/styles/input.css. - P3-03 (M) Restore execution.
restic.RunRestorebuildsrestore <sid> --target <dir> [--include p]...with --json; newpumpRestoreStdoutparses status + summary objects.--no-ownershipis gated on the agent's restic version viaEnv.AtLeastVersion(0, 17)— the flag was added in 0.17 and 0.16 rejects it. Restic version is threaded throughrunner.Config.ResticVersionfrom the agent's sysinfo snapshot. New-dir target is operator-editable (default$HOME/rm-restore/<job-id>/); agent expands$HOME/${HOME}/~/at run time and callsos.MkdirAllon the target chain so the operator never has to pre-create the per-job subdir.runner.RunRestoretranslatesRestoreStatusintojob.progress(mapping FilesRestored → FilesDone, etc.); agent dispatcher caseJobRestorereuses thespawn()helper from P3-X1 so cancel works. Restore-shaped job-detail variant with current-file display under the progress bar. - P3-09 (S)
diffbetween two snapshots.JobDiffJobKind +restic.RunDiff+runner.RunDiff;POST /api/hosts/{id}/snapshots/diff(and HTMX-form variant on the unprefixed path) dispatcher with two-snapshot guard + per-host snapshot-list validation; UI panel on host detail right rail (visible when 2+ snapshots) with two short-id inputs + Diff button. Output streams as log.stream to the standard live job log page. - P3-X3 (S) Recent-restores line on host detail.
hostChromeDatagrowsRestoreStatus/RestoreAt/RestoreJobIDpopulated viastore.LatestJobByKind(host_id, 'restore')(already exists from P2R).host_chrome.htmlrenders a small line below the init-status one with status-coloured copy + a link to the job log. Hidden when no restore has ever run on this host. - P3-X4 (S) Job log download (txt + ndjson). New
GET /api/jobs/{id}/log.{txt|ndjson}endpoint backed by the persistedjob_logstable — works any time (running or finished) without pausing the live WS stream because the source is the DB, not the live socket. Plain-text format mirrors the on-screen "HH:MM:SS.mmm TAG payload" shape with a small# job ... · kind ... · status ...header; ndjson emits one self-contained{seq,ts,stream,payload}JSON object per line forjq/ tooling. Surfaced as a single header dropdown on the live job page (details/summary-driven, native keyboard support, click-outside-to-close). New reusable.dropdown/.dropdown-menu/.dropdown-itemtokens inweb/styles/input.css. - P3-X5 (S) UK lint locale + sweep.
.golangci.ymlmisspell locale switched US → UK and the codebase swept (~73 corrections — behaviour, serialise, recognise, honour, initialise, enrol, unauthorised, etc.). WireErrorCodevalue"unauthorized"→"unauthorised"is a tiny contract change but the agent doesn't parse those codes today and no external clients exist yet. - P3-X6 (S) Snapshot SIZE/FILES tooltip on host detail. The per-snapshot summary block was added by restic 0.17 (the source comment in
internal/restic/snapshots.goincorrectly said 0.16+); on 0.16 hosts the columns render—.hostDetailPage.LegacyRestic(computed viaEnv.AtLeastVersion(0, 17)) drives atitle="Needs restic 0.17+ on the agent host. This host runs <ver>."+cursor: helpon the column headers, hidden once the host upgrades.
Migration 0012 widens the
jobs.kindCHECK constraint to includerestoreanddiff. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005, with a defensive temp-table backup ofjob_logsso the cascade-trap that bit migration 0007 wouldn't take the log history with it.
install.sh + systemd unit: the install script now pre-creates
/root/rm-restore(root-owned 0700) so the default new-dir restore target works under the sandbox out of the box; the unit'sReadWritePathsgains-/root/rm-restore(soft-fail prefix). Existing installs need a re-run ofinstall.shto pick up the new dir; new operator-typed targets are auto-created by the agent at job time.
As shipped (Playwright sweep against the live smoke env, 2026-05-04): login → host detail → Restore button → wizard step 1 picks snapshot a1ac4006 (most recent) → tree drill-down
/home/steve/test(3 lazy loads) → tickfile1+file2→ step 4 confirm summary populated → dispatch → live job page with running progress widget → restore succeeds, files land on disk at/root/rm-restore/<job-id>/home/steve/test/file{1,2}(default$HOME/rm-restore/<job-id>/after agent-side expansion). Custom-target restore to/tmp/custom-restore/<job-id>/lands inside the agent'sPrivateTmpnamespace. Snapshot diff betweena1ac4006and5f78c788→ diff job page, statistics output streamed (738 bytes added, 0 removed). Recent-restores line on host detail reads "last restore · succeeded 28s ago · job log →". Download dropdown serves both.txtand.ndjsonwith correctContent-Type+Content-Disposition. SIZE/FILES tooltip "Needs restic 0.17+ on the agent host. This host runs 0.16.4." renders on column hover.
Phase 3 — Alerts ✅
- P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- P3-06 (M) Notification channels: webhook, ntfy, SMTP email
- P3-07 (S) Alert UI: list, acknowledge, resolve
As shipped (Playwright sweep, 2026-05-04): /settings/notifications → 3 channels created (sweep-webhook → local Python sink, sweep-ntfy → ntfy.sh public topic, sweep-smtp → MailHog at 127.0.0.1:1025). Test buttons fire alert.test on each: webhook 200/1ms, ntfy 200/322ms, SMTP 250/3ms. Synthetic critical
backup_failedraised → /alerts shows row with severity dot, kind chip, host, message, raised/last-seen, Ack + Resolve buttons; nav badge1; dashboard critical-alert banner appears with Review→ link; OPEN ALERTS card reads1 unresolved. Acknowledge → fan-out to all 3 channels emits alert.acknowledged (verified in webhook sink, MailHog inbox, notification_log); Acknowledged tab shows row withack'd by <user>line. Resolve → fan-out emits alert.resolved across all 3 channels; banner clears; dashboard reads0 unresolved · all clear; host alerts column reads —. Three live bugs found and fixed mid-sweep: (a)enabledform value lost because hidden+checkbox both namedenabledandPostForm.Getreturned the first ("0"); (b) Ack/Resolve handlers stored the state change but never dispatched alert.acknowledged / alert.resolved; (c)hosts.open_alert_countprojection was never recomputed on Raise/Resolve/AutoResolve, so the dashboard count always read 0.
Phase 3 — Audit log UI ✅
- P3-08 (S) Audit log UI with filters (user, action, target, time range)
As shipped (2026-05-05): Read-only
/auditpage (+/api/auditJSON). Filters: time-range presets (24h / 7d / 30d / all), user dropdown (any registered user), actor dropdown (user / agent / system), target-kind dropdown (host / schedule / source_group / alert / notification_channel / job / user), action substring search box. Table columns: when (relative + abstime tooltip), actor tag (user accent / agent green / system grey), user (or em-dash for system rows), action string, target (kind · resolved name for hosts, kind · id otherwise), payload<details>block when non-empty. NewStore.ListAudit(AuditFilter)andStore.DistinctAuditActionsplusStore.ListUsers. Append-only — no edit/delete surface, deliberately.
Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at
/hosts/{id}/restore; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page. - A failed backup raises an alert via the configured channel within 60s.
- The audit-log UI lets an admin filter by user / action / target / time range.
Phase 4 — RBAC, OIDC, host tags
- P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
- P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
As shipped (2026-05-05): Three-role hierarchy (admin > operator > viewer) enforced via chi route-group middleware (
requireRole). Admin is the fail-closed default; agent endpoints stay on the bearer-token chain. Sessions re-validatedisabled_aton every authenticated request — admin-driven changes (disable, force-logout) land immediately.Setup-token flow replaces temp passwords. Admin clicks
+ Add user, picks username + email + role, server returns a one-time setup link valid for 1 hour (sha256-hashed at rest, raw shown to admin once). User clicks the link → sets a password (≥12 chars) → drops a session → lands on/./settings/users/{id}/regenerate-setupissues a new link, replacing the old via INSERT OR REPLACE. Expired tokens are swept on the alert engine's 60s tick.Disable-only lifecycle — soft delete via
disabled_at. Last-admin guard rejects "disable last admin" and "demote last admin to non-admin" (both server-side and UI-hinted). Re-enable on disabled-username collision: admin trying to add a name that matches a disabled user is redirected to that user's edit page rather than 409'd.Self-service password change at
/settings/accountavailable to any role. Skips current-password check whenmust_change_passwordis set so admin-initiated resets work without surfacing a credential the user doesn't know.Schema: migration 0017 adds
disabled_at,must_change_passwordplus a UNIQUE INDEX on LOWER(username) (lowercase normalisation in Go on every CreateUser); 0018 addsuser_setup_tokens. Both column-level ALTERs per CLAUDE.md preference. Email is metadata only in v1 (no SMTP-the-link); the SMTP channel infrastructure from P3-06 makes that a one-page follow-up.Sweep verified (smoke env): admin adds operator → setup link generated → curl-as-new-user fetches /setup (200, page shows username) → POSTs password → 303 to / + Set-Cookie → operator authenticated → 200 on /, 200 on /settings/account, 403 on /settings/users (admin-only) → admin disables user → operator's next request is 401 + session row count drops to 0 → audit log shows
user.created+user.setup_completedfor the cycle. All 26 implementation tasks landed; fullgo test ./...green.
- P4-05 (L) OIDC login (generic provider config, group → role mapping)
As shipped (2026-05-05): Authorization Code + PKCE (S256) against any OIDC IdP advertising standard discovery. Config is YAML+env (
oidc.issuer,oidc.client_id,oidc.client_secret/_file,oidc.role_claimdefaultgroups,oidc.role_mapping,oidc.display_name,oidc.redirect_url); empty issuer → OIDC disabled, no routes mounted. Migration 0019 addsusers.auth_source/oidc_subject(partial unique index onoidc_subject),sessions.id_token, and a smalloidc_statetable for state+verifier round-trip (cleaned up every alert tick, 5 min TTL). Login page renders Sign in with<display_name>above the local form when OIDC is enabled; the SSO button kicks off a 303 to the IdP with state + S256 code_challenge persisted server-side. Callback verifies ID token, fetches/userinfoto merge claims (Authelia / many IdPs only putsubin the ID token and surfacepreferred_username/groupsfrom userinfo), maps the first matching group to a role; no match → deny banner, no row created, audituser.oidc_login_blocked. Username-collision with an existing local user → same deny path withusername_taken. New user → JIT-provisioned withauth_source='oidc',oidc_subject=<sub>,password_hash=''. Returning user → looked up byoidc_subject(stable when usernames change at the IdP), role + email refreshed on every login. Local password login is rejected forauth_source='oidc'users. Logout posts to/logoutand, when the IdP advertisedend_session_endpoint, follows up with RP-initiated logout (carriesid_token_hint+post_logout_redirect_uri=BaseURL); when not advertised (Authelia in our smoke env), the local session is cleared and the browser lands on/login. Users list shows a small oidc chip beside enabled/disabled; the edit page disables username/email/role for OIDC users (server-side guard mirrors UI, returns 403). Force-logout, disable, and the last-admin guard from P4-04 all still apply. Live Authelia sweep verified all four paths against local auth: rm-admin → admin role + JIT row + chip + readonly edit; rm-operator → operator JIT, 403 on/settings/users; rm-viewer → viewer JIT, 403 on/hosts/new; rm-other (group not in role_mapping) → no_role_match banner, no row created, audit logged. Returning rm-admin login resolved to the same row by sub. Screenshots in_diag/p4-05-sweep/. Out-of-scope and on Phase 6 candidate list: refresh tokens, back-channel logout, multiple providers, post-login PKCE for the cookie itself.
- P4-07 (S) Per-host tags + dashboard filtering by tag
As shipped (2026-05-05): Tag column already existed on the hosts schema (JSON array, round-tripped through the Host struct since Phase 1) but had no edit UI or filter. Added
Store.SetHostTags+Store.DistinctHostTags(the latter viajson_eachfor autocomplete + chip-row population). Inline editor on the host detail header:+ tagbutton reveals a comma-separated input with<datalist>autocomplete from the fleet's distinct tags; submit lowercases / trims / dedupes server-side. Tag chips on the host header link to the dashboard pre-filtered. Dashboard chip-row above the hosts table —All / <tag1> / <tag2> …with the active chip highlighted via a new.tag-activestyle;?tag=foofilters the list with the count showingN of M. Operator-band POST/hosts/{id}/tagsaudited ashost.tags_updated.
Phase 4 acceptance
- Non-admin users see an appropriately limited UI. OIDC login works against at least one provider (Authelia or Authentik). Hosts can be tagged and the dashboard filters by tag.
Deferred to Phase 6 (2026-05-05) — pulled forward of OSS readiness so a working v1 ships sooner: P4-01/02 (update delivery + agent-version tracking), P4-06 (repo size trends), P4-08/09 (Prometheus + Grafana). All operator-experience polish, none of it gates getting the system into production.
Phase 5 — OSS readiness
- P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- P5-02 (S)
CONTRIBUTING.md,CODE_OF_CONDUCT.md, issue + PR templates - P5-03 (S) Release automation — pivoted away from goreleaser/binary archives on 2026-05-05 (spec:
docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md). Single deliverable per tag: a multi-arch (linux amd64+arm64) server image, with cross-compiled agent binaries (linux amd64+arm64, windows amd64) +install.sh+install.ps1+ the systemd unit baked under/opt/restic-manager/dist/. The/agent/binaryand/install/*handlers fall back from<DataDir>/...to<BundledAssetsDir>/...so a fresh container Just Works. Workflow.gitea/workflows/release.ymltriggers onv*.*.*tag-push (real release: fan-out:vX.Y.Z,:X.Y,:X, plus:latestonceMAJOR>=1) andworkflow_dispatch(snapshot::snapshot-<shortsha>only). Pushed to the Gitea container registry on this instance — no external creds, no GHCR mirror. Cosign / SBOM / minisign / GHCR mirror deferred to Phase 6. Source builds viamake buildremain a first-class path. - P5-04 (S) Demo screenshots / short Loom walkthrough in README
- P5-05 (S)
SECURITY.mdwith disclosure process - P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
As shipped (2026-05-07, branch
p5-oss-readiness):P5-01 — docs site. mdBook under
docs/book/with structured chapters: getting-started (install, enrolling hosts, reverse proxy), concepts (architecture, credentials, schedules + source groups, repo maintenance), operations (backups + restores, alerts, observability, updates), security (threat model, hardening, disclosure), reference (env vars, HTTP endpoints), plus contributing / roadmap / license pages. mdBook binary downloaded via Makefile (make docs/make docs-watch) — same "static binary, no toolchain" pattern as Tailwind. Generatedbook/dir gitignored.P5-02 — CONTRIBUTING + CoC + templates.
CONTRIBUTING.mdrewritten from placeholder to full guide (setup, conventions, workflow, RBAC of the project itself).CODE_OF_CONDUCT.mdshaped on the Contributor Covenant but adapted for a single-maintainer project..gitea/issue_template/{bug_report,feature_request}.md
.gitea/PULL_REQUEST_TEMPLATE.md.P5-04 — README screenshots. Six full-page captures from a fresh server bootstrap under
docs/screenshots/(login, empty dashboard, add host, alerts, settings, audit log). README rewritten to centre the screenshot grid + link out to docs site. Captured live from a working build via Playwright; replaceable as the UI evolves without breaking layout.P5-05 — SECURITY.md. Disclosure policy (3-day ack, 30-day default disclosure window), supported-versions matrix, scope in/out, threat-model summary, hardening checklist for operators. Mirrored as a chapter in the docs site.
P5-06 — e2e harness.
e2e/compose.e2e.ymlstands up server + sibling Linux agent (alpine + restic) + restic/rest-server backend, with announce-and-approve as the enrolment path so Playwright drives the operator flow end-to-end. Tests undere2e/playwright/tests/: smoke spec covers bootstrap → login → accept-pending → backup → terminal-status; second spec scrapes/metricsto verify the P6-04 endpoint. New.gitea/workflows/e2e.ymlruns on every PR (separate from the fast lint/test workflow). Local how-to indocs/e2e.md.
- P5-07 (S) Reference deployment landed alongside P5-03.
deploy/docker-compose.ymlstands up only the server (image-pinned viaRM_VERSION, named volume for operator state, bound to localhost) — TLS termination is left to whichever reverse proxy the operator already runs.docs/reverse-proxy.mddocuments the headers + WebSocket pass-through the proxy must forward, theRM_TRUSTED_PROXYCIDR rule, and worked examples for Caddy, nginx, and Traefik.
Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
Phase 6 — Update delivery + observability
Deferred from Phase 4 on 2026-05-05 — operator-experience polish that doesn't gate a working v1.
- P6-01 (S) Agent self-update from the server's bundled binaries. Server-dispatched
command.updateWS envelope; agent fetches$RM_SERVER/agent/binary?os=…&arch=…to<bin>.new, copies running binary to<bin>.old(M1 — keep one revision back), atomic-rename, exit cleanly. Linux relies on systemdRestart=always; Windows writes a detachedupdate.cmdhelper that waits 3s,sc stops, renames,sc starts. No sha256 digest verification — TLS already covers corruption-in-transit (decision deferred per spec §4). (Was P4-01.) - P6-02 (M) Agent version reporting + fleet update on dashboard.
internal/versionpackage + Makefile ldflags injection so server and agent are comparable byte-for-byte. Out-of-date chip on host rows + detail header (amber, formatout of date · A → B). Hero tile "N hosts behind" with?updates=behindfilter. Per-host Update agent button on host detail. Admin/settings/fleet-updatepage drives a rolling worker (internal/server/fleetupdate) that updates one host at a time, polls for hello-with-target-version up to 95s, halts on first failure withfleet_update_haltedalert. Per-hostupdate_failedalerts auto-resolve when the agent reconnects at the right version.host.update_dispatched/_succeeded/_failedandfleet.update_started/_completed/_halted/_cancelledaudit actions. (Was P4-02.)
As shipped (2026-05-06, branch
p6-agent-self-update): Specdocs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md, plandocs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md. Schema: migration 0021 widensjobs.kindCHECK to includeupdate; 0022 createsfleet_updates+fleet_update_hosts. Agent: newinternal/agent/updaterpackage (build-tag split unix/windows); dispatcher caseMsgCommandUpdateincmd/agent/update_dispatch.goemitsjob.started+log.streamupdates before exit. Server: WS update-watcher (internal/server/ws/update_watch.go) tracks in-flight dispatches, marks succeeded on hello-with-matching-version, fails after 90s timeout (covers both no-show and rollback cases per spec §3.2). EndpointPOST /api/hosts/{id}/update(admin, JSON) +POST /hosts/{id}/update(HTMX,HX-Redirect: /jobs/{id}); pre-checks for offline / already up-to-date / update_in_progress. Fleet worker exposesStart/Canceland runs at most one rolling sequence at a time. Alert kindsupdate_failedandfleet_update_haltedplug into the P3-05 engine.Smoke caught + fixed mid-sweep: the systemd unit's
ProtectSystem=fullmade/usr/local/binread-only, blocking the .new staging file. Added/usr/local/bintoReadWritePaths. With the fix in place: end-to-end Update agent took the host fromv0.9.0-11-gccaccd8-dirty→v9.9.9-smokein <5s;.oldpreserved on disk; chip and hero tile cleared on reconnect; audit row landed. Screenshots in_diag/p6-update-sweep/.
- P6-03 (M) Repo size trend graphs (sparkline on host card, full chart on repo page). (Was P4-06.)
As shipped (2026-05-07, branch
tidy-up-last-backup-projection): Specdocs/superpowers/specs/2026-05-07-p6-03-repo-size-trend-design.md, plandocs/superpowers/plans/2026-05-07-p6-03-repo-size-trend.md. Migration 0023 introduceshost_repo_stats_history(one row per host per UTC day, last-write-wins per column via COALESCE — a prune-only or check-only patch never nulls a backup-time size we already captured). WS handler ininternal/server/ws/handler.gowrites a history row alongside the existingUpsertHostRepoStatscall; failure is best-effort, logged at WARN. Newinternal/web/sparklinepackage emits inline SVG (sparkline + two-axis chart with hover dots and bytes/count formatting); golden-file tests, deterministic output. Dashboard host row gains a 30d sparkline cell between Repo size and Snapshots; host repo page gains a Trend panel with server-rendered30d | 90d | 1yrange pills (htmx outerHTML swap, helperbuildRepoTrendViewshared between page-load and fragment endpoint). No new dependencies, no client JS, no agent change. CI green; in-browser smoke walk-through pending operator.
- P6-04 (M) Prometheus
/metricsendpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. (Was P4-08.) - P6-05 (S) Document Prometheus integration + sample Grafana dashboard JSON. (Was P4-09.)
As shipped (2026-05-07, branch
p6-04-05-prometheus-metrics): Specdocs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md, plandocs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md. Newinternal/server/metricspackage emits the legacytext/plain; version=0.0.4exposition format directly — noprometheus/client_golangdependency, matching the repo's "no Tailwind, no Node" minimal-deps style./metricsis opt-in:RM_METRICS_TOKENand/orRM_METRICS_TRUSTED_CIDRmust be set or the route isn't mounted at all (404). When both are set, both must pass; either alone gates access. Token compare is constant-time. CIDR check honoursX-Forwarded-Foronly when the immediate hop is a configuredRM_TRUSTED_PROXY(mirrors the existing realIP resolution).Metrics: per-host gauges (
rm_host_agent_online,rm_host_last_backup_timestamp_seconds,rm_host_last_backup_success,rm_host_repo_size_bytes,rm_host_snapshot_count,rm_host_open_alerts,rm_host_repo_status); server gauges (rm_hosts_total,rm_hosts_online,rm_active_alerts{severity},rm_build_info{version,commit,go_version}); histogramrm_job_duration_seconds_bucket{kind,status,le}with buckets1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf. Histogram is in-memory; observations come from the existingMsgJobFinishedbranch ininternal/server/ws/handler.go.Docs:
docs/prometheus.mdcovers enable + scrape config + metric reference + dashboard import. Dashboard:deploy/grafana/restic-manager-dashboard.json— six panels (fleet status, open alerts, backups failing, hosts table, repo size over time, job-duration p95). Schema 39, single Prometheus datasource variable.Tests: golden-render + concurrent-observe + bucket-boundary in the metrics package; auth matrix (no auth → 404; token missing/wrong/right; CIDR matching/non-matching; token AND CIDR) in the HTTP layer.
Phase 6 acceptance
- Agents upgrade via apt/choco with one admin-triggered action. Prometheus can scrape
/metricsand the sample Grafana dashboard renders with live data. Repo size trend visible on host detail.
Cross-cutting / ongoing
- X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format). ✅ Landed:
CHANGELOG.mdat the repo root with a v1.0.0 entry summarising what each phase shipped, plus an empty Unreleased section to accumulate changes after the tag. Updated on each release going forward. - X-02 Track restic version compatibility matrix
- X-03 Periodic dependency updates (
dependabotorrenovate) - X-04 Threat-model review at end of each phase. ✅ Landed:
docs/threat-model.mdcovering assets, actors, attack surfaces (bootstrap, local accounts, OIDC, agent enrolment, agent ↔ server WS, credential lifecycle, restore, audit log, self-update channel), residual risks, and explicit out-of-scope items. Reviewed against v1.0.0 surface; refresh on each tagged release. - X-05 Proper first-run onboarding UI. ✅ Landed: bootstrap form already lives at
/bootstrapand/loginredirects to it when no users exist (so an operator hitting the server in a browser is guided into setup automatically — the form takes username + password only, no token field needed because the server holds the in-memory token and applies it server-side). Improvements added here: at first-run startup the server now prints a clickable$RM_BASE_URL/bootstrapURL (or a fallback message whenRM_BASE_URLis unset) alongside the existing one-shot token for headless/api/bootstrapuse; the bootstrap form's password field shows an explicit "Minimum 12 characters" hint so the rule is visible before submission instead of failing on submit.
Next steps from testing
Bin for issues spotted while exercising a live deployment. Promote into a phase once scoped; leave here while still being collected.
- NS-01 Admin-driven host deletion. ✅ Landed: store
DeleteHost(FK cascade revokes the agent bearer along with everything else), admin-bandPOST /hosts/{id}/delete, danger-zone form on host detail with hostname-confirm, audithost.deleted, live WS connection closed pre-delete. Original scope below for reference. No UI or API surface today — once a host is enrolled the only way to remove it is hand-editing SQLite, which then cascades through schedules/jobs/snapshots/source-groups via the FK chain. Needs: store-levelDeleteHost+ cascade audit, admin-bandDELETE /api/hosts/{id}and form-post variant, confirm-modal on the host-detail page, audit entry, and a decision on whether to also revoke the agent's bearer (recommend: yes, so a re-installed host comes back through the normal pending-host accept flow). - NS-02 Recoverable enrollment-token UX. ✅ Landed:
Store.ListOutstandingEnrollmentTokens+DeleteEnrollmentToken; outstanding-tokens panel on the Add-host page (short hash, redacted repo URL, created/expires) with per-row Regenerate (revokes old hash, mints fresh raw token preserving repo creds + initial paths, 303s to/hosts/pending/{newToken}) and Revoke (delete + audit). Audit actionsenrollment_token.regenerated/enrollment_token.revoked. Original scope below. TodayPOST /hosts/newmints a token and 303s to/hosts/pending/{token}; if the operator closes that tab the install snippet is lost and there's no UI surface to find it again — the row sits inenrollment_tokensuntil TTL expiry, invisible. Needs: store-levelListOutstandingEnrollmentTokensreturning(token_hash, created_at, expires_at, repo_url_redacted, initial_paths, attached_host_id_or_null); a small list section on the Add-host page (and/or Settings) showing outstanding tokens with created/expires-in and the redacted repo URL; admin-bandPOST /api/enrollment-tokens/{id}/regenerate(revokes the old hash, mints a fresh raw token, re-uses the original attachments — same pattern as the user-setup-token regenerate flow) andPOST /api/enrollment-tokens/{id}/revoke. Choose regenerate over "show original token" because we only persist hashes, never raw tokens. - NS-03 Auto-init repo on first onboard, surface credential failures eagerly. ✅ Landed: migration 0020 adds
hosts.repo_status(unknown/ready/init_failed) +repo_status_error; WS handler projects every init job's terminal state onto the host row (with idempotent "config file already exists" → ready); creds-save handlers (UI + JSON API) reset status tounknownand dispatch a fresh init when the agent is online; new/hosts/{id}/repo/proberetry endpoint and a status banner on the repo page. Remainder of original scope below. surface credential failures eagerly. Today the operator types repo URL + creds during Add-host and the credentials are pushed to the agent on connect, but norestic init/probe runs until the first scheduled job — so a typo in the password or a wrong URL goes undetected for hours/days, manifesting as a silent missed-backup. Wanted behaviour: when the host completes enrolment (or when an admin saves new repo creds), the server dispatches a one-shot probe job that runsrestic cat config(cheap, repo-existence + creds-validity in one call). OnIs there already a config file? unable to open config file→ runrestic init. On success → mark the host's repo as ready. On any other error (network, auth, fingerprint) → surface a panel-level error on the host detail page and audit the failure, leaving the host in an "init pending" state with a "Retry" button. Needs: a newJobKind(or piggyback on an existing one) for the probe, server-side state on the host row (repo_statusenum:unknown/ready/init_pending/init_failed), UI panel that shows the state, and clear copy on the Add-host page so the operator knows the save isn't fire-and-forget. - NS-05 Drop redundant
actions/setup-gofrom.gitea/workflows/ci.yml. ✅ Already gone — verified.gitea/workflows/ci.ymlhas zeroactions/setup-go@v5invocations and noGO_VERSIONenv; the file's header comment now documents that the runner image (gitea.dcglab.co.uk/steve/ci-runner-go) is the single source of truth for the Go version. Closing as done; no further code change needed. - NS-06 Remove the permanently-disabled "Run backup now" button from
web/templates/partials/host_chrome.html. ✅ Landed: dropped the disabled tombstone button from the host header action row; only "Edit credentials" + the ⋯ menu remain. Per-source-group Run-now on/hosts/{id}/sourcesis the only path now. No e2e change needed —smoke.spec.tsdoes not assert on host_chrome's button row. - NS-07 Relative timestamps go stale on long-open tabs. ✅ Landed:
formatRelTimenow wraps its label in<time data-rel-ts=…>and both layouts (base.html,chromeless.html) carry a small ticker that re-renders every 30s, so a page rendered an hour ago no longer keeps showing "2h ago" when the wall-clock truth is "3h ago". Covered byfuncs_test.go. The bug: every relative label was computed once at server render and never updated client-side, so a job-detail page left open drifted further from reality the longer it sat. - NS-08 Always-On vs intermittent host mode. ✅ Landed: a host can now be marked not-always-on (laptop/workstation) so it stops generating offline-alert noise when it legitimately sleeps. Migration 0024 adds
hosts.always_on(default 1 = today's 24×7 behaviour; intermittent is strictly opt-in). The alert engine suppressesagent_offlinefor intermittent hosts and instead wires up the previously-deadstale_schedulealert for them — raised at a 7-day global threshold when the host has an enabled schedule and a stale last backup, resolved on the next successful backup. A new server-side catch-up scheduler (internal/server/http/catchup.go) arms on agent hello and fires from the existing 30s pending-drain tick: ~60s after an intermittent host reconnects it dispatches a backup for any enabled schedule whose window elapsed while asleep (overdue =cron.Next(lastBackup) <= now, reusing the sharedcronParser), guarded against firing when the host bounced offline, flipped to always-on, or already has a job running. Overdue is measured against the per-hostLastBackupAt(exact for the common single-schedule laptop; a known coarseness for multi-cadence hosts, documented in code). Operator toggle viaPOST /hosts/{id}/mode(auditedhost.mode_updated), which also clears open offline/staleness alerts so the next sweep re-settles. UI: intermittent offline hosts render a calm greyasleep · <relTime> · will catch up on returnstate (new.dot-asleep) instead of red "offline"; a24×7chip shows only for always-on hosts; a "presence" inline toggle on the host header. Design + plan indocs/specs/2026-06-15-always-on-host-mode-design.mdanddocs/plans/2026-06-15-always-on-host-mode.md. Spec §2 (online/offline mechanics) deliberately left untouched. Out of scope for v1: per-host staleness thresholds, continuous (non-reconnect) overdue evaluation, per-schedule last-success tracking. - NS-04 Dashboard parity with the alerts screen: live refresh, column sorting, filters. ✅ Landed:
/now parsesq/status/repo_status/tag/sort/dirquery params (round-trip durable for bookmarks); table is wrapped in anid="hosts-table"htmx live-poll matching the alerts cadence (5s, gated ondocument.visibilityStateandlocalStorage.rm-dashboard-live); filter row above the table with hostname free-text + status + repo_status selects + tag chips + clear; column headers (Host / OS · arch / Last backup / Repo size / Snapshots) are clickable links that toggle direction on the active column; pure-Go sort+filter pipeline covered bydashboard_filter_test.go. Original scope below. live refresh, column sorting, filters. The host list is currently a static render — operators have to reload to see new heartbeats / job state changes. Mirror the alerts pattern (web/templates/pages/alerts.htmluseshx-trigger="every 5s [document.visibilityState==='visible' && localStorage.getItem('rm-alerts-live')!=='off']"plus a Live/Off toggle so background tabs and explicit-off don't burn server cycles). Add: server-side sort on every meaningful column (name, OS, last-backup time, last-backup status, agent online/offline, restic version, tags), and a small filter row above the table — at minimum free-text on hostname, status (online/offline/never-seen), and tag chips. Columns + filter state should round-trip through query string so a bookmarked / shared URL is durable. Re-use thehost_rowpartial that already exists so the live-refresh swap is a clean OOB swap, not a full table re-render.
Future / unscheduled
Items here have a plausible use case but no confirmed need. They live outside numbered phases until a concrete trigger (a user request, a security review finding, a real disaster-recovery exercise) bumps them back into a phase.
- F-02 API tokens (PATs) for automation. Today the only way to drive
/api/*from a tool is to log in as a real user and reuse therm_sessioncookie — fine for a single automation account, but bearer-equivalent for the 24h session TTL and not revocable per-tool. Build a proper personal-access-token feature: newpersonal_access_tokenstable (id, user_id, sha256 hash, name, optional role cap, created_at, last_used_at, revoked_at), a/settings/tokensUI to mint/list/revoke, and a branch inrequireUserthat acceptsAuthorization: Bearer …and falls back to the cookie. Reuseauth.NewToken()/auth.HashToken()(same primitives used for agent bearers). Audit each mint/revoke. Trigger to promote: second automation consumer, or any external integration request. - F-01
P3-04Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.