Sweep against the live smoke env confirmed the alerts subsystem end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh, SMTP → MailHog) created and verified via the Test button; synthetic critical raised; ack + resolve fan out alert.acknowledged / alert.resolved across all three; dashboard banner appears and clears; nav badge tracks open count. Three real bugs found and fixed mid-sweep — see preceding three commits for the full reasoning.
45 KiB
restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: S = under a day, M = 1–3 days, L = 3–7 days.
Phase 0 — Project bootstrap
- P0-01 (S) Initialize Go module,
cmd/server,cmd/agent, baselineinternal/packages - P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- P0-03 (S) Set up
golangci-lint,gofumpt,goimports; pre-commit config - P0-04 (S)
GitHub ActionsGitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint - P0-05 (S)
Dockerfile.server(multi-stage, distroless),deploy/docker-compose.yml - P0-06 (S) Makefile /
with common targets (taskfile.ymlbuild,test,run,release)
Phase 1 — MVP: enrollment, visibility, on-demand backup
Server foundations
- P1-01 (M) HTTP server scaffolding (
chi, structured logging viaslog, graceful shutdown) - P1-02 (M) SQLite store layer (
modernc.org/sqlite) + migrations (hand-rolled,embed.FS) - P1-03 (M) Schema for
users,sessions,hosts,repos,credentials,jobs,job_logs,snapshots,audit_log - [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
- P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
- P1-06 (M) Secret encryption helper (AEAD with key from
RM_SECRET_KEY_FILE) - [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited
Agent ↔ server protocol
- P1-08 (M) Define shared API types in
internal/api(envelopes, every WS message +protocol_versionconstants; JSON-shape tests pin the wire) - P1-09 (L) WebSocket transport (
github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side - P1-10 (M) Enrollment flow:
POST /api/agents/enrollwith one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate. - P1-11 (M) Agent registration on connect (
helloupserts agent_version/restic_version/protocol_version, flips status online,protocol_too_oldrejection has clean error envelope) - P1-12 (S) Heartbeat handler (touches
last_seen_at; background sweeper marks hosts offline after 90s without one)
Agent foundations
- P1-13 (M) Agent config file (
/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2 - P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
- P1-15 (M) Outbound WS client (
github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats,protocol_versionin hello - P1-16 (M) Restic wrapper: locate via PATH or override, run with
--json, scan stdout/stderr, parseBackupStatus+BackupSummary, exit-code 3 treated as success-with-issues - P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
Run-now backup
- P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
- P1-19 (M) Server endpoint
POST /api/hosts/{id}/jobsto dispatch abackupcommand (validates kind, checks online, audit-logs) - P1-20 (M) Agent executes
restic backup, streams stdout/stderr + parsed JSON events back asjob.progress(1Hz throttle) /log.stream - [~] P1-21 (M) Server persists log stream to
job_logs✓; WS/api/jobs/{id}/streamfor live browser tailing still TODO — needs the per-job fan-out hub - P1-22 (S) Snapshot listing: agent calls
restic snapshots --jsonafter each successful backup and ships the projection oversnapshots.report. ServerReplaceHostSnapshotsatomically swaps the per-host list and updateshosts.snapshot_countin the same tx. Read endpoint:GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unusedrepo_idFK fromsnapshots(repos as a first-class entity is P2 work).
UI (HTMX + Tailwind)
- P1-23 (M) Base layout, login page, session-aware nav
- P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by
GET /api/hosts+GET /api/fleet/summary(JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMXRun nowbutton posts to/hosts/{id}/run-backup. - P1-25 (M) Host detail page (
/hosts/{id}): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel. - P1-26 (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens
/api/jobs/{id}/stream; agent-emittedjob.started/job.progress/log.stream/job.finishedare mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads onjob.finishedto show the final header. "Run now" setsHX-Redirectso the operator lands on the live log. - [~] P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (
RM_SERVER+RM_TOKENfilled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). Deferred: one-click "download preconfigured installer"install-<hostname>.sh(cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1. - P1-28 (S) Tailwind build via
tailwindcssstandalone binary (no Node) — Makefile downloads pinned v3.4.17 intobin/tailwindcss, buildsweb/styles/input.css→web/static/css/styles.css, embedded into the binary viaweb.FS.make buildruns Tailwind first.
Install scripts
- P1-29 (M)
install.sh(Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers //etc/cron.{d,daily,hourly,weekly}/*/ root crontab and prints them with the exact disable commands — does not auto-disable - [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (
/agent/binary+/install/*); signature verification deferred to Phase 5 OSS readiness
Repo credentials (pulled forward from Phase 2)
-
P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:
POST /api/enrollment-tokensbody growsrepo_url,repo_username,repo_password(all required).- Token row stores them as one AEAD-encrypted blob (existing
crypto.AEAD);ConsumeEnrollmentTokenmoves the blob to a newhost_credentialsrow keyed byhost_idin the same tx. PUT /api/hosts/{id}/repo-credentials(admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.GET /api/hosts/{id}/repo-credentialsreturns the redacted view (URL + username +has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.- On WS
hello, server pushes aconfig.updatewith decrypted creds before returning the connection to idle. Same path on edit-while-connected. - Audit-logged on create / consume / edit; payload omits the secret material.
-
P1-33 (M) Agent-side encrypted secrets store:
- New
internal/agent/secretspackage: AEAD blob at/var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600). - Per-host 32-byte secrets key minted at enrollment, persisted in
agent.yaml(already 0600 root-only — same trust boundary as the bearer; explicit comment in the file). - Strip
repo_url/repo_passwordfromagent.config.Config. Agent loads creds fromsecrets.encat startup;config.updatehandler writes through to the file. - Dispatcher reads from the secrets store on every job rather than from in-memory config.
- Migration path: if
agent.yamlstill containsrepo_url/repo_password, copy them intosecrets.encon next start, then strip from the YAML on save.
- New
-
P1-34 (S) End-to-end smoke runbook:
docs/e2e-smoke.mdwalks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a realrestic/rest-serverin a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (
.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17). - Agent ↔ server
protocol_versionhandshake rejects mismatched versions with a clear error rather than failing on JSON parse. - Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as
config.update.
Phase 2 — Scheduling, retention, repo operations
Mid-phase pivot — "P2 redesign" (commits
7a7cac5,666af41,5667cdf). The original P2 plan put paths/excludes/retention/manual/kind/options onScheduleand one repo per host. After landing P2-01..P2-05 against that shape, the data model was rewritten: schedules are slim (cron + whichsource_groups); paths/excludes/retention/retry live onsource_group(also doubles as the snapshot tag); forget/prune/check cadences live onhost_repo_maintenanceand run on a server-side ticker, not the agent cron;pending_runsqueues offline retries;host.repo_initialised_atis gone (auto-init at enrolment). The redesign is captured below asP2R-NNitems. Items P2-01..P2-05 stay marked done because the work shipped, but they're labelled ⚠️ shipped against old shape — behaviour to be re-validated under P2R-02 after UI rewire. P2-04.5 (manualflag) is dropped wholesale. P2-06..P2-15 are reframed below to point at their new homes; P2-16/17/18 are unaffected by the redesign.
Original P2 work — shipped (against pre-redesign shape)
- ⚠️ P2-01 (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
- ⚠️ P2-02 (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
- ⚠️ P2-03 (M) Agent local scheduler (
internal/agent/scheduler,robfig/cron/v3,schedule.fireenvelope,dispatchScheduledJob). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01. - ⚠️ P2-04 (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
P2-04.5 Manual schedules / kill— superseded; thehost.default_pathsmanualflag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).- ⚠️ P2-05 (M)
forgetcommand with retention policy. Wire payload (CommandRunPayload.retention_policy) and restic wrapper (restic.ForgetPolicy,RunForget) are still correct; what changes under P2R-03 is where retention comes from (source_group, not schedule) and who dispatches (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).
P2 redesign — Phase 1 ✅
- P2R-00.1 (M) Migration 0008 — sources + repo maintenance. Adds
source_groups,schedule_source_groupsjunction,host_repo_maintenance,pending_runs,host.bandwidth_up_kbps/bandwidth_down_kbps. Dropshost.repo_initialised_at. Slim-schedule columns dropped fromschedules. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit7a7cac5. - P2R-00.2 (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of
/hosts/{id}/sources,/sources/{gid}/edit(with retention-conflict banner), slim/schedules,/repo(connection / bandwidth / maintenance / re-init). Commit666af41.
P2 redesign — Phase 2 ✅
- P2R-00.3 (L) Go-side store rewrite against migration 0008. New types:
SourceGroup,HostRepoMaintenance,PendingRun.Scheduleslimmed to{id, host_id, cron, enabled, source_group_ids, timestamps}.RetentionPolicymoves from schedule field → source group field (type unchanged).HostlosesRepoInitialisedAt, gains bandwidth caps. New files:store/sources.go,store/maintenance.go,store/pending.go.store/schedules.gorewritten for slim shape + junction CRUD.enrollment.goseeds a default source group + repo-maintenance row instead of a manual schedule.ws/handler.godropsMarkHostRepoInitialised. HTTP layer + UI templates temporarily 501-stubbed withredesign_in_progress— this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit5667cdf. - P2R-00.4 (S) Host-detail UI patched up enough to render:
RepoInitialisedAttemplate refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.
P2 redesign — Phase 3 (REST + WS rewire) ✅
- P2R-01 (L) HTTP/WS layer against the slim shape:
- Schedules REST CRUD:
GET|POST /api/hosts/{id}/schedules,PUT|DELETE /api/hosts/{id}/schedules/{sid}. Body shape is{cron, enabled, source_group_ids[]}— paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (perstore.UpdateSchedule). Validation: cron parses viarobfig/cron/v3; ≥1source_group_ids; all referenced groups belong to the host. - Source-groups REST CRUD:
GET|POST /api/hosts/{id}/source-groups,GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}. Body:{name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}. Name uniqueness per host. Refuse delete ifSchedulesUsingGroup(gid)is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bumphost_schedule_version. - Repo-maintenance REST:
GET|PUT /api/hosts/{id}/repo-maintenance. Body:{forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}. Server-side ticker drives execution (P2R-04), so updates here do not bumphost_schedule_version. - Per-source-group Run-now:
POST /hosts/{id}/source-groups/{gid}/run. Reuses the existingdispatchScheduleNow-style path; agent receives a normalcommand.runcarrying the resolved includes/excludes/retention from the group. This replaces the old per-host/hosts/{id}/run-backupendpoint (kept around as a 410-Gone with a hint pointing to source groups). schedule_push.goreconciliation: rebuildpushScheduleSet*to ship the new wire format (ScheduleSetPayloadcarries[{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}]— agent doesn't need to knowsource_group_id, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persistsapplied_schedule_version.- Auto-init at enrolment: server dispatches
restic initon first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row withkind=initso the audit trail still shows it. Oninitreturning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour. - Tests: rewrite the deleted
schedules_test.goandschedule_push_test.goagainst new endpoints; newsource_groups_test.go,repo_maintenance_test.go,auto_init_test.go. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.
- Schedules REST CRUD:
P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅
Row-design rule (binding for every list-row template in this app, current and future): Whole-row click navigates to the row's primary detail/edit page — mirror
.host-row.clickableon the dashboard (partials/host_row.html): an absolute-positioned.row-linkoverlay withtext-indent: -9999pxcovers the row, action buttons live in.row-actioncells that sit above via z-index. Do not add an explicit "Edit" button when the row is clickable — it duplicates the affordance and dilutes the click target. Action cells are reserved for verbs that aren't "open this row" (Run-now, Delete, Pause, etc).
- P2R-02 (L) UI templates rebuilt against the new model:
- Slice 1 ✅ Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a
host_chromepartial; Sources / Schedules / Repo become real<a>links; placeholder pages share the chrome; version indicator restored. (commita535822) - Slice 2 ✅ Sources tab —
/hosts/{id}/sourceslist with per-row meta + clickable rows + per-group Run-now/Delete;/sources/newand/sources/{gid}/editform (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner fromConflictDimensioncache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits0ed9c3d,dede74f) - Slice 3 ✅ Schedules tab —
/hosts/{id}/schedulesslim list (status / cron / source-tags / actions, clickable rows) plus/schedules/newand/schedules/{sid}/editform (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reusesdispatchScheduledJobfor enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit67ca769+ follow-ups64d2fcf,8b91d30,4035c44) - Slice 4 ✅
/hosts/{id}/repo— three independent forms (connection: URL/user/password pre-filled fromGET /api/hosts/{id}/repo-credentialsredacted view; bandwidth: host-wide caps via newPUT /api/hosts/{id}/bandwidth; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commitd62b173) - Slice 5 ✅ Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit
fab99b4) - Slice 6 ✅ Playwright sweep against the live
:8080server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in_diag/p2r-02-sweep/. - Side-fix: agent runner drops noisy restic
statusevents fromlog.stream(they were drowning the live log on short backups; the throttledjob.progressenvelope already covers the same data). (commitffba737) - Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by
host_schedule_version+applied_schedule_version). - Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires
pushScheduleSetAsyncso an online agent re-arms within seconds.
- Slice 1 ✅ Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a
P2 redesign — Phase 5 ✅
Shipped on branch
p2r-phase5-maintenance(PR #3). Plan:docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md.
- P2R-03 (M)
prunecommand end-to-end. Restic wrapper (restic.RunPrune), agent dispatcher (case api.JobPrune:), wire envelope. Admin-only credential: a secondhost_credentialsrow keyed byhost_id+kind=admincarries the non-append-only username/password; server pushes it viaconfig.updateonly when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now viaPOST /hosts/{id}/repo/prune. Cadence-driven dispatch via the maintenance ticker (P2R-06). - P2R-04 (M)
checkcommand end-to-end (restic check --read-data-subset N%). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now viaPOST /hosts/{id}/repo/check. Cadence-driven dispatch via the maintenance ticker (P2R-06). - P2R-05 (S)
unlockcommand end-to-end (restic unlock). Operator-only — no cadence.POST /hosts/{id}/repo/unlock. Repo page surfaces lock state from the most recentcheck(which warns about stale locks). - P2R-06 (M) Server-side maintenance ticker. Cron-style loop on the server reads
host_repo_maintenancerows, dispatchesforget/prune/checkjobs against the right host on the configured cadence. Last-fire anchor is derived from thejobstable viaLatestJobByKind(queued + running included so a long-running prune correctly suppresses the next tick). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (logged, no queue — only scheduled backup fires queue, per P2R-08). Forget reshape: ships a multi-groupForgetGroupspayload so one job fires N restic-forget invocations per tick. - P2R-07 (S) Repo stats panel on the Repo page: total size, raw size, last-check timestamp + status (color-coded), last-prune timestamp, stale-lock banner. Backed by
restic stats --json --mode raw-datathat the agent ships in arepo.statsenvelope after every backup / check / prune / unlock; persisted viaStore.UpsertHostRepoStatsinto a newhost_repo_statsprojection table. - P2R-08 (M) Pending-runs queue worker. Scheduled backup fires that race an agent disconnect queue to
pending_runs. Drained on a 30s server-side tick and on agent reconnect (viaonAgentHello); per-host TryLock mutex prevents the two paths double-dispatching the same row. Exponential backoff capped at 30 minutes; abandons rows that exceed the source-group'sretry_max(audit-logged) or whose schedule/group has genuinely been deleted.
P2 redesign — Phase 6 (auto-init follow-up) ✅
- P2R-09 (S) Auto-init UX polish. Latest
initjob status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). Danger-zonePOST /hosts/{id}/repo/reinitdispatches a fresh init job after the operator types the host name to confirm; audit row recordshost.repo_reinit.
Pre/post hooks (rehomed onto source groups) ✅
- P2R-10 (M) Hook schema: migration 0010 adds
pre_hook/post_hookBLOB columns tosource_groupsandpre_hook_default/post_hook_defaulttohosts. Bytes stored verbatim — AEAD encrypt/decrypt at the HTTP layer (per-slot AD bytes). Round-trip tests cover set/clear semantics on both tables. - P2R-11 (M) Agent execution of hooks:
runner.BackupHooks+runHookhelper invoked via/bin/sh -c(cmd.exe /Con Windows). pre_hook non-zero exit aborts the backup; post_hook always runs withRM_JOB_STATUS=succeeded|failedin env. Output streamed ashook(<phase>): …log.stream lines. Hooks only run forkind=backup. Server side resolves group → host default → empty and ships plaintext on the WS payload (decrypt at HTTP layer). - P2R-12 (S) Hook editor UI: source-group edit form gains pre/post hook textareas with the service-user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a host-default Hooks panel with the same shape; saved via
POST /hosts/{id}/repo/hooks.
Bandwidth + niceties (rehomed onto host + source groups) ✅
- P2R-13 (S) Bandwidth limit fields.
restic.EnvgainsLimitUploadKBps/LimitDownloadKBps, emitted as--limit-upload/--limit-downloadglobal flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received viaconfig.update; server pushes them on hello and afterPUT /api/hosts/{id}/bandwidth. Per-job override on the per-source-group Run-now form (collapsed<details>"Limit bandwidth for this run" with two KB/s inputs); override wins over host caps. - P2R-14 (S) Schedule "next run" / "last run". New
store.LatestJobBySchedulequery. Schedules tab grows two columns (Next derived from cron viarobfig/cron/v3.Parse(...).Next, Last from latestactor_kind=schedulejob). Dashboard host row prependsnext 12h ago/from nowwhen a single covering schedule is the run-now candidate.
Cross-platform + alt-enrolment ✅
-
P2-16 (M) Windows service integration:
internal/agent/service(build-tagged) implementssvc.Handler; newrestic-manager-agent install|uninstall|start|stop|runsubcommands wrap the SCM viagolang.org/x/sys/windows/svc/mgr. Cross-compile verified (GOOS=windows GOARCH=amd64 go build ./cmd/agent); untested on Windows itself — Linux CI can't exercise the SCM round-trip. -
P2-17 (M)
install.ps1(Windows): pwsh installer that detects arch, downloads$Server/agent/binary?os=windows&arch=amd64, runs the agent in-enroll-server(+ optional-enroll-token) mode (token flow OR announce-and-approve), then registers the service viarestic-manager-agent install. Surfaces existing scheduled tasks named*restic*without disabling. Served by the existingGET /install/*handler; restage block in CLAUDE.md updated. -
P2-18 (L) Announce-and-approve enrolment (second enrolment mode):
- Agent run with no
RM_TOKENgenerates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), thenPOST /api/agents/announcewith{hostname, os, arch, agent_version, restic_version, public_key}. Server stores apending_hostsrow (public_key,fingerprint = sha256(public_key),announced_from_ip,first_seen_at,last_seen_at,expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal. - Agent then opens a long-poll/WS to
/ws/agent/pendingauthenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits. - Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g.
SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept. - UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
- Server-side guards: per-source-IP rate limit on
/api/agents/announce(token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race). - Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting
hostnameover the wire.
As shipped: migration 0011 +
store/pending_hosts.gocover the table.POST /api/agents/announce(rate-limited 10/min/IP, global cap 100 in-flight rows) returns{pending_id, fingerprint, hostname_collision}.GET /ws/agent/pendingruns the Ed25519 nonce-sign handshake. Admin POSTs to/api/pending-hosts/{id}/accept|reject(audit-logged ashost.accept_pending/host.reject_pending). Dashboard panel renders the queue with a copyable fingerprint + inline accept form (URL/user/password). 60s server ticker sweeps expired rows. Agent:cmd/agent/announce.gomints + persists an Ed25519 keypair intoagent.yaml'sannounce_keyfield; runs automatically when-enroll-serveris supplied without-enroll-token. The install scripts haven't been updated to surface the printed fingerprint beyond the agent's own banner — the operator reads it from the install script's stdout. - Agent run with no
Phase 2 acceptance
- A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
- Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
- Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to
pending_runsand drain on reconnect. - Pre/post hooks fire correctly per source group, fail loudly on
pre_hookerrors, runpost_hookwithRM_JOB_STATUS. Rejected on non-backup kinds. - Bandwidth limits honoured (host-wide default + per-run override).
- A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming. Not validated in CI: Linux runners cannot exercise the SCM round-trip; the
service_windows.go/install.ps1pieces compile cleanly underGOOS=windows GOARCH=amd64but the first real Windows install will be the first end-to-end test. - A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.
Phase 3 — Restore, alerts, audit
Phase 3 is split into three independently-shippable sub-phases: Restore (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC), Alerts (P3-05..07), Audit UI (P3-08). Each sub-phase has its own spec → plan → implement cycle; we hand back at sub-phase boundaries.
P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm on 2026-05-04: disaster recovery is already covered by re-enrolling a replacement host with the same repo creds (snapshots reappear, restore is same-host). The remaining "pull a file from host A onto host C without giving C permanent access" use case is genuinely different and doesn't have a confirmed need yet, so it's moved to the Future / unscheduled section at the end of this file.
Phase 3 — Restore ✅
Spec:
docs/superpowers/specs/2026-05-04-p3-restore-design.md. Wireframe:_diag/p3-restore-wizard/wireframe.html. Sweep screenshots:_diag/p3-restore-sweep/. Shipped on branchp3-restore.
- P3-X1 (S) Cancel-job feature.
command.cancelWS envelope; agent tracks per-job ctx.CancelFunc and kills the runningresticsubprocess via context cancel (SIGTERM, SIGKILL after 5s grace viacmd.Cancel+cmd.WaitDelay); server endpointPOST /api/jobs/{id}/cancelbridges UI → WS; the existing UI Cancel button on/jobs/{id}is now real for any running kind. Sandbox-aware:internal/restic/cancel_{unix,windows}.gobuild-tags pick SIGTERM on POSIX vsos.Killon Windows (which can't deliver SIGTERM). Tests: cancel mid-run via 'sleep 30' fake-restic returns JobCancelled with exit 130 in <200ms. - P3-X2 (S) Tree-list synchronous WS RPC.
MsgTreeList↔MsgTreeListResultwithEnvelope.IDcorrelation; genericHub.SendRPChelper (registry of buffered channels keyed by ULID, ctx-cancel + timeout aware).internal/restic.ListTreeChildrenwrapsrestic ls --jsonand filters its recursive output to direct children. Server-sidetreeCacheis per-wizard-session (keyed by session cookie + host + snapshot + path) with a 30-min TTL and lazy sweep. - P3-01 (L) Restore wizard backend (
internal/server/http/ui_restore.go). GET handlers render the four-step wizard against the wireframe. HTMX/fetch tree partial endpoint hitsfetchTreeWithCache. POST validates: snapshot_id, ≥1 absolute path, in-place ⇒ confirm_hostname == host name, agent online; on error re-renders with operator's input intact. Happy path mints job_id, target =/var/lib/restic-manager/restore/<job-id>(server-picked, agent's writable dir under the systemd sandbox'sReadWritePaths), creates job row, shipscommand.runwithRestorePayload, writeshost.restoreaudit row, returns HX-Redirect (or 303) to the live job page. - P3-02 (L) Wizard UI templates (
web/templates/pages/host_restore.html+partials/tree_node.html). Single-page progressively-enabled four-step form. Form-state-driven JS computes a running tally + step-4 confirm summary client-side. Tree expansion uses plain fetch (not HTMX) for simpler target lookup; loaded-state cached per node. Top-level Restore button on host detail right rail + per-snapshot Restore action on snapshot rows. New.snap-rowtoken inweb/styles/input.css. - P3-03 (M) Restore execution.
restic.RunRestorebuildsrestore <sid> --target <dir> [--include p]...with --json; newpumpRestoreStdoutparses status + summary objects.--no-ownershipis gated on the agent's restic version viaEnv.AtLeastVersion(0, 17)— the flag was added in 0.17 and 0.16 rejects it. Restic version is threaded throughrunner.Config.ResticVersionfrom the agent's sysinfo snapshot. New-dir target is operator-editable (default$HOME/rm-restore/<job-id>/); agent expands$HOME/${HOME}/~/at run time and callsos.MkdirAllon the target chain so the operator never has to pre-create the per-job subdir.runner.RunRestoretranslatesRestoreStatusintojob.progress(mapping FilesRestored → FilesDone, etc.); agent dispatcher caseJobRestorereuses thespawn()helper from P3-X1 so cancel works. Restore-shaped job-detail variant with current-file display under the progress bar. - P3-09 (S)
diffbetween two snapshots.JobDiffJobKind +restic.RunDiff+runner.RunDiff;POST /api/hosts/{id}/snapshots/diff(and HTMX-form variant on the unprefixed path) dispatcher with two-snapshot guard + per-host snapshot-list validation; UI panel on host detail right rail (visible when 2+ snapshots) with two short-id inputs + Diff button. Output streams as log.stream to the standard live job log page. - P3-X3 (S) Recent-restores line on host detail.
hostChromeDatagrowsRestoreStatus/RestoreAt/RestoreJobIDpopulated viastore.LatestJobByKind(host_id, 'restore')(already exists from P2R).host_chrome.htmlrenders a small line below the init-status one with status-coloured copy + a link to the job log. Hidden when no restore has ever run on this host. - P3-X4 (S) Job log download (txt + ndjson). New
GET /api/jobs/{id}/log.{txt|ndjson}endpoint backed by the persistedjob_logstable — works any time (running or finished) without pausing the live WS stream because the source is the DB, not the live socket. Plain-text format mirrors the on-screen "HH:MM:SS.mmm TAG payload" shape with a small# job ... · kind ... · status ...header; ndjson emits one self-contained{seq,ts,stream,payload}JSON object per line forjq/ tooling. Surfaced as a single header dropdown on the live job page (details/summary-driven, native keyboard support, click-outside-to-close). New reusable.dropdown/.dropdown-menu/.dropdown-itemtokens inweb/styles/input.css. - P3-X5 (S) UK lint locale + sweep.
.golangci.ymlmisspell locale switched US → UK and the codebase swept (~73 corrections — behaviour, serialise, recognise, honour, initialise, enrol, unauthorised, etc.). WireErrorCodevalue"unauthorized"→"unauthorised"is a tiny contract change but the agent doesn't parse those codes today and no external clients exist yet. - P3-X6 (S) Snapshot SIZE/FILES tooltip on host detail. The per-snapshot summary block was added by restic 0.17 (the source comment in
internal/restic/snapshots.goincorrectly said 0.16+); on 0.16 hosts the columns render—.hostDetailPage.LegacyRestic(computed viaEnv.AtLeastVersion(0, 17)) drives atitle="Needs restic 0.17+ on the agent host. This host runs <ver>."+cursor: helpon the column headers, hidden once the host upgrades.
Migration 0012 widens the
jobs.kindCHECK constraint to includerestoreanddiff. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005, with a defensive temp-table backup ofjob_logsso the cascade-trap that bit migration 0007 wouldn't take the log history with it.
install.sh + systemd unit: the install script now pre-creates
/root/rm-restore(root-owned 0700) so the default new-dir restore target works under the sandbox out of the box; the unit'sReadWritePathsgains-/root/rm-restore(soft-fail prefix). Existing installs need a re-run ofinstall.shto pick up the new dir; new operator-typed targets are auto-created by the agent at job time.
As shipped (Playwright sweep against the live smoke env, 2026-05-04): login → host detail → Restore button → wizard step 1 picks snapshot a1ac4006 (most recent) → tree drill-down
/home/steve/test(3 lazy loads) → tickfile1+file2→ step 4 confirm summary populated → dispatch → live job page with running progress widget → restore succeeds, files land on disk at/root/rm-restore/<job-id>/home/steve/test/file{1,2}(default$HOME/rm-restore/<job-id>/after agent-side expansion). Custom-target restore to/tmp/custom-restore/<job-id>/lands inside the agent'sPrivateTmpnamespace. Snapshot diff betweena1ac4006and5f78c788→ diff job page, statistics output streamed (738 bytes added, 0 removed). Recent-restores line on host detail reads "last restore · succeeded 28s ago · job log →". Download dropdown serves both.txtand.ndjsonwith correctContent-Type+Content-Disposition. SIZE/FILES tooltip "Needs restic 0.17+ on the agent host. This host runs 0.16.4." renders on column hover.
Phase 3 — Alerts ✅
- P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- P3-06 (M) Notification channels: webhook, ntfy, SMTP email
- P3-07 (S) Alert UI: list, acknowledge, resolve
As shipped (Playwright sweep, 2026-05-04): /settings/notifications → 3 channels created (sweep-webhook → local Python sink, sweep-ntfy → ntfy.sh public topic, sweep-smtp → MailHog at 127.0.0.1:1025). Test buttons fire alert.test on each: webhook 200/1ms, ntfy 200/322ms, SMTP 250/3ms. Synthetic critical
backup_failedraised → /alerts shows row with severity dot, kind chip, host, message, raised/last-seen, Ack + Resolve buttons; nav badge1; dashboard critical-alert banner appears with Review→ link; OPEN ALERTS card reads1 unresolved. Acknowledge → fan-out to all 3 channels emits alert.acknowledged (verified in webhook sink, MailHog inbox, notification_log); Acknowledged tab shows row withack'd by <user>line. Resolve → fan-out emits alert.resolved across all 3 channels; banner clears; dashboard reads0 unresolved · all clear; host alerts column reads —. Three live bugs found and fixed mid-sweep: (a)enabledform value lost because hidden+checkbox both namedenabledandPostForm.Getreturned the first ("0"); (b) Ack/Resolve handlers stored the state change but never dispatched alert.acknowledged / alert.resolved; (c)hosts.open_alert_countprojection was never recomputed on Raise/Resolve/AutoResolve, so the dashboard count always read 0.
Phase 3 — Audit log UI (not started)
- P3-08 (S) Audit log UI with filters (user, action, target, time range)
Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at
/hosts/{id}/restore; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page. - A failed backup raises an alert via the configured channel within 60s.
- The audit-log UI lets an admin filter by user / action / target / time range.
Phase 4 — Update delivery, RBAC polish, OIDC
- P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases.
restic-manager-agent updateis a thin wrapper overapt-get install --only-upgrade restic-manager-agent/choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2) - P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
- P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
- P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
- P4-05 (L) OIDC login (generic provider config, group → role mapping)
- P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
- P4-07 (S) Per-host tags + dashboard filtering by tag
- P4-08 (M) Prometheus
/metricsendpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list - P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON
Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape
/metricsand the sample Grafana dashboard renders with live data.
Phase 5 — OSS readiness
- P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- P5-02 (S)
CONTRIBUTING.md,CODE_OF_CONDUCT.md, issue + PR templates - P5-03 (S) Release automation:
goreleaserfor binaries + Docker image to GHCR - P5-04 (S) Demo screenshots / short Loom walkthrough in README
- P5-05 (S)
SECURITY.mdwith disclosure process - P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
- P5-07 (S) Reference deployment:
docker-compose.yml+ Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstratesRM_TRUSTED_PROXY)
Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
Cross-cutting / ongoing
- X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
- X-02 Track restic version compatibility matrix
- X-03 Periodic dependency updates (
dependabotorrenovate) - X-04 Threat-model review at end of each phase
- X-05 Proper first-run onboarding UI: admin shouldn't need to
curl/api/bootstrapby hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to/api/bootstrap, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form soadmindoesn't silently fail validation.
Future / unscheduled
Items here have a plausible use case but no confirmed need. They live outside numbered phases until a concrete trigger (a user request, a security review finding, a real disaster-recovery exercise) bumps them back into a phase.
- F-01
P3-04Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.