Splits Phase 3 into three independently-shippable sub-phases (Restore, Alerts, Audit UI) so they can land in separate PRs with their own brainstorm → spec → plan cycles. The Restore sub-phase is up first. The brainstorm ran on 2026-05-04 and locked the following decisions: - Single-host restore only this phase. P3-04 (cross-host restore) is moved to a new 'Future / unscheduled' section. Disaster recovery is already covered by re-enrolling a replacement host with the same repo creds; the remaining 'pull a file from host A onto host C' use case is genuinely different (file sharing / migration, not DR) and has no confirmed need. - Default target is /var/restic-restore/<job-id>/ with --no-ownership; in-place restore preserves uid/gid/mode and is gated by typed-confirmation of the host name (mirroring the repo re-init danger zone). - Tree browser is the path picker, lazy-loaded via a synchronous WS RPC (tree.list) over the existing correlation-ID infrastructure with a per-wizard-session in-memory cache (~30 min TTL). - Single-page wizard with progressively-enabled sections; entry is a top-level Restore button on host detail (or per-snapshot Restore action for direct deep-link). - Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other agent operation; output streams to the standard live job log page. - Restore-specific live job page variant with files-restored / bytes-restored / current-file widget. - Single-flight per host across all kinds, plus a real cancel-job feature (command.cancel WS envelope, agent kills the restic subprocess via context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a long-running backup if they need to restore urgently. Wires the existing job_detail Cancel button (which was a UI stub). - Audit row host.restore on every dispatch + a recent-restores panel on host detail. Role gate deferred to P4-03 RBAC. Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored — transient design artefact); screenshot reviewed and approved 2026-05-04.
39 KiB
restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: S = under a day, M = 1–3 days, L = 3–7 days.
Phase 0 — Project bootstrap
- P0-01 (S) Initialize Go module,
cmd/server,cmd/agent, baselineinternal/packages - P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- P0-03 (S) Set up
golangci-lint,gofumpt,goimports; pre-commit config - P0-04 (S)
GitHub ActionsGitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint - P0-05 (S)
Dockerfile.server(multi-stage, distroless),deploy/docker-compose.yml - P0-06 (S) Makefile /
with common targets (taskfile.ymlbuild,test,run,release)
Phase 1 — MVP: enrollment, visibility, on-demand backup
Server foundations
- P1-01 (M) HTTP server scaffolding (
chi, structured logging viaslog, graceful shutdown) - P1-02 (M) SQLite store layer (
modernc.org/sqlite) + migrations (hand-rolled,embed.FS) - P1-03 (M) Schema for
users,sessions,hosts,repos,credentials,jobs,job_logs,snapshots,audit_log - [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
- P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
- P1-06 (M) Secret encryption helper (AEAD with key from
RM_SECRET_KEY_FILE) - [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited
Agent ↔ server protocol
- P1-08 (M) Define shared API types in
internal/api(envelopes, every WS message +protocol_versionconstants; JSON-shape tests pin the wire) - P1-09 (L) WebSocket transport (
github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side - P1-10 (M) Enrollment flow:
POST /api/agents/enrollwith one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate. - P1-11 (M) Agent registration on connect (
helloupserts agent_version/restic_version/protocol_version, flips status online,protocol_too_oldrejection has clean error envelope) - P1-12 (S) Heartbeat handler (touches
last_seen_at; background sweeper marks hosts offline after 90s without one)
Agent foundations
- P1-13 (M) Agent config file (
/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2 - P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
- P1-15 (M) Outbound WS client (
github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats,protocol_versionin hello - P1-16 (M) Restic wrapper: locate via PATH or override, run with
--json, scan stdout/stderr, parseBackupStatus+BackupSummary, exit-code 3 treated as success-with-issues - P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
Run-now backup
- P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
- P1-19 (M) Server endpoint
POST /api/hosts/{id}/jobsto dispatch abackupcommand (validates kind, checks online, audit-logs) - P1-20 (M) Agent executes
restic backup, streams stdout/stderr + parsed JSON events back asjob.progress(1Hz throttle) /log.stream - [~] P1-21 (M) Server persists log stream to
job_logs✓; WS/api/jobs/{id}/streamfor live browser tailing still TODO — needs the per-job fan-out hub - P1-22 (S) Snapshot listing: agent calls
restic snapshots --jsonafter each successful backup and ships the projection oversnapshots.report. ServerReplaceHostSnapshotsatomically swaps the per-host list and updateshosts.snapshot_countin the same tx. Read endpoint:GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unusedrepo_idFK fromsnapshots(repos as a first-class entity is P2 work).
UI (HTMX + Tailwind)
- P1-23 (M) Base layout, login page, session-aware nav
- P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by
GET /api/hosts+GET /api/fleet/summary(JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMXRun nowbutton posts to/hosts/{id}/run-backup. - P1-25 (M) Host detail page (
/hosts/{id}): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel. - P1-26 (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens
/api/jobs/{id}/stream; agent-emittedjob.started/job.progress/log.stream/job.finishedare mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads onjob.finishedto show the final header. "Run now" setsHX-Redirectso the operator lands on the live log. - [~] P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (
RM_SERVER+RM_TOKENfilled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). Deferred: one-click "download preconfigured installer"install-<hostname>.sh(cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1. - P1-28 (S) Tailwind build via
tailwindcssstandalone binary (no Node) — Makefile downloads pinned v3.4.17 intobin/tailwindcss, buildsweb/styles/input.css→web/static/css/styles.css, embedded into the binary viaweb.FS.make buildruns Tailwind first.
Install scripts
- P1-29 (M)
install.sh(Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers //etc/cron.{d,daily,hourly,weekly}/*/ root crontab and prints them with the exact disable commands — does not auto-disable - [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (
/agent/binary+/install/*); signature verification deferred to Phase 5 OSS readiness
Repo credentials (pulled forward from Phase 2)
-
P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:
POST /api/enrollment-tokensbody growsrepo_url,repo_username,repo_password(all required).- Token row stores them as one AEAD-encrypted blob (existing
crypto.AEAD);ConsumeEnrollmentTokenmoves the blob to a newhost_credentialsrow keyed byhost_idin the same tx. PUT /api/hosts/{id}/repo-credentials(admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.GET /api/hosts/{id}/repo-credentialsreturns the redacted view (URL + username +has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.- On WS
hello, server pushes aconfig.updatewith decrypted creds before returning the connection to idle. Same path on edit-while-connected. - Audit-logged on create / consume / edit; payload omits the secret material.
-
P1-33 (M) Agent-side encrypted secrets store:
- New
internal/agent/secretspackage: AEAD blob at/var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600). - Per-host 32-byte secrets key minted at enrollment, persisted in
agent.yaml(already 0600 root-only — same trust boundary as the bearer; explicit comment in the file). - Strip
repo_url/repo_passwordfromagent.config.Config. Agent loads creds fromsecrets.encat startup;config.updatehandler writes through to the file. - Dispatcher reads from the secrets store on every job rather than from in-memory config.
- Migration path: if
agent.yamlstill containsrepo_url/repo_password, copy them intosecrets.encon next start, then strip from the YAML on save.
- New
-
P1-34 (S) End-to-end smoke runbook:
docs/e2e-smoke.mdwalks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a realrestic/rest-serverin a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (
.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17). - Agent ↔ server
protocol_versionhandshake rejects mismatched versions with a clear error rather than failing on JSON parse. - Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as
config.update.
Phase 2 — Scheduling, retention, repo operations
Mid-phase pivot — "P2 redesign" (commits
7a7cac5,666af41,5667cdf). The original P2 plan put paths/excludes/retention/manual/kind/options onScheduleand one repo per host. After landing P2-01..P2-05 against that shape, the data model was rewritten: schedules are slim (cron + whichsource_groups); paths/excludes/retention/retry live onsource_group(also doubles as the snapshot tag); forget/prune/check cadences live onhost_repo_maintenanceand run on a server-side ticker, not the agent cron;pending_runsqueues offline retries;host.repo_initialised_atis gone (auto-init at enrolment). The redesign is captured below asP2R-NNitems. Items P2-01..P2-05 stay marked done because the work shipped, but they're labelled ⚠️ shipped against old shape — behaviour to be re-validated under P2R-02 after UI rewire. P2-04.5 (manualflag) is dropped wholesale. P2-06..P2-15 are reframed below to point at their new homes; P2-16/17/18 are unaffected by the redesign.
Original P2 work — shipped (against pre-redesign shape)
- ⚠️ P2-01 (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
- ⚠️ P2-02 (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
- ⚠️ P2-03 (M) Agent local scheduler (
internal/agent/scheduler,robfig/cron/v3,schedule.fireenvelope,dispatchScheduledJob). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01. - ⚠️ P2-04 (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
P2-04.5 Manual schedules / kill— superseded; thehost.default_pathsmanualflag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).- ⚠️ P2-05 (M)
forgetcommand with retention policy. Wire payload (CommandRunPayload.retention_policy) and restic wrapper (restic.ForgetPolicy,RunForget) are still correct; what changes under P2R-03 is where retention comes from (source_group, not schedule) and who dispatches (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).
P2 redesign — Phase 1 ✅
- P2R-00.1 (M) Migration 0008 — sources + repo maintenance. Adds
source_groups,schedule_source_groupsjunction,host_repo_maintenance,pending_runs,host.bandwidth_up_kbps/bandwidth_down_kbps. Dropshost.repo_initialised_at. Slim-schedule columns dropped fromschedules. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit7a7cac5. - P2R-00.2 (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of
/hosts/{id}/sources,/sources/{gid}/edit(with retention-conflict banner), slim/schedules,/repo(connection / bandwidth / maintenance / re-init). Commit666af41.
P2 redesign — Phase 2 ✅
- P2R-00.3 (L) Go-side store rewrite against migration 0008. New types:
SourceGroup,HostRepoMaintenance,PendingRun.Scheduleslimmed to{id, host_id, cron, enabled, source_group_ids, timestamps}.RetentionPolicymoves from schedule field → source group field (type unchanged).HostlosesRepoInitialisedAt, gains bandwidth caps. New files:store/sources.go,store/maintenance.go,store/pending.go.store/schedules.gorewritten for slim shape + junction CRUD.enrollment.goseeds a default source group + repo-maintenance row instead of a manual schedule.ws/handler.godropsMarkHostRepoInitialised. HTTP layer + UI templates temporarily 501-stubbed withredesign_in_progress— this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit5667cdf. - P2R-00.4 (S) Host-detail UI patched up enough to render:
RepoInitialisedAttemplate refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.
P2 redesign — Phase 3 (REST + WS rewire) ✅
- P2R-01 (L) HTTP/WS layer against the slim shape:
- Schedules REST CRUD:
GET|POST /api/hosts/{id}/schedules,PUT|DELETE /api/hosts/{id}/schedules/{sid}. Body shape is{cron, enabled, source_group_ids[]}— paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (perstore.UpdateSchedule). Validation: cron parses viarobfig/cron/v3; ≥1source_group_ids; all referenced groups belong to the host. - Source-groups REST CRUD:
GET|POST /api/hosts/{id}/source-groups,GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}. Body:{name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}. Name uniqueness per host. Refuse delete ifSchedulesUsingGroup(gid)is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bumphost_schedule_version. - Repo-maintenance REST:
GET|PUT /api/hosts/{id}/repo-maintenance. Body:{forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}. Server-side ticker drives execution (P2R-04), so updates here do not bumphost_schedule_version. - Per-source-group Run-now:
POST /hosts/{id}/source-groups/{gid}/run. Reuses the existingdispatchScheduleNow-style path; agent receives a normalcommand.runcarrying the resolved includes/excludes/retention from the group. This replaces the old per-host/hosts/{id}/run-backupendpoint (kept around as a 410-Gone with a hint pointing to source groups). schedule_push.goreconciliation: rebuildpushScheduleSet*to ship the new wire format (ScheduleSetPayloadcarries[{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}]— agent doesn't need to knowsource_group_id, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persistsapplied_schedule_version.- Auto-init at enrolment: server dispatches
restic initon first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row withkind=initso the audit trail still shows it. Oninitreturning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour. - Tests: rewrite the deleted
schedules_test.goandschedule_push_test.goagainst new endpoints; newsource_groups_test.go,repo_maintenance_test.go,auto_init_test.go. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.
- Schedules REST CRUD:
P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅
Row-design rule (binding for every list-row template in this app, current and future): Whole-row click navigates to the row's primary detail/edit page — mirror
.host-row.clickableon the dashboard (partials/host_row.html): an absolute-positioned.row-linkoverlay withtext-indent: -9999pxcovers the row, action buttons live in.row-actioncells that sit above via z-index. Do not add an explicit "Edit" button when the row is clickable — it duplicates the affordance and dilutes the click target. Action cells are reserved for verbs that aren't "open this row" (Run-now, Delete, Pause, etc).
- P2R-02 (L) UI templates rebuilt against the new model:
- Slice 1 ✅ Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a
host_chromepartial; Sources / Schedules / Repo become real<a>links; placeholder pages share the chrome; version indicator restored. (commita535822) - Slice 2 ✅ Sources tab —
/hosts/{id}/sourceslist with per-row meta + clickable rows + per-group Run-now/Delete;/sources/newand/sources/{gid}/editform (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner fromConflictDimensioncache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits0ed9c3d,dede74f) - Slice 3 ✅ Schedules tab —
/hosts/{id}/schedulesslim list (status / cron / source-tags / actions, clickable rows) plus/schedules/newand/schedules/{sid}/editform (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reusesdispatchScheduledJobfor enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit67ca769+ follow-ups64d2fcf,8b91d30,4035c44) - Slice 4 ✅
/hosts/{id}/repo— three independent forms (connection: URL/user/password pre-filled fromGET /api/hosts/{id}/repo-credentialsredacted view; bandwidth: host-wide caps via newPUT /api/hosts/{id}/bandwidth; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commitd62b173) - Slice 5 ✅ Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit
fab99b4) - Slice 6 ✅ Playwright sweep against the live
:8080server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in_diag/p2r-02-sweep/. - Side-fix: agent runner drops noisy restic
statusevents fromlog.stream(they were drowning the live log on short backups; the throttledjob.progressenvelope already covers the same data). (commitffba737) - Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by
host_schedule_version+applied_schedule_version). - Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires
pushScheduleSetAsyncso an online agent re-arms within seconds.
- Slice 1 ✅ Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a
P2 redesign — Phase 5 ✅
Shipped on branch
p2r-phase5-maintenance(PR #3). Plan:docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md.
- P2R-03 (M)
prunecommand end-to-end. Restic wrapper (restic.RunPrune), agent dispatcher (case api.JobPrune:), wire envelope. Admin-only credential: a secondhost_credentialsrow keyed byhost_id+kind=admincarries the non-append-only username/password; server pushes it viaconfig.updateonly when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now viaPOST /hosts/{id}/repo/prune. Cadence-driven dispatch via the maintenance ticker (P2R-06). - P2R-04 (M)
checkcommand end-to-end (restic check --read-data-subset N%). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now viaPOST /hosts/{id}/repo/check. Cadence-driven dispatch via the maintenance ticker (P2R-06). - P2R-05 (S)
unlockcommand end-to-end (restic unlock). Operator-only — no cadence.POST /hosts/{id}/repo/unlock. Repo page surfaces lock state from the most recentcheck(which warns about stale locks). - P2R-06 (M) Server-side maintenance ticker. Cron-style loop on the server reads
host_repo_maintenancerows, dispatchesforget/prune/checkjobs against the right host on the configured cadence. Last-fire anchor is derived from thejobstable viaLatestJobByKind(queued + running included so a long-running prune correctly suppresses the next tick). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (logged, no queue — only scheduled backup fires queue, per P2R-08). Forget reshape: ships a multi-groupForgetGroupspayload so one job fires N restic-forget invocations per tick. - P2R-07 (S) Repo stats panel on the Repo page: total size, raw size, last-check timestamp + status (color-coded), last-prune timestamp, stale-lock banner. Backed by
restic stats --json --mode raw-datathat the agent ships in arepo.statsenvelope after every backup / check / prune / unlock; persisted viaStore.UpsertHostRepoStatsinto a newhost_repo_statsprojection table. - P2R-08 (M) Pending-runs queue worker. Scheduled backup fires that race an agent disconnect queue to
pending_runs. Drained on a 30s server-side tick and on agent reconnect (viaonAgentHello); per-host TryLock mutex prevents the two paths double-dispatching the same row. Exponential backoff capped at 30 minutes; abandons rows that exceed the source-group'sretry_max(audit-logged) or whose schedule/group has genuinely been deleted.
P2 redesign — Phase 6 (auto-init follow-up) ✅
- P2R-09 (S) Auto-init UX polish. Latest
initjob status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). Danger-zonePOST /hosts/{id}/repo/reinitdispatches a fresh init job after the operator types the host name to confirm; audit row recordshost.repo_reinit.
Pre/post hooks (rehomed onto source groups) ✅
- P2R-10 (M) Hook schema: migration 0010 adds
pre_hook/post_hookBLOB columns tosource_groupsandpre_hook_default/post_hook_defaulttohosts. Bytes stored verbatim — AEAD encrypt/decrypt at the HTTP layer (per-slot AD bytes). Round-trip tests cover set/clear semantics on both tables. - P2R-11 (M) Agent execution of hooks:
runner.BackupHooks+runHookhelper invoked via/bin/sh -c(cmd.exe /Con Windows). pre_hook non-zero exit aborts the backup; post_hook always runs withRM_JOB_STATUS=succeeded|failedin env. Output streamed ashook(<phase>): …log.stream lines. Hooks only run forkind=backup. Server side resolves group → host default → empty and ships plaintext on the WS payload (decrypt at HTTP layer). - P2R-12 (S) Hook editor UI: source-group edit form gains pre/post hook textareas with the service-user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a host-default Hooks panel with the same shape; saved via
POST /hosts/{id}/repo/hooks.
Bandwidth + niceties (rehomed onto host + source groups) ✅
- P2R-13 (S) Bandwidth limit fields.
restic.EnvgainsLimitUploadKBps/LimitDownloadKBps, emitted as--limit-upload/--limit-downloadglobal flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received viaconfig.update; server pushes them on hello and afterPUT /api/hosts/{id}/bandwidth. Per-job override on the per-source-group Run-now form (collapsed<details>"Limit bandwidth for this run" with two KB/s inputs); override wins over host caps. - P2R-14 (S) Schedule "next run" / "last run". New
store.LatestJobBySchedulequery. Schedules tab grows two columns (Next derived from cron viarobfig/cron/v3.Parse(...).Next, Last from latestactor_kind=schedulejob). Dashboard host row prependsnext 12h ago/from nowwhen a single covering schedule is the run-now candidate.
Cross-platform + alt-enrolment ✅
-
P2-16 (M) Windows service integration:
internal/agent/service(build-tagged) implementssvc.Handler; newrestic-manager-agent install|uninstall|start|stop|runsubcommands wrap the SCM viagolang.org/x/sys/windows/svc/mgr. Cross-compile verified (GOOS=windows GOARCH=amd64 go build ./cmd/agent); untested on Windows itself — Linux CI can't exercise the SCM round-trip. -
P2-17 (M)
install.ps1(Windows): pwsh installer that detects arch, downloads$Server/agent/binary?os=windows&arch=amd64, runs the agent in-enroll-server(+ optional-enroll-token) mode (token flow OR announce-and-approve), then registers the service viarestic-manager-agent install. Surfaces existing scheduled tasks named*restic*without disabling. Served by the existingGET /install/*handler; restage block in CLAUDE.md updated. -
P2-18 (L) Announce-and-approve enrolment (second enrolment mode):
- Agent run with no
RM_TOKENgenerates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), thenPOST /api/agents/announcewith{hostname, os, arch, agent_version, restic_version, public_key}. Server stores apending_hostsrow (public_key,fingerprint = sha256(public_key),announced_from_ip,first_seen_at,last_seen_at,expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal. - Agent then opens a long-poll/WS to
/ws/agent/pendingauthenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits. - Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g.
SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept. - UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
- Server-side guards: per-source-IP rate limit on
/api/agents/announce(token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race). - Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting
hostnameover the wire.
As shipped: migration 0011 +
store/pending_hosts.gocover the table.POST /api/agents/announce(rate-limited 10/min/IP, global cap 100 in-flight rows) returns{pending_id, fingerprint, hostname_collision}.GET /ws/agent/pendingruns the Ed25519 nonce-sign handshake. Admin POSTs to/api/pending-hosts/{id}/accept|reject(audit-logged ashost.accept_pending/host.reject_pending). Dashboard panel renders the queue with a copyable fingerprint + inline accept form (URL/user/password). 60s server ticker sweeps expired rows. Agent:cmd/agent/announce.gomints + persists an Ed25519 keypair intoagent.yaml'sannounce_keyfield; runs automatically when-enroll-serveris supplied without-enroll-token. The install scripts haven't been updated to surface the printed fingerprint beyond the agent's own banner — the operator reads it from the install script's stdout. - Agent run with no
Phase 2 acceptance
- A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
- Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
- Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to
pending_runsand drain on reconnect. - Pre/post hooks fire correctly per source group, fail loudly on
pre_hookerrors, runpost_hookwithRM_JOB_STATUS. Rejected on non-backup kinds. - Bandwidth limits honoured (host-wide default + per-run override).
- A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming. Not validated in CI: Linux runners cannot exercise the SCM round-trip; the
service_windows.go/install.ps1pieces compile cleanly underGOOS=windows GOARCH=amd64but the first real Windows install will be the first end-to-end test. - A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.
Phase 3 — Restore, alerts, audit
Phase 3 is split into three independently-shippable sub-phases: Restore (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC), Alerts (P3-05..07), Audit UI (P3-08). Each sub-phase has its own spec → plan → implement cycle; we hand back at sub-phase boundaries.
P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm on 2026-05-04: disaster recovery is already covered by re-enrolling a replacement host with the same repo creds (snapshots reappear, restore is same-host). The remaining "pull a file from host A onto host C without giving C permanent access" use case is genuinely different and doesn't have a confirmed need yet, so it's moved to the Future / unscheduled section at the end of this file.
Phase 3 — Restore (in progress, brand p3-restore)
Spec:
docs/superpowers/specs/2026-05-04-p3-restore-design.md. Wireframe:_diag/p3-restore-wizard/wireframe.html.
- P3-X1 (S) Cancel-job feature. New
command.cancelWS envelope; agent tracks per-job ctx.CancelFunc and kills the runningresticsubprocess (SIGTERM, SIGKILL after 5s grace); server endpointPOST /api/jobs/{id}/cancelbridges UI → WS; the existing UI Cancel button on/jobs/{id}becomes real for any running kind. Foundational — restore depends on it. - P3-X2 (S) Tree-list synchronous WS RPC. New
tree.listrequest /tree.list.resultreply on the existing correlation-ID infra; agent runsrestic ls --json <sid> <path>per call; server-side mediatorws.SendRPC+ per-wizard-session in-memory cache (~30-min TTL). - P3-01 (L) Restore wizard backend: tree browse via
tree.listRPC (P3-X2), path picker validation, target selection (new-dir vs in-place + typed-confirm), dispatch endpointPOST /hosts/{id}/restore, audit rowhost.restore. - P3-02 (L) Restore wizard UI: single-page progressively-enabled four-step form at
/hosts/{id}/restore(and pre-selected variant/hosts/{id}/snapshots/{sid}/restore); tree-browser HTMX partials. Top-level "Restore" button on host detail. - P3-03 (M) Restore execution:
restic.RunRestore(paths, --target, --no-ownership for new-dir; preserves ownership for in-place); agent dispatcher caseJobRestore; restore-specific job page variant with files-restored / bytes-restored / throughput / ETA / current-file widget. - P3-09 (S)
diffbetween two snapshots in UI:JobDiffJobKind,restic.RunDiff,POST /api/hosts/{id}/snapshots/diffdispatcher, snapshot-picker UI on Snapshots tab to pick A+B; output streams aslog.streamto the standard live job log page. - P3-X3 (S) Recent-restores panel on host detail: small line below the existing init-status, surfacing latest
JobRestoreoutcome (succeeded N hours ago / failed → live log link). Backed bystore.LatestJobByKind(host_id, JobRestore).
Phase 3 — Alerts (not started)
- P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- P3-06 (M) Notification channels: webhook, ntfy, SMTP email
- P3-07 (S) Alert UI: list, acknowledge, resolve
Phase 3 — Audit log UI (not started)
- P3-08 (S) Audit log UI with filters (user, action, target, time range)
Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at
/hosts/{id}/restore; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page. - A failed backup raises an alert via the configured channel within 60s.
- The audit-log UI lets an admin filter by user / action / target / time range.
Phase 4 — Update delivery, RBAC polish, OIDC
- P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases.
restic-manager-agent updateis a thin wrapper overapt-get install --only-upgrade restic-manager-agent/choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2) - P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
- P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
- P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
- P4-05 (L) OIDC login (generic provider config, group → role mapping)
- P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
- P4-07 (S) Per-host tags + dashboard filtering by tag
- P4-08 (M) Prometheus
/metricsendpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list - P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON
Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape
/metricsand the sample Grafana dashboard renders with live data.
Phase 5 — OSS readiness
- P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- P5-02 (S)
CONTRIBUTING.md,CODE_OF_CONDUCT.md, issue + PR templates - P5-03 (S) Release automation:
goreleaserfor binaries + Docker image to GHCR - P5-04 (S) Demo screenshots / short Loom walkthrough in README
- P5-05 (S)
SECURITY.mdwith disclosure process - P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
- P5-07 (S) Reference deployment:
docker-compose.yml+ Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstratesRM_TRUSTED_PROXY)
Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
Cross-cutting / ongoing
- X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
- X-02 Track restic version compatibility matrix
- X-03 Periodic dependency updates (
dependabotorrenovate) - X-04 Threat-model review at end of each phase
- X-05 Proper first-run onboarding UI: admin shouldn't need to
curl/api/bootstrapby hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to/api/bootstrap, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form soadmindoesn't silently fail validation.
Future / unscheduled
Items here have a plausible use case but no confirmed need. They live outside numbered phases until a concrete trigger (a user request, a security review finding, a real disaster-recovery exercise) bumps them back into a phase.
- F-01
P3-04Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.