Files
restic-manager/tasks.md
T
steve aa2d7db097 P3 wrap: agent auto-creates restore target; tasks.md ticked
1. Agent-side MkdirAll on the new-dir restore target. Restic creates
   missing leaves but won't traverse multiple missing levels, and
   under the systemd sandbox writes outside ReadWritePaths fail
   anyway. Calling os.MkdirAll(target, 0700) before invoking restic
   means the operator never has to pre-create the per-job subdir,
   and a path the sandbox rejects surfaces as a clean
   'restic restore: prepare target ...: read-only file system' error
   in the job log instead of a cryptic restic-side stat failure.

2. tasks.md Phase 3 — Restore section refreshed:
   - P3-X4 added (job log download dropdown — txt + ndjson)
   - P3-X5 added (UK lint locale switch + 73-correction sweep)
   - P3-X6 added (SIZE/FILES tooltip when host's restic < 0.17)
   - P3-03 entry expanded to cover version-gated --no-ownership,
     editable target, $HOME expansion, agent-side MkdirAll
   - As-shipped sweep summary mentions custom-target restore +
     download dropdown + tooltip in addition to the original walk

Test: TestRunRestoreNewDirAutoCreatesTarget seeds a multi-level
target the operator hasn't created and confirms RunRestore mkdir's
the chain before invoking restic.
2026-05-04 17:51:34 +01:00

44 KiB
Raw Blame History

restic-manager — Tasks

Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.

Sizes: S = under a day, M = 13 days, L = 37 days.


Phase 0 — Project bootstrap

  • P0-01 (S) Initialize Go module, cmd/server, cmd/agent, baseline internal/ packages
  • P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
  • P0-03 (S) Set up golangci-lint, gofumpt, goimports; pre-commit config
  • P0-04 (S) GitHub Actions Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
  • P0-05 (S) Dockerfile.server (multi-stage, distroless), deploy/docker-compose.yml
  • P0-06 (S) Makefile / taskfile.yml with common targets (build, test, run, release)

Phase 1 — MVP: enrollment, visibility, on-demand backup

Server foundations

  • P1-01 (M) HTTP server scaffolding (chi, structured logging via slog, graceful shutdown)
  • P1-02 (M) SQLite store layer (modernc.org/sqlite) + migrations (hand-rolled, embed.FS)
  • P1-03 (M) Schema for users, sessions, hosts, repos, credentials, jobs, job_logs, snapshots, audit_log
  • [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
  • P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
  • P1-06 (M) Secret encryption helper (AEAD with key from RM_SECRET_KEY_FILE)
  • [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited

Agent ↔ server protocol

  • P1-08 (M) Define shared API types in internal/api (envelopes, every WS message + protocol_version constants; JSON-shape tests pin the wire)
  • P1-09 (L) WebSocket transport (github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
  • P1-10 (M) Enrollment flow: POST /api/agents/enroll with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
  • P1-11 (M) Agent registration on connect (hello upserts agent_version/restic_version/protocol_version, flips status online, protocol_too_old rejection has clean error envelope)
  • P1-12 (S) Heartbeat handler (touches last_seen_at; background sweeper marks hosts offline after 90s without one)

Agent foundations

  • P1-13 (M) Agent config file (/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
  • P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
  • P1-15 (M) Outbound WS client (github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, protocol_version in hello
  • P1-16 (M) Restic wrapper: locate via PATH or override, run with --json, scan stdout/stderr, parse BackupStatus + BackupSummary, exit-code 3 treated as success-with-issues
  • P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)

Run-now backup

  • P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
  • P1-19 (M) Server endpoint POST /api/hosts/{id}/jobs to dispatch a backup command (validates kind, checks online, audit-logs)
  • P1-20 (M) Agent executes restic backup, streams stdout/stderr + parsed JSON events back as job.progress (1Hz throttle) / log.stream
  • [~] P1-21 (M) Server persists log stream to job_logs ✓; WS /api/jobs/{id}/stream for live browser tailing still TODO — needs the per-job fan-out hub
  • P1-22 (S) Snapshot listing: agent calls restic snapshots --json after each successful backup and ships the projection over snapshots.report. Server ReplaceHostSnapshots atomically swaps the per-host list and updates hosts.snapshot_count in the same tx. Read endpoint: GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused repo_id FK from snapshots (repos as a first-class entity is P2 work).

UI (HTMX + Tailwind)

  • P1-23 (M) Base layout, login page, session-aware nav
  • P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by GET /api/hosts + GET /api/fleet/summary (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX Run now button posts to /hosts/{id}/run-backup.
  • P1-25 (M) Host detail page (/hosts/{id}): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
  • P1-26 (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens /api/jobs/{id}/stream; agent-emitted job.started/job.progress/log.stream/job.finished are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on job.finished to show the final header. "Run now" sets HX-Redirect so the operator lands on the live log.
  • [~] P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (RM_SERVER + RM_TOKEN filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). Deferred: one-click "download preconfigured installer" install-<hostname>.sh (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
  • P1-28 (S) Tailwind build via tailwindcss standalone binary (no Node) — Makefile downloads pinned v3.4.17 into bin/tailwindcss, builds web/styles/input.cssweb/static/css/styles.css, embedded into the binary via web.FS. make build runs Tailwind first.

Install scripts

  • P1-29 (M) install.sh (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / /etc/cron.{d,daily,hourly,weekly}/* / root crontab and prints them with the exact disable commands — does not auto-disable
  • [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (/agent/binary + /install/*); signature verification deferred to Phase 5 OSS readiness

Repo credentials (pulled forward from Phase 2)

  • P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:

    • POST /api/enrollment-tokens body grows repo_url, repo_username, repo_password (all required).
    • Token row stores them as one AEAD-encrypted blob (existing crypto.AEAD); ConsumeEnrollmentToken moves the blob to a new host_credentials row keyed by host_id in the same tx.
    • PUT /api/hosts/{id}/repo-credentials (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
    • GET /api/hosts/{id}/repo-credentials returns the redacted view (URL + username + has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
    • On WS hello, server pushes a config.update with decrypted creds before returning the connection to idle. Same path on edit-while-connected.
    • Audit-logged on create / consume / edit; payload omits the secret material.
  • P1-33 (M) Agent-side encrypted secrets store:

    • New internal/agent/secrets package: AEAD blob at /var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600).
    • Per-host 32-byte secrets key minted at enrollment, persisted in agent.yaml (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
    • Strip repo_url / repo_password from agent.config.Config. Agent loads creds from secrets.enc at startup; config.update handler writes through to the file.
    • Dispatcher reads from the secrets store on every job rather than from in-memory config.
    • Migration path: if agent.yaml still contains repo_url/repo_password, copy them into secrets.enc on next start, then strip from the YAML on save.
  • P1-34 (S) End-to-end smoke runbook: docs/e2e-smoke.md walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real restic/rest-server in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.

Phase 1 acceptance

  • One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
  • Windows binary builds cleanly in CI (.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
  • Agent ↔ server protocol_version handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
  • Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as config.update.

Phase 2 — Scheduling, retention, repo operations

Mid-phase pivot — "P2 redesign" (commits 7a7cac5, 666af41, 5667cdf). The original P2 plan put paths/excludes/retention/manual/kind/options on Schedule and one repo per host. After landing P2-01..P2-05 against that shape, the data model was rewritten: schedules are slim (cron + which source_groups); paths/excludes/retention/retry live on source_group (also doubles as the snapshot tag); forget/prune/check cadences live on host_repo_maintenance and run on a server-side ticker, not the agent cron; pending_runs queues offline retries; host.repo_initialised_at is gone (auto-init at enrolment). The redesign is captured below as P2R-NN items. Items P2-01..P2-05 stay marked done because the work shipped, but they're labelled ⚠️ shipped against old shape — behaviour to be re-validated under P2R-02 after UI rewire. P2-04.5 (manual flag) is dropped wholesale. P2-06..P2-15 are reframed below to point at their new homes; P2-16/17/18 are unaffected by the redesign.

Original P2 work — shipped (against pre-redesign shape)

  • ⚠️ P2-01 (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
  • ⚠️ P2-02 (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
  • ⚠️ P2-03 (M) Agent local scheduler (internal/agent/scheduler, robfig/cron/v3, schedule.fire envelope, dispatchScheduledJob). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01.
  • ⚠️ P2-04 (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
  • P2-04.5 Manual schedules / kill host.default_paths — superseded; the manual flag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).
  • ⚠️ P2-05 (M) forget command with retention policy. Wire payload (CommandRunPayload.retention_policy) and restic wrapper (restic.ForgetPolicy, RunForget) are still correct; what changes under P2R-03 is where retention comes from (source_group, not schedule) and who dispatches (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).

P2 redesign — Phase 1

  • P2R-00.1 (M) Migration 0008 — sources + repo maintenance. Adds source_groups, schedule_source_groups junction, host_repo_maintenance, pending_runs, host.bandwidth_up_kbps / bandwidth_down_kbps. Drops host.repo_initialised_at. Slim-schedule columns dropped from schedules. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit 7a7cac5.
  • P2R-00.2 (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of /hosts/{id}/sources, /sources/{gid}/edit (with retention-conflict banner), slim /schedules, /repo (connection / bandwidth / maintenance / re-init). Commit 666af41.

P2 redesign — Phase 2

  • P2R-00.3 (L) Go-side store rewrite against migration 0008. New types: SourceGroup, HostRepoMaintenance, PendingRun. Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. RetentionPolicy moves from schedule field → source group field (type unchanged). Host loses RepoInitialisedAt, gains bandwidth caps. New files: store/sources.go, store/maintenance.go, store/pending.go. store/schedules.go rewritten for slim shape + junction CRUD. enrollment.go seeds a default source group + repo-maintenance row instead of a manual schedule. ws/handler.go drops MarkHostRepoInitialised. HTTP layer + UI templates temporarily 501-stubbed with redesign_in_progress — this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit 5667cdf.
  • P2R-00.4 (S) Host-detail UI patched up enough to render: RepoInitialisedAt template refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.

P2 redesign — Phase 3 (REST + WS rewire)

  • P2R-01 (L) HTTP/WS layer against the slim shape:
    • Schedules REST CRUD: GET|POST /api/hosts/{id}/schedules, PUT|DELETE /api/hosts/{id}/schedules/{sid}. Body shape is {cron, enabled, source_group_ids[]} — paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (per store.UpdateSchedule). Validation: cron parses via robfig/cron/v3; ≥1 source_group_ids; all referenced groups belong to the host.
    • Source-groups REST CRUD: GET|POST /api/hosts/{id}/source-groups, GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}. Body: {name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}. Name uniqueness per host. Refuse delete if SchedulesUsingGroup(gid) is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bump host_schedule_version.
    • Repo-maintenance REST: GET|PUT /api/hosts/{id}/repo-maintenance. Body: {forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}. Server-side ticker drives execution (P2R-04), so updates here do not bump host_schedule_version.
    • Per-source-group Run-now: POST /hosts/{id}/source-groups/{gid}/run. Reuses the existing dispatchScheduleNow-style path; agent receives a normal command.run carrying the resolved includes/excludes/retention from the group. This replaces the old per-host /hosts/{id}/run-backup endpoint (kept around as a 410-Gone with a hint pointing to source groups).
    • schedule_push.go reconciliation: rebuild pushScheduleSet* to ship the new wire format (ScheduleSetPayload carries [{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}] — agent doesn't need to know source_group_id, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persists applied_schedule_version.
    • Auto-init at enrolment: server dispatches restic init on first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row with kind=init so the audit trail still shows it. On init returning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour.
    • Tests: rewrite the deleted schedules_test.go and schedule_push_test.go against new endpoints; new source_groups_test.go, repo_maintenance_test.go, auto_init_test.go. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.

P2 redesign — Phase 4 (UI rewire, against v4 wireframes)

Row-design rule (binding for every list-row template in this app, current and future): Whole-row click navigates to the row's primary detail/edit page — mirror .host-row.clickable on the dashboard (partials/host_row.html): an absolute-positioned .row-link overlay with text-indent: -9999px covers the row, action buttons live in .row-action cells that sit above via z-index. Do not add an explicit "Edit" button when the row is clickable — it duplicates the affordance and dilutes the click target. Action cells are reserved for verbs that aren't "open this row" (Run-now, Delete, Pause, etc).

  • P2R-02 (L) UI templates rebuilt against the new model:
    • Slice 1 Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a host_chrome partial; Sources / Schedules / Repo become real <a> links; placeholder pages share the chrome; version indicator restored. (commit a535822)
    • Slice 2 Sources tab — /hosts/{id}/sources list with per-row meta + clickable rows + per-group Run-now/Delete; /sources/new and /sources/{gid}/edit form (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner from ConflictDimension cache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits 0ed9c3d, dede74f)
    • Slice 3 Schedules tab — /hosts/{id}/schedules slim list (status / cron / source-tags / actions, clickable rows) plus /schedules/new and /schedules/{sid}/edit form (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reuses dispatchScheduledJob for enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit 67ca769 + follow-ups 64d2fcf, 8b91d30, 4035c44)
    • Slice 4 /hosts/{id}/repo — three independent forms (connection: URL/user/password pre-filled from GET /api/hosts/{id}/repo-credentials redacted view; bandwidth: host-wide caps via new PUT /api/hosts/{id}/bandwidth; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commit d62b173)
    • Slice 5 Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit fab99b4)
    • Slice 6 Playwright sweep against the live :8080 server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in _diag/p2r-02-sweep/.
    • Side-fix: agent runner drops noisy restic status events from log.stream (they were drowning the live log on short backups; the throttled job.progress envelope already covers the same data). (commit ffba737)
    • Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by host_schedule_version + applied_schedule_version).
    • Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires pushScheduleSetAsync so an online agent re-arms within seconds.

P2 redesign — Phase 5

Shipped on branch p2r-phase5-maintenance (PR #3). Plan: docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md.

  • P2R-03 (M) prune command end-to-end. Restic wrapper (restic.RunPrune), agent dispatcher (case api.JobPrune:), wire envelope. Admin-only credential: a second host_credentials row keyed by host_id + kind=admin carries the non-append-only username/password; server pushes it via config.update only when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now via POST /hosts/{id}/repo/prune. Cadence-driven dispatch via the maintenance ticker (P2R-06).
  • P2R-04 (M) check command end-to-end (restic check --read-data-subset N%). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now via POST /hosts/{id}/repo/check. Cadence-driven dispatch via the maintenance ticker (P2R-06).
  • P2R-05 (S) unlock command end-to-end (restic unlock). Operator-only — no cadence. POST /hosts/{id}/repo/unlock. Repo page surfaces lock state from the most recent check (which warns about stale locks).
  • P2R-06 (M) Server-side maintenance ticker. Cron-style loop on the server reads host_repo_maintenance rows, dispatches forget / prune / check jobs against the right host on the configured cadence. Last-fire anchor is derived from the jobs table via LatestJobByKind (queued + running included so a long-running prune correctly suppresses the next tick). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (logged, no queue — only scheduled backup fires queue, per P2R-08). Forget reshape: ships a multi-group ForgetGroups payload so one job fires N restic-forget invocations per tick.
  • P2R-07 (S) Repo stats panel on the Repo page: total size, raw size, last-check timestamp + status (color-coded), last-prune timestamp, stale-lock banner. Backed by restic stats --json --mode raw-data that the agent ships in a repo.stats envelope after every backup / check / prune / unlock; persisted via Store.UpsertHostRepoStats into a new host_repo_stats projection table.
  • P2R-08 (M) Pending-runs queue worker. Scheduled backup fires that race an agent disconnect queue to pending_runs. Drained on a 30s server-side tick and on agent reconnect (via onAgentHello); per-host TryLock mutex prevents the two paths double-dispatching the same row. Exponential backoff capped at 30 minutes; abandons rows that exceed the source-group's retry_max (audit-logged) or whose schedule/group has genuinely been deleted.

P2 redesign — Phase 6 (auto-init follow-up)

  • P2R-09 (S) Auto-init UX polish. Latest init job status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). Danger-zone POST /hosts/{id}/repo/reinit dispatches a fresh init job after the operator types the host name to confirm; audit row records host.repo_reinit.

Pre/post hooks (rehomed onto source groups)

  • P2R-10 (M) Hook schema: migration 0010 adds pre_hook/post_hook BLOB columns to source_groups and pre_hook_default/post_hook_default to hosts. Bytes stored verbatim — AEAD encrypt/decrypt at the HTTP layer (per-slot AD bytes). Round-trip tests cover set/clear semantics on both tables.
  • P2R-11 (M) Agent execution of hooks: runner.BackupHooks + runHook helper invoked via /bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed in env. Output streamed as hook(<phase>): … log.stream lines. Hooks only run for kind=backup. Server side resolves group → host default → empty and ships plaintext on the WS payload (decrypt at HTTP layer).
  • P2R-12 (S) Hook editor UI: source-group edit form gains pre/post hook textareas with the service-user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a host-default Hooks panel with the same shape; saved via POST /hosts/{id}/repo/hooks.

Bandwidth + niceties (rehomed onto host + source groups)

  • P2R-13 (S) Bandwidth limit fields. restic.Env gains LimitUploadKBps/LimitDownloadKBps, emitted as --limit-upload/--limit-download global flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received via config.update; server pushes them on hello and after PUT /api/hosts/{id}/bandwidth. Per-job override on the per-source-group Run-now form (collapsed <details> "Limit bandwidth for this run" with two KB/s inputs); override wins over host caps.
  • P2R-14 (S) Schedule "next run" / "last run". New store.LatestJobBySchedule query. Schedules tab grows two columns (Next derived from cron via robfig/cron/v3.Parse(...).Next, Last from latest actor_kind=schedule job). Dashboard host row prepends next 12h ago/from now when a single covering schedule is the run-now candidate.

Cross-platform + alt-enrolment

  • P2-16 (M) Windows service integration: internal/agent/service (build-tagged) implements svc.Handler; new restic-manager-agent install|uninstall|start|stop|run subcommands wrap the SCM via golang.org/x/sys/windows/svc/mgr. Cross-compile verified (GOOS=windows GOARCH=amd64 go build ./cmd/agent); untested on Windows itself — Linux CI can't exercise the SCM round-trip.

  • P2-17 (M) install.ps1 (Windows): pwsh installer that detects arch, downloads $Server/agent/binary?os=windows&arch=amd64, runs the agent in -enroll-server (+ optional -enroll-token) mode (token flow OR announce-and-approve), then registers the service via restic-manager-agent install. Surfaces existing scheduled tasks named *restic* without disabling. Served by the existing GET /install/* handler; restage block in CLAUDE.md updated.

  • P2-18 (L) Announce-and-approve enrolment (second enrolment mode):

    • Agent run with no RM_TOKEN generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then POST /api/agents/announce with {hostname, os, arch, agent_version, restic_version, public_key}. Server stores a pending_hosts row (public_key, fingerprint = sha256(public_key), announced_from_ip, first_seen_at, last_seen_at, expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
    • Agent then opens a long-poll/WS to /ws/agent/pending authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
    • Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept.
    • UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
    • Server-side guards: per-source-IP rate limit on /api/agents/announce (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
    • Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting hostname over the wire.

    As shipped: migration 0011 + store/pending_hosts.go cover the table. POST /api/agents/announce (rate-limited 10/min/IP, global cap 100 in-flight rows) returns {pending_id, fingerprint, hostname_collision}. GET /ws/agent/pending runs the Ed25519 nonce-sign handshake. Admin POSTs to /api/pending-hosts/{id}/accept|reject (audit-logged as host.accept_pending/host.reject_pending). Dashboard panel renders the queue with a copyable fingerprint + inline accept form (URL/user/password). 60s server ticker sweeps expired rows. Agent: cmd/agent/announce.go mints + persists an Ed25519 keypair into agent.yaml's announce_key field; runs automatically when -enroll-server is supplied without -enroll-token. The install scripts haven't been updated to surface the printed fingerprint beyond the agent's own banner — the operator reads it from the install script's stdout.

Phase 2 acceptance

  • A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
  • Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
  • Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to pending_runs and drain on reconnect.
  • Pre/post hooks fire correctly per source group, fail loudly on pre_hook errors, run post_hook with RM_JOB_STATUS. Rejected on non-backup kinds.
  • Bandwidth limits honoured (host-wide default + per-run override).
  • A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming. Not validated in CI: Linux runners cannot exercise the SCM round-trip; the service_windows.go/install.ps1 pieces compile cleanly under GOOS=windows GOARCH=amd64 but the first real Windows install will be the first end-to-end test.
  • A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.

Phase 3 — Restore, alerts, audit

Phase 3 is split into three independently-shippable sub-phases: Restore (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC), Alerts (P3-05..07), Audit UI (P3-08). Each sub-phase has its own spec → plan → implement cycle; we hand back at sub-phase boundaries.

P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm on 2026-05-04: disaster recovery is already covered by re-enrolling a replacement host with the same repo creds (snapshots reappear, restore is same-host). The remaining "pull a file from host A onto host C without giving C permanent access" use case is genuinely different and doesn't have a confirmed need yet, so it's moved to the Future / unscheduled section at the end of this file.

Phase 3 — Restore

Spec: docs/superpowers/specs/2026-05-04-p3-restore-design.md. Wireframe: _diag/p3-restore-wizard/wireframe.html. Sweep screenshots: _diag/p3-restore-sweep/. Shipped on branch p3-restore.

  • P3-X1 (S) Cancel-job feature. command.cancel WS envelope; agent tracks per-job ctx.CancelFunc and kills the running restic subprocess via context cancel (SIGTERM, SIGKILL after 5s grace via cmd.Cancel + cmd.WaitDelay); server endpoint POST /api/jobs/{id}/cancel bridges UI → WS; the existing UI Cancel button on /jobs/{id} is now real for any running kind. Sandbox-aware: internal/restic/cancel_{unix,windows}.go build-tags pick SIGTERM on POSIX vs os.Kill on Windows (which can't deliver SIGTERM). Tests: cancel mid-run via 'sleep 30' fake-restic returns JobCancelled with exit 130 in <200ms.
  • P3-X2 (S) Tree-list synchronous WS RPC. MsgTreeListMsgTreeListResult with Envelope.ID correlation; generic Hub.SendRPC helper (registry of buffered channels keyed by ULID, ctx-cancel + timeout aware). internal/restic.ListTreeChildren wraps restic ls --json and filters its recursive output to direct children. Server-side treeCache is per-wizard-session (keyed by session cookie + host + snapshot + path) with a 30-min TTL and lazy sweep.
  • P3-01 (L) Restore wizard backend (internal/server/http/ui_restore.go). GET handlers render the four-step wizard against the wireframe. HTMX/fetch tree partial endpoint hits fetchTreeWithCache. POST validates: snapshot_id, ≥1 absolute path, in-place ⇒ confirm_hostname == host name, agent online; on error re-renders with operator's input intact. Happy path mints job_id, target = /var/lib/restic-manager/restore/<job-id> (server-picked, agent's writable dir under the systemd sandbox's ReadWritePaths), creates job row, ships command.run with RestorePayload, writes host.restore audit row, returns HX-Redirect (or 303) to the live job page.
  • P3-02 (L) Wizard UI templates (web/templates/pages/host_restore.html + partials/tree_node.html). Single-page progressively-enabled four-step form. Form-state-driven JS computes a running tally + step-4 confirm summary client-side. Tree expansion uses plain fetch (not HTMX) for simpler target lookup; loaded-state cached per node. Top-level Restore button on host detail right rail + per-snapshot Restore action on snapshot rows. New .snap-row token in web/styles/input.css.
  • P3-03 (M) Restore execution. restic.RunRestore builds restore <sid> --target <dir> [--include p]... with --json; new pumpRestoreStdout parses status + summary objects. --no-ownership is gated on the agent's restic version via Env.AtLeastVersion(0, 17) — the flag was added in 0.17 and 0.16 rejects it. Restic version is threaded through runner.Config.ResticVersion from the agent's sysinfo snapshot. New-dir target is operator-editable (default $HOME/rm-restore/<job-id>/); agent expands $HOME / ${HOME} / ~/ at run time and calls os.MkdirAll on the target chain so the operator never has to pre-create the per-job subdir. runner.RunRestore translates RestoreStatus into job.progress (mapping FilesRestored → FilesDone, etc.); agent dispatcher case JobRestore reuses the spawn() helper from P3-X1 so cancel works. Restore-shaped job-detail variant with current-file display under the progress bar.
  • P3-09 (S) diff between two snapshots. JobDiff JobKind + restic.RunDiff + runner.RunDiff; POST /api/hosts/{id}/snapshots/diff (and HTMX-form variant on the unprefixed path) dispatcher with two-snapshot guard + per-host snapshot-list validation; UI panel on host detail right rail (visible when 2+ snapshots) with two short-id inputs + Diff button. Output streams as log.stream to the standard live job log page.
  • P3-X3 (S) Recent-restores line on host detail. hostChromeData grows RestoreStatus / RestoreAt / RestoreJobID populated via store.LatestJobByKind(host_id, 'restore') (already exists from P2R). host_chrome.html renders a small line below the init-status one with status-coloured copy + a link to the job log. Hidden when no restore has ever run on this host.
  • P3-X4 (S) Job log download (txt + ndjson). New GET /api/jobs/{id}/log.{txt|ndjson} endpoint backed by the persisted job_logs table — works any time (running or finished) without pausing the live WS stream because the source is the DB, not the live socket. Plain-text format mirrors the on-screen "HH:MM:SS.mmm TAG payload" shape with a small # job ... · kind ... · status ... header; ndjson emits one self-contained {seq,ts,stream,payload} JSON object per line for jq / tooling. Surfaced as a single header dropdown on the live job page (details/summary-driven, native keyboard support, click-outside-to-close). New reusable .dropdown / .dropdown-menu / .dropdown-item tokens in web/styles/input.css.
  • P3-X5 (S) UK lint locale + sweep. .golangci.yml misspell locale switched US → UK and the codebase swept (~73 corrections — behaviour, serialise, recognise, honour, initialise, enrol, unauthorised, etc.). Wire ErrorCode value "unauthorized""unauthorised" is a tiny contract change but the agent doesn't parse those codes today and no external clients exist yet.
  • P3-X6 (S) Snapshot SIZE/FILES tooltip on host detail. The per-snapshot summary block was added by restic 0.17 (the source comment in internal/restic/snapshots.go incorrectly said 0.16+); on 0.16 hosts the columns render . hostDetailPage.LegacyRestic (computed via Env.AtLeastVersion(0, 17)) drives a title="Needs restic 0.17+ on the agent host. This host runs <ver>." + cursor: help on the column headers, hidden once the host upgrades.

Migration 0012 widens the jobs.kind CHECK constraint to include restore and diff. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005, with a defensive temp-table backup of job_logs so the cascade-trap that bit migration 0007 wouldn't take the log history with it.

install.sh + systemd unit: the install script now pre-creates /root/rm-restore (root-owned 0700) so the default new-dir restore target works under the sandbox out of the box; the unit's ReadWritePaths gains -/root/rm-restore (soft-fail prefix). Existing installs need a re-run of install.sh to pick up the new dir; new operator-typed targets are auto-created by the agent at job time.

As shipped (Playwright sweep against the live smoke env, 2026-05-04): login → host detail → Restore button → wizard step 1 picks snapshot a1ac4006 (most recent) → tree drill-down /home/steve/test (3 lazy loads) → tick file1 + file2 → step 4 confirm summary populated → dispatch → live job page with running progress widget → restore succeeds, files land on disk at /root/rm-restore/<job-id>/home/steve/test/file{1,2} (default $HOME/rm-restore/<job-id>/ after agent-side expansion). Custom-target restore to /tmp/custom-restore/<job-id>/ lands inside the agent's PrivateTmp namespace. Snapshot diff between a1ac4006 and 5f78c788 → diff job page, statistics output streamed (738 bytes added, 0 removed). Recent-restores line on host detail reads "last restore · succeeded 28s ago · job log →". Download dropdown serves both .txt and .ndjson with correct Content-Type + Content-Disposition. SIZE/FILES tooltip "Needs restic 0.17+ on the agent host. This host runs 0.16.4." renders on column hover.

Phase 3 — Alerts (not started)

  • P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
  • P3-06 (M) Notification channels: webhook, ntfy, SMTP email
  • P3-07 (S) Alert UI: list, acknowledge, resolve

Phase 3 — Audit log UI (not started)

  • P3-08 (S) Audit log UI with filters (user, action, target, time range)

Phase 3 acceptance

  • A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at /hosts/{id}/restore; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page.
  • A failed backup raises an alert via the configured channel within 60s.
  • The audit-log UI lets an admin filter by user / action / target / time range.

Phase 4 — Update delivery, RBAC polish, OIDC

  • P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. restic-manager-agent update is a thin wrapper over apt-get install --only-upgrade restic-manager-agent / choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
  • P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
  • P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
  • P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
  • P4-05 (L) OIDC login (generic provider config, group → role mapping)
  • P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
  • P4-07 (S) Per-host tags + dashboard filtering by tag
  • P4-08 (M) Prometheus /metrics endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
  • P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON

Phase 4 acceptance

  • Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape /metrics and the sample Grafana dashboard renders with live data.

Phase 5 — OSS readiness

  • P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
  • P5-02 (S) CONTRIBUTING.md, CODE_OF_CONDUCT.md, issue + PR templates
  • P5-03 (S) Release automation: goreleaser for binaries + Docker image to GHCR
  • P5-04 (S) Demo screenshots / short Loom walkthrough in README
  • P5-05 (S) SECURITY.md with disclosure process
  • P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
  • P5-07 (S) Reference deployment: docker-compose.yml + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates RM_TRUSTED_PROXY)

Phase 5 acceptance

  • A stranger can read the docs and stand up a working install in under 30 minutes.

Cross-cutting / ongoing

  • X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
  • X-02 Track restic version compatibility matrix
  • X-03 Periodic dependency updates (dependabot or renovate)
  • X-04 Threat-model review at end of each phase
  • X-05 Proper first-run onboarding UI: admin shouldn't need to curl /api/bootstrap by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to /api/bootstrap, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so admin doesn't silently fail validation.

Future / unscheduled

Items here have a plausible use case but no confirmed need. They live outside numbered phases until a concrete trigger (a user request, a security review finding, a real disaster-recovery exercise) bumps them back into a phase.

  • F-01 P3-04 Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.