Files
restic-manager/tasks.md
T

33 KiB
Raw Blame History

restic-manager — Tasks

Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.

Sizes: S = under a day, M = 13 days, L = 37 days.


Phase 0 — Project bootstrap

  • P0-01 (S) Initialize Go module, cmd/server, cmd/agent, baseline internal/ packages
  • P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
  • P0-03 (S) Set up golangci-lint, gofumpt, goimports; pre-commit config
  • P0-04 (S) GitHub Actions Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
  • P0-05 (S) Dockerfile.server (multi-stage, distroless), deploy/docker-compose.yml
  • P0-06 (S) Makefile / taskfile.yml with common targets (build, test, run, release)

Phase 1 — MVP: enrollment, visibility, on-demand backup

Server foundations

  • P1-01 (M) HTTP server scaffolding (chi, structured logging via slog, graceful shutdown)
  • P1-02 (M) SQLite store layer (modernc.org/sqlite) + migrations (hand-rolled, embed.FS)
  • P1-03 (M) Schema for users, sessions, hosts, repos, credentials, jobs, job_logs, snapshots, audit_log
  • [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
  • P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
  • P1-06 (M) Secret encryption helper (AEAD with key from RM_SECRET_KEY_FILE)
  • [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited

Agent ↔ server protocol

  • P1-08 (M) Define shared API types in internal/api (envelopes, every WS message + protocol_version constants; JSON-shape tests pin the wire)
  • P1-09 (L) WebSocket transport (github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
  • P1-10 (M) Enrollment flow: POST /api/agents/enroll with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
  • P1-11 (M) Agent registration on connect (hello upserts agent_version/restic_version/protocol_version, flips status online, protocol_too_old rejection has clean error envelope)
  • P1-12 (S) Heartbeat handler (touches last_seen_at; background sweeper marks hosts offline after 90s without one)

Agent foundations

  • P1-13 (M) Agent config file (/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
  • P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
  • P1-15 (M) Outbound WS client (github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, protocol_version in hello
  • P1-16 (M) Restic wrapper: locate via PATH or override, run with --json, scan stdout/stderr, parse BackupStatus + BackupSummary, exit-code 3 treated as success-with-issues
  • P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)

Run-now backup

  • P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
  • P1-19 (M) Server endpoint POST /api/hosts/{id}/jobs to dispatch a backup command (validates kind, checks online, audit-logs)
  • P1-20 (M) Agent executes restic backup, streams stdout/stderr + parsed JSON events back as job.progress (1Hz throttle) / log.stream
  • [~] P1-21 (M) Server persists log stream to job_logs ✓; WS /api/jobs/{id}/stream for live browser tailing still TODO — needs the per-job fan-out hub
  • P1-22 (S) Snapshot listing: agent calls restic snapshots --json after each successful backup and ships the projection over snapshots.report. Server ReplaceHostSnapshots atomically swaps the per-host list and updates hosts.snapshot_count in the same tx. Read endpoint: GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused repo_id FK from snapshots (repos as a first-class entity is P2 work).

UI (HTMX + Tailwind)

  • P1-23 (M) Base layout, login page, session-aware nav
  • P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by GET /api/hosts + GET /api/fleet/summary (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX Run now button posts to /hosts/{id}/run-backup.
  • P1-25 (M) Host detail page (/hosts/{id}): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
  • P1-26 (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens /api/jobs/{id}/stream; agent-emitted job.started/job.progress/log.stream/job.finished are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on job.finished to show the final header. "Run now" sets HX-Redirect so the operator lands on the live log.
  • [~] P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (RM_SERVER + RM_TOKEN filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). Deferred: one-click "download preconfigured installer" install-<hostname>.sh (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
  • P1-28 (S) Tailwind build via tailwindcss standalone binary (no Node) — Makefile downloads pinned v3.4.17 into bin/tailwindcss, builds web/styles/input.cssweb/static/css/styles.css, embedded into the binary via web.FS. make build runs Tailwind first.

Install scripts

  • P1-29 (M) install.sh (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / /etc/cron.{d,daily,hourly,weekly}/* / root crontab and prints them with the exact disable commands — does not auto-disable
  • [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (/agent/binary + /install/*); signature verification deferred to Phase 5 OSS readiness

Repo credentials (pulled forward from Phase 2)

  • P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:

    • POST /api/enrollment-tokens body grows repo_url, repo_username, repo_password (all required).
    • Token row stores them as one AEAD-encrypted blob (existing crypto.AEAD); ConsumeEnrollmentToken moves the blob to a new host_credentials row keyed by host_id in the same tx.
    • PUT /api/hosts/{id}/repo-credentials (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
    • GET /api/hosts/{id}/repo-credentials returns the redacted view (URL + username + has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
    • On WS hello, server pushes a config.update with decrypted creds before returning the connection to idle. Same path on edit-while-connected.
    • Audit-logged on create / consume / edit; payload omits the secret material.
  • P1-33 (M) Agent-side encrypted secrets store:

    • New internal/agent/secrets package: AEAD blob at /var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600).
    • Per-host 32-byte secrets key minted at enrollment, persisted in agent.yaml (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
    • Strip repo_url / repo_password from agent.config.Config. Agent loads creds from secrets.enc at startup; config.update handler writes through to the file.
    • Dispatcher reads from the secrets store on every job rather than from in-memory config.
    • Migration path: if agent.yaml still contains repo_url/repo_password, copy them into secrets.enc on next start, then strip from the YAML on save.
  • P1-34 (S) End-to-end smoke runbook: docs/e2e-smoke.md walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real restic/rest-server in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.

Phase 1 acceptance

  • One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
  • Windows binary builds cleanly in CI (.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
  • Agent ↔ server protocol_version handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
  • Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as config.update.

Phase 2 — Scheduling, retention, repo operations

Mid-phase pivot — "P2 redesign" (commits 7a7cac5, 666af41, 5667cdf). The original P2 plan put paths/excludes/retention/manual/kind/options on Schedule and one repo per host. After landing P2-01..P2-05 against that shape, the data model was rewritten: schedules are slim (cron + which source_groups); paths/excludes/retention/retry live on source_group (also doubles as the snapshot tag); forget/prune/check cadences live on host_repo_maintenance and run on a server-side ticker, not the agent cron; pending_runs queues offline retries; host.repo_initialised_at is gone (auto-init at enrolment). The redesign is captured below as P2R-NN items. Items P2-01..P2-05 stay marked done because the work shipped, but they're labelled ⚠️ shipped against old shape — behaviour to be re-validated under P2R-02 after UI rewire. P2-04.5 (manual flag) is dropped wholesale. P2-06..P2-15 are reframed below to point at their new homes; P2-16/17/18 are unaffected by the redesign.

Original P2 work — shipped (against pre-redesign shape)

  • ⚠️ P2-01 (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
  • ⚠️ P2-02 (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
  • ⚠️ P2-03 (M) Agent local scheduler (internal/agent/scheduler, robfig/cron/v3, schedule.fire envelope, dispatchScheduledJob). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01.
  • ⚠️ P2-04 (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
  • P2-04.5 Manual schedules / kill host.default_paths — superseded; the manual flag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).
  • ⚠️ P2-05 (M) forget command with retention policy. Wire payload (CommandRunPayload.retention_policy) and restic wrapper (restic.ForgetPolicy, RunForget) are still correct; what changes under P2R-03 is where retention comes from (source_group, not schedule) and who dispatches (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).

P2 redesign — Phase 1

  • P2R-00.1 (M) Migration 0008 — sources + repo maintenance. Adds source_groups, schedule_source_groups junction, host_repo_maintenance, pending_runs, host.bandwidth_up_kbps / bandwidth_down_kbps. Drops host.repo_initialised_at. Slim-schedule columns dropped from schedules. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit 7a7cac5.
  • P2R-00.2 (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of /hosts/{id}/sources, /sources/{gid}/edit (with retention-conflict banner), slim /schedules, /repo (connection / bandwidth / maintenance / re-init). Commit 666af41.

P2 redesign — Phase 2

  • P2R-00.3 (L) Go-side store rewrite against migration 0008. New types: SourceGroup, HostRepoMaintenance, PendingRun. Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. RetentionPolicy moves from schedule field → source group field (type unchanged). Host loses RepoInitialisedAt, gains bandwidth caps. New files: store/sources.go, store/maintenance.go, store/pending.go. store/schedules.go rewritten for slim shape + junction CRUD. enrollment.go seeds a default source group + repo-maintenance row instead of a manual schedule. ws/handler.go drops MarkHostRepoInitialised. HTTP layer + UI templates temporarily 501-stubbed with redesign_in_progress — this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit 5667cdf.
  • P2R-00.4 (S) Host-detail UI patched up enough to render: RepoInitialisedAt template refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.

P2 redesign — Phase 3 (REST + WS rewire)

  • P2R-01 (L) HTTP/WS layer against the slim shape:
    • Schedules REST CRUD: GET|POST /api/hosts/{id}/schedules, PUT|DELETE /api/hosts/{id}/schedules/{sid}. Body shape is {cron, enabled, source_group_ids[]} — paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (per store.UpdateSchedule). Validation: cron parses via robfig/cron/v3; ≥1 source_group_ids; all referenced groups belong to the host.
    • Source-groups REST CRUD: GET|POST /api/hosts/{id}/source-groups, GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}. Body: {name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}. Name uniqueness per host. Refuse delete if SchedulesUsingGroup(gid) is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bump host_schedule_version.
    • Repo-maintenance REST: GET|PUT /api/hosts/{id}/repo-maintenance. Body: {forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}. Server-side ticker drives execution (P2R-04), so updates here do not bump host_schedule_version.
    • Per-source-group Run-now: POST /hosts/{id}/source-groups/{gid}/run. Reuses the existing dispatchScheduleNow-style path; agent receives a normal command.run carrying the resolved includes/excludes/retention from the group. This replaces the old per-host /hosts/{id}/run-backup endpoint (kept around as a 410-Gone with a hint pointing to source groups).
    • schedule_push.go reconciliation: rebuild pushScheduleSet* to ship the new wire format (ScheduleSetPayload carries [{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}] — agent doesn't need to know source_group_id, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persists applied_schedule_version.
    • Auto-init at enrolment: server dispatches restic init on first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row with kind=init so the audit trail still shows it. On init returning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour.
    • Tests: rewrite the deleted schedules_test.go and schedule_push_test.go against new endpoints; new source_groups_test.go, repo_maintenance_test.go, auto_init_test.go. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.

P2 redesign — Phase 4 (UI rewire, against v4 wireframes)

Row-design rule (binding for every list-row template in this app, current and future): Whole-row click navigates to the row's primary detail/edit page — mirror .host-row.clickable on the dashboard (partials/host_row.html): an absolute-positioned .row-link overlay with text-indent: -9999px covers the row, action buttons live in .row-action cells that sit above via z-index. Do not add an explicit "Edit" button when the row is clickable — it duplicates the affordance and dilutes the click target. Action cells are reserved for verbs that aren't "open this row" (Run-now, Delete, Pause, etc).

  • P2R-02 (L) UI templates rebuilt against the new model:
    • Slice 1 Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a host_chrome partial; Sources / Schedules / Repo become real <a> links; placeholder pages share the chrome; version indicator restored. (commit a535822)
    • Slice 2 Sources tab — /hosts/{id}/sources list with per-row meta + clickable rows + per-group Run-now/Delete; /sources/new and /sources/{gid}/edit form (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner from ConflictDimension cache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits 0ed9c3d, dede74f)
    • Slice 3 Schedules tab — /hosts/{id}/schedules slim list (status / cron / source-tags / actions, clickable rows) plus /schedules/new and /schedules/{sid}/edit form (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reuses dispatchScheduledJob for enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit 67ca769 + follow-ups 64d2fcf, 8b91d30, 4035c44)
    • Slice 4 /hosts/{id}/repo — three independent forms (connection: URL/user/password pre-filled from GET /api/hosts/{id}/repo-credentials redacted view; bandwidth: host-wide caps via new PUT /api/hosts/{id}/bandwidth; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commit d62b173)
    • Slice 5 Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit fab99b4)
    • Slice 6 Playwright sweep against the live :8080 server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in _diag/p2r-02-sweep/.
    • Side-fix: agent runner drops noisy restic status events from log.stream (they were drowning the live log on short backups; the throttled job.progress envelope already covers the same data). (commit ffba737)
    • Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by host_schedule_version + applied_schedule_version).
    • Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires pushScheduleSetAsync so an online agent re-arms within seconds.

P2 redesign — Phase 5 (server-side maintenance ticker)

  • P2R-03 (M) prune command end-to-end. Restic wrapper (restic.RunPrune), agent dispatcher (case api.JobPrune:), wire envelope. Admin-only credential: a second host_credentials row keyed by host_id + kind=admin carries the non-append-only username/password; server pushes it via config.update only when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now via POST /hosts/{id}/repo/prune. Cadence-driven dispatch lands in P2R-04.
  • P2R-04 (M) check command end-to-end (restic check --read-data-subset N%). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now via POST /hosts/{id}/repo/check. Cadence-driven dispatch lands in P2R-05.
  • P2R-05 (S) unlock command end-to-end (restic unlock). Operator-only — no cadence. POST /hosts/{id}/repo/unlock. Repo page surfaces lock state from the most recent check (which warns about stale locks).
  • P2R-06 (M) Server-side maintenance ticker. Cron-style loop on the server reads host_repo_maintenance rows, dispatches forget / prune / check jobs against the right host on the configured cadence (last-run timestamps tracked per kind on the maintenance row). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (queues to pending_runs instead — see P2R-08). Handles ticker restarts cleanly (no-op if a job of the same kind ran inside the cadence window).
  • P2R-07 (S) Repo stats panel on the Repo page: size, dedup ratio, snapshot count, last-check timestamp + result, lock state, last-prune timestamp + bytes-freed. Backed by parsing restic stats --json output that the agent ships periodically (piggyback on the existing snapshots-report path).
  • P2R-08 (M) Pending-runs queue worker. On agent reconnect, server drains pending_runs rows for that host and re-dispatches them in order. Bump backoff per pending_run.attempt_count; drop rows that have exceeded the source-group's retry_max. Audit-logged. Smoke-tested by stopping the agent, running maintenance ticker so cadence misses, restarting agent, watching the queue drain.

P2 redesign — Phase 5

  • Restic-manager Phase 5 lands on branch p2r-phase5-maintenance: prune/check/unlock end-to-end (P2R-03/04/05); server-side maintenance ticker drives forget/prune/check on cadence (P2R-06); repo-stats panel surfaces size, lock state, last-check / last-prune (P2R-07); pending-runs queue worker drains scheduled-backup fires that raced an agent disconnect (P2R-08). See docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md.

P2 redesign — Phase 6 (auto-init follow-up) — TODO

  • P2R-09 (S) Auto-init UX polish. Surface init result on host detail (small "repo ready · initialised by you on …" line; or "init failed — see job N · retry" if init failed). Re-init button on Repo page danger zone wipes then re-runs init (admin only, audit-logged, two-step confirm with the host name typed in).

Pre/post hooks (rehomed onto source groups) — TODO

  • P2R-10 (M) Hook schema: source_group.pre_hook, source_group.post_hook, host.pre_hook_default, host.post_hook_default. Encrypted at rest (existing crypto.AEAD). Admin-only edit. Audit-logged.
  • P2R-11 (M) Agent execution of hooks: configurable shell per host. pre_hook failure aborts the backup. post_hook always runs with RM_JOB_STATUS env var. Stdout/stderr captured into JobLog with a hook: prefix. Hooks only run for kind=backup jobs (forget/prune/check/unlock skip them, per spec.md §14.3).
  • P2R-12 (S) Hook editor UI on source-group edit page (per-group override) and host Settings tab (host-wide default). Validation rejects non-backup contexts. Warning banner: "this hook runs as the agent service user (root on Linux; LocalSystem on Windows)".

Bandwidth + niceties (rehomed onto host + source groups) — TODO

  • P2R-13 (S) Bandwidth limit fields. Host-wide caps (Host.BandwidthUpKBps, BandwidthDownKBps — schema is in 0008 already, just needs UI on the Repo page) applied to every restic invocation. Per-job override on Run-now (override field on the Run-now confirm dialog). Maps to restic --limit-upload / --limit-download.
  • P2R-14 (S) Schedule "next run" / "last run" surfaced on host card (dashboard row) + on the Schedules tab. "Next run" computed server-side from cron + now; "last run" from the most recent job with actor_kind=schedule for any schedule that uses any of the host's source groups.

Cross-platform + alt-enrolment (unchanged by redesign) — TODO

  • P2-16 (M) Windows service integration: agent runs under the Service Control Manager via golang.org/x/sys/windows/svc; install/uninstall/start/stop wired up.
  • P2-17 (M) install.ps1 (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named *restic* and prints them for manual review.
  • P2-18 (L) Announce-and-approve enrollment (second enrollment mode, alongside the token flow that ships in Phase 1):
    • Agent run with no RM_TOKEN generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then POST /api/agents/announce with {hostname, os, arch, agent_version, restic_version, public_key}. Server stores a pending_hosts row (public_key, fingerprint = sha256(public_key), announced_from_ip, first_seen_at, last_seen_at, expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
    • Agent then opens a long-poll/WS to /ws/agent/pending authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
    • Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept.
    • UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
    • Server-side guards: per-source-IP rate limit on /api/agents/announce (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
    • Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting hostname over the wire.

Phase 2 acceptance

  • A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
  • Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
  • Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to pending_runs and drain on reconnect.
  • Pre/post hooks fire correctly per source group, fail loudly on pre_hook errors, run post_hook with RM_JOB_STATUS. Rejected on non-backup kinds.
  • Bandwidth limits honoured (host-wide default + per-run override).
  • A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming.
  • A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.

Phase 3 — Restore, alerts, audit

  • P3-01 (L) Restore wizard backend: snapshot tree browse via restic ls --json, path picker, target selection
  • P3-02 (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
  • P3-03 (M) Restore execution: restic restore invocation, progress streaming
  • P3-04 (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
  • P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
  • P3-06 (M) Notification channels: webhook, ntfy, SMTP email
  • P3-07 (S) Alert UI: list, acknowledge, resolve
  • P3-08 (S) Audit log UI with filters (user, action, target, time range)
  • P3-09 (S) diff between two snapshots in UI

Phase 3 acceptance

  • A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.

Phase 4 — Update delivery, RBAC polish, OIDC

  • P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. restic-manager-agent update is a thin wrapper over apt-get install --only-upgrade restic-manager-agent / choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
  • P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
  • P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
  • P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
  • P4-05 (L) OIDC login (generic provider config, group → role mapping)
  • P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
  • P4-07 (S) Per-host tags + dashboard filtering by tag
  • P4-08 (M) Prometheus /metrics endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
  • P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON

Phase 4 acceptance

  • Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape /metrics and the sample Grafana dashboard renders with live data.

Phase 5 — OSS readiness

  • P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
  • P5-02 (S) CONTRIBUTING.md, CODE_OF_CONDUCT.md, issue + PR templates
  • P5-03 (S) Release automation: goreleaser for binaries + Docker image to GHCR
  • P5-04 (S) Demo screenshots / short Loom walkthrough in README
  • P5-05 (S) SECURITY.md with disclosure process
  • P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
  • P5-07 (S) Reference deployment: docker-compose.yml + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates RM_TRUSTED_PROXY)

Phase 5 acceptance

  • A stranger can read the docs and stand up a working install in under 30 minutes.

Cross-cutting / ongoing

  • X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
  • X-02 Track restic version compatibility matrix
  • X-03 Periodic dependency updates (dependabot or renovate)
  • X-04 Threat-model review at end of each phase
  • X-05 Proper first-run onboarding UI: admin shouldn't need to curl /api/bootstrap by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to /api/bootstrap, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so admin doesn't silently fail validation.