steve/restic-manager

Fork 0

Files

T

steve c5b884a22b

CI / Build (windows/amd64) (pull_request) Successful in 22s

Details

CI / Lint (pull_request) Successful in 32s

Details

CI / Build (linux/amd64) (pull_request) Successful in 22s

Details

CI / Build (linux/arm64) (pull_request) Successful in 21s

Details

CI / Test (linux/amd64) (pull_request) Successful in 3m44s

Details

tasks: tick P3-05/06/07 + Playwright sweep notes

Sweep against the live smoke env confirmed the alerts subsystem
end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh,
SMTP → MailHog) created and verified via the Test button; synthetic
critical raised; ack + resolve fan out alert.acknowledged /
alert.resolved across all three; dashboard banner appears and
clears; nav badge tracks open count.

Three real bugs found and fixed mid-sweep — see preceding three
commits for the full reasoning.

2026-05-04 21:01:34 +01:00

45 KiB

Raw Blame History

restic-manager — Tasks

Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.

Sizes: S = under a day, M = 1–3 days, L = 3–7 days.

Phase 0 — Project bootstrap

P0-01 (S) Initialize Go module, cmd/server, cmd/agent, baseline internal/ packages
P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
P0-03 (S) Set up golangci-lint, gofumpt, goimports; pre-commit config
P0-04 (S) ~~GitHub Actions~~ Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
P0-05 (S) Dockerfile.server (multi-stage, distroless), deploy/docker-compose.yml
P0-06 (S) Makefile / ~~taskfile.yml~~ with common targets (build, test, run, release)

Phase 1 — MVP: enrollment, visibility, on-demand backup

Server foundations

P1-01 (M) HTTP server scaffolding (chi, structured logging via slog, graceful shutdown)
P1-02 (M) SQLite store layer (modernc.org/sqlite) + migrations (hand-rolled, embed.FS)
P1-03 (M) Schema for users, sessions, hosts, repos, credentials, jobs, job_logs, snapshots, audit_log
[~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
P1-06 (M) Secret encryption helper (AEAD with key from RM_SECRET_KEY_FILE)
[~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited

Agent ↔ server protocol

P1-08 (M) Define shared API types in internal/api (envelopes, every WS message + protocol_version constants; JSON-shape tests pin the wire)
P1-09 (L) WebSocket transport (github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
P1-10 (M) Enrollment flow: POST /api/agents/enroll with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
P1-11 (M) Agent registration on connect (hello upserts agent_version/restic_version/protocol_version, flips status online, protocol_too_old rejection has clean error envelope)
P1-12 (S) Heartbeat handler (touches last_seen_at; background sweeper marks hosts offline after 90s without one)

Agent foundations

P1-13 (M) Agent config file (/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
P1-15 (M) Outbound WS client (github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, protocol_version in hello
P1-16 (M) Restic wrapper: locate via PATH or override, run with --json, scan stdout/stderr, parse BackupStatus + BackupSummary, exit-code 3 treated as success-with-issues
P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)

Run-now backup

P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
P1-19 (M) Server endpoint POST /api/hosts/{id}/jobs to dispatch a backup command (validates kind, checks online, audit-logs)
P1-20 (M) Agent executes restic backup, streams stdout/stderr + parsed JSON events back as job.progress (1Hz throttle) / log.stream
[~] P1-21 (M) Server persists log stream to job_logs ✓; WS /api/jobs/{id}/stream for live browser tailing still TODO — needs the per-job fan-out hub
P1-22 (S) Snapshot listing: agent calls restic snapshots --json after each successful backup and ships the projection over snapshots.report. Server ReplaceHostSnapshots atomically swaps the per-host list and updates hosts.snapshot_count in the same tx. Read endpoint: GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused repo_id FK from snapshots (repos as a first-class entity is P2 work).

UI (HTMX + Tailwind)

P1-23 (M) Base layout, login page, session-aware nav
P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by GET /api/hosts + GET /api/fleet/summary (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX Run now button posts to /hosts/{id}/run-backup.
P1-25 (M) Host detail page (/hosts/{id}): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
P1-26 (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens /api/jobs/{id}/stream; agent-emitted job.started/job.progress/log.stream/job.finished are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on job.finished to show the final header. "Run now" sets HX-Redirect so the operator lands on the live log.
[~] P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (RM_SERVER + RM_TOKEN filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). Deferred: one-click "download preconfigured installer" install-<hostname>.sh (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
P1-28 (S) Tailwind build via tailwindcss standalone binary (no Node) — Makefile downloads pinned v3.4.17 into bin/tailwindcss, builds web/styles/input.css → web/static/css/styles.css, embedded into the binary via web.FS. make build runs Tailwind first.

Install scripts

P1-29 (M) install.sh (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / /etc/cron.{d,daily,hourly,weekly}/* / root crontab and prints them with the exact disable commands — does not auto-disable
[~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (/agent/binary + /install/*); signature verification deferred to Phase 5 OSS readiness

Repo credentials (pulled forward from Phase 2)

P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:
- POST /api/enrollment-tokens body grows repo_url, repo_username, repo_password (all required).
- Token row stores them as one AEAD-encrypted blob (existing crypto.AEAD); ConsumeEnrollmentToken moves the blob to a new host_credentials row keyed by host_id in the same tx.
- PUT /api/hosts/{id}/repo-credentials (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
- GET /api/hosts/{id}/repo-credentials returns the redacted view (URL + username + has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
- On WS hello, server pushes a config.update with decrypted creds before returning the connection to idle. Same path on edit-while-connected.
- Audit-logged on create / consume / edit; payload omits the secret material.
P1-33 (M) Agent-side encrypted secrets store:
- New internal/agent/secrets package: AEAD blob at /var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600).
- Per-host 32-byte secrets key minted at enrollment, persisted in agent.yaml (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
- Strip repo_url / repo_password from agent.config.Config. Agent loads creds from secrets.enc at startup; config.update handler writes through to the file.
- Dispatcher reads from the secrets store on every job rather than from in-memory config.
- Migration path: if agent.yaml still contains repo_url/repo_password, copy them into secrets.enc on next start, then strip from the YAML on save.
P1-34 (S) End-to-end smoke runbook: docs/e2e-smoke.md walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real restic/rest-server in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.

Phase 1 acceptance

One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
Windows binary builds cleanly in CI (.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
Agent ↔ server protocol_version handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as config.update.

Phase 2 — Scheduling, retention, repo operations

Mid-phase pivot — "P2 redesign" (commits 7a7cac5, 666af41, 5667cdf). The original P2 plan put paths/excludes/retention/manual/kind/options on Schedule and one repo per host. After landing P2-01..P2-05 against that shape, the data model was rewritten: schedules are slim (cron + which source_groups); paths/excludes/retention/retry live on source_group (also doubles as the snapshot tag); forget/prune/check cadences live on host_repo_maintenance and run on a server-side ticker, not the agent cron; pending_runs queues offline retries; host.repo_initialised_at is gone (auto-init at enrolment). The redesign is captured below as P2R-NN items. Items P2-01..P2-05 stay marked done because the work shipped, but they're labelled ⚠️ shipped against old shape — behaviour to be re-validated under P2R-02 after UI rewire. P2-04.5 (manual flag) is dropped wholesale. P2-06..P2-15 are reframed below to point at their new homes; P2-16/17/18 are unaffected by the redesign.

Original P2 work — shipped (against pre-redesign shape)

⚠️ P2-01 (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
⚠️ P2-02 (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
⚠️ P2-03 (M) Agent local scheduler (internal/agent/scheduler, robfig/cron/v3, schedule.fire envelope, dispatchScheduledJob). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01.
⚠️ P2-04 (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
P2-04.5 Manual schedules / kill host.default_paths — superseded; the manual flag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).
⚠️ P2-05 (M) forget command with retention policy. Wire payload (CommandRunPayload.retention_policy) and restic wrapper (restic.ForgetPolicy, RunForget) are still correct; what changes under P2R-03 is where retention comes from (source_group, not schedule) and who dispatches (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).

P2 redesign — Phase 1 ✅

P2R-00.1 (M) Migration 0008 — sources + repo maintenance. Adds source_groups, schedule_source_groups junction, host_repo_maintenance, pending_runs, host.bandwidth_up_kbps / bandwidth_down_kbps. Drops host.repo_initialised_at. Slim-schedule columns dropped from schedules. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit 7a7cac5.
P2R-00.2 (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of /hosts/{id}/sources, /sources/{gid}/edit (with retention-conflict banner), slim /schedules, /repo (connection / bandwidth / maintenance / re-init). Commit 666af41.

P2 redesign — Phase 2 ✅

P2R-00.3 (L) Go-side store rewrite against migration 0008. New types: SourceGroup, HostRepoMaintenance, PendingRun. Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. RetentionPolicy moves from schedule field → source group field (type unchanged). Host loses RepoInitialisedAt, gains bandwidth caps. New files: store/sources.go, store/maintenance.go, store/pending.go. store/schedules.go rewritten for slim shape + junction CRUD. enrollment.go seeds a default source group + repo-maintenance row instead of a manual schedule. ws/handler.go drops MarkHostRepoInitialised. HTTP layer + UI templates temporarily 501-stubbed with redesign_in_progress — this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit 5667cdf.
P2R-00.4 (S) Host-detail UI patched up enough to render: RepoInitialisedAt template refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.

P2 redesign — Phase 3 (REST + WS rewire) ✅

P2R-01 (L) HTTP/WS layer against the slim shape:
- Schedules REST CRUD: GET|POST /api/hosts/{id}/schedules, PUT|DELETE /api/hosts/{id}/schedules/{sid}. Body shape is {cron, enabled, source_group_ids[]} — paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (per store.UpdateSchedule). Validation: cron parses via robfig/cron/v3; ≥1 source_group_ids; all referenced groups belong to the host.
- Source-groups REST CRUD: GET|POST /api/hosts/{id}/source-groups, GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}. Body: {name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}. Name uniqueness per host. Refuse delete if SchedulesUsingGroup(gid) is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bump host_schedule_version.
- Repo-maintenance REST: GET|PUT /api/hosts/{id}/repo-maintenance. Body: {forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}. Server-side ticker drives execution (P2R-04), so updates here do not bump host_schedule_version.
- Per-source-group Run-now: POST /hosts/{id}/source-groups/{gid}/run. Reuses the existing dispatchScheduleNow-style path; agent receives a normal command.run carrying the resolved includes/excludes/retention from the group. This replaces the old per-host /hosts/{id}/run-backup endpoint (kept around as a 410-Gone with a hint pointing to source groups).
- schedule_push.go reconciliation: rebuild pushScheduleSet* to ship the new wire format (ScheduleSetPayload carries [{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}] — agent doesn't need to know source_group_id, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persists applied_schedule_version.
- Auto-init at enrolment: server dispatches restic init on first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row with kind=init so the audit trail still shows it. On init returning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour.
- Tests: rewrite the deleted schedules_test.go and schedule_push_test.go against new endpoints; new source_groups_test.go, repo_maintenance_test.go, auto_init_test.go. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.

P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅

Row-design rule (binding for every list-row template in this app, current and future): Whole-row click navigates to the row's primary detail/edit page — mirror .host-row.clickable on the dashboard (partials/host_row.html): an absolute-positioned .row-link overlay with text-indent: -9999px covers the row, action buttons live in .row-action cells that sit above via z-index. Do not add an explicit "Edit" button when the row is clickable — it duplicates the affordance and dilutes the click target. Action cells are reserved for verbs that aren't "open this row" (Run-now, Delete, Pause, etc).

P2R-02 (L) UI templates rebuilt against the new model:
- Slice 1 ✅ Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a host_chrome partial; Sources / Schedules / Repo become real <a> links; placeholder pages share the chrome; version indicator restored. (commit a535822)
- Slice 2 ✅ Sources tab — /hosts/{id}/sources list with per-row meta + clickable rows + per-group Run-now/Delete; /sources/new and /sources/{gid}/edit form (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner from ConflictDimension cache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits 0ed9c3d, dede74f)
- Slice 3 ✅ Schedules tab — /hosts/{id}/schedules slim list (status / cron / source-tags / actions, clickable rows) plus /schedules/new and /schedules/{sid}/edit form (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reuses dispatchScheduledJob for enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit 67ca769 + follow-ups 64d2fcf, 8b91d30, 4035c44)
- Slice 4 ✅ /hosts/{id}/repo — three independent forms (connection: URL/user/password pre-filled from GET /api/hosts/{id}/repo-credentials redacted view; bandwidth: host-wide caps via new PUT /api/hosts/{id}/bandwidth; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commit d62b173)
- Slice 5 ✅ Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit fab99b4)
- Slice 6 ✅ Playwright sweep against the live :8080 server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in _diag/p2r-02-sweep/.
- Side-fix: agent runner drops noisy restic status events from log.stream (they were drowning the live log on short backups; the throttled job.progress envelope already covers the same data). (commit ffba737)
- Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by host_schedule_version + applied_schedule_version).
- Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires pushScheduleSetAsync so an online agent re-arms within seconds.

P2 redesign — Phase 5 ✅

Shipped on branch p2r-phase5-maintenance (PR #3). Plan: docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md.

P2R-03 (M) prune command end-to-end. Restic wrapper (restic.RunPrune), agent dispatcher (case api.JobPrune:), wire envelope. Admin-only credential: a second host_credentials row keyed by host_id + kind=admin carries the non-append-only username/password; server pushes it via config.update only when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now via POST /hosts/{id}/repo/prune. Cadence-driven dispatch via the maintenance ticker (P2R-06).
P2R-04 (M) check command end-to-end (restic check --read-data-subset N%). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now via POST /hosts/{id}/repo/check. Cadence-driven dispatch via the maintenance ticker (P2R-06).
P2R-05 (S) unlock command end-to-end (restic unlock). Operator-only — no cadence. POST /hosts/{id}/repo/unlock. Repo page surfaces lock state from the most recent check (which warns about stale locks).
P2R-06 (M) Server-side maintenance ticker. Cron-style loop on the server reads host_repo_maintenance rows, dispatches forget / prune / check jobs against the right host on the configured cadence. Last-fire anchor is derived from the jobs table via LatestJobByKind (queued + running included so a long-running prune correctly suppresses the next tick). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (logged, no queue — only scheduled backup fires queue, per P2R-08). Forget reshape: ships a multi-group ForgetGroups payload so one job fires N restic-forget invocations per tick.
P2R-07 (S) Repo stats panel on the Repo page: total size, raw size, last-check timestamp + status (color-coded), last-prune timestamp, stale-lock banner. Backed by restic stats --json --mode raw-data that the agent ships in a repo.stats envelope after every backup / check / prune / unlock; persisted via Store.UpsertHostRepoStats into a new host_repo_stats projection table.
P2R-08 (M) Pending-runs queue worker. Scheduled backup fires that race an agent disconnect queue to pending_runs. Drained on a 30s server-side tick and on agent reconnect (via onAgentHello); per-host TryLock mutex prevents the two paths double-dispatching the same row. Exponential backoff capped at 30 minutes; abandons rows that exceed the source-group's retry_max (audit-logged) or whose schedule/group has genuinely been deleted.

P2 redesign — Phase 6 (auto-init follow-up) ✅

P2R-09 (S) Auto-init UX polish. Latest init job status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). Danger-zone POST /hosts/{id}/repo/reinit dispatches a fresh init job after the operator types the host name to confirm; audit row records host.repo_reinit.

Pre/post hooks (rehomed onto source groups) ✅

P2R-10 (M) Hook schema: migration 0010 adds pre_hook/post_hook BLOB columns to source_groups and pre_hook_default/post_hook_default to hosts. Bytes stored verbatim — AEAD encrypt/decrypt at the HTTP layer (per-slot AD bytes). Round-trip tests cover set/clear semantics on both tables.
P2R-11 (M) Agent execution of hooks: runner.BackupHooks + runHook helper invoked via /bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed in env. Output streamed as hook(<phase>): … log.stream lines. Hooks only run for kind=backup. Server side resolves group → host default → empty and ships plaintext on the WS payload (decrypt at HTTP layer).
P2R-12 (S) Hook editor UI: source-group edit form gains pre/post hook textareas with the service-user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a host-default Hooks panel with the same shape; saved via POST /hosts/{id}/repo/hooks.

Bandwidth + niceties (rehomed onto host + source groups) ✅

P2R-13 (S) Bandwidth limit fields. restic.Env gains LimitUploadKBps/LimitDownloadKBps, emitted as --limit-upload/--limit-download global flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received via config.update; server pushes them on hello and after PUT /api/hosts/{id}/bandwidth. Per-job override on the per-source-group Run-now form (collapsed <details> "Limit bandwidth for this run" with two KB/s inputs); override wins over host caps.
P2R-14 (S) Schedule "next run" / "last run". New store.LatestJobBySchedule query. Schedules tab grows two columns (Next derived from cron via robfig/cron/v3.Parse(...).Next, Last from latest actor_kind=schedule job). Dashboard host row prepends next 12h ago/from now when a single covering schedule is the run-now candidate.

Cross-platform + alt-enrolment ✅

P2-16 (M) Windows service integration: internal/agent/service (build-tagged) implements svc.Handler; new restic-manager-agent install|uninstall|start|stop|run subcommands wrap the SCM via golang.org/x/sys/windows/svc/mgr. Cross-compile verified (GOOS=windows GOARCH=amd64 go build ./cmd/agent); untested on Windows itself — Linux CI can't exercise the SCM round-trip.
P2-17 (M) install.ps1 (Windows): pwsh installer that detects arch, downloads $Server/agent/binary?os=windows&arch=amd64, runs the agent in -enroll-server (+ optional -enroll-token) mode (token flow OR announce-and-approve), then registers the service via restic-manager-agent install. Surfaces existing scheduled tasks named *restic* without disabling. Served by the existing GET /install/* handler; restage block in CLAUDE.md updated.
P2-18 (L) Announce-and-approve enrolment (second enrolment mode):
- Agent run with no RM_TOKEN generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then POST /api/agents/announce with {hostname, os, arch, agent_version, restic_version, public_key}. Server stores a pending_hosts row (public_key, fingerprint = sha256(public_key), announced_from_ip, first_seen_at, last_seen_at, expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
- Agent then opens a long-poll/WS to /ws/agent/pending authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
- Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept.
- UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
- Server-side guards: per-source-IP rate limit on /api/agents/announce (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
- Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting hostname over the wire.
As shipped: migration 0011 + store/pending_hosts.go cover the table. POST /api/agents/announce (rate-limited 10/min/IP, global cap 100 in-flight rows) returns {pending_id, fingerprint, hostname_collision}. GET /ws/agent/pending runs the Ed25519 nonce-sign handshake. Admin POSTs to /api/pending-hosts/{id}/accept|reject (audit-logged as host.accept_pending/host.reject_pending). Dashboard panel renders the queue with a copyable fingerprint + inline accept form (URL/user/password). 60s server ticker sweeps expired rows. Agent: cmd/agent/announce.go mints + persists an Ed25519 keypair into agent.yaml's announce_key field; runs automatically when -enroll-server is supplied without -enroll-token. The install scripts haven't been updated to surface the printed fingerprint beyond the agent's own banner — the operator reads it from the install script's stdout.

Phase 2 acceptance

A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to pending_runs and drain on reconnect.
Pre/post hooks fire correctly per source group, fail loudly on pre_hook errors, run post_hook with RM_JOB_STATUS. Rejected on non-backup kinds.
Bandwidth limits honoured (host-wide default + per-run override).
A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming. Not validated in CI: Linux runners cannot exercise the SCM round-trip; the service_windows.go/install.ps1 pieces compile cleanly under GOOS=windows GOARCH=amd64 but the first real Windows install will be the first end-to-end test.
A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.

Phase 3 — Restore, alerts, audit

Phase 3 is split into three independently-shippable sub-phases: Restore (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC), Alerts (P3-05..07), Audit UI (P3-08). Each sub-phase has its own spec → plan → implement cycle; we hand back at sub-phase boundaries.

P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm on 2026-05-04: disaster recovery is already covered by re-enrolling a replacement host with the same repo creds (snapshots reappear, restore is same-host). The remaining "pull a file from host A onto host C without giving C permanent access" use case is genuinely different and doesn't have a confirmed need yet, so it's moved to the Future / unscheduled section at the end of this file.

Phase 3 — Restore ✅

Spec: docs/superpowers/specs/2026-05-04-p3-restore-design.md. Wireframe: _diag/p3-restore-wizard/wireframe.html. Sweep screenshots: _diag/p3-restore-sweep/. Shipped on branch p3-restore.

P3-X1 (S) Cancel-job feature. command.cancel WS envelope; agent tracks per-job ctx.CancelFunc and kills the running restic subprocess via context cancel (SIGTERM, SIGKILL after 5s grace via cmd.Cancel + cmd.WaitDelay); server endpoint POST /api/jobs/{id}/cancel bridges UI → WS; the existing UI Cancel button on /jobs/{id} is now real for any running kind. Sandbox-aware: internal/restic/cancel_{unix,windows}.go build-tags pick SIGTERM on POSIX vs os.Kill on Windows (which can't deliver SIGTERM). Tests: cancel mid-run via 'sleep 30' fake-restic returns JobCancelled with exit 130 in <200ms.
P3-X2 (S) Tree-list synchronous WS RPC. MsgTreeList ↔ MsgTreeListResult with Envelope.ID correlation; generic Hub.SendRPC helper (registry of buffered channels keyed by ULID, ctx-cancel + timeout aware). internal/restic.ListTreeChildren wraps restic ls --json and filters its recursive output to direct children. Server-side treeCache is per-wizard-session (keyed by session cookie + host + snapshot + path) with a 30-min TTL and lazy sweep.
P3-01 (L) Restore wizard backend (internal/server/http/ui_restore.go). GET handlers render the four-step wizard against the wireframe. HTMX/fetch tree partial endpoint hits fetchTreeWithCache. POST validates: snapshot_id, ≥1 absolute path, in-place ⇒ confirm_hostname == host name, agent online; on error re-renders with operator's input intact. Happy path mints job_id, target = /var/lib/restic-manager/restore/<job-id> (server-picked, agent's writable dir under the systemd sandbox's ReadWritePaths), creates job row, ships command.run with RestorePayload, writes host.restore audit row, returns HX-Redirect (or 303) to the live job page.
P3-02 (L) Wizard UI templates (web/templates/pages/host_restore.html + partials/tree_node.html). Single-page progressively-enabled four-step form. Form-state-driven JS computes a running tally + step-4 confirm summary client-side. Tree expansion uses plain fetch (not HTMX) for simpler target lookup; loaded-state cached per node. Top-level Restore button on host detail right rail + per-snapshot Restore action on snapshot rows. New .snap-row token in web/styles/input.css.
P3-03 (M) Restore execution. restic.RunRestore builds restore <sid> --target <dir> [--include p]... with --json; new pumpRestoreStdout parses status + summary objects. --no-ownership is gated on the agent's restic version via Env.AtLeastVersion(0, 17) — the flag was added in 0.17 and 0.16 rejects it. Restic version is threaded through runner.Config.ResticVersion from the agent's sysinfo snapshot. New-dir target is operator-editable (default $HOME/rm-restore/<job-id>/); agent expands $HOME / ${HOME} / ~/ at run time and calls os.MkdirAll on the target chain so the operator never has to pre-create the per-job subdir. runner.RunRestore translates RestoreStatus into job.progress (mapping FilesRestored → FilesDone, etc.); agent dispatcher case JobRestore reuses the spawn() helper from P3-X1 so cancel works. Restore-shaped job-detail variant with current-file display under the progress bar.
P3-09 (S) diff between two snapshots. JobDiff JobKind + restic.RunDiff + runner.RunDiff; POST /api/hosts/{id}/snapshots/diff (and HTMX-form variant on the unprefixed path) dispatcher with two-snapshot guard + per-host snapshot-list validation; UI panel on host detail right rail (visible when 2+ snapshots) with two short-id inputs + Diff button. Output streams as log.stream to the standard live job log page.
P3-X3 (S) Recent-restores line on host detail. hostChromeData grows RestoreStatus / RestoreAt / RestoreJobID populated via store.LatestJobByKind(host_id, 'restore') (already exists from P2R). host_chrome.html renders a small line below the init-status one with status-coloured copy + a link to the job log. Hidden when no restore has ever run on this host.
P3-X4 (S) Job log download (txt + ndjson). New GET /api/jobs/{id}/log.{txt|ndjson} endpoint backed by the persisted job_logs table — works any time (running or finished) without pausing the live WS stream because the source is the DB, not the live socket. Plain-text format mirrors the on-screen "HH:MM:SS.mmm TAG payload" shape with a small # job ... · kind ... · status ... header; ndjson emits one self-contained {seq,ts,stream,payload} JSON object per line for jq / tooling. Surfaced as a single header dropdown on the live job page (details/summary-driven, native keyboard support, click-outside-to-close). New reusable .dropdown / .dropdown-menu / .dropdown-item tokens in web/styles/input.css.
P3-X5 (S) UK lint locale + sweep. .golangci.yml misspell locale switched US → UK and the codebase swept (~73 corrections — behaviour, serialise, recognise, honour, initialise, enrol, unauthorised, etc.). Wire ErrorCode value "unauthorized" → "unauthorised" is a tiny contract change but the agent doesn't parse those codes today and no external clients exist yet.
P3-X6 (S) Snapshot SIZE/FILES tooltip on host detail. The per-snapshot summary block was added by restic 0.17 (the source comment in internal/restic/snapshots.go incorrectly said 0.16+); on 0.16 hosts the columns render —. hostDetailPage.LegacyRestic (computed via Env.AtLeastVersion(0, 17)) drives a title="Needs restic 0.17+ on the agent host. This host runs <ver>." + cursor: help on the column headers, hidden once the host upgrades.

Migration 0012 widens the jobs.kind CHECK constraint to include restore and diff. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005, with a defensive temp-table backup of job_logs so the cascade-trap that bit migration 0007 wouldn't take the log history with it.

install.sh + systemd unit: the install script now pre-creates /root/rm-restore (root-owned 0700) so the default new-dir restore target works under the sandbox out of the box; the unit's ReadWritePaths gains -/root/rm-restore (soft-fail prefix). Existing installs need a re-run of install.sh to pick up the new dir; new operator-typed targets are auto-created by the agent at job time.

As shipped (Playwright sweep against the live smoke env, 2026-05-04): login → host detail → Restore button → wizard step 1 picks snapshot a1ac4006 (most recent) → tree drill-down /home/steve/test (3 lazy loads) → tick file1 + file2 → step 4 confirm summary populated → dispatch → live job page with running progress widget → restore succeeds, files land on disk at /root/rm-restore/<job-id>/home/steve/test/file{1,2} (default $HOME/rm-restore/<job-id>/ after agent-side expansion). Custom-target restore to /tmp/custom-restore/<job-id>/ lands inside the agent's PrivateTmp namespace. Snapshot diff between a1ac4006 and 5f78c788 → diff job page, statistics output streamed (738 bytes added, 0 removed). Recent-restores line on host detail reads "last restore · succeeded 28s ago · job log →". Download dropdown serves both .txt and .ndjson with correct Content-Type + Content-Disposition. SIZE/FILES tooltip "Needs restic 0.17+ on the agent host. This host runs 0.16.4." renders on column hover.

Phase 3 — Alerts ✅

P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
P3-06 (M) Notification channels: webhook, ntfy, SMTP email
P3-07 (S) Alert UI: list, acknowledge, resolve

As shipped (Playwright sweep, 2026-05-04): /settings/notifications → 3 channels created (sweep-webhook → local Python sink, sweep-ntfy → ntfy.sh public topic, sweep-smtp → MailHog at 127.0.0.1:1025). Test buttons fire alert.test on each: webhook 200/1ms, ntfy 200/322ms, SMTP 250/3ms. Synthetic critical backup_failed raised → /alerts shows row with severity dot, kind chip, host, message, raised/last-seen, Ack + Resolve buttons; nav badge 1; dashboard critical-alert banner appears with Review→ link; OPEN ALERTS card reads 1 unresolved. Acknowledge → fan-out to all 3 channels emits alert.acknowledged (verified in webhook sink, MailHog inbox, notification_log); Acknowledged tab shows row with ack'd by <user> line. Resolve → fan-out emits alert.resolved across all 3 channels; banner clears; dashboard reads 0 unresolved · all clear; host alerts column reads —. Three live bugs found and fixed mid-sweep: (a) enabled form value lost because hidden+checkbox both named enabled and PostForm.Get returned the first ("0"); (b) Ack/Resolve handlers stored the state change but never dispatched alert.acknowledged / alert.resolved; (c) hosts.open_alert_count projection was never recomputed on Raise/Resolve/AutoResolve, so the dashboard count always read 0.

Phase 3 — Audit log UI (not started)

P3-08 (S) Audit log UI with filters (user, action, target, time range)

Phase 3 acceptance

A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at /hosts/{id}/restore; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page.
A failed backup raises an alert via the configured channel within 60s.
The audit-log UI lets an admin filter by user / action / target / time range.

Phase 4 — Update delivery, RBAC polish, OIDC

P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. restic-manager-agent update is a thin wrapper over apt-get install --only-upgrade restic-manager-agent / choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
P4-05 (L) OIDC login (generic provider config, group → role mapping)
P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
P4-07 (S) Per-host tags + dashboard filtering by tag
P4-08 (M) Prometheus /metrics endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON

Phase 4 acceptance

Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape /metrics and the sample Grafana dashboard renders with live data.

Phase 5 — OSS readiness

P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
P5-02 (S) CONTRIBUTING.md, CODE_OF_CONDUCT.md, issue + PR templates
P5-03 (S) Release automation: goreleaser for binaries + Docker image to GHCR
P5-04 (S) Demo screenshots / short Loom walkthrough in README
P5-05 (S) SECURITY.md with disclosure process
P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
P5-07 (S) Reference deployment: docker-compose.yml + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates RM_TRUSTED_PROXY)

Phase 5 acceptance

A stranger can read the docs and stand up a working install in under 30 minutes.

Cross-cutting / ongoing

X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
X-02 Track restic version compatibility matrix
X-03 Periodic dependency updates (dependabot or renovate)
X-04 Threat-model review at end of each phase
X-05 Proper first-run onboarding UI: admin shouldn't need to curl /api/bootstrap by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to /api/bootstrap, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so admin doesn't silently fail validation.

Future / unscheduled

Items here have a plausible use case but no confirmed need. They live outside numbered phases until a concrete trigger (a user request, a security review finding, a real disaster-recovery exercise) bumps them back into a phase.

F-01 ~~P3-04~~ Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.

45 KiB Raw Blame History Unescape Escape

restic-manager — Tasks

Phase 0 — Project bootstrap

Phase 1 — MVP: enrollment, visibility, on-demand backup

Server foundations

Agent ↔ server protocol

Agent foundations

Run-now backup

UI (HTMX + Tailwind)

Install scripts

Repo credentials (pulled forward from Phase 2)

Phase 1 acceptance

Phase 2 — Scheduling, retention, repo operations

Original P2 work — shipped (against pre-redesign shape)

P2 redesign — Phase 1 ✅

P2 redesign — Phase 2 ✅

P2 redesign — Phase 3 (REST + WS rewire) ✅

P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅

P2 redesign — Phase 5 ✅

P2 redesign — Phase 6 (auto-init follow-up) ✅

Pre/post hooks (rehomed onto source groups) ✅

Bandwidth + niceties (rehomed onto host + source groups) ✅

Cross-platform + alt-enrolment ✅

Phase 2 acceptance

Phase 3 — Restore, alerts, audit

Phase 3 — Restore ✅

Phase 3 — Alerts ✅

Phase 3 — Audit log UI (not started)

Phase 3 acceptance

Phase 4 — Update delivery, RBAC polish, OIDC

Phase 4 acceptance

Phase 5 — OSS readiness

Phase 5 acceptance

Cross-cutting / ongoing

Future / unscheduled

45 KiB

Raw Blame History