a8256f5aff
The original plan was apt repo + Chocolatey package. The P5-03 Docker pivot bundled matching agent binaries into the server image and exposes them via /agent/binary, so 'update agent' now collapses to 're-fetch from your own server'. No third-party packaging or signing infra needed. P6-01 drops to S; P6-02 keeps the dashboard reporting + fleet-update UX but points at the new mechanism.
390 lines
60 KiB
Markdown
390 lines
60 KiB
Markdown
# restic-manager — Tasks
|
||
|
||
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
|
||
|
||
Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||
|
||
---
|
||
|
||
## Phase 0 — Project bootstrap
|
||
|
||
- [x] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages
|
||
- [x] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
|
||
- [x] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config
|
||
- [x] **P0-04** (S) ~~GitHub Actions~~ Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
|
||
- [x] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml`
|
||
- [x] **P0-06** (S) Makefile / ~~`taskfile.yml`~~ with common targets (`build`, `test`, `run`, `release`)
|
||
|
||
---
|
||
|
||
## Phase 1 — MVP: enrollment, visibility, on-demand backup
|
||
|
||
### Server foundations
|
||
|
||
- [x] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
|
||
- [x] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (hand-rolled, `embed.FS`)
|
||
- [x] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
|
||
- [~] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies; **CSRF middleware deferred to P1-23 (UI work)** — REST clients use bearer/session-only flows
|
||
- [x] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs)
|
||
- [x] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`)
|
||
- [~] **P1-07** (M) Audit log writer; middleware sweep for every state-changing endpoint **lands when the rest of the API surface does** — login / bootstrap / host.enrolled / job.run_now currently audited
|
||
|
||
### Agent ↔ server protocol
|
||
|
||
- [x] **P1-08** (M) Define shared API types in `internal/api` (envelopes, every WS message + `protocol_version` constants; JSON-shape tests pin the wire)
|
||
- [x] **P1-09** (L) WebSocket transport (`github.com/coder/websocket`), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
|
||
- [x] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
|
||
- [x] **P1-11** (M) Agent registration on connect (`hello` upserts agent_version/restic_version/protocol_version, flips status online, `protocol_too_old` rejection has clean error envelope)
|
||
- [x] **P1-12** (S) Heartbeat handler (touches `last_seen_at`; background sweeper marks hosts offline after 90s without one)
|
||
|
||
### Agent foundations
|
||
|
||
- [x] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
|
||
- [x] **P1-14** (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
|
||
- [x] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, `protocol_version` in hello
|
||
- [x] **P1-16** (M) Restic wrapper: locate via PATH or override, run with `--json`, scan stdout/stderr, parse `BackupStatus` + `BackupSummary`, exit-code 3 treated as success-with-issues
|
||
- [x] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
|
||
|
||
### Run-now backup
|
||
|
||
- [x] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
|
||
- [x] **P1-19** (M) Server endpoint `POST /api/hosts/{id}/jobs` to dispatch a `backup` command (validates kind, checks online, audit-logs)
|
||
- [x] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` (1Hz throttle) / `log.stream`
|
||
- [~] **P1-21** (M) Server persists log stream to `job_logs` ✓; **WS `/api/jobs/{id}/stream` for live browser tailing** still TODO — needs the per-job fan-out hub
|
||
- [x] **P1-22** (S) Snapshot listing: agent calls `restic snapshots --json` after each successful backup and ships the projection over `snapshots.report`. Server `ReplaceHostSnapshots` atomically swaps the per-host list and updates `hosts.snapshot_count` in the same tx. Read endpoint: `GET /api/hosts/{id}/snapshots`. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused `repo_id` FK from `snapshots` (repos as a first-class entity is P2 work).
|
||
|
||
### UI (HTMX + Tailwind)
|
||
|
||
- [x] **P1-23** (M) Base layout, login page, session-aware nav
|
||
- [x] **P1-24** (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by `GET /api/hosts` + `GET /api/fleet/summary` (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX `Run now` button posts to `/hosts/{id}/run-backup`.
|
||
- [x] **P1-25** (M) Host detail page (`/hosts/{id}`): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
|
||
- [x] **P1-26** (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens `/api/jobs/{id}/stream`; agent-emitted `job.started`/`job.progress`/`log.stream`/`job.finished` are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on `job.finished` to show the final header. "Run now" sets `HX-Redirect` so the operator lands on the live log.
|
||
- [~] **P1-27** (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (`RM_SERVER` + `RM_TOKEN` filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). **Deferred:** one-click "download preconfigured installer" `install-<hostname>.sh` (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
|
||
- [x] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node) — Makefile downloads pinned v3.4.17 into `bin/tailwindcss`, builds `web/styles/input.css` → `web/static/css/styles.css`, embedded into the binary via `web.FS`. `make build` runs Tailwind first.
|
||
|
||
### Install scripts
|
||
|
||
- [x] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / `/etc/cron.{d,daily,hourly,weekly}/*` / root crontab and prints them with the exact disable commands — does **not** auto-disable
|
||
- [~] **P1-31** (S) Server endpoint to serve agent binaries + install scripts ✓ (`/agent/binary` + `/install/*`); **signature verification** deferred to Phase 5 OSS readiness
|
||
|
||
### Repo credentials (pulled forward from Phase 2)
|
||
|
||
- [x] **P1-32** (M) Server-side encrypted repo creds carried on the enrollment token:
|
||
- `POST /api/enrollment-tokens` body grows `repo_url`, `repo_username`, `repo_password` (all required).
|
||
- Token row stores them as one AEAD-encrypted blob (existing `crypto.AEAD`); `ConsumeEnrollmentToken` moves the blob to a new `host_credentials` row keyed by `host_id` in the same tx.
|
||
- `PUT /api/hosts/{id}/repo-credentials` (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
|
||
- `GET /api/hosts/{id}/repo-credentials` returns the redacted view (URL + username + `has_password`) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
|
||
- On WS `hello`, server pushes a `config.update` with decrypted creds **before** returning the connection to idle. Same path on edit-while-connected.
|
||
- Audit-logged on create / consume / edit; payload omits the secret material.
|
||
|
||
- [x] **P1-33** (M) Agent-side encrypted secrets store:
|
||
- New `internal/agent/secrets` package: AEAD blob at `/var/lib/restic-manager/secrets.enc`, atomic write (tmp+fsync+rename, mode 0600).
|
||
- Per-host 32-byte secrets key minted at enrollment, persisted in `agent.yaml` (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
|
||
- Strip `repo_url` / `repo_password` from `agent.config.Config`. Agent loads creds from `secrets.enc` at startup; `config.update` handler writes through to the file.
|
||
- Dispatcher reads from the secrets store on every job rather than from in-memory config.
|
||
- Migration path: if `agent.yaml` still contains `repo_url`/`repo_password`, copy them into `secrets.enc` on next start, then strip from the YAML on save.
|
||
|
||
- [x] **P1-34** (S) End-to-end smoke runbook: `docs/e2e-smoke.md` walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real `restic/rest-server` in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
|
||
|
||
### Phase 1 acceptance
|
||
|
||
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
|
||
- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
|
||
- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
|
||
- Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as `config.update`.
|
||
|
||
---
|
||
|
||
## Phase 2 — Scheduling, retention, repo operations
|
||
|
||
> **Mid-phase pivot — "P2 redesign" (commits `7a7cac5`, `666af41`, `5667cdf`).**
|
||
> The original P2 plan put paths/excludes/retention/manual/kind/options on
|
||
> `Schedule` and one repo per host. After landing P2-01..P2-05 against that
|
||
> shape, the data model was rewritten: schedules are slim (cron + which
|
||
> `source_groups`); paths/excludes/retention/retry live on `source_group`
|
||
> (also doubles as the snapshot tag); forget/prune/check cadences live on
|
||
> `host_repo_maintenance` and run on a server-side ticker, not the agent
|
||
> cron; `pending_runs` queues offline retries; `host.repo_initialised_at`
|
||
> is gone (auto-init at enrolment). The redesign is captured below as
|
||
> `P2R-NN` items. Items P2-01..P2-05 stay marked done because the work
|
||
> shipped, but they're labelled ⚠️ **shipped against old shape — behaviour
|
||
> to be re-validated under P2R-02 after UI rewire**. P2-04.5 (`manual`
|
||
> flag) is dropped wholesale. P2-06..P2-15 are reframed below to point at
|
||
> their new homes; P2-16/17/18 are unaffected by the redesign.
|
||
|
||
### Original P2 work — shipped (against pre-redesign shape)
|
||
|
||
- [x] ⚠️ **P2-01** (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
|
||
- [x] ⚠️ **P2-02** (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
|
||
- [x] ⚠️ **P2-03** (M) Agent local scheduler (`internal/agent/scheduler`, `robfig/cron/v3`, `schedule.fire` envelope, `dispatchScheduledJob`). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01.
|
||
- [x] ⚠️ **P2-04** (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
|
||
- ~~**P2-04.5** Manual schedules / kill `host.default_paths`~~ — superseded; the `manual` flag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).
|
||
- [x] ⚠️ **P2-05** (M) `forget` command with retention policy. Wire payload (`CommandRunPayload.retention_policy`) and restic wrapper (`restic.ForgetPolicy`, `RunForget`) are still correct; what changes under P2R-03 is **where retention comes from** (source_group, not schedule) and **who dispatches** (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).
|
||
|
||
### P2 redesign — Phase 1 ✅
|
||
|
||
- [x] **P2R-00.1** (M) Migration 0008 — sources + repo maintenance. Adds `source_groups`, `schedule_source_groups` junction, `host_repo_maintenance`, `pending_runs`, `host.bandwidth_up_kbps` / `bandwidth_down_kbps`. Drops `host.repo_initialised_at`. Slim-schedule columns dropped from `schedules`. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit `7a7cac5`.
|
||
- [x] **P2R-00.2** (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of `/hosts/{id}/sources`, `/sources/{gid}/edit` (with retention-conflict banner), slim `/schedules`, `/repo` (connection / bandwidth / maintenance / re-init). Commit `666af41`.
|
||
|
||
### P2 redesign — Phase 2 ✅
|
||
|
||
- [x] **P2R-00.3** (L) Go-side store rewrite against migration 0008. New types: `SourceGroup`, `HostRepoMaintenance`, `PendingRun`. `Schedule` slimmed to `{id, host_id, cron, enabled, source_group_ids, timestamps}`. `RetentionPolicy` moves from schedule field → source group field (type unchanged). `Host` loses `RepoInitialisedAt`, gains bandwidth caps. New files: `store/sources.go`, `store/maintenance.go`, `store/pending.go`. `store/schedules.go` rewritten for slim shape + junction CRUD. `enrollment.go` seeds a default source group + repo-maintenance row instead of a manual schedule. `ws/handler.go` drops `MarkHostRepoInitialised`. HTTP layer + UI templates **temporarily 501-stubbed** with `redesign_in_progress` — this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit `5667cdf`.
|
||
- [x] **P2R-00.4** (S) Host-detail UI patched up enough to render: `RepoInitialisedAt` template refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.
|
||
|
||
### P2 redesign — Phase 3 (REST + WS rewire) ✅
|
||
|
||
- [x] **P2R-01** (L) HTTP/WS layer against the slim shape:
|
||
- **Schedules REST CRUD**: `GET|POST /api/hosts/{id}/schedules`, `PUT|DELETE /api/hosts/{id}/schedules/{sid}`. Body shape is `{cron, enabled, source_group_ids[]}` — paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (per `store.UpdateSchedule`). Validation: cron parses via `robfig/cron/v3`; ≥1 `source_group_ids`; all referenced groups belong to the host.
|
||
- **Source-groups REST CRUD**: `GET|POST /api/hosts/{id}/source-groups`, `GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}`. Body: `{name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}`. Name uniqueness per host. Refuse delete if `SchedulesUsingGroup(gid)` is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bump `host_schedule_version`.
|
||
- **Repo-maintenance REST**: `GET|PUT /api/hosts/{id}/repo-maintenance`. Body: `{forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}`. Server-side ticker drives execution (P2R-04), so updates here do **not** bump `host_schedule_version`.
|
||
- **Per-source-group Run-now**: `POST /hosts/{id}/source-groups/{gid}/run`. Reuses the existing `dispatchScheduleNow`-style path; agent receives a normal `command.run` carrying the resolved includes/excludes/retention from the group. This replaces the old per-host `/hosts/{id}/run-backup` endpoint (kept around as a 410-Gone with a hint pointing to source groups).
|
||
- **`schedule_push.go` reconciliation**: rebuild `pushScheduleSet*` to ship the new wire format (`ScheduleSetPayload` carries `[{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}]` — agent doesn't need to know `source_group_id`, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persists `applied_schedule_version`.
|
||
- **Auto-init at enrolment**: server dispatches `restic init` on first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row with `kind=init` so the audit trail still shows it. On `init` returning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour.
|
||
- **Tests**: rewrite the deleted `schedules_test.go` and `schedule_push_test.go` against new endpoints; new `source_groups_test.go`, `repo_maintenance_test.go`, `auto_init_test.go`. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.
|
||
|
||
### P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅
|
||
|
||
> **Row-design rule (binding for every list-row template in this app, current and future):**
|
||
> Whole-row click navigates to the row's primary detail/edit page —
|
||
> mirror `.host-row.clickable` on the dashboard
|
||
> (`partials/host_row.html`): an absolute-positioned `.row-link`
|
||
> overlay with `text-indent: -9999px` covers the row, action buttons
|
||
> live in `.row-action` cells that sit above via z-index. **Do not
|
||
> add an explicit "Edit" button** when the row is clickable — it
|
||
> duplicates the affordance and dilutes the click target. Action
|
||
> cells are reserved for verbs that aren't "open this row" (Run-now,
|
||
> Delete, Pause, etc).
|
||
|
||
- [x] **P2R-02** (L) UI templates rebuilt against the new model:
|
||
- **Slice 1 ✅** Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a `host_chrome` partial; Sources / Schedules / Repo become real `<a>` links; placeholder pages share the chrome; version indicator restored. (commit `a535822`)
|
||
- **Slice 2 ✅** Sources tab — `/hosts/{id}/sources` list with per-row meta + clickable rows + per-group Run-now/Delete; `/sources/new` and `/sources/{gid}/edit` form (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner from `ConflictDimension` cache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits `0ed9c3d`, `dede74f`)
|
||
- **Slice 3 ✅** Schedules tab — `/hosts/{id}/schedules` slim list (status / cron / source-tags / actions, clickable rows) plus `/schedules/new` and `/schedules/{sid}/edit` form (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reuses `dispatchScheduledJob` for enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit `67ca769` + follow-ups `64d2fcf`, `8b91d30`, `4035c44`)
|
||
- **Slice 4 ✅** `/hosts/{id}/repo` — three independent forms (connection: URL/user/password pre-filled from `GET /api/hosts/{id}/repo-credentials` redacted view; bandwidth: host-wide caps via new `PUT /api/hosts/{id}/bandwidth`; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commit `d62b173`)
|
||
- **Slice 5 ✅** Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit `fab99b4`)
|
||
- **Slice 6 ✅** Playwright sweep against the live `:8080` server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in `_diag/p2r-02-sweep/`.
|
||
- Side-fix: agent runner drops noisy restic `status` events from `log.stream` (they were drowning the live log on short backups; the throttled `job.progress` envelope already covers the same data). (commit `ffba737`)
|
||
- Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by `host_schedule_version` + `applied_schedule_version`).
|
||
- Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires `pushScheduleSetAsync` so an online agent re-arms within seconds.
|
||
|
||
### P2 redesign — Phase 5 ✅
|
||
|
||
> Shipped on branch `p2r-phase5-maintenance` (PR #3). Plan:
|
||
> `docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md`.
|
||
|
||
- [x] **P2R-03** (M) `prune` command end-to-end. Restic wrapper (`restic.RunPrune`), agent dispatcher (`case api.JobPrune:`), wire envelope. **Admin-only credential**: a second `host_credentials` row keyed by `host_id` + `kind=admin` carries the non-append-only username/password; server pushes it via `config.update` only when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now via `POST /hosts/{id}/repo/prune`. Cadence-driven dispatch via the maintenance ticker (P2R-06).
|
||
- [x] **P2R-04** (M) `check` command end-to-end (`restic check --read-data-subset N%`). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now via `POST /hosts/{id}/repo/check`. Cadence-driven dispatch via the maintenance ticker (P2R-06).
|
||
- [x] **P2R-05** (S) `unlock` command end-to-end (`restic unlock`). Operator-only — no cadence. `POST /hosts/{id}/repo/unlock`. Repo page surfaces lock state from the most recent `check` (which warns about stale locks).
|
||
- [x] **P2R-06** (M) Server-side maintenance ticker. Cron-style loop on the server reads `host_repo_maintenance` rows, dispatches `forget` / `prune` / `check` jobs against the right host on the configured cadence. Last-fire anchor is derived from the `jobs` table via `LatestJobByKind` (queued + running included so a long-running prune correctly suppresses the next tick). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (logged, no queue — only scheduled backup fires queue, per P2R-08). Forget reshape: ships a multi-group `ForgetGroups` payload so one job fires N restic-forget invocations per tick.
|
||
- [x] **P2R-07** (S) Repo stats panel on the Repo page: total size, raw size, last-check timestamp + status (color-coded), last-prune timestamp, stale-lock banner. Backed by `restic stats --json --mode raw-data` that the agent ships in a `repo.stats` envelope after every backup / check / prune / unlock; persisted via `Store.UpsertHostRepoStats` into a new `host_repo_stats` projection table.
|
||
- [x] **P2R-08** (M) Pending-runs queue worker. Scheduled backup fires that race an agent disconnect queue to `pending_runs`. Drained on a 30s server-side tick **and** on agent reconnect (via `onAgentHello`); per-host TryLock mutex prevents the two paths double-dispatching the same row. Exponential backoff capped at 30 minutes; abandons rows that exceed the source-group's `retry_max` (audit-logged) or whose schedule/group has genuinely been deleted.
|
||
|
||
### P2 redesign — Phase 6 (auto-init follow-up) ✅
|
||
|
||
- [x] **P2R-09** (S) Auto-init UX polish. Latest `init` job status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). Danger-zone `POST /hosts/{id}/repo/reinit` dispatches a fresh init job after the operator types the host name to confirm; audit row records `host.repo_reinit`.
|
||
|
||
### Pre/post hooks (rehomed onto source groups) ✅
|
||
|
||
- [x] **P2R-10** (M) Hook schema: migration 0010 adds `pre_hook`/`post_hook` BLOB columns to `source_groups` and `pre_hook_default`/`post_hook_default` to `hosts`. Bytes stored verbatim — AEAD encrypt/decrypt at the HTTP layer (per-slot AD bytes). Round-trip tests cover set/clear semantics on both tables.
|
||
- [x] **P2R-11** (M) Agent execution of hooks: `runner.BackupHooks` + `runHook` helper invoked via `/bin/sh -c` (`cmd.exe /C` on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with `RM_JOB_STATUS=succeeded|failed` in env. Output streamed as `hook(<phase>): …` log.stream lines. Hooks only run for `kind=backup`. Server side resolves group → host default → empty and ships plaintext on the WS payload (decrypt at HTTP layer).
|
||
- [x] **P2R-12** (S) Hook editor UI: source-group edit form gains pre/post hook textareas with the service-user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a host-default Hooks panel with the same shape; saved via `POST /hosts/{id}/repo/hooks`.
|
||
|
||
### Bandwidth + niceties (rehomed onto host + source groups) ✅
|
||
|
||
- [x] **P2R-13** (S) Bandwidth limit fields. `restic.Env` gains `LimitUploadKBps`/`LimitDownloadKBps`, emitted as `--limit-upload`/`--limit-download` global flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received via `config.update`; server pushes them on hello and after `PUT /api/hosts/{id}/bandwidth`. Per-job override on the per-source-group Run-now form (collapsed `<details>` "Limit bandwidth for this run" with two KB/s inputs); override wins over host caps.
|
||
- [x] **P2R-14** (S) Schedule "next run" / "last run". New `store.LatestJobBySchedule` query. Schedules tab grows two columns (Next derived from cron via `robfig/cron/v3.Parse(...).Next`, Last from latest `actor_kind=schedule` job). Dashboard host row prepends `next 12h ago/from now` when a single covering schedule is the run-now candidate.
|
||
|
||
### Cross-platform + alt-enrolment ✅
|
||
|
||
- [x] **P2-16** (M) Windows service integration: `internal/agent/service` (build-tagged) implements `svc.Handler`; new `restic-manager-agent install|uninstall|start|stop|run` subcommands wrap the SCM via `golang.org/x/sys/windows/svc/mgr`. Cross-compile verified (`GOOS=windows GOARCH=amd64 go build ./cmd/agent`); **untested on Windows itself** — Linux CI can't exercise the SCM round-trip.
|
||
- [x] **P2-17** (M) `install.ps1` (Windows): pwsh installer that detects arch, downloads `$Server/agent/binary?os=windows&arch=amd64`, runs the agent in `-enroll-server` (+ optional `-enroll-token`) mode (token flow OR announce-and-approve), then registers the service via `restic-manager-agent install`. Surfaces existing scheduled tasks named `*restic*` without disabling. Served by the existing `GET /install/*` handler; restage block in CLAUDE.md updated.
|
||
- [x] **P2-18** (L) Announce-and-approve enrolment (second enrolment mode):
|
||
- Agent run with no `RM_TOKEN` generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then `POST /api/agents/announce` with `{hostname, os, arch, agent_version, restic_version, public_key}`. Server stores a `pending_hosts` row (`public_key`, `fingerprint = sha256(public_key)`, `announced_from_ip`, `first_seen_at`, `last_seen_at`, `expires_at = now+1h`). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
|
||
- Agent then opens a long-poll/WS to `/ws/agent/pending` authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
|
||
- Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. `SHA256:ab12…cd34`) and tells the operator to compare it to the one shown in the UI before clicking accept.
|
||
- UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: **Accept** (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / **Reject** (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
|
||
- Server-side guards: per-source-IP rate limit on `/api/agents/announce` (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does **not** auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
|
||
- Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting `hostname` over the wire.
|
||
|
||
> **As shipped:** migration 0011 + `store/pending_hosts.go` cover the table.
|
||
> `POST /api/agents/announce` (rate-limited 10/min/IP, global cap 100 in-flight rows)
|
||
> returns `{pending_id, fingerprint, hostname_collision}`. `GET /ws/agent/pending`
|
||
> runs the Ed25519 nonce-sign handshake. Admin POSTs to
|
||
> `/api/pending-hosts/{id}/accept|reject` (audit-logged as
|
||
> `host.accept_pending`/`host.reject_pending`). Dashboard panel renders the queue
|
||
> with a copyable fingerprint + inline accept form (URL/user/password). 60s
|
||
> server ticker sweeps expired rows. Agent: `cmd/agent/announce.go` mints +
|
||
> persists an Ed25519 keypair into `agent.yaml`'s `announce_key` field; runs
|
||
> automatically when `-enroll-server` is supplied without `-enroll-token`. The
|
||
> install scripts haven't been updated to surface the printed fingerprint
|
||
> beyond the agent's own banner — the operator reads it from the install
|
||
> script's stdout.
|
||
|
||
### Phase 2 acceptance
|
||
|
||
- A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
|
||
- Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
|
||
- Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to `pending_runs` and drain on reconnect.
|
||
- Pre/post hooks fire correctly per source group, fail loudly on `pre_hook` errors, run `post_hook` with `RM_JOB_STATUS`. Rejected on non-backup kinds.
|
||
- Bandwidth limits honoured (host-wide default + per-run override).
|
||
- A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming. **Not validated in CI:** Linux runners cannot exercise the SCM round-trip; the `service_windows.go`/`install.ps1` pieces compile cleanly under `GOOS=windows GOARCH=amd64` but the first real Windows install will be the first end-to-end test.
|
||
- A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.
|
||
|
||
---
|
||
|
||
## Phase 3 — Restore, alerts, audit
|
||
|
||
> Phase 3 is split into three independently-shippable sub-phases:
|
||
> **Restore** (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC),
|
||
> **Alerts** (P3-05..07), **Audit UI** (P3-08). Each sub-phase has its own
|
||
> spec → plan → implement cycle; we hand back at sub-phase boundaries.
|
||
>
|
||
> P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm
|
||
> on 2026-05-04: disaster recovery is already covered by re-enrolling a
|
||
> replacement host with the same repo creds (snapshots reappear, restore
|
||
> is same-host). The remaining "pull a file from host A onto host C
|
||
> without giving C permanent access" use case is genuinely different and
|
||
> doesn't have a confirmed need yet, so it's moved to the **Future /
|
||
> unscheduled** section at the end of this file.
|
||
|
||
### Phase 3 — Restore ✅
|
||
|
||
> Spec: `docs/superpowers/specs/2026-05-04-p3-restore-design.md`.
|
||
> Wireframe: `_diag/p3-restore-wizard/wireframe.html`.
|
||
> Sweep screenshots: `_diag/p3-restore-sweep/`.
|
||
> Shipped on branch `p3-restore`.
|
||
|
||
- [x] **P3-X1** (S) Cancel-job feature. `command.cancel` WS envelope; agent tracks per-job ctx.CancelFunc and kills the running `restic` subprocess via context cancel (SIGTERM, SIGKILL after 5s grace via `cmd.Cancel` + `cmd.WaitDelay`); server endpoint `POST /api/jobs/{id}/cancel` bridges UI → WS; the existing UI Cancel button on `/jobs/{id}` is now real for any running kind. Sandbox-aware: `internal/restic/cancel_{unix,windows}.go` build-tags pick SIGTERM on POSIX vs `os.Kill` on Windows (which can't deliver SIGTERM). Tests: cancel mid-run via 'sleep 30' fake-restic returns JobCancelled with exit 130 in <200ms.
|
||
- [x] **P3-X2** (S) Tree-list synchronous WS RPC. `MsgTreeList` ↔ `MsgTreeListResult` with `Envelope.ID` correlation; generic `Hub.SendRPC` helper (registry of buffered channels keyed by ULID, ctx-cancel + timeout aware). `internal/restic.ListTreeChildren` wraps `restic ls --json` and filters its recursive output to direct children. Server-side `treeCache` is per-wizard-session (keyed by session cookie + host + snapshot + path) with a 30-min TTL and lazy sweep.
|
||
- [x] **P3-01** (L) Restore wizard backend (`internal/server/http/ui_restore.go`). GET handlers render the four-step wizard against the wireframe. HTMX/fetch tree partial endpoint hits `fetchTreeWithCache`. POST validates: snapshot_id, ≥1 absolute path, in-place ⇒ confirm_hostname == host name, agent online; on error re-renders with operator's input intact. Happy path mints job_id, target = `/var/lib/restic-manager/restore/<job-id>` (server-picked, agent's writable dir under the systemd sandbox's `ReadWritePaths`), creates job row, ships `command.run` with `RestorePayload`, writes `host.restore` audit row, returns HX-Redirect (or 303) to the live job page.
|
||
- [x] **P3-02** (L) Wizard UI templates (`web/templates/pages/host_restore.html` + `partials/tree_node.html`). Single-page progressively-enabled four-step form. Form-state-driven JS computes a running tally + step-4 confirm summary client-side. Tree expansion uses plain fetch (not HTMX) for simpler target lookup; loaded-state cached per node. Top-level Restore button on host detail right rail + per-snapshot Restore action on snapshot rows. New `.snap-row` token in `web/styles/input.css`.
|
||
- [x] **P3-03** (M) Restore execution. `restic.RunRestore` builds `restore <sid> --target <dir> [--include p]...` with --json; new `pumpRestoreStdout` parses status + summary objects. `--no-ownership` is gated on the agent's restic version via `Env.AtLeastVersion(0, 17)` — the flag was added in 0.17 and 0.16 rejects it. Restic version is threaded through `runner.Config.ResticVersion` from the agent's sysinfo snapshot. New-dir target is operator-editable (default `$HOME/rm-restore/<job-id>/`); agent expands `$HOME` / `${HOME}` / `~/` at run time and calls `os.MkdirAll` on the target chain so the operator never has to pre-create the per-job subdir. `runner.RunRestore` translates `RestoreStatus` into `job.progress` (mapping FilesRestored → FilesDone, etc.); agent dispatcher case `JobRestore` reuses the `spawn()` helper from P3-X1 so cancel works. Restore-shaped job-detail variant with current-file display under the progress bar.
|
||
- [x] **P3-09** (S) `diff` between two snapshots. `JobDiff` JobKind + `restic.RunDiff` + `runner.RunDiff`; `POST /api/hosts/{id}/snapshots/diff` (and HTMX-form variant on the unprefixed path) dispatcher with two-snapshot guard + per-host snapshot-list validation; UI panel on host detail right rail (visible when 2+ snapshots) with two short-id inputs + Diff button. Output streams as log.stream to the standard live job log page.
|
||
- [x] **P3-X3** (S) Recent-restores line on host detail. `hostChromeData` grows `RestoreStatus` / `RestoreAt` / `RestoreJobID` populated via `store.LatestJobByKind(host_id, 'restore')` (already exists from P2R). `host_chrome.html` renders a small line below the init-status one with status-coloured copy + a link to the job log. Hidden when no restore has ever run on this host.
|
||
- [x] **P3-X4** (S) Job log download (txt + ndjson). New `GET /api/jobs/{id}/log.{txt|ndjson}` endpoint backed by the persisted `job_logs` table — works any time (running or finished) without pausing the live WS stream because the source is the DB, not the live socket. Plain-text format mirrors the on-screen "HH:MM:SS.mmm TAG payload" shape with a small `# job ... · kind ... · status ...` header; ndjson emits one self-contained `{seq,ts,stream,payload}` JSON object per line for `jq` / tooling. Surfaced as a single header dropdown on the live job page (`details/summary`-driven, native keyboard support, click-outside-to-close). New reusable `.dropdown` / `.dropdown-menu` / `.dropdown-item` tokens in `web/styles/input.css`.
|
||
- [x] **P3-X5** (S) UK lint locale + sweep. `.golangci.yml` misspell locale switched US → UK and the codebase swept (~73 corrections — behaviour, serialise, recognise, honour, initialise, enrol, unauthorised, etc.). Wire `ErrorCode` value `"unauthorized"` → `"unauthorised"` is a tiny contract change but the agent doesn't parse those codes today and no external clients exist yet.
|
||
- [x] **P3-X6** (S) Snapshot SIZE/FILES tooltip on host detail. The per-snapshot summary block was added by restic 0.17 (the source comment in `internal/restic/snapshots.go` incorrectly said 0.16+); on 0.16 hosts the columns render `—`. `hostDetailPage.LegacyRestic` (computed via `Env.AtLeastVersion(0, 17)`) drives a `title="Needs restic 0.17+ on the agent host. This host runs <ver>."` + `cursor: help` on the column headers, hidden once the host upgrades.
|
||
|
||
> **Migration 0012** widens the `jobs.kind` CHECK constraint to include `restore` and `diff`. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005, with a defensive temp-table backup of `job_logs` so the cascade-trap that bit migration 0007 wouldn't take the log history with it.
|
||
|
||
> **install.sh + systemd unit:** the install script now pre-creates `/root/rm-restore` (root-owned 0700) so the default new-dir restore target works under the sandbox out of the box; the unit's `ReadWritePaths` gains `-/root/rm-restore` (soft-fail prefix). Existing installs need a re-run of `install.sh` to pick up the new dir; new operator-typed targets are auto-created by the agent at job time.
|
||
|
||
> **As shipped (Playwright sweep against the live smoke env, 2026-05-04):** login → host detail → Restore button → wizard step 1 picks snapshot a1ac4006 (most recent) → tree drill-down `/home/steve/test` (3 lazy loads) → tick `file1` + `file2` → step 4 confirm summary populated → dispatch → live job page with running progress widget → restore succeeds, files land on disk at `/root/rm-restore/<job-id>/home/steve/test/file{1,2}` (default `$HOME/rm-restore/<job-id>/` after agent-side expansion). Custom-target restore to `/tmp/custom-restore/<job-id>/` lands inside the agent's `PrivateTmp` namespace. Snapshot diff between `a1ac4006` and `5f78c788` → diff job page, statistics output streamed (738 bytes added, 0 removed). Recent-restores line on host detail reads "last restore · succeeded 28s ago · job log →". Download dropdown serves both `.txt` and `.ndjson` with correct `Content-Type` + `Content-Disposition`. SIZE/FILES tooltip "Needs restic 0.17+ on the agent host. This host runs 0.16.4." renders on column hover.
|
||
|
||
### Phase 3 — Alerts ✅
|
||
|
||
- [x] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
|
||
- [x] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
|
||
- [x] **P3-07** (S) Alert UI: list, acknowledge, resolve
|
||
|
||
> **As shipped (Playwright sweep, 2026-05-04):** /settings/notifications → 3 channels created (sweep-webhook → local Python sink, sweep-ntfy → ntfy.sh public topic, sweep-smtp → MailHog at 127.0.0.1:1025). Test buttons fire alert.test on each: webhook 200/1ms, ntfy 200/322ms, SMTP 250/3ms. Synthetic critical `backup_failed` raised → /alerts shows row with severity dot, kind chip, host, message, raised/last-seen, Ack + Resolve buttons; nav badge `1`; dashboard critical-alert banner appears with Review→ link; OPEN ALERTS card reads `1 unresolved`. Acknowledge → fan-out to all 3 channels emits alert.acknowledged (verified in webhook sink, MailHog inbox, notification_log); Acknowledged tab shows row with `ack'd by <user>` line. Resolve → fan-out emits alert.resolved across all 3 channels; banner clears; dashboard reads `0 unresolved · all clear`; host alerts column reads —. Three live bugs found and fixed mid-sweep: (a) `enabled` form value lost because hidden+checkbox both named `enabled` and `PostForm.Get` returned the first ("0"); (b) Ack/Resolve handlers stored the state change but never dispatched alert.acknowledged / alert.resolved; (c) `hosts.open_alert_count` projection was never recomputed on Raise/Resolve/AutoResolve, so the dashboard count always read 0.
|
||
|
||
### Phase 3 — Audit log UI ✅
|
||
|
||
- [x] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
|
||
|
||
> **As shipped (2026-05-05):** Read-only `/audit` page (+ `/api/audit` JSON). Filters: time-range presets (24h / 7d / 30d / all), user dropdown (any registered user), actor dropdown (user / agent / system), target-kind dropdown (host / schedule / source_group / alert / notification_channel / job / user), action substring search box. Table columns: when (relative + abstime tooltip), actor tag (user accent / agent green / system grey), user (or em-dash for system rows), action string, target (kind · resolved name for hosts, kind · id otherwise), payload `<details>` block when non-empty. New `Store.ListAudit(AuditFilter)` and `Store.DistinctAuditActions` plus `Store.ListUsers`. Append-only — no edit/delete surface, deliberately.
|
||
|
||
### Phase 3 acceptance
|
||
|
||
- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at `/hosts/{id}/restore`; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page.
|
||
- A failed backup raises an alert via the configured channel within 60s.
|
||
- The audit-log UI lets an admin filter by user / action / target / time range.
|
||
|
||
---
|
||
|
||
## Phase 4 — RBAC, OIDC, host tags
|
||
|
||
- [x] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer)
|
||
- [x] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset)
|
||
|
||
> **As shipped (2026-05-05):** Three-role hierarchy (admin > operator > viewer) enforced via chi route-group middleware (`requireRole`). Admin is the fail-closed default; agent endpoints stay on the bearer-token chain. Sessions re-validate `disabled_at` on every authenticated request — admin-driven changes (disable, force-logout) land immediately.
|
||
>
|
||
> **Setup-token flow** replaces temp passwords. Admin clicks `+ Add user`, picks username + email + role, server returns a one-time setup link valid for 1 hour (sha256-hashed at rest, raw shown to admin once). User clicks the link → sets a password (≥12 chars) → drops a session → lands on `/`. `/settings/users/{id}/regenerate-setup` issues a new link, replacing the old via INSERT OR REPLACE. Expired tokens are swept on the alert engine's 60s tick.
|
||
>
|
||
> **Disable-only lifecycle** — soft delete via `disabled_at`. Last-admin guard rejects "disable last admin" and "demote last admin to non-admin" (both server-side and UI-hinted). Re-enable on disabled-username collision: admin trying to add a name that matches a disabled user is redirected to that user's edit page rather than 409'd.
|
||
>
|
||
> **Self-service password change** at `/settings/account` available to any role. Skips current-password check when `must_change_password` is set so admin-initiated resets work without surfacing a credential the user doesn't know.
|
||
>
|
||
> **Schema:** migration 0017 adds `email`, `disabled_at`, `must_change_password` plus a UNIQUE INDEX on LOWER(username) (lowercase normalisation in Go on every CreateUser); 0018 adds `user_setup_tokens`. Both column-level ALTERs per CLAUDE.md preference. Email is metadata only in v1 (no SMTP-the-link); the SMTP channel infrastructure from P3-06 makes that a one-page follow-up.
|
||
>
|
||
> **Sweep verified (smoke env):** admin adds operator → setup link generated → curl-as-new-user fetches /setup (200, page shows username) → POSTs password → 303 to / + Set-Cookie → operator authenticated → 200 on /, 200 on /settings/account, **403 on /settings/users** (admin-only) → admin disables user → operator's next request is **401** + session row count drops to 0 → audit log shows `user.created` + `user.setup_completed` for the cycle. All 26 implementation tasks landed; full `go test ./...` green.
|
||
- [x] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
|
||
|
||
> **As shipped (2026-05-05):** Authorization Code + PKCE (S256) against any OIDC IdP advertising standard discovery. Config is YAML+env (`oidc.issuer`, `oidc.client_id`, `oidc.client_secret`/`_file`, `oidc.role_claim` default `groups`, `oidc.role_mapping`, `oidc.display_name`, `oidc.redirect_url`); empty issuer → OIDC disabled, no routes mounted. Migration 0019 adds `users.auth_source`/`oidc_subject` (partial unique index on `oidc_subject`), `sessions.id_token`, and a small `oidc_state` table for state+verifier round-trip (cleaned up every alert tick, 5 min TTL). Login page renders **Sign in with `<display_name>`** above the local form when OIDC is enabled; the SSO button kicks off a 303 to the IdP with state + S256 code_challenge persisted server-side. Callback verifies ID token, fetches `/userinfo` to merge claims (Authelia / many IdPs only put `sub` in the ID token and surface `preferred_username`/`email`/`groups` from userinfo), maps the first matching group to a role; **no match → deny banner**, no row created, audit `user.oidc_login_blocked`. Username-collision with an existing local user → same deny path with `username_taken`. New user → JIT-provisioned with `auth_source='oidc'`, `oidc_subject=<sub>`, `password_hash=''`. Returning user → looked up by `oidc_subject` (stable when usernames change at the IdP), role + email refreshed on every login. Local password login is rejected for `auth_source='oidc'` users. Logout posts to `/logout` and, when the IdP advertised `end_session_endpoint`, follows up with RP-initiated logout (carries `id_token_hint` + `post_logout_redirect_uri=BaseURL`); when not advertised (Authelia in our smoke env), the local session is cleared and the browser lands on `/login`. Users list shows a small **oidc** chip beside enabled/disabled; the edit page disables username/email/role for OIDC users (server-side guard mirrors UI, returns 403). Force-logout, disable, and the last-admin guard from P4-04 all still apply. **Live Authelia sweep verified all four paths against `https://auth.example.invalid`:** rm-admin → admin role + JIT row + chip + readonly edit; rm-operator → operator JIT, 403 on `/settings/users`; rm-viewer → viewer JIT, 403 on `/hosts/new`; rm-other (group not in role_mapping) → no_role_match banner, no row created, audit logged. Returning rm-admin login resolved to the same row by sub. Screenshots in `_diag/p4-05-sweep/`. Out-of-scope and on Phase 6 candidate list: refresh tokens, back-channel logout, multiple providers, post-login PKCE for the cookie itself.
|
||
|
||
- [x] **P4-07** (S) Per-host tags + dashboard filtering by tag
|
||
|
||
> **As shipped (2026-05-05):** Tag column already existed on the hosts schema (JSON array, round-tripped through the Host struct since Phase 1) but had no edit UI or filter. Added `Store.SetHostTags` + `Store.DistinctHostTags` (the latter via `json_each` for autocomplete + chip-row population). Inline editor on the host detail header: `+ tag` button reveals a comma-separated input with `<datalist>` autocomplete from the fleet's distinct tags; submit lowercases / trims / dedupes server-side. Tag chips on the host header link to the dashboard pre-filtered. Dashboard chip-row above the hosts table — `All / <tag1> / <tag2> …` with the active chip highlighted via a new `.tag-active` style; `?tag=foo` filters the list with the count showing `N of M`. Operator-band POST `/hosts/{id}/tags` audited as `host.tags_updated`.
|
||
|
||
### Phase 4 acceptance
|
||
|
||
- Non-admin users see an appropriately limited UI. OIDC login works against at least one provider (Authelia or Authentik). Hosts can be tagged and the dashboard filters by tag.
|
||
|
||
> **Deferred to Phase 6** (2026-05-05) — pulled forward of OSS readiness so a working v1 ships sooner: P4-01/02 (update delivery + agent-version tracking), P4-06 (repo size trends), P4-08/09 (Prometheus + Grafana). All operator-experience polish, none of it gates getting the system into production.
|
||
|
||
---
|
||
|
||
## Phase 5 — OSS readiness
|
||
|
||
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
|
||
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
|
||
- [x] **P5-03** (S) Release automation — **pivoted away from goreleaser/binary archives** on 2026-05-05 (spec: `docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md`). Single deliverable per tag: a multi-arch (linux amd64+arm64) server image, with cross-compiled agent binaries (linux amd64+arm64, windows amd64) + `install.sh` + `install.ps1` + the systemd unit baked under `/opt/restic-manager/dist/`. The `/agent/binary` and `/install/*` handlers fall back from `<DataDir>/...` to `<BundledAssetsDir>/...` so a fresh container Just Works. Workflow `.gitea/workflows/release.yml` triggers on `v*.*.*` tag-push (real release: fan-out `:vX.Y.Z`, `:X.Y`, `:X`, plus `:latest` once `MAJOR>=1`) and `workflow_dispatch` (snapshot: `:snapshot-<shortsha>` only). Pushed to the Gitea container registry on this instance — no external creds, no GHCR mirror. Cosign / SBOM / minisign / GHCR mirror deferred to Phase 6. Source builds via `make build` remain a first-class path.
|
||
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
|
||
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
|
||
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
|
||
- [x] **P5-07** (S) Reference deployment landed alongside P5-03. `deploy/docker-compose.yml` stands up *only* the server (image-pinned via `RM_VERSION`, named volume for operator state, bound to localhost) — TLS termination is left to whichever reverse proxy the operator already runs. `docs/reverse-proxy.md` documents the headers + WebSocket pass-through the proxy must forward, the `RM_TRUSTED_PROXY` CIDR rule, and worked examples for Caddy, nginx, and Traefik.
|
||
|
||
### Phase 5 acceptance
|
||
|
||
- A stranger can read the docs and stand up a working install in under 30 minutes.
|
||
|
||
---
|
||
|
||
## Phase 6 — Update delivery + observability
|
||
|
||
> Deferred from Phase 4 on 2026-05-05 — operator-experience polish that doesn't gate a working v1.
|
||
|
||
- [ ] **P6-01** (S) Agent self-update from the server's bundled binaries. P5-03 already bakes matching `agent-{linux-amd64,linux-arm64,windows-amd64}` into the server image under `/opt/restic-manager/dist/`, served by `/agent/binary`. Add a `restic-manager-agent update` subcommand (and a server-dispatched `command.update` WS envelope) that fetches `$RM_SERVER/agent/binary?os=…&arch=…`, verifies sha256 against a digest the server advertises alongside the binary, atomic-renames over the running binary (`tmp+fsync+rename`), and asks the service manager to restart (`systemctl restart` on Linux, SCM restart on Windows). Version pinning is automatic — the server only ever serves the agent that matches its own release. No apt repo, no Chocolatey, no third-party signing infra. _(Was P4-01; original apt/choco plan dropped after the P5-03 Docker pivot made the server the natural distribution point.)_
|
||
- [ ] **P6-02** (M) Agent version reporting + fleet update on dashboard. Server already knows its own build version and each agent's `agent_version` from the WS hello. Surface "N hosts behind" on the dashboard, a per-host "out of date" chip, and an admin-only **Update all** action that fans out `command.update` to every online host (offline hosts queue via `pending_runs`-style retry on reconnect). Per-host **Update** button on host detail for one-shot upgrades. Audit-logged. _(Was P4-02.)_
|
||
- [ ] **P6-03** (M) Repo size trend graphs (sparkline on host card, full chart on repo page). _(Was P4-06.)_
|
||
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
|
||
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
|
||
|
||
### Phase 6 acceptance
|
||
|
||
- Agents upgrade via apt/choco with one admin-triggered action. Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data. Repo size trend visible on host detail.
|
||
|
||
---
|
||
|
||
## Cross-cutting / ongoing
|
||
|
||
- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format)
|
||
- [ ] **X-02** Track restic version compatibility matrix
|
||
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
|
||
- [ ] **X-04** Threat-model review at end of each phase
|
||
- [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation.
|
||
|
||
---
|
||
|
||
## Next steps from testing
|
||
|
||
> Bin for issues spotted while exercising a live deployment. Promote
|
||
> into a phase once scoped; leave here while still being collected.
|
||
|
||
- [x] **NS-01** Admin-driven host deletion. ✅ Landed: store `DeleteHost` (FK cascade revokes the agent bearer along with everything else), admin-band `POST /hosts/{id}/delete`, danger-zone form on host detail with hostname-confirm, audit `host.deleted`, live WS connection closed pre-delete. Original scope below for reference. No UI or API surface today — once a host is enrolled the only way to remove it is hand-editing SQLite, which then cascades through schedules/jobs/snapshots/source-groups via the FK chain. Needs: store-level `DeleteHost` + cascade audit, admin-band `DELETE /api/hosts/{id}` and form-post variant, confirm-modal on the host-detail page, audit entry, and a decision on whether to also revoke the agent's bearer (recommend: yes, so a re-installed host comes back through the normal pending-host accept flow).
|
||
- [x] **NS-02** Recoverable enrollment-token UX. ✅ Landed: `Store.ListOutstandingEnrollmentTokens` + `DeleteEnrollmentToken`; outstanding-tokens panel on the Add-host page (short hash, redacted repo URL, created/expires) with per-row Regenerate (revokes old hash, mints fresh raw token preserving repo creds + initial paths, 303s to `/hosts/pending/{newToken}`) and Revoke (delete + audit). Audit actions `enrollment_token.regenerated` / `enrollment_token.revoked`. Original scope below. Today `POST /hosts/new` mints a token and 303s to `/hosts/pending/{token}`; if the operator closes that tab the install snippet is lost and there's no UI surface to find it again — the row sits in `enrollment_tokens` until TTL expiry, invisible. Needs: store-level `ListOutstandingEnrollmentTokens` returning `(token_hash, created_at, expires_at, repo_url_redacted, initial_paths, attached_host_id_or_null)`; a small list section on the Add-host page (and/or Settings) showing outstanding tokens with created/expires-in and the redacted repo URL; admin-band `POST /api/enrollment-tokens/{id}/regenerate` (revokes the old hash, mints a fresh raw token, re-uses the original attachments — same pattern as the user-setup-token regenerate flow) and `POST /api/enrollment-tokens/{id}/revoke`. Choose regenerate over "show original token" because we only persist hashes, never raw tokens.
|
||
- [x] **NS-03** Auto-init repo on first onboard, surface credential failures eagerly. ✅ Landed: migration 0020 adds `hosts.repo_status` (`unknown`/`ready`/`init_failed`) + `repo_status_error`; WS handler projects every init job's terminal state onto the host row (with idempotent "config file already exists" → ready); creds-save handlers (UI + JSON API) reset status to `unknown` and dispatch a fresh init when the agent is online; new `/hosts/{id}/repo/probe` retry endpoint and a status banner on the repo page. Remainder of original scope below. surface credential failures eagerly. Today the operator types repo URL + creds during Add-host and the credentials are pushed to the agent on connect, but no `restic init`/probe runs until the first scheduled job — so a typo in the password or a wrong URL goes undetected for hours/days, manifesting as a silent missed-backup. Wanted behaviour: when the host completes enrolment (or when an admin saves new repo creds), the server dispatches a one-shot probe job that runs `restic cat config` (cheap, repo-existence + creds-validity in one call). On `Is there already a config file? unable to open config file` → run `restic init`. On success → mark the host's repo as ready. On any other error (network, auth, fingerprint) → surface a panel-level error on the host detail page and audit the failure, leaving the host in an "init pending" state with a "Retry" button. Needs: a new `JobKind` (or piggyback on an existing one) for the probe, server-side state on the host row (`repo_status` enum: `unknown`/`ready`/`init_pending`/`init_failed`), UI panel that shows the state, and clear copy on the Add-host page so the operator knows the save isn't fire-and-forget.
|
||
- [x] **NS-04** Dashboard parity with the alerts screen: live refresh, column sorting, filters. ✅ Landed: `/` now parses `q`/`status`/`repo_status`/`tag`/`sort`/`dir` query params (round-trip durable for bookmarks); table is wrapped in an `id="hosts-table"` htmx live-poll matching the alerts cadence (5s, gated on `document.visibilityState` and `localStorage.rm-dashboard-live`); filter row above the table with hostname free-text + status + repo_status selects + tag chips + clear; column headers (Host / OS · arch / Last backup / Repo size / Snapshots) are clickable links that toggle direction on the active column; pure-Go sort+filter pipeline covered by `dashboard_filter_test.go`. Original scope below. live refresh, column sorting, filters. The host list is currently a static render — operators have to reload to see new heartbeats / job state changes. Mirror the alerts pattern (`web/templates/pages/alerts.html` uses `hx-trigger="every 5s [document.visibilityState==='visible' && localStorage.getItem('rm-alerts-live')!=='off']"` plus a Live/Off toggle so background tabs and explicit-off don't burn server cycles). Add: server-side sort on every meaningful column (name, OS, last-backup time, last-backup status, agent online/offline, restic version, tags), and a small filter row above the table — at minimum free-text on hostname, status (online/offline/never-seen), and tag chips. Columns + filter state should round-trip through query string so a bookmarked / shared URL is durable. Re-use the `host_row` partial that already exists so the live-refresh swap is a clean OOB swap, not a full table re-render.
|
||
|
||
---
|
||
|
||
## Future / unscheduled
|
||
|
||
> Items here have a plausible use case but no confirmed need. They live
|
||
> outside numbered phases until a concrete trigger (a user request, a
|
||
> security review finding, a real disaster-recovery exercise) bumps them
|
||
> back into a phase.
|
||
|
||
- [ ] **F-02** API tokens (PATs) for automation. Today the only way to drive `/api/*` from a tool is to log in as a real user and reuse the `rm_session` cookie — fine for a single automation account, but bearer-equivalent for the 24h session TTL and not revocable per-tool. Build a proper personal-access-token feature: new `personal_access_tokens` table (id, user_id, sha256 hash, name, optional role cap, created_at, last_used_at, revoked_at), a `/settings/tokens` UI to mint/list/revoke, and a branch in `requireUser` that accepts `Authorization: Bearer …` and falls back to the cookie. Reuse `auth.NewToken()` / `auth.HashToken()` (same primitives used for agent bearers). Audit each mint/revoke. Trigger to promote: second automation consumer, or any external integration request.
|
||
- [ ] **F-01** ~~P3-04~~ Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.
|