0521a2169f
Live Playwright + curl sweep on the smoke env exercised the full user-management lifecycle: admin add user → setup link generated → curl-as-new-user fetches /setup (200, username on page) → POSTs password → 303 to / with Set-Cookie → 200 on dashboard, 200 on /settings/account, **403 on /settings/users** (admin-only) → admin disables → next request is **401** + session row count drops to 0 → audit log reflects user.created + user.setup_completed. Three-role middleware enforces band gates; admin is fail-closed default. Setup tokens are sha256-hashed at rest with 1h expiry; expired tokens are swept on the alert engine's 60s tick. Last-admin guard rejects disable + demote of the only enabled admin. Self- service password change at /settings/account is reachable by every role.
359 lines
48 KiB
Markdown
359 lines
48 KiB
Markdown
# restic-manager — Tasks
|
||
|
||
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
|
||
|
||
Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||
|
||
---
|
||
|
||
## Phase 0 — Project bootstrap
|
||
|
||
- [x] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages
|
||
- [x] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
|
||
- [x] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config
|
||
- [x] **P0-04** (S) ~~GitHub Actions~~ Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
|
||
- [x] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml`
|
||
- [x] **P0-06** (S) Makefile / ~~`taskfile.yml`~~ with common targets (`build`, `test`, `run`, `release`)
|
||
|
||
---
|
||
|
||
## Phase 1 — MVP: enrollment, visibility, on-demand backup
|
||
|
||
### Server foundations
|
||
|
||
- [x] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
|
||
- [x] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (hand-rolled, `embed.FS`)
|
||
- [x] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
|
||
- [~] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies; **CSRF middleware deferred to P1-23 (UI work)** — REST clients use bearer/session-only flows
|
||
- [x] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs)
|
||
- [x] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`)
|
||
- [~] **P1-07** (M) Audit log writer; middleware sweep for every state-changing endpoint **lands when the rest of the API surface does** — login / bootstrap / host.enrolled / job.run_now currently audited
|
||
|
||
### Agent ↔ server protocol
|
||
|
||
- [x] **P1-08** (M) Define shared API types in `internal/api` (envelopes, every WS message + `protocol_version` constants; JSON-shape tests pin the wire)
|
||
- [x] **P1-09** (L) WebSocket transport (`github.com/coder/websocket`), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
|
||
- [x] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
|
||
- [x] **P1-11** (M) Agent registration on connect (`hello` upserts agent_version/restic_version/protocol_version, flips status online, `protocol_too_old` rejection has clean error envelope)
|
||
- [x] **P1-12** (S) Heartbeat handler (touches `last_seen_at`; background sweeper marks hosts offline after 90s without one)
|
||
|
||
### Agent foundations
|
||
|
||
- [x] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
|
||
- [x] **P1-14** (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
|
||
- [x] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, `protocol_version` in hello
|
||
- [x] **P1-16** (M) Restic wrapper: locate via PATH or override, run with `--json`, scan stdout/stderr, parse `BackupStatus` + `BackupSummary`, exit-code 3 treated as success-with-issues
|
||
- [x] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
|
||
|
||
### Run-now backup
|
||
|
||
- [x] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
|
||
- [x] **P1-19** (M) Server endpoint `POST /api/hosts/{id}/jobs` to dispatch a `backup` command (validates kind, checks online, audit-logs)
|
||
- [x] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` (1Hz throttle) / `log.stream`
|
||
- [~] **P1-21** (M) Server persists log stream to `job_logs` ✓; **WS `/api/jobs/{id}/stream` for live browser tailing** still TODO — needs the per-job fan-out hub
|
||
- [x] **P1-22** (S) Snapshot listing: agent calls `restic snapshots --json` after each successful backup and ships the projection over `snapshots.report`. Server `ReplaceHostSnapshots` atomically swaps the per-host list and updates `hosts.snapshot_count` in the same tx. Read endpoint: `GET /api/hosts/{id}/snapshots`. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused `repo_id` FK from `snapshots` (repos as a first-class entity is P2 work).
|
||
|
||
### UI (HTMX + Tailwind)
|
||
|
||
- [x] **P1-23** (M) Base layout, login page, session-aware nav
|
||
- [x] **P1-24** (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by `GET /api/hosts` + `GET /api/fleet/summary` (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX `Run now` button posts to `/hosts/{id}/run-backup`.
|
||
- [x] **P1-25** (M) Host detail page (`/hosts/{id}`): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
|
||
- [x] **P1-26** (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens `/api/jobs/{id}/stream`; agent-emitted `job.started`/`job.progress`/`log.stream`/`job.finished` are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on `job.finished` to show the final header. "Run now" sets `HX-Redirect` so the operator lands on the live log.
|
||
- [~] **P1-27** (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (`RM_SERVER` + `RM_TOKEN` filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). **Deferred:** one-click "download preconfigured installer" `install-<hostname>.sh` (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
|
||
- [x] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node) — Makefile downloads pinned v3.4.17 into `bin/tailwindcss`, builds `web/styles/input.css` → `web/static/css/styles.css`, embedded into the binary via `web.FS`. `make build` runs Tailwind first.
|
||
|
||
### Install scripts
|
||
|
||
- [x] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / `/etc/cron.{d,daily,hourly,weekly}/*` / root crontab and prints them with the exact disable commands — does **not** auto-disable
|
||
- [~] **P1-31** (S) Server endpoint to serve agent binaries + install scripts ✓ (`/agent/binary` + `/install/*`); **signature verification** deferred to Phase 5 OSS readiness
|
||
|
||
### Repo credentials (pulled forward from Phase 2)
|
||
|
||
- [x] **P1-32** (M) Server-side encrypted repo creds carried on the enrollment token:
|
||
- `POST /api/enrollment-tokens` body grows `repo_url`, `repo_username`, `repo_password` (all required).
|
||
- Token row stores them as one AEAD-encrypted blob (existing `crypto.AEAD`); `ConsumeEnrollmentToken` moves the blob to a new `host_credentials` row keyed by `host_id` in the same tx.
|
||
- `PUT /api/hosts/{id}/repo-credentials` (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
|
||
- `GET /api/hosts/{id}/repo-credentials` returns the redacted view (URL + username + `has_password`) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
|
||
- On WS `hello`, server pushes a `config.update` with decrypted creds **before** returning the connection to idle. Same path on edit-while-connected.
|
||
- Audit-logged on create / consume / edit; payload omits the secret material.
|
||
|
||
- [x] **P1-33** (M) Agent-side encrypted secrets store:
|
||
- New `internal/agent/secrets` package: AEAD blob at `/var/lib/restic-manager/secrets.enc`, atomic write (tmp+fsync+rename, mode 0600).
|
||
- Per-host 32-byte secrets key minted at enrollment, persisted in `agent.yaml` (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
|
||
- Strip `repo_url` / `repo_password` from `agent.config.Config`. Agent loads creds from `secrets.enc` at startup; `config.update` handler writes through to the file.
|
||
- Dispatcher reads from the secrets store on every job rather than from in-memory config.
|
||
- Migration path: if `agent.yaml` still contains `repo_url`/`repo_password`, copy them into `secrets.enc` on next start, then strip from the YAML on save.
|
||
|
||
- [x] **P1-34** (S) End-to-end smoke runbook: `docs/e2e-smoke.md` walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real `restic/rest-server` in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
|
||
|
||
### Phase 1 acceptance
|
||
|
||
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
|
||
- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
|
||
- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
|
||
- Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as `config.update`.
|
||
|
||
---
|
||
|
||
## Phase 2 — Scheduling, retention, repo operations
|
||
|
||
> **Mid-phase pivot — "P2 redesign" (commits `7a7cac5`, `666af41`, `5667cdf`).**
|
||
> The original P2 plan put paths/excludes/retention/manual/kind/options on
|
||
> `Schedule` and one repo per host. After landing P2-01..P2-05 against that
|
||
> shape, the data model was rewritten: schedules are slim (cron + which
|
||
> `source_groups`); paths/excludes/retention/retry live on `source_group`
|
||
> (also doubles as the snapshot tag); forget/prune/check cadences live on
|
||
> `host_repo_maintenance` and run on a server-side ticker, not the agent
|
||
> cron; `pending_runs` queues offline retries; `host.repo_initialised_at`
|
||
> is gone (auto-init at enrolment). The redesign is captured below as
|
||
> `P2R-NN` items. Items P2-01..P2-05 stay marked done because the work
|
||
> shipped, but they're labelled ⚠️ **shipped against old shape — behaviour
|
||
> to be re-validated under P2R-02 after UI rewire**. P2-04.5 (`manual`
|
||
> flag) is dropped wholesale. P2-06..P2-15 are reframed below to point at
|
||
> their new homes; P2-16/17/18 are unaffected by the redesign.
|
||
|
||
### Original P2 work — shipped (against pre-redesign shape)
|
||
|
||
- [x] ⚠️ **P2-01** (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
|
||
- [x] ⚠️ **P2-02** (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
|
||
- [x] ⚠️ **P2-03** (M) Agent local scheduler (`internal/agent/scheduler`, `robfig/cron/v3`, `schedule.fire` envelope, `dispatchScheduledJob`). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01.
|
||
- [x] ⚠️ **P2-04** (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
|
||
- ~~**P2-04.5** Manual schedules / kill `host.default_paths`~~ — superseded; the `manual` flag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).
|
||
- [x] ⚠️ **P2-05** (M) `forget` command with retention policy. Wire payload (`CommandRunPayload.retention_policy`) and restic wrapper (`restic.ForgetPolicy`, `RunForget`) are still correct; what changes under P2R-03 is **where retention comes from** (source_group, not schedule) and **who dispatches** (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).
|
||
|
||
### P2 redesign — Phase 1 ✅
|
||
|
||
- [x] **P2R-00.1** (M) Migration 0008 — sources + repo maintenance. Adds `source_groups`, `schedule_source_groups` junction, `host_repo_maintenance`, `pending_runs`, `host.bandwidth_up_kbps` / `bandwidth_down_kbps`. Drops `host.repo_initialised_at`. Slim-schedule columns dropped from `schedules`. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit `7a7cac5`.
|
||
- [x] **P2R-00.2** (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of `/hosts/{id}/sources`, `/sources/{gid}/edit` (with retention-conflict banner), slim `/schedules`, `/repo` (connection / bandwidth / maintenance / re-init). Commit `666af41`.
|
||
|
||
### P2 redesign — Phase 2 ✅
|
||
|
||
- [x] **P2R-00.3** (L) Go-side store rewrite against migration 0008. New types: `SourceGroup`, `HostRepoMaintenance`, `PendingRun`. `Schedule` slimmed to `{id, host_id, cron, enabled, source_group_ids, timestamps}`. `RetentionPolicy` moves from schedule field → source group field (type unchanged). `Host` loses `RepoInitialisedAt`, gains bandwidth caps. New files: `store/sources.go`, `store/maintenance.go`, `store/pending.go`. `store/schedules.go` rewritten for slim shape + junction CRUD. `enrollment.go` seeds a default source group + repo-maintenance row instead of a manual schedule. `ws/handler.go` drops `MarkHostRepoInitialised`. HTTP layer + UI templates **temporarily 501-stubbed** with `redesign_in_progress` — this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit `5667cdf`.
|
||
- [x] **P2R-00.4** (S) Host-detail UI patched up enough to render: `RepoInitialisedAt` template refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.
|
||
|
||
### P2 redesign — Phase 3 (REST + WS rewire) ✅
|
||
|
||
- [x] **P2R-01** (L) HTTP/WS layer against the slim shape:
|
||
- **Schedules REST CRUD**: `GET|POST /api/hosts/{id}/schedules`, `PUT|DELETE /api/hosts/{id}/schedules/{sid}`. Body shape is `{cron, enabled, source_group_ids[]}` — paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (per `store.UpdateSchedule`). Validation: cron parses via `robfig/cron/v3`; ≥1 `source_group_ids`; all referenced groups belong to the host.
|
||
- **Source-groups REST CRUD**: `GET|POST /api/hosts/{id}/source-groups`, `GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}`. Body: `{name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}`. Name uniqueness per host. Refuse delete if `SchedulesUsingGroup(gid)` is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bump `host_schedule_version`.
|
||
- **Repo-maintenance REST**: `GET|PUT /api/hosts/{id}/repo-maintenance`. Body: `{forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}`. Server-side ticker drives execution (P2R-04), so updates here do **not** bump `host_schedule_version`.
|
||
- **Per-source-group Run-now**: `POST /hosts/{id}/source-groups/{gid}/run`. Reuses the existing `dispatchScheduleNow`-style path; agent receives a normal `command.run` carrying the resolved includes/excludes/retention from the group. This replaces the old per-host `/hosts/{id}/run-backup` endpoint (kept around as a 410-Gone with a hint pointing to source groups).
|
||
- **`schedule_push.go` reconciliation**: rebuild `pushScheduleSet*` to ship the new wire format (`ScheduleSetPayload` carries `[{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}]` — agent doesn't need to know `source_group_id`, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persists `applied_schedule_version`.
|
||
- **Auto-init at enrolment**: server dispatches `restic init` on first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row with `kind=init` so the audit trail still shows it. On `init` returning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour.
|
||
- **Tests**: rewrite the deleted `schedules_test.go` and `schedule_push_test.go` against new endpoints; new `source_groups_test.go`, `repo_maintenance_test.go`, `auto_init_test.go`. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.
|
||
|
||
### P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅
|
||
|
||
> **Row-design rule (binding for every list-row template in this app, current and future):**
|
||
> Whole-row click navigates to the row's primary detail/edit page —
|
||
> mirror `.host-row.clickable` on the dashboard
|
||
> (`partials/host_row.html`): an absolute-positioned `.row-link`
|
||
> overlay with `text-indent: -9999px` covers the row, action buttons
|
||
> live in `.row-action` cells that sit above via z-index. **Do not
|
||
> add an explicit "Edit" button** when the row is clickable — it
|
||
> duplicates the affordance and dilutes the click target. Action
|
||
> cells are reserved for verbs that aren't "open this row" (Run-now,
|
||
> Delete, Pause, etc).
|
||
|
||
- [x] **P2R-02** (L) UI templates rebuilt against the new model:
|
||
- **Slice 1 ✅** Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a `host_chrome` partial; Sources / Schedules / Repo become real `<a>` links; placeholder pages share the chrome; version indicator restored. (commit `a535822`)
|
||
- **Slice 2 ✅** Sources tab — `/hosts/{id}/sources` list with per-row meta + clickable rows + per-group Run-now/Delete; `/sources/new` and `/sources/{gid}/edit` form (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner from `ConflictDimension` cache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits `0ed9c3d`, `dede74f`)
|
||
- **Slice 3 ✅** Schedules tab — `/hosts/{id}/schedules` slim list (status / cron / source-tags / actions, clickable rows) plus `/schedules/new` and `/schedules/{sid}/edit` form (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reuses `dispatchScheduledJob` for enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit `67ca769` + follow-ups `64d2fcf`, `8b91d30`, `4035c44`)
|
||
- **Slice 4 ✅** `/hosts/{id}/repo` — three independent forms (connection: URL/user/password pre-filled from `GET /api/hosts/{id}/repo-credentials` redacted view; bandwidth: host-wide caps via new `PUT /api/hosts/{id}/bandwidth`; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commit `d62b173`)
|
||
- **Slice 5 ✅** Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit `fab99b4`)
|
||
- **Slice 6 ✅** Playwright sweep against the live `:8080` server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in `_diag/p2r-02-sweep/`.
|
||
- Side-fix: agent runner drops noisy restic `status` events from `log.stream` (they were drowning the live log on short backups; the throttled `job.progress` envelope already covers the same data). (commit `ffba737`)
|
||
- Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by `host_schedule_version` + `applied_schedule_version`).
|
||
- Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires `pushScheduleSetAsync` so an online agent re-arms within seconds.
|
||
|
||
### P2 redesign — Phase 5 ✅
|
||
|
||
> Shipped on branch `p2r-phase5-maintenance` (PR #3). Plan:
|
||
> `docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md`.
|
||
|
||
- [x] **P2R-03** (M) `prune` command end-to-end. Restic wrapper (`restic.RunPrune`), agent dispatcher (`case api.JobPrune:`), wire envelope. **Admin-only credential**: a second `host_credentials` row keyed by `host_id` + `kind=admin` carries the non-append-only username/password; server pushes it via `config.update` only when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now via `POST /hosts/{id}/repo/prune`. Cadence-driven dispatch via the maintenance ticker (P2R-06).
|
||
- [x] **P2R-04** (M) `check` command end-to-end (`restic check --read-data-subset N%`). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now via `POST /hosts/{id}/repo/check`. Cadence-driven dispatch via the maintenance ticker (P2R-06).
|
||
- [x] **P2R-05** (S) `unlock` command end-to-end (`restic unlock`). Operator-only — no cadence. `POST /hosts/{id}/repo/unlock`. Repo page surfaces lock state from the most recent `check` (which warns about stale locks).
|
||
- [x] **P2R-06** (M) Server-side maintenance ticker. Cron-style loop on the server reads `host_repo_maintenance` rows, dispatches `forget` / `prune` / `check` jobs against the right host on the configured cadence. Last-fire anchor is derived from the `jobs` table via `LatestJobByKind` (queued + running included so a long-running prune correctly suppresses the next tick). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (logged, no queue — only scheduled backup fires queue, per P2R-08). Forget reshape: ships a multi-group `ForgetGroups` payload so one job fires N restic-forget invocations per tick.
|
||
- [x] **P2R-07** (S) Repo stats panel on the Repo page: total size, raw size, last-check timestamp + status (color-coded), last-prune timestamp, stale-lock banner. Backed by `restic stats --json --mode raw-data` that the agent ships in a `repo.stats` envelope after every backup / check / prune / unlock; persisted via `Store.UpsertHostRepoStats` into a new `host_repo_stats` projection table.
|
||
- [x] **P2R-08** (M) Pending-runs queue worker. Scheduled backup fires that race an agent disconnect queue to `pending_runs`. Drained on a 30s server-side tick **and** on agent reconnect (via `onAgentHello`); per-host TryLock mutex prevents the two paths double-dispatching the same row. Exponential backoff capped at 30 minutes; abandons rows that exceed the source-group's `retry_max` (audit-logged) or whose schedule/group has genuinely been deleted.
|
||
|
||
### P2 redesign — Phase 6 (auto-init follow-up) ✅
|
||
|
||
- [x] **P2R-09** (S) Auto-init UX polish. Latest `init` job status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). Danger-zone `POST /hosts/{id}/repo/reinit` dispatches a fresh init job after the operator types the host name to confirm; audit row records `host.repo_reinit`.
|
||
|
||
### Pre/post hooks (rehomed onto source groups) ✅
|
||
|
||
- [x] **P2R-10** (M) Hook schema: migration 0010 adds `pre_hook`/`post_hook` BLOB columns to `source_groups` and `pre_hook_default`/`post_hook_default` to `hosts`. Bytes stored verbatim — AEAD encrypt/decrypt at the HTTP layer (per-slot AD bytes). Round-trip tests cover set/clear semantics on both tables.
|
||
- [x] **P2R-11** (M) Agent execution of hooks: `runner.BackupHooks` + `runHook` helper invoked via `/bin/sh -c` (`cmd.exe /C` on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with `RM_JOB_STATUS=succeeded|failed` in env. Output streamed as `hook(<phase>): …` log.stream lines. Hooks only run for `kind=backup`. Server side resolves group → host default → empty and ships plaintext on the WS payload (decrypt at HTTP layer).
|
||
- [x] **P2R-12** (S) Hook editor UI: source-group edit form gains pre/post hook textareas with the service-user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a host-default Hooks panel with the same shape; saved via `POST /hosts/{id}/repo/hooks`.
|
||
|
||
### Bandwidth + niceties (rehomed onto host + source groups) ✅
|
||
|
||
- [x] **P2R-13** (S) Bandwidth limit fields. `restic.Env` gains `LimitUploadKBps`/`LimitDownloadKBps`, emitted as `--limit-upload`/`--limit-download` global flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received via `config.update`; server pushes them on hello and after `PUT /api/hosts/{id}/bandwidth`. Per-job override on the per-source-group Run-now form (collapsed `<details>` "Limit bandwidth for this run" with two KB/s inputs); override wins over host caps.
|
||
- [x] **P2R-14** (S) Schedule "next run" / "last run". New `store.LatestJobBySchedule` query. Schedules tab grows two columns (Next derived from cron via `robfig/cron/v3.Parse(...).Next`, Last from latest `actor_kind=schedule` job). Dashboard host row prepends `next 12h ago/from now` when a single covering schedule is the run-now candidate.
|
||
|
||
### Cross-platform + alt-enrolment ✅
|
||
|
||
- [x] **P2-16** (M) Windows service integration: `internal/agent/service` (build-tagged) implements `svc.Handler`; new `restic-manager-agent install|uninstall|start|stop|run` subcommands wrap the SCM via `golang.org/x/sys/windows/svc/mgr`. Cross-compile verified (`GOOS=windows GOARCH=amd64 go build ./cmd/agent`); **untested on Windows itself** — Linux CI can't exercise the SCM round-trip.
|
||
- [x] **P2-17** (M) `install.ps1` (Windows): pwsh installer that detects arch, downloads `$Server/agent/binary?os=windows&arch=amd64`, runs the agent in `-enroll-server` (+ optional `-enroll-token`) mode (token flow OR announce-and-approve), then registers the service via `restic-manager-agent install`. Surfaces existing scheduled tasks named `*restic*` without disabling. Served by the existing `GET /install/*` handler; restage block in CLAUDE.md updated.
|
||
- [x] **P2-18** (L) Announce-and-approve enrolment (second enrolment mode):
|
||
- Agent run with no `RM_TOKEN` generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then `POST /api/agents/announce` with `{hostname, os, arch, agent_version, restic_version, public_key}`. Server stores a `pending_hosts` row (`public_key`, `fingerprint = sha256(public_key)`, `announced_from_ip`, `first_seen_at`, `last_seen_at`, `expires_at = now+1h`). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
|
||
- Agent then opens a long-poll/WS to `/ws/agent/pending` authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
|
||
- Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. `SHA256:ab12…cd34`) and tells the operator to compare it to the one shown in the UI before clicking accept.
|
||
- UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: **Accept** (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / **Reject** (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
|
||
- Server-side guards: per-source-IP rate limit on `/api/agents/announce` (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does **not** auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
|
||
- Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting `hostname` over the wire.
|
||
|
||
> **As shipped:** migration 0011 + `store/pending_hosts.go` cover the table.
|
||
> `POST /api/agents/announce` (rate-limited 10/min/IP, global cap 100 in-flight rows)
|
||
> returns `{pending_id, fingerprint, hostname_collision}`. `GET /ws/agent/pending`
|
||
> runs the Ed25519 nonce-sign handshake. Admin POSTs to
|
||
> `/api/pending-hosts/{id}/accept|reject` (audit-logged as
|
||
> `host.accept_pending`/`host.reject_pending`). Dashboard panel renders the queue
|
||
> with a copyable fingerprint + inline accept form (URL/user/password). 60s
|
||
> server ticker sweeps expired rows. Agent: `cmd/agent/announce.go` mints +
|
||
> persists an Ed25519 keypair into `agent.yaml`'s `announce_key` field; runs
|
||
> automatically when `-enroll-server` is supplied without `-enroll-token`. The
|
||
> install scripts haven't been updated to surface the printed fingerprint
|
||
> beyond the agent's own banner — the operator reads it from the install
|
||
> script's stdout.
|
||
|
||
### Phase 2 acceptance
|
||
|
||
- A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
|
||
- Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
|
||
- Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to `pending_runs` and drain on reconnect.
|
||
- Pre/post hooks fire correctly per source group, fail loudly on `pre_hook` errors, run `post_hook` with `RM_JOB_STATUS`. Rejected on non-backup kinds.
|
||
- Bandwidth limits honoured (host-wide default + per-run override).
|
||
- A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming. **Not validated in CI:** Linux runners cannot exercise the SCM round-trip; the `service_windows.go`/`install.ps1` pieces compile cleanly under `GOOS=windows GOARCH=amd64` but the first real Windows install will be the first end-to-end test.
|
||
- A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.
|
||
|
||
---
|
||
|
||
## Phase 3 — Restore, alerts, audit
|
||
|
||
> Phase 3 is split into three independently-shippable sub-phases:
|
||
> **Restore** (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC),
|
||
> **Alerts** (P3-05..07), **Audit UI** (P3-08). Each sub-phase has its own
|
||
> spec → plan → implement cycle; we hand back at sub-phase boundaries.
|
||
>
|
||
> P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm
|
||
> on 2026-05-04: disaster recovery is already covered by re-enrolling a
|
||
> replacement host with the same repo creds (snapshots reappear, restore
|
||
> is same-host). The remaining "pull a file from host A onto host C
|
||
> without giving C permanent access" use case is genuinely different and
|
||
> doesn't have a confirmed need yet, so it's moved to the **Future /
|
||
> unscheduled** section at the end of this file.
|
||
|
||
### Phase 3 — Restore ✅
|
||
|
||
> Spec: `docs/superpowers/specs/2026-05-04-p3-restore-design.md`.
|
||
> Wireframe: `_diag/p3-restore-wizard/wireframe.html`.
|
||
> Sweep screenshots: `_diag/p3-restore-sweep/`.
|
||
> Shipped on branch `p3-restore`.
|
||
|
||
- [x] **P3-X1** (S) Cancel-job feature. `command.cancel` WS envelope; agent tracks per-job ctx.CancelFunc and kills the running `restic` subprocess via context cancel (SIGTERM, SIGKILL after 5s grace via `cmd.Cancel` + `cmd.WaitDelay`); server endpoint `POST /api/jobs/{id}/cancel` bridges UI → WS; the existing UI Cancel button on `/jobs/{id}` is now real for any running kind. Sandbox-aware: `internal/restic/cancel_{unix,windows}.go` build-tags pick SIGTERM on POSIX vs `os.Kill` on Windows (which can't deliver SIGTERM). Tests: cancel mid-run via 'sleep 30' fake-restic returns JobCancelled with exit 130 in <200ms.
|
||
- [x] **P3-X2** (S) Tree-list synchronous WS RPC. `MsgTreeList` ↔ `MsgTreeListResult` with `Envelope.ID` correlation; generic `Hub.SendRPC` helper (registry of buffered channels keyed by ULID, ctx-cancel + timeout aware). `internal/restic.ListTreeChildren` wraps `restic ls --json` and filters its recursive output to direct children. Server-side `treeCache` is per-wizard-session (keyed by session cookie + host + snapshot + path) with a 30-min TTL and lazy sweep.
|
||
- [x] **P3-01** (L) Restore wizard backend (`internal/server/http/ui_restore.go`). GET handlers render the four-step wizard against the wireframe. HTMX/fetch tree partial endpoint hits `fetchTreeWithCache`. POST validates: snapshot_id, ≥1 absolute path, in-place ⇒ confirm_hostname == host name, agent online; on error re-renders with operator's input intact. Happy path mints job_id, target = `/var/lib/restic-manager/restore/<job-id>` (server-picked, agent's writable dir under the systemd sandbox's `ReadWritePaths`), creates job row, ships `command.run` with `RestorePayload`, writes `host.restore` audit row, returns HX-Redirect (or 303) to the live job page.
|
||
- [x] **P3-02** (L) Wizard UI templates (`web/templates/pages/host_restore.html` + `partials/tree_node.html`). Single-page progressively-enabled four-step form. Form-state-driven JS computes a running tally + step-4 confirm summary client-side. Tree expansion uses plain fetch (not HTMX) for simpler target lookup; loaded-state cached per node. Top-level Restore button on host detail right rail + per-snapshot Restore action on snapshot rows. New `.snap-row` token in `web/styles/input.css`.
|
||
- [x] **P3-03** (M) Restore execution. `restic.RunRestore` builds `restore <sid> --target <dir> [--include p]...` with --json; new `pumpRestoreStdout` parses status + summary objects. `--no-ownership` is gated on the agent's restic version via `Env.AtLeastVersion(0, 17)` — the flag was added in 0.17 and 0.16 rejects it. Restic version is threaded through `runner.Config.ResticVersion` from the agent's sysinfo snapshot. New-dir target is operator-editable (default `$HOME/rm-restore/<job-id>/`); agent expands `$HOME` / `${HOME}` / `~/` at run time and calls `os.MkdirAll` on the target chain so the operator never has to pre-create the per-job subdir. `runner.RunRestore` translates `RestoreStatus` into `job.progress` (mapping FilesRestored → FilesDone, etc.); agent dispatcher case `JobRestore` reuses the `spawn()` helper from P3-X1 so cancel works. Restore-shaped job-detail variant with current-file display under the progress bar.
|
||
- [x] **P3-09** (S) `diff` between two snapshots. `JobDiff` JobKind + `restic.RunDiff` + `runner.RunDiff`; `POST /api/hosts/{id}/snapshots/diff` (and HTMX-form variant on the unprefixed path) dispatcher with two-snapshot guard + per-host snapshot-list validation; UI panel on host detail right rail (visible when 2+ snapshots) with two short-id inputs + Diff button. Output streams as log.stream to the standard live job log page.
|
||
- [x] **P3-X3** (S) Recent-restores line on host detail. `hostChromeData` grows `RestoreStatus` / `RestoreAt` / `RestoreJobID` populated via `store.LatestJobByKind(host_id, 'restore')` (already exists from P2R). `host_chrome.html` renders a small line below the init-status one with status-coloured copy + a link to the job log. Hidden when no restore has ever run on this host.
|
||
- [x] **P3-X4** (S) Job log download (txt + ndjson). New `GET /api/jobs/{id}/log.{txt|ndjson}` endpoint backed by the persisted `job_logs` table — works any time (running or finished) without pausing the live WS stream because the source is the DB, not the live socket. Plain-text format mirrors the on-screen "HH:MM:SS.mmm TAG payload" shape with a small `# job ... · kind ... · status ...` header; ndjson emits one self-contained `{seq,ts,stream,payload}` JSON object per line for `jq` / tooling. Surfaced as a single header dropdown on the live job page (`details/summary`-driven, native keyboard support, click-outside-to-close). New reusable `.dropdown` / `.dropdown-menu` / `.dropdown-item` tokens in `web/styles/input.css`.
|
||
- [x] **P3-X5** (S) UK lint locale + sweep. `.golangci.yml` misspell locale switched US → UK and the codebase swept (~73 corrections — behaviour, serialise, recognise, honour, initialise, enrol, unauthorised, etc.). Wire `ErrorCode` value `"unauthorized"` → `"unauthorised"` is a tiny contract change but the agent doesn't parse those codes today and no external clients exist yet.
|
||
- [x] **P3-X6** (S) Snapshot SIZE/FILES tooltip on host detail. The per-snapshot summary block was added by restic 0.17 (the source comment in `internal/restic/snapshots.go` incorrectly said 0.16+); on 0.16 hosts the columns render `—`. `hostDetailPage.LegacyRestic` (computed via `Env.AtLeastVersion(0, 17)`) drives a `title="Needs restic 0.17+ on the agent host. This host runs <ver>."` + `cursor: help` on the column headers, hidden once the host upgrades.
|
||
|
||
> **Migration 0012** widens the `jobs.kind` CHECK constraint to include `restore` and `diff`. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005, with a defensive temp-table backup of `job_logs` so the cascade-trap that bit migration 0007 wouldn't take the log history with it.
|
||
|
||
> **install.sh + systemd unit:** the install script now pre-creates `/root/rm-restore` (root-owned 0700) so the default new-dir restore target works under the sandbox out of the box; the unit's `ReadWritePaths` gains `-/root/rm-restore` (soft-fail prefix). Existing installs need a re-run of `install.sh` to pick up the new dir; new operator-typed targets are auto-created by the agent at job time.
|
||
|
||
> **As shipped (Playwright sweep against the live smoke env, 2026-05-04):** login → host detail → Restore button → wizard step 1 picks snapshot a1ac4006 (most recent) → tree drill-down `/home/steve/test` (3 lazy loads) → tick `file1` + `file2` → step 4 confirm summary populated → dispatch → live job page with running progress widget → restore succeeds, files land on disk at `/root/rm-restore/<job-id>/home/steve/test/file{1,2}` (default `$HOME/rm-restore/<job-id>/` after agent-side expansion). Custom-target restore to `/tmp/custom-restore/<job-id>/` lands inside the agent's `PrivateTmp` namespace. Snapshot diff between `a1ac4006` and `5f78c788` → diff job page, statistics output streamed (738 bytes added, 0 removed). Recent-restores line on host detail reads "last restore · succeeded 28s ago · job log →". Download dropdown serves both `.txt` and `.ndjson` with correct `Content-Type` + `Content-Disposition`. SIZE/FILES tooltip "Needs restic 0.17+ on the agent host. This host runs 0.16.4." renders on column hover.
|
||
|
||
### Phase 3 — Alerts ✅
|
||
|
||
- [x] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
|
||
- [x] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
|
||
- [x] **P3-07** (S) Alert UI: list, acknowledge, resolve
|
||
|
||
> **As shipped (Playwright sweep, 2026-05-04):** /settings/notifications → 3 channels created (sweep-webhook → local Python sink, sweep-ntfy → ntfy.sh public topic, sweep-smtp → MailHog at 127.0.0.1:1025). Test buttons fire alert.test on each: webhook 200/1ms, ntfy 200/322ms, SMTP 250/3ms. Synthetic critical `backup_failed` raised → /alerts shows row with severity dot, kind chip, host, message, raised/last-seen, Ack + Resolve buttons; nav badge `1`; dashboard critical-alert banner appears with Review→ link; OPEN ALERTS card reads `1 unresolved`. Acknowledge → fan-out to all 3 channels emits alert.acknowledged (verified in webhook sink, MailHog inbox, notification_log); Acknowledged tab shows row with `ack'd by <user>` line. Resolve → fan-out emits alert.resolved across all 3 channels; banner clears; dashboard reads `0 unresolved · all clear`; host alerts column reads —. Three live bugs found and fixed mid-sweep: (a) `enabled` form value lost because hidden+checkbox both named `enabled` and `PostForm.Get` returned the first ("0"); (b) Ack/Resolve handlers stored the state change but never dispatched alert.acknowledged / alert.resolved; (c) `hosts.open_alert_count` projection was never recomputed on Raise/Resolve/AutoResolve, so the dashboard count always read 0.
|
||
|
||
### Phase 3 — Audit log UI ✅
|
||
|
||
- [x] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
|
||
|
||
> **As shipped (2026-05-05):** Read-only `/audit` page (+ `/api/audit` JSON). Filters: time-range presets (24h / 7d / 30d / all), user dropdown (any registered user), actor dropdown (user / agent / system), target-kind dropdown (host / schedule / source_group / alert / notification_channel / job / user), action substring search box. Table columns: when (relative + abstime tooltip), actor tag (user accent / agent green / system grey), user (or em-dash for system rows), action string, target (kind · resolved name for hosts, kind · id otherwise), payload `<details>` block when non-empty. New `Store.ListAudit(AuditFilter)` and `Store.DistinctAuditActions` plus `Store.ListUsers`. Append-only — no edit/delete surface, deliberately.
|
||
|
||
### Phase 3 acceptance
|
||
|
||
- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at `/hosts/{id}/restore`; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page.
|
||
- A failed backup raises an alert via the configured channel within 60s.
|
||
- The audit-log UI lets an admin filter by user / action / target / time range.
|
||
|
||
---
|
||
|
||
## Phase 4 — Update delivery, RBAC polish, OIDC
|
||
|
||
- [ ] **P4-01** (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. `restic-manager-agent update` is a thin wrapper over `apt-get install --only-upgrade restic-manager-agent` / `choco upgrade`. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
|
||
- [ ] **P4-02** (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
|
||
- [x] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer)
|
||
- [x] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset)
|
||
|
||
> **As shipped (2026-05-05):** Three-role hierarchy (admin > operator > viewer) enforced via chi route-group middleware (`requireRole`). Admin is the fail-closed default; agent endpoints stay on the bearer-token chain. Sessions re-validate `disabled_at` on every authenticated request — admin-driven changes (disable, force-logout) land immediately.
|
||
>
|
||
> **Setup-token flow** replaces temp passwords. Admin clicks `+ Add user`, picks username + email + role, server returns a one-time setup link valid for 1 hour (sha256-hashed at rest, raw shown to admin once). User clicks the link → sets a password (≥12 chars) → drops a session → lands on `/`. `/settings/users/{id}/regenerate-setup` issues a new link, replacing the old via INSERT OR REPLACE. Expired tokens are swept on the alert engine's 60s tick.
|
||
>
|
||
> **Disable-only lifecycle** — soft delete via `disabled_at`. Last-admin guard rejects "disable last admin" and "demote last admin to non-admin" (both server-side and UI-hinted). Re-enable on disabled-username collision: admin trying to add a name that matches a disabled user is redirected to that user's edit page rather than 409'd.
|
||
>
|
||
> **Self-service password change** at `/settings/account` available to any role. Skips current-password check when `must_change_password` is set so admin-initiated resets work without surfacing a credential the user doesn't know.
|
||
>
|
||
> **Schema:** migration 0017 adds `email`, `disabled_at`, `must_change_password` plus a UNIQUE INDEX on LOWER(username) (lowercase normalisation in Go on every CreateUser); 0018 adds `user_setup_tokens`. Both column-level ALTERs per CLAUDE.md preference. Email is metadata only in v1 (no SMTP-the-link); the SMTP channel infrastructure from P3-06 makes that a one-page follow-up.
|
||
>
|
||
> **Sweep verified (smoke env):** admin adds operator → setup link generated → curl-as-new-user fetches /setup (200, page shows username) → POSTs password → 303 to / + Set-Cookie → operator authenticated → 200 on /, 200 on /settings/account, **403 on /settings/users** (admin-only) → admin disables user → operator's next request is **401** + session row count drops to 0 → audit log shows `user.created` + `user.setup_completed` for the cycle. All 26 implementation tasks landed; full `go test ./...` green.
|
||
- [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
|
||
- [ ] **P4-06** (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
|
||
- [ ] **P4-07** (S) Per-host tags + dashboard filtering by tag
|
||
- [ ] **P4-08** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
|
||
- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON
|
||
|
||
### Phase 4 acceptance
|
||
|
||
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data.
|
||
|
||
---
|
||
|
||
## Phase 5 — OSS readiness
|
||
|
||
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
|
||
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
|
||
- [ ] **P5-03** (S) Release automation: `goreleaser` for binaries + Docker image to GHCR
|
||
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
|
||
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
|
||
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
|
||
- [ ] **P5-07** (S) Reference deployment: `docker-compose.yml` + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates `RM_TRUSTED_PROXY`)
|
||
|
||
### Phase 5 acceptance
|
||
|
||
- A stranger can read the docs and stand up a working install in under 30 minutes.
|
||
|
||
---
|
||
|
||
## Cross-cutting / ongoing
|
||
|
||
- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format)
|
||
- [ ] **X-02** Track restic version compatibility matrix
|
||
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
|
||
- [ ] **X-04** Threat-model review at end of each phase
|
||
- [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation.
|
||
|
||
---
|
||
|
||
## Future / unscheduled
|
||
|
||
> Items here have a plausible use case but no confirmed need. They live
|
||
> outside numbered phases until a concrete trigger (a user request, a
|
||
> security review finding, a real disaster-recovery exercise) bumps them
|
||
> back into a phase.
|
||
|
||
- [ ] **F-01** ~~P3-04~~ Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.
|