Files
restic-manager/tasks.md
T
steve 54528b9b15 P2R-02 follow-up: clickable rows on Sources/Schedules + cron-preset tooltips
Aligns Sources and Schedules tab rows with the dashboard's row-click
UX: whole-row click navigates to the row's edit page (mirroring
.host-row.clickable). Drops the redundant Edit buttons; Run-now and
Delete remain in .row-action cells that sit above the row-link
overlay via z-index.

Schedule edit form's cron preset chips now carry human-readable
title= tooltips ("Every day at 03:00", "Every Sunday at 03:00", etc).

tasks.md gets a binding row-design rule covering all current and
future list-row templates, and the P2R-02 entry is split into the
six slices already agreed with the operator (slices 1–3 marked
done, 4 next).
2026-05-03 12:01:55 +01:00

275 lines
32 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
---
## Phase 0 — Project bootstrap
- [x] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages
- [x] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- [x] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config
- [x] **P0-04** (S) ~~GitHub Actions~~ Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
- [x] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml`
- [x] **P0-06** (S) Makefile / ~~`taskfile.yml`~~ with common targets (`build`, `test`, `run`, `release`)
---
## Phase 1 — MVP: enrollment, visibility, on-demand backup
### Server foundations
- [x] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
- [x] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (hand-rolled, `embed.FS`)
- [x] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
- [~] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies; **CSRF middleware deferred to P1-23 (UI work)** — REST clients use bearer/session-only flows
- [x] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs)
- [x] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`)
- [~] **P1-07** (M) Audit log writer; middleware sweep for every state-changing endpoint **lands when the rest of the API surface does** — login / bootstrap / host.enrolled / job.run_now currently audited
### Agent ↔ server protocol
- [x] **P1-08** (M) Define shared API types in `internal/api` (envelopes, every WS message + `protocol_version` constants; JSON-shape tests pin the wire)
- [x] **P1-09** (L) WebSocket transport (`github.com/coder/websocket`), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
- [x] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
- [x] **P1-11** (M) Agent registration on connect (`hello` upserts agent_version/restic_version/protocol_version, flips status online, `protocol_too_old` rejection has clean error envelope)
- [x] **P1-12** (S) Heartbeat handler (touches `last_seen_at`; background sweeper marks hosts offline after 90s without one)
### Agent foundations
- [x] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
- [x] **P1-14** (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
- [x] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, `protocol_version` in hello
- [x] **P1-16** (M) Restic wrapper: locate via PATH or override, run with `--json`, scan stdout/stderr, parse `BackupStatus` + `BackupSummary`, exit-code 3 treated as success-with-issues
- [x] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
### Run-now backup
- [x] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
- [x] **P1-19** (M) Server endpoint `POST /api/hosts/{id}/jobs` to dispatch a `backup` command (validates kind, checks online, audit-logs)
- [x] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` (1Hz throttle) / `log.stream`
- [~] **P1-21** (M) Server persists log stream to `job_logs` ✓; **WS `/api/jobs/{id}/stream` for live browser tailing** still TODO — needs the per-job fan-out hub
- [x] **P1-22** (S) Snapshot listing: agent calls `restic snapshots --json` after each successful backup and ships the projection over `snapshots.report`. Server `ReplaceHostSnapshots` atomically swaps the per-host list and updates `hosts.snapshot_count` in the same tx. Read endpoint: `GET /api/hosts/{id}/snapshots`. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused `repo_id` FK from `snapshots` (repos as a first-class entity is P2 work).
### UI (HTMX + Tailwind)
- [x] **P1-23** (M) Base layout, login page, session-aware nav
- [x] **P1-24** (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by `GET /api/hosts` + `GET /api/fleet/summary` (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX `Run now` button posts to `/hosts/{id}/run-backup`.
- [x] **P1-25** (M) Host detail page (`/hosts/{id}`): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
- [x] **P1-26** (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens `/api/jobs/{id}/stream`; agent-emitted `job.started`/`job.progress`/`log.stream`/`job.finished` are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on `job.finished` to show the final header. "Run now" sets `HX-Redirect` so the operator lands on the live log.
- [~] **P1-27** (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (`RM_SERVER` + `RM_TOKEN` filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). **Deferred:** one-click "download preconfigured installer" `install-<hostname>.sh` (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
- [x] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node) — Makefile downloads pinned v3.4.17 into `bin/tailwindcss`, builds `web/styles/input.css``web/static/css/styles.css`, embedded into the binary via `web.FS`. `make build` runs Tailwind first.
### Install scripts
- [x] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / `/etc/cron.{d,daily,hourly,weekly}/*` / root crontab and prints them with the exact disable commands — does **not** auto-disable
- [~] **P1-31** (S) Server endpoint to serve agent binaries + install scripts ✓ (`/agent/binary` + `/install/*`); **signature verification** deferred to Phase 5 OSS readiness
### Repo credentials (pulled forward from Phase 2)
- [x] **P1-32** (M) Server-side encrypted repo creds carried on the enrollment token:
- `POST /api/enrollment-tokens` body grows `repo_url`, `repo_username`, `repo_password` (all required).
- Token row stores them as one AEAD-encrypted blob (existing `crypto.AEAD`); `ConsumeEnrollmentToken` moves the blob to a new `host_credentials` row keyed by `host_id` in the same tx.
- `PUT /api/hosts/{id}/repo-credentials` (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
- `GET /api/hosts/{id}/repo-credentials` returns the redacted view (URL + username + `has_password`) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
- On WS `hello`, server pushes a `config.update` with decrypted creds **before** returning the connection to idle. Same path on edit-while-connected.
- Audit-logged on create / consume / edit; payload omits the secret material.
- [x] **P1-33** (M) Agent-side encrypted secrets store:
- New `internal/agent/secrets` package: AEAD blob at `/var/lib/restic-manager/secrets.enc`, atomic write (tmp+fsync+rename, mode 0600).
- Per-host 32-byte secrets key minted at enrollment, persisted in `agent.yaml` (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
- Strip `repo_url` / `repo_password` from `agent.config.Config`. Agent loads creds from `secrets.enc` at startup; `config.update` handler writes through to the file.
- Dispatcher reads from the secrets store on every job rather than from in-memory config.
- Migration path: if `agent.yaml` still contains `repo_url`/`repo_password`, copy them into `secrets.enc` on next start, then strip from the YAML on save.
- [x] **P1-34** (S) End-to-end smoke runbook: `docs/e2e-smoke.md` walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real `restic/rest-server` in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
### Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
- Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as `config.update`.
---
## Phase 2 — Scheduling, retention, repo operations
> **Mid-phase pivot — "P2 redesign" (commits `7a7cac5`, `666af41`, `5667cdf`).**
> The original P2 plan put paths/excludes/retention/manual/kind/options on
> `Schedule` and one repo per host. After landing P2-01..P2-05 against that
> shape, the data model was rewritten: schedules are slim (cron + which
> `source_groups`); paths/excludes/retention/retry live on `source_group`
> (also doubles as the snapshot tag); forget/prune/check cadences live on
> `host_repo_maintenance` and run on a server-side ticker, not the agent
> cron; `pending_runs` queues offline retries; `host.repo_initialised_at`
> is gone (auto-init at enrolment). The redesign is captured below as
> `P2R-NN` items. Items P2-01..P2-05 stay marked done because the work
> shipped, but they're labelled ⚠️ **shipped against old shape — behaviour
> to be re-validated under P2R-02 after UI rewire**. P2-04.5 (`manual`
> flag) is dropped wholesale. P2-06..P2-15 are reframed below to point at
> their new homes; P2-16/17/18 are unaffected by the redesign.
### Original P2 work — shipped (against pre-redesign shape)
- [x] ⚠️ **P2-01** (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
- [x] ⚠️ **P2-02** (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
- [x] ⚠️ **P2-03** (M) Agent local scheduler (`internal/agent/scheduler`, `robfig/cron/v3`, `schedule.fire` envelope, `dispatchScheduledJob`). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01.
- [x] ⚠️ **P2-04** (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
- ~~**P2-04.5** Manual schedules / kill `host.default_paths`~~ — superseded; the `manual` flag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).
- [x] ⚠️ **P2-05** (M) `forget` command with retention policy. Wire payload (`CommandRunPayload.retention_policy`) and restic wrapper (`restic.ForgetPolicy`, `RunForget`) are still correct; what changes under P2R-03 is **where retention comes from** (source_group, not schedule) and **who dispatches** (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).
### P2 redesign — Phase 1 ✅
- [x] **P2R-00.1** (M) Migration 0008 — sources + repo maintenance. Adds `source_groups`, `schedule_source_groups` junction, `host_repo_maintenance`, `pending_runs`, `host.bandwidth_up_kbps` / `bandwidth_down_kbps`. Drops `host.repo_initialised_at`. Slim-schedule columns dropped from `schedules`. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit `7a7cac5`.
- [x] **P2R-00.2** (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of `/hosts/{id}/sources`, `/sources/{gid}/edit` (with retention-conflict banner), slim `/schedules`, `/repo` (connection / bandwidth / maintenance / re-init). Commit `666af41`.
### P2 redesign — Phase 2 ✅
- [x] **P2R-00.3** (L) Go-side store rewrite against migration 0008. New types: `SourceGroup`, `HostRepoMaintenance`, `PendingRun`. `Schedule` slimmed to `{id, host_id, cron, enabled, source_group_ids, timestamps}`. `RetentionPolicy` moves from schedule field → source group field (type unchanged). `Host` loses `RepoInitialisedAt`, gains bandwidth caps. New files: `store/sources.go`, `store/maintenance.go`, `store/pending.go`. `store/schedules.go` rewritten for slim shape + junction CRUD. `enrollment.go` seeds a default source group + repo-maintenance row instead of a manual schedule. `ws/handler.go` drops `MarkHostRepoInitialised`. HTTP layer + UI templates **temporarily 501-stubbed** with `redesign_in_progress` — this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit `5667cdf`.
- [x] **P2R-00.4** (S) Host-detail UI patched up enough to render: `RepoInitialisedAt` template refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.
### P2 redesign — Phase 3 (REST + WS rewire) ✅
- [x] **P2R-01** (L) HTTP/WS layer against the slim shape:
- **Schedules REST CRUD**: `GET|POST /api/hosts/{id}/schedules`, `PUT|DELETE /api/hosts/{id}/schedules/{sid}`. Body shape is `{cron, enabled, source_group_ids[]}` — paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (per `store.UpdateSchedule`). Validation: cron parses via `robfig/cron/v3`; ≥1 `source_group_ids`; all referenced groups belong to the host.
- **Source-groups REST CRUD**: `GET|POST /api/hosts/{id}/source-groups`, `GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}`. Body: `{name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}`. Name uniqueness per host. Refuse delete if `SchedulesUsingGroup(gid)` is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bump `host_schedule_version`.
- **Repo-maintenance REST**: `GET|PUT /api/hosts/{id}/repo-maintenance`. Body: `{forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}`. Server-side ticker drives execution (P2R-04), so updates here do **not** bump `host_schedule_version`.
- **Per-source-group Run-now**: `POST /hosts/{id}/source-groups/{gid}/run`. Reuses the existing `dispatchScheduleNow`-style path; agent receives a normal `command.run` carrying the resolved includes/excludes/retention from the group. This replaces the old per-host `/hosts/{id}/run-backup` endpoint (kept around as a 410-Gone with a hint pointing to source groups).
- **`schedule_push.go` reconciliation**: rebuild `pushScheduleSet*` to ship the new wire format (`ScheduleSetPayload` carries `[{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}]` — agent doesn't need to know `source_group_id`, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persists `applied_schedule_version`.
- **Auto-init at enrolment**: server dispatches `restic init` on first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row with `kind=init` so the audit trail still shows it. On `init` returning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour.
- **Tests**: rewrite the deleted `schedules_test.go` and `schedule_push_test.go` against new endpoints; new `source_groups_test.go`, `repo_maintenance_test.go`, `auto_init_test.go`. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.
### P2 redesign — Phase 4 (UI rewire, against v4 wireframes) — IN PROGRESS
> **Row-design rule (binding for every list-row template in this app, current and future):**
> Whole-row click navigates to the row's primary detail/edit page —
> mirror `.host-row.clickable` on the dashboard
> (`partials/host_row.html`): an absolute-positioned `.row-link`
> overlay with `text-indent: -9999px` covers the row, action buttons
> live in `.row-action` cells that sit above via z-index. **Do not
> add an explicit "Edit" button** when the row is clickable — it
> duplicates the affordance and dilutes the click target. Action
> cells are reserved for verbs that aren't "open this row" (Run-now,
> Delete, Pause, etc).
- [ ] **P2R-02** (L) UI templates rebuilt against the new model:
- **Slice 1 ✅** Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a `host_chrome` partial; Sources / Schedules / Repo become real `<a>` links; placeholder pages share the chrome; version indicator restored. (commit `a535822`)
- **Slice 2 ✅** Sources tab — `/hosts/{id}/sources` list with per-row meta + clickable rows + per-group Run-now/Delete; `/sources/new` and `/sources/{gid}/edit` form (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner from `ConflictDimension` cache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits `0ed9c3d`, `dede74f`)
- **Slice 3 ✅** Schedules tab — `/hosts/{id}/schedules` slim list (status / cron / source-tags / actions, clickable rows) plus `/schedules/new` and `/schedules/{sid}/edit` form (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reuses `dispatchScheduledJob`. (commit `67ca769` + clickable-row follow-up)
- **Slice 4 (next)** `/hosts/{id}/repo` — connection (URL/user/password — pre-filled from `GET /api/hosts/{id}/repo-credentials` redacted view; cert pin), bandwidth caps (host-wide; new `PUT /api/hosts/{id}/bandwidth`), maintenance rows (forget/prune/check cadence + check subset %), danger-zone re-init.
- **Slice 5** Per-source-group Run-now wiring on dashboard row + empty-snapshots state. Dashboard row's Run-now becomes either "Run all groups" (if exactly one schedule covers all groups) or "Open →" (multi-group hosts).
- **Slice 6** Playwright sweep — login → walk new tabs → create source group → create schedule → run-now → confirm dispatch.
- Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by `host_schedule_version` + `applied_schedule_version`).
- Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires `pushScheduleSetAsync` so an online agent re-arms within seconds.
### P2 redesign — Phase 5 (server-side maintenance ticker) — TODO
- [ ] **P2R-03** (M) `prune` command end-to-end. Restic wrapper (`restic.RunPrune`), agent dispatcher (`case api.JobPrune:`), wire envelope. **Admin-only credential**: a second `host_credentials` row keyed by `host_id` + `kind=admin` carries the non-append-only username/password; server pushes it via `config.update` only when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now via `POST /hosts/{id}/repo/prune`. Cadence-driven dispatch lands in P2R-04.
- [ ] **P2R-04** (M) `check` command end-to-end (`restic check --read-data-subset N%`). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now via `POST /hosts/{id}/repo/check`. Cadence-driven dispatch lands in P2R-05.
- [ ] **P2R-05** (S) `unlock` command end-to-end (`restic unlock`). Operator-only — no cadence. `POST /hosts/{id}/repo/unlock`. Repo page surfaces lock state from the most recent `check` (which warns about stale locks).
- [ ] **P2R-06** (M) Server-side maintenance ticker. Cron-style loop on the server reads `host_repo_maintenance` rows, dispatches `forget` / `prune` / `check` jobs against the right host on the configured cadence (last-run timestamps tracked per kind on the maintenance row). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (queues to `pending_runs` instead — see P2R-08). Handles ticker restarts cleanly (no-op if a job of the same kind ran inside the cadence window).
- [ ] **P2R-07** (S) Repo stats panel on the Repo page: size, dedup ratio, snapshot count, last-check timestamp + result, lock state, last-prune timestamp + bytes-freed. Backed by parsing `restic stats --json` output that the agent ships periodically (piggyback on the existing snapshots-report path).
- [ ] **P2R-08** (M) Pending-runs queue worker. On agent reconnect, server drains `pending_runs` rows for that host and re-dispatches them in order. Bump backoff per `pending_run.attempt_count`; drop rows that have exceeded the source-group's `retry_max`. Audit-logged. Smoke-tested by stopping the agent, running maintenance ticker so cadence misses, restarting agent, watching the queue drain.
### P2 redesign — Phase 6 (auto-init follow-up) — TODO
- [ ] **P2R-09** (S) Auto-init UX polish. Surface init result on host detail (small "repo ready · initialised by you on …" line; or "init failed — see job N · retry" if init failed). Re-init button on Repo page danger zone wipes then re-runs init (admin only, audit-logged, two-step confirm with the host name typed in).
### Pre/post hooks (rehomed onto source groups) — TODO
- [ ] **P2R-10** (M) Hook schema: `source_group.pre_hook`, `source_group.post_hook`, `host.pre_hook_default`, `host.post_hook_default`. Encrypted at rest (existing `crypto.AEAD`). Admin-only edit. Audit-logged.
- [ ] **P2R-11** (M) Agent execution of hooks: configurable shell per host. `pre_hook` failure aborts the backup. `post_hook` always runs with `RM_JOB_STATUS` env var. Stdout/stderr captured into `JobLog` with a `hook:` prefix. Hooks only run for `kind=backup` jobs (forget/prune/check/unlock skip them, per spec.md §14.3).
- [ ] **P2R-12** (S) Hook editor UI on source-group edit page (per-group override) and host Settings tab (host-wide default). Validation rejects non-backup contexts. Warning banner: "this hook runs as the agent service user (root on Linux; LocalSystem on Windows)".
### Bandwidth + niceties (rehomed onto host + source groups) — TODO
- [ ] **P2R-13** (S) Bandwidth limit fields. Host-wide caps (`Host.BandwidthUpKBps`, `BandwidthDownKBps` — schema is in 0008 already, just needs UI on the Repo page) applied to every restic invocation. Per-job override on Run-now (override field on the Run-now confirm dialog). Maps to `restic --limit-upload` / `--limit-download`.
- [ ] **P2R-14** (S) Schedule "next run" / "last run" surfaced on host card (dashboard row) + on the Schedules tab. "Next run" computed server-side from cron + now; "last run" from the most recent job with `actor_kind=schedule` for any schedule that uses any of the host's source groups.
### Cross-platform + alt-enrolment (unchanged by redesign) — TODO
- [ ] **P2-16** (M) Windows service integration: agent runs under the Service Control Manager via `golang.org/x/sys/windows/svc`; install/uninstall/start/stop wired up.
- [ ] **P2-17** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named `*restic*` and prints them for manual review.
- [ ] **P2-18** (L) Announce-and-approve enrollment (second enrollment mode, alongside the token flow that ships in Phase 1):
- Agent run with no `RM_TOKEN` generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then `POST /api/agents/announce` with `{hostname, os, arch, agent_version, restic_version, public_key}`. Server stores a `pending_hosts` row (`public_key`, `fingerprint = sha256(public_key)`, `announced_from_ip`, `first_seen_at`, `last_seen_at`, `expires_at = now+1h`). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
- Agent then opens a long-poll/WS to `/ws/agent/pending` authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
- Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. `SHA256:ab12…cd34`) and tells the operator to compare it to the one shown in the UI before clicking accept.
- UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: **Accept** (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / **Reject** (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
- Server-side guards: per-source-IP rate limit on `/api/agents/announce` (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does **not** auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
- Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting `hostname` over the wire.
### Phase 2 acceptance
- A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
- Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
- Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to `pending_runs` and drain on reconnect.
- Pre/post hooks fire correctly per source group, fail loudly on `pre_hook` errors, run `post_hook` with `RM_JOB_STATUS`. Rejected on non-backup kinds.
- Bandwidth limits honoured (host-wide default + per-run override).
- A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming.
- A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.
---
## Phase 3 — Restore, alerts, audit
- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection
- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming
- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve
- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
- [ ] **P3-09** (S) `diff` between two snapshots in UI
### Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
---
## Phase 4 — Update delivery, RBAC polish, OIDC
- [ ] **P4-01** (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. `restic-manager-agent update` is a thin wrapper over `apt-get install --only-upgrade restic-manager-agent` / `choco upgrade`. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
- [ ] **P4-02** (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
- [ ] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer)
- [ ] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset)
- [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
- [ ] **P4-06** (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
- [ ] **P4-07** (S) Per-host tags + dashboard filtering by tag
- [ ] **P4-08** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON
### Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data.
---
## Phase 5 — OSS readiness
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
- [ ] **P5-03** (S) Release automation: `goreleaser` for binaries + Docker image to GHCR
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
- [ ] **P5-07** (S) Reference deployment: `docker-compose.yml` + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates `RM_TRUSTED_PROXY`)
### Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
---
## Cross-cutting / ongoing
- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format)
- [ ] **X-02** Track restic version compatibility matrix
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
- [ ] **X-04** Threat-model review at end of each phase
- [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation.