e871b05b38
CI / Test (linux/amd64) (pull_request) Successful in 34s
CI / Lint (pull_request) Failing after 16s
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (linux/arm64) (pull_request) Successful in 21s
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:
* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
api.JobCancelled = "cancelled" since that literal is the wire +
DB CHECK constraint value, plus matched the case in store/fleet.go
back to "cancelled" and added //nolint:misspell on both for the
next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
`defer res.Body.Close()` in `defer func() { _ = .Close() }()`
to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
upgrade response Body — coder/websocket can return res with a nil
Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
comments explaining why nil-on-error is the contract (cookie
missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
the dashboard primary nav today
Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
276 lines
33 KiB
Markdown
276 lines
33 KiB
Markdown
# restic-manager — Tasks
|
||
|
||
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
|
||
|
||
Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||
|
||
---
|
||
|
||
## Phase 0 — Project bootstrap
|
||
|
||
- [x] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages
|
||
- [x] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
|
||
- [x] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config
|
||
- [x] **P0-04** (S) ~~GitHub Actions~~ Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
|
||
- [x] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml`
|
||
- [x] **P0-06** (S) Makefile / ~~`taskfile.yml`~~ with common targets (`build`, `test`, `run`, `release`)
|
||
|
||
---
|
||
|
||
## Phase 1 — MVP: enrollment, visibility, on-demand backup
|
||
|
||
### Server foundations
|
||
|
||
- [x] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
|
||
- [x] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (hand-rolled, `embed.FS`)
|
||
- [x] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
|
||
- [~] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies; **CSRF middleware deferred to P1-23 (UI work)** — REST clients use bearer/session-only flows
|
||
- [x] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs)
|
||
- [x] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`)
|
||
- [~] **P1-07** (M) Audit log writer; middleware sweep for every state-changing endpoint **lands when the rest of the API surface does** — login / bootstrap / host.enrolled / job.run_now currently audited
|
||
|
||
### Agent ↔ server protocol
|
||
|
||
- [x] **P1-08** (M) Define shared API types in `internal/api` (envelopes, every WS message + `protocol_version` constants; JSON-shape tests pin the wire)
|
||
- [x] **P1-09** (L) WebSocket transport (`github.com/coder/websocket`), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
|
||
- [x] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
|
||
- [x] **P1-11** (M) Agent registration on connect (`hello` upserts agent_version/restic_version/protocol_version, flips status online, `protocol_too_old` rejection has clean error envelope)
|
||
- [x] **P1-12** (S) Heartbeat handler (touches `last_seen_at`; background sweeper marks hosts offline after 90s without one)
|
||
|
||
### Agent foundations
|
||
|
||
- [x] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
|
||
- [x] **P1-14** (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
|
||
- [x] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, `protocol_version` in hello
|
||
- [x] **P1-16** (M) Restic wrapper: locate via PATH or override, run with `--json`, scan stdout/stderr, parse `BackupStatus` + `BackupSummary`, exit-code 3 treated as success-with-issues
|
||
- [x] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
|
||
|
||
### Run-now backup
|
||
|
||
- [x] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
|
||
- [x] **P1-19** (M) Server endpoint `POST /api/hosts/{id}/jobs` to dispatch a `backup` command (validates kind, checks online, audit-logs)
|
||
- [x] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` (1Hz throttle) / `log.stream`
|
||
- [~] **P1-21** (M) Server persists log stream to `job_logs` ✓; **WS `/api/jobs/{id}/stream` for live browser tailing** still TODO — needs the per-job fan-out hub
|
||
- [x] **P1-22** (S) Snapshot listing: agent calls `restic snapshots --json` after each successful backup and ships the projection over `snapshots.report`. Server `ReplaceHostSnapshots` atomically swaps the per-host list and updates `hosts.snapshot_count` in the same tx. Read endpoint: `GET /api/hosts/{id}/snapshots`. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused `repo_id` FK from `snapshots` (repos as a first-class entity is P2 work).
|
||
|
||
### UI (HTMX + Tailwind)
|
||
|
||
- [x] **P1-23** (M) Base layout, login page, session-aware nav
|
||
- [x] **P1-24** (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by `GET /api/hosts` + `GET /api/fleet/summary` (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX `Run now` button posts to `/hosts/{id}/run-backup`.
|
||
- [x] **P1-25** (M) Host detail page (`/hosts/{id}`): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
|
||
- [x] **P1-26** (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens `/api/jobs/{id}/stream`; agent-emitted `job.started`/`job.progress`/`log.stream`/`job.finished` are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on `job.finished` to show the final header. "Run now" sets `HX-Redirect` so the operator lands on the live log.
|
||
- [~] **P1-27** (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (`RM_SERVER` + `RM_TOKEN` filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). **Deferred:** one-click "download preconfigured installer" `install-<hostname>.sh` (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
|
||
- [x] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node) — Makefile downloads pinned v3.4.17 into `bin/tailwindcss`, builds `web/styles/input.css` → `web/static/css/styles.css`, embedded into the binary via `web.FS`. `make build` runs Tailwind first.
|
||
|
||
### Install scripts
|
||
|
||
- [x] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / `/etc/cron.{d,daily,hourly,weekly}/*` / root crontab and prints them with the exact disable commands — does **not** auto-disable
|
||
- [~] **P1-31** (S) Server endpoint to serve agent binaries + install scripts ✓ (`/agent/binary` + `/install/*`); **signature verification** deferred to Phase 5 OSS readiness
|
||
|
||
### Repo credentials (pulled forward from Phase 2)
|
||
|
||
- [x] **P1-32** (M) Server-side encrypted repo creds carried on the enrollment token:
|
||
- `POST /api/enrollment-tokens` body grows `repo_url`, `repo_username`, `repo_password` (all required).
|
||
- Token row stores them as one AEAD-encrypted blob (existing `crypto.AEAD`); `ConsumeEnrollmentToken` moves the blob to a new `host_credentials` row keyed by `host_id` in the same tx.
|
||
- `PUT /api/hosts/{id}/repo-credentials` (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
|
||
- `GET /api/hosts/{id}/repo-credentials` returns the redacted view (URL + username + `has_password`) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
|
||
- On WS `hello`, server pushes a `config.update` with decrypted creds **before** returning the connection to idle. Same path on edit-while-connected.
|
||
- Audit-logged on create / consume / edit; payload omits the secret material.
|
||
|
||
- [x] **P1-33** (M) Agent-side encrypted secrets store:
|
||
- New `internal/agent/secrets` package: AEAD blob at `/var/lib/restic-manager/secrets.enc`, atomic write (tmp+fsync+rename, mode 0600).
|
||
- Per-host 32-byte secrets key minted at enrollment, persisted in `agent.yaml` (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
|
||
- Strip `repo_url` / `repo_password` from `agent.config.Config`. Agent loads creds from `secrets.enc` at startup; `config.update` handler writes through to the file.
|
||
- Dispatcher reads from the secrets store on every job rather than from in-memory config.
|
||
- Migration path: if `agent.yaml` still contains `repo_url`/`repo_password`, copy them into `secrets.enc` on next start, then strip from the YAML on save.
|
||
|
||
- [x] **P1-34** (S) End-to-end smoke runbook: `docs/e2e-smoke.md` walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real `restic/rest-server` in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
|
||
|
||
### Phase 1 acceptance
|
||
|
||
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
|
||
- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
|
||
- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
|
||
- Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as `config.update`.
|
||
|
||
---
|
||
|
||
## Phase 2 — Scheduling, retention, repo operations
|
||
|
||
> **Mid-phase pivot — "P2 redesign" (commits `7a7cac5`, `666af41`, `5667cdf`).**
|
||
> The original P2 plan put paths/excludes/retention/manual/kind/options on
|
||
> `Schedule` and one repo per host. After landing P2-01..P2-05 against that
|
||
> shape, the data model was rewritten: schedules are slim (cron + which
|
||
> `source_groups`); paths/excludes/retention/retry live on `source_group`
|
||
> (also doubles as the snapshot tag); forget/prune/check cadences live on
|
||
> `host_repo_maintenance` and run on a server-side ticker, not the agent
|
||
> cron; `pending_runs` queues offline retries; `host.repo_initialised_at`
|
||
> is gone (auto-init at enrolment). The redesign is captured below as
|
||
> `P2R-NN` items. Items P2-01..P2-05 stay marked done because the work
|
||
> shipped, but they're labelled ⚠️ **shipped against old shape — behaviour
|
||
> to be re-validated under P2R-02 after UI rewire**. P2-04.5 (`manual`
|
||
> flag) is dropped wholesale. P2-06..P2-15 are reframed below to point at
|
||
> their new homes; P2-16/17/18 are unaffected by the redesign.
|
||
|
||
### Original P2 work — shipped (against pre-redesign shape)
|
||
|
||
- [x] ⚠️ **P2-01** (M) Schedule schema + CRUD API (fat-schedule shape) — superseded by P2R-01.
|
||
- [x] ⚠️ **P2-02** (L) Server-pushed schedule reconciliation (fat-schedule shape) — re-validate under P2R-01.
|
||
- [x] ⚠️ **P2-03** (M) Agent local scheduler (`internal/agent/scheduler`, `robfig/cron/v3`, `schedule.fire` envelope, `dispatchScheduledJob`). The cron loop + ack/fire round-trip stay; the payload it carries reshapes under P2R-01.
|
||
- [x] ⚠️ **P2-04** (M) Schedule editor UI (fat-schedule form: paths/excludes/tags/retention/bandwidth on the schedule itself) — superseded by P2R-02.
|
||
- ~~**P2-04.5** Manual schedules / kill `host.default_paths`~~ — superseded; the `manual` flag concept is gone, Run-now lives on source groups (see P2R-01 / P2R-02).
|
||
- [x] ⚠️ **P2-05** (M) `forget` command with retention policy. Wire payload (`CommandRunPayload.retention_policy`) and restic wrapper (`restic.ForgetPolicy`, `RunForget`) are still correct; what changes under P2R-03 is **where retention comes from** (source_group, not schedule) and **who dispatches** (server-side maintenance ticker for cadence; per-source-group Run-now for ad-hoc).
|
||
|
||
### P2 redesign — Phase 1 ✅
|
||
|
||
- [x] **P2R-00.1** (M) Migration 0008 — sources + repo maintenance. Adds `source_groups`, `schedule_source_groups` junction, `host_repo_maintenance`, `pending_runs`, `host.bandwidth_up_kbps` / `bandwidth_down_kbps`. Drops `host.repo_initialised_at`. Slim-schedule columns dropped from `schedules`. Column-level ALTERs only — no table rebuilds (FK cascade trap, see CLAUDE.md). Commit `7a7cac5`.
|
||
- [x] **P2R-00.2** (M) v4 wireframes for sources / schedules / repo. Hi-fi mocks of `/hosts/{id}/sources`, `/sources/{gid}/edit` (with retention-conflict banner), slim `/schedules`, `/repo` (connection / bandwidth / maintenance / re-init). Commit `666af41`.
|
||
|
||
### P2 redesign — Phase 2 ✅
|
||
|
||
- [x] **P2R-00.3** (L) Go-side store rewrite against migration 0008. New types: `SourceGroup`, `HostRepoMaintenance`, `PendingRun`. `Schedule` slimmed to `{id, host_id, cron, enabled, source_group_ids, timestamps}`. `RetentionPolicy` moves from schedule field → source group field (type unchanged). `Host` loses `RepoInitialisedAt`, gains bandwidth caps. New files: `store/sources.go`, `store/maintenance.go`, `store/pending.go`. `store/schedules.go` rewritten for slim shape + junction CRUD. `enrollment.go` seeds a default source group + repo-maintenance row instead of a manual schedule. `ws/handler.go` drops `MarkHostRepoInitialised`. HTTP layer + UI templates **temporarily 501-stubbed** with `redesign_in_progress` — this is what P2R-01 / P2R-02 fill back in. Tests for the obsolete fat-schedule API deleted. Commit `5667cdf`.
|
||
- [x] **P2R-00.4** (S) Host-detail UI patched up enough to render: `RepoInitialisedAt` template refs removed, manual init-repo branches stripped, dead Schedules sub-tab demoted to inert div (matches Jobs/Repo/Settings), broken Run-now buttons disabled with P2-Phase-4 hints. Stop-gap until P2R-02 lands the real surface.
|
||
|
||
### P2 redesign — Phase 3 (REST + WS rewire) ✅
|
||
|
||
- [x] **P2R-01** (L) HTTP/WS layer against the slim shape:
|
||
- **Schedules REST CRUD**: `GET|POST /api/hosts/{id}/schedules`, `PUT|DELETE /api/hosts/{id}/schedules/{sid}`. Body shape is `{cron, enabled, source_group_ids[]}` — paths/excludes/retention/kind/manual all go away. Junction wiped + re-inserted on every update (per `store.UpdateSchedule`). Validation: cron parses via `robfig/cron/v3`; ≥1 `source_group_ids`; all referenced groups belong to the host.
|
||
- **Source-groups REST CRUD**: `GET|POST /api/hosts/{id}/source-groups`, `GET|PUT|DELETE /api/hosts/{id}/source-groups/{gid}`. Body: `{name, includes[], excludes[], retention_policy, retry_max, retry_backoff_seconds}`. Name uniqueness per host. Refuse delete if `SchedulesUsingGroup(gid)` is non-empty (return the schedule list so UI can show "remove from these schedules first"). Mutations bump `host_schedule_version`.
|
||
- **Repo-maintenance REST**: `GET|PUT /api/hosts/{id}/repo-maintenance`. Body: `{forget_cadence, prune_cadence, check_cadence, check_subset_pct, enabled}`. Server-side ticker drives execution (P2R-04), so updates here do **not** bump `host_schedule_version`.
|
||
- **Per-source-group Run-now**: `POST /hosts/{id}/source-groups/{gid}/run`. Reuses the existing `dispatchScheduleNow`-style path; agent receives a normal `command.run` carrying the resolved includes/excludes/retention from the group. This replaces the old per-host `/hosts/{id}/run-backup` endpoint (kept around as a 410-Gone with a hint pointing to source groups).
|
||
- **`schedule_push.go` reconciliation**: rebuild `pushScheduleSet*` to ship the new wire format (`ScheduleSetPayload` carries `[{schedule_id, cron, enabled, source_groups: [{name, includes, excludes, retention, retry_*}]}]` — agent doesn't need to know `source_group_id`, just the resolved bundle). On-hello + async-on-CRUD flavours; ack still persists `applied_schedule_version`.
|
||
- **Auto-init at enrolment**: server dispatches `restic init` on first WS connect (was P2-old "Init repo" button — now invisible to the operator). On success: emit a normal job row with `kind=init` so the audit trail still shows it. On `init` returning "config file already exists" (e.g. re-enrolment against an existing repo): treat as soft success per existing restic-wrapper behaviour.
|
||
- **Tests**: rewrite the deleted `schedules_test.go` and `schedule_push_test.go` against new endpoints; new `source_groups_test.go`, `repo_maintenance_test.go`, `auto_init_test.go`. End-to-end: enrol → server pushes creds → server dispatches init → agent runs it → schedule reconcile fires → operator hits per-source-group Run-now → backup runs → snapshots refresh.
|
||
|
||
### P2 redesign — Phase 4 (UI rewire, against v4 wireframes) ✅
|
||
|
||
> **Row-design rule (binding for every list-row template in this app, current and future):**
|
||
> Whole-row click navigates to the row's primary detail/edit page —
|
||
> mirror `.host-row.clickable` on the dashboard
|
||
> (`partials/host_row.html`): an absolute-positioned `.row-link`
|
||
> overlay with `text-indent: -9999px` covers the row, action buttons
|
||
> live in `.row-action` cells that sit above via z-index. **Do not
|
||
> add an explicit "Edit" button** when the row is clickable — it
|
||
> duplicates the affordance and dilutes the click target. Action
|
||
> cells are reserved for verbs that aren't "open this row" (Run-now,
|
||
> Delete, Pause, etc).
|
||
|
||
- [x] **P2R-02** (L) UI templates rebuilt against the new model:
|
||
- **Slice 1 ✅** Sub-tab navigation skeleton — extract header/vitals/sub-tabs into a `host_chrome` partial; Sources / Schedules / Repo become real `<a>` links; placeholder pages share the chrome; version indicator restored. (commit `a535822`)
|
||
- **Slice 2 ✅** Sources tab — `/hosts/{id}/sources` list with per-row meta + clickable rows + per-group Run-now/Delete; `/sources/new` and `/sources/{gid}/edit` form (name, includes/excludes, 3×2 keep-* grid, retry-on-offline, inline conflict banner from `ConflictDimension` cache); validation re-renders form with input intact; refuses to delete a host's last source group. (commits `0ed9c3d`, `dede74f`)
|
||
- **Slice 3 ✅** Schedules tab — `/hosts/{id}/schedules` slim list (status / cron / source-tags / actions, clickable rows) plus `/schedules/new` and `/schedules/{sid}/edit` form (cron with five quick-pick chips that have human-readable tooltips, source-group multi-pick as styled check cards, enabled toggle); per-schedule Run-now reuses `dispatchScheduledJob` for enabled schedules + bypasses the enabled check (with a HX-confirm) for paused ones; multi-group fires emit a success toast, single-group fires HX-Redirect to the live job log. (commit `67ca769` + follow-ups `64d2fcf`, `8b91d30`, `4035c44`)
|
||
- **Slice 4 ✅** `/hosts/{id}/repo` — three independent forms (connection: URL/user/password pre-filled from `GET /api/hosts/{id}/repo-credentials` redacted view; bandwidth: host-wide caps via new `PUT /api/hosts/{id}/bandwidth`; maintenance: forget/prune/check cadences + check subset %); danger-zone re-init button rendered + disabled (real flow lands in P2R-09); right-rail snapshots-by-tag breakdown. (commit `d62b173`)
|
||
- **Slice 5 ✅** Dashboard row Run-now uses the single covering schedule when one exists ("Run all groups" primary button), otherwise falls back to "Open →" pointing at the Sources tab. Right-rail and empty-snapshots-state Run-now were rehomed to source-group context in slice 1. (commit `fab99b4`)
|
||
- **Slice 6 ✅** Playwright sweep against the live `:8080` server — login → walk every new tab → create source group → create schedule → Run-now → confirm a snapshot landed → end-to-end clean, no console errors. Screenshots in `_diag/p2r-02-sweep/`.
|
||
- Side-fix: agent runner drops noisy restic `status` events from `log.stream` (they were drowning the live log on short backups; the throttled `job.progress` envelope already covers the same data). (commit `ffba737`)
|
||
- Header "version N · agent in sync / agent at vM" indicator preserved across all tabs (backed by `host_schedule_version` + `applied_schedule_version`).
|
||
- Form validation re-renders with the operator's typed input intact (mirror P2-04's behaviour). Each save fires `pushScheduleSetAsync` so an online agent re-arms within seconds.
|
||
|
||
### P2 redesign — Phase 5 (server-side maintenance ticker) — TODO
|
||
|
||
- [ ] **P2R-03** (M) `prune` command end-to-end. Restic wrapper (`restic.RunPrune`), agent dispatcher (`case api.JobPrune:`), wire envelope. **Admin-only credential**: a second `host_credentials` row keyed by `host_id` + `kind=admin` carries the non-append-only username/password; server pushes it via `config.update` only when dispatching a prune job, and the agent's secrets store keeps it in a separate slot from the everyday append-only creds. UI: prune row on the Repo page. Operator-triggered Run-now via `POST /hosts/{id}/repo/prune`. Cadence-driven dispatch lands in P2R-04.
|
||
- [ ] **P2R-04** (M) `check` command end-to-end (`restic check --read-data-subset N%`). Wrapper + dispatcher + wire. UI: check row on the Repo page (with the subset % slider). Operator Run-now via `POST /hosts/{id}/repo/check`. Cadence-driven dispatch lands in P2R-05.
|
||
- [ ] **P2R-05** (S) `unlock` command end-to-end (`restic unlock`). Operator-only — no cadence. `POST /hosts/{id}/repo/unlock`. Repo page surfaces lock state from the most recent `check` (which warns about stale locks).
|
||
- [ ] **P2R-06** (M) Server-side maintenance ticker. Cron-style loop on the server reads `host_repo_maintenance` rows, dispatches `forget` / `prune` / `check` jobs against the right host on the configured cadence (last-run timestamps tracked per kind on the maintenance row). Independent of the agent's local cron — the agent's cron only handles backup schedules now. Skips offline hosts (queues to `pending_runs` instead — see P2R-08). Handles ticker restarts cleanly (no-op if a job of the same kind ran inside the cadence window).
|
||
- [ ] **P2R-07** (S) Repo stats panel on the Repo page: size, dedup ratio, snapshot count, last-check timestamp + result, lock state, last-prune timestamp + bytes-freed. Backed by parsing `restic stats --json` output that the agent ships periodically (piggyback on the existing snapshots-report path).
|
||
- [ ] **P2R-08** (M) Pending-runs queue worker. On agent reconnect, server drains `pending_runs` rows for that host and re-dispatches them in order. Bump backoff per `pending_run.attempt_count`; drop rows that have exceeded the source-group's `retry_max`. Audit-logged. Smoke-tested by stopping the agent, running maintenance ticker so cadence misses, restarting agent, watching the queue drain.
|
||
|
||
### P2 redesign — Phase 6 (auto-init follow-up) — TODO
|
||
|
||
- [ ] **P2R-09** (S) Auto-init UX polish. Surface init result on host detail (small "repo ready · initialised by you on …" line; or "init failed — see job N · retry" if init failed). Re-init button on Repo page danger zone wipes then re-runs init (admin only, audit-logged, two-step confirm with the host name typed in).
|
||
|
||
### Pre/post hooks (rehomed onto source groups) — TODO
|
||
|
||
- [ ] **P2R-10** (M) Hook schema: `source_group.pre_hook`, `source_group.post_hook`, `host.pre_hook_default`, `host.post_hook_default`. Encrypted at rest (existing `crypto.AEAD`). Admin-only edit. Audit-logged.
|
||
- [ ] **P2R-11** (M) Agent execution of hooks: configurable shell per host. `pre_hook` failure aborts the backup. `post_hook` always runs with `RM_JOB_STATUS` env var. Stdout/stderr captured into `JobLog` with a `hook:` prefix. Hooks only run for `kind=backup` jobs (forget/prune/check/unlock skip them, per spec.md §14.3).
|
||
- [ ] **P2R-12** (S) Hook editor UI on source-group edit page (per-group override) and host Settings tab (host-wide default). Validation rejects non-backup contexts. Warning banner: "this hook runs as the agent service user (root on Linux; LocalSystem on Windows)".
|
||
|
||
### Bandwidth + niceties (rehomed onto host + source groups) — TODO
|
||
|
||
- [ ] **P2R-13** (S) Bandwidth limit fields. Host-wide caps (`Host.BandwidthUpKBps`, `BandwidthDownKBps` — schema is in 0008 already, just needs UI on the Repo page) applied to every restic invocation. Per-job override on Run-now (override field on the Run-now confirm dialog). Maps to `restic --limit-upload` / `--limit-download`.
|
||
- [ ] **P2R-14** (S) Schedule "next run" / "last run" surfaced on host card (dashboard row) + on the Schedules tab. "Next run" computed server-side from cron + now; "last run" from the most recent job with `actor_kind=schedule` for any schedule that uses any of the host's source groups.
|
||
|
||
### Cross-platform + alt-enrolment (unchanged by redesign) — TODO
|
||
|
||
- [ ] **P2-16** (M) Windows service integration: agent runs under the Service Control Manager via `golang.org/x/sys/windows/svc`; install/uninstall/start/stop wired up.
|
||
- [ ] **P2-17** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named `*restic*` and prints them for manual review.
|
||
- [ ] **P2-18** (L) Announce-and-approve enrollment (second enrollment mode, alongside the token flow that ships in Phase 1):
|
||
- Agent run with no `RM_TOKEN` generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then `POST /api/agents/announce` with `{hostname, os, arch, agent_version, restic_version, public_key}`. Server stores a `pending_hosts` row (`public_key`, `fingerprint = sha256(public_key)`, `announced_from_ip`, `first_seen_at`, `last_seen_at`, `expires_at = now+1h`). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
|
||
- Agent then opens a long-poll/WS to `/ws/agent/pending` authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
|
||
- Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. `SHA256:ab12…cd34`) and tells the operator to compare it to the one shown in the UI before clicking accept.
|
||
- UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: **Accept** (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / **Reject** (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
|
||
- Server-side guards: per-source-IP rate limit on `/api/agents/announce` (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does **not** auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
|
||
- Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting `hostname` over the wire.
|
||
|
||
### Phase 2 acceptance
|
||
|
||
- A host can be onboarded end-to-end with no manual REST: enrol → auto-init runs → operator opens host → creates source group(s) → attaches them to one or more schedules → schedule fires on time → backup runs against the right paths with the right retention → snapshots tagged by group name appear in UI.
|
||
- Operator can hit Run-now per source group from any of: dashboard row (single-group host), source group row, snapshot empty-state.
|
||
- Server-side maintenance ticker drives forget/prune/check at the configured cadences, independent of agent cron. Offline hosts queue to `pending_runs` and drain on reconnect.
|
||
- Pre/post hooks fire correctly per source group, fail loudly on `pre_hook` errors, run `post_hook` with `RM_JOB_STATUS`. Rejected on non-backup kinds.
|
||
- Bandwidth limits honoured (host-wide default + per-run override).
|
||
- A Windows host can enrol, appear in the dashboard, and run a backup with live log streaming.
|
||
- A Linux host can enrol via announce-and-approve, with fingerprint-comparison gate enforced. Rate-limit + pending-cap guards verified.
|
||
|
||
---
|
||
|
||
## Phase 3 — Restore, alerts, audit
|
||
|
||
- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection
|
||
- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
|
||
- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming
|
||
- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
|
||
- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
|
||
- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
|
||
- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve
|
||
- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
|
||
- [ ] **P3-09** (S) `diff` between two snapshots in UI
|
||
|
||
### Phase 3 acceptance
|
||
|
||
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
|
||
|
||
---
|
||
|
||
## Phase 4 — Update delivery, RBAC polish, OIDC
|
||
|
||
- [ ] **P4-01** (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. `restic-manager-agent update` is a thin wrapper over `apt-get install --only-upgrade restic-manager-agent` / `choco upgrade`. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
|
||
- [ ] **P4-02** (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
|
||
- [ ] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer)
|
||
- [ ] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset)
|
||
- [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
|
||
- [ ] **P4-06** (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
|
||
- [ ] **P4-07** (S) Per-host tags + dashboard filtering by tag
|
||
- [ ] **P4-08** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
|
||
- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON
|
||
|
||
### Phase 4 acceptance
|
||
|
||
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data.
|
||
|
||
---
|
||
|
||
## Phase 5 — OSS readiness
|
||
|
||
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
|
||
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
|
||
- [ ] **P5-03** (S) Release automation: `goreleaser` for binaries + Docker image to GHCR
|
||
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
|
||
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
|
||
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
|
||
- [ ] **P5-07** (S) Reference deployment: `docker-compose.yml` + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates `RM_TRUSTED_PROXY`)
|
||
|
||
### Phase 5 acceptance
|
||
|
||
- A stranger can read the docs and stand up a working install in under 30 minutes.
|
||
|
||
---
|
||
|
||
## Cross-cutting / ongoing
|
||
|
||
- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format)
|
||
- [ ] **X-02** Track restic version compatibility matrix
|
||
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
|
||
- [ ] **X-04** Threat-model review at end of each phase
|
||
- [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation.
|