From c821ec1fe02f868e75f52f1fe91bbddebf74f312 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Fri, 1 May 2026 00:12:55 +0100 Subject: [PATCH] spec/tasks: address pre-Phase-1 design feedback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Doc-only changes captured before any Phase 1 code lands. spec.md: - §4.1 nhooyr.io/websocket → github.com/coder/websocket (the maintained fork; the original is unmaintained) - §4.1 RM_LISTEN documented as source of truth for the bind port; add RM_TRUSTED_PROXY env var for X-Forwarded-* handling behind Caddy/Traefik - §4.2 Phase 1 ships Linux only; Windows binaries continue to build in CI to keep the codebase portable, but service integration + installer move to Phase 2 - §4.2 self-update via apt/choco, not bespoke signed binaries - §5 add Host.protocol_version + Host.applied_schedule_version - §6.2 lock protocol_version handshake semantics (clean error on mismatch, not weird JSON parse failures) - §6.2 schedule reconciliation when server unreachable: agent keeps firing last-known-good indefinitely; server's view canonical on reconnect; UI surfaces drift via applied_schedule_version - §6.2 schedule.set carries schedule_version; new schedule.ack agent→server message - §10.1 cross-reference RM_LISTEN ↔ compose port mapping - §14.3 hooks rejected at validation on non-backup schedule kinds tasks.md: - P1-14 / P1-30 (Windows service + install.ps1) → Phase 2 as P2-16 / P2-17 - P1-29 install.sh detects existing restic timers/cron and prints disable commands, doesn't auto-disable - Phase 1 acceptance: drop Windows from end-to-end criterion, require windows cross-compile in CI - P4-01 rewritten: package-manager-based update delivery - P5-08 removed (duplicate of P4-08 Prometheus /metrics) - Various references updated No Go code changes; build still clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- spec.md | 81 +++++++++++++++++++++++++++++++++++++++++++++++--------- tasks.md | 33 ++++++++++++----------- 2 files changed, 87 insertions(+), 27 deletions(-) diff --git a/spec.md b/spec.md index 865a06a..edf2752 100644 --- a/spec.md +++ b/spec.md @@ -94,20 +94,37 @@ It is built for small-to-medium fleets (initial target: ~12 endpoints) and is in - **Language:** Go 1.22+ - **Storage:** SQLite (via `modernc.org/sqlite`, no CGo) - **HTTP:** `net/http` + `chi` router -- **WebSocket:** `nhooyr.io/websocket` +- **WebSocket:** `github.com/coder/websocket` (the maintained fork of the + unmaintained `nhooyr.io/websocket`; same API) - **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step - **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml` -- **Config:** YAML or env vars (`RM_LISTEN`, `RM_DATA_DIR`, `RM_BASE_URL`, `RM_TLS_CERT`, `RM_TLS_KEY`) +- **Config:** YAML or env vars: + - `RM_LISTEN` — bind address, e.g. `:8443` (source of truth for the port; + the `8443` in the reference compose is just a default mapping) + - `RM_DATA_DIR`, `RM_BASE_URL`, `RM_TLS_CERT`, `RM_TLS_KEY`, + `RM_SECRET_KEY_FILE` + - `RM_TRUSTED_PROXY` — comma-separated CIDR list of reverse proxies + whose `X-Forwarded-For` / `X-Forwarded-Proto` we honour. Empty (the + default) = trust no one. Set this when fronted by Caddy/Traefik. - **TLS:** terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS) ### 4.2 Agent -- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`) -- **Service integration:** systemd unit (Linux), Windows service via `golang.org/x/sys/windows/svc` +- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`). + Phase 1 ships Linux only; Windows binaries continue to build in CI to keep + the codebase portable, but Windows service integration + signed installer + + install.ps1 land in Phase 2. +- **Service integration:** systemd unit (Linux). Windows service via + `golang.org/x/sys/windows/svc` — Phase 2. - **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle - **Persistence:** local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable - **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time -- **Self-update:** server publishes signed agent binary; agent downloads, verifies signature, swaps binary, restarts service +- **Updates:** distributed via OS package manager — apt repo (Linux) and + Chocolatey package (Windows), both pointing at gitea releases. No + bespoke signed-binary self-update; the `restic-manager-agent update` + command is a thin wrapper over `apt-get install --only-upgrade` / + `choco upgrade`. UI surfaces "agent N versions behind server" so an + operator knows when to upgrade. ### 4.3 Restic REST server (Unraid) @@ -121,14 +138,20 @@ It is built for small-to-medium fleets (initial target: ~12 endpoints) and is in ``` Host - id, name, os, arch, agent_version, restic_version, + id, name, os, arch, agent_version, restic_version, protocol_version, enrolled_at, last_seen_at, status (online/offline/degraded), repo_id (FK), tags, current_job_id (FK nullable), last_backup_at, last_backup_status (succeeded|failed|cancelled|null), - repo_size_bytes, snapshot_count, open_alert_count - # Last six fields are denormalised projections, refreshed on - # job.finished, snapshots.report, repo.stats, and alert state changes. + repo_size_bytes, snapshot_count, open_alert_count, + applied_schedule_version + # Bottom block (last_backup_*, repo_size_bytes, snapshot_count, + # open_alert_count, applied_schedule_version) are denormalised + # projections, refreshed on job.finished, snapshots.report, + # repo.stats, and alert state changes. + # applied_schedule_version is the schedule_version the agent most + # recently acknowledged via `schedule.ack` — lets the UI surface + # drift when an agent is offline. Repo id, name, url, kind (rest|s3|local), credential_id (FK), @@ -241,7 +264,8 @@ if dashboard staleness becomes a problem in practice. Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages. **Agent → server:** -- `hello` (host metadata, agent version, restic version, OS) +- `hello` (host metadata, agent_version, restic_version, OS, + `protocol_version` — see "Protocol versioning" below) - `heartbeat` (every 30s) - `job.started` (job_id, kind, started_at) - `job.progress` (job_id, percent_done, files_done, total_files, @@ -252,18 +276,44 @@ Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages. last_check_status, lock_state) - `log.stream` (live stdout/stderr lines while job running; {job_id, seq, ts, stream: stdout|stderr|event, payload}) +- `schedule.ack` (schedule_version) — agent confirms it has applied a + schedule push; lets the server surface "this host is N versions + behind" without polling **Server → agent:** - `command.run` (kind, args) - `command.cancel` (job_id) -- `schedule.set` (full schedule list, agent reconciles local cron) +- `schedule.set` (schedule_version, schedules: [...]) — full schedule + list, agent reconciles local cron and replies with `schedule.ack` - `config.update` -- `agent.update` (new version available, URL + signature) +- `agent.update.available` (new version + package source URL — + informational only; agent does not self-update, see §4.2) The server fans `job.progress` and `log.stream` for a given job to all browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without transformation, so the schema is shared end-to-end. +**Protocol versioning.** Agents and the server each declare an integer +`protocol_version` in `hello`. The version bumps **only** on breaking +wire-format changes (not human-readable software releases). The server +maintains a `MinAgentProtocolVersion` constant; agents below it are +disconnected with `error: protocol_too_old` and a URL pointing at the +upgrade instructions. Symmetrically, an agent talking to a server that +advertises a `protocol_version` it does not recognise refuses to +proceed and surfaces a clear log message. This avoids the failure mode +of "weird JSON parse errors when v0.3 agent meets v0.5 server." + +**Schedule reconciliation when the server is unreachable.** Agents +keep firing the **last-known-good** schedule pushed by the server, +indefinitely. Rationale: a missed backup because the controller is +down is a worse outcome than firing a schedule the user has since +edited. On reconnect, the server's view is canonical: the next +`schedule.set` overrides whatever the agent was running, the agent +replies `schedule.ack` with the new `schedule_version`, and the server +updates `Host.applied_schedule_version`. The UI surfaces drift +("schedule v7 pushed, agent applied v5") when an agent has been +offline. + ### 6.3 Enrollment 1. Operator clicks "Add host" → server generates one-time token (TTL 1h) @@ -339,8 +389,14 @@ services: - RM_TLS_CERT=/certs/fullchain.pem - RM_TLS_KEY=/certs/privkey.pem - RM_SECRET_KEY_FILE=/data/secret.key + # - RM_TRUSTED_PROXY=10.0.0.0/8 # set when fronted by a reverse proxy ``` +`RM_LISTEN` is the source of truth for the server's bind address. The +`8443:8443` mapping above is just the matching default; if you change +`RM_LISTEN` to e.g. `:9443`, change the right-hand side of the port +mapping to match. + ### 10.2 Restic REST server (Unraid) Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share. @@ -430,6 +486,7 @@ Per-host upload/download caps for backup, restore, and prune jobs. Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications. - **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden +- **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host - **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable) - **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status - **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:` diff --git a/tasks.md b/tasks.md index 0859f52..90b6093 100644 --- a/tasks.md +++ b/tasks.md @@ -36,11 +36,11 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. - [ ] **P1-12** (S) Heartbeat handler (mark host offline after 90s without heartbeat) ### Agent foundations -- [ ] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml` / `%PROGRAMDATA%\restic-manager\agent.yaml`) -- [ ] **P1-14** (M) Service integration: systemd unit + Windows service entrypoint -- [ ] **P1-15** (M) Outbound WS client with reconnect, server cert pinning +- [ ] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`); Windows path deferred to Phase 2 +- [ ] **P1-14** (M) Service integration: systemd unit (Linux only in Phase 1; Windows service entrypoint deferred to Phase 2 — see P2-16) +- [ ] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect, server cert pinning, `protocol_version` advertisement in `hello` - [ ] **P1-16** (M) Restic wrapper: locate `restic` binary, run with `--json`, stream parsed events -- [ ] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version) +- [ ] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version) ### Run-now backup - [ ] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with logs @@ -58,12 +58,13 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. - [ ] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node) ### Install scripts -- [ ] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls -- [ ] **P1-30** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls +- [ ] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Also detects existing restic timers/cron (`systemctl list-timers --all | grep -i restic`, `crontab -l`, `/etc/cron.d/`, `/etc/cron.daily/`) and prints them with the disable commands — does **not** auto-disable, since heuristic matches could be unrelated tooling - [ ] **P1-31** (S) Server endpoint to serve agent binaries + install scripts (signed) ### Phase 1 acceptance -- One Linux + one Windows host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success. +- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success. +- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17). +- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse. --- @@ -83,10 +84,13 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. - [ ] **P2-12** (S) Bandwidth limit fields on schedule editor (`--limit-upload`, `--limit-download`); also overridable on run-now jobs - [ ] **P2-13** (M) Pre/post backup hooks: schema (`Schedule.pre_hook`, `Schedule.post_hook`, `Host.pre_hook_default`, `Host.post_hook_default`), encrypted at rest, admin-only edit, audit-logged - [ ] **P2-14** (M) Agent execution of hooks: configurable shell per host, `pre_hook` failure aborts backup, `post_hook` always runs with `RM_JOB_STATUS` env var, stdout/stderr captured into `JobLog` with prefix -- [ ] **P2-15** (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user") +- [ ] **P2-15** (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user"); validation enforces hooks only on `kind = backup` schedules (see spec.md §14.3) +- [ ] **P2-16** (M) Windows service integration: agent runs under the Service Control Manager via `golang.org/x/sys/windows/svc`; install/uninstall/start/stop wired up +- [ ] **P2-17** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named `*restic*` and prints them for manual review ### Phase 2 acceptance -- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a `mysqldump` example). Bandwidth limits honoured. +- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a `mysqldump` example) and are rejected on non-backup schedule kinds. Bandwidth limits honoured. +- A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1. --- @@ -107,10 +111,10 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. --- -## Phase 4 — Self-update, RBAC polish, OIDC +## Phase 4 — Update delivery, RBAC polish, OIDC -- [ ] **P4-01** (L) Agent self-update: signed binary published by server, agent downloads, verifies, swaps, restarts -- [ ] **P4-02** (M) Agent version reporting on dashboard; "update all" admin action +- [ ] **P4-01** (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. `restic-manager-agent update` is a thin wrapper over `apt-get install --only-upgrade restic-manager-agent` / `choco upgrade`. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2) +- [ ] **P4-02** (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host - [ ] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer) - [ ] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset) - [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping) @@ -120,7 +124,7 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. - [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON ### Phase 4 acceptance -- Non-admin users see an appropriately limited UI. Agents update themselves with one click. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data. +- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data. --- @@ -132,8 +136,7 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. - [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README - [ ] **P5-05** (S) `SECURITY.md` with disclosure process - [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent) -- [ ] **P5-07** (S) Sample `docker-compose.yml` with TLS via Caddy sidecar -- [ ] **P5-08** (S) Optional Prometheus `/metrics` endpoint +- [ ] **P5-07** (S) Sample `docker-compose.yml` with TLS via Caddy sidecar (also demonstrates `RM_TRUSTED_PROXY`) ### Phase 5 acceptance - A stranger can read the docs and stand up a working install in under 30 minutes.