spec/tasks: address pre-Phase-1 design feedback
Doc-only changes captured before any Phase 1 code lands. spec.md: - §4.1 nhooyr.io/websocket → github.com/coder/websocket (the maintained fork; the original is unmaintained) - §4.1 RM_LISTEN documented as source of truth for the bind port; add RM_TRUSTED_PROXY env var for X-Forwarded-* handling behind Caddy/Traefik - §4.2 Phase 1 ships Linux only; Windows binaries continue to build in CI to keep the codebase portable, but service integration + installer move to Phase 2 - §4.2 self-update via apt/choco, not bespoke signed binaries - §5 add Host.protocol_version + Host.applied_schedule_version - §6.2 lock protocol_version handshake semantics (clean error on mismatch, not weird JSON parse failures) - §6.2 schedule reconciliation when server unreachable: agent keeps firing last-known-good indefinitely; server's view canonical on reconnect; UI surfaces drift via applied_schedule_version - §6.2 schedule.set carries schedule_version; new schedule.ack agent→server message - §10.1 cross-reference RM_LISTEN ↔ compose port mapping - §14.3 hooks rejected at validation on non-backup schedule kinds tasks.md: - P1-14 / P1-30 (Windows service + install.ps1) → Phase 2 as P2-16 / P2-17 - P1-29 install.sh detects existing restic timers/cron and prints disable commands, doesn't auto-disable - Phase 1 acceptance: drop Windows from end-to-end criterion, require windows cross-compile in CI - P4-01 rewritten: package-manager-based update delivery - P5-08 removed (duplicate of P4-08 Prometheus /metrics) - Various references updated No Go code changes; build still clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -94,20 +94,37 @@ It is built for small-to-medium fleets (initial target: ~12 endpoints) and is in
|
||||
- **Language:** Go 1.22+
|
||||
- **Storage:** SQLite (via `modernc.org/sqlite`, no CGo)
|
||||
- **HTTP:** `net/http` + `chi` router
|
||||
- **WebSocket:** `nhooyr.io/websocket`
|
||||
- **WebSocket:** `github.com/coder/websocket` (the maintained fork of the
|
||||
unmaintained `nhooyr.io/websocket`; same API)
|
||||
- **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step
|
||||
- **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml`
|
||||
- **Config:** YAML or env vars (`RM_LISTEN`, `RM_DATA_DIR`, `RM_BASE_URL`, `RM_TLS_CERT`, `RM_TLS_KEY`)
|
||||
- **Config:** YAML or env vars:
|
||||
- `RM_LISTEN` — bind address, e.g. `:8443` (source of truth for the port;
|
||||
the `8443` in the reference compose is just a default mapping)
|
||||
- `RM_DATA_DIR`, `RM_BASE_URL`, `RM_TLS_CERT`, `RM_TLS_KEY`,
|
||||
`RM_SECRET_KEY_FILE`
|
||||
- `RM_TRUSTED_PROXY` — comma-separated CIDR list of reverse proxies
|
||||
whose `X-Forwarded-For` / `X-Forwarded-Proto` we honour. Empty (the
|
||||
default) = trust no one. Set this when fronted by Caddy/Traefik.
|
||||
- **TLS:** terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS)
|
||||
|
||||
### 4.2 Agent
|
||||
|
||||
- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`)
|
||||
- **Service integration:** systemd unit (Linux), Windows service via `golang.org/x/sys/windows/svc`
|
||||
- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`).
|
||||
Phase 1 ships Linux only; Windows binaries continue to build in CI to keep
|
||||
the codebase portable, but Windows service integration + signed installer
|
||||
+ install.ps1 land in Phase 2.
|
||||
- **Service integration:** systemd unit (Linux). Windows service via
|
||||
`golang.org/x/sys/windows/svc` — Phase 2.
|
||||
- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle
|
||||
- **Persistence:** local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable
|
||||
- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time
|
||||
- **Self-update:** server publishes signed agent binary; agent downloads, verifies signature, swaps binary, restarts service
|
||||
- **Updates:** distributed via OS package manager — apt repo (Linux) and
|
||||
Chocolatey package (Windows), both pointing at gitea releases. No
|
||||
bespoke signed-binary self-update; the `restic-manager-agent update`
|
||||
command is a thin wrapper over `apt-get install --only-upgrade` /
|
||||
`choco upgrade`. UI surfaces "agent N versions behind server" so an
|
||||
operator knows when to upgrade.
|
||||
|
||||
### 4.3 Restic REST server (Unraid)
|
||||
|
||||
@@ -121,14 +138,20 @@ It is built for small-to-medium fleets (initial target: ~12 endpoints) and is in
|
||||
|
||||
```
|
||||
Host
|
||||
id, name, os, arch, agent_version, restic_version,
|
||||
id, name, os, arch, agent_version, restic_version, protocol_version,
|
||||
enrolled_at, last_seen_at, status (online/offline/degraded),
|
||||
repo_id (FK), tags,
|
||||
current_job_id (FK nullable),
|
||||
last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
|
||||
repo_size_bytes, snapshot_count, open_alert_count
|
||||
# Last six fields are denormalised projections, refreshed on
|
||||
# job.finished, snapshots.report, repo.stats, and alert state changes.
|
||||
repo_size_bytes, snapshot_count, open_alert_count,
|
||||
applied_schedule_version
|
||||
# Bottom block (last_backup_*, repo_size_bytes, snapshot_count,
|
||||
# open_alert_count, applied_schedule_version) are denormalised
|
||||
# projections, refreshed on job.finished, snapshots.report,
|
||||
# repo.stats, and alert state changes.
|
||||
# applied_schedule_version is the schedule_version the agent most
|
||||
# recently acknowledged via `schedule.ack` — lets the UI surface
|
||||
# drift when an agent is offline.
|
||||
|
||||
Repo
|
||||
id, name, url, kind (rest|s3|local), credential_id (FK),
|
||||
@@ -241,7 +264,8 @@ if dashboard staleness becomes a problem in practice.
|
||||
Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.
|
||||
|
||||
**Agent → server:**
|
||||
- `hello` (host metadata, agent version, restic version, OS)
|
||||
- `hello` (host metadata, agent_version, restic_version, OS,
|
||||
`protocol_version` — see "Protocol versioning" below)
|
||||
- `heartbeat` (every 30s)
|
||||
- `job.started` (job_id, kind, started_at)
|
||||
- `job.progress` (job_id, percent_done, files_done, total_files,
|
||||
@@ -252,18 +276,44 @@ Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.
|
||||
last_check_status, lock_state)
|
||||
- `log.stream` (live stdout/stderr lines while job running;
|
||||
{job_id, seq, ts, stream: stdout|stderr|event, payload})
|
||||
- `schedule.ack` (schedule_version) — agent confirms it has applied a
|
||||
schedule push; lets the server surface "this host is N versions
|
||||
behind" without polling
|
||||
|
||||
**Server → agent:**
|
||||
- `command.run` (kind, args)
|
||||
- `command.cancel` (job_id)
|
||||
- `schedule.set` (full schedule list, agent reconciles local cron)
|
||||
- `schedule.set` (schedule_version, schedules: [...]) — full schedule
|
||||
list, agent reconciles local cron and replies with `schedule.ack`
|
||||
- `config.update`
|
||||
- `agent.update` (new version available, URL + signature)
|
||||
- `agent.update.available` (new version + package source URL —
|
||||
informational only; agent does not self-update, see §4.2)
|
||||
|
||||
The server fans `job.progress` and `log.stream` for a given job to all
|
||||
browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without
|
||||
transformation, so the schema is shared end-to-end.
|
||||
|
||||
**Protocol versioning.** Agents and the server each declare an integer
|
||||
`protocol_version` in `hello`. The version bumps **only** on breaking
|
||||
wire-format changes (not human-readable software releases). The server
|
||||
maintains a `MinAgentProtocolVersion` constant; agents below it are
|
||||
disconnected with `error: protocol_too_old` and a URL pointing at the
|
||||
upgrade instructions. Symmetrically, an agent talking to a server that
|
||||
advertises a `protocol_version` it does not recognise refuses to
|
||||
proceed and surfaces a clear log message. This avoids the failure mode
|
||||
of "weird JSON parse errors when v0.3 agent meets v0.5 server."
|
||||
|
||||
**Schedule reconciliation when the server is unreachable.** Agents
|
||||
keep firing the **last-known-good** schedule pushed by the server,
|
||||
indefinitely. Rationale: a missed backup because the controller is
|
||||
down is a worse outcome than firing a schedule the user has since
|
||||
edited. On reconnect, the server's view is canonical: the next
|
||||
`schedule.set` overrides whatever the agent was running, the agent
|
||||
replies `schedule.ack` with the new `schedule_version`, and the server
|
||||
updates `Host.applied_schedule_version`. The UI surfaces drift
|
||||
("schedule v7 pushed, agent applied v5") when an agent has been
|
||||
offline.
|
||||
|
||||
### 6.3 Enrollment
|
||||
|
||||
1. Operator clicks "Add host" → server generates one-time token (TTL 1h)
|
||||
@@ -339,8 +389,14 @@ services:
|
||||
- RM_TLS_CERT=/certs/fullchain.pem
|
||||
- RM_TLS_KEY=/certs/privkey.pem
|
||||
- RM_SECRET_KEY_FILE=/data/secret.key
|
||||
# - RM_TRUSTED_PROXY=10.0.0.0/8 # set when fronted by a reverse proxy
|
||||
```
|
||||
|
||||
`RM_LISTEN` is the source of truth for the server's bind address. The
|
||||
`8443:8443` mapping above is just the matching default; if you change
|
||||
`RM_LISTEN` to e.g. `:9443`, change the right-hand side of the port
|
||||
mapping to match.
|
||||
|
||||
### 10.2 Restic REST server (Unraid)
|
||||
|
||||
Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share.
|
||||
@@ -430,6 +486,7 @@ Per-host upload/download caps for backup, restore, and prune jobs.
|
||||
Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.
|
||||
|
||||
- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden
|
||||
- **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host
|
||||
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable)
|
||||
- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status
|
||||
- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:`
|
||||
|
||||
Reference in New Issue
Block a user