From 731f01a63ecdd1783587a022cc0681d31e1f2272 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Wed, 6 May 2026 21:20:00 +0100 Subject: [PATCH] spec: P6-01+02 agent self-update + fleet update design --- ...05-06-p6-01-02-agent-self-update-design.md | 448 ++++++++++++++++++ 1 file changed, 448 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md diff --git a/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md b/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md new file mode 100644 index 0000000..d788418 --- /dev/null +++ b/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md @@ -0,0 +1,448 @@ +# P6-01 + P6-02 — Agent self-update + fleet update + +Status: design approved 2026-05-06. +Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard +version reporting + fleet update UI). One spec, one branch — the +two tasks are tightly coupled (P6-02 is the operator surface for +the mechanism P6-01 ships). + +## 1. Background + +P5-03 pivoted release distribution to a single multi-arch server +Docker image, with cross-compiled agent binaries baked under +`/opt/restic-manager/dist/agent-binaries/` and served via +`GET /agent/binary?os=…&arch=…`. The plumbing already does +dual-path lookup: `/agent-binaries/` overrides the +image-baked copy, so an operator can hot-patch a pre-release agent +without rebuilding the image. + +That makes the server the natural distribution point for agent +upgrades. "Update agent" collapses to "re-fetch from your own +server" — no apt repo, no Chocolatey, no third-party signing infra, +and version pinning is automatic because the server only ever +serves the agent that matches its own release. + +This spec wires up the update mechanism end-to-end and the +operator surface that drives it. + +## 2. Decisions + +| # | Decision | Rationale | +|---|----------|-----------| +| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked | +| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe | +| 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" | +| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's | +| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition | +| 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears | + +## 3. Wire protocol + +### 3.1 Server → agent: `command.update` + +``` +{ + "type": "command.update", + "id": "", + "payload": { + "job_id": "" + } +} +``` + +No `os` / `arch` / `version` in the payload — the agent already +knows its own build target and fetches from its configured server +URL via the existing `/agent/binary` handler. Including a target +version would also tempt the agent into version-comparison logic; +keep that on the server side. + +### 3.2 Job lifecycle (server-driven) + +The agent has limited ability to report on its own restart, so the +job state machine lives on the server: + +- **queued → running** when the envelope is dispatched. +- **running → succeeded** when the agent re-hellos with + `agent_version == server.Version` after dispatch and within + the timeout. Audit `host.update_succeeded`. +- **running → failed (timeout)** if 90 seconds pass without a + hello carrying the matching version. Audit `host.update_failed`. + Raise alert kind `update_failed` (reuses P3-05 alert engine). + This single transition covers both the "agent never came back + at all" case and the "agent came back at the wrong version" + case — see §6.2 for why we don't transition immediately on a + mismatched hello. + +Migration 0021 widens the `jobs.kind` CHECK constraint to include +`update`. Same column-level pattern as 0012 (where 0012 added +`restore` and `diff`). + +## 4. Agent-side execution + +Lives in `internal/agent/updater`, build-tag split: + +- `updater_unix.go` — Linux + any future POSIX target. +- `updater_windows.go` — Windows-only, uses the helper-script + pattern. +- `updater.go` — shared `Update(ctx, serverURL string) error` + interface and the HTTP fetch/streaming code (no platform deps). + +### 4.1 Linux flow + +1. Receive `command.update` from the WS dispatcher. +2. Resolve own binary via `os.Executable()` and `filepath.Abs`. + Refuse if the resolved path is `/proc/self/exe` or otherwise + not a real file (defence in depth — shouldn't happen under + systemd, but bail loudly if it does). +3. `GET /agent/binary?os=linux&arch=`, + stream to `.new` in the same directory as the running + binary (same filesystem ⇒ atomic rename). +4. fsync the file, `os.Chmod(0755)`. +5. Copy current binary to `.old` (overwrite if it + exists). M1 — one-revision rollback target. +6. `os.Rename(.new, )`. +7. Close the WS connection cleanly (sends close frame so the + server transitions the connection to `disconnected` rather + than waiting for the heartbeat-miss sweep). +8. `os.Exit(0)`. Systemd's `Restart=always` (already in the unit) + brings up the new binary within seconds. + +### 4.2 Windows flow + +The .exe is exclusively locked by the OS while running, so steps +5–6 above can't happen in-process. Use a detached helper: + +1. Steps 1–4 the same — fetch into `.exe.new`, fsync. +2. Write `update.cmd` to a tmp path with the orchestration: + ``` + timeout /t 3 /nobreak >nul + copy /Y ".exe" ".exe.old" + sc stop restic-manager-agent + :wait + sc query restic-manager-agent | find "STOPPED" >nul + if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait) + move /Y ".exe.new" ".exe" + sc start restic-manager-agent + del "%~f0" + ``` +3. `CreateProcess` it detached + (`DETACHED_PROCESS | CREATE_NO_WINDOW`, no parent handles). +4. Close WS, `os.Exit(0)`. SCM sees clean stop and waits — does + *not* try to restart, because `sc stop` is the helper's job, + not a crash. (`Restart=always` semantics differ between + systemd and SCM. SCM treats clean-exit-after-stop as + intentional and does not auto-restart; only crashes restart. + That's why the helper script needs the explicit `sc start` + at the end.) + +### 4.3 Service-user assumption + +Both Linux (`User=root` per the existing unit) and Windows +(`LocalSystem` by default) can write the binary path directly. If +the agent ever moves to a non-root service user, the updater +breaks — would need either a setuid helper or an out-of-process +update service. Add a `// NOTE:` comment in the updater package +flagging this; not a v1 blocker. + +## 5. Server build version + +New package `internal/version` exposing two constants: + +``` +package version + +var ( + Version = "dev" + Commit = "" +) +``` + +Wired via `-ldflags` in the Makefile: + +``` +GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \ + -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT) + +VERSION := $(shell git describe --tags --always --dirty) +COMMIT := $(shell git rev-parse --short HEAD) +``` + +Both `cmd/server` and `cmd/agent` link the same package, so an +agent's `agent_version` (sent in the hello payload, already wired +since P1-11) is comparable byte-for-byte to the server's +`version.Version`. + +`make build` already does what's needed for source builds. The +Phase 2 work in this spec is the Docker release path — confirm +during plan execution that `.gitea/workflows/release.yml` passes +`VERSION` and `COMMIT` into the Docker `--build-arg` chain so the +in-image binaries embed the same string the image is tagged with. +If not, add the wiring. + +Dirty/dev builds (`v1.2.3-dirty`) won't match clean server builds, +so every dev environment will show every host as out-of-date. This +is acceptable — the chip is a noop in dev, real ops always run +tagged builds. + +A new `GET /api/version` endpoint returns +`{"version": "...", "commit": "..."}`. Used by the dashboard +header tile and by `/settings/fleet-update`. Public-band — exposes +no secrets, lets the install scripts surface it too. + +## 6. P6-01 server endpoints + +### 6.1 `POST /api/hosts/{id}/update` + +Admin-only. Refuses (with structured error code) when: + +- Host is offline (`host_offline`). +- Host's `agent_version == server.Version` (`already_up_to_date`). +- An update job for this host is already running (`update_in_progress`). + +Happy path: creates `jobs` row with `kind=update`, dispatches +`command.update` envelope, audit-logs `host.update_dispatched`, +returns `{"job_id": "..."}`. + +UI form-post variant on `/hosts/{id}/update` returns +`HX-Redirect` to the live job log. + +### 6.2 Hello handler integration + +The existing `onAgentHello` (P1-11) already upserts +`agent_version`. Extend it: after the upsert, look for any +`update` job for this host with `status='running'`. If one +exists: + +- `agent_version == server.Version` → mark job `succeeded`, + audit `host.update_succeeded`. +- `agent_version != server.Version` → leave the job running so + the timeout path catches it as a rollback failure (don't fail + immediately — gives the agent one chance to come back, restart, + hello again with the right version). + +Adds a small in-memory map of pending updates so the timeout +goroutine knows when to give up. Persisted state lives in the +`jobs` table; the in-memory map is just for the timer. + +## 7. P6-02 fleet update + +### 7.1 Schema + +Migration 0022, column-level adds only: + +``` +CREATE TABLE fleet_updates ( + id TEXT PRIMARY KEY, + started_at TEXT NOT NULL, + started_by_user_id TEXT NOT NULL REFERENCES users(id), + target_version TEXT NOT NULL, + status TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')), + current_host_id TEXT REFERENCES hosts(id), + halted_reason TEXT, + completed_at TEXT +); + +CREATE TABLE fleet_update_hosts ( + fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE, + host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE, + status TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')), + job_id TEXT REFERENCES jobs(id), + failed_reason TEXT, + PRIMARY KEY (fleet_update_id, host_id) +); +``` + +### 7.2 Worker loop + +A single in-process goroutine — at most one fleet update may run +at a time (enforced via a `sync.Mutex` + a precondition check on +`POST /api/fleet/update`). + +``` +for each pending fleet_update_hosts row in dispatch order: + set fleet_updates.current_host_id = row.host_id + set fleet_update_hosts.status = 'running' + if host.agent_version == server.Version: + # Already updated since we built the list — skip. + set status = 'skipped'; continue + if !host.online: + # Offline since we built the list — halt. + halt(reason="host went offline") + return + dispatch_update_for_host(host) # reuses 6.1 logic + wait_up_to_90s_for_hello_with_matching_version() + if matched: + set status = 'succeeded'; continue + else: + set status = 'failed', failed_reason = "..." + halt(reason="update failed on host X") + return +set fleet_updates.status = 'completed', completed_at = now +``` + +Halt: set `fleet_updates.status = 'halted'`, raise an alert kind +`fleet_update_halted`, audit `fleet.update_halted` with the host +id and reason. Subsequent hosts stay `pending` so the operator can +see what was queued and decide whether to resume (resume = start a +new fleet update with the still-out-of-date subset). + +Cancel: admin-only `POST /api/fleet-updates/{id}/cancel`. Sets +`status='cancelled'`. The currently-dispatched host's update job +keeps running (the agent is already mid-restart) — cancel only +prevents the *next* host from being picked. Audit +`fleet.update_cancelled`. + +### 7.3 UI surfaces + +**Per-host chip (host_row partial + host detail chrome):** + +`out of date · v1.2.2 → v1.2.3` — amber-accented, mirrors `.tag` +token shape. Only rendered when: + +``` +host.agent_version != "" && host.agent_version != server.Version +``` + +Empty `agent_version` (host enrolled but never connected) renders +nothing rather than "out of date" — we don't know what version +they have. + +**Dashboard summary tile:** + +The hero strip already has tiles. Add an "Updates" tile: +`N hosts behind` linking to `/?updates=behind` (extends NS-04's +filter machinery — adds an `updates` query param alongside +`status`/`repo_status`/`tag`). Hidden when N == 0. + +**Per-host Update button on `/hosts/{id}`:** + +Right-rail, admin-only. Disabled with hover tooltip when host +offline / already up to date / update in progress. POSTs to +`/hosts/{id}/update`, `HX-Redirect` to the live job log. + +**Fleet update page `/settings/fleet-update`:** + +Admin-only. Two states: + +- **Idle**: lists out-of-date online hosts (table: hostname, + current version, target version, last seen). Big "Start rolling + update" button behind a typed-confirm dialog (operator types + the host count, e.g. `12`, to enable the button — same shape as + the host-delete confirm). +- **Running/halted/completed**: shows the currently-active + fleet_update row + per-host progress list. Polls every 3s (htmx + trigger conditional on `document.visibilityState === 'visible'`, + same pattern as the alerts page). Renders: + ``` + Updated 3/12 · currently updating + Halted on : · job log → + ``` + +Audit actions: `fleet.update_started`, `fleet.update_completed`, +`fleet.update_halted`, `fleet.update_cancelled`. + +### 7.4 Alert engine integration + +P3-05's alert engine already supports kind-based registration. Add +two new kinds: + +- `update_failed` — per-host, raised on individual update failure. + Auto-resolves when the host re-hellos with the matching version. +- `fleet_update_halted` — global, raised on fleet halt. Auto-resolves + when a subsequent fleet update completes successfully. + +## 8. RBAC + +| Endpoint | Role | +|----------|------| +| `POST /api/hosts/{id}/update` | admin | +| `POST /api/fleet/update` | admin | +| `POST /api/fleet-updates/{id}/cancel` | admin | +| `GET /api/fleet-updates/{id}` | admin (status polling) | +| `GET /api/version` | public | + +Operator and viewer see the "out of date" chip but no update +buttons. Mirrors the existing pattern: read affordances are +visible to all roles, write affordances are gated. + +## 9. Testing + +### 9.1 Unit + +- `internal/agent/updater`: fake-`/agent/binary` HTTP server + + tmp "running binary" file, assert post-state — binary swapped, + `.old` present, no leftover `.new`. Linux path only (Windows + helper covered by build-tag compile-only). +- `internal/server/http`: `POST /api/hosts/{id}/update` happy + path, refuses-when-offline, refuses-when-up-to-date, + refuses-when-update-in-progress, RBAC enforcement, audit row + written. +- Hello handler: agent reconnects with matching version after + `update` job dispatch → marks job `succeeded`, drops the + in-memory pending entry. Mismatched version → no-op (timeout + catches it). +- Timeout path: synthetic `update` job + 90s elapsed → + marks `failed`, raises alert. +- Fleet worker: table-driven over the loop's state machine — + success-then-success, success-then-timeout-halts, + cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately, + host-disappears-from-list-mid-loop-skips. + +### 9.2 Smoke validation (per CLAUDE.md restage block) + +1. Build server + agent at version A. Restage. Enrol a host; + confirm `agent_version=A`. +2. Bump version to B (`make build VERSION=B`), rebuild server + only, restart server. Dashboard shows host as out-of-date with + `A → B` chip. Updates tile reads "1 host behind". +3. Rebuild agent at B, restage `/agent-binaries/`. Click + **Update agent** on host detail. Agent fetches, swaps, exits; + systemd restarts it; hello-back at B → job `succeeded`, chip + gone, tile clears. +4. Rollback path: leave `/agent-binaries/` at A, server + at B, click Update — agent fetches A, swaps to A, restarts at + A; hello says A != B; server marks job `failed` after 90s with + reason "agent reconnected at version A, expected B". +5. Fleet update: spin up two smoke hosts both out-of-date, fire + **Start rolling update**, watch progress page tick host 1 → + host 2 → completed. +6. Halt path: replace one of the `/agent-binaries/` + files with `/bin/false`. Run fleet update. First host gets + broken binary, fails to come back up, fleet update halts at + host 1 after 90s, alert raised, host 2 left as `pending`. + +Step 6 validates M2 end-to-end — the rolling halt is the actual +safety guarantee, not a nice-to-have. + +## 10. Out of scope + +- sha256 digest verification (deferred — see decision 4). +- `restic-manager-agent update` CLI subcommand (deferred — + decision 6). +- Auto-update (deferred — decision 1). +- Auto-rollback watchdog M3 (deferred — decision 3). +- Migrating the agent off `User=root` (separate hardening track). +- Cross-version protocol-compatibility checks beyond the existing + `protocol_version` handshake (P1-11). If the new agent's + `protocol_version` is incompatible with the server, the + existing handshake rejects it; the update job will then + correctly time out and be marked failed. + +## 11. Migration plan + +1. `internal/version` package + Makefile ldflags wiring. +2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates + tables). +3. `internal/agent/updater` package, Linux first. +4. WS envelope wiring + `command.update` dispatcher. +5. `POST /api/hosts/{id}/update` + hello-handler integration + + timeout goroutine. +6. UI: chip + per-host update button + dashboard tile + filter. +7. Fleet update worker + page. +8. Windows updater path. +9. Alert engine kinds. +10. Smoke validation per §9.2. + +Each step is independently testable; commits should land at each +boundary so a failed Windows path (8) doesn't block the rest of +the work.