From 731f01a63ecdd1783587a022cc0681d31e1f2272 Mon Sep 17 00:00:00 2001
From: Steve Cliff <steve@devcloud.guru>
Date: Wed, 6 May 2026 21:20:00 +0100
Subject: [PATCH] spec: P6-01+02 agent self-update + fleet update design

---
 ...05-06-p6-01-02-agent-self-update-design.md | 448 ++++++++++++++++++
 1 file changed, 448 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md
diff --git a/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md b/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md
new file mode 100644
index 0000000..d788418
--- /dev/null
+++ b/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md
@@ -0,0 +1,448 @@
+# P6-01 + P6-02 — Agent self-update + fleet update
+
+Status: design approved 2026-05-06.
+Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard
+version reporting + fleet update UI). One spec, one branch — the
+two tasks are tightly coupled (P6-02 is the operator surface for
+the mechanism P6-01 ships).
+
+## 1. Background
+
+P5-03 pivoted release distribution to a single multi-arch server
+Docker image, with cross-compiled agent binaries baked under
+`/opt/restic-manager/dist/agent-binaries/` and served via
+`GET /agent/binary?os=…&arch=…`. The plumbing already does
+dual-path lookup: `<DataDir>/agent-binaries/<name>` overrides the
+image-baked copy, so an operator can hot-patch a pre-release agent
+without rebuilding the image.
+
+That makes the server the natural distribution point for agent
+upgrades. "Update agent" collapses to "re-fetch from your own
+server" — no apt repo, no Chocolatey, no third-party signing infra,
+and version pinning is automatic because the server only ever
+serves the agent that matches its own release.
+
+This spec wires up the update mechanism end-to-end and the
+operator surface that drives it.
+
+## 2. Decisions
+
+| # | Decision | Rationale |
+|---|----------|-----------|
+| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked |
+| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe |
+| 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" |
+| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's |
+| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition |
+| 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears |
+
+## 3. Wire protocol
+
+### 3.1 Server → agent: `command.update`
+
+```
+{
+  "type": "command.update",
+  "id": "<envelope id>",
+  "payload": {
+    "job_id": "<ulid>"
+  }
+}
+```
+
+No `os` / `arch` / `version` in the payload — the agent already
+knows its own build target and fetches from its configured server
+URL via the existing `/agent/binary` handler. Including a target
+version would also tempt the agent into version-comparison logic;
+keep that on the server side.
+
+### 3.2 Job lifecycle (server-driven)
+
+The agent has limited ability to report on its own restart, so the
+job state machine lives on the server:
+
+- **queued → running** when the envelope is dispatched.
+- **running → succeeded** when the agent re-hellos with
+  `agent_version == server.Version` after dispatch and within
+  the timeout. Audit `host.update_succeeded`.
+- **running → failed (timeout)** if 90 seconds pass without a
+  hello carrying the matching version. Audit `host.update_failed`.
+  Raise alert kind `update_failed` (reuses P3-05 alert engine).
+  This single transition covers both the "agent never came back
+  at all" case and the "agent came back at the wrong version"
+  case — see §6.2 for why we don't transition immediately on a
+  mismatched hello.
+
+Migration 0021 widens the `jobs.kind` CHECK constraint to include
+`update`. Same column-level pattern as 0012 (where 0012 added
+`restore` and `diff`).
+
+## 4. Agent-side execution
+
+Lives in `internal/agent/updater`, build-tag split:
+
+- `updater_unix.go` — Linux + any future POSIX target.
+- `updater_windows.go` — Windows-only, uses the helper-script
+  pattern.
+- `updater.go` — shared `Update(ctx, serverURL string) error`
+  interface and the HTTP fetch/streaming code (no platform deps).
+
+### 4.1 Linux flow
+
+1. Receive `command.update` from the WS dispatcher.
+2. Resolve own binary via `os.Executable()` and `filepath.Abs`.
+   Refuse if the resolved path is `/proc/self/exe` or otherwise
+   not a real file (defence in depth — shouldn't happen under
+   systemd, but bail loudly if it does).
+3. `GET <server>/agent/binary?os=linux&arch=<runtime.GOARCH>`,
+   stream to `<binary>.new` in the same directory as the running
+   binary (same filesystem ⇒ atomic rename).
+4. fsync the file, `os.Chmod(0755)`.
+5. Copy current binary to `<binary>.old` (overwrite if it
+   exists). M1 — one-revision rollback target.
+6. `os.Rename(<binary>.new, <binary>)`.
+7. Close the WS connection cleanly (sends close frame so the
+   server transitions the connection to `disconnected` rather
+   than waiting for the heartbeat-miss sweep).
+8. `os.Exit(0)`. Systemd's `Restart=always` (already in the unit)
+   brings up the new binary within seconds.
+
+### 4.2 Windows flow
+
+The .exe is exclusively locked by the OS while running, so steps
+5–6 above can't happen in-process. Use a detached helper:
+
+1. Steps 1–4 the same — fetch into `<binary>.exe.new`, fsync.
+2. Write `update.cmd` to a tmp path with the orchestration:
+   ```
+   timeout /t 3 /nobreak >nul
+   copy /Y "<binary>.exe" "<binary>.exe.old"
+   sc stop restic-manager-agent
+   :wait
+   sc query restic-manager-agent | find "STOPPED" >nul
+   if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
+   move /Y "<binary>.exe.new" "<binary>.exe"
+   sc start restic-manager-agent
+   del "%~f0"
+   ```
+3. `CreateProcess` it detached
+   (`DETACHED_PROCESS | CREATE_NO_WINDOW`, no parent handles).
+4. Close WS, `os.Exit(0)`. SCM sees clean stop and waits — does
+   *not* try to restart, because `sc stop` is the helper's job,
+   not a crash. (`Restart=always` semantics differ between
+   systemd and SCM. SCM treats clean-exit-after-stop as
+   intentional and does not auto-restart; only crashes restart.
+   That's why the helper script needs the explicit `sc start`
+   at the end.)
+
+### 4.3 Service-user assumption
+
+Both Linux (`User=root` per the existing unit) and Windows
+(`LocalSystem` by default) can write the binary path directly. If
+the agent ever moves to a non-root service user, the updater
+breaks — would need either a setuid helper or an out-of-process
+update service. Add a `// NOTE:` comment in the updater package
+flagging this; not a v1 blocker.
+
+## 5. Server build version
+
+New package `internal/version` exposing two constants:
+
+```
+package version
+
+var (
+    Version = "dev"
+    Commit  = ""
+)
+```
+
+Wired via `-ldflags` in the Makefile:
+
+```
+GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
+             -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
+
+VERSION := $(shell git describe --tags --always --dirty)
+COMMIT  := $(shell git rev-parse --short HEAD)
+```
+
+Both `cmd/server` and `cmd/agent` link the same package, so an
+agent's `agent_version` (sent in the hello payload, already wired
+since P1-11) is comparable byte-for-byte to the server's
+`version.Version`.
+
+`make build` already does what's needed for source builds. The
+Phase 2 work in this spec is the Docker release path — confirm
+during plan execution that `.gitea/workflows/release.yml` passes
+`VERSION` and `COMMIT` into the Docker `--build-arg` chain so the
+in-image binaries embed the same string the image is tagged with.
+If not, add the wiring.
+
+Dirty/dev builds (`v1.2.3-dirty`) won't match clean server builds,
+so every dev environment will show every host as out-of-date. This
+is acceptable — the chip is a noop in dev, real ops always run
+tagged builds.
+
+A new `GET /api/version` endpoint returns
+`{"version": "...", "commit": "..."}`. Used by the dashboard
+header tile and by `/settings/fleet-update`. Public-band — exposes
+no secrets, lets the install scripts surface it too.
+
+## 6. P6-01 server endpoints
+
+### 6.1 `POST /api/hosts/{id}/update`
+
+Admin-only. Refuses (with structured error code) when:
+
+- Host is offline (`host_offline`).
+- Host's `agent_version == server.Version` (`already_up_to_date`).
+- An update job for this host is already running (`update_in_progress`).
+
+Happy path: creates `jobs` row with `kind=update`, dispatches
+`command.update` envelope, audit-logs `host.update_dispatched`,
+returns `{"job_id": "..."}`.
+
+UI form-post variant on `/hosts/{id}/update` returns
+`HX-Redirect` to the live job log.
+
+### 6.2 Hello handler integration
+
+The existing `onAgentHello` (P1-11) already upserts
+`agent_version`. Extend it: after the upsert, look for any
+`update` job for this host with `status='running'`. If one
+exists:
+
+- `agent_version == server.Version` → mark job `succeeded`,
+  audit `host.update_succeeded`.
+- `agent_version != server.Version` → leave the job running so
+  the timeout path catches it as a rollback failure (don't fail
+  immediately — gives the agent one chance to come back, restart,
+  hello again with the right version).
+
+Adds a small in-memory map of pending updates so the timeout
+goroutine knows when to give up. Persisted state lives in the
+`jobs` table; the in-memory map is just for the timer.
+
+## 7. P6-02 fleet update
+
+### 7.1 Schema
+
+Migration 0022, column-level adds only:
+
+```
+CREATE TABLE fleet_updates (
+  id              TEXT PRIMARY KEY,
+  started_at      TEXT NOT NULL,
+  started_by_user_id TEXT NOT NULL REFERENCES users(id),
+  target_version  TEXT NOT NULL,
+  status          TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')),
+  current_host_id TEXT REFERENCES hosts(id),
+  halted_reason   TEXT,
+  completed_at    TEXT
+);
+
+CREATE TABLE fleet_update_hosts (
+  fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
+  host_id         TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
+  status          TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')),
+  job_id          TEXT REFERENCES jobs(id),
+  failed_reason   TEXT,
+  PRIMARY KEY (fleet_update_id, host_id)
+);
+```
+
+### 7.2 Worker loop
+
+A single in-process goroutine — at most one fleet update may run
+at a time (enforced via a `sync.Mutex` + a precondition check on
+`POST /api/fleet/update`).
+
+```
+for each pending fleet_update_hosts row in dispatch order:
+    set fleet_updates.current_host_id = row.host_id
+    set fleet_update_hosts.status = 'running'
+    if host.agent_version == server.Version:
+        # Already updated since we built the list — skip.
+        set status = 'skipped'; continue
+    if !host.online:
+        # Offline since we built the list — halt.
+        halt(reason="host went offline")
+        return
+    dispatch_update_for_host(host)  # reuses 6.1 logic
+    wait_up_to_90s_for_hello_with_matching_version()
+    if matched:
+        set status = 'succeeded'; continue
+    else:
+        set status = 'failed', failed_reason = "..."
+        halt(reason="update failed on host X")
+        return
+set fleet_updates.status = 'completed', completed_at = now
+```
+
+Halt: set `fleet_updates.status = 'halted'`, raise an alert kind
+`fleet_update_halted`, audit `fleet.update_halted` with the host
+id and reason. Subsequent hosts stay `pending` so the operator can
+see what was queued and decide whether to resume (resume = start a
+new fleet update with the still-out-of-date subset).
+
+Cancel: admin-only `POST /api/fleet-updates/{id}/cancel`. Sets
+`status='cancelled'`. The currently-dispatched host's update job
+keeps running (the agent is already mid-restart) — cancel only
+prevents the *next* host from being picked. Audit
+`fleet.update_cancelled`.
+
+### 7.3 UI surfaces
+
+**Per-host chip (host_row partial + host detail chrome):**
+
+`out of date · v1.2.2 → v1.2.3` — amber-accented, mirrors `.tag`
+token shape. Only rendered when:
+
+```
+host.agent_version != "" && host.agent_version != server.Version
+```
+
+Empty `agent_version` (host enrolled but never connected) renders
+nothing rather than "out of date" — we don't know what version
+they have.
+
+**Dashboard summary tile:**
+
+The hero strip already has tiles. Add an "Updates" tile:
+`N hosts behind` linking to `/?updates=behind` (extends NS-04's
+filter machinery — adds an `updates` query param alongside
+`status`/`repo_status`/`tag`). Hidden when N == 0.
+
+**Per-host Update button on `/hosts/{id}`:**
+
+Right-rail, admin-only. Disabled with hover tooltip when host
+offline / already up to date / update in progress. POSTs to
+`/hosts/{id}/update`, `HX-Redirect` to the live job log.
+
+**Fleet update page `/settings/fleet-update`:**
+
+Admin-only. Two states:
+
+- **Idle**: lists out-of-date online hosts (table: hostname,
+  current version, target version, last seen). Big "Start rolling
+  update" button behind a typed-confirm dialog (operator types
+  the host count, e.g. `12`, to enable the button — same shape as
+  the host-delete confirm).
+- **Running/halted/completed**: shows the currently-active
+  fleet_update row + per-host progress list. Polls every 3s (htmx
+  trigger conditional on `document.visibilityState === 'visible'`,
+  same pattern as the alerts page). Renders:
+  ```
+  Updated 3/12 · currently updating <hostname>
+  Halted on <hostname>: <reason> · job log →
+  ```
+
+Audit actions: `fleet.update_started`, `fleet.update_completed`,
+`fleet.update_halted`, `fleet.update_cancelled`.
+
+### 7.4 Alert engine integration
+
+P3-05's alert engine already supports kind-based registration. Add
+two new kinds:
+
+- `update_failed` — per-host, raised on individual update failure.
+  Auto-resolves when the host re-hellos with the matching version.
+- `fleet_update_halted` — global, raised on fleet halt. Auto-resolves
+  when a subsequent fleet update completes successfully.
+
+## 8. RBAC
+
+| Endpoint | Role |
+|----------|------|
+| `POST /api/hosts/{id}/update` | admin |
+| `POST /api/fleet/update` | admin |
+| `POST /api/fleet-updates/{id}/cancel` | admin |
+| `GET /api/fleet-updates/{id}` | admin (status polling) |
+| `GET /api/version` | public |
+
+Operator and viewer see the "out of date" chip but no update
+buttons. Mirrors the existing pattern: read affordances are
+visible to all roles, write affordances are gated.
+
+## 9. Testing
+
+### 9.1 Unit
+
+- `internal/agent/updater`: fake-`/agent/binary` HTTP server +
+  tmp "running binary" file, assert post-state — binary swapped,
+  `.old` present, no leftover `.new`. Linux path only (Windows
+  helper covered by build-tag compile-only).
+- `internal/server/http`: `POST /api/hosts/{id}/update` happy
+  path, refuses-when-offline, refuses-when-up-to-date,
+  refuses-when-update-in-progress, RBAC enforcement, audit row
+  written.
+- Hello handler: agent reconnects with matching version after
+  `update` job dispatch → marks job `succeeded`, drops the
+  in-memory pending entry. Mismatched version → no-op (timeout
+  catches it).
+- Timeout path: synthetic `update` job + 90s elapsed →
+  marks `failed`, raises alert.
+- Fleet worker: table-driven over the loop's state machine —
+  success-then-success, success-then-timeout-halts,
+  cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately,
+  host-disappears-from-list-mid-loop-skips.
+
+### 9.2 Smoke validation (per CLAUDE.md restage block)
+
+1. Build server + agent at version A. Restage. Enrol a host;
+   confirm `agent_version=A`.
+2. Bump version to B (`make build VERSION=B`), rebuild server
+   only, restart server. Dashboard shows host as out-of-date with
+   `A → B` chip. Updates tile reads "1 host behind".
+3. Rebuild agent at B, restage `<DataDir>/agent-binaries/`. Click
+   **Update agent** on host detail. Agent fetches, swaps, exits;
+   systemd restarts it; hello-back at B → job `succeeded`, chip
+   gone, tile clears.
+4. Rollback path: leave `<DataDir>/agent-binaries/` at A, server
+   at B, click Update — agent fetches A, swaps to A, restarts at
+   A; hello says A != B; server marks job `failed` after 90s with
+   reason "agent reconnected at version A, expected B".
+5. Fleet update: spin up two smoke hosts both out-of-date, fire
+   **Start rolling update**, watch progress page tick host 1 →
+   host 2 → completed.
+6. Halt path: replace one of the `<DataDir>/agent-binaries/`
+   files with `/bin/false`. Run fleet update. First host gets
+   broken binary, fails to come back up, fleet update halts at
+   host 1 after 90s, alert raised, host 2 left as `pending`.
+
+Step 6 validates M2 end-to-end — the rolling halt is the actual
+safety guarantee, not a nice-to-have.
+
+## 10. Out of scope
+
+- sha256 digest verification (deferred — see decision 4).
+- `restic-manager-agent update` CLI subcommand (deferred —
+  decision 6).
+- Auto-update (deferred — decision 1).
+- Auto-rollback watchdog M3 (deferred — decision 3).
+- Migrating the agent off `User=root` (separate hardening track).
+- Cross-version protocol-compatibility checks beyond the existing
+  `protocol_version` handshake (P1-11). If the new agent's
+  `protocol_version` is incompatible with the server, the
+  existing handshake rejects it; the update job will then
+  correctly time out and be marked failed.
+
+## 11. Migration plan
+
+1. `internal/version` package + Makefile ldflags wiring.
+2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates
+   tables).
+3. `internal/agent/updater` package, Linux first.
+4. WS envelope wiring + `command.update` dispatcher.
+5. `POST /api/hosts/{id}/update` + hello-handler integration +
+   timeout goroutine.
+6. UI: chip + per-host update button + dashboard tile + filter.
+7. Fleet update worker + page.
+8. Windows updater path.
+9. Alert engine kinds.
+10. Smoke validation per §9.2.
+
+Each step is independently testable; commits should land at each
+boundary so a failed Windows path (8) doesn't block the rest of
+the work.