# P6-01 + P6-02 — Agent self-update + fleet update Status: design approved 2026-05-06. Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard version reporting + fleet update UI). One spec, one branch — the two tasks are tightly coupled (P6-02 is the operator surface for the mechanism P6-01 ships). ## 1. Background P5-03 pivoted release distribution to a single multi-arch server Docker image, with cross-compiled agent binaries baked under `/opt/restic-manager/dist/agent-binaries/` and served via `GET /agent/binary?os=…&arch=…`. The plumbing already does dual-path lookup: `/agent-binaries/` overrides the image-baked copy, so an operator can hot-patch a pre-release agent without rebuilding the image. That makes the server the natural distribution point for agent upgrades. "Update agent" collapses to "re-fetch from your own server" — no apt repo, no Chocolatey, no third-party signing infra, and version pinning is automatic because the server only ever serves the agent that matches its own release. This spec wires up the update mechanism end-to-end and the operator surface that drives it. ## 2. Decisions | # | Decision | Rationale | |---|----------|-----------| | 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked | | 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe | | 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" | | 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's | | 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition | | 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears | ## 3. Wire protocol ### 3.1 Server → agent: `command.update` ``` { "type": "command.update", "id": "", "payload": { "job_id": "" } } ``` No `os` / `arch` / `version` in the payload — the agent already knows its own build target and fetches from its configured server URL via the existing `/agent/binary` handler. Including a target version would also tempt the agent into version-comparison logic; keep that on the server side. ### 3.2 Job lifecycle (server-driven) The agent has limited ability to report on its own restart, so the job state machine lives on the server: - **queued → running** when the envelope is dispatched. - **running → succeeded** when the agent re-hellos with `agent_version == server.Version` after dispatch and within the timeout. Audit `host.update_succeeded`. - **running → failed (timeout)** if 90 seconds pass without a hello carrying the matching version. Audit `host.update_failed`. Raise alert kind `update_failed` (reuses P3-05 alert engine). This single transition covers both the "agent never came back at all" case and the "agent came back at the wrong version" case — see §6.2 for why we don't transition immediately on a mismatched hello. Migration 0021 widens the `jobs.kind` CHECK constraint to include `update`. Same column-level pattern as 0012 (where 0012 added `restore` and `diff`). ## 4. Agent-side execution Lives in `internal/agent/updater`, build-tag split: - `updater_unix.go` — Linux + any future POSIX target. - `updater_windows.go` — Windows-only, uses the helper-script pattern. - `updater.go` — shared `Update(ctx, serverURL string) error` interface and the HTTP fetch/streaming code (no platform deps). ### 4.1 Linux flow 1. Receive `command.update` from the WS dispatcher. 2. Resolve own binary via `os.Executable()` and `filepath.Abs`. Refuse if the resolved path is `/proc/self/exe` or otherwise not a real file (defence in depth — shouldn't happen under systemd, but bail loudly if it does). 3. `GET /agent/binary?os=linux&arch=`, stream to `.new` in the same directory as the running binary (same filesystem ⇒ atomic rename). 4. fsync the file, `os.Chmod(0755)`. 5. Copy current binary to `.old` (overwrite if it exists). M1 — one-revision rollback target. 6. `os.Rename(.new, )`. 7. Close the WS connection cleanly (sends close frame so the server transitions the connection to `disconnected` rather than waiting for the heartbeat-miss sweep). 8. `os.Exit(0)`. Systemd's `Restart=always` (already in the unit) brings up the new binary within seconds. ### 4.2 Windows flow The .exe is exclusively locked by the OS while running, so steps 5–6 above can't happen in-process. Use a detached helper: 1. Steps 1–4 the same — fetch into `.exe.new`, fsync. 2. Write `update.cmd` to a tmp path with the orchestration: ``` timeout /t 3 /nobreak >nul copy /Y ".exe" ".exe.old" sc stop restic-manager-agent :wait sc query restic-manager-agent | find "STOPPED" >nul if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait) move /Y ".exe.new" ".exe" sc start restic-manager-agent del "%~f0" ``` 3. `CreateProcess` it detached (`DETACHED_PROCESS | CREATE_NO_WINDOW`, no parent handles). 4. Close WS, `os.Exit(0)`. SCM sees clean stop and waits — does *not* try to restart, because `sc stop` is the helper's job, not a crash. (`Restart=always` semantics differ between systemd and SCM. SCM treats clean-exit-after-stop as intentional and does not auto-restart; only crashes restart. That's why the helper script needs the explicit `sc start` at the end.) ### 4.3 Service-user assumption Both Linux (`User=root` per the existing unit) and Windows (`LocalSystem` by default) can write the binary path directly. If the agent ever moves to a non-root service user, the updater breaks — would need either a setuid helper or an out-of-process update service. Add a `// NOTE:` comment in the updater package flagging this; not a v1 blocker. ## 5. Server build version New package `internal/version` exposing two constants: ``` package version var ( Version = "dev" Commit = "" ) ``` Wired via `-ldflags` in the Makefile: ``` GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \ -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT) VERSION := $(shell git describe --tags --always --dirty) COMMIT := $(shell git rev-parse --short HEAD) ``` Both `cmd/server` and `cmd/agent` link the same package, so an agent's `agent_version` (sent in the hello payload, already wired since P1-11) is comparable byte-for-byte to the server's `version.Version`. `make build` already does what's needed for source builds. The Phase 2 work in this spec is the Docker release path — confirm during plan execution that `.gitea/workflows/release.yml` passes `VERSION` and `COMMIT` into the Docker `--build-arg` chain so the in-image binaries embed the same string the image is tagged with. If not, add the wiring. Dirty/dev builds (`v1.2.3-dirty`) won't match clean server builds, so every dev environment will show every host as out-of-date. This is acceptable — the chip is a noop in dev, real ops always run tagged builds. A new `GET /api/version` endpoint returns `{"version": "...", "commit": "..."}`. Used by the dashboard header tile and by `/settings/fleet-update`. Public-band — exposes no secrets, lets the install scripts surface it too. ## 6. P6-01 server endpoints ### 6.1 `POST /api/hosts/{id}/update` Admin-only. Refuses (with structured error code) when: - Host is offline (`host_offline`). - Host's `agent_version == server.Version` (`already_up_to_date`). - An update job for this host is already running (`update_in_progress`). Happy path: creates `jobs` row with `kind=update`, dispatches `command.update` envelope, audit-logs `host.update_dispatched`, returns `{"job_id": "..."}`. UI form-post variant on `/hosts/{id}/update` returns `HX-Redirect` to the live job log. ### 6.2 Hello handler integration The existing `onAgentHello` (P1-11) already upserts `agent_version`. Extend it: after the upsert, look for any `update` job for this host with `status='running'`. If one exists: - `agent_version == server.Version` → mark job `succeeded`, audit `host.update_succeeded`. - `agent_version != server.Version` → leave the job running so the timeout path catches it as a rollback failure (don't fail immediately — gives the agent one chance to come back, restart, hello again with the right version). Adds a small in-memory map of pending updates so the timeout goroutine knows when to give up. Persisted state lives in the `jobs` table; the in-memory map is just for the timer. ## 7. P6-02 fleet update ### 7.1 Schema Migration 0022, column-level adds only: ``` CREATE TABLE fleet_updates ( id TEXT PRIMARY KEY, started_at TEXT NOT NULL, started_by_user_id TEXT NOT NULL REFERENCES users(id), target_version TEXT NOT NULL, status TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')), current_host_id TEXT REFERENCES hosts(id), halted_reason TEXT, completed_at TEXT ); CREATE TABLE fleet_update_hosts ( fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE, host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE, status TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')), job_id TEXT REFERENCES jobs(id), failed_reason TEXT, PRIMARY KEY (fleet_update_id, host_id) ); ``` ### 7.2 Worker loop A single in-process goroutine — at most one fleet update may run at a time (enforced via a `sync.Mutex` + a precondition check on `POST /api/fleet/update`). ``` for each pending fleet_update_hosts row in dispatch order: set fleet_updates.current_host_id = row.host_id set fleet_update_hosts.status = 'running' if host.agent_version == server.Version: # Already updated since we built the list — skip. set status = 'skipped'; continue if !host.online: # Offline since we built the list — halt. halt(reason="host went offline") return dispatch_update_for_host(host) # reuses 6.1 logic wait_up_to_90s_for_hello_with_matching_version() if matched: set status = 'succeeded'; continue else: set status = 'failed', failed_reason = "..." halt(reason="update failed on host X") return set fleet_updates.status = 'completed', completed_at = now ``` Halt: set `fleet_updates.status = 'halted'`, raise an alert kind `fleet_update_halted`, audit `fleet.update_halted` with the host id and reason. Subsequent hosts stay `pending` so the operator can see what was queued and decide whether to resume (resume = start a new fleet update with the still-out-of-date subset). Cancel: admin-only `POST /api/fleet-updates/{id}/cancel`. Sets `status='cancelled'`. The currently-dispatched host's update job keeps running (the agent is already mid-restart) — cancel only prevents the *next* host from being picked. Audit `fleet.update_cancelled`. ### 7.3 UI surfaces **Per-host chip (host_row partial + host detail chrome):** `out of date · v1.2.2 → v1.2.3` — amber-accented, mirrors `.tag` token shape. Only rendered when: ``` host.agent_version != "" && host.agent_version != server.Version ``` Empty `agent_version` (host enrolled but never connected) renders nothing rather than "out of date" — we don't know what version they have. **Dashboard summary tile:** The hero strip already has tiles. Add an "Updates" tile: `N hosts behind` linking to `/?updates=behind` (extends NS-04's filter machinery — adds an `updates` query param alongside `status`/`repo_status`/`tag`). Hidden when N == 0. **Per-host Update button on `/hosts/{id}`:** Right-rail, admin-only. Disabled with hover tooltip when host offline / already up to date / update in progress. POSTs to `/hosts/{id}/update`, `HX-Redirect` to the live job log. **Fleet update page `/settings/fleet-update`:** Admin-only. Two states: - **Idle**: lists out-of-date online hosts (table: hostname, current version, target version, last seen). Big "Start rolling update" button behind a typed-confirm dialog (operator types the host count, e.g. `12`, to enable the button — same shape as the host-delete confirm). - **Running/halted/completed**: shows the currently-active fleet_update row + per-host progress list. Polls every 3s (htmx trigger conditional on `document.visibilityState === 'visible'`, same pattern as the alerts page). Renders: ``` Updated 3/12 · currently updating Halted on : · job log → ``` Audit actions: `fleet.update_started`, `fleet.update_completed`, `fleet.update_halted`, `fleet.update_cancelled`. ### 7.4 Alert engine integration P3-05's alert engine already supports kind-based registration. Add two new kinds: - `update_failed` — per-host, raised on individual update failure. Auto-resolves when the host re-hellos with the matching version. - `fleet_update_halted` — global, raised on fleet halt. Auto-resolves when a subsequent fleet update completes successfully. ## 8. RBAC | Endpoint | Role | |----------|------| | `POST /api/hosts/{id}/update` | admin | | `POST /api/fleet/update` | admin | | `POST /api/fleet-updates/{id}/cancel` | admin | | `GET /api/fleet-updates/{id}` | admin (status polling) | | `GET /api/version` | public | Operator and viewer see the "out of date" chip but no update buttons. Mirrors the existing pattern: read affordances are visible to all roles, write affordances are gated. ## 9. Testing ### 9.1 Unit - `internal/agent/updater`: fake-`/agent/binary` HTTP server + tmp "running binary" file, assert post-state — binary swapped, `.old` present, no leftover `.new`. Linux path only (Windows helper covered by build-tag compile-only). - `internal/server/http`: `POST /api/hosts/{id}/update` happy path, refuses-when-offline, refuses-when-up-to-date, refuses-when-update-in-progress, RBAC enforcement, audit row written. - Hello handler: agent reconnects with matching version after `update` job dispatch → marks job `succeeded`, drops the in-memory pending entry. Mismatched version → no-op (timeout catches it). - Timeout path: synthetic `update` job + 90s elapsed → marks `failed`, raises alert. - Fleet worker: table-driven over the loop's state machine — success-then-success, success-then-timeout-halts, cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately, host-disappears-from-list-mid-loop-skips. ### 9.2 Smoke validation (per CLAUDE.md restage block) 1. Build server + agent at version A. Restage. Enrol a host; confirm `agent_version=A`. 2. Bump version to B (`make build VERSION=B`), rebuild server only, restart server. Dashboard shows host as out-of-date with `A → B` chip. Updates tile reads "1 host behind". 3. Rebuild agent at B, restage `/agent-binaries/`. Click **Update agent** on host detail. Agent fetches, swaps, exits; systemd restarts it; hello-back at B → job `succeeded`, chip gone, tile clears. 4. Rollback path: leave `/agent-binaries/` at A, server at B, click Update — agent fetches A, swaps to A, restarts at A; hello says A != B; server marks job `failed` after 90s with reason "agent reconnected at version A, expected B". 5. Fleet update: spin up two smoke hosts both out-of-date, fire **Start rolling update**, watch progress page tick host 1 → host 2 → completed. 6. Halt path: replace one of the `/agent-binaries/` files with `/bin/false`. Run fleet update. First host gets broken binary, fails to come back up, fleet update halts at host 1 after 90s, alert raised, host 2 left as `pending`. Step 6 validates M2 end-to-end — the rolling halt is the actual safety guarantee, not a nice-to-have. ## 10. Out of scope - sha256 digest verification (deferred — see decision 4). - `restic-manager-agent update` CLI subcommand (deferred — decision 6). - Auto-update (deferred — decision 1). - Auto-rollback watchdog M3 (deferred — decision 3). - Migrating the agent off `User=root` (separate hardening track). - Cross-version protocol-compatibility checks beyond the existing `protocol_version` handshake (P1-11). If the new agent's `protocol_version` is incompatible with the server, the existing handshake rejects it; the update job will then correctly time out and be marked failed. ## 11. Migration plan 1. `internal/version` package + Makefile ldflags wiring. 2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates tables). 3. `internal/agent/updater` package, Linux first. 4. WS envelope wiring + `command.update` dispatcher. 5. `POST /api/hosts/{id}/update` + hello-handler integration + timeout goroutine. 6. UI: chip + per-host update button + dashboard tile + filter. 7. Fleet update worker + page. 8. Windows updater path. 9. Alert engine kinds. 10. Smoke validation per §9.2. Each step is independently testable; commits should land at each boundary so a failed Windows path (8) doesn't block the rest of the work.