P6-01 + P6-02: agent self-update + fleet update #19

Merged
steve merged 12 commits from p6-agent-self-update into main 2026-05-07 17:49:25 +01:00
Showing only changes of commit 731f01a63e - Show all commits
@@ -0,0 +1,448 @@
# P6-01 + P6-02 — Agent self-update + fleet update
Status: design approved 2026-05-06.
Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard
version reporting + fleet update UI). One spec, one branch — the
two tasks are tightly coupled (P6-02 is the operator surface for
the mechanism P6-01 ships).
## 1. Background
P5-03 pivoted release distribution to a single multi-arch server
Docker image, with cross-compiled agent binaries baked under
`/opt/restic-manager/dist/agent-binaries/` and served via
`GET /agent/binary?os=…&arch=…`. The plumbing already does
dual-path lookup: `<DataDir>/agent-binaries/<name>` overrides the
image-baked copy, so an operator can hot-patch a pre-release agent
without rebuilding the image.
That makes the server the natural distribution point for agent
upgrades. "Update agent" collapses to "re-fetch from your own
server" — no apt repo, no Chocolatey, no third-party signing infra,
and version pinning is automatic because the server only ever
serves the agent that matches its own release.
This spec wires up the update mechanism end-to-end and the
operator surface that drives it.
## 2. Decisions
| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked |
| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe |
| 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" |
| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's |
| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition |
| 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears |
## 3. Wire protocol
### 3.1 Server → agent: `command.update`
```
{
"type": "command.update",
"id": "<envelope id>",
"payload": {
"job_id": "<ulid>"
}
}
```
No `os` / `arch` / `version` in the payload — the agent already
knows its own build target and fetches from its configured server
URL via the existing `/agent/binary` handler. Including a target
version would also tempt the agent into version-comparison logic;
keep that on the server side.
### 3.2 Job lifecycle (server-driven)
The agent has limited ability to report on its own restart, so the
job state machine lives on the server:
- **queued → running** when the envelope is dispatched.
- **running → succeeded** when the agent re-hellos with
`agent_version == server.Version` after dispatch and within
the timeout. Audit `host.update_succeeded`.
- **running → failed (timeout)** if 90 seconds pass without a
hello carrying the matching version. Audit `host.update_failed`.
Raise alert kind `update_failed` (reuses P3-05 alert engine).
This single transition covers both the "agent never came back
at all" case and the "agent came back at the wrong version"
case — see §6.2 for why we don't transition immediately on a
mismatched hello.
Migration 0021 widens the `jobs.kind` CHECK constraint to include
`update`. Same column-level pattern as 0012 (where 0012 added
`restore` and `diff`).
## 4. Agent-side execution
Lives in `internal/agent/updater`, build-tag split:
- `updater_unix.go` — Linux + any future POSIX target.
- `updater_windows.go` — Windows-only, uses the helper-script
pattern.
- `updater.go` — shared `Update(ctx, serverURL string) error`
interface and the HTTP fetch/streaming code (no platform deps).
### 4.1 Linux flow
1. Receive `command.update` from the WS dispatcher.
2. Resolve own binary via `os.Executable()` and `filepath.Abs`.
Refuse if the resolved path is `/proc/self/exe` or otherwise
not a real file (defence in depth — shouldn't happen under
systemd, but bail loudly if it does).
3. `GET <server>/agent/binary?os=linux&arch=<runtime.GOARCH>`,
stream to `<binary>.new` in the same directory as the running
binary (same filesystem ⇒ atomic rename).
4. fsync the file, `os.Chmod(0755)`.
5. Copy current binary to `<binary>.old` (overwrite if it
exists). M1 — one-revision rollback target.
6. `os.Rename(<binary>.new, <binary>)`.
7. Close the WS connection cleanly (sends close frame so the
server transitions the connection to `disconnected` rather
than waiting for the heartbeat-miss sweep).
8. `os.Exit(0)`. Systemd's `Restart=always` (already in the unit)
brings up the new binary within seconds.
### 4.2 Windows flow
The .exe is exclusively locked by the OS while running, so steps
56 above can't happen in-process. Use a detached helper:
1. Steps 14 the same — fetch into `<binary>.exe.new`, fsync.
2. Write `update.cmd` to a tmp path with the orchestration:
```
timeout /t 3 /nobreak >nul
copy /Y "<binary>.exe" "<binary>.exe.old"
sc stop restic-manager-agent
:wait
sc query restic-manager-agent | find "STOPPED" >nul
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
move /Y "<binary>.exe.new" "<binary>.exe"
sc start restic-manager-agent
del "%~f0"
```
3. `CreateProcess` it detached
(`DETACHED_PROCESS | CREATE_NO_WINDOW`, no parent handles).
4. Close WS, `os.Exit(0)`. SCM sees clean stop and waits — does
*not* try to restart, because `sc stop` is the helper's job,
not a crash. (`Restart=always` semantics differ between
systemd and SCM. SCM treats clean-exit-after-stop as
intentional and does not auto-restart; only crashes restart.
That's why the helper script needs the explicit `sc start`
at the end.)
### 4.3 Service-user assumption
Both Linux (`User=root` per the existing unit) and Windows
(`LocalSystem` by default) can write the binary path directly. If
the agent ever moves to a non-root service user, the updater
breaks — would need either a setuid helper or an out-of-process
update service. Add a `// NOTE:` comment in the updater package
flagging this; not a v1 blocker.
## 5. Server build version
New package `internal/version` exposing two constants:
```
package version
var (
Version = "dev"
Commit = ""
)
```
Wired via `-ldflags` in the Makefile:
```
GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
VERSION := $(shell git describe --tags --always --dirty)
COMMIT := $(shell git rev-parse --short HEAD)
```
Both `cmd/server` and `cmd/agent` link the same package, so an
agent's `agent_version` (sent in the hello payload, already wired
since P1-11) is comparable byte-for-byte to the server's
`version.Version`.
`make build` already does what's needed for source builds. The
Phase 2 work in this spec is the Docker release path — confirm
during plan execution that `.gitea/workflows/release.yml` passes
`VERSION` and `COMMIT` into the Docker `--build-arg` chain so the
in-image binaries embed the same string the image is tagged with.
If not, add the wiring.
Dirty/dev builds (`v1.2.3-dirty`) won't match clean server builds,
so every dev environment will show every host as out-of-date. This
is acceptable — the chip is a noop in dev, real ops always run
tagged builds.
A new `GET /api/version` endpoint returns
`{"version": "...", "commit": "..."}`. Used by the dashboard
header tile and by `/settings/fleet-update`. Public-band — exposes
no secrets, lets the install scripts surface it too.
## 6. P6-01 server endpoints
### 6.1 `POST /api/hosts/{id}/update`
Admin-only. Refuses (with structured error code) when:
- Host is offline (`host_offline`).
- Host's `agent_version == server.Version` (`already_up_to_date`).
- An update job for this host is already running (`update_in_progress`).
Happy path: creates `jobs` row with `kind=update`, dispatches
`command.update` envelope, audit-logs `host.update_dispatched`,
returns `{"job_id": "..."}`.
UI form-post variant on `/hosts/{id}/update` returns
`HX-Redirect` to the live job log.
### 6.2 Hello handler integration
The existing `onAgentHello` (P1-11) already upserts
`agent_version`. Extend it: after the upsert, look for any
`update` job for this host with `status='running'`. If one
exists:
- `agent_version == server.Version` → mark job `succeeded`,
audit `host.update_succeeded`.
- `agent_version != server.Version` → leave the job running so
the timeout path catches it as a rollback failure (don't fail
immediately — gives the agent one chance to come back, restart,
hello again with the right version).
Adds a small in-memory map of pending updates so the timeout
goroutine knows when to give up. Persisted state lives in the
`jobs` table; the in-memory map is just for the timer.
## 7. P6-02 fleet update
### 7.1 Schema
Migration 0022, column-level adds only:
```
CREATE TABLE fleet_updates (
id TEXT PRIMARY KEY,
started_at TEXT NOT NULL,
started_by_user_id TEXT NOT NULL REFERENCES users(id),
target_version TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')),
current_host_id TEXT REFERENCES hosts(id),
halted_reason TEXT,
completed_at TEXT
);
CREATE TABLE fleet_update_hosts (
fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
status TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')),
job_id TEXT REFERENCES jobs(id),
failed_reason TEXT,
PRIMARY KEY (fleet_update_id, host_id)
);
```
### 7.2 Worker loop
A single in-process goroutine — at most one fleet update may run
at a time (enforced via a `sync.Mutex` + a precondition check on
`POST /api/fleet/update`).
```
for each pending fleet_update_hosts row in dispatch order:
set fleet_updates.current_host_id = row.host_id
set fleet_update_hosts.status = 'running'
if host.agent_version == server.Version:
# Already updated since we built the list — skip.
set status = 'skipped'; continue
if !host.online:
# Offline since we built the list — halt.
halt(reason="host went offline")
return
dispatch_update_for_host(host) # reuses 6.1 logic
wait_up_to_90s_for_hello_with_matching_version()
if matched:
set status = 'succeeded'; continue
else:
set status = 'failed', failed_reason = "..."
halt(reason="update failed on host X")
return
set fleet_updates.status = 'completed', completed_at = now
```
Halt: set `fleet_updates.status = 'halted'`, raise an alert kind
`fleet_update_halted`, audit `fleet.update_halted` with the host
id and reason. Subsequent hosts stay `pending` so the operator can
see what was queued and decide whether to resume (resume = start a
new fleet update with the still-out-of-date subset).
Cancel: admin-only `POST /api/fleet-updates/{id}/cancel`. Sets
`status='cancelled'`. The currently-dispatched host's update job
keeps running (the agent is already mid-restart) — cancel only
prevents the *next* host from being picked. Audit
`fleet.update_cancelled`.
### 7.3 UI surfaces
**Per-host chip (host_row partial + host detail chrome):**
`out of date · v1.2.2 → v1.2.3` — amber-accented, mirrors `.tag`
token shape. Only rendered when:
```
host.agent_version != "" && host.agent_version != server.Version
```
Empty `agent_version` (host enrolled but never connected) renders
nothing rather than "out of date" — we don't know what version
they have.
**Dashboard summary tile:**
The hero strip already has tiles. Add an "Updates" tile:
`N hosts behind` linking to `/?updates=behind` (extends NS-04's
filter machinery — adds an `updates` query param alongside
`status`/`repo_status`/`tag`). Hidden when N == 0.
**Per-host Update button on `/hosts/{id}`:**
Right-rail, admin-only. Disabled with hover tooltip when host
offline / already up to date / update in progress. POSTs to
`/hosts/{id}/update`, `HX-Redirect` to the live job log.
**Fleet update page `/settings/fleet-update`:**
Admin-only. Two states:
- **Idle**: lists out-of-date online hosts (table: hostname,
current version, target version, last seen). Big "Start rolling
update" button behind a typed-confirm dialog (operator types
the host count, e.g. `12`, to enable the button — same shape as
the host-delete confirm).
- **Running/halted/completed**: shows the currently-active
fleet_update row + per-host progress list. Polls every 3s (htmx
trigger conditional on `document.visibilityState === 'visible'`,
same pattern as the alerts page). Renders:
```
Updated 3/12 · currently updating <hostname>
Halted on <hostname>: <reason> · job log →
```
Audit actions: `fleet.update_started`, `fleet.update_completed`,
`fleet.update_halted`, `fleet.update_cancelled`.
### 7.4 Alert engine integration
P3-05's alert engine already supports kind-based registration. Add
two new kinds:
- `update_failed` — per-host, raised on individual update failure.
Auto-resolves when the host re-hellos with the matching version.
- `fleet_update_halted` — global, raised on fleet halt. Auto-resolves
when a subsequent fleet update completes successfully.
## 8. RBAC
| Endpoint | Role |
|----------|------|
| `POST /api/hosts/{id}/update` | admin |
| `POST /api/fleet/update` | admin |
| `POST /api/fleet-updates/{id}/cancel` | admin |
| `GET /api/fleet-updates/{id}` | admin (status polling) |
| `GET /api/version` | public |
Operator and viewer see the "out of date" chip but no update
buttons. Mirrors the existing pattern: read affordances are
visible to all roles, write affordances are gated.
## 9. Testing
### 9.1 Unit
- `internal/agent/updater`: fake-`/agent/binary` HTTP server +
tmp "running binary" file, assert post-state — binary swapped,
`.old` present, no leftover `.new`. Linux path only (Windows
helper covered by build-tag compile-only).
- `internal/server/http`: `POST /api/hosts/{id}/update` happy
path, refuses-when-offline, refuses-when-up-to-date,
refuses-when-update-in-progress, RBAC enforcement, audit row
written.
- Hello handler: agent reconnects with matching version after
`update` job dispatch → marks job `succeeded`, drops the
in-memory pending entry. Mismatched version → no-op (timeout
catches it).
- Timeout path: synthetic `update` job + 90s elapsed →
marks `failed`, raises alert.
- Fleet worker: table-driven over the loop's state machine —
success-then-success, success-then-timeout-halts,
cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately,
host-disappears-from-list-mid-loop-skips.
### 9.2 Smoke validation (per CLAUDE.md restage block)
1. Build server + agent at version A. Restage. Enrol a host;
confirm `agent_version=A`.
2. Bump version to B (`make build VERSION=B`), rebuild server
only, restart server. Dashboard shows host as out-of-date with
`A → B` chip. Updates tile reads "1 host behind".
3. Rebuild agent at B, restage `<DataDir>/agent-binaries/`. Click
**Update agent** on host detail. Agent fetches, swaps, exits;
systemd restarts it; hello-back at B → job `succeeded`, chip
gone, tile clears.
4. Rollback path: leave `<DataDir>/agent-binaries/` at A, server
at B, click Update — agent fetches A, swaps to A, restarts at
A; hello says A != B; server marks job `failed` after 90s with
reason "agent reconnected at version A, expected B".
5. Fleet update: spin up two smoke hosts both out-of-date, fire
**Start rolling update**, watch progress page tick host 1 →
host 2 → completed.
6. Halt path: replace one of the `<DataDir>/agent-binaries/`
files with `/bin/false`. Run fleet update. First host gets
broken binary, fails to come back up, fleet update halts at
host 1 after 90s, alert raised, host 2 left as `pending`.
Step 6 validates M2 end-to-end — the rolling halt is the actual
safety guarantee, not a nice-to-have.
## 10. Out of scope
- sha256 digest verification (deferred — see decision 4).
- `restic-manager-agent update` CLI subcommand (deferred —
decision 6).
- Auto-update (deferred — decision 1).
- Auto-rollback watchdog M3 (deferred — decision 3).
- Migrating the agent off `User=root` (separate hardening track).
- Cross-version protocol-compatibility checks beyond the existing
`protocol_version` handshake (P1-11). If the new agent's
`protocol_version` is incompatible with the server, the
existing handshake rejects it; the update job will then
correctly time out and be marked failed.
## 11. Migration plan
1. `internal/version` package + Makefile ldflags wiring.
2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates
tables).
3. `internal/agent/updater` package, Linux first.
4. WS envelope wiring + `command.update` dispatcher.
5. `POST /api/hosts/{id}/update` + hello-handler integration +
timeout goroutine.
6. UI: chip + per-host update button + dashboard tile + filter.
7. Fleet update worker + page.
8. Windows updater path.
9. Alert engine kinds.
10. Smoke validation per §9.2.
Each step is independently testable; commits should land at each
boundary so a failed Windows path (8) doesn't block the rest of
the work.