Files
restic-manager/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md
T

449 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# P6-01 + P6-02 — Agent self-update + fleet update
Status: design approved 2026-05-06.
Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard
version reporting + fleet update UI). One spec, one branch — the
two tasks are tightly coupled (P6-02 is the operator surface for
the mechanism P6-01 ships).
## 1. Background
P5-03 pivoted release distribution to a single multi-arch server
Docker image, with cross-compiled agent binaries baked under
`/opt/restic-manager/dist/agent-binaries/` and served via
`GET /agent/binary?os=…&arch=…`. The plumbing already does
dual-path lookup: `<DataDir>/agent-binaries/<name>` overrides the
image-baked copy, so an operator can hot-patch a pre-release agent
without rebuilding the image.
That makes the server the natural distribution point for agent
upgrades. "Update agent" collapses to "re-fetch from your own
server" — no apt repo, no Chocolatey, no third-party signing infra,
and version pinning is automatic because the server only ever
serves the agent that matches its own release.
This spec wires up the update mechanism end-to-end and the
operator surface that drives it.
## 2. Decisions
| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked |
| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe |
| 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" |
| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's |
| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition |
| 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears |
## 3. Wire protocol
### 3.1 Server → agent: `command.update`
```
{
"type": "command.update",
"id": "<envelope id>",
"payload": {
"job_id": "<ulid>"
}
}
```
No `os` / `arch` / `version` in the payload — the agent already
knows its own build target and fetches from its configured server
URL via the existing `/agent/binary` handler. Including a target
version would also tempt the agent into version-comparison logic;
keep that on the server side.
### 3.2 Job lifecycle (server-driven)
The agent has limited ability to report on its own restart, so the
job state machine lives on the server:
- **queued → running** when the envelope is dispatched.
- **running → succeeded** when the agent re-hellos with
`agent_version == server.Version` after dispatch and within
the timeout. Audit `host.update_succeeded`.
- **running → failed (timeout)** if 90 seconds pass without a
hello carrying the matching version. Audit `host.update_failed`.
Raise alert kind `update_failed` (reuses P3-05 alert engine).
This single transition covers both the "agent never came back
at all" case and the "agent came back at the wrong version"
case — see §6.2 for why we don't transition immediately on a
mismatched hello.
Migration 0021 widens the `jobs.kind` CHECK constraint to include
`update`. Same column-level pattern as 0012 (where 0012 added
`restore` and `diff`).
## 4. Agent-side execution
Lives in `internal/agent/updater`, build-tag split:
- `updater_unix.go` — Linux + any future POSIX target.
- `updater_windows.go` — Windows-only, uses the helper-script
pattern.
- `updater.go` — shared `Update(ctx, serverURL string) error`
interface and the HTTP fetch/streaming code (no platform deps).
### 4.1 Linux flow
1. Receive `command.update` from the WS dispatcher.
2. Resolve own binary via `os.Executable()` and `filepath.Abs`.
Refuse if the resolved path is `/proc/self/exe` or otherwise
not a real file (defence in depth — shouldn't happen under
systemd, but bail loudly if it does).
3. `GET <server>/agent/binary?os=linux&arch=<runtime.GOARCH>`,
stream to `<binary>.new` in the same directory as the running
binary (same filesystem ⇒ atomic rename).
4. fsync the file, `os.Chmod(0755)`.
5. Copy current binary to `<binary>.old` (overwrite if it
exists). M1 — one-revision rollback target.
6. `os.Rename(<binary>.new, <binary>)`.
7. Close the WS connection cleanly (sends close frame so the
server transitions the connection to `disconnected` rather
than waiting for the heartbeat-miss sweep).
8. `os.Exit(0)`. Systemd's `Restart=always` (already in the unit)
brings up the new binary within seconds.
### 4.2 Windows flow
The .exe is exclusively locked by the OS while running, so steps
56 above can't happen in-process. Use a detached helper:
1. Steps 14 the same — fetch into `<binary>.exe.new`, fsync.
2. Write `update.cmd` to a tmp path with the orchestration:
```
timeout /t 3 /nobreak >nul
copy /Y "<binary>.exe" "<binary>.exe.old"
sc stop restic-manager-agent
:wait
sc query restic-manager-agent | find "STOPPED" >nul
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
move /Y "<binary>.exe.new" "<binary>.exe"
sc start restic-manager-agent
del "%~f0"
```
3. `CreateProcess` it detached
(`DETACHED_PROCESS | CREATE_NO_WINDOW`, no parent handles).
4. Close WS, `os.Exit(0)`. SCM sees clean stop and waits — does
*not* try to restart, because `sc stop` is the helper's job,
not a crash. (`Restart=always` semantics differ between
systemd and SCM. SCM treats clean-exit-after-stop as
intentional and does not auto-restart; only crashes restart.
That's why the helper script needs the explicit `sc start`
at the end.)
### 4.3 Service-user assumption
Both Linux (`User=root` per the existing unit) and Windows
(`LocalSystem` by default) can write the binary path directly. If
the agent ever moves to a non-root service user, the updater
breaks — would need either a setuid helper or an out-of-process
update service. Add a `// NOTE:` comment in the updater package
flagging this; not a v1 blocker.
## 5. Server build version
New package `internal/version` exposing two constants:
```
package version
var (
Version = "dev"
Commit = ""
)
```
Wired via `-ldflags` in the Makefile:
```
GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
VERSION := $(shell git describe --tags --always --dirty)
COMMIT := $(shell git rev-parse --short HEAD)
```
Both `cmd/server` and `cmd/agent` link the same package, so an
agent's `agent_version` (sent in the hello payload, already wired
since P1-11) is comparable byte-for-byte to the server's
`version.Version`.
`make build` already does what's needed for source builds. The
Phase 2 work in this spec is the Docker release path — confirm
during plan execution that `.gitea/workflows/release.yml` passes
`VERSION` and `COMMIT` into the Docker `--build-arg` chain so the
in-image binaries embed the same string the image is tagged with.
If not, add the wiring.
Dirty/dev builds (`v1.2.3-dirty`) won't match clean server builds,
so every dev environment will show every host as out-of-date. This
is acceptable — the chip is a noop in dev, real ops always run
tagged builds.
A new `GET /api/version` endpoint returns
`{"version": "...", "commit": "..."}`. Used by the dashboard
header tile and by `/settings/fleet-update`. Public-band — exposes
no secrets, lets the install scripts surface it too.
## 6. P6-01 server endpoints
### 6.1 `POST /api/hosts/{id}/update`
Admin-only. Refuses (with structured error code) when:
- Host is offline (`host_offline`).
- Host's `agent_version == server.Version` (`already_up_to_date`).
- An update job for this host is already running (`update_in_progress`).
Happy path: creates `jobs` row with `kind=update`, dispatches
`command.update` envelope, audit-logs `host.update_dispatched`,
returns `{"job_id": "..."}`.
UI form-post variant on `/hosts/{id}/update` returns
`HX-Redirect` to the live job log.
### 6.2 Hello handler integration
The existing `onAgentHello` (P1-11) already upserts
`agent_version`. Extend it: after the upsert, look for any
`update` job for this host with `status='running'`. If one
exists:
- `agent_version == server.Version` → mark job `succeeded`,
audit `host.update_succeeded`.
- `agent_version != server.Version` → leave the job running so
the timeout path catches it as a rollback failure (don't fail
immediately — gives the agent one chance to come back, restart,
hello again with the right version).
Adds a small in-memory map of pending updates so the timeout
goroutine knows when to give up. Persisted state lives in the
`jobs` table; the in-memory map is just for the timer.
## 7. P6-02 fleet update
### 7.1 Schema
Migration 0022, column-level adds only:
```
CREATE TABLE fleet_updates (
id TEXT PRIMARY KEY,
started_at TEXT NOT NULL,
started_by_user_id TEXT NOT NULL REFERENCES users(id),
target_version TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')),
current_host_id TEXT REFERENCES hosts(id),
halted_reason TEXT,
completed_at TEXT
);
CREATE TABLE fleet_update_hosts (
fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
status TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')),
job_id TEXT REFERENCES jobs(id),
failed_reason TEXT,
PRIMARY KEY (fleet_update_id, host_id)
);
```
### 7.2 Worker loop
A single in-process goroutine — at most one fleet update may run
at a time (enforced via a `sync.Mutex` + a precondition check on
`POST /api/fleet/update`).
```
for each pending fleet_update_hosts row in dispatch order:
set fleet_updates.current_host_id = row.host_id
set fleet_update_hosts.status = 'running'
if host.agent_version == server.Version:
# Already updated since we built the list — skip.
set status = 'skipped'; continue
if !host.online:
# Offline since we built the list — halt.
halt(reason="host went offline")
return
dispatch_update_for_host(host) # reuses 6.1 logic
wait_up_to_90s_for_hello_with_matching_version()
if matched:
set status = 'succeeded'; continue
else:
set status = 'failed', failed_reason = "..."
halt(reason="update failed on host X")
return
set fleet_updates.status = 'completed', completed_at = now
```
Halt: set `fleet_updates.status = 'halted'`, raise an alert kind
`fleet_update_halted`, audit `fleet.update_halted` with the host
id and reason. Subsequent hosts stay `pending` so the operator can
see what was queued and decide whether to resume (resume = start a
new fleet update with the still-out-of-date subset).
Cancel: admin-only `POST /api/fleet-updates/{id}/cancel`. Sets
`status='cancelled'`. The currently-dispatched host's update job
keeps running (the agent is already mid-restart) — cancel only
prevents the *next* host from being picked. Audit
`fleet.update_cancelled`.
### 7.3 UI surfaces
**Per-host chip (host_row partial + host detail chrome):**
`out of date · v1.2.2 → v1.2.3` — amber-accented, mirrors `.tag`
token shape. Only rendered when:
```
host.agent_version != "" && host.agent_version != server.Version
```
Empty `agent_version` (host enrolled but never connected) renders
nothing rather than "out of date" — we don't know what version
they have.
**Dashboard summary tile:**
The hero strip already has tiles. Add an "Updates" tile:
`N hosts behind` linking to `/?updates=behind` (extends NS-04's
filter machinery — adds an `updates` query param alongside
`status`/`repo_status`/`tag`). Hidden when N == 0.
**Per-host Update button on `/hosts/{id}`:**
Right-rail, admin-only. Disabled with hover tooltip when host
offline / already up to date / update in progress. POSTs to
`/hosts/{id}/update`, `HX-Redirect` to the live job log.
**Fleet update page `/settings/fleet-update`:**
Admin-only. Two states:
- **Idle**: lists out-of-date online hosts (table: hostname,
current version, target version, last seen). Big "Start rolling
update" button behind a typed-confirm dialog (operator types
the host count, e.g. `12`, to enable the button — same shape as
the host-delete confirm).
- **Running/halted/completed**: shows the currently-active
fleet_update row + per-host progress list. Polls every 3s (htmx
trigger conditional on `document.visibilityState === 'visible'`,
same pattern as the alerts page). Renders:
```
Updated 3/12 · currently updating <hostname>
Halted on <hostname>: <reason> · job log →
```
Audit actions: `fleet.update_started`, `fleet.update_completed`,
`fleet.update_halted`, `fleet.update_cancelled`.
### 7.4 Alert engine integration
P3-05's alert engine already supports kind-based registration. Add
two new kinds:
- `update_failed` — per-host, raised on individual update failure.
Auto-resolves when the host re-hellos with the matching version.
- `fleet_update_halted` — global, raised on fleet halt. Auto-resolves
when a subsequent fleet update completes successfully.
## 8. RBAC
| Endpoint | Role |
|----------|------|
| `POST /api/hosts/{id}/update` | admin |
| `POST /api/fleet/update` | admin |
| `POST /api/fleet-updates/{id}/cancel` | admin |
| `GET /api/fleet-updates/{id}` | admin (status polling) |
| `GET /api/version` | public |
Operator and viewer see the "out of date" chip but no update
buttons. Mirrors the existing pattern: read affordances are
visible to all roles, write affordances are gated.
## 9. Testing
### 9.1 Unit
- `internal/agent/updater`: fake-`/agent/binary` HTTP server +
tmp "running binary" file, assert post-state — binary swapped,
`.old` present, no leftover `.new`. Linux path only (Windows
helper covered by build-tag compile-only).
- `internal/server/http`: `POST /api/hosts/{id}/update` happy
path, refuses-when-offline, refuses-when-up-to-date,
refuses-when-update-in-progress, RBAC enforcement, audit row
written.
- Hello handler: agent reconnects with matching version after
`update` job dispatch → marks job `succeeded`, drops the
in-memory pending entry. Mismatched version → no-op (timeout
catches it).
- Timeout path: synthetic `update` job + 90s elapsed →
marks `failed`, raises alert.
- Fleet worker: table-driven over the loop's state machine —
success-then-success, success-then-timeout-halts,
cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately,
host-disappears-from-list-mid-loop-skips.
### 9.2 Smoke validation (per CLAUDE.md restage block)
1. Build server + agent at version A. Restage. Enrol a host;
confirm `agent_version=A`.
2. Bump version to B (`make build VERSION=B`), rebuild server
only, restart server. Dashboard shows host as out-of-date with
`A → B` chip. Updates tile reads "1 host behind".
3. Rebuild agent at B, restage `<DataDir>/agent-binaries/`. Click
**Update agent** on host detail. Agent fetches, swaps, exits;
systemd restarts it; hello-back at B → job `succeeded`, chip
gone, tile clears.
4. Rollback path: leave `<DataDir>/agent-binaries/` at A, server
at B, click Update — agent fetches A, swaps to A, restarts at
A; hello says A != B; server marks job `failed` after 90s with
reason "agent reconnected at version A, expected B".
5. Fleet update: spin up two smoke hosts both out-of-date, fire
**Start rolling update**, watch progress page tick host 1 →
host 2 → completed.
6. Halt path: replace one of the `<DataDir>/agent-binaries/`
files with `/bin/false`. Run fleet update. First host gets
broken binary, fails to come back up, fleet update halts at
host 1 after 90s, alert raised, host 2 left as `pending`.
Step 6 validates M2 end-to-end — the rolling halt is the actual
safety guarantee, not a nice-to-have.
## 10. Out of scope
- sha256 digest verification (deferred — see decision 4).
- `restic-manager-agent update` CLI subcommand (deferred —
decision 6).
- Auto-update (deferred — decision 1).
- Auto-rollback watchdog M3 (deferred — decision 3).
- Migrating the agent off `User=root` (separate hardening track).
- Cross-version protocol-compatibility checks beyond the existing
`protocol_version` handshake (P1-11). If the new agent's
`protocol_version` is incompatible with the server, the
existing handshake rejects it; the update job will then
correctly time out and be marked failed.
## 11. Migration plan
1. `internal/version` package + Makefile ldflags wiring.
2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates
tables).
3. `internal/agent/updater` package, Linux first.
4. WS envelope wiring + `command.update` dispatcher.
5. `POST /api/hosts/{id}/update` + hello-handler integration +
timeout goroutine.
6. UI: chip + per-host update button + dashboard tile + filter.
7. Fleet update worker + page.
8. Windows updater path.
9. Alert engine kinds.
10. Smoke validation per §9.2.
Each step is independently testable; commits should land at each
boundary so a failed Windows path (8) doesn't block the rest of
the work.