spec: P6-01+02 agent self-update + fleet update design
This commit is contained in:
@@ -0,0 +1,448 @@
|
||||
# P6-01 + P6-02 — Agent self-update + fleet update
|
||||
|
||||
Status: design approved 2026-05-06.
|
||||
Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard
|
||||
version reporting + fleet update UI). One spec, one branch — the
|
||||
two tasks are tightly coupled (P6-02 is the operator surface for
|
||||
the mechanism P6-01 ships).
|
||||
|
||||
## 1. Background
|
||||
|
||||
P5-03 pivoted release distribution to a single multi-arch server
|
||||
Docker image, with cross-compiled agent binaries baked under
|
||||
`/opt/restic-manager/dist/agent-binaries/` and served via
|
||||
`GET /agent/binary?os=…&arch=…`. The plumbing already does
|
||||
dual-path lookup: `<DataDir>/agent-binaries/<name>` overrides the
|
||||
image-baked copy, so an operator can hot-patch a pre-release agent
|
||||
without rebuilding the image.
|
||||
|
||||
That makes the server the natural distribution point for agent
|
||||
upgrades. "Update agent" collapses to "re-fetch from your own
|
||||
server" — no apt repo, no Chocolatey, no third-party signing infra,
|
||||
and version pinning is automatic because the server only ever
|
||||
serves the agent that matches its own release.
|
||||
|
||||
This spec wires up the update mechanism end-to-end and the
|
||||
operator surface that drives it.
|
||||
|
||||
## 2. Decisions
|
||||
|
||||
| # | Decision | Rationale |
|
||||
|---|----------|-----------|
|
||||
| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked |
|
||||
| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe |
|
||||
| 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" |
|
||||
| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's |
|
||||
| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition |
|
||||
| 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears |
|
||||
|
||||
## 3. Wire protocol
|
||||
|
||||
### 3.1 Server → agent: `command.update`
|
||||
|
||||
```
|
||||
{
|
||||
"type": "command.update",
|
||||
"id": "<envelope id>",
|
||||
"payload": {
|
||||
"job_id": "<ulid>"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
No `os` / `arch` / `version` in the payload — the agent already
|
||||
knows its own build target and fetches from its configured server
|
||||
URL via the existing `/agent/binary` handler. Including a target
|
||||
version would also tempt the agent into version-comparison logic;
|
||||
keep that on the server side.
|
||||
|
||||
### 3.2 Job lifecycle (server-driven)
|
||||
|
||||
The agent has limited ability to report on its own restart, so the
|
||||
job state machine lives on the server:
|
||||
|
||||
- **queued → running** when the envelope is dispatched.
|
||||
- **running → succeeded** when the agent re-hellos with
|
||||
`agent_version == server.Version` after dispatch and within
|
||||
the timeout. Audit `host.update_succeeded`.
|
||||
- **running → failed (timeout)** if 90 seconds pass without a
|
||||
hello carrying the matching version. Audit `host.update_failed`.
|
||||
Raise alert kind `update_failed` (reuses P3-05 alert engine).
|
||||
This single transition covers both the "agent never came back
|
||||
at all" case and the "agent came back at the wrong version"
|
||||
case — see §6.2 for why we don't transition immediately on a
|
||||
mismatched hello.
|
||||
|
||||
Migration 0021 widens the `jobs.kind` CHECK constraint to include
|
||||
`update`. Same column-level pattern as 0012 (where 0012 added
|
||||
`restore` and `diff`).
|
||||
|
||||
## 4. Agent-side execution
|
||||
|
||||
Lives in `internal/agent/updater`, build-tag split:
|
||||
|
||||
- `updater_unix.go` — Linux + any future POSIX target.
|
||||
- `updater_windows.go` — Windows-only, uses the helper-script
|
||||
pattern.
|
||||
- `updater.go` — shared `Update(ctx, serverURL string) error`
|
||||
interface and the HTTP fetch/streaming code (no platform deps).
|
||||
|
||||
### 4.1 Linux flow
|
||||
|
||||
1. Receive `command.update` from the WS dispatcher.
|
||||
2. Resolve own binary via `os.Executable()` and `filepath.Abs`.
|
||||
Refuse if the resolved path is `/proc/self/exe` or otherwise
|
||||
not a real file (defence in depth — shouldn't happen under
|
||||
systemd, but bail loudly if it does).
|
||||
3. `GET <server>/agent/binary?os=linux&arch=<runtime.GOARCH>`,
|
||||
stream to `<binary>.new` in the same directory as the running
|
||||
binary (same filesystem ⇒ atomic rename).
|
||||
4. fsync the file, `os.Chmod(0755)`.
|
||||
5. Copy current binary to `<binary>.old` (overwrite if it
|
||||
exists). M1 — one-revision rollback target.
|
||||
6. `os.Rename(<binary>.new, <binary>)`.
|
||||
7. Close the WS connection cleanly (sends close frame so the
|
||||
server transitions the connection to `disconnected` rather
|
||||
than waiting for the heartbeat-miss sweep).
|
||||
8. `os.Exit(0)`. Systemd's `Restart=always` (already in the unit)
|
||||
brings up the new binary within seconds.
|
||||
|
||||
### 4.2 Windows flow
|
||||
|
||||
The .exe is exclusively locked by the OS while running, so steps
|
||||
5–6 above can't happen in-process. Use a detached helper:
|
||||
|
||||
1. Steps 1–4 the same — fetch into `<binary>.exe.new`, fsync.
|
||||
2. Write `update.cmd` to a tmp path with the orchestration:
|
||||
```
|
||||
timeout /t 3 /nobreak >nul
|
||||
copy /Y "<binary>.exe" "<binary>.exe.old"
|
||||
sc stop restic-manager-agent
|
||||
:wait
|
||||
sc query restic-manager-agent | find "STOPPED" >nul
|
||||
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
|
||||
move /Y "<binary>.exe.new" "<binary>.exe"
|
||||
sc start restic-manager-agent
|
||||
del "%~f0"
|
||||
```
|
||||
3. `CreateProcess` it detached
|
||||
(`DETACHED_PROCESS | CREATE_NO_WINDOW`, no parent handles).
|
||||
4. Close WS, `os.Exit(0)`. SCM sees clean stop and waits — does
|
||||
*not* try to restart, because `sc stop` is the helper's job,
|
||||
not a crash. (`Restart=always` semantics differ between
|
||||
systemd and SCM. SCM treats clean-exit-after-stop as
|
||||
intentional and does not auto-restart; only crashes restart.
|
||||
That's why the helper script needs the explicit `sc start`
|
||||
at the end.)
|
||||
|
||||
### 4.3 Service-user assumption
|
||||
|
||||
Both Linux (`User=root` per the existing unit) and Windows
|
||||
(`LocalSystem` by default) can write the binary path directly. If
|
||||
the agent ever moves to a non-root service user, the updater
|
||||
breaks — would need either a setuid helper or an out-of-process
|
||||
update service. Add a `// NOTE:` comment in the updater package
|
||||
flagging this; not a v1 blocker.
|
||||
|
||||
## 5. Server build version
|
||||
|
||||
New package `internal/version` exposing two constants:
|
||||
|
||||
```
|
||||
package version
|
||||
|
||||
var (
|
||||
Version = "dev"
|
||||
Commit = ""
|
||||
)
|
||||
```
|
||||
|
||||
Wired via `-ldflags` in the Makefile:
|
||||
|
||||
```
|
||||
GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
|
||||
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
|
||||
|
||||
VERSION := $(shell git describe --tags --always --dirty)
|
||||
COMMIT := $(shell git rev-parse --short HEAD)
|
||||
```
|
||||
|
||||
Both `cmd/server` and `cmd/agent` link the same package, so an
|
||||
agent's `agent_version` (sent in the hello payload, already wired
|
||||
since P1-11) is comparable byte-for-byte to the server's
|
||||
`version.Version`.
|
||||
|
||||
`make build` already does what's needed for source builds. The
|
||||
Phase 2 work in this spec is the Docker release path — confirm
|
||||
during plan execution that `.gitea/workflows/release.yml` passes
|
||||
`VERSION` and `COMMIT` into the Docker `--build-arg` chain so the
|
||||
in-image binaries embed the same string the image is tagged with.
|
||||
If not, add the wiring.
|
||||
|
||||
Dirty/dev builds (`v1.2.3-dirty`) won't match clean server builds,
|
||||
so every dev environment will show every host as out-of-date. This
|
||||
is acceptable — the chip is a noop in dev, real ops always run
|
||||
tagged builds.
|
||||
|
||||
A new `GET /api/version` endpoint returns
|
||||
`{"version": "...", "commit": "..."}`. Used by the dashboard
|
||||
header tile and by `/settings/fleet-update`. Public-band — exposes
|
||||
no secrets, lets the install scripts surface it too.
|
||||
|
||||
## 6. P6-01 server endpoints
|
||||
|
||||
### 6.1 `POST /api/hosts/{id}/update`
|
||||
|
||||
Admin-only. Refuses (with structured error code) when:
|
||||
|
||||
- Host is offline (`host_offline`).
|
||||
- Host's `agent_version == server.Version` (`already_up_to_date`).
|
||||
- An update job for this host is already running (`update_in_progress`).
|
||||
|
||||
Happy path: creates `jobs` row with `kind=update`, dispatches
|
||||
`command.update` envelope, audit-logs `host.update_dispatched`,
|
||||
returns `{"job_id": "..."}`.
|
||||
|
||||
UI form-post variant on `/hosts/{id}/update` returns
|
||||
`HX-Redirect` to the live job log.
|
||||
|
||||
### 6.2 Hello handler integration
|
||||
|
||||
The existing `onAgentHello` (P1-11) already upserts
|
||||
`agent_version`. Extend it: after the upsert, look for any
|
||||
`update` job for this host with `status='running'`. If one
|
||||
exists:
|
||||
|
||||
- `agent_version == server.Version` → mark job `succeeded`,
|
||||
audit `host.update_succeeded`.
|
||||
- `agent_version != server.Version` → leave the job running so
|
||||
the timeout path catches it as a rollback failure (don't fail
|
||||
immediately — gives the agent one chance to come back, restart,
|
||||
hello again with the right version).
|
||||
|
||||
Adds a small in-memory map of pending updates so the timeout
|
||||
goroutine knows when to give up. Persisted state lives in the
|
||||
`jobs` table; the in-memory map is just for the timer.
|
||||
|
||||
## 7. P6-02 fleet update
|
||||
|
||||
### 7.1 Schema
|
||||
|
||||
Migration 0022, column-level adds only:
|
||||
|
||||
```
|
||||
CREATE TABLE fleet_updates (
|
||||
id TEXT PRIMARY KEY,
|
||||
started_at TEXT NOT NULL,
|
||||
started_by_user_id TEXT NOT NULL REFERENCES users(id),
|
||||
target_version TEXT NOT NULL,
|
||||
status TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')),
|
||||
current_host_id TEXT REFERENCES hosts(id),
|
||||
halted_reason TEXT,
|
||||
completed_at TEXT
|
||||
);
|
||||
|
||||
CREATE TABLE fleet_update_hosts (
|
||||
fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
|
||||
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
|
||||
status TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')),
|
||||
job_id TEXT REFERENCES jobs(id),
|
||||
failed_reason TEXT,
|
||||
PRIMARY KEY (fleet_update_id, host_id)
|
||||
);
|
||||
```
|
||||
|
||||
### 7.2 Worker loop
|
||||
|
||||
A single in-process goroutine — at most one fleet update may run
|
||||
at a time (enforced via a `sync.Mutex` + a precondition check on
|
||||
`POST /api/fleet/update`).
|
||||
|
||||
```
|
||||
for each pending fleet_update_hosts row in dispatch order:
|
||||
set fleet_updates.current_host_id = row.host_id
|
||||
set fleet_update_hosts.status = 'running'
|
||||
if host.agent_version == server.Version:
|
||||
# Already updated since we built the list — skip.
|
||||
set status = 'skipped'; continue
|
||||
if !host.online:
|
||||
# Offline since we built the list — halt.
|
||||
halt(reason="host went offline")
|
||||
return
|
||||
dispatch_update_for_host(host) # reuses 6.1 logic
|
||||
wait_up_to_90s_for_hello_with_matching_version()
|
||||
if matched:
|
||||
set status = 'succeeded'; continue
|
||||
else:
|
||||
set status = 'failed', failed_reason = "..."
|
||||
halt(reason="update failed on host X")
|
||||
return
|
||||
set fleet_updates.status = 'completed', completed_at = now
|
||||
```
|
||||
|
||||
Halt: set `fleet_updates.status = 'halted'`, raise an alert kind
|
||||
`fleet_update_halted`, audit `fleet.update_halted` with the host
|
||||
id and reason. Subsequent hosts stay `pending` so the operator can
|
||||
see what was queued and decide whether to resume (resume = start a
|
||||
new fleet update with the still-out-of-date subset).
|
||||
|
||||
Cancel: admin-only `POST /api/fleet-updates/{id}/cancel`. Sets
|
||||
`status='cancelled'`. The currently-dispatched host's update job
|
||||
keeps running (the agent is already mid-restart) — cancel only
|
||||
prevents the *next* host from being picked. Audit
|
||||
`fleet.update_cancelled`.
|
||||
|
||||
### 7.3 UI surfaces
|
||||
|
||||
**Per-host chip (host_row partial + host detail chrome):**
|
||||
|
||||
`out of date · v1.2.2 → v1.2.3` — amber-accented, mirrors `.tag`
|
||||
token shape. Only rendered when:
|
||||
|
||||
```
|
||||
host.agent_version != "" && host.agent_version != server.Version
|
||||
```
|
||||
|
||||
Empty `agent_version` (host enrolled but never connected) renders
|
||||
nothing rather than "out of date" — we don't know what version
|
||||
they have.
|
||||
|
||||
**Dashboard summary tile:**
|
||||
|
||||
The hero strip already has tiles. Add an "Updates" tile:
|
||||
`N hosts behind` linking to `/?updates=behind` (extends NS-04's
|
||||
filter machinery — adds an `updates` query param alongside
|
||||
`status`/`repo_status`/`tag`). Hidden when N == 0.
|
||||
|
||||
**Per-host Update button on `/hosts/{id}`:**
|
||||
|
||||
Right-rail, admin-only. Disabled with hover tooltip when host
|
||||
offline / already up to date / update in progress. POSTs to
|
||||
`/hosts/{id}/update`, `HX-Redirect` to the live job log.
|
||||
|
||||
**Fleet update page `/settings/fleet-update`:**
|
||||
|
||||
Admin-only. Two states:
|
||||
|
||||
- **Idle**: lists out-of-date online hosts (table: hostname,
|
||||
current version, target version, last seen). Big "Start rolling
|
||||
update" button behind a typed-confirm dialog (operator types
|
||||
the host count, e.g. `12`, to enable the button — same shape as
|
||||
the host-delete confirm).
|
||||
- **Running/halted/completed**: shows the currently-active
|
||||
fleet_update row + per-host progress list. Polls every 3s (htmx
|
||||
trigger conditional on `document.visibilityState === 'visible'`,
|
||||
same pattern as the alerts page). Renders:
|
||||
```
|
||||
Updated 3/12 · currently updating <hostname>
|
||||
Halted on <hostname>: <reason> · job log →
|
||||
```
|
||||
|
||||
Audit actions: `fleet.update_started`, `fleet.update_completed`,
|
||||
`fleet.update_halted`, `fleet.update_cancelled`.
|
||||
|
||||
### 7.4 Alert engine integration
|
||||
|
||||
P3-05's alert engine already supports kind-based registration. Add
|
||||
two new kinds:
|
||||
|
||||
- `update_failed` — per-host, raised on individual update failure.
|
||||
Auto-resolves when the host re-hellos with the matching version.
|
||||
- `fleet_update_halted` — global, raised on fleet halt. Auto-resolves
|
||||
when a subsequent fleet update completes successfully.
|
||||
|
||||
## 8. RBAC
|
||||
|
||||
| Endpoint | Role |
|
||||
|----------|------|
|
||||
| `POST /api/hosts/{id}/update` | admin |
|
||||
| `POST /api/fleet/update` | admin |
|
||||
| `POST /api/fleet-updates/{id}/cancel` | admin |
|
||||
| `GET /api/fleet-updates/{id}` | admin (status polling) |
|
||||
| `GET /api/version` | public |
|
||||
|
||||
Operator and viewer see the "out of date" chip but no update
|
||||
buttons. Mirrors the existing pattern: read affordances are
|
||||
visible to all roles, write affordances are gated.
|
||||
|
||||
## 9. Testing
|
||||
|
||||
### 9.1 Unit
|
||||
|
||||
- `internal/agent/updater`: fake-`/agent/binary` HTTP server +
|
||||
tmp "running binary" file, assert post-state — binary swapped,
|
||||
`.old` present, no leftover `.new`. Linux path only (Windows
|
||||
helper covered by build-tag compile-only).
|
||||
- `internal/server/http`: `POST /api/hosts/{id}/update` happy
|
||||
path, refuses-when-offline, refuses-when-up-to-date,
|
||||
refuses-when-update-in-progress, RBAC enforcement, audit row
|
||||
written.
|
||||
- Hello handler: agent reconnects with matching version after
|
||||
`update` job dispatch → marks job `succeeded`, drops the
|
||||
in-memory pending entry. Mismatched version → no-op (timeout
|
||||
catches it).
|
||||
- Timeout path: synthetic `update` job + 90s elapsed →
|
||||
marks `failed`, raises alert.
|
||||
- Fleet worker: table-driven over the loop's state machine —
|
||||
success-then-success, success-then-timeout-halts,
|
||||
cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately,
|
||||
host-disappears-from-list-mid-loop-skips.
|
||||
|
||||
### 9.2 Smoke validation (per CLAUDE.md restage block)
|
||||
|
||||
1. Build server + agent at version A. Restage. Enrol a host;
|
||||
confirm `agent_version=A`.
|
||||
2. Bump version to B (`make build VERSION=B`), rebuild server
|
||||
only, restart server. Dashboard shows host as out-of-date with
|
||||
`A → B` chip. Updates tile reads "1 host behind".
|
||||
3. Rebuild agent at B, restage `<DataDir>/agent-binaries/`. Click
|
||||
**Update agent** on host detail. Agent fetches, swaps, exits;
|
||||
systemd restarts it; hello-back at B → job `succeeded`, chip
|
||||
gone, tile clears.
|
||||
4. Rollback path: leave `<DataDir>/agent-binaries/` at A, server
|
||||
at B, click Update — agent fetches A, swaps to A, restarts at
|
||||
A; hello says A != B; server marks job `failed` after 90s with
|
||||
reason "agent reconnected at version A, expected B".
|
||||
5. Fleet update: spin up two smoke hosts both out-of-date, fire
|
||||
**Start rolling update**, watch progress page tick host 1 →
|
||||
host 2 → completed.
|
||||
6. Halt path: replace one of the `<DataDir>/agent-binaries/`
|
||||
files with `/bin/false`. Run fleet update. First host gets
|
||||
broken binary, fails to come back up, fleet update halts at
|
||||
host 1 after 90s, alert raised, host 2 left as `pending`.
|
||||
|
||||
Step 6 validates M2 end-to-end — the rolling halt is the actual
|
||||
safety guarantee, not a nice-to-have.
|
||||
|
||||
## 10. Out of scope
|
||||
|
||||
- sha256 digest verification (deferred — see decision 4).
|
||||
- `restic-manager-agent update` CLI subcommand (deferred —
|
||||
decision 6).
|
||||
- Auto-update (deferred — decision 1).
|
||||
- Auto-rollback watchdog M3 (deferred — decision 3).
|
||||
- Migrating the agent off `User=root` (separate hardening track).
|
||||
- Cross-version protocol-compatibility checks beyond the existing
|
||||
`protocol_version` handshake (P1-11). If the new agent's
|
||||
`protocol_version` is incompatible with the server, the
|
||||
existing handshake rejects it; the update job will then
|
||||
correctly time out and be marked failed.
|
||||
|
||||
## 11. Migration plan
|
||||
|
||||
1. `internal/version` package + Makefile ldflags wiring.
|
||||
2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates
|
||||
tables).
|
||||
3. `internal/agent/updater` package, Linux first.
|
||||
4. WS envelope wiring + `command.update` dispatcher.
|
||||
5. `POST /api/hosts/{id}/update` + hello-handler integration +
|
||||
timeout goroutine.
|
||||
6. UI: chip + per-host update button + dashboard tile + filter.
|
||||
7. Fleet update worker + page.
|
||||
8. Windows updater path.
|
||||
9. Alert engine kinds.
|
||||
10. Smoke validation per §9.2.
|
||||
|
||||
Each step is independently testable; commits should land at each
|
||||
boundary so a failed Windows path (8) doesn't block the rest of
|
||||
the work.
|
||||
Reference in New Issue
Block a user