17 KiB
P6-01 + P6-02 — Agent self-update + fleet update
Status: design approved 2026-05-06. Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard version reporting + fleet update UI). One spec, one branch — the two tasks are tightly coupled (P6-02 is the operator surface for the mechanism P6-01 ships).
1. Background
P5-03 pivoted release distribution to a single multi-arch server
Docker image, with cross-compiled agent binaries baked under
/opt/restic-manager/dist/agent-binaries/ and served via
GET /agent/binary?os=…&arch=…. The plumbing already does
dual-path lookup: <DataDir>/agent-binaries/<name> overrides the
image-baked copy, so an operator can hot-patch a pre-release agent
without rebuilding the image.
That makes the server the natural distribution point for agent upgrades. "Update agent" collapses to "re-fetch from your own server" — no apt repo, no Chocolatey, no third-party signing infra, and version pinning is automatic because the server only ever serves the agent that matches its own release.
This spec wires up the update mechanism end-to-end and the operator surface that drives it.
2. Decisions
| # | Decision | Rationale |
|---|---|---|
| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked |
| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe |
| 3 | M1 (keep agent.old on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). |
M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" |
| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's |
| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition |
| 6 | WS envelope only, no restic-manager-agent update CLI subcommand |
YAGNI; no concrete consumer; the underlying logic is reusable when one appears |
3. Wire protocol
3.1 Server → agent: command.update
{
"type": "command.update",
"id": "<envelope id>",
"payload": {
"job_id": "<ulid>"
}
}
No os / arch / version in the payload — the agent already
knows its own build target and fetches from its configured server
URL via the existing /agent/binary handler. Including a target
version would also tempt the agent into version-comparison logic;
keep that on the server side.
3.2 Job lifecycle (server-driven)
The agent has limited ability to report on its own restart, so the job state machine lives on the server:
- queued → running when the envelope is dispatched.
- running → succeeded when the agent re-hellos with
agent_version == server.Versionafter dispatch and within the timeout. Audithost.update_succeeded. - running → failed (timeout) if 90 seconds pass without a
hello carrying the matching version. Audit
host.update_failed. Raise alert kindupdate_failed(reuses P3-05 alert engine). This single transition covers both the "agent never came back at all" case and the "agent came back at the wrong version" case — see §6.2 for why we don't transition immediately on a mismatched hello.
Migration 0021 widens the jobs.kind CHECK constraint to include
update. Same column-level pattern as 0012 (where 0012 added
restore and diff).
4. Agent-side execution
Lives in internal/agent/updater, build-tag split:
updater_unix.go— Linux + any future POSIX target.updater_windows.go— Windows-only, uses the helper-script pattern.updater.go— sharedUpdate(ctx, serverURL string) errorinterface and the HTTP fetch/streaming code (no platform deps).
4.1 Linux flow
- Receive
command.updatefrom the WS dispatcher. - Resolve own binary via
os.Executable()andfilepath.Abs. Refuse if the resolved path is/proc/self/exeor otherwise not a real file (defence in depth — shouldn't happen under systemd, but bail loudly if it does). GET <server>/agent/binary?os=linux&arch=<runtime.GOARCH>, stream to<binary>.newin the same directory as the running binary (same filesystem ⇒ atomic rename).- fsync the file,
os.Chmod(0755). - Copy current binary to
<binary>.old(overwrite if it exists). M1 — one-revision rollback target. os.Rename(<binary>.new, <binary>).- Close the WS connection cleanly (sends close frame so the
server transitions the connection to
disconnectedrather than waiting for the heartbeat-miss sweep). os.Exit(0). Systemd'sRestart=always(already in the unit) brings up the new binary within seconds.
4.2 Windows flow
The .exe is exclusively locked by the OS while running, so steps 5–6 above can't happen in-process. Use a detached helper:
- Steps 1–4 the same — fetch into
<binary>.exe.new, fsync. - Write
update.cmdto a tmp path with the orchestration:timeout /t 3 /nobreak >nul copy /Y "<binary>.exe" "<binary>.exe.old" sc stop restic-manager-agent :wait sc query restic-manager-agent | find "STOPPED" >nul if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait) move /Y "<binary>.exe.new" "<binary>.exe" sc start restic-manager-agent del "%~f0" CreateProcessit detached (DETACHED_PROCESS | CREATE_NO_WINDOW, no parent handles).- Close WS,
os.Exit(0). SCM sees clean stop and waits — does not try to restart, becausesc stopis the helper's job, not a crash. (Restart=alwayssemantics differ between systemd and SCM. SCM treats clean-exit-after-stop as intentional and does not auto-restart; only crashes restart. That's why the helper script needs the explicitsc startat the end.)
4.3 Service-user assumption
Both Linux (User=root per the existing unit) and Windows
(LocalSystem by default) can write the binary path directly. If
the agent ever moves to a non-root service user, the updater
breaks — would need either a setuid helper or an out-of-process
update service. Add a // NOTE: comment in the updater package
flagging this; not a v1 blocker.
5. Server build version
New package internal/version exposing two constants:
package version
var (
Version = "dev"
Commit = ""
)
Wired via -ldflags in the Makefile:
GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
VERSION := $(shell git describe --tags --always --dirty)
COMMIT := $(shell git rev-parse --short HEAD)
Both cmd/server and cmd/agent link the same package, so an
agent's agent_version (sent in the hello payload, already wired
since P1-11) is comparable byte-for-byte to the server's
version.Version.
make build already does what's needed for source builds. The
Phase 2 work in this spec is the Docker release path — confirm
during plan execution that .gitea/workflows/release.yml passes
VERSION and COMMIT into the Docker --build-arg chain so the
in-image binaries embed the same string the image is tagged with.
If not, add the wiring.
Dirty/dev builds (v1.2.3-dirty) won't match clean server builds,
so every dev environment will show every host as out-of-date. This
is acceptable — the chip is a noop in dev, real ops always run
tagged builds.
A new GET /api/version endpoint returns
{"version": "...", "commit": "..."}. Used by the dashboard
header tile and by /settings/fleet-update. Public-band — exposes
no secrets, lets the install scripts surface it too.
6. P6-01 server endpoints
6.1 POST /api/hosts/{id}/update
Admin-only. Refuses (with structured error code) when:
- Host is offline (
host_offline). - Host's
agent_version == server.Version(already_up_to_date). - An update job for this host is already running (
update_in_progress).
Happy path: creates jobs row with kind=update, dispatches
command.update envelope, audit-logs host.update_dispatched,
returns {"job_id": "..."}.
UI form-post variant on /hosts/{id}/update returns
HX-Redirect to the live job log.
6.2 Hello handler integration
The existing onAgentHello (P1-11) already upserts
agent_version. Extend it: after the upsert, look for any
update job for this host with status='running'. If one
exists:
agent_version == server.Version→ mark jobsucceeded, audithost.update_succeeded.agent_version != server.Version→ leave the job running so the timeout path catches it as a rollback failure (don't fail immediately — gives the agent one chance to come back, restart, hello again with the right version).
Adds a small in-memory map of pending updates so the timeout
goroutine knows when to give up. Persisted state lives in the
jobs table; the in-memory map is just for the timer.
7. P6-02 fleet update
7.1 Schema
Migration 0022, column-level adds only:
CREATE TABLE fleet_updates (
id TEXT PRIMARY KEY,
started_at TEXT NOT NULL,
started_by_user_id TEXT NOT NULL REFERENCES users(id),
target_version TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')),
current_host_id TEXT REFERENCES hosts(id),
halted_reason TEXT,
completed_at TEXT
);
CREATE TABLE fleet_update_hosts (
fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
status TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')),
job_id TEXT REFERENCES jobs(id),
failed_reason TEXT,
PRIMARY KEY (fleet_update_id, host_id)
);
7.2 Worker loop
A single in-process goroutine — at most one fleet update may run
at a time (enforced via a sync.Mutex + a precondition check on
POST /api/fleet/update).
for each pending fleet_update_hosts row in dispatch order:
set fleet_updates.current_host_id = row.host_id
set fleet_update_hosts.status = 'running'
if host.agent_version == server.Version:
# Already updated since we built the list — skip.
set status = 'skipped'; continue
if !host.online:
# Offline since we built the list — halt.
halt(reason="host went offline")
return
dispatch_update_for_host(host) # reuses 6.1 logic
wait_up_to_90s_for_hello_with_matching_version()
if matched:
set status = 'succeeded'; continue
else:
set status = 'failed', failed_reason = "..."
halt(reason="update failed on host X")
return
set fleet_updates.status = 'completed', completed_at = now
Halt: set fleet_updates.status = 'halted', raise an alert kind
fleet_update_halted, audit fleet.update_halted with the host
id and reason. Subsequent hosts stay pending so the operator can
see what was queued and decide whether to resume (resume = start a
new fleet update with the still-out-of-date subset).
Cancel: admin-only POST /api/fleet-updates/{id}/cancel. Sets
status='cancelled'. The currently-dispatched host's update job
keeps running (the agent is already mid-restart) — cancel only
prevents the next host from being picked. Audit
fleet.update_cancelled.
7.3 UI surfaces
Per-host chip (host_row partial + host detail chrome):
out of date · v1.2.2 → v1.2.3 — amber-accented, mirrors .tag
token shape. Only rendered when:
host.agent_version != "" && host.agent_version != server.Version
Empty agent_version (host enrolled but never connected) renders
nothing rather than "out of date" — we don't know what version
they have.
Dashboard summary tile:
The hero strip already has tiles. Add an "Updates" tile:
N hosts behind linking to /?updates=behind (extends NS-04's
filter machinery — adds an updates query param alongside
status/repo_status/tag). Hidden when N == 0.
Per-host Update button on /hosts/{id}:
Right-rail, admin-only. Disabled with hover tooltip when host
offline / already up to date / update in progress. POSTs to
/hosts/{id}/update, HX-Redirect to the live job log.
Fleet update page /settings/fleet-update:
Admin-only. Two states:
- Idle: lists out-of-date online hosts (table: hostname,
current version, target version, last seen). Big "Start rolling
update" button behind a typed-confirm dialog (operator types
the host count, e.g.
12, to enable the button — same shape as the host-delete confirm). - Running/halted/completed: shows the currently-active
fleet_update row + per-host progress list. Polls every 3s (htmx
trigger conditional on
document.visibilityState === 'visible', same pattern as the alerts page). Renders:Updated 3/12 · currently updating <hostname> Halted on <hostname>: <reason> · job log →
Audit actions: fleet.update_started, fleet.update_completed,
fleet.update_halted, fleet.update_cancelled.
7.4 Alert engine integration
P3-05's alert engine already supports kind-based registration. Add two new kinds:
update_failed— per-host, raised on individual update failure. Auto-resolves when the host re-hellos with the matching version.fleet_update_halted— global, raised on fleet halt. Auto-resolves when a subsequent fleet update completes successfully.
8. RBAC
| Endpoint | Role |
|---|---|
POST /api/hosts/{id}/update |
admin |
POST /api/fleet/update |
admin |
POST /api/fleet-updates/{id}/cancel |
admin |
GET /api/fleet-updates/{id} |
admin (status polling) |
GET /api/version |
public |
Operator and viewer see the "out of date" chip but no update buttons. Mirrors the existing pattern: read affordances are visible to all roles, write affordances are gated.
9. Testing
9.1 Unit
internal/agent/updater: fake-/agent/binaryHTTP server + tmp "running binary" file, assert post-state — binary swapped,.oldpresent, no leftover.new. Linux path only (Windows helper covered by build-tag compile-only).internal/server/http:POST /api/hosts/{id}/updatehappy path, refuses-when-offline, refuses-when-up-to-date, refuses-when-update-in-progress, RBAC enforcement, audit row written.- Hello handler: agent reconnects with matching version after
updatejob dispatch → marks jobsucceeded, drops the in-memory pending entry. Mismatched version → no-op (timeout catches it). - Timeout path: synthetic
updatejob + 90s elapsed → marksfailed, raises alert. - Fleet worker: table-driven over the loop's state machine — success-then-success, success-then-timeout-halts, cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately, host-disappears-from-list-mid-loop-skips.
9.2 Smoke validation (per CLAUDE.md restage block)
- Build server + agent at version A. Restage. Enrol a host;
confirm
agent_version=A. - Bump version to B (
make build VERSION=B), rebuild server only, restart server. Dashboard shows host as out-of-date withA → Bchip. Updates tile reads "1 host behind". - Rebuild agent at B, restage
<DataDir>/agent-binaries/. Click Update agent on host detail. Agent fetches, swaps, exits; systemd restarts it; hello-back at B → jobsucceeded, chip gone, tile clears. - Rollback path: leave
<DataDir>/agent-binaries/at A, server at B, click Update — agent fetches A, swaps to A, restarts at A; hello says A != B; server marks jobfailedafter 90s with reason "agent reconnected at version A, expected B". - Fleet update: spin up two smoke hosts both out-of-date, fire Start rolling update, watch progress page tick host 1 → host 2 → completed.
- Halt path: replace one of the
<DataDir>/agent-binaries/files with/bin/false. Run fleet update. First host gets broken binary, fails to come back up, fleet update halts at host 1 after 90s, alert raised, host 2 left aspending.
Step 6 validates M2 end-to-end — the rolling halt is the actual safety guarantee, not a nice-to-have.
10. Out of scope
- sha256 digest verification (deferred — see decision 4).
restic-manager-agent updateCLI subcommand (deferred — decision 6).- Auto-update (deferred — decision 1).
- Auto-rollback watchdog M3 (deferred — decision 3).
- Migrating the agent off
User=root(separate hardening track). - Cross-version protocol-compatibility checks beyond the existing
protocol_versionhandshake (P1-11). If the new agent'sprotocol_versionis incompatible with the server, the existing handshake rejects it; the update job will then correctly time out and be marked failed.
11. Migration plan
internal/versionpackage + Makefile ldflags wiring.- Migration 0021 (jobs.kind widening) + 0022 (fleet_updates tables).
internal/agent/updaterpackage, Linux first.- WS envelope wiring +
command.updatedispatcher. POST /api/hosts/{id}/update+ hello-handler integration + timeout goroutine.- UI: chip + per-host update button + dashboard tile + filter.
- Fleet update worker + page.
- Windows updater path.
- Alert engine kinds.
- Smoke validation per §9.2.
Each step is independently testable; commits should land at each boundary so a failed Windows path (8) doesn't block the rest of the work.