P6-01 + P6-02: agent self-update + fleet update #19
Reference in New Issue
Block a user
Delete Branch "p6-agent-self-update"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
command.updateenvelope. Linux: atomic-rename + clean exit, systemd brings the new binary up. Windows: detachedupdate.cmdhelper does the swap-and-restart while the agent exits cleanly.?updates=behindfilter, per-host Update agent button, admin/settings/fleet-updatepage driving a rolling worker that halts on first failure with an alert.Spec: docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md
Plan: docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md
Architecture decisions (full rationale in the spec)
os.Exit(0), systemdRestart=alwaysbrings up the new binaryupdate.cmdhelper script (can't overwrite running .exe)<bin>.oldon disk + M2 — rolling fleet update halts on first failureagent_version != server.Version⇒ out of date)What landed
internal/versionpackage + Makefile ldflags injection so server and agent are comparable byte-for-bytecommand.updateWS envelope +JobUpdatekindjobs.kindCHECK; 0022 createsfleet_updates+fleet_update_hostsinternal/agent/updater(build-tag split unix/windows)POST /api/hosts/{id}/update(admin JSON) +POST /hosts/{id}/update(HTMX, HX-Redirect)internal/server/fleetupdate) — at-most-one rolling update, halt-on-first-failureupdate_failed+fleet_update_haltedalert kinds, auto-resolvehost.update_dispatched/_succeeded/_failedandfleet.update_*audit actionsSmoke validation
End-to-end on the dev box: agent at
v0.9.0-11-gccaccd8-dirty→ click Update → server dispatchescommand.update→ agent fetches new binary, swaps, exits → systemd restarts → hello atv9.9.9-smokematches server → job marked succeeded → chip + tile clear automatically. Took <5s,.oldpreserved on disk.Caught and fixed mid-sweep: the systemd unit's
ProtectSystem=fullmade/usr/local/binread-only and blocked the.newstaging write. Added/usr/local/bintoReadWritePaths(commit83d97a2). Comment in the unit explains why the whole-dir grant is needed (os.Renametakes a write lock on the parent dir). Existing installs need a re-run ofinstall.shto pick up the unit change.Screenshots in
_diag/p6-update-sweep/(gitignored, captured during the sweep).Out of scope
restic-manager-agent updateCLI subcommand (no consumer)Test plan
b91fe56era), the Update button dispatches but the old agent ignores the envelope; job times out at 90s with a clear failure. Manual install of the new agent is the bootstrap path. Worth documenting in the agent-version compatibility note.- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted (system-scoped, host_id NULL via new RaiseOrTouchSystem helper). - ws: UpdateWatcher tracks in-flight command.update dispatches and reconciles them against incoming hello envelopes — success path marks the job succeeded and auto-resolves the alert; 90s timeout marks the job failed and raises update_failed. - http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX /hosts/{id}/update form variant. Pre-checks: host exists, online, agent_version != current, no running update job. Refactored core into Server.dispatchHostUpdate so the fleet worker can share it without going through HTTP. - fleetupdate: rolling worker iterating through host slots, halting on first failure and raising fleet_update_halted. Polling-based version-match (re-read hosts.agent_version every 1s up to 95s) — no extra plumbing into the WS hello path. At-most-one-running is enforced at the store layer (ErrFleetUpdateRunning). - cmd/server: wire UpdateWatcher and FleetWorker into the main goroutine; the worker uses a small serverDispatcher adapter that delegates back into Server.DispatchHostUpdate. Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint (happy + four pre-check branches + RBAC), worker (two-host happy, timeout-halt, host-offline-halt, already-at-target skip, cancel mid-run, double-Start guard).- POST /api/fleet/update, POST /api/fleet-updates/{id}/cancel, GET /api/fleet-updates/{id} (admin-only). - GET /settings/fleet-update + /partial for htmx polling. - Renders idle / running / terminal states with per-host progress. - Tests cover happy path, derive-host-ids, conflict, cancel, get, and RBAC.