P6-01 + P6-02: agent self-update + fleet update #19

Merged
steve merged 12 commits from p6-agent-self-update into main 2026-05-07 17:49:25 +01:00
Owner

Summary

  • P6-01 Agent self-update via WS command.update envelope. Linux: atomic-rename + clean exit, systemd brings the new binary up. Windows: detached update.cmd helper does the swap-and-restart while the agent exits cleanly.
  • P6-02 Dashboard surfacing + rolling fleet update. Out-of-date chip on host rows + detail header, "N hosts behind" hero tile with ?updates=behind filter, per-host Update agent button, admin /settings/fleet-update page driving a rolling worker that halts on first failure with an alert.

Spec: docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md
Plan: docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md

Architecture decisions (full rationale in the spec)

Decision
Update trigger Operator-driven only via WS — no auto-update
Linux restart os.Exit(0), systemd Restart=always brings up the new binary
Windows restart Detached update.cmd helper script (can't overwrite running .exe)
Rollback safety M1 — keep <bin>.old on disk + M2 — rolling fleet update halts on first failure
Digest verification Skipped for v1; TLS already covers corruption-in-transit
Version compare Exact string match (agent_version != server.Version ⇒ out of date)

What landed

  • internal/version package + Makefile ldflags injection so server and agent are comparable byte-for-byte
  • command.update WS envelope + JobUpdate kind
  • Migration 0021 widens jobs.kind CHECK; 0022 creates fleet_updates + fleet_update_hosts
  • internal/agent/updater (build-tag split unix/windows)
  • POST /api/hosts/{id}/update (admin JSON) + POST /hosts/{id}/update (HTMX, HX-Redirect)
  • WS update-watcher: marks succeeded on matching-version hello, fails after 90s timeout (covers no-show + rollback)
  • Fleet worker (internal/server/fleetupdate) — at-most-one rolling update, halt-on-first-failure
  • update_failed + fleet_update_halted alert kinds, auto-resolve
  • Per-host chip, dashboard tile + filter, per-host Update button, fleet update page (idle/running/halted/completed)
  • host.update_dispatched/_succeeded/_failed and fleet.update_* audit actions

Smoke validation

End-to-end on the dev box: agent at v0.9.0-11-gccaccd8-dirty → click Update → server dispatches command.update → agent fetches new binary, swaps, exits → systemd restarts → hello at v9.9.9-smoke matches server → job marked succeeded → chip + tile clear automatically. Took <5s, .old preserved on disk.

Caught and fixed mid-sweep: the systemd unit's ProtectSystem=full made /usr/local/bin read-only and blocked the .new staging write. Added /usr/local/bin to ReadWritePaths (commit 83d97a2). Comment in the unit explains why the whole-dir grant is needed (os.Rename takes a write lock on the parent dir). Existing installs need a re-run of install.sh to pick up the unit change.

Screenshots in _diag/p6-update-sweep/ (gitignored, captured during the sweep).

Out of scope

  • sha256 digest verification (deferred — TLS suffices for v1)
  • restic-manager-agent update CLI subcommand (no consumer)
  • Auto-update on hello mismatch (deferred; can be a setting flip later)
  • Auto-rollback watchdog M3 (deferred until we ship a broken release)
  • Migrating the agent off `User=root`

Test plan

  • Re-run `install.sh` on every existing agent host so they pick up the unit change. Without it the first self-update will fail with "read-only file system".
  • On a host enrolled at the OLD agent version (b91fe56 era), the Update button dispatches but the old agent ignores the envelope; job times out at 90s with a clear failure. Manual install of the new agent is the bootstrap path. Worth documenting in the agent-version compatibility note.
  • Windows path is compile-only verified — first real Windows install will be the first end-to-end test (same caveat as P2-16/17).
## Summary - **P6-01** Agent self-update via WS `command.update` envelope. Linux: atomic-rename + clean exit, systemd brings the new binary up. Windows: detached `update.cmd` helper does the swap-and-restart while the agent exits cleanly. - **P6-02** Dashboard surfacing + rolling fleet update. Out-of-date chip on host rows + detail header, "N hosts behind" hero tile with `?updates=behind` filter, per-host **Update agent** button, admin `/settings/fleet-update` page driving a rolling worker that halts on first failure with an alert. Spec: [docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md](../src/branch/p6-agent-self-update/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md) Plan: [docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md](../src/branch/p6-agent-self-update/docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md) ## Architecture decisions (full rationale in the spec) | | Decision | |-|-| | Update trigger | Operator-driven only via WS — no auto-update | | Linux restart | `os.Exit(0)`, systemd `Restart=always` brings up the new binary | | Windows restart | Detached `update.cmd` helper script (can't overwrite running .exe) | | Rollback safety | M1 — keep `<bin>.old` on disk + M2 — rolling fleet update halts on first failure | | Digest verification | Skipped for v1; TLS already covers corruption-in-transit | | Version compare | Exact string match (`agent_version != server.Version` ⇒ out of date) | ## What landed - `internal/version` package + Makefile ldflags injection so server and agent are comparable byte-for-byte - `command.update` WS envelope + `JobUpdate` kind - Migration 0021 widens `jobs.kind` CHECK; 0022 creates `fleet_updates` + `fleet_update_hosts` - `internal/agent/updater` (build-tag split unix/windows) - `POST /api/hosts/{id}/update` (admin JSON) + `POST /hosts/{id}/update` (HTMX, HX-Redirect) - WS update-watcher: marks succeeded on matching-version hello, fails after 90s timeout (covers no-show + rollback) - Fleet worker (`internal/server/fleetupdate`) — at-most-one rolling update, halt-on-first-failure - `update_failed` + `fleet_update_halted` alert kinds, auto-resolve - Per-host chip, dashboard tile + filter, per-host Update button, fleet update page (idle/running/halted/completed) - `host.update_dispatched/_succeeded/_failed` and `fleet.update_*` audit actions ## Smoke validation End-to-end on the dev box: agent at `v0.9.0-11-gccaccd8-dirty` → click Update → server dispatches `command.update` → agent fetches new binary, swaps, exits → systemd restarts → hello at `v9.9.9-smoke` matches server → job marked succeeded → chip + tile clear automatically. Took <5s, `.old` preserved on disk. **Caught and fixed mid-sweep:** the systemd unit's `ProtectSystem=full` made `/usr/local/bin` read-only and blocked the `.new` staging write. Added `/usr/local/bin` to `ReadWritePaths` (commit `83d97a2`). Comment in the unit explains why the whole-dir grant is needed (`os.Rename` takes a write lock on the parent dir). Existing installs need a re-run of `install.sh` to pick up the unit change. Screenshots in `_diag/p6-update-sweep/` (gitignored, captured during the sweep). ## Out of scope - sha256 digest verification (deferred — TLS suffices for v1) - `restic-manager-agent update` CLI subcommand (no consumer) - Auto-update on hello mismatch (deferred; can be a setting flip later) - Auto-rollback watchdog M3 (deferred until we ship a broken release) - Migrating the agent off \`User=root\` ## Test plan - [ ] Re-run \`install.sh\` on every existing agent host so they pick up the unit change. Without it the first self-update will fail with "read-only file system". - [ ] On a host enrolled at the OLD agent version (b91fe56 era), the Update button dispatches but the old agent ignores the envelope; job times out at 90s with a clear failure. Manual install of the new agent is the bootstrap path. Worth documenting in the agent-version compatibility note. - [ ] Windows path is compile-only verified — first real Windows install will be the first end-to-end test (same caveat as P2-16/17).
steve added 12 commits 2026-05-07 07:42:06 +01:00
- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted
  (system-scoped, host_id NULL via new RaiseOrTouchSystem helper).
- ws: UpdateWatcher tracks in-flight command.update dispatches and
  reconciles them against incoming hello envelopes — success path
  marks the job succeeded and auto-resolves the alert; 90s timeout
  marks the job failed and raises update_failed.
- http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX
  /hosts/{id}/update form variant. Pre-checks: host exists, online,
  agent_version != current, no running update job. Refactored core
  into Server.dispatchHostUpdate so the fleet worker can share it
  without going through HTTP.
- fleetupdate: rolling worker iterating through host slots, halting
  on first failure and raising fleet_update_halted. Polling-based
  version-match (re-read hosts.agent_version every 1s up to 95s) —
  no extra plumbing into the WS hello path. At-most-one-running is
  enforced at the store layer (ErrFleetUpdateRunning).
- cmd/server: wire UpdateWatcher and FleetWorker into the main
  goroutine; the worker uses a small serverDispatcher adapter that
  delegates back into Server.DispatchHostUpdate.

Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint
(happy + four pre-check branches + RBAC), worker (two-host happy,
timeout-halt, host-offline-halt, already-at-target skip, cancel
mid-run, double-Start guard).
- POST /api/fleet/update, POST /api/fleet-updates/{id}/cancel,
  GET /api/fleet-updates/{id} (admin-only).
- GET /settings/fleet-update + /partial for htmx polling.
- Renders idle / running / terminal states with per-host progress.
- Tests cover happy path, derive-host-ids, conflict, cancel, get,
  and RBAC.
- Surface UpdateAvailable + TargetVersion on the dashboard host row,
  the host_chrome header, and the JSON Host shape.
- New host_update_chip partial renders an amber out-of-date pill
  next to the agent-version display when the host's agent trails
  the server.
- Host detail right-rail gains an admin-only Update agent button
  (disabled when host is offline or already updating).
- New .update-chip and .btn-amber CSS tokens; tailwind output
  refreshed.
- Add ?updates=behind query filter and the matching dashboardFilter
  field; round-trips through encode/parse.
- Compute UpdatesBehind on the dashboard view-model (online + version
  trailing the server) and surface as an amber hero tile that links
  to the filtered list.
- Test exercise covering the new filter case.
Smoke caught this: ProtectSystem=full mounts /usr read-only so the
agent couldn't write its own .new staging file or atomic-rename over
the running binary. Adding /usr/local/bin to ReadWritePaths is the
minimum diff that lets self-update work; the whole-dir grant is
required because os.Rename needs write on the parent directory.
tasks: mark P6-01 + P6-02 done with as-shipped block
CI / Test (store) (pull_request) Successful in 52s
CI / Test (rest) (pull_request) Successful in 1m6s
CI / Lint (pull_request) Successful in 32s
CI / Test (server-http) (pull_request) Successful in 1m41s
CI / Build (windows/amd64) (pull_request) Successful in 41s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 24s
0bd075c2a3
steve merged commit 39657355be into main 2026-05-07 17:49:25 +01:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/restic-manager#19