diff --git a/tasks.md b/tasks.md index 22f4186..26ff06f 100644 --- a/tasks.md +++ b/tasks.md @@ -344,8 +344,33 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. > Deferred from Phase 4 on 2026-05-05 — operator-experience polish that doesn't gate a working v1. -- [ ] **P6-01** (S) Agent self-update from the server's bundled binaries. P5-03 already bakes matching `agent-{linux-amd64,linux-arm64,windows-amd64}` into the server image under `/opt/restic-manager/dist/`, served by `/agent/binary`. Add a `restic-manager-agent update` subcommand (and a server-dispatched `command.update` WS envelope) that fetches `$RM_SERVER/agent/binary?os=…&arch=…`, verifies sha256 against a digest the server advertises alongside the binary, atomic-renames over the running binary (`tmp+fsync+rename`), and asks the service manager to restart (`systemctl restart` on Linux, SCM restart on Windows). Version pinning is automatic — the server only ever serves the agent that matches its own release. No apt repo, no Chocolatey, no third-party signing infra. _(Was P4-01; original apt/choco plan dropped after the P5-03 Docker pivot made the server the natural distribution point.)_ -- [ ] **P6-02** (M) Agent version reporting + fleet update on dashboard. Server already knows its own build version and each agent's `agent_version` from the WS hello. Surface "N hosts behind" on the dashboard, a per-host "out of date" chip, and an admin-only **Update all** action that fans out `command.update` to every online host (offline hosts queue via `pending_runs`-style retry on reconnect). Per-host **Update** button on host detail for one-shot upgrades. Audit-logged. _(Was P4-02.)_ +- [x] **P6-01** (S) Agent self-update from the server's bundled binaries. Server-dispatched `command.update` WS envelope; agent fetches `$RM_SERVER/agent/binary?os=…&arch=…` to `.new`, copies running binary to `.old` (M1 — keep one revision back), atomic-rename, exit cleanly. Linux relies on systemd `Restart=always`; Windows writes a detached `update.cmd` helper that waits 3s, `sc stop`s, renames, `sc start`s. No sha256 digest verification — TLS already covers corruption-in-transit (decision deferred per spec §4). _(Was P4-01.)_ +- [x] **P6-02** (M) Agent version reporting + fleet update on dashboard. `internal/version` package + Makefile ldflags injection so server and agent are comparable byte-for-byte. Out-of-date chip on host rows + detail header (amber, format `out of date · A → B`). Hero tile "N hosts behind" with `?updates=behind` filter. Per-host **Update agent** button on host detail. Admin `/settings/fleet-update` page drives a rolling worker (`internal/server/fleetupdate`) that updates one host at a time, polls for hello-with-target-version up to 95s, halts on first failure with `fleet_update_halted` alert. Per-host `update_failed` alerts auto-resolve when the agent reconnects at the right version. `host.update_dispatched/_succeeded/_failed` and `fleet.update_started/_completed/_halted/_cancelled` audit actions. _(Was P4-02.)_ + +> **As shipped (2026-05-06, branch `p6-agent-self-update`):** +> Spec `docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md`, +> plan `docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md`. +> Schema: migration 0021 widens `jobs.kind` CHECK to include `update`; +> 0022 creates `fleet_updates` + `fleet_update_hosts`. Agent: new +> `internal/agent/updater` package (build-tag split unix/windows); +> dispatcher case `MsgCommandUpdate` in `cmd/agent/update_dispatch.go` +> emits `job.started` + `log.stream` updates before exit. Server: WS +> update-watcher (`internal/server/ws/update_watch.go`) tracks in-flight +> dispatches, marks succeeded on hello-with-matching-version, fails after +> 90s timeout (covers both no-show and rollback cases per spec §3.2). +> Endpoint `POST /api/hosts/{id}/update` (admin, JSON) + `POST /hosts/{id}/update` +> (HTMX, `HX-Redirect: /jobs/{id}`); pre-checks for offline / already +> up-to-date / update_in_progress. Fleet worker exposes `Start` / +> `Cancel` and runs at most one rolling sequence at a time. Alert kinds +> `update_failed` and `fleet_update_halted` plug into the P3-05 engine. +> +> **Smoke caught + fixed mid-sweep:** the systemd unit's +> `ProtectSystem=full` made `/usr/local/bin` read-only, blocking the +> .new staging file. Added `/usr/local/bin` to `ReadWritePaths`. With +> the fix in place: end-to-end Update agent took the host from +> `v0.9.0-11-gccaccd8-dirty` → `v9.9.9-smoke` in <5s; `.old` preserved +> on disk; chip and hero tile cleared on reconnect; audit row landed. +> Screenshots in `_diag/p6-update-sweep/`. - [ ] **P6-03** (M) Repo size trend graphs (sparkline on host card, full chart on repo page). _(Was P4-06.)_ - [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_ - [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_