tasks: mark P6-01 + P6-02 done with as-shipped block
CI / Test (store) (pull_request) Successful in 52s
CI / Test (rest) (pull_request) Successful in 1m6s
CI / Lint (pull_request) Successful in 32s
CI / Test (server-http) (pull_request) Successful in 1m41s
CI / Build (windows/amd64) (pull_request) Successful in 41s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 24s

This commit is contained in:
2026-05-06 22:33:33 +01:00
parent 83d97a27cc
commit 0bd075c2a3
+27 -2
View File
@@ -344,8 +344,33 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
> Deferred from Phase 4 on 2026-05-05 — operator-experience polish that doesn't gate a working v1.
- [ ] **P6-01** (S) Agent self-update from the server's bundled binaries. P5-03 already bakes matching `agent-{linux-amd64,linux-arm64,windows-amd64}` into the server image under `/opt/restic-manager/dist/`, served by `/agent/binary`. Add a `restic-manager-agent update` subcommand (and a server-dispatched `command.update` WS envelope) that fetches `$RM_SERVER/agent/binary?os=…&arch=…`, verifies sha256 against a digest the server advertises alongside the binary, atomic-renames over the running binary (`tmp+fsync+rename`), and asks the service manager to restart (`systemctl restart` on Linux, SCM restart on Windows). Version pinning is automatic — the server only ever serves the agent that matches its own release. No apt repo, no Chocolatey, no third-party signing infra. _(Was P4-01; original apt/choco plan dropped after the P5-03 Docker pivot made the server the natural distribution point.)_
- [ ] **P6-02** (M) Agent version reporting + fleet update on dashboard. Server already knows its own build version and each agent's `agent_version` from the WS hello. Surface "N hosts behind" on the dashboard, a per-host "out of date" chip, and an admin-only **Update all** action that fans out `command.update` to every online host (offline hosts queue via `pending_runs`-style retry on reconnect). Per-host **Update** button on host detail for one-shot upgrades. Audit-logged. _(Was P4-02.)_
- [x] **P6-01** (S) Agent self-update from the server's bundled binaries. Server-dispatched `command.update` WS envelope; agent fetches `$RM_SERVER/agent/binary?os=…&arch=…` to `<bin>.new`, copies running binary to `<bin>.old` (M1 — keep one revision back), atomic-rename, exit cleanly. Linux relies on systemd `Restart=always`; Windows writes a detached `update.cmd` helper that waits 3s, `sc stop`s, renames, `sc start`s. No sha256 digest verification — TLS already covers corruption-in-transit (decision deferred per spec §4). _(Was P4-01.)_
- [x] **P6-02** (M) Agent version reporting + fleet update on dashboard. `internal/version` package + Makefile ldflags injection so server and agent are comparable byte-for-byte. Out-of-date chip on host rows + detail header (amber, format `out of date · A → B`). Hero tile "N hosts behind" with `?updates=behind` filter. Per-host **Update agent** button on host detail. Admin `/settings/fleet-update` page drives a rolling worker (`internal/server/fleetupdate`) that updates one host at a time, polls for hello-with-target-version up to 95s, halts on first failure with `fleet_update_halted` alert. Per-host `update_failed` alerts auto-resolve when the agent reconnects at the right version. `host.update_dispatched/_succeeded/_failed` and `fleet.update_started/_completed/_halted/_cancelled` audit actions. _(Was P4-02.)_
> **As shipped (2026-05-06, branch `p6-agent-self-update`):**
> Spec `docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md`,
> plan `docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md`.
> Schema: migration 0021 widens `jobs.kind` CHECK to include `update`;
> 0022 creates `fleet_updates` + `fleet_update_hosts`. Agent: new
> `internal/agent/updater` package (build-tag split unix/windows);
> dispatcher case `MsgCommandUpdate` in `cmd/agent/update_dispatch.go`
> emits `job.started` + `log.stream` updates before exit. Server: WS
> update-watcher (`internal/server/ws/update_watch.go`) tracks in-flight
> dispatches, marks succeeded on hello-with-matching-version, fails after
> 90s timeout (covers both no-show and rollback cases per spec §3.2).
> Endpoint `POST /api/hosts/{id}/update` (admin, JSON) + `POST /hosts/{id}/update`
> (HTMX, `HX-Redirect: /jobs/{id}`); pre-checks for offline / already
> up-to-date / update_in_progress. Fleet worker exposes `Start` /
> `Cancel` and runs at most one rolling sequence at a time. Alert kinds
> `update_failed` and `fleet_update_halted` plug into the P3-05 engine.
>
> **Smoke caught + fixed mid-sweep:** the systemd unit's
> `ProtectSystem=full` made `/usr/local/bin` read-only, blocking the
> .new staging file. Added `/usr/local/bin` to `ReadWritePaths`. With
> the fix in place: end-to-end Update agent took the host from
> `v0.9.0-11-gccaccd8-dirty` → `v9.9.9-smoke` in <5s; `.old` preserved
> on disk; chip and hero tile cleared on reconnect; audit row landed.
> Screenshots in `_diag/p6-update-sweep/`.
- [ ] **P6-03** (M) Repo size trend graphs (sparkline on host card, full chart on repo page). _(Was P4-06.)_
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_