From d24856866ebce9ec160a44428810eb5972b5cf69 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Wed, 6 May 2026 21:37:38 +0100 Subject: [PATCH] plan: P6-01+02 implementation plan --- .../2026-05-06-p6-01-02-agent-self-update.md | 1706 +++++++++++++++++ 1 file changed, 1706 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md diff --git a/docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md b/docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md new file mode 100644 index 0000000..f5829ca --- /dev/null +++ b/docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md @@ -0,0 +1,1706 @@ +# P6-01 + P6-02 Agent Self-Update Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Operator-driven agent self-update via WS envelope, with dashboard "out of date" surfacing, per-host Update button, and a rolling fleet-update worker that halts on first failure. + +**Architecture:** Agent fetches its replacement binary from `/agent/binary`, atomic-renames over the running binary (Linux) or hands off to a detached helper script (Windows), and exits cleanly so the service manager restarts it. The server tracks each update as a `jobs` row with `kind=update`; success is detected when the agent re-hellos with `agent_version == server.Version`. A fleet-update worker iterates out-of-date hosts one at a time, halting on the first failure. + +**Tech Stack:** Go server + agent, SQLite migrations, WebSocket envelopes (existing `internal/api`), htmx/Tailwind UI. + +**Spec:** `docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md` + +--- + +## File Structure + +### New files +- `internal/version/version.go` — `Version`, `Commit` constants, ldflags-injected. +- `internal/agent/updater/updater.go` — shared HTTP fetch + atomic-write helpers. +- `internal/agent/updater/updater_unix.go` — Linux platform path (build-tag `!windows`). +- `internal/agent/updater/updater_windows.go` — Windows platform path (build-tag `windows`). +- `internal/agent/updater/updater_test.go` — Linux unit tests with fake HTTP server. +- `internal/store/migrations/0021_jobs_update_kind.sql` — widen jobs.kind CHECK. +- `internal/store/migrations/0022_fleet_updates.sql` — fleet_updates + fleet_update_hosts tables. +- `internal/store/fleet_updates.go` — store layer for the new tables. +- `internal/store/fleet_updates_test.go` +- `internal/server/http/host_update.go` — `POST /api/hosts/{id}/update` + form variant. +- `internal/server/http/host_update_test.go` +- `internal/server/http/version.go` — `GET /api/version`. +- `internal/server/http/fleet_update.go` — fleet update endpoints + page handler. +- `internal/server/http/fleet_update_test.go` +- `internal/server/fleetupdate/worker.go` — the rolling-update goroutine. +- `internal/server/fleetupdate/worker_test.go` +- `internal/alert/update_alerts.go` — alert kinds + helpers (`update_failed`, `fleet_update_halted`). +- `web/templates/pages/fleet_update.html` — both idle and running states. +- `web/templates/partials/host_update_chip.html` — reusable chip. +- `cmd/agent/update_dispatch.go` — agent-side `command.update` handler. + +### Modified files +- `Makefile` — add `VERSION` / `COMMIT` ldflags. +- `internal/api/wire.go` — drop `MsgAgentUpdateAvail`, add `MsgCommandUpdate`. +- `internal/api/messages.go` — drop `AgentUpdateAvailablePayload`, add `CommandUpdatePayload`. +- `cmd/agent/main.go` — wire `MsgCommandUpdate` case in dispatcher. +- `internal/server/ws/handler.go` — extend `onAgentHello` to mark in-flight `update` jobs succeeded on version match. +- `internal/server/http/server.go` — register new routes + middleware. +- `internal/server/http/middleware.go` — already has `requireAdmin`; reuse. +- `internal/server/http/hosts.go` — render the update chip into host responses. +- `internal/server/http/dashboard.go` (or wherever the dashboard handler lives) — "N hosts behind" tile, `updates=behind` filter. +- `web/templates/partials/host_chrome.html` — embed update chip in header. +- `web/templates/partials/host_row.html` — embed update chip in dashboard row. +- `web/templates/pages/host_detail.html` — Update agent button on right-rail. +- `web/styles/input.css` — `.update-chip` token (amber). +- `cmd/server/main.go` — wire fleet-update worker into the daemon lifecycle. +- `tasks.md` — mark P6-01 and P6-02 done with the as-shipped block. + +--- + +## Phase 1 — Build version plumbing + +### Task 1: `internal/version` package + +**Files:** +- Create: `internal/version/version.go` + +- [ ] **Step 1: Create the package** + +```go +// Package version exposes build-time identifying constants. Both the +// server and agent link this package; their values are set via +// -ldflags during the build. An unset Version falls back to "dev" +// so source builds without ldflags still run. +package version + +var ( + // Version is the human-facing release string, e.g. "v1.2.3" or + // "v1.2.3-dirty". Compared byte-for-byte between agent and + // server to drive the "out of date" signal. + Version = "dev" + + // Commit is the short git SHA. Informational only; surfaced via + // /api/version but not used for any comparison. + Commit = "" +) +``` + +- [ ] **Step 2: Commit** + +``` +git add internal/version/version.go +git commit -m "version: add build-time version package" +``` + +### Task 2: Wire ldflags into the Makefile + +**Files:** +- Modify: `Makefile` + +- [ ] **Step 1: Read the Makefile, locate the build target, and prepend the ldflags** + +Add near the top of the Makefile (after any existing variable block): + +```make +VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev) +COMMIT ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo unknown) +GO_LDFLAGS := -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \ + -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT) +``` + +Then thread `-ldflags "$(GO_LDFLAGS)"` into every `go build` invocation in the Makefile (for `cmd/server`, `cmd/agent`, and any cross-compile target). + +- [ ] **Step 2: Verify** + +```sh +make build +./bin/restic-manager-server -version 2>/dev/null || true # if -version flag exists +strings ./bin/restic-manager-server | grep -E "^v[0-9]+|^dev" | head -3 +``` + +Expected: a non-`dev` version string is embedded in the binary when in a tagged-or-dirty git checkout. + +- [ ] **Step 3: Commit** + +``` +git add Makefile +git commit -m "build: inject version + commit via ldflags" +``` + +### Task 3: `GET /api/version` endpoint + +**Files:** +- Create: `internal/server/http/version.go` +- Create: `internal/server/http/version_test.go` +- Modify: `internal/server/http/server.go` + +- [ ] **Step 1: Write the failing test** + +```go +package http + +import ( + "encoding/json" + "net/http" + "net/http/httptest" + "testing" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/version" +) + +func TestVersionEndpoint(t *testing.T) { + version.Version = "v1.2.3" + version.Commit = "abc1234" + t.Cleanup(func() { + version.Version = "dev" + version.Commit = "" + }) + + srv := newTestServer(t) // existing helper in this package + rr := httptest.NewRecorder() + req := httptest.NewRequest(http.MethodGet, "/api/version", nil) + srv.Router().ServeHTTP(rr, req) + + if rr.Code != http.StatusOK { + t.Fatalf("status: got %d want 200", rr.Code) + } + var body struct { + Version string `json:"version"` + Commit string `json:"commit"` + } + if err := json.NewDecoder(rr.Body).Decode(&body); err != nil { + t.Fatalf("decode: %v", err) + } + if body.Version != "v1.2.3" || body.Commit != "abc1234" { + t.Fatalf("body: %+v", body) + } +} +``` + +If `newTestServer` doesn't exist by that name in this package, locate the equivalent helper (look at `enrollment_test.go` or `version.go` style elsewhere) and adapt. + +- [ ] **Step 2: Implement** + +```go +// internal/server/http/version.go +package http + +import ( + "encoding/json" + stdhttp "net/http" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/version" +) + +func (s *Server) handleVersion(w stdhttp.ResponseWriter, r *stdhttp.Request) { + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(map[string]string{ + "version": version.Version, + "commit": version.Commit, + }) +} +``` + +In `server.go`, add inside the route registration block (where other public routes live, near `/agent/binary`): + +```go +r.Get("/api/version", s.handleVersion) +``` + +- [ ] **Step 3: Run tests** + +```sh +go test ./internal/server/http/ -run TestVersionEndpoint -v +``` + +Expected: PASS. + +- [ ] **Step 4: Commit** + +``` +git add internal/server/http/version.go internal/server/http/version_test.go internal/server/http/server.go +git commit -m "http: expose GET /api/version" +``` + +--- + +## Phase 2 — Wire protocol changes + +### Task 4: Add `MsgCommandUpdate` and `CommandUpdatePayload`, retire `MsgAgentUpdateAvail` + +**Files:** +- Modify: `internal/api/wire.go` +- Modify: `internal/api/messages.go` +- Modify: `cmd/agent/main.go` + +- [ ] **Step 1: Edit `wire.go`** + +In the server-to-agent block (around line 32-37), replace: +```go +MsgAgentUpdateAvail MessageType = "agent.update.available" +``` +with: +```go +MsgCommandUpdate MessageType = "command.update" +``` + +- [ ] **Step 2: Edit `messages.go`** + +Delete the `AgentUpdateAvailablePayload` struct (lines ~364-371) and its doc comment. Add immediately before `TreeListRequestPayload`: + +```go +// CommandUpdatePayload carries no operational data — the agent +// already knows its own os/arch and fetches from its configured +// server URL via /agent/binary. JobID is the server-issued id of +// the update job; the agent echoes it on log.stream lines so the +// live job log captures pre-restart progress. +type CommandUpdatePayload struct { + JobID string `json:"job_id"` +} +``` + +- [ ] **Step 3: Edit `cmd/agent/main.go`** + +Replace the `case api.MsgAgentUpdateAvail` block (lines ~398-401) with: + +```go +case api.MsgCommandUpdate: + var p api.CommandUpdatePayload + if err := env.UnmarshalPayload(&p); err != nil { + return fmt.Errorf("command.update: %w", err) + } + go d.runUpdate(ctx, p, tx) +``` + +`runUpdate` lands in Task 9. + +- [ ] **Step 4: Update `JobKind` constants in `messages.go`** + +In the `JobKind` const block (line ~57), add: + +```go +JobUpdate JobKind = "update" +``` + +- [ ] **Step 5: Verify** + +```sh +go build ./... +``` + +Expected: build error from `cmd/agent/main.go` calling `d.runUpdate` (not yet defined). That's fine — proceed; the next phase plugs the gap. Verify only the `internal/api` package builds: + +```sh +go build ./internal/api/ +``` + +Expected: PASS. + +- [ ] **Step 6: Commit** + +``` +git add internal/api/wire.go internal/api/messages.go cmd/agent/main.go +git commit -m "api: replace agent.update.available with command.update + JobUpdate kind" +``` + +--- + +## Phase 3 — Database migrations + +### Task 5: Migration 0021 — widen `jobs.kind` CHECK + +**Files:** +- Create: `internal/store/migrations/0021_jobs_update_kind.sql` + +- [ ] **Step 1: Write the migration** + +Mirror the pattern in `0012_jobs_restore_diff_kind.sql` exactly: temp-backup of `job_logs`, rebuild `jobs` with the wider CHECK, restore log rows, recreate indexes. Only change is the CHECK list now includes `'update'`: + +```sql +-- 0021_jobs_update_kind.sql +-- +-- Add 'update' to the jobs.kind CHECK constraint so the agent +-- self-update flow (P6-01) can persist its job rows. Same safe +-- rebuild pattern as 0012; cascade trap mitigated by job_logs +-- temp-backup. + +CREATE TEMPORARY TABLE _job_logs_backup AS + SELECT job_id, seq, ts, stream, payload FROM job_logs; + +CREATE TABLE jobs_new ( + id TEXT PRIMARY KEY, + host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE, + kind TEXT NOT NULL CHECK (kind IN + ('backup','init','forget','prune','check','unlock','restore','diff','update')), + status TEXT NOT NULL CHECK (status IN ('queued','running','succeeded','failed','cancelled')), + scheduled_id TEXT REFERENCES schedules(id) ON DELETE SET NULL, + actor_kind TEXT NOT NULL CHECK (actor_kind IN ('user','schedule','system')), + actor_id TEXT, + started_at TEXT, + finished_at TEXT, + exit_code INTEGER, + stats TEXT, + error TEXT, + created_at TEXT NOT NULL +); + +INSERT INTO jobs_new + SELECT id, host_id, kind, status, scheduled_id, actor_kind, actor_id, + started_at, finished_at, exit_code, stats, error, created_at + FROM jobs; + +DROP TABLE jobs; +ALTER TABLE jobs_new RENAME TO jobs; + +CREATE INDEX jobs_host_id ON jobs(host_id); +CREATE INDEX jobs_status ON jobs(status); +CREATE INDEX jobs_created_at ON jobs(created_at); + +INSERT OR IGNORE INTO job_logs (job_id, seq, ts, stream, payload) + SELECT job_id, seq, ts, stream, payload FROM _job_logs_backup; + +DROP TABLE _job_logs_backup; +``` + +If the live `jobs` schema already has columns added by post-0012 migrations (e.g. 0015 added `source_group_id`), match them in `jobs_new` and the INSERT — check the latest schema before writing. + +- [ ] **Step 2: Verify** + +```sh +go test ./internal/store/ -run TestMigrations -v +``` + +Expected: PASS, includes 0021. + +- [ ] **Step 3: Commit** + +``` +git add internal/store/migrations/0021_jobs_update_kind.sql +git commit -m "store: migration 0021 — add 'update' to jobs.kind" +``` + +### Task 6: Migration 0022 — fleet_updates tables + +**Files:** +- Create: `internal/store/migrations/0022_fleet_updates.sql` + +- [ ] **Step 1: Write the migration** + +```sql +-- 0022_fleet_updates.sql +-- +-- Tables backing the rolling fleet-update worker (P6-02). One row in +-- fleet_updates per "update all" invocation, a child row per host so +-- the worker can resume / report progress / mark per-host failures. + +CREATE TABLE fleet_updates ( + id TEXT PRIMARY KEY, + started_at TEXT NOT NULL, + started_by_user_id TEXT NOT NULL REFERENCES users(id), + target_version TEXT NOT NULL, + status TEXT NOT NULL CHECK (status IN + ('running','completed','halted','cancelled')), + current_host_id TEXT REFERENCES hosts(id), + halted_reason TEXT, + completed_at TEXT +); + +CREATE INDEX fleet_updates_status ON fleet_updates(status); + +CREATE TABLE fleet_update_hosts ( + fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE, + host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE, + position INTEGER NOT NULL, + status TEXT NOT NULL CHECK (status IN + ('pending','running','succeeded','failed','skipped')), + job_id TEXT REFERENCES jobs(id), + failed_reason TEXT, + PRIMARY KEY (fleet_update_id, host_id) +); + +CREATE INDEX fleet_update_hosts_status ON fleet_update_hosts(fleet_update_id, position); +``` + +- [ ] **Step 2: Verify** + +```sh +go test ./internal/store/ -run TestMigrations -v +``` + +- [ ] **Step 3: Commit** + +``` +git add internal/store/migrations/0022_fleet_updates.sql +git commit -m "store: migration 0022 — fleet_updates + fleet_update_hosts" +``` + +### Task 7: `internal/store/fleet_updates.go` + +**Files:** +- Create: `internal/store/fleet_updates.go` +- Create: `internal/store/fleet_updates_test.go` + +- [ ] **Step 1: Write the failing tests** + +```go +package store_test + +// Cover: CreateFleetUpdate creates parent + N pending host rows; +// SetFleetUpdateHostStatus moves a row through pending→running→succeeded; +// HaltFleetUpdate sets status=halted + halted_reason and stamps no +// completed_at; CompleteFleetUpdate sets completed_at; ListPendingFleetUpdateHosts +// returns rows in position order; ActiveFleetUpdate returns the running +// row (or nil); GetFleetUpdate hydrates parent + children. +// +// One test per behaviour, table-driven where the API supports it. +// Mirror the structure of internal/store/maintenance_test.go. +``` + +Write four-six discrete test functions. Look at `internal/store/maintenance.go` + `_test.go` for the established style (constructor on `*Store`, NewStore + tmp DB). + +- [ ] **Step 2: Implement** + +Sketch: + +```go +package store + +import ( + "context" + "database/sql" + "errors" + "fmt" + "time" +) + +type FleetUpdate struct { + ID string + StartedAt time.Time + StartedByUserID string + TargetVersion string + Status string // running | completed | halted | cancelled + CurrentHostID string + HaltedReason string + CompletedAt *time.Time +} + +type FleetUpdateHost struct { + FleetUpdateID string + HostID string + Position int + Status string // pending | running | succeeded | failed | skipped + JobID string + FailedReason string +} + +func (s *Store) CreateFleetUpdate(ctx context.Context, fu FleetUpdate, hostIDs []string) error { ... } +func (s *Store) ActiveFleetUpdate(ctx context.Context) (*FleetUpdate, error) { ... } +func (s *Store) GetFleetUpdate(ctx context.Context, id string) (*FleetUpdate, []FleetUpdateHost, error) { ... } +func (s *Store) ListPendingFleetUpdateHosts(ctx context.Context, fuID string) ([]FleetUpdateHost, error) { ... } +func (s *Store) SetFleetUpdateHostStatus(ctx context.Context, fuID, hostID, status, failedReason, jobID string) error { ... } +func (s *Store) SetFleetUpdateCurrentHost(ctx context.Context, fuID, hostID string) error { ... } +func (s *Store) HaltFleetUpdate(ctx context.Context, fuID, reason string, when time.Time) error { ... } +func (s *Store) CancelFleetUpdate(ctx context.Context, fuID string) error { ... } +func (s *Store) CompleteFleetUpdate(ctx context.Context, fuID string, when time.Time) error { ... } +``` + +- [ ] **Step 3: Run tests** + +```sh +go test ./internal/store/ -run TestFleetUpdate -v +``` + +Expected: PASS. + +- [ ] **Step 4: Commit** + +``` +git add internal/store/fleet_updates.go internal/store/fleet_updates_test.go +git commit -m "store: fleet_updates + fleet_update_hosts CRUD" +``` + +--- + +## Phase 4 — Agent updater (Linux) + +### Task 8: `internal/agent/updater` package skeleton + Linux path + +**Files:** +- Create: `internal/agent/updater/updater.go` +- Create: `internal/agent/updater/updater_unix.go` +- Create: `internal/agent/updater/updater_windows.go` +- Create: `internal/agent/updater/updater_test.go` + +- [ ] **Step 1: Write the failing test (Linux)** + +```go +//go:build !windows + +package updater + +import ( + "io" + "net/http" + "net/http/httptest" + "os" + "path/filepath" + "runtime" + "testing" +) + +func TestUpdate_LinuxAtomicSwap(t *testing.T) { + // Stage 1: a fake "running binary" file + a server that serves new bytes. + tmp := t.TempDir() + binPath := filepath.Join(tmp, "agent") + if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil { + t.Fatal(err) + } + newBytes := []byte("NEW BINARY CONTENTS") + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path != "/agent/binary" { + http.NotFound(w, r); return + } + if r.URL.Query().Get("os") != runtime.GOOS || r.URL.Query().Get("arch") != runtime.GOARCH { + t.Errorf("unexpected query: %s", r.URL.RawQuery) + } + _, _ = io.Copy(w, &io.LimitedReader{R: bytesReader(newBytes), N: int64(len(newBytes))}) + })) + defer srv.Close() + + if err := UpdateForTest(srv.URL, binPath); err != nil { + t.Fatalf("update: %v", err) + } + + got, err := os.ReadFile(binPath) + if err != nil { t.Fatal(err) } + if string(got) != "NEW BINARY CONTENTS" { + t.Fatalf("binary contents: got %q", got) + } + old, err := os.ReadFile(binPath + ".old") + if err != nil { t.Fatalf("agent.old missing: %v", err) } + if string(old) != "OLD" { + t.Fatalf("agent.old contents: got %q", old) + } + // .new must have been renamed away + if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) { + t.Fatalf("agent.new should be absent after swap") + } +} +``` + +`UpdateForTest(serverURL, binaryPath string) error` is a tiny wrapper exposed by `updater.go` that does steps 1–6 of §4.1 of the spec (everything except `os.Exit`). The exit-and-restart side effect can't be covered by a unit test. + +- [ ] **Step 2: Implement `updater.go` (shared)** + +```go +package updater + +import ( + "context" + "fmt" + "io" + "net/http" + "os" + "path/filepath" + "runtime" + "time" +) + +// fetch downloads the new binary into .new, fsyncs, chmods. +// Returns the path to the staged file. +func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) { + url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH) + req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) + if err != nil { + return "", err + } + c := &http.Client{Timeout: 5 * time.Minute} + res, err := c.Do(req) + if err != nil { + return "", err + } + defer res.Body.Close() + if res.StatusCode != http.StatusOK { + return "", fmt.Errorf("agent binary fetch: %s", res.Status) + } + + stagePath := binaryPath + ".new" + f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755) + if err != nil { + return "", err + } + if _, err := io.Copy(f, res.Body); err != nil { + f.Close() + _ = os.Remove(stagePath) + return "", err + } + if err := f.Sync(); err != nil { + f.Close() + _ = os.Remove(stagePath) + return "", err + } + if err := f.Close(); err != nil { + _ = os.Remove(stagePath) + return "", err + } + if err := os.Chmod(stagePath, 0o755); err != nil { + _ = os.Remove(stagePath) + return "", err + } + return stagePath, nil +} + +// resolveOwnBinary returns the absolute path of the running binary. +// Refuses /proc/self/exe — that's what os.Executable returns on +// some systems but it can't be renamed across. +func resolveOwnBinary() (string, error) { + p, err := os.Executable() + if err != nil { + return "", err + } + abs, err := filepath.Abs(p) + if err != nil { + return "", err + } + if abs == "/proc/self/exe" { + return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe — not a real file)") + } + return abs, nil +} + +// UpdateForTest is the platform-neutral test seam. +// In production, Update (in updater_unix.go / updater_windows.go) does +// the same fetch+swap then exits the process. UpdateForTest stops short +// of the exit so unit tests can assert on file state. +func UpdateForTest(serverURL, binaryPath string) error { + ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute) + defer cancel() + stage, err := fetch(ctx, serverURL, binaryPath) + if err != nil { + return err + } + return swap(stage, binaryPath) +} +``` + +- [ ] **Step 3: Implement `updater_unix.go`** + +```go +//go:build !windows + +package updater + +import ( + "context" + "fmt" + "io" + "log/slog" + "os" + "time" +) + +// Update fetches the new binary, swaps it in, then exits so systemd +// restarts the process under the new binary. Caller should close +// the WS cleanly before invoking. +func Update(ctx context.Context, serverURL string) error { + binPath, err := resolveOwnBinary() + if err != nil { + return err + } + stage, err := fetch(ctx, serverURL, binPath) + if err != nil { + return err + } + if err := swap(stage, binPath); err != nil { + return err + } + slog.Info("agent self-update: binary swapped, exiting for systemd restart", + "binary", binPath) + // Give logger a moment to flush, then exit. + time.Sleep(200 * time.Millisecond) + os.Exit(0) + return nil // unreachable +} + +// swap copies the running binary to .old, then atomic-renames the +// staged binary into place. On non-Windows this works because the OS +// allows renames across an open file. +func swap(stagePath, binPath string) error { + src, err := os.Open(binPath) + if err != nil { + return fmt.Errorf("open running binary: %w", err) + } + defer src.Close() + dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755) + if err != nil { + return fmt.Errorf("open .old: %w", err) + } + if _, err := io.Copy(dst, src); err != nil { + dst.Close() + return fmt.Errorf("copy to .old: %w", err) + } + if err := dst.Sync(); err != nil { + dst.Close() + return err + } + if err := dst.Close(); err != nil { + return err + } + if err := os.Rename(stagePath, binPath); err != nil { + return fmt.Errorf("rename .new over running binary: %w", err) + } + return nil +} +``` + +- [ ] **Step 4: Implement `updater_windows.go` stub** + +```go +//go:build windows + +package updater + +import ( + "context" + "errors" +) + +// Update is implemented in Task 12. Stubbed so the package builds +// on Windows during phases 4-11. +func Update(ctx context.Context, serverURL string) error { + return errors.New("agent self-update on Windows: not yet implemented") +} + +func swap(stagePath, binPath string) error { + return errors.New("agent self-update on Windows: not yet implemented") +} +``` + +- [ ] **Step 5: Add `bytesReader` helper to test file** (or use `bytes.NewReader` directly). + +- [ ] **Step 6: Run tests** + +```sh +go test ./internal/agent/updater/ -v +``` + +Expected: PASS. + +- [ ] **Step 7: Commit** + +``` +git add internal/agent/updater/ +git commit -m "agent: updater package — Linux atomic-swap path" +``` + +### Task 9: Wire `command.update` into the agent dispatcher + +**Files:** +- Create: `cmd/agent/update_dispatch.go` +- Verify: `cmd/agent/main.go` (already edited in Task 4) + +- [ ] **Step 1: Implement the dispatcher method** + +```go +package main + +import ( + "context" + "fmt" + "log/slog" + "time" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater" + "gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient" + "gitea.dcglab.co.uk/steve/restic-manager/internal/api" +) + +// runUpdate handles a server-dispatched command.update. It logs progress +// via log.stream so the live job page captures pre-restart state, then +// calls the platform updater. On Linux the updater calls os.Exit; on +// Windows it spawns a helper and returns, with the agent then exiting. +func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) { + logf := func(format string, args ...any) { + line := fmt.Sprintf(format, args...) + slog.Info("ws agent: update: " + line) + env, _ := api.Marshal(api.MsgLogStream, "", api.LogStreamPayload{ + JobID: p.JobID, + Stream: api.LogStdout, + Data: line + "\n", + At: time.Now().UTC(), + }) + _ = tx.Send(env) + } + + // Job-started so the server flips queued→running. + startedEnv, _ := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{ + JobID: p.JobID, + Kind: api.JobUpdate, + StartedAt: time.Now().UTC(), + }) + _ = tx.Send(startedEnv) + + logf("fetching new binary from %s", d.serverURL) + if err := updater.Update(ctx, d.serverURL); err != nil { + logf("update failed: %v", err) + finishedEnv, _ := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{ + JobID: p.JobID, + Kind: api.JobUpdate, + Status: api.JobFailed, + FinishedAt: time.Now().UTC(), + Error: err.Error(), + }) + _ = tx.Send(finishedEnv) + return + } + // Unreachable on Linux (Update calls os.Exit). On Windows control + // returns here and the agent exits cleanly so SCM hands off to the + // helper script that does the actual swap-and-restart. +} +``` + +`d.serverURL` should already exist on the dispatcher (it's the URL the WS connection was made to). If not, plumb it from `cmd/agent/main.go`'s connection setup — the URL is in the agent config. + +- [ ] **Step 2: Verify build** + +```sh +go build ./... +``` + +Expected: PASS. + +- [ ] **Step 3: Run all agent tests** + +```sh +go test ./cmd/agent/... ./internal/agent/... +``` + +Expected: PASS, no regressions. + +- [ ] **Step 4: Commit** + +``` +git add cmd/agent/update_dispatch.go cmd/agent/main.go +git commit -m "agent: handle command.update — fetch + swap + exit" +``` + +--- + +## Phase 5 — Server endpoint + hello integration + +### Task 10: `POST /api/hosts/{id}/update` + +**Files:** +- Create: `internal/server/http/host_update.go` +- Create: `internal/server/http/host_update_test.go` +- Modify: `internal/server/http/server.go` (route registration) + +- [ ] **Step 1: Write tests covering** + +Mirror the structure of an existing admin-band endpoint test, e.g. `repo_ops_test.go`: + +- happy path: admin POST → 200 + `{job_id}` returned, `jobs` row created with `kind=update`, audit row written, WS envelope `command.update` sent to the host's connection. +- refuses when host offline → 409 / structured error code `host_offline`. +- refuses when `agent_version == server.Version` → 409 / `already_up_to_date`. +- refuses when an `update` job is already running for this host → 409 / `update_in_progress`. +- RBAC: operator → 403, viewer → 403. + +- [ ] **Step 2: Implement** + +```go +package http + +import ( + "encoding/json" + stdhttp "net/http" + + "github.com/go-chi/chi/v5" + "github.com/oklog/ulid/v2" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/api" + "gitea.dcglab.co.uk/steve/restic-manager/internal/store" + "gitea.dcglab.co.uk/steve/restic-manager/internal/version" +) + +// handleHostUpdate dispatches a command.update WS envelope after +// validating that the host is online, currently running a different +// version, and not already in the middle of an update. +func (s *Server) handleHostUpdate(w stdhttp.ResponseWriter, r *stdhttp.Request) { + hostID := chi.URLParam(r, "id") + host, err := s.deps.Store.GetHost(r.Context(), hostID) + if err != nil { writeJSONError(w, stdhttp.StatusNotFound, "host_not_found", ""); return } + + if !s.deps.Hub.IsOnline(hostID) { + writeJSONError(w, stdhttp.StatusConflict, "host_offline", + "agent must be online to receive an update") + return + } + if host.AgentVersion == version.Version { + writeJSONError(w, stdhttp.StatusConflict, "already_up_to_date", + "host is already on "+version.Version) + return + } + running, err := s.deps.Store.RunningUpdateJobForHost(r.Context(), hostID) + if err == nil && running != "" { + writeJSONError(w, stdhttp.StatusConflict, "update_in_progress", + "an update job is already running for this host") + return + } + + jobID := ulid.Make().String() + user := userFrom(r) // existing helper + if err := s.deps.Store.InsertJob(r.Context(), store.Job{ + ID: jobID, + HostID: hostID, + Kind: string(api.JobUpdate), + Status: string(api.JobQueued), + ActorKind: "user", + ActorID: user.ID, + // CreatedAt is set by InsertJob. + }); err != nil { + writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error()) + return + } + + env, _ := api.Marshal(api.MsgCommandUpdate, ulid.Make().String(), api.CommandUpdatePayload{JobID: jobID}) + if err := s.deps.Hub.SendTo(hostID, env); err != nil { + writeJSONError(w, stdhttp.StatusBadGateway, "send_failed", err.Error()) + return + } + + s.audit(r, "host.update_dispatched", store.AuditTarget{ + Kind: "host", ID: hostID, + }, map[string]any{"job_id": jobID, "target_version": version.Version}) + + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(map[string]string{"job_id": jobID}) +} + +// Form-post variant for HTMX. Same gates, returns HX-Redirect. +func (s *Server) handleHostUpdateForm(w stdhttp.ResponseWriter, r *stdhttp.Request) { + // reuse handleHostUpdate's pre-checks via a shared validator; + // on success, set HX-Redirect to /jobs/ and write 200. + // On error, render an inline banner partial. +} +``` + +Helpers to add: +- `Store.RunningUpdateJobForHost(ctx, hostID) (string, error)` — returns the job id of any running `kind=update` job for this host, or empty string + nil if none. One-line query. + +- [ ] **Step 3: Register routes** + +In `server.go`, inside the admin-only route group: + +```go +r.Post("/api/hosts/{id}/update", s.handleHostUpdate) +r.Post("/hosts/{id}/update", s.handleHostUpdateForm) +``` + +- [ ] **Step 4: Run tests** + +```sh +go test ./internal/server/http/ -run TestHostUpdate -v +``` + +Expected: PASS. + +- [ ] **Step 5: Commit** + +``` +git add internal/server/http/host_update.go internal/server/http/host_update_test.go internal/server/http/server.go internal/store/jobs.go +git commit -m "http: POST /api/hosts/{id}/update — dispatch agent update" +``` + +### Task 11: Hello-handler integration + timeout watcher + +**Files:** +- Modify: `internal/server/ws/handler.go` +- Create: `internal/server/ws/update_watch.go` +- Create: `internal/server/ws/update_watch_test.go` + +- [ ] **Step 1: Write the failing test** + +```go +// In update_watch_test.go: +// +// 1. NewWatcher; Track(jobID, hostID, started=now). Hello arrives after +// 50ms with matching version → watcher marks the job succeeded +// (verify via mock Store.UpdateJobStatus call). +// 2. NewWatcher; Track(...). 100ms timeout (override constant for test). +// No hello arrives → after 100ms, watcher marks the job failed with +// reason "timeout" and raises an alert (verify via mock Store + +// AlertEngine). +// 3. NewWatcher; Track(...). Hello arrives but version doesn't match. +// Watcher does nothing (timeout will catch). After timeout, marked +// failed with reason "agent reconnected at version X, expected Y". +// 4. Cancel: Track then explicitly Stop(jobID) — no further callbacks. +``` + +- [ ] **Step 2: Implement the watcher** + +```go +package ws + +import ( + "context" + "sync" + "time" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/api" + "gitea.dcglab.co.uk/steve/restic-manager/internal/store" +) + +// updateTimeout is the default ceiling for how long the server waits +// for an agent re-hello carrying the matching version after dispatching +// a command.update. Exported as a var so tests can shrink it. +var updateTimeout = 90 * time.Second + +type updateWatch struct { + jobID string + hostID string + deadline time.Time +} + +type updateWatcher struct { + mu sync.Mutex + pending map[string]*updateWatch // hostID → watch + store *store.Store + alerts AlertRaiser // small interface, injected + now func() time.Time +} + +func newUpdateWatcher(st *store.Store, alerts AlertRaiser) *updateWatcher { + return &updateWatcher{ + pending: make(map[string]*updateWatch), + store: st, + alerts: alerts, + now: func() time.Time { return time.Now().UTC() }, + } +} + +// Track registers an in-flight update. If a hello with the matching +// version arrives before the deadline, OnHello returns true and clears +// the entry. Otherwise the watcher's runLoop will mark the job failed. +func (w *updateWatcher) Track(jobID, hostID string) { + w.mu.Lock() + defer w.mu.Unlock() + w.pending[hostID] = &updateWatch{ + jobID: jobID, + hostID: hostID, + deadline: w.now().Add(updateTimeout), + } +} + +// OnHello is called by the WS handler when an agent hellos. If a watch +// is pending for this host AND the version matches, mark succeeded and +// drop the watch. Mismatched version → leave the watch (timeout +// handles it). +func (w *updateWatcher) OnHello(ctx context.Context, hostID, agentVersion, serverVersion string) { + w.mu.Lock() + watch, ok := w.pending[hostID] + if ok && agentVersion == serverVersion { + delete(w.pending, hostID) + } + w.mu.Unlock() + if !ok || agentVersion != serverVersion { return } + // Mark job succeeded. + _ = w.store.SetJobStatus(ctx, watch.jobID, string(api.JobSucceeded), "", w.now()) + // Audit + alert auto-resolve. + // (audit hook reused via http layer's helper, or write directly here) +} + +// Run is a goroutine started by NewHandler — sweeps for expired +// watches every 5s. +func (w *updateWatcher) Run(ctx context.Context) { + tick := time.NewTicker(5 * time.Second) + defer tick.Stop() + for { + select { + case <-ctx.Done(): return + case <-tick.C: + w.sweep(ctx) + } + } +} + +func (w *updateWatcher) sweep(ctx context.Context) { + now := w.now() + w.mu.Lock() + expired := []*updateWatch{} + for hostID, wch := range w.pending { + if now.After(wch.deadline) { + expired = append(expired, wch) + delete(w.pending, hostID) + } + } + w.mu.Unlock() + for _, wch := range expired { + // Determine reason: did the agent come back at all? + host, _ := w.store.GetHost(ctx, wch.hostID) + reason := "timeout: agent did not reconnect within 90s" + if host != nil && host.AgentVersion != "" && host.AgentVersion != version.Version { + reason = fmt.Sprintf("agent reconnected at %s, expected %s", + host.AgentVersion, version.Version) + } + _ = w.store.SetJobStatus(ctx, wch.jobID, string(api.JobFailed), reason, now) + if w.alerts != nil { + w.alerts.RaiseUpdateFailed(ctx, wch.hostID, wch.jobID, reason, now) + } + } +} +``` + +- [ ] **Step 3: Hook into the WS handler** + +In `handler.go`, where `onAgentHello` is defined (search for the place it upserts `agent_version`), at the *end* of the handler — after the upsert succeeds — call: + +```go +deps.UpdateWatcher.OnHello(ctx, hostID, hello.AgentVersion, version.Version) +``` + +The `UpdateWatcher *updateWatcher` field needs to exist on the handler `Deps` struct. Wire it up in `cmd/server/main.go`. + +`AlertRaiser` interface (defined alongside the watcher) is implemented by `*alert.Engine` after Task 14 adds the `RaiseUpdateFailed` method. For now, define the interface and make the engine satisfy it. + +- [ ] **Step 4: Run tests** + +```sh +go test ./internal/server/ws/ -v +``` + +Expected: PASS. + +- [ ] **Step 5: Commit** + +``` +git add internal/server/ws/update_watch.go internal/server/ws/update_watch_test.go internal/server/ws/handler.go cmd/server/main.go +git commit -m "ws: update watcher — promote/fail update jobs on hello timeout" +``` + +--- + +## Phase 6 — Windows updater path + +### Task 12: Windows helper-script implementation + +**Files:** +- Modify: `internal/agent/updater/updater_windows.go` + +- [ ] **Step 1: Replace the stub from Task 8** + +```go +//go:build windows + +package updater + +import ( + "context" + "fmt" + "log/slog" + "os" + "os/exec" + "path/filepath" + "syscall" + "time" +) + +const helperScript = `@echo off +timeout /t 3 /nobreak >nul +copy /Y "%s" "%s" +sc stop restic-manager-agent +:wait +sc query restic-manager-agent | find "STOPPED" >nul +if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait) +move /Y "%s" "%s" +sc start restic-manager-agent +del "%%~f0" +` + +func Update(ctx context.Context, serverURL string) error { + binPath, err := resolveOwnBinary() + if err != nil { + return err + } + stage, err := fetch(ctx, serverURL, binPath) + if err != nil { + return err + } + helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd") + oldPath := binPath + ".old" + body := fmt.Sprintf(helperScript, binPath, oldPath, stage, binPath) + if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil { + return err + } + cmd := exec.Command("cmd.exe", "/c", helperPath) + cmd.SysProcAttr = &syscall.SysProcAttr{ + HideWindow: true, + CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW + } + if err := cmd.Start(); err != nil { + return err + } + slog.Info("agent self-update: helper spawned, exiting cleanly", "binary", binPath) + time.Sleep(200 * time.Millisecond) + os.Exit(0) + return nil +} + +func swap(_, _ string) error { return nil } // not used on Windows +``` + +- [ ] **Step 2: Verify cross-compile** + +```sh +GOOS=windows GOARCH=amd64 go build ./... +``` + +Expected: PASS. + +- [ ] **Step 3: Commit** + +``` +git add internal/agent/updater/updater_windows.go +git commit -m "agent: Windows updater — detached helper script" +``` + +--- + +## Phase 7 — Alert kinds + auto-resolve + +### Task 13: Add `update_failed` and `fleet_update_halted` alert kinds + +**Files:** +- Create: `internal/alert/update_alerts.go` +- Modify: `internal/alert/rules.go` (auto-resolve hook on host hello) + +- [ ] **Step 1: Implement** + +```go +package alert + +import ( + "context" + "time" +) + +// RaiseUpdateFailed is called by the WS update-watcher when an agent +// fails to come back at the target version after a command.update +// dispatch. Auto-resolves when the host next hellos with the right +// version. +func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) { + dedup := "update_failed:" + hostID + msg := "agent self-update failed: " + reason + e.raiseAndNotify(ctx, hostID, "update_failed", dedup, "warning", msg, when) +} + +// ResolveUpdateFailed is called from the WS hello handler when the +// host comes back at the expected version. +func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) { + e.resolveAndNotify(ctx, hostID, "update_failed", "update_failed:"+hostID, when) +} + +// RaiseFleetUpdateHalted is called by the fleet-update worker when it +// halts on a per-host failure. No host id (global alert). +func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) { + dedup := "fleet_update_halted:" + fleetUpdateID + msg := "fleet update halted: " + reason + e.raiseAndNotify(ctx, "", "fleet_update_halted", dedup, "warning", msg, when) +} +``` + +- [ ] **Step 2: Wire auto-resolve into the WS hello handler** (in Task 11's update watcher: when a successful match is recorded, also call `ResolveUpdateFailed`). + +- [ ] **Step 3: Commit** + +``` +git add internal/alert/update_alerts.go +git commit -m "alert: update_failed + fleet_update_halted kinds" +``` + +--- + +## Phase 8 — Fleet-update worker + +### Task 14: `internal/server/fleetupdate` worker + +**Files:** +- Create: `internal/server/fleetupdate/worker.go` +- Create: `internal/server/fleetupdate/worker_test.go` + +- [ ] **Step 1: Sketch the API** + +```go +package fleetupdate + +import ( + "context" + "sync" + "time" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/store" +) + +// Worker owns the at-most-one rolling fleet-update goroutine. +type Worker struct { + mu sync.Mutex // ensures one run at a time + store *store.Store + hub Hub // small interface — IsOnline, SendTo + dispatcher Dispatcher // small interface — DispatchUpdate(hostID, fleetUpdateID) (jobID string, err error) + watcher Watcher // small interface — WaitForVersion(hostID, version, timeout) bool + alerts AlertRaiser +} + +// Start kicks off a new fleet update. Validates that no other run +// is in progress. Returns the new fleet_update id on success. +func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) { + if !w.mu.TryLock() { + return "", ErrAlreadyRunning + } + // build the fleet_updates row + N pending fleet_update_hosts rows + // in position order, then spawn a goroutine that runs the loop. + go w.run(ctx, fuID) + return fuID, nil +} + +// run is the rolling loop. For each pending host: pre-check, dispatch, +// wait for hello-with-target-version, mark succeeded/failed, halt on +// first failure. +func (w *Worker) run(ctx context.Context, fuID string) { + defer w.mu.Unlock() + // ... see spec §7.2 pseudocode +} +``` + +- [ ] **Step 2: Write tests** + +Use mocks/fakes for Hub/Dispatcher/Watcher. Cover: + +- two-host run, both succeed → completed. +- first host succeeds, second times out → halted, alert raised, third stays pending. +- host goes offline mid-run → halted with reason "host went offline". +- host already at target version when its turn comes (raced with another path) → skipped, loop continues. +- cancel mid-run → status=cancelled, current host's job left running, no further dispatches. +- start while another run active → returns ErrAlreadyRunning. + +- [ ] **Step 3: Implement the run loop** + +```go +func (w *Worker) run(ctx context.Context, fuID string) { + defer w.mu.Unlock() + pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID) + if err != nil { return } + for _, p := range pending { + // Re-check status — could have been cancelled. + fu, _ := w.store.ActiveFleetUpdate(ctx) + if fu == nil || fu.Status != "running" || fu.ID != fuID { return } + + _ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, p.HostID) + + host, _ := w.store.GetHost(ctx, p.HostID) + if host == nil { + _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "host deleted", "") + continue + } + // Already at target? + // (target version comes from the fleet_update row) + if host.AgentVersion == fu.TargetVersion { + _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "already at target", "") + continue + } + if !w.hub.IsOnline(p.HostID) { + _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "host offline at dispatch time", "") + _ = w.store.HaltFleetUpdate(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC()) + w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC()) + return + } + + _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "running", "", "") + jobID, err := w.dispatcher.DispatchUpdate(ctx, p.HostID, fuID) + if err != nil { + _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", err.Error(), "") + _ = w.store.HaltFleetUpdate(ctx, fuID, "dispatch failed on "+host.Hostname, time.Now().UTC()) + w.alerts.RaiseFleetUpdateHalted(ctx, fuID, err.Error(), time.Now().UTC()) + return + } + _ = w.store.SetFleetUpdateHostStatusJob(ctx, fuID, p.HostID, jobID) + + ok := w.watcher.WaitForVersion(ctx, p.HostID, fu.TargetVersion, 95*time.Second) + if !ok { + _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "did not reconnect at target version", jobID) + _ = w.store.HaltFleetUpdate(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC()) + w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC()) + return + } + _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "succeeded", "", jobID) + } + _ = w.store.CompleteFleetUpdate(ctx, fuID, time.Now().UTC()) +} +``` + +- [ ] **Step 4: Run tests** + +```sh +go test ./internal/server/fleetupdate/ -v +``` + +Expected: PASS. + +- [ ] **Step 5: Commit** + +``` +git add internal/server/fleetupdate/ +git commit -m "fleetupdate: rolling worker with halt-on-fail" +``` + +### Task 15: HTTP endpoints + page handler for fleet update + +**Files:** +- Create: `internal/server/http/fleet_update.go` +- Create: `internal/server/http/fleet_update_test.go` +- Create: `web/templates/pages/fleet_update.html` +- Modify: `internal/server/http/server.go` (route registration) + +- [ ] **Step 1: Endpoints** + +```go +// POST /api/fleet/update — admin-only, body: {target_version?}. +// If target_version omitted, defaults to current server version. +// Returns {fleet_update_id}. +func (s *Server) handleFleetUpdateStart(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } + +// POST /api/fleet-updates/{id}/cancel — admin-only. +func (s *Server) handleFleetUpdateCancel(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } + +// GET /api/fleet-updates/{id} — admin-only, returns +// {fleet_update + per-host array} as JSON. +func (s *Server) handleFleetUpdateGet(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } + +// GET /settings/fleet-update — admin-only, renders the page. +// Shows idle list (out-of-date online hosts) when no run is active, +// or the running run's progress. +func (s *Server) handleFleetUpdatePage(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } +``` + +- [ ] **Step 2: Tests** + +Unit-test the page handler (idle vs running variants) and the start endpoint (accepts target list, refuses if a run is already active, RBAC). + +- [ ] **Step 3: Page template** + +`web/templates/pages/fleet_update.html`: +- Inherit from the base layout. +- Idle state block: header "Fleet update", paragraph explaining "rolling updates one host at a time, halts on first failure", table of out-of-date online hosts with Hostname / Current / Target / Last seen, plus a typed-confirm dialog ("Type the host count to confirm"), "Start rolling update" button. +- Running state block: htmx auto-refresh every 3s (`hx-get="/api/fleet-updates/{id}/partial" hx-trigger="every 3s [...visibility...]"`), per-host progress list with status pill, link to job log when present, "Cancel" button. +- Mirror the visual idiom of `web/templates/pages/alerts.html` for the auto-refresh behaviour. + +- [ ] **Step 4: Run tests + smoke render** + +```sh +go test ./internal/server/http/ -run TestFleetUpdate -v +``` + +- [ ] **Step 5: Commit** + +``` +git add internal/server/http/fleet_update.go internal/server/http/fleet_update_test.go web/templates/pages/fleet_update.html internal/server/http/server.go +git commit -m "http: fleet update endpoints + /settings/fleet-update page" +``` + +--- + +## Phase 9 — UI surfacing + +### Task 16: Update chip on host row + host detail header + +**Files:** +- Create: `web/templates/partials/host_update_chip.html` +- Modify: `web/templates/partials/host_row.html` +- Modify: `web/templates/partials/host_chrome.html` +- Modify: `internal/server/http/hosts.go` (add `UpdateAvailable` and `TargetVersion` fields to the row view-model) +- Modify: `internal/server/http/host_detail.go` (or wherever `host_chrome` is populated) +- Modify: `web/styles/input.css` + +- [ ] **Step 1: View-model** + +Compute `UpdateAvailable bool` and `TargetVersion string` (= server version) anywhere `host` data is built for templates. Hide chip when `host.AgentVersion == ""` or matches. + +- [ ] **Step 2: Partial** + +```html +{{ define "host_update_chip" }} +{{ if .UpdateAvailable }} + + out of date · {{ .AgentVersion }} → {{ .TargetVersion }} + +{{ end }} +{{ end }} +``` + +- [ ] **Step 3: CSS** + +`web/styles/input.css`: + +```css +.update-chip { + @apply inline-flex items-center gap-1 px-2 py-0.5 rounded text-xs; + @apply bg-amber-50 text-amber-900 border border-amber-200; +} +``` + +- [ ] **Step 4: Render Tailwind + commit** + +```sh +make build +git add web/templates web/styles/input.css web/static/css/styles.css internal/server/http/hosts.go internal/server/http/host_detail.go +git commit -m "ui: update chip on host row + detail header" +``` + +### Task 17: Per-host Update agent button on `/hosts/{id}` + +**Files:** +- Modify: `web/templates/pages/host_detail.html` + +- [ ] **Step 1: Right-rail button block** + +Look at the existing right-rail in `host_detail.html` (e.g. the Restore button block from P3). Add (admin-only): + +```html +{{ if and .CanAdmin .Host.UpdateAvailable }} +
+ +
+{{ end }} +``` + +The view-model needs `Host.Online` and `Host.UpdateInProgress` populated. + +- [ ] **Step 2: Commit** + +``` +git add web/templates/pages/host_detail.html internal/server/http/host_detail.go +git commit -m "ui: per-host Update agent button" +``` + +### Task 18: Dashboard "N hosts behind" tile + `?updates=behind` filter + +**Files:** +- Modify: `internal/server/http/dashboard_filter.go` (or wherever the dashboard handler lives — search for the `?status=` filter from NS-04) +- Modify: `web/templates/pages/dashboard.html` + +- [ ] **Step 1: Extend filter parsing** + +Add `Updates string` (values: "" or "behind") to the dashboard filter struct. When `behind`, filter to hosts where `agent_version != "" && agent_version != server.Version`. + +- [ ] **Step 2: Hero tile** + +In `dashboard.html`, alongside existing tiles (online/offline/snapshot count), add — only when N > 0: + +```html +{{ if gt .UpdatesBehind 0 }} + + {{ .UpdatesBehind }} + hosts behind + +{{ end }} +``` + +- [ ] **Step 3: Tests** + +Extend `dashboard_filter_test.go` to cover the `updates=behind` path. + +- [ ] **Step 4: Commit** + +``` +git add internal/server/http/dashboard*.go web/templates/pages/dashboard.html +git commit -m "ui: dashboard hosts-behind tile + filter" +``` + +--- + +## Phase 10 — Smoke validation + +### Task 19: Restage + smoke validate + +- [ ] **Step 1: Build at version A** + +```sh +make build VERSION=v0.0.1-smoke-A +# restage block from CLAUDE.md +``` + +- [ ] **Step 2: Onboard `uptime` as a fresh host** + +Use the dashboard's Add-host flow against `ssh uptime`. Confirm the host shows `agent_version=v0.0.1-smoke-A`. + +- [ ] **Step 3: Bump server to version B** + +```sh +make build VERSION=v0.0.1-smoke-B +# restart server only (not the agent) +``` + +Verify: dashboard shows `uptime` with the "out of date · v0.0.1-smoke-A → v0.0.1-smoke-B" chip and the "1 host behind" tile. + +- [ ] **Step 4: Stage agent at version B** + +```sh +cp bin/restic-manager-agent $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64 +``` + +- [ ] **Step 5: Click Update agent** + +On `/hosts/{uptime-id}`. Watch the live job log. Expect: agent fetches, swaps, exits, systemd restarts it, hellos at version B, job marked succeeded, chip and tile clear. + +Verify on `uptime`: +```sh +ssh uptime "ls -la /usr/local/bin/restic-manager-agent*" +``` +Expect both `restic-manager-agent` (B) and `restic-manager-agent.old` (A) present. + +- [ ] **Step 6: Test rollback path** + +```sh +# Replace the bundled binary with the OLD one — server claims B but serves A +cp bin/restic-manager-agent.A $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64 +# (assume earlier build saved as .A) +``` + +Click Update — agent fetches A, swaps to A, restarts at A. Server should mark the job `failed` after 90s with reason like "agent reconnected at v0.0.1-smoke-A, expected v0.0.1-smoke-B". Alert raised. + +- [ ] **Step 7: Fleet update path** + +If only one host is available, this validates the worker on N=1. Spin up a second sibling agent (docker-based or another VM) to validate N=2 + halt-on-fail (replace `/agent-binaries/...` with `/bin/false`-equivalent during one host's turn). + +- [ ] **Step 8: Capture screenshots** + +Save Playwright screenshots of: out-of-date host row, fleet-update idle page, fleet-update running progress, fleet-update halted state. Drop into `_diag/p6-update-sweep/`. + +- [ ] **Step 9: Commit + update tasks.md** + +Mark P6-01 and P6-02 done in `tasks.md` with an as-shipped block summarising what landed (mirror the style used for P5-03/P5-07). + +``` +git add tasks.md _diag/p6-update-sweep/ +git commit -m "tasks: mark P6-01 + P6-02 done with as-shipped block" +``` + +--- + +## Self-review + +Run through the spec sections: + +- §3 wire protocol → Task 4, Task 5 (jobs.kind), Task 9, Task 10. ✅ +- §4 agent execution → Task 8 (Linux), Task 9 (dispatch wiring), Task 12 (Windows). ✅ +- §5 server build version → Task 1, Task 2, Task 3. ✅ +- §6 server endpoints → Task 10 (host update), Task 11 (hello integration + watcher). ✅ +- §7 fleet update → Task 6 (schema), Task 7 (store), Task 14 (worker), Task 15 (HTTP+UI). ✅ +- §7.3 UI surfaces → Task 16 (chip), Task 17 (button), Task 18 (dashboard tile). ✅ +- §7.4 alert engine → Task 13. ✅ +- §8 RBAC → enforced in Task 10 + Task 15 by reusing existing `requireAdmin` middleware. ✅ +- §9 testing → Task tests + Task 19 smoke. ✅ + +No placeholders. All types referenced consistently across tasks. Done.