# P6-01 + P6-02 Agent Self-Update Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Operator-driven agent self-update via WS envelope, with dashboard "out of date" surfacing, per-host Update button, and a rolling fleet-update worker that halts on first failure. **Architecture:** Agent fetches its replacement binary from `/agent/binary`, atomic-renames over the running binary (Linux) or hands off to a detached helper script (Windows), and exits cleanly so the service manager restarts it. The server tracks each update as a `jobs` row with `kind=update`; success is detected when the agent re-hellos with `agent_version == server.Version`. A fleet-update worker iterates out-of-date hosts one at a time, halting on the first failure. **Tech Stack:** Go server + agent, SQLite migrations, WebSocket envelopes (existing `internal/api`), htmx/Tailwind UI. **Spec:** `docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md` --- ## File Structure ### New files - `internal/version/version.go` — `Version`, `Commit` constants, ldflags-injected. - `internal/agent/updater/updater.go` — shared HTTP fetch + atomic-write helpers. - `internal/agent/updater/updater_unix.go` — Linux platform path (build-tag `!windows`). - `internal/agent/updater/updater_windows.go` — Windows platform path (build-tag `windows`). - `internal/agent/updater/updater_test.go` — Linux unit tests with fake HTTP server. - `internal/store/migrations/0021_jobs_update_kind.sql` — widen jobs.kind CHECK. - `internal/store/migrations/0022_fleet_updates.sql` — fleet_updates + fleet_update_hosts tables. - `internal/store/fleet_updates.go` — store layer for the new tables. - `internal/store/fleet_updates_test.go` - `internal/server/http/host_update.go` — `POST /api/hosts/{id}/update` + form variant. - `internal/server/http/host_update_test.go` - `internal/server/http/version.go` — `GET /api/version`. - `internal/server/http/fleet_update.go` — fleet update endpoints + page handler. - `internal/server/http/fleet_update_test.go` - `internal/server/fleetupdate/worker.go` — the rolling-update goroutine. - `internal/server/fleetupdate/worker_test.go` - `internal/alert/update_alerts.go` — alert kinds + helpers (`update_failed`, `fleet_update_halted`). - `web/templates/pages/fleet_update.html` — both idle and running states. - `web/templates/partials/host_update_chip.html` — reusable chip. - `cmd/agent/update_dispatch.go` — agent-side `command.update` handler. ### Modified files - `Makefile` — add `VERSION` / `COMMIT` ldflags. - `internal/api/wire.go` — drop `MsgAgentUpdateAvail`, add `MsgCommandUpdate`. - `internal/api/messages.go` — drop `AgentUpdateAvailablePayload`, add `CommandUpdatePayload`. - `cmd/agent/main.go` — wire `MsgCommandUpdate` case in dispatcher. - `internal/server/ws/handler.go` — extend `onAgentHello` to mark in-flight `update` jobs succeeded on version match. - `internal/server/http/server.go` — register new routes + middleware. - `internal/server/http/middleware.go` — already has `requireAdmin`; reuse. - `internal/server/http/hosts.go` — render the update chip into host responses. - `internal/server/http/dashboard.go` (or wherever the dashboard handler lives) — "N hosts behind" tile, `updates=behind` filter. - `web/templates/partials/host_chrome.html` — embed update chip in header. - `web/templates/partials/host_row.html` — embed update chip in dashboard row. - `web/templates/pages/host_detail.html` — Update agent button on right-rail. - `web/styles/input.css` — `.update-chip` token (amber). - `cmd/server/main.go` — wire fleet-update worker into the daemon lifecycle. - `tasks.md` — mark P6-01 and P6-02 done with the as-shipped block. --- ## Phase 1 — Build version plumbing ### Task 1: `internal/version` package **Files:** - Create: `internal/version/version.go` - [ ] **Step 1: Create the package** ```go // Package version exposes build-time identifying constants. Both the // server and agent link this package; their values are set via // -ldflags during the build. An unset Version falls back to "dev" // so source builds without ldflags still run. package version var ( // Version is the human-facing release string, e.g. "v1.2.3" or // "v1.2.3-dirty". Compared byte-for-byte between agent and // server to drive the "out of date" signal. Version = "dev" // Commit is the short git SHA. Informational only; surfaced via // /api/version but not used for any comparison. Commit = "" ) ``` - [ ] **Step 2: Commit** ``` git add internal/version/version.go git commit -m "version: add build-time version package" ``` ### Task 2: Wire ldflags into the Makefile **Files:** - Modify: `Makefile` - [ ] **Step 1: Read the Makefile, locate the build target, and prepend the ldflags** Add near the top of the Makefile (after any existing variable block): ```make VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev) COMMIT ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo unknown) GO_LDFLAGS := -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \ -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT) ``` Then thread `-ldflags "$(GO_LDFLAGS)"` into every `go build` invocation in the Makefile (for `cmd/server`, `cmd/agent`, and any cross-compile target). - [ ] **Step 2: Verify** ```sh make build ./bin/restic-manager-server -version 2>/dev/null || true # if -version flag exists strings ./bin/restic-manager-server | grep -E "^v[0-9]+|^dev" | head -3 ``` Expected: a non-`dev` version string is embedded in the binary when in a tagged-or-dirty git checkout. - [ ] **Step 3: Commit** ``` git add Makefile git commit -m "build: inject version + commit via ldflags" ``` ### Task 3: `GET /api/version` endpoint **Files:** - Create: `internal/server/http/version.go` - Create: `internal/server/http/version_test.go` - Modify: `internal/server/http/server.go` - [ ] **Step 1: Write the failing test** ```go package http import ( "encoding/json" "net/http" "net/http/httptest" "testing" "gitea.dcglab.co.uk/steve/restic-manager/internal/version" ) func TestVersionEndpoint(t *testing.T) { version.Version = "v1.2.3" version.Commit = "abc1234" t.Cleanup(func() { version.Version = "dev" version.Commit = "" }) srv := newTestServer(t) // existing helper in this package rr := httptest.NewRecorder() req := httptest.NewRequest(http.MethodGet, "/api/version", nil) srv.Router().ServeHTTP(rr, req) if rr.Code != http.StatusOK { t.Fatalf("status: got %d want 200", rr.Code) } var body struct { Version string `json:"version"` Commit string `json:"commit"` } if err := json.NewDecoder(rr.Body).Decode(&body); err != nil { t.Fatalf("decode: %v", err) } if body.Version != "v1.2.3" || body.Commit != "abc1234" { t.Fatalf("body: %+v", body) } } ``` If `newTestServer` doesn't exist by that name in this package, locate the equivalent helper (look at `enrollment_test.go` or `version.go` style elsewhere) and adapt. - [ ] **Step 2: Implement** ```go // internal/server/http/version.go package http import ( "encoding/json" stdhttp "net/http" "gitea.dcglab.co.uk/steve/restic-manager/internal/version" ) func (s *Server) handleVersion(w stdhttp.ResponseWriter, r *stdhttp.Request) { w.Header().Set("Content-Type", "application/json") _ = json.NewEncoder(w).Encode(map[string]string{ "version": version.Version, "commit": version.Commit, }) } ``` In `server.go`, add inside the route registration block (where other public routes live, near `/agent/binary`): ```go r.Get("/api/version", s.handleVersion) ``` - [ ] **Step 3: Run tests** ```sh go test ./internal/server/http/ -run TestVersionEndpoint -v ``` Expected: PASS. - [ ] **Step 4: Commit** ``` git add internal/server/http/version.go internal/server/http/version_test.go internal/server/http/server.go git commit -m "http: expose GET /api/version" ``` --- ## Phase 2 — Wire protocol changes ### Task 4: Add `MsgCommandUpdate` and `CommandUpdatePayload`, retire `MsgAgentUpdateAvail` **Files:** - Modify: `internal/api/wire.go` - Modify: `internal/api/messages.go` - Modify: `cmd/agent/main.go` - [ ] **Step 1: Edit `wire.go`** In the server-to-agent block (around line 32-37), replace: ```go MsgAgentUpdateAvail MessageType = "agent.update.available" ``` with: ```go MsgCommandUpdate MessageType = "command.update" ``` - [ ] **Step 2: Edit `messages.go`** Delete the `AgentUpdateAvailablePayload` struct (lines ~364-371) and its doc comment. Add immediately before `TreeListRequestPayload`: ```go // CommandUpdatePayload carries no operational data — the agent // already knows its own os/arch and fetches from its configured // server URL via /agent/binary. JobID is the server-issued id of // the update job; the agent echoes it on log.stream lines so the // live job log captures pre-restart progress. type CommandUpdatePayload struct { JobID string `json:"job_id"` } ``` - [ ] **Step 3: Edit `cmd/agent/main.go`** Replace the `case api.MsgAgentUpdateAvail` block (lines ~398-401) with: ```go case api.MsgCommandUpdate: var p api.CommandUpdatePayload if err := env.UnmarshalPayload(&p); err != nil { return fmt.Errorf("command.update: %w", err) } go d.runUpdate(ctx, p, tx) ``` `runUpdate` lands in Task 9. - [ ] **Step 4: Update `JobKind` constants in `messages.go`** In the `JobKind` const block (line ~57), add: ```go JobUpdate JobKind = "update" ``` - [ ] **Step 5: Verify** ```sh go build ./... ``` Expected: build error from `cmd/agent/main.go` calling `d.runUpdate` (not yet defined). That's fine — proceed; the next phase plugs the gap. Verify only the `internal/api` package builds: ```sh go build ./internal/api/ ``` Expected: PASS. - [ ] **Step 6: Commit** ``` git add internal/api/wire.go internal/api/messages.go cmd/agent/main.go git commit -m "api: replace agent.update.available with command.update + JobUpdate kind" ``` --- ## Phase 3 — Database migrations ### Task 5: Migration 0021 — widen `jobs.kind` CHECK **Files:** - Create: `internal/store/migrations/0021_jobs_update_kind.sql` - [ ] **Step 1: Write the migration** Mirror the pattern in `0012_jobs_restore_diff_kind.sql` exactly: temp-backup of `job_logs`, rebuild `jobs` with the wider CHECK, restore log rows, recreate indexes. Only change is the CHECK list now includes `'update'`: ```sql -- 0021_jobs_update_kind.sql -- -- Add 'update' to the jobs.kind CHECK constraint so the agent -- self-update flow (P6-01) can persist its job rows. Same safe -- rebuild pattern as 0012; cascade trap mitigated by job_logs -- temp-backup. CREATE TEMPORARY TABLE _job_logs_backup AS SELECT job_id, seq, ts, stream, payload FROM job_logs; CREATE TABLE jobs_new ( id TEXT PRIMARY KEY, host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE, kind TEXT NOT NULL CHECK (kind IN ('backup','init','forget','prune','check','unlock','restore','diff','update')), status TEXT NOT NULL CHECK (status IN ('queued','running','succeeded','failed','cancelled')), scheduled_id TEXT REFERENCES schedules(id) ON DELETE SET NULL, actor_kind TEXT NOT NULL CHECK (actor_kind IN ('user','schedule','system')), actor_id TEXT, started_at TEXT, finished_at TEXT, exit_code INTEGER, stats TEXT, error TEXT, created_at TEXT NOT NULL ); INSERT INTO jobs_new SELECT id, host_id, kind, status, scheduled_id, actor_kind, actor_id, started_at, finished_at, exit_code, stats, error, created_at FROM jobs; DROP TABLE jobs; ALTER TABLE jobs_new RENAME TO jobs; CREATE INDEX jobs_host_id ON jobs(host_id); CREATE INDEX jobs_status ON jobs(status); CREATE INDEX jobs_created_at ON jobs(created_at); INSERT OR IGNORE INTO job_logs (job_id, seq, ts, stream, payload) SELECT job_id, seq, ts, stream, payload FROM _job_logs_backup; DROP TABLE _job_logs_backup; ``` If the live `jobs` schema already has columns added by post-0012 migrations (e.g. 0015 added `source_group_id`), match them in `jobs_new` and the INSERT — check the latest schema before writing. - [ ] **Step 2: Verify** ```sh go test ./internal/store/ -run TestMigrations -v ``` Expected: PASS, includes 0021. - [ ] **Step 3: Commit** ``` git add internal/store/migrations/0021_jobs_update_kind.sql git commit -m "store: migration 0021 — add 'update' to jobs.kind" ``` ### Task 6: Migration 0022 — fleet_updates tables **Files:** - Create: `internal/store/migrations/0022_fleet_updates.sql` - [ ] **Step 1: Write the migration** ```sql -- 0022_fleet_updates.sql -- -- Tables backing the rolling fleet-update worker (P6-02). One row in -- fleet_updates per "update all" invocation, a child row per host so -- the worker can resume / report progress / mark per-host failures. CREATE TABLE fleet_updates ( id TEXT PRIMARY KEY, started_at TEXT NOT NULL, started_by_user_id TEXT NOT NULL REFERENCES users(id), target_version TEXT NOT NULL, status TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')), current_host_id TEXT REFERENCES hosts(id), halted_reason TEXT, completed_at TEXT ); CREATE INDEX fleet_updates_status ON fleet_updates(status); CREATE TABLE fleet_update_hosts ( fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE, host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE, position INTEGER NOT NULL, status TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')), job_id TEXT REFERENCES jobs(id), failed_reason TEXT, PRIMARY KEY (fleet_update_id, host_id) ); CREATE INDEX fleet_update_hosts_status ON fleet_update_hosts(fleet_update_id, position); ``` - [ ] **Step 2: Verify** ```sh go test ./internal/store/ -run TestMigrations -v ``` - [ ] **Step 3: Commit** ``` git add internal/store/migrations/0022_fleet_updates.sql git commit -m "store: migration 0022 — fleet_updates + fleet_update_hosts" ``` ### Task 7: `internal/store/fleet_updates.go` **Files:** - Create: `internal/store/fleet_updates.go` - Create: `internal/store/fleet_updates_test.go` - [ ] **Step 1: Write the failing tests** ```go package store_test // Cover: CreateFleetUpdate creates parent + N pending host rows; // SetFleetUpdateHostStatus moves a row through pending→running→succeeded; // HaltFleetUpdate sets status=halted + halted_reason and stamps no // completed_at; CompleteFleetUpdate sets completed_at; ListPendingFleetUpdateHosts // returns rows in position order; ActiveFleetUpdate returns the running // row (or nil); GetFleetUpdate hydrates parent + children. // // One test per behaviour, table-driven where the API supports it. // Mirror the structure of internal/store/maintenance_test.go. ``` Write four-six discrete test functions. Look at `internal/store/maintenance.go` + `_test.go` for the established style (constructor on `*Store`, NewStore + tmp DB). - [ ] **Step 2: Implement** Sketch: ```go package store import ( "context" "database/sql" "errors" "fmt" "time" ) type FleetUpdate struct { ID string StartedAt time.Time StartedByUserID string TargetVersion string Status string // running | completed | halted | cancelled CurrentHostID string HaltedReason string CompletedAt *time.Time } type FleetUpdateHost struct { FleetUpdateID string HostID string Position int Status string // pending | running | succeeded | failed | skipped JobID string FailedReason string } func (s *Store) CreateFleetUpdate(ctx context.Context, fu FleetUpdate, hostIDs []string) error { ... } func (s *Store) ActiveFleetUpdate(ctx context.Context) (*FleetUpdate, error) { ... } func (s *Store) GetFleetUpdate(ctx context.Context, id string) (*FleetUpdate, []FleetUpdateHost, error) { ... } func (s *Store) ListPendingFleetUpdateHosts(ctx context.Context, fuID string) ([]FleetUpdateHost, error) { ... } func (s *Store) SetFleetUpdateHostStatus(ctx context.Context, fuID, hostID, status, failedReason, jobID string) error { ... } func (s *Store) SetFleetUpdateCurrentHost(ctx context.Context, fuID, hostID string) error { ... } func (s *Store) HaltFleetUpdate(ctx context.Context, fuID, reason string, when time.Time) error { ... } func (s *Store) CancelFleetUpdate(ctx context.Context, fuID string) error { ... } func (s *Store) CompleteFleetUpdate(ctx context.Context, fuID string, when time.Time) error { ... } ``` - [ ] **Step 3: Run tests** ```sh go test ./internal/store/ -run TestFleetUpdate -v ``` Expected: PASS. - [ ] **Step 4: Commit** ``` git add internal/store/fleet_updates.go internal/store/fleet_updates_test.go git commit -m "store: fleet_updates + fleet_update_hosts CRUD" ``` --- ## Phase 4 — Agent updater (Linux) ### Task 8: `internal/agent/updater` package skeleton + Linux path **Files:** - Create: `internal/agent/updater/updater.go` - Create: `internal/agent/updater/updater_unix.go` - Create: `internal/agent/updater/updater_windows.go` - Create: `internal/agent/updater/updater_test.go` - [ ] **Step 1: Write the failing test (Linux)** ```go //go:build !windows package updater import ( "io" "net/http" "net/http/httptest" "os" "path/filepath" "runtime" "testing" ) func TestUpdate_LinuxAtomicSwap(t *testing.T) { // Stage 1: a fake "running binary" file + a server that serves new bytes. tmp := t.TempDir() binPath := filepath.Join(tmp, "agent") if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil { t.Fatal(err) } newBytes := []byte("NEW BINARY CONTENTS") srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if r.URL.Path != "/agent/binary" { http.NotFound(w, r); return } if r.URL.Query().Get("os") != runtime.GOOS || r.URL.Query().Get("arch") != runtime.GOARCH { t.Errorf("unexpected query: %s", r.URL.RawQuery) } _, _ = io.Copy(w, &io.LimitedReader{R: bytesReader(newBytes), N: int64(len(newBytes))}) })) defer srv.Close() if err := UpdateForTest(srv.URL, binPath); err != nil { t.Fatalf("update: %v", err) } got, err := os.ReadFile(binPath) if err != nil { t.Fatal(err) } if string(got) != "NEW BINARY CONTENTS" { t.Fatalf("binary contents: got %q", got) } old, err := os.ReadFile(binPath + ".old") if err != nil { t.Fatalf("agent.old missing: %v", err) } if string(old) != "OLD" { t.Fatalf("agent.old contents: got %q", old) } // .new must have been renamed away if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) { t.Fatalf("agent.new should be absent after swap") } } ``` `UpdateForTest(serverURL, binaryPath string) error` is a tiny wrapper exposed by `updater.go` that does steps 1–6 of §4.1 of the spec (everything except `os.Exit`). The exit-and-restart side effect can't be covered by a unit test. - [ ] **Step 2: Implement `updater.go` (shared)** ```go package updater import ( "context" "fmt" "io" "net/http" "os" "path/filepath" "runtime" "time" ) // fetch downloads the new binary into .new, fsyncs, chmods. // Returns the path to the staged file. func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) { url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH) req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) if err != nil { return "", err } c := &http.Client{Timeout: 5 * time.Minute} res, err := c.Do(req) if err != nil { return "", err } defer res.Body.Close() if res.StatusCode != http.StatusOK { return "", fmt.Errorf("agent binary fetch: %s", res.Status) } stagePath := binaryPath + ".new" f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755) if err != nil { return "", err } if _, err := io.Copy(f, res.Body); err != nil { f.Close() _ = os.Remove(stagePath) return "", err } if err := f.Sync(); err != nil { f.Close() _ = os.Remove(stagePath) return "", err } if err := f.Close(); err != nil { _ = os.Remove(stagePath) return "", err } if err := os.Chmod(stagePath, 0o755); err != nil { _ = os.Remove(stagePath) return "", err } return stagePath, nil } // resolveOwnBinary returns the absolute path of the running binary. // Refuses /proc/self/exe — that's what os.Executable returns on // some systems but it can't be renamed across. func resolveOwnBinary() (string, error) { p, err := os.Executable() if err != nil { return "", err } abs, err := filepath.Abs(p) if err != nil { return "", err } if abs == "/proc/self/exe" { return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe — not a real file)") } return abs, nil } // UpdateForTest is the platform-neutral test seam. // In production, Update (in updater_unix.go / updater_windows.go) does // the same fetch+swap then exits the process. UpdateForTest stops short // of the exit so unit tests can assert on file state. func UpdateForTest(serverURL, binaryPath string) error { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute) defer cancel() stage, err := fetch(ctx, serverURL, binaryPath) if err != nil { return err } return swap(stage, binaryPath) } ``` - [ ] **Step 3: Implement `updater_unix.go`** ```go //go:build !windows package updater import ( "context" "fmt" "io" "log/slog" "os" "time" ) // Update fetches the new binary, swaps it in, then exits so systemd // restarts the process under the new binary. Caller should close // the WS cleanly before invoking. func Update(ctx context.Context, serverURL string) error { binPath, err := resolveOwnBinary() if err != nil { return err } stage, err := fetch(ctx, serverURL, binPath) if err != nil { return err } if err := swap(stage, binPath); err != nil { return err } slog.Info("agent self-update: binary swapped, exiting for systemd restart", "binary", binPath) // Give logger a moment to flush, then exit. time.Sleep(200 * time.Millisecond) os.Exit(0) return nil // unreachable } // swap copies the running binary to .old, then atomic-renames the // staged binary into place. On non-Windows this works because the OS // allows renames across an open file. func swap(stagePath, binPath string) error { src, err := os.Open(binPath) if err != nil { return fmt.Errorf("open running binary: %w", err) } defer src.Close() dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755) if err != nil { return fmt.Errorf("open .old: %w", err) } if _, err := io.Copy(dst, src); err != nil { dst.Close() return fmt.Errorf("copy to .old: %w", err) } if err := dst.Sync(); err != nil { dst.Close() return err } if err := dst.Close(); err != nil { return err } if err := os.Rename(stagePath, binPath); err != nil { return fmt.Errorf("rename .new over running binary: %w", err) } return nil } ``` - [ ] **Step 4: Implement `updater_windows.go` stub** ```go //go:build windows package updater import ( "context" "errors" ) // Update is implemented in Task 12. Stubbed so the package builds // on Windows during phases 4-11. func Update(ctx context.Context, serverURL string) error { return errors.New("agent self-update on Windows: not yet implemented") } func swap(stagePath, binPath string) error { return errors.New("agent self-update on Windows: not yet implemented") } ``` - [ ] **Step 5: Add `bytesReader` helper to test file** (or use `bytes.NewReader` directly). - [ ] **Step 6: Run tests** ```sh go test ./internal/agent/updater/ -v ``` Expected: PASS. - [ ] **Step 7: Commit** ``` git add internal/agent/updater/ git commit -m "agent: updater package — Linux atomic-swap path" ``` ### Task 9: Wire `command.update` into the agent dispatcher **Files:** - Create: `cmd/agent/update_dispatch.go` - Verify: `cmd/agent/main.go` (already edited in Task 4) - [ ] **Step 1: Implement the dispatcher method** ```go package main import ( "context" "fmt" "log/slog" "time" "gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater" "gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient" "gitea.dcglab.co.uk/steve/restic-manager/internal/api" ) // runUpdate handles a server-dispatched command.update. It logs progress // via log.stream so the live job page captures pre-restart state, then // calls the platform updater. On Linux the updater calls os.Exit; on // Windows it spawns a helper and returns, with the agent then exiting. func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) { logf := func(format string, args ...any) { line := fmt.Sprintf(format, args...) slog.Info("ws agent: update: " + line) env, _ := api.Marshal(api.MsgLogStream, "", api.LogStreamPayload{ JobID: p.JobID, Stream: api.LogStdout, Data: line + "\n", At: time.Now().UTC(), }) _ = tx.Send(env) } // Job-started so the server flips queued→running. startedEnv, _ := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{ JobID: p.JobID, Kind: api.JobUpdate, StartedAt: time.Now().UTC(), }) _ = tx.Send(startedEnv) logf("fetching new binary from %s", d.serverURL) if err := updater.Update(ctx, d.serverURL); err != nil { logf("update failed: %v", err) finishedEnv, _ := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{ JobID: p.JobID, Kind: api.JobUpdate, Status: api.JobFailed, FinishedAt: time.Now().UTC(), Error: err.Error(), }) _ = tx.Send(finishedEnv) return } // Unreachable on Linux (Update calls os.Exit). On Windows control // returns here and the agent exits cleanly so SCM hands off to the // helper script that does the actual swap-and-restart. } ``` `d.serverURL` should already exist on the dispatcher (it's the URL the WS connection was made to). If not, plumb it from `cmd/agent/main.go`'s connection setup — the URL is in the agent config. - [ ] **Step 2: Verify build** ```sh go build ./... ``` Expected: PASS. - [ ] **Step 3: Run all agent tests** ```sh go test ./cmd/agent/... ./internal/agent/... ``` Expected: PASS, no regressions. - [ ] **Step 4: Commit** ``` git add cmd/agent/update_dispatch.go cmd/agent/main.go git commit -m "agent: handle command.update — fetch + swap + exit" ``` --- ## Phase 5 — Server endpoint + hello integration ### Task 10: `POST /api/hosts/{id}/update` **Files:** - Create: `internal/server/http/host_update.go` - Create: `internal/server/http/host_update_test.go` - Modify: `internal/server/http/server.go` (route registration) - [ ] **Step 1: Write tests covering** Mirror the structure of an existing admin-band endpoint test, e.g. `repo_ops_test.go`: - happy path: admin POST → 200 + `{job_id}` returned, `jobs` row created with `kind=update`, audit row written, WS envelope `command.update` sent to the host's connection. - refuses when host offline → 409 / structured error code `host_offline`. - refuses when `agent_version == server.Version` → 409 / `already_up_to_date`. - refuses when an `update` job is already running for this host → 409 / `update_in_progress`. - RBAC: operator → 403, viewer → 403. - [ ] **Step 2: Implement** ```go package http import ( "encoding/json" stdhttp "net/http" "github.com/go-chi/chi/v5" "github.com/oklog/ulid/v2" "gitea.dcglab.co.uk/steve/restic-manager/internal/api" "gitea.dcglab.co.uk/steve/restic-manager/internal/store" "gitea.dcglab.co.uk/steve/restic-manager/internal/version" ) // handleHostUpdate dispatches a command.update WS envelope after // validating that the host is online, currently running a different // version, and not already in the middle of an update. func (s *Server) handleHostUpdate(w stdhttp.ResponseWriter, r *stdhttp.Request) { hostID := chi.URLParam(r, "id") host, err := s.deps.Store.GetHost(r.Context(), hostID) if err != nil { writeJSONError(w, stdhttp.StatusNotFound, "host_not_found", ""); return } if !s.deps.Hub.IsOnline(hostID) { writeJSONError(w, stdhttp.StatusConflict, "host_offline", "agent must be online to receive an update") return } if host.AgentVersion == version.Version { writeJSONError(w, stdhttp.StatusConflict, "already_up_to_date", "host is already on "+version.Version) return } running, err := s.deps.Store.RunningUpdateJobForHost(r.Context(), hostID) if err == nil && running != "" { writeJSONError(w, stdhttp.StatusConflict, "update_in_progress", "an update job is already running for this host") return } jobID := ulid.Make().String() user := userFrom(r) // existing helper if err := s.deps.Store.InsertJob(r.Context(), store.Job{ ID: jobID, HostID: hostID, Kind: string(api.JobUpdate), Status: string(api.JobQueued), ActorKind: "user", ActorID: user.ID, // CreatedAt is set by InsertJob. }); err != nil { writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error()) return } env, _ := api.Marshal(api.MsgCommandUpdate, ulid.Make().String(), api.CommandUpdatePayload{JobID: jobID}) if err := s.deps.Hub.SendTo(hostID, env); err != nil { writeJSONError(w, stdhttp.StatusBadGateway, "send_failed", err.Error()) return } s.audit(r, "host.update_dispatched", store.AuditTarget{ Kind: "host", ID: hostID, }, map[string]any{"job_id": jobID, "target_version": version.Version}) w.Header().Set("Content-Type", "application/json") _ = json.NewEncoder(w).Encode(map[string]string{"job_id": jobID}) } // Form-post variant for HTMX. Same gates, returns HX-Redirect. func (s *Server) handleHostUpdateForm(w stdhttp.ResponseWriter, r *stdhttp.Request) { // reuse handleHostUpdate's pre-checks via a shared validator; // on success, set HX-Redirect to /jobs/ and write 200. // On error, render an inline banner partial. } ``` Helpers to add: - `Store.RunningUpdateJobForHost(ctx, hostID) (string, error)` — returns the job id of any running `kind=update` job for this host, or empty string + nil if none. One-line query. - [ ] **Step 3: Register routes** In `server.go`, inside the admin-only route group: ```go r.Post("/api/hosts/{id}/update", s.handleHostUpdate) r.Post("/hosts/{id}/update", s.handleHostUpdateForm) ``` - [ ] **Step 4: Run tests** ```sh go test ./internal/server/http/ -run TestHostUpdate -v ``` Expected: PASS. - [ ] **Step 5: Commit** ``` git add internal/server/http/host_update.go internal/server/http/host_update_test.go internal/server/http/server.go internal/store/jobs.go git commit -m "http: POST /api/hosts/{id}/update — dispatch agent update" ``` ### Task 11: Hello-handler integration + timeout watcher **Files:** - Modify: `internal/server/ws/handler.go` - Create: `internal/server/ws/update_watch.go` - Create: `internal/server/ws/update_watch_test.go` - [ ] **Step 1: Write the failing test** ```go // In update_watch_test.go: // // 1. NewWatcher; Track(jobID, hostID, started=now). Hello arrives after // 50ms with matching version → watcher marks the job succeeded // (verify via mock Store.UpdateJobStatus call). // 2. NewWatcher; Track(...). 100ms timeout (override constant for test). // No hello arrives → after 100ms, watcher marks the job failed with // reason "timeout" and raises an alert (verify via mock Store + // AlertEngine). // 3. NewWatcher; Track(...). Hello arrives but version doesn't match. // Watcher does nothing (timeout will catch). After timeout, marked // failed with reason "agent reconnected at version X, expected Y". // 4. Cancel: Track then explicitly Stop(jobID) — no further callbacks. ``` - [ ] **Step 2: Implement the watcher** ```go package ws import ( "context" "sync" "time" "gitea.dcglab.co.uk/steve/restic-manager/internal/api" "gitea.dcglab.co.uk/steve/restic-manager/internal/store" ) // updateTimeout is the default ceiling for how long the server waits // for an agent re-hello carrying the matching version after dispatching // a command.update. Exported as a var so tests can shrink it. var updateTimeout = 90 * time.Second type updateWatch struct { jobID string hostID string deadline time.Time } type updateWatcher struct { mu sync.Mutex pending map[string]*updateWatch // hostID → watch store *store.Store alerts AlertRaiser // small interface, injected now func() time.Time } func newUpdateWatcher(st *store.Store, alerts AlertRaiser) *updateWatcher { return &updateWatcher{ pending: make(map[string]*updateWatch), store: st, alerts: alerts, now: func() time.Time { return time.Now().UTC() }, } } // Track registers an in-flight update. If a hello with the matching // version arrives before the deadline, OnHello returns true and clears // the entry. Otherwise the watcher's runLoop will mark the job failed. func (w *updateWatcher) Track(jobID, hostID string) { w.mu.Lock() defer w.mu.Unlock() w.pending[hostID] = &updateWatch{ jobID: jobID, hostID: hostID, deadline: w.now().Add(updateTimeout), } } // OnHello is called by the WS handler when an agent hellos. If a watch // is pending for this host AND the version matches, mark succeeded and // drop the watch. Mismatched version → leave the watch (timeout // handles it). func (w *updateWatcher) OnHello(ctx context.Context, hostID, agentVersion, serverVersion string) { w.mu.Lock() watch, ok := w.pending[hostID] if ok && agentVersion == serverVersion { delete(w.pending, hostID) } w.mu.Unlock() if !ok || agentVersion != serverVersion { return } // Mark job succeeded. _ = w.store.SetJobStatus(ctx, watch.jobID, string(api.JobSucceeded), "", w.now()) // Audit + alert auto-resolve. // (audit hook reused via http layer's helper, or write directly here) } // Run is a goroutine started by NewHandler — sweeps for expired // watches every 5s. func (w *updateWatcher) Run(ctx context.Context) { tick := time.NewTicker(5 * time.Second) defer tick.Stop() for { select { case <-ctx.Done(): return case <-tick.C: w.sweep(ctx) } } } func (w *updateWatcher) sweep(ctx context.Context) { now := w.now() w.mu.Lock() expired := []*updateWatch{} for hostID, wch := range w.pending { if now.After(wch.deadline) { expired = append(expired, wch) delete(w.pending, hostID) } } w.mu.Unlock() for _, wch := range expired { // Determine reason: did the agent come back at all? host, _ := w.store.GetHost(ctx, wch.hostID) reason := "timeout: agent did not reconnect within 90s" if host != nil && host.AgentVersion != "" && host.AgentVersion != version.Version { reason = fmt.Sprintf("agent reconnected at %s, expected %s", host.AgentVersion, version.Version) } _ = w.store.SetJobStatus(ctx, wch.jobID, string(api.JobFailed), reason, now) if w.alerts != nil { w.alerts.RaiseUpdateFailed(ctx, wch.hostID, wch.jobID, reason, now) } } } ``` - [ ] **Step 3: Hook into the WS handler** In `handler.go`, where `onAgentHello` is defined (search for the place it upserts `agent_version`), at the *end* of the handler — after the upsert succeeds — call: ```go deps.UpdateWatcher.OnHello(ctx, hostID, hello.AgentVersion, version.Version) ``` The `UpdateWatcher *updateWatcher` field needs to exist on the handler `Deps` struct. Wire it up in `cmd/server/main.go`. `AlertRaiser` interface (defined alongside the watcher) is implemented by `*alert.Engine` after Task 14 adds the `RaiseUpdateFailed` method. For now, define the interface and make the engine satisfy it. - [ ] **Step 4: Run tests** ```sh go test ./internal/server/ws/ -v ``` Expected: PASS. - [ ] **Step 5: Commit** ``` git add internal/server/ws/update_watch.go internal/server/ws/update_watch_test.go internal/server/ws/handler.go cmd/server/main.go git commit -m "ws: update watcher — promote/fail update jobs on hello timeout" ``` --- ## Phase 6 — Windows updater path ### Task 12: Windows helper-script implementation **Files:** - Modify: `internal/agent/updater/updater_windows.go` - [ ] **Step 1: Replace the stub from Task 8** ```go //go:build windows package updater import ( "context" "fmt" "log/slog" "os" "os/exec" "path/filepath" "syscall" "time" ) const helperScript = `@echo off timeout /t 3 /nobreak >nul copy /Y "%s" "%s" sc stop restic-manager-agent :wait sc query restic-manager-agent | find "STOPPED" >nul if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait) move /Y "%s" "%s" sc start restic-manager-agent del "%%~f0" ` func Update(ctx context.Context, serverURL string) error { binPath, err := resolveOwnBinary() if err != nil { return err } stage, err := fetch(ctx, serverURL, binPath) if err != nil { return err } helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd") oldPath := binPath + ".old" body := fmt.Sprintf(helperScript, binPath, oldPath, stage, binPath) if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil { return err } cmd := exec.Command("cmd.exe", "/c", helperPath) cmd.SysProcAttr = &syscall.SysProcAttr{ HideWindow: true, CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW } if err := cmd.Start(); err != nil { return err } slog.Info("agent self-update: helper spawned, exiting cleanly", "binary", binPath) time.Sleep(200 * time.Millisecond) os.Exit(0) return nil } func swap(_, _ string) error { return nil } // not used on Windows ``` - [ ] **Step 2: Verify cross-compile** ```sh GOOS=windows GOARCH=amd64 go build ./... ``` Expected: PASS. - [ ] **Step 3: Commit** ``` git add internal/agent/updater/updater_windows.go git commit -m "agent: Windows updater — detached helper script" ``` --- ## Phase 7 — Alert kinds + auto-resolve ### Task 13: Add `update_failed` and `fleet_update_halted` alert kinds **Files:** - Create: `internal/alert/update_alerts.go` - Modify: `internal/alert/rules.go` (auto-resolve hook on host hello) - [ ] **Step 1: Implement** ```go package alert import ( "context" "time" ) // RaiseUpdateFailed is called by the WS update-watcher when an agent // fails to come back at the target version after a command.update // dispatch. Auto-resolves when the host next hellos with the right // version. func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) { dedup := "update_failed:" + hostID msg := "agent self-update failed: " + reason e.raiseAndNotify(ctx, hostID, "update_failed", dedup, "warning", msg, when) } // ResolveUpdateFailed is called from the WS hello handler when the // host comes back at the expected version. func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) { e.resolveAndNotify(ctx, hostID, "update_failed", "update_failed:"+hostID, when) } // RaiseFleetUpdateHalted is called by the fleet-update worker when it // halts on a per-host failure. No host id (global alert). func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) { dedup := "fleet_update_halted:" + fleetUpdateID msg := "fleet update halted: " + reason e.raiseAndNotify(ctx, "", "fleet_update_halted", dedup, "warning", msg, when) } ``` - [ ] **Step 2: Wire auto-resolve into the WS hello handler** (in Task 11's update watcher: when a successful match is recorded, also call `ResolveUpdateFailed`). - [ ] **Step 3: Commit** ``` git add internal/alert/update_alerts.go git commit -m "alert: update_failed + fleet_update_halted kinds" ``` --- ## Phase 8 — Fleet-update worker ### Task 14: `internal/server/fleetupdate` worker **Files:** - Create: `internal/server/fleetupdate/worker.go` - Create: `internal/server/fleetupdate/worker_test.go` - [ ] **Step 1: Sketch the API** ```go package fleetupdate import ( "context" "sync" "time" "gitea.dcglab.co.uk/steve/restic-manager/internal/store" ) // Worker owns the at-most-one rolling fleet-update goroutine. type Worker struct { mu sync.Mutex // ensures one run at a time store *store.Store hub Hub // small interface — IsOnline, SendTo dispatcher Dispatcher // small interface — DispatchUpdate(hostID, fleetUpdateID) (jobID string, err error) watcher Watcher // small interface — WaitForVersion(hostID, version, timeout) bool alerts AlertRaiser } // Start kicks off a new fleet update. Validates that no other run // is in progress. Returns the new fleet_update id on success. func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) { if !w.mu.TryLock() { return "", ErrAlreadyRunning } // build the fleet_updates row + N pending fleet_update_hosts rows // in position order, then spawn a goroutine that runs the loop. go w.run(ctx, fuID) return fuID, nil } // run is the rolling loop. For each pending host: pre-check, dispatch, // wait for hello-with-target-version, mark succeeded/failed, halt on // first failure. func (w *Worker) run(ctx context.Context, fuID string) { defer w.mu.Unlock() // ... see spec §7.2 pseudocode } ``` - [ ] **Step 2: Write tests** Use mocks/fakes for Hub/Dispatcher/Watcher. Cover: - two-host run, both succeed → completed. - first host succeeds, second times out → halted, alert raised, third stays pending. - host goes offline mid-run → halted with reason "host went offline". - host already at target version when its turn comes (raced with another path) → skipped, loop continues. - cancel mid-run → status=cancelled, current host's job left running, no further dispatches. - start while another run active → returns ErrAlreadyRunning. - [ ] **Step 3: Implement the run loop** ```go func (w *Worker) run(ctx context.Context, fuID string) { defer w.mu.Unlock() pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID) if err != nil { return } for _, p := range pending { // Re-check status — could have been cancelled. fu, _ := w.store.ActiveFleetUpdate(ctx) if fu == nil || fu.Status != "running" || fu.ID != fuID { return } _ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, p.HostID) host, _ := w.store.GetHost(ctx, p.HostID) if host == nil { _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "host deleted", "") continue } // Already at target? // (target version comes from the fleet_update row) if host.AgentVersion == fu.TargetVersion { _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "already at target", "") continue } if !w.hub.IsOnline(p.HostID) { _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "host offline at dispatch time", "") _ = w.store.HaltFleetUpdate(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC()) w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC()) return } _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "running", "", "") jobID, err := w.dispatcher.DispatchUpdate(ctx, p.HostID, fuID) if err != nil { _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", err.Error(), "") _ = w.store.HaltFleetUpdate(ctx, fuID, "dispatch failed on "+host.Hostname, time.Now().UTC()) w.alerts.RaiseFleetUpdateHalted(ctx, fuID, err.Error(), time.Now().UTC()) return } _ = w.store.SetFleetUpdateHostStatusJob(ctx, fuID, p.HostID, jobID) ok := w.watcher.WaitForVersion(ctx, p.HostID, fu.TargetVersion, 95*time.Second) if !ok { _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "did not reconnect at target version", jobID) _ = w.store.HaltFleetUpdate(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC()) w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC()) return } _ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "succeeded", "", jobID) } _ = w.store.CompleteFleetUpdate(ctx, fuID, time.Now().UTC()) } ``` - [ ] **Step 4: Run tests** ```sh go test ./internal/server/fleetupdate/ -v ``` Expected: PASS. - [ ] **Step 5: Commit** ``` git add internal/server/fleetupdate/ git commit -m "fleetupdate: rolling worker with halt-on-fail" ``` ### Task 15: HTTP endpoints + page handler for fleet update **Files:** - Create: `internal/server/http/fleet_update.go` - Create: `internal/server/http/fleet_update_test.go` - Create: `web/templates/pages/fleet_update.html` - Modify: `internal/server/http/server.go` (route registration) - [ ] **Step 1: Endpoints** ```go // POST /api/fleet/update — admin-only, body: {target_version?}. // If target_version omitted, defaults to current server version. // Returns {fleet_update_id}. func (s *Server) handleFleetUpdateStart(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } // POST /api/fleet-updates/{id}/cancel — admin-only. func (s *Server) handleFleetUpdateCancel(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } // GET /api/fleet-updates/{id} — admin-only, returns // {fleet_update + per-host array} as JSON. func (s *Server) handleFleetUpdateGet(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } // GET /settings/fleet-update — admin-only, renders the page. // Shows idle list (out-of-date online hosts) when no run is active, // or the running run's progress. func (s *Server) handleFleetUpdatePage(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... } ``` - [ ] **Step 2: Tests** Unit-test the page handler (idle vs running variants) and the start endpoint (accepts target list, refuses if a run is already active, RBAC). - [ ] **Step 3: Page template** `web/templates/pages/fleet_update.html`: - Inherit from the base layout. - Idle state block: header "Fleet update", paragraph explaining "rolling updates one host at a time, halts on first failure", table of out-of-date online hosts with Hostname / Current / Target / Last seen, plus a typed-confirm dialog ("Type the host count to confirm"), "Start rolling update" button. - Running state block: htmx auto-refresh every 3s (`hx-get="/api/fleet-updates/{id}/partial" hx-trigger="every 3s [...visibility...]"`), per-host progress list with status pill, link to job log when present, "Cancel" button. - Mirror the visual idiom of `web/templates/pages/alerts.html` for the auto-refresh behaviour. - [ ] **Step 4: Run tests + smoke render** ```sh go test ./internal/server/http/ -run TestFleetUpdate -v ``` - [ ] **Step 5: Commit** ``` git add internal/server/http/fleet_update.go internal/server/http/fleet_update_test.go web/templates/pages/fleet_update.html internal/server/http/server.go git commit -m "http: fleet update endpoints + /settings/fleet-update page" ``` --- ## Phase 9 — UI surfacing ### Task 16: Update chip on host row + host detail header **Files:** - Create: `web/templates/partials/host_update_chip.html` - Modify: `web/templates/partials/host_row.html` - Modify: `web/templates/partials/host_chrome.html` - Modify: `internal/server/http/hosts.go` (add `UpdateAvailable` and `TargetVersion` fields to the row view-model) - Modify: `internal/server/http/host_detail.go` (or wherever `host_chrome` is populated) - Modify: `web/styles/input.css` - [ ] **Step 1: View-model** Compute `UpdateAvailable bool` and `TargetVersion string` (= server version) anywhere `host` data is built for templates. Hide chip when `host.AgentVersion == ""` or matches. - [ ] **Step 2: Partial** ```html {{ define "host_update_chip" }} {{ if .UpdateAvailable }} out of date · {{ .AgentVersion }} → {{ .TargetVersion }} {{ end }} {{ end }} ``` - [ ] **Step 3: CSS** `web/styles/input.css`: ```css .update-chip { @apply inline-flex items-center gap-1 px-2 py-0.5 rounded text-xs; @apply bg-amber-50 text-amber-900 border border-amber-200; } ``` - [ ] **Step 4: Render Tailwind + commit** ```sh make build git add web/templates web/styles/input.css web/static/css/styles.css internal/server/http/hosts.go internal/server/http/host_detail.go git commit -m "ui: update chip on host row + detail header" ``` ### Task 17: Per-host Update agent button on `/hosts/{id}` **Files:** - Modify: `web/templates/pages/host_detail.html` - [ ] **Step 1: Right-rail button block** Look at the existing right-rail in `host_detail.html` (e.g. the Restore button block from P3). Add (admin-only): ```html {{ if and .CanAdmin .Host.UpdateAvailable }}
{{ end }} ``` The view-model needs `Host.Online` and `Host.UpdateInProgress` populated. - [ ] **Step 2: Commit** ``` git add web/templates/pages/host_detail.html internal/server/http/host_detail.go git commit -m "ui: per-host Update agent button" ``` ### Task 18: Dashboard "N hosts behind" tile + `?updates=behind` filter **Files:** - Modify: `internal/server/http/dashboard_filter.go` (or wherever the dashboard handler lives — search for the `?status=` filter from NS-04) - Modify: `web/templates/pages/dashboard.html` - [ ] **Step 1: Extend filter parsing** Add `Updates string` (values: "" or "behind") to the dashboard filter struct. When `behind`, filter to hosts where `agent_version != "" && agent_version != server.Version`. - [ ] **Step 2: Hero tile** In `dashboard.html`, alongside existing tiles (online/offline/snapshot count), add — only when N > 0: ```html {{ if gt .UpdatesBehind 0 }} {{ .UpdatesBehind }} hosts behind {{ end }} ``` - [ ] **Step 3: Tests** Extend `dashboard_filter_test.go` to cover the `updates=behind` path. - [ ] **Step 4: Commit** ``` git add internal/server/http/dashboard*.go web/templates/pages/dashboard.html git commit -m "ui: dashboard hosts-behind tile + filter" ``` --- ## Phase 10 — Smoke validation ### Task 19: Restage + smoke validate - [ ] **Step 1: Build at version A** ```sh make build VERSION=v0.0.1-smoke-A # restage block from CLAUDE.md ``` - [ ] **Step 2: Onboard `uptime` as a fresh host** Use the dashboard's Add-host flow against `ssh uptime`. Confirm the host shows `agent_version=v0.0.1-smoke-A`. - [ ] **Step 3: Bump server to version B** ```sh make build VERSION=v0.0.1-smoke-B # restart server only (not the agent) ``` Verify: dashboard shows `uptime` with the "out of date · v0.0.1-smoke-A → v0.0.1-smoke-B" chip and the "1 host behind" tile. - [ ] **Step 4: Stage agent at version B** ```sh cp bin/restic-manager-agent $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64 ``` - [ ] **Step 5: Click Update agent** On `/hosts/{uptime-id}`. Watch the live job log. Expect: agent fetches, swaps, exits, systemd restarts it, hellos at version B, job marked succeeded, chip and tile clear. Verify on `uptime`: ```sh ssh uptime "ls -la /usr/local/bin/restic-manager-agent*" ``` Expect both `restic-manager-agent` (B) and `restic-manager-agent.old` (A) present. - [ ] **Step 6: Test rollback path** ```sh # Replace the bundled binary with the OLD one — server claims B but serves A cp bin/restic-manager-agent.A $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64 # (assume earlier build saved as .A) ``` Click Update — agent fetches A, swaps to A, restarts at A. Server should mark the job `failed` after 90s with reason like "agent reconnected at v0.0.1-smoke-A, expected v0.0.1-smoke-B". Alert raised. - [ ] **Step 7: Fleet update path** If only one host is available, this validates the worker on N=1. Spin up a second sibling agent (docker-based or another VM) to validate N=2 + halt-on-fail (replace `/agent-binaries/...` with `/bin/false`-equivalent during one host's turn). - [ ] **Step 8: Capture screenshots** Save Playwright screenshots of: out-of-date host row, fleet-update idle page, fleet-update running progress, fleet-update halted state. Drop into `_diag/p6-update-sweep/`. - [ ] **Step 9: Commit + update tasks.md** Mark P6-01 and P6-02 done in `tasks.md` with an as-shipped block summarising what landed (mirror the style used for P5-03/P5-07). ``` git add tasks.md _diag/p6-update-sweep/ git commit -m "tasks: mark P6-01 + P6-02 done with as-shipped block" ``` --- ## Self-review Run through the spec sections: - §3 wire protocol → Task 4, Task 5 (jobs.kind), Task 9, Task 10. ✅ - §4 agent execution → Task 8 (Linux), Task 9 (dispatch wiring), Task 12 (Windows). ✅ - §5 server build version → Task 1, Task 2, Task 3. ✅ - §6 server endpoints → Task 10 (host update), Task 11 (hello integration + watcher). ✅ - §7 fleet update → Task 6 (schema), Task 7 (store), Task 14 (worker), Task 15 (HTTP+UI). ✅ - §7.3 UI surfaces → Task 16 (chip), Task 17 (button), Task 18 (dashboard tile). ✅ - §7.4 alert engine → Task 13. ✅ - §8 RBAC → enforced in Task 10 + Task 15 by reusing existing `requireAdmin` middleware. ✅ - §9 testing → Task tests + Task 19 smoke. ✅ No placeholders. All types referenced consistently across tasks. Done.