Files
restic-manager/docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md
T

1707 lines
52 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# P6-01 + P6-02 Agent Self-Update Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Operator-driven agent self-update via WS envelope, with dashboard "out of date" surfacing, per-host Update button, and a rolling fleet-update worker that halts on first failure.
**Architecture:** Agent fetches its replacement binary from `/agent/binary`, atomic-renames over the running binary (Linux) or hands off to a detached helper script (Windows), and exits cleanly so the service manager restarts it. The server tracks each update as a `jobs` row with `kind=update`; success is detected when the agent re-hellos with `agent_version == server.Version`. A fleet-update worker iterates out-of-date hosts one at a time, halting on the first failure.
**Tech Stack:** Go server + agent, SQLite migrations, WebSocket envelopes (existing `internal/api`), htmx/Tailwind UI.
**Spec:** `docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md`
---
## File Structure
### New files
- `internal/version/version.go``Version`, `Commit` constants, ldflags-injected.
- `internal/agent/updater/updater.go` — shared HTTP fetch + atomic-write helpers.
- `internal/agent/updater/updater_unix.go` — Linux platform path (build-tag `!windows`).
- `internal/agent/updater/updater_windows.go` — Windows platform path (build-tag `windows`).
- `internal/agent/updater/updater_test.go` — Linux unit tests with fake HTTP server.
- `internal/store/migrations/0021_jobs_update_kind.sql` — widen jobs.kind CHECK.
- `internal/store/migrations/0022_fleet_updates.sql` — fleet_updates + fleet_update_hosts tables.
- `internal/store/fleet_updates.go` — store layer for the new tables.
- `internal/store/fleet_updates_test.go`
- `internal/server/http/host_update.go``POST /api/hosts/{id}/update` + form variant.
- `internal/server/http/host_update_test.go`
- `internal/server/http/version.go``GET /api/version`.
- `internal/server/http/fleet_update.go` — fleet update endpoints + page handler.
- `internal/server/http/fleet_update_test.go`
- `internal/server/fleetupdate/worker.go` — the rolling-update goroutine.
- `internal/server/fleetupdate/worker_test.go`
- `internal/alert/update_alerts.go` — alert kinds + helpers (`update_failed`, `fleet_update_halted`).
- `web/templates/pages/fleet_update.html` — both idle and running states.
- `web/templates/partials/host_update_chip.html` — reusable chip.
- `cmd/agent/update_dispatch.go` — agent-side `command.update` handler.
### Modified files
- `Makefile` — add `VERSION` / `COMMIT` ldflags.
- `internal/api/wire.go` — drop `MsgAgentUpdateAvail`, add `MsgCommandUpdate`.
- `internal/api/messages.go` — drop `AgentUpdateAvailablePayload`, add `CommandUpdatePayload`.
- `cmd/agent/main.go` — wire `MsgCommandUpdate` case in dispatcher.
- `internal/server/ws/handler.go` — extend `onAgentHello` to mark in-flight `update` jobs succeeded on version match.
- `internal/server/http/server.go` — register new routes + middleware.
- `internal/server/http/middleware.go` — already has `requireAdmin`; reuse.
- `internal/server/http/hosts.go` — render the update chip into host responses.
- `internal/server/http/dashboard.go` (or wherever the dashboard handler lives) — "N hosts behind" tile, `updates=behind` filter.
- `web/templates/partials/host_chrome.html` — embed update chip in header.
- `web/templates/partials/host_row.html` — embed update chip in dashboard row.
- `web/templates/pages/host_detail.html` — Update agent button on right-rail.
- `web/styles/input.css``.update-chip` token (amber).
- `cmd/server/main.go` — wire fleet-update worker into the daemon lifecycle.
- `tasks.md` — mark P6-01 and P6-02 done with the as-shipped block.
---
## Phase 1 — Build version plumbing
### Task 1: `internal/version` package
**Files:**
- Create: `internal/version/version.go`
- [ ] **Step 1: Create the package**
```go
// Package version exposes build-time identifying constants. Both the
// server and agent link this package; their values are set via
// -ldflags during the build. An unset Version falls back to "dev"
// so source builds without ldflags still run.
package version
var (
// Version is the human-facing release string, e.g. "v1.2.3" or
// "v1.2.3-dirty". Compared byte-for-byte between agent and
// server to drive the "out of date" signal.
Version = "dev"
// Commit is the short git SHA. Informational only; surfaced via
// /api/version but not used for any comparison.
Commit = ""
)
```
- [ ] **Step 2: Commit**
```
git add internal/version/version.go
git commit -m "version: add build-time version package"
```
### Task 2: Wire ldflags into the Makefile
**Files:**
- Modify: `Makefile`
- [ ] **Step 1: Read the Makefile, locate the build target, and prepend the ldflags**
Add near the top of the Makefile (after any existing variable block):
```make
VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
COMMIT ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo unknown)
GO_LDFLAGS := -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
```
Then thread `-ldflags "$(GO_LDFLAGS)"` into every `go build` invocation in the Makefile (for `cmd/server`, `cmd/agent`, and any cross-compile target).
- [ ] **Step 2: Verify**
```sh
make build
./bin/restic-manager-server -version 2>/dev/null || true # if -version flag exists
strings ./bin/restic-manager-server | grep -E "^v[0-9]+|^dev" | head -3
```
Expected: a non-`dev` version string is embedded in the binary when in a tagged-or-dirty git checkout.
- [ ] **Step 3: Commit**
```
git add Makefile
git commit -m "build: inject version + commit via ldflags"
```
### Task 3: `GET /api/version` endpoint
**Files:**
- Create: `internal/server/http/version.go`
- Create: `internal/server/http/version_test.go`
- Modify: `internal/server/http/server.go`
- [ ] **Step 1: Write the failing test**
```go
package http
import (
"encoding/json"
"net/http"
"net/http/httptest"
"testing"
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)
func TestVersionEndpoint(t *testing.T) {
version.Version = "v1.2.3"
version.Commit = "abc1234"
t.Cleanup(func() {
version.Version = "dev"
version.Commit = ""
})
srv := newTestServer(t) // existing helper in this package
rr := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodGet, "/api/version", nil)
srv.Router().ServeHTTP(rr, req)
if rr.Code != http.StatusOK {
t.Fatalf("status: got %d want 200", rr.Code)
}
var body struct {
Version string `json:"version"`
Commit string `json:"commit"`
}
if err := json.NewDecoder(rr.Body).Decode(&body); err != nil {
t.Fatalf("decode: %v", err)
}
if body.Version != "v1.2.3" || body.Commit != "abc1234" {
t.Fatalf("body: %+v", body)
}
}
```
If `newTestServer` doesn't exist by that name in this package, locate the equivalent helper (look at `enrollment_test.go` or `version.go` style elsewhere) and adapt.
- [ ] **Step 2: Implement**
```go
// internal/server/http/version.go
package http
import (
"encoding/json"
stdhttp "net/http"
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)
func (s *Server) handleVersion(w stdhttp.ResponseWriter, r *stdhttp.Request) {
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]string{
"version": version.Version,
"commit": version.Commit,
})
}
```
In `server.go`, add inside the route registration block (where other public routes live, near `/agent/binary`):
```go
r.Get("/api/version", s.handleVersion)
```
- [ ] **Step 3: Run tests**
```sh
go test ./internal/server/http/ -run TestVersionEndpoint -v
```
Expected: PASS.
- [ ] **Step 4: Commit**
```
git add internal/server/http/version.go internal/server/http/version_test.go internal/server/http/server.go
git commit -m "http: expose GET /api/version"
```
---
## Phase 2 — Wire protocol changes
### Task 4: Add `MsgCommandUpdate` and `CommandUpdatePayload`, retire `MsgAgentUpdateAvail`
**Files:**
- Modify: `internal/api/wire.go`
- Modify: `internal/api/messages.go`
- Modify: `cmd/agent/main.go`
- [ ] **Step 1: Edit `wire.go`**
In the server-to-agent block (around line 32-37), replace:
```go
MsgAgentUpdateAvail MessageType = "agent.update.available"
```
with:
```go
MsgCommandUpdate MessageType = "command.update"
```
- [ ] **Step 2: Edit `messages.go`**
Delete the `AgentUpdateAvailablePayload` struct (lines ~364-371) and its doc comment. Add immediately before `TreeListRequestPayload`:
```go
// CommandUpdatePayload carries no operational data — the agent
// already knows its own os/arch and fetches from its configured
// server URL via /agent/binary. JobID is the server-issued id of
// the update job; the agent echoes it on log.stream lines so the
// live job log captures pre-restart progress.
type CommandUpdatePayload struct {
JobID string `json:"job_id"`
}
```
- [ ] **Step 3: Edit `cmd/agent/main.go`**
Replace the `case api.MsgAgentUpdateAvail` block (lines ~398-401) with:
```go
case api.MsgCommandUpdate:
var p api.CommandUpdatePayload
if err := env.UnmarshalPayload(&p); err != nil {
return fmt.Errorf("command.update: %w", err)
}
go d.runUpdate(ctx, p, tx)
```
`runUpdate` lands in Task 9.
- [ ] **Step 4: Update `JobKind` constants in `messages.go`**
In the `JobKind` const block (line ~57), add:
```go
JobUpdate JobKind = "update"
```
- [ ] **Step 5: Verify**
```sh
go build ./...
```
Expected: build error from `cmd/agent/main.go` calling `d.runUpdate` (not yet defined). That's fine — proceed; the next phase plugs the gap. Verify only the `internal/api` package builds:
```sh
go build ./internal/api/
```
Expected: PASS.
- [ ] **Step 6: Commit**
```
git add internal/api/wire.go internal/api/messages.go cmd/agent/main.go
git commit -m "api: replace agent.update.available with command.update + JobUpdate kind"
```
---
## Phase 3 — Database migrations
### Task 5: Migration 0021 — widen `jobs.kind` CHECK
**Files:**
- Create: `internal/store/migrations/0021_jobs_update_kind.sql`
- [ ] **Step 1: Write the migration**
Mirror the pattern in `0012_jobs_restore_diff_kind.sql` exactly: temp-backup of `job_logs`, rebuild `jobs` with the wider CHECK, restore log rows, recreate indexes. Only change is the CHECK list now includes `'update'`:
```sql
-- 0021_jobs_update_kind.sql
--
-- Add 'update' to the jobs.kind CHECK constraint so the agent
-- self-update flow (P6-01) can persist its job rows. Same safe
-- rebuild pattern as 0012; cascade trap mitigated by job_logs
-- temp-backup.
CREATE TEMPORARY TABLE _job_logs_backup AS
SELECT job_id, seq, ts, stream, payload FROM job_logs;
CREATE TABLE jobs_new (
id TEXT PRIMARY KEY,
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
kind TEXT NOT NULL CHECK (kind IN
('backup','init','forget','prune','check','unlock','restore','diff','update')),
status TEXT NOT NULL CHECK (status IN ('queued','running','succeeded','failed','cancelled')),
scheduled_id TEXT REFERENCES schedules(id) ON DELETE SET NULL,
actor_kind TEXT NOT NULL CHECK (actor_kind IN ('user','schedule','system')),
actor_id TEXT,
started_at TEXT,
finished_at TEXT,
exit_code INTEGER,
stats TEXT,
error TEXT,
created_at TEXT NOT NULL
);
INSERT INTO jobs_new
SELECT id, host_id, kind, status, scheduled_id, actor_kind, actor_id,
started_at, finished_at, exit_code, stats, error, created_at
FROM jobs;
DROP TABLE jobs;
ALTER TABLE jobs_new RENAME TO jobs;
CREATE INDEX jobs_host_id ON jobs(host_id);
CREATE INDEX jobs_status ON jobs(status);
CREATE INDEX jobs_created_at ON jobs(created_at);
INSERT OR IGNORE INTO job_logs (job_id, seq, ts, stream, payload)
SELECT job_id, seq, ts, stream, payload FROM _job_logs_backup;
DROP TABLE _job_logs_backup;
```
If the live `jobs` schema already has columns added by post-0012 migrations (e.g. 0015 added `source_group_id`), match them in `jobs_new` and the INSERT — check the latest schema before writing.
- [ ] **Step 2: Verify**
```sh
go test ./internal/store/ -run TestMigrations -v
```
Expected: PASS, includes 0021.
- [ ] **Step 3: Commit**
```
git add internal/store/migrations/0021_jobs_update_kind.sql
git commit -m "store: migration 0021 — add 'update' to jobs.kind"
```
### Task 6: Migration 0022 — fleet_updates tables
**Files:**
- Create: `internal/store/migrations/0022_fleet_updates.sql`
- [ ] **Step 1: Write the migration**
```sql
-- 0022_fleet_updates.sql
--
-- Tables backing the rolling fleet-update worker (P6-02). One row in
-- fleet_updates per "update all" invocation, a child row per host so
-- the worker can resume / report progress / mark per-host failures.
CREATE TABLE fleet_updates (
id TEXT PRIMARY KEY,
started_at TEXT NOT NULL,
started_by_user_id TEXT NOT NULL REFERENCES users(id),
target_version TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN
('running','completed','halted','cancelled')),
current_host_id TEXT REFERENCES hosts(id),
halted_reason TEXT,
completed_at TEXT
);
CREATE INDEX fleet_updates_status ON fleet_updates(status);
CREATE TABLE fleet_update_hosts (
fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
position INTEGER NOT NULL,
status TEXT NOT NULL CHECK (status IN
('pending','running','succeeded','failed','skipped')),
job_id TEXT REFERENCES jobs(id),
failed_reason TEXT,
PRIMARY KEY (fleet_update_id, host_id)
);
CREATE INDEX fleet_update_hosts_status ON fleet_update_hosts(fleet_update_id, position);
```
- [ ] **Step 2: Verify**
```sh
go test ./internal/store/ -run TestMigrations -v
```
- [ ] **Step 3: Commit**
```
git add internal/store/migrations/0022_fleet_updates.sql
git commit -m "store: migration 0022 — fleet_updates + fleet_update_hosts"
```
### Task 7: `internal/store/fleet_updates.go`
**Files:**
- Create: `internal/store/fleet_updates.go`
- Create: `internal/store/fleet_updates_test.go`
- [ ] **Step 1: Write the failing tests**
```go
package store_test
// Cover: CreateFleetUpdate creates parent + N pending host rows;
// SetFleetUpdateHostStatus moves a row through pending→running→succeeded;
// HaltFleetUpdate sets status=halted + halted_reason and stamps no
// completed_at; CompleteFleetUpdate sets completed_at; ListPendingFleetUpdateHosts
// returns rows in position order; ActiveFleetUpdate returns the running
// row (or nil); GetFleetUpdate hydrates parent + children.
//
// One test per behaviour, table-driven where the API supports it.
// Mirror the structure of internal/store/maintenance_test.go.
```
Write four-six discrete test functions. Look at `internal/store/maintenance.go` + `_test.go` for the established style (constructor on `*Store`, NewStore + tmp DB).
- [ ] **Step 2: Implement**
Sketch:
```go
package store
import (
"context"
"database/sql"
"errors"
"fmt"
"time"
)
type FleetUpdate struct {
ID string
StartedAt time.Time
StartedByUserID string
TargetVersion string
Status string // running | completed | halted | cancelled
CurrentHostID string
HaltedReason string
CompletedAt *time.Time
}
type FleetUpdateHost struct {
FleetUpdateID string
HostID string
Position int
Status string // pending | running | succeeded | failed | skipped
JobID string
FailedReason string
}
func (s *Store) CreateFleetUpdate(ctx context.Context, fu FleetUpdate, hostIDs []string) error { ... }
func (s *Store) ActiveFleetUpdate(ctx context.Context) (*FleetUpdate, error) { ... }
func (s *Store) GetFleetUpdate(ctx context.Context, id string) (*FleetUpdate, []FleetUpdateHost, error) { ... }
func (s *Store) ListPendingFleetUpdateHosts(ctx context.Context, fuID string) ([]FleetUpdateHost, error) { ... }
func (s *Store) SetFleetUpdateHostStatus(ctx context.Context, fuID, hostID, status, failedReason, jobID string) error { ... }
func (s *Store) SetFleetUpdateCurrentHost(ctx context.Context, fuID, hostID string) error { ... }
func (s *Store) HaltFleetUpdate(ctx context.Context, fuID, reason string, when time.Time) error { ... }
func (s *Store) CancelFleetUpdate(ctx context.Context, fuID string) error { ... }
func (s *Store) CompleteFleetUpdate(ctx context.Context, fuID string, when time.Time) error { ... }
```
- [ ] **Step 3: Run tests**
```sh
go test ./internal/store/ -run TestFleetUpdate -v
```
Expected: PASS.
- [ ] **Step 4: Commit**
```
git add internal/store/fleet_updates.go internal/store/fleet_updates_test.go
git commit -m "store: fleet_updates + fleet_update_hosts CRUD"
```
---
## Phase 4 — Agent updater (Linux)
### Task 8: `internal/agent/updater` package skeleton + Linux path
**Files:**
- Create: `internal/agent/updater/updater.go`
- Create: `internal/agent/updater/updater_unix.go`
- Create: `internal/agent/updater/updater_windows.go`
- Create: `internal/agent/updater/updater_test.go`
- [ ] **Step 1: Write the failing test (Linux)**
```go
//go:build !windows
package updater
import (
"io"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"runtime"
"testing"
)
func TestUpdate_LinuxAtomicSwap(t *testing.T) {
// Stage 1: a fake "running binary" file + a server that serves new bytes.
tmp := t.TempDir()
binPath := filepath.Join(tmp, "agent")
if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
t.Fatal(err)
}
newBytes := []byte("NEW BINARY CONTENTS")
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path != "/agent/binary" {
http.NotFound(w, r); return
}
if r.URL.Query().Get("os") != runtime.GOOS || r.URL.Query().Get("arch") != runtime.GOARCH {
t.Errorf("unexpected query: %s", r.URL.RawQuery)
}
_, _ = io.Copy(w, &io.LimitedReader{R: bytesReader(newBytes), N: int64(len(newBytes))})
}))
defer srv.Close()
if err := UpdateForTest(srv.URL, binPath); err != nil {
t.Fatalf("update: %v", err)
}
got, err := os.ReadFile(binPath)
if err != nil { t.Fatal(err) }
if string(got) != "NEW BINARY CONTENTS" {
t.Fatalf("binary contents: got %q", got)
}
old, err := os.ReadFile(binPath + ".old")
if err != nil { t.Fatalf("agent.old missing: %v", err) }
if string(old) != "OLD" {
t.Fatalf("agent.old contents: got %q", old)
}
// .new must have been renamed away
if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) {
t.Fatalf("agent.new should be absent after swap")
}
}
```
`UpdateForTest(serverURL, binaryPath string) error` is a tiny wrapper exposed by `updater.go` that does steps 16 of §4.1 of the spec (everything except `os.Exit`). The exit-and-restart side effect can't be covered by a unit test.
- [ ] **Step 2: Implement `updater.go` (shared)**
```go
package updater
import (
"context"
"fmt"
"io"
"net/http"
"os"
"path/filepath"
"runtime"
"time"
)
// fetch downloads the new binary into <binaryPath>.new, fsyncs, chmods.
// Returns the path to the staged file.
func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) {
url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH)
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return "", err
}
c := &http.Client{Timeout: 5 * time.Minute}
res, err := c.Do(req)
if err != nil {
return "", err
}
defer res.Body.Close()
if res.StatusCode != http.StatusOK {
return "", fmt.Errorf("agent binary fetch: %s", res.Status)
}
stagePath := binaryPath + ".new"
f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
if err != nil {
return "", err
}
if _, err := io.Copy(f, res.Body); err != nil {
f.Close()
_ = os.Remove(stagePath)
return "", err
}
if err := f.Sync(); err != nil {
f.Close()
_ = os.Remove(stagePath)
return "", err
}
if err := f.Close(); err != nil {
_ = os.Remove(stagePath)
return "", err
}
if err := os.Chmod(stagePath, 0o755); err != nil {
_ = os.Remove(stagePath)
return "", err
}
return stagePath, nil
}
// resolveOwnBinary returns the absolute path of the running binary.
// Refuses /proc/self/exe — that's what os.Executable returns on
// some systems but it can't be renamed across.
func resolveOwnBinary() (string, error) {
p, err := os.Executable()
if err != nil {
return "", err
}
abs, err := filepath.Abs(p)
if err != nil {
return "", err
}
if abs == "/proc/self/exe" {
return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe — not a real file)")
}
return abs, nil
}
// UpdateForTest is the platform-neutral test seam.
// In production, Update (in updater_unix.go / updater_windows.go) does
// the same fetch+swap then exits the process. UpdateForTest stops short
// of the exit so unit tests can assert on file state.
func UpdateForTest(serverURL, binaryPath string) error {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
stage, err := fetch(ctx, serverURL, binaryPath)
if err != nil {
return err
}
return swap(stage, binaryPath)
}
```
- [ ] **Step 3: Implement `updater_unix.go`**
```go
//go:build !windows
package updater
import (
"context"
"fmt"
"io"
"log/slog"
"os"
"time"
)
// Update fetches the new binary, swaps it in, then exits so systemd
// restarts the process under the new binary. Caller should close
// the WS cleanly before invoking.
func Update(ctx context.Context, serverURL string) error {
binPath, err := resolveOwnBinary()
if err != nil {
return err
}
stage, err := fetch(ctx, serverURL, binPath)
if err != nil {
return err
}
if err := swap(stage, binPath); err != nil {
return err
}
slog.Info("agent self-update: binary swapped, exiting for systemd restart",
"binary", binPath)
// Give logger a moment to flush, then exit.
time.Sleep(200 * time.Millisecond)
os.Exit(0)
return nil // unreachable
}
// swap copies the running binary to <bin>.old, then atomic-renames the
// staged binary into place. On non-Windows this works because the OS
// allows renames across an open file.
func swap(stagePath, binPath string) error {
src, err := os.Open(binPath)
if err != nil {
return fmt.Errorf("open running binary: %w", err)
}
defer src.Close()
dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
if err != nil {
return fmt.Errorf("open .old: %w", err)
}
if _, err := io.Copy(dst, src); err != nil {
dst.Close()
return fmt.Errorf("copy to .old: %w", err)
}
if err := dst.Sync(); err != nil {
dst.Close()
return err
}
if err := dst.Close(); err != nil {
return err
}
if err := os.Rename(stagePath, binPath); err != nil {
return fmt.Errorf("rename .new over running binary: %w", err)
}
return nil
}
```
- [ ] **Step 4: Implement `updater_windows.go` stub**
```go
//go:build windows
package updater
import (
"context"
"errors"
)
// Update is implemented in Task 12. Stubbed so the package builds
// on Windows during phases 4-11.
func Update(ctx context.Context, serverURL string) error {
return errors.New("agent self-update on Windows: not yet implemented")
}
func swap(stagePath, binPath string) error {
return errors.New("agent self-update on Windows: not yet implemented")
}
```
- [ ] **Step 5: Add `bytesReader` helper to test file** (or use `bytes.NewReader` directly).
- [ ] **Step 6: Run tests**
```sh
go test ./internal/agent/updater/ -v
```
Expected: PASS.
- [ ] **Step 7: Commit**
```
git add internal/agent/updater/
git commit -m "agent: updater package — Linux atomic-swap path"
```
### Task 9: Wire `command.update` into the agent dispatcher
**Files:**
- Create: `cmd/agent/update_dispatch.go`
- Verify: `cmd/agent/main.go` (already edited in Task 4)
- [ ] **Step 1: Implement the dispatcher method**
```go
package main
import (
"context"
"fmt"
"log/slog"
"time"
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater"
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
)
// runUpdate handles a server-dispatched command.update. It logs progress
// via log.stream so the live job page captures pre-restart state, then
// calls the platform updater. On Linux the updater calls os.Exit; on
// Windows it spawns a helper and returns, with the agent then exiting.
func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) {
logf := func(format string, args ...any) {
line := fmt.Sprintf(format, args...)
slog.Info("ws agent: update: " + line)
env, _ := api.Marshal(api.MsgLogStream, "", api.LogStreamPayload{
JobID: p.JobID,
Stream: api.LogStdout,
Data: line + "\n",
At: time.Now().UTC(),
})
_ = tx.Send(env)
}
// Job-started so the server flips queued→running.
startedEnv, _ := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{
JobID: p.JobID,
Kind: api.JobUpdate,
StartedAt: time.Now().UTC(),
})
_ = tx.Send(startedEnv)
logf("fetching new binary from %s", d.serverURL)
if err := updater.Update(ctx, d.serverURL); err != nil {
logf("update failed: %v", err)
finishedEnv, _ := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{
JobID: p.JobID,
Kind: api.JobUpdate,
Status: api.JobFailed,
FinishedAt: time.Now().UTC(),
Error: err.Error(),
})
_ = tx.Send(finishedEnv)
return
}
// Unreachable on Linux (Update calls os.Exit). On Windows control
// returns here and the agent exits cleanly so SCM hands off to the
// helper script that does the actual swap-and-restart.
}
```
`d.serverURL` should already exist on the dispatcher (it's the URL the WS connection was made to). If not, plumb it from `cmd/agent/main.go`'s connection setup — the URL is in the agent config.
- [ ] **Step 2: Verify build**
```sh
go build ./...
```
Expected: PASS.
- [ ] **Step 3: Run all agent tests**
```sh
go test ./cmd/agent/... ./internal/agent/...
```
Expected: PASS, no regressions.
- [ ] **Step 4: Commit**
```
git add cmd/agent/update_dispatch.go cmd/agent/main.go
git commit -m "agent: handle command.update — fetch + swap + exit"
```
---
## Phase 5 — Server endpoint + hello integration
### Task 10: `POST /api/hosts/{id}/update`
**Files:**
- Create: `internal/server/http/host_update.go`
- Create: `internal/server/http/host_update_test.go`
- Modify: `internal/server/http/server.go` (route registration)
- [ ] **Step 1: Write tests covering**
Mirror the structure of an existing admin-band endpoint test, e.g. `repo_ops_test.go`:
- happy path: admin POST → 200 + `{job_id}` returned, `jobs` row created with `kind=update`, audit row written, WS envelope `command.update` sent to the host's connection.
- refuses when host offline → 409 / structured error code `host_offline`.
- refuses when `agent_version == server.Version` → 409 / `already_up_to_date`.
- refuses when an `update` job is already running for this host → 409 / `update_in_progress`.
- RBAC: operator → 403, viewer → 403.
- [ ] **Step 2: Implement**
```go
package http
import (
"encoding/json"
stdhttp "net/http"
"github.com/go-chi/chi/v5"
"github.com/oklog/ulid/v2"
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)
// handleHostUpdate dispatches a command.update WS envelope after
// validating that the host is online, currently running a different
// version, and not already in the middle of an update.
func (s *Server) handleHostUpdate(w stdhttp.ResponseWriter, r *stdhttp.Request) {
hostID := chi.URLParam(r, "id")
host, err := s.deps.Store.GetHost(r.Context(), hostID)
if err != nil { writeJSONError(w, stdhttp.StatusNotFound, "host_not_found", ""); return }
if !s.deps.Hub.IsOnline(hostID) {
writeJSONError(w, stdhttp.StatusConflict, "host_offline",
"agent must be online to receive an update")
return
}
if host.AgentVersion == version.Version {
writeJSONError(w, stdhttp.StatusConflict, "already_up_to_date",
"host is already on "+version.Version)
return
}
running, err := s.deps.Store.RunningUpdateJobForHost(r.Context(), hostID)
if err == nil && running != "" {
writeJSONError(w, stdhttp.StatusConflict, "update_in_progress",
"an update job is already running for this host")
return
}
jobID := ulid.Make().String()
user := userFrom(r) // existing helper
if err := s.deps.Store.InsertJob(r.Context(), store.Job{
ID: jobID,
HostID: hostID,
Kind: string(api.JobUpdate),
Status: string(api.JobQueued),
ActorKind: "user",
ActorID: user.ID,
// CreatedAt is set by InsertJob.
}); err != nil {
writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
return
}
env, _ := api.Marshal(api.MsgCommandUpdate, ulid.Make().String(), api.CommandUpdatePayload{JobID: jobID})
if err := s.deps.Hub.SendTo(hostID, env); err != nil {
writeJSONError(w, stdhttp.StatusBadGateway, "send_failed", err.Error())
return
}
s.audit(r, "host.update_dispatched", store.AuditTarget{
Kind: "host", ID: hostID,
}, map[string]any{"job_id": jobID, "target_version": version.Version})
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]string{"job_id": jobID})
}
// Form-post variant for HTMX. Same gates, returns HX-Redirect.
func (s *Server) handleHostUpdateForm(w stdhttp.ResponseWriter, r *stdhttp.Request) {
// reuse handleHostUpdate's pre-checks via a shared validator;
// on success, set HX-Redirect to /jobs/<id> and write 200.
// On error, render an inline banner partial.
}
```
Helpers to add:
- `Store.RunningUpdateJobForHost(ctx, hostID) (string, error)` — returns the job id of any running `kind=update` job for this host, or empty string + nil if none. One-line query.
- [ ] **Step 3: Register routes**
In `server.go`, inside the admin-only route group:
```go
r.Post("/api/hosts/{id}/update", s.handleHostUpdate)
r.Post("/hosts/{id}/update", s.handleHostUpdateForm)
```
- [ ] **Step 4: Run tests**
```sh
go test ./internal/server/http/ -run TestHostUpdate -v
```
Expected: PASS.
- [ ] **Step 5: Commit**
```
git add internal/server/http/host_update.go internal/server/http/host_update_test.go internal/server/http/server.go internal/store/jobs.go
git commit -m "http: POST /api/hosts/{id}/update — dispatch agent update"
```
### Task 11: Hello-handler integration + timeout watcher
**Files:**
- Modify: `internal/server/ws/handler.go`
- Create: `internal/server/ws/update_watch.go`
- Create: `internal/server/ws/update_watch_test.go`
- [ ] **Step 1: Write the failing test**
```go
// In update_watch_test.go:
//
// 1. NewWatcher; Track(jobID, hostID, started=now). Hello arrives after
// 50ms with matching version → watcher marks the job succeeded
// (verify via mock Store.UpdateJobStatus call).
// 2. NewWatcher; Track(...). 100ms timeout (override constant for test).
// No hello arrives → after 100ms, watcher marks the job failed with
// reason "timeout" and raises an alert (verify via mock Store +
// AlertEngine).
// 3. NewWatcher; Track(...). Hello arrives but version doesn't match.
// Watcher does nothing (timeout will catch). After timeout, marked
// failed with reason "agent reconnected at version X, expected Y".
// 4. Cancel: Track then explicitly Stop(jobID) — no further callbacks.
```
- [ ] **Step 2: Implement the watcher**
```go
package ws
import (
"context"
"sync"
"time"
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)
// updateTimeout is the default ceiling for how long the server waits
// for an agent re-hello carrying the matching version after dispatching
// a command.update. Exported as a var so tests can shrink it.
var updateTimeout = 90 * time.Second
type updateWatch struct {
jobID string
hostID string
deadline time.Time
}
type updateWatcher struct {
mu sync.Mutex
pending map[string]*updateWatch // hostID → watch
store *store.Store
alerts AlertRaiser // small interface, injected
now func() time.Time
}
func newUpdateWatcher(st *store.Store, alerts AlertRaiser) *updateWatcher {
return &updateWatcher{
pending: make(map[string]*updateWatch),
store: st,
alerts: alerts,
now: func() time.Time { return time.Now().UTC() },
}
}
// Track registers an in-flight update. If a hello with the matching
// version arrives before the deadline, OnHello returns true and clears
// the entry. Otherwise the watcher's runLoop will mark the job failed.
func (w *updateWatcher) Track(jobID, hostID string) {
w.mu.Lock()
defer w.mu.Unlock()
w.pending[hostID] = &updateWatch{
jobID: jobID,
hostID: hostID,
deadline: w.now().Add(updateTimeout),
}
}
// OnHello is called by the WS handler when an agent hellos. If a watch
// is pending for this host AND the version matches, mark succeeded and
// drop the watch. Mismatched version → leave the watch (timeout
// handles it).
func (w *updateWatcher) OnHello(ctx context.Context, hostID, agentVersion, serverVersion string) {
w.mu.Lock()
watch, ok := w.pending[hostID]
if ok && agentVersion == serverVersion {
delete(w.pending, hostID)
}
w.mu.Unlock()
if !ok || agentVersion != serverVersion { return }
// Mark job succeeded.
_ = w.store.SetJobStatus(ctx, watch.jobID, string(api.JobSucceeded), "", w.now())
// Audit + alert auto-resolve.
// (audit hook reused via http layer's helper, or write directly here)
}
// Run is a goroutine started by NewHandler — sweeps for expired
// watches every 5s.
func (w *updateWatcher) Run(ctx context.Context) {
tick := time.NewTicker(5 * time.Second)
defer tick.Stop()
for {
select {
case <-ctx.Done(): return
case <-tick.C:
w.sweep(ctx)
}
}
}
func (w *updateWatcher) sweep(ctx context.Context) {
now := w.now()
w.mu.Lock()
expired := []*updateWatch{}
for hostID, wch := range w.pending {
if now.After(wch.deadline) {
expired = append(expired, wch)
delete(w.pending, hostID)
}
}
w.mu.Unlock()
for _, wch := range expired {
// Determine reason: did the agent come back at all?
host, _ := w.store.GetHost(ctx, wch.hostID)
reason := "timeout: agent did not reconnect within 90s"
if host != nil && host.AgentVersion != "" && host.AgentVersion != version.Version {
reason = fmt.Sprintf("agent reconnected at %s, expected %s",
host.AgentVersion, version.Version)
}
_ = w.store.SetJobStatus(ctx, wch.jobID, string(api.JobFailed), reason, now)
if w.alerts != nil {
w.alerts.RaiseUpdateFailed(ctx, wch.hostID, wch.jobID, reason, now)
}
}
}
```
- [ ] **Step 3: Hook into the WS handler**
In `handler.go`, where `onAgentHello` is defined (search for the place it upserts `agent_version`), at the *end* of the handler — after the upsert succeeds — call:
```go
deps.UpdateWatcher.OnHello(ctx, hostID, hello.AgentVersion, version.Version)
```
The `UpdateWatcher *updateWatcher` field needs to exist on the handler `Deps` struct. Wire it up in `cmd/server/main.go`.
`AlertRaiser` interface (defined alongside the watcher) is implemented by `*alert.Engine` after Task 14 adds the `RaiseUpdateFailed` method. For now, define the interface and make the engine satisfy it.
- [ ] **Step 4: Run tests**
```sh
go test ./internal/server/ws/ -v
```
Expected: PASS.
- [ ] **Step 5: Commit**
```
git add internal/server/ws/update_watch.go internal/server/ws/update_watch_test.go internal/server/ws/handler.go cmd/server/main.go
git commit -m "ws: update watcher — promote/fail update jobs on hello timeout"
```
---
## Phase 6 — Windows updater path
### Task 12: Windows helper-script implementation
**Files:**
- Modify: `internal/agent/updater/updater_windows.go`
- [ ] **Step 1: Replace the stub from Task 8**
```go
//go:build windows
package updater
import (
"context"
"fmt"
"log/slog"
"os"
"os/exec"
"path/filepath"
"syscall"
"time"
)
const helperScript = `@echo off
timeout /t 3 /nobreak >nul
copy /Y "%s" "%s"
sc stop restic-manager-agent
:wait
sc query restic-manager-agent | find "STOPPED" >nul
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
move /Y "%s" "%s"
sc start restic-manager-agent
del "%%~f0"
`
func Update(ctx context.Context, serverURL string) error {
binPath, err := resolveOwnBinary()
if err != nil {
return err
}
stage, err := fetch(ctx, serverURL, binPath)
if err != nil {
return err
}
helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd")
oldPath := binPath + ".old"
body := fmt.Sprintf(helperScript, binPath, oldPath, stage, binPath)
if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil {
return err
}
cmd := exec.Command("cmd.exe", "/c", helperPath)
cmd.SysProcAttr = &syscall.SysProcAttr{
HideWindow: true,
CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW
}
if err := cmd.Start(); err != nil {
return err
}
slog.Info("agent self-update: helper spawned, exiting cleanly", "binary", binPath)
time.Sleep(200 * time.Millisecond)
os.Exit(0)
return nil
}
func swap(_, _ string) error { return nil } // not used on Windows
```
- [ ] **Step 2: Verify cross-compile**
```sh
GOOS=windows GOARCH=amd64 go build ./...
```
Expected: PASS.
- [ ] **Step 3: Commit**
```
git add internal/agent/updater/updater_windows.go
git commit -m "agent: Windows updater — detached helper script"
```
---
## Phase 7 — Alert kinds + auto-resolve
### Task 13: Add `update_failed` and `fleet_update_halted` alert kinds
**Files:**
- Create: `internal/alert/update_alerts.go`
- Modify: `internal/alert/rules.go` (auto-resolve hook on host hello)
- [ ] **Step 1: Implement**
```go
package alert
import (
"context"
"time"
)
// RaiseUpdateFailed is called by the WS update-watcher when an agent
// fails to come back at the target version after a command.update
// dispatch. Auto-resolves when the host next hellos with the right
// version.
func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) {
dedup := "update_failed:" + hostID
msg := "agent self-update failed: " + reason
e.raiseAndNotify(ctx, hostID, "update_failed", dedup, "warning", msg, when)
}
// ResolveUpdateFailed is called from the WS hello handler when the
// host comes back at the expected version.
func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) {
e.resolveAndNotify(ctx, hostID, "update_failed", "update_failed:"+hostID, when)
}
// RaiseFleetUpdateHalted is called by the fleet-update worker when it
// halts on a per-host failure. No host id (global alert).
func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) {
dedup := "fleet_update_halted:" + fleetUpdateID
msg := "fleet update halted: " + reason
e.raiseAndNotify(ctx, "", "fleet_update_halted", dedup, "warning", msg, when)
}
```
- [ ] **Step 2: Wire auto-resolve into the WS hello handler** (in Task 11's update watcher: when a successful match is recorded, also call `ResolveUpdateFailed`).
- [ ] **Step 3: Commit**
```
git add internal/alert/update_alerts.go
git commit -m "alert: update_failed + fleet_update_halted kinds"
```
---
## Phase 8 — Fleet-update worker
### Task 14: `internal/server/fleetupdate` worker
**Files:**
- Create: `internal/server/fleetupdate/worker.go`
- Create: `internal/server/fleetupdate/worker_test.go`
- [ ] **Step 1: Sketch the API**
```go
package fleetupdate
import (
"context"
"sync"
"time"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)
// Worker owns the at-most-one rolling fleet-update goroutine.
type Worker struct {
mu sync.Mutex // ensures one run at a time
store *store.Store
hub Hub // small interface — IsOnline, SendTo
dispatcher Dispatcher // small interface — DispatchUpdate(hostID, fleetUpdateID) (jobID string, err error)
watcher Watcher // small interface — WaitForVersion(hostID, version, timeout) bool
alerts AlertRaiser
}
// Start kicks off a new fleet update. Validates that no other run
// is in progress. Returns the new fleet_update id on success.
func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) {
if !w.mu.TryLock() {
return "", ErrAlreadyRunning
}
// build the fleet_updates row + N pending fleet_update_hosts rows
// in position order, then spawn a goroutine that runs the loop.
go w.run(ctx, fuID)
return fuID, nil
}
// run is the rolling loop. For each pending host: pre-check, dispatch,
// wait for hello-with-target-version, mark succeeded/failed, halt on
// first failure.
func (w *Worker) run(ctx context.Context, fuID string) {
defer w.mu.Unlock()
// ... see spec §7.2 pseudocode
}
```
- [ ] **Step 2: Write tests**
Use mocks/fakes for Hub/Dispatcher/Watcher. Cover:
- two-host run, both succeed → completed.
- first host succeeds, second times out → halted, alert raised, third stays pending.
- host goes offline mid-run → halted with reason "host went offline".
- host already at target version when its turn comes (raced with another path) → skipped, loop continues.
- cancel mid-run → status=cancelled, current host's job left running, no further dispatches.
- start while another run active → returns ErrAlreadyRunning.
- [ ] **Step 3: Implement the run loop**
```go
func (w *Worker) run(ctx context.Context, fuID string) {
defer w.mu.Unlock()
pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID)
if err != nil { return }
for _, p := range pending {
// Re-check status — could have been cancelled.
fu, _ := w.store.ActiveFleetUpdate(ctx)
if fu == nil || fu.Status != "running" || fu.ID != fuID { return }
_ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, p.HostID)
host, _ := w.store.GetHost(ctx, p.HostID)
if host == nil {
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "host deleted", "")
continue
}
// Already at target?
// (target version comes from the fleet_update row)
if host.AgentVersion == fu.TargetVersion {
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "already at target", "")
continue
}
if !w.hub.IsOnline(p.HostID) {
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "host offline at dispatch time", "")
_ = w.store.HaltFleetUpdate(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC())
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC())
return
}
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "running", "", "")
jobID, err := w.dispatcher.DispatchUpdate(ctx, p.HostID, fuID)
if err != nil {
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", err.Error(), "")
_ = w.store.HaltFleetUpdate(ctx, fuID, "dispatch failed on "+host.Hostname, time.Now().UTC())
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, err.Error(), time.Now().UTC())
return
}
_ = w.store.SetFleetUpdateHostStatusJob(ctx, fuID, p.HostID, jobID)
ok := w.watcher.WaitForVersion(ctx, p.HostID, fu.TargetVersion, 95*time.Second)
if !ok {
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "did not reconnect at target version", jobID)
_ = w.store.HaltFleetUpdate(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC())
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC())
return
}
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "succeeded", "", jobID)
}
_ = w.store.CompleteFleetUpdate(ctx, fuID, time.Now().UTC())
}
```
- [ ] **Step 4: Run tests**
```sh
go test ./internal/server/fleetupdate/ -v
```
Expected: PASS.
- [ ] **Step 5: Commit**
```
git add internal/server/fleetupdate/
git commit -m "fleetupdate: rolling worker with halt-on-fail"
```
### Task 15: HTTP endpoints + page handler for fleet update
**Files:**
- Create: `internal/server/http/fleet_update.go`
- Create: `internal/server/http/fleet_update_test.go`
- Create: `web/templates/pages/fleet_update.html`
- Modify: `internal/server/http/server.go` (route registration)
- [ ] **Step 1: Endpoints**
```go
// POST /api/fleet/update — admin-only, body: {target_version?}.
// If target_version omitted, defaults to current server version.
// Returns {fleet_update_id}.
func (s *Server) handleFleetUpdateStart(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
// POST /api/fleet-updates/{id}/cancel — admin-only.
func (s *Server) handleFleetUpdateCancel(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
// GET /api/fleet-updates/{id} — admin-only, returns
// {fleet_update + per-host array} as JSON.
func (s *Server) handleFleetUpdateGet(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
// GET /settings/fleet-update — admin-only, renders the page.
// Shows idle list (out-of-date online hosts) when no run is active,
// or the running run's progress.
func (s *Server) handleFleetUpdatePage(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
```
- [ ] **Step 2: Tests**
Unit-test the page handler (idle vs running variants) and the start endpoint (accepts target list, refuses if a run is already active, RBAC).
- [ ] **Step 3: Page template**
`web/templates/pages/fleet_update.html`:
- Inherit from the base layout.
- Idle state block: header "Fleet update", paragraph explaining "rolling updates one host at a time, halts on first failure", table of out-of-date online hosts with Hostname / Current / Target / Last seen, plus a typed-confirm dialog ("Type the host count to confirm"), "Start rolling update" button.
- Running state block: htmx auto-refresh every 3s (`hx-get="/api/fleet-updates/{id}/partial" hx-trigger="every 3s [...visibility...]"`), per-host progress list with status pill, link to job log when present, "Cancel" button.
- Mirror the visual idiom of `web/templates/pages/alerts.html` for the auto-refresh behaviour.
- [ ] **Step 4: Run tests + smoke render**
```sh
go test ./internal/server/http/ -run TestFleetUpdate -v
```
- [ ] **Step 5: Commit**
```
git add internal/server/http/fleet_update.go internal/server/http/fleet_update_test.go web/templates/pages/fleet_update.html internal/server/http/server.go
git commit -m "http: fleet update endpoints + /settings/fleet-update page"
```
---
## Phase 9 — UI surfacing
### Task 16: Update chip on host row + host detail header
**Files:**
- Create: `web/templates/partials/host_update_chip.html`
- Modify: `web/templates/partials/host_row.html`
- Modify: `web/templates/partials/host_chrome.html`
- Modify: `internal/server/http/hosts.go` (add `UpdateAvailable` and `TargetVersion` fields to the row view-model)
- Modify: `internal/server/http/host_detail.go` (or wherever `host_chrome` is populated)
- Modify: `web/styles/input.css`
- [ ] **Step 1: View-model**
Compute `UpdateAvailable bool` and `TargetVersion string` (= server version) anywhere `host` data is built for templates. Hide chip when `host.AgentVersion == ""` or matches.
- [ ] **Step 2: Partial**
```html
{{ define "host_update_chip" }}
{{ if .UpdateAvailable }}
<span class="update-chip" title="Agent {{ .AgentVersion }} → server {{ .TargetVersion }}">
out of date · {{ .AgentVersion }} → {{ .TargetVersion }}
</span>
{{ end }}
{{ end }}
```
- [ ] **Step 3: CSS**
`web/styles/input.css`:
```css
.update-chip {
@apply inline-flex items-center gap-1 px-2 py-0.5 rounded text-xs;
@apply bg-amber-50 text-amber-900 border border-amber-200;
}
```
- [ ] **Step 4: Render Tailwind + commit**
```sh
make build
git add web/templates web/styles/input.css web/static/css/styles.css internal/server/http/hosts.go internal/server/http/host_detail.go
git commit -m "ui: update chip on host row + detail header"
```
### Task 17: Per-host Update agent button on `/hosts/{id}`
**Files:**
- Modify: `web/templates/pages/host_detail.html`
- [ ] **Step 1: Right-rail button block**
Look at the existing right-rail in `host_detail.html` (e.g. the Restore button block from P3). Add (admin-only):
```html
{{ if and .CanAdmin .Host.UpdateAvailable }}
<form hx-post="/hosts/{{ .Host.ID }}/update" hx-swap="none">
<button class="btn btn-amber w-full"
{{ if not .Host.Online }}disabled title="Agent must be online"{{ end }}
{{ if .Host.UpdateInProgress }}disabled title="Update already in progress"{{ end }}>
Update agent
</button>
</form>
{{ end }}
```
The view-model needs `Host.Online` and `Host.UpdateInProgress` populated.
- [ ] **Step 2: Commit**
```
git add web/templates/pages/host_detail.html internal/server/http/host_detail.go
git commit -m "ui: per-host Update agent button"
```
### Task 18: Dashboard "N hosts behind" tile + `?updates=behind` filter
**Files:**
- Modify: `internal/server/http/dashboard_filter.go` (or wherever the dashboard handler lives — search for the `?status=` filter from NS-04)
- Modify: `web/templates/pages/dashboard.html`
- [ ] **Step 1: Extend filter parsing**
Add `Updates string` (values: "" or "behind") to the dashboard filter struct. When `behind`, filter to hosts where `agent_version != "" && agent_version != server.Version`.
- [ ] **Step 2: Hero tile**
In `dashboard.html`, alongside existing tiles (online/offline/snapshot count), add — only when N > 0:
```html
{{ if gt .UpdatesBehind 0 }}
<a href="?updates=behind" class="hero-tile hero-tile--amber">
<span class="hero-num">{{ .UpdatesBehind }}</span>
<span class="hero-label">hosts behind</span>
</a>
{{ end }}
```
- [ ] **Step 3: Tests**
Extend `dashboard_filter_test.go` to cover the `updates=behind` path.
- [ ] **Step 4: Commit**
```
git add internal/server/http/dashboard*.go web/templates/pages/dashboard.html
git commit -m "ui: dashboard hosts-behind tile + filter"
```
---
## Phase 10 — Smoke validation
### Task 19: Restage + smoke validate
- [ ] **Step 1: Build at version A**
```sh
make build VERSION=v0.0.1-smoke-A
# restage block from CLAUDE.md
```
- [ ] **Step 2: Onboard `uptime` as a fresh host**
Use the dashboard's Add-host flow against `ssh uptime`. Confirm the host shows `agent_version=v0.0.1-smoke-A`.
- [ ] **Step 3: Bump server to version B**
```sh
make build VERSION=v0.0.1-smoke-B
# restart server only (not the agent)
```
Verify: dashboard shows `uptime` with the "out of date · v0.0.1-smoke-A → v0.0.1-smoke-B" chip and the "1 host behind" tile.
- [ ] **Step 4: Stage agent at version B**
```sh
cp bin/restic-manager-agent $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
```
- [ ] **Step 5: Click Update agent**
On `/hosts/{uptime-id}`. Watch the live job log. Expect: agent fetches, swaps, exits, systemd restarts it, hellos at version B, job marked succeeded, chip and tile clear.
Verify on `uptime`:
```sh
ssh uptime "ls -la /usr/local/bin/restic-manager-agent*"
```
Expect both `restic-manager-agent` (B) and `restic-manager-agent.old` (A) present.
- [ ] **Step 6: Test rollback path**
```sh
# Replace the bundled binary with the OLD one — server claims B but serves A
cp bin/restic-manager-agent.A $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
# (assume earlier build saved as .A)
```
Click Update — agent fetches A, swaps to A, restarts at A. Server should mark the job `failed` after 90s with reason like "agent reconnected at v0.0.1-smoke-A, expected v0.0.1-smoke-B". Alert raised.
- [ ] **Step 7: Fleet update path**
If only one host is available, this validates the worker on N=1. Spin up a second sibling agent (docker-based or another VM) to validate N=2 + halt-on-fail (replace `<DataDir>/agent-binaries/...` with `/bin/false`-equivalent during one host's turn).
- [ ] **Step 8: Capture screenshots**
Save Playwright screenshots of: out-of-date host row, fleet-update idle page, fleet-update running progress, fleet-update halted state. Drop into `_diag/p6-update-sweep/`.
- [ ] **Step 9: Commit + update tasks.md**
Mark P6-01 and P6-02 done in `tasks.md` with an as-shipped block summarising what landed (mirror the style used for P5-03/P5-07).
```
git add tasks.md _diag/p6-update-sweep/
git commit -m "tasks: mark P6-01 + P6-02 done with as-shipped block"
```
---
## Self-review
Run through the spec sections:
- §3 wire protocol → Task 4, Task 5 (jobs.kind), Task 9, Task 10. ✅
- §4 agent execution → Task 8 (Linux), Task 9 (dispatch wiring), Task 12 (Windows). ✅
- §5 server build version → Task 1, Task 2, Task 3. ✅
- §6 server endpoints → Task 10 (host update), Task 11 (hello integration + watcher). ✅
- §7 fleet update → Task 6 (schema), Task 7 (store), Task 14 (worker), Task 15 (HTTP+UI). ✅
- §7.3 UI surfaces → Task 16 (chip), Task 17 (button), Task 18 (dashboard tile). ✅
- §7.4 alert engine → Task 13. ✅
- §8 RBAC → enforced in Task 10 + Task 15 by reusing existing `requireAdmin` middleware. ✅
- §9 testing → Task tests + Task 19 smoke. ✅
No placeholders. All types referenced consistently across tasks. Done.