Files
restic-manager/docs/superpowers/plans/2026-05-06-p6-01-02-agent-self-update.md
T

52 KiB
Raw Blame History

P6-01 + P6-02 Agent Self-Update Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Operator-driven agent self-update via WS envelope, with dashboard "out of date" surfacing, per-host Update button, and a rolling fleet-update worker that halts on first failure.

Architecture: Agent fetches its replacement binary from /agent/binary, atomic-renames over the running binary (Linux) or hands off to a detached helper script (Windows), and exits cleanly so the service manager restarts it. The server tracks each update as a jobs row with kind=update; success is detected when the agent re-hellos with agent_version == server.Version. A fleet-update worker iterates out-of-date hosts one at a time, halting on the first failure.

Tech Stack: Go server + agent, SQLite migrations, WebSocket envelopes (existing internal/api), htmx/Tailwind UI.

Spec: docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md


File Structure

New files

  • internal/version/version.goVersion, Commit constants, ldflags-injected.
  • internal/agent/updater/updater.go — shared HTTP fetch + atomic-write helpers.
  • internal/agent/updater/updater_unix.go — Linux platform path (build-tag !windows).
  • internal/agent/updater/updater_windows.go — Windows platform path (build-tag windows).
  • internal/agent/updater/updater_test.go — Linux unit tests with fake HTTP server.
  • internal/store/migrations/0021_jobs_update_kind.sql — widen jobs.kind CHECK.
  • internal/store/migrations/0022_fleet_updates.sql — fleet_updates + fleet_update_hosts tables.
  • internal/store/fleet_updates.go — store layer for the new tables.
  • internal/store/fleet_updates_test.go
  • internal/server/http/host_update.goPOST /api/hosts/{id}/update + form variant.
  • internal/server/http/host_update_test.go
  • internal/server/http/version.goGET /api/version.
  • internal/server/http/fleet_update.go — fleet update endpoints + page handler.
  • internal/server/http/fleet_update_test.go
  • internal/server/fleetupdate/worker.go — the rolling-update goroutine.
  • internal/server/fleetupdate/worker_test.go
  • internal/alert/update_alerts.go — alert kinds + helpers (update_failed, fleet_update_halted).
  • web/templates/pages/fleet_update.html — both idle and running states.
  • web/templates/partials/host_update_chip.html — reusable chip.
  • cmd/agent/update_dispatch.go — agent-side command.update handler.

Modified files

  • Makefile — add VERSION / COMMIT ldflags.
  • internal/api/wire.go — drop MsgAgentUpdateAvail, add MsgCommandUpdate.
  • internal/api/messages.go — drop AgentUpdateAvailablePayload, add CommandUpdatePayload.
  • cmd/agent/main.go — wire MsgCommandUpdate case in dispatcher.
  • internal/server/ws/handler.go — extend onAgentHello to mark in-flight update jobs succeeded on version match.
  • internal/server/http/server.go — register new routes + middleware.
  • internal/server/http/middleware.go — already has requireAdmin; reuse.
  • internal/server/http/hosts.go — render the update chip into host responses.
  • internal/server/http/dashboard.go (or wherever the dashboard handler lives) — "N hosts behind" tile, updates=behind filter.
  • web/templates/partials/host_chrome.html — embed update chip in header.
  • web/templates/partials/host_row.html — embed update chip in dashboard row.
  • web/templates/pages/host_detail.html — Update agent button on right-rail.
  • web/styles/input.css.update-chip token (amber).
  • cmd/server/main.go — wire fleet-update worker into the daemon lifecycle.
  • tasks.md — mark P6-01 and P6-02 done with the as-shipped block.

Phase 1 — Build version plumbing

Task 1: internal/version package

Files:

  • Create: internal/version/version.go

  • Step 1: Create the package

// Package version exposes build-time identifying constants. Both the
// server and agent link this package; their values are set via
// -ldflags during the build. An unset Version falls back to "dev"
// so source builds without ldflags still run.
package version

var (
	// Version is the human-facing release string, e.g. "v1.2.3" or
	// "v1.2.3-dirty". Compared byte-for-byte between agent and
	// server to drive the "out of date" signal.
	Version = "dev"

	// Commit is the short git SHA. Informational only; surfaced via
	// /api/version but not used for any comparison.
	Commit = ""
)
  • Step 2: Commit
git add internal/version/version.go
git commit -m "version: add build-time version package"

Task 2: Wire ldflags into the Makefile

Files:

  • Modify: Makefile

  • Step 1: Read the Makefile, locate the build target, and prepend the ldflags

Add near the top of the Makefile (after any existing variable block):

VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
COMMIT  ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo unknown)
GO_LDFLAGS := -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
              -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)

Then thread -ldflags "$(GO_LDFLAGS)" into every go build invocation in the Makefile (for cmd/server, cmd/agent, and any cross-compile target).

  • Step 2: Verify
make build
./bin/restic-manager-server -version 2>/dev/null || true   # if -version flag exists
strings ./bin/restic-manager-server | grep -E "^v[0-9]+|^dev" | head -3

Expected: a non-dev version string is embedded in the binary when in a tagged-or-dirty git checkout.

  • Step 3: Commit
git add Makefile
git commit -m "build: inject version + commit via ldflags"

Task 3: GET /api/version endpoint

Files:

  • Create: internal/server/http/version.go

  • Create: internal/server/http/version_test.go

  • Modify: internal/server/http/server.go

  • Step 1: Write the failing test

package http

import (
	"encoding/json"
	"net/http"
	"net/http/httptest"
	"testing"

	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)

func TestVersionEndpoint(t *testing.T) {
	version.Version = "v1.2.3"
	version.Commit = "abc1234"
	t.Cleanup(func() {
		version.Version = "dev"
		version.Commit = ""
	})

	srv := newTestServer(t)         // existing helper in this package
	rr := httptest.NewRecorder()
	req := httptest.NewRequest(http.MethodGet, "/api/version", nil)
	srv.Router().ServeHTTP(rr, req)

	if rr.Code != http.StatusOK {
		t.Fatalf("status: got %d want 200", rr.Code)
	}
	var body struct {
		Version string `json:"version"`
		Commit  string `json:"commit"`
	}
	if err := json.NewDecoder(rr.Body).Decode(&body); err != nil {
		t.Fatalf("decode: %v", err)
	}
	if body.Version != "v1.2.3" || body.Commit != "abc1234" {
		t.Fatalf("body: %+v", body)
	}
}

If newTestServer doesn't exist by that name in this package, locate the equivalent helper (look at enrollment_test.go or version.go style elsewhere) and adapt.

  • Step 2: Implement
// internal/server/http/version.go
package http

import (
	"encoding/json"
	stdhttp "net/http"

	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)

func (s *Server) handleVersion(w stdhttp.ResponseWriter, r *stdhttp.Request) {
	w.Header().Set("Content-Type", "application/json")
	_ = json.NewEncoder(w).Encode(map[string]string{
		"version": version.Version,
		"commit":  version.Commit,
	})
}

In server.go, add inside the route registration block (where other public routes live, near /agent/binary):

r.Get("/api/version", s.handleVersion)
  • Step 3: Run tests
go test ./internal/server/http/ -run TestVersionEndpoint -v

Expected: PASS.

  • Step 4: Commit
git add internal/server/http/version.go internal/server/http/version_test.go internal/server/http/server.go
git commit -m "http: expose GET /api/version"

Phase 2 — Wire protocol changes

Task 4: Add MsgCommandUpdate and CommandUpdatePayload, retire MsgAgentUpdateAvail

Files:

  • Modify: internal/api/wire.go

  • Modify: internal/api/messages.go

  • Modify: cmd/agent/main.go

  • Step 1: Edit wire.go

In the server-to-agent block (around line 32-37), replace:

MsgAgentUpdateAvail MessageType = "agent.update.available"

with:

MsgCommandUpdate    MessageType = "command.update"
  • Step 2: Edit messages.go

Delete the AgentUpdateAvailablePayload struct (lines ~364-371) and its doc comment. Add immediately before TreeListRequestPayload:

// CommandUpdatePayload carries no operational data — the agent
// already knows its own os/arch and fetches from its configured
// server URL via /agent/binary. JobID is the server-issued id of
// the update job; the agent echoes it on log.stream lines so the
// live job log captures pre-restart progress.
type CommandUpdatePayload struct {
	JobID string `json:"job_id"`
}
  • Step 3: Edit cmd/agent/main.go

Replace the case api.MsgAgentUpdateAvail block (lines ~398-401) with:

case api.MsgCommandUpdate:
	var p api.CommandUpdatePayload
	if err := env.UnmarshalPayload(&p); err != nil {
		return fmt.Errorf("command.update: %w", err)
	}
	go d.runUpdate(ctx, p, tx)

runUpdate lands in Task 9.

  • Step 4: Update JobKind constants in messages.go

In the JobKind const block (line ~57), add:

JobUpdate  JobKind = "update"
  • Step 5: Verify
go build ./...

Expected: build error from cmd/agent/main.go calling d.runUpdate (not yet defined). That's fine — proceed; the next phase plugs the gap. Verify only the internal/api package builds:

go build ./internal/api/

Expected: PASS.

  • Step 6: Commit
git add internal/api/wire.go internal/api/messages.go cmd/agent/main.go
git commit -m "api: replace agent.update.available with command.update + JobUpdate kind"

Phase 3 — Database migrations

Task 5: Migration 0021 — widen jobs.kind CHECK

Files:

  • Create: internal/store/migrations/0021_jobs_update_kind.sql

  • Step 1: Write the migration

Mirror the pattern in 0012_jobs_restore_diff_kind.sql exactly: temp-backup of job_logs, rebuild jobs with the wider CHECK, restore log rows, recreate indexes. Only change is the CHECK list now includes 'update':

-- 0021_jobs_update_kind.sql
--
-- Add 'update' to the jobs.kind CHECK constraint so the agent
-- self-update flow (P6-01) can persist its job rows. Same safe
-- rebuild pattern as 0012; cascade trap mitigated by job_logs
-- temp-backup.

CREATE TEMPORARY TABLE _job_logs_backup AS
  SELECT job_id, seq, ts, stream, payload FROM job_logs;

CREATE TABLE jobs_new (
  id              TEXT PRIMARY KEY,
  host_id         TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
  kind            TEXT NOT NULL CHECK (kind IN
                       ('backup','init','forget','prune','check','unlock','restore','diff','update')),
  status          TEXT NOT NULL CHECK (status IN ('queued','running','succeeded','failed','cancelled')),
  scheduled_id    TEXT REFERENCES schedules(id) ON DELETE SET NULL,
  actor_kind      TEXT NOT NULL CHECK (actor_kind IN ('user','schedule','system')),
  actor_id        TEXT,
  started_at      TEXT,
  finished_at     TEXT,
  exit_code       INTEGER,
  stats           TEXT,
  error           TEXT,
  created_at      TEXT NOT NULL
);

INSERT INTO jobs_new
  SELECT id, host_id, kind, status, scheduled_id, actor_kind, actor_id,
         started_at, finished_at, exit_code, stats, error, created_at
  FROM jobs;

DROP TABLE jobs;
ALTER TABLE jobs_new RENAME TO jobs;

CREATE INDEX jobs_host_id ON jobs(host_id);
CREATE INDEX jobs_status ON jobs(status);
CREATE INDEX jobs_created_at ON jobs(created_at);

INSERT OR IGNORE INTO job_logs (job_id, seq, ts, stream, payload)
  SELECT job_id, seq, ts, stream, payload FROM _job_logs_backup;

DROP TABLE _job_logs_backup;

If the live jobs schema already has columns added by post-0012 migrations (e.g. 0015 added source_group_id), match them in jobs_new and the INSERT — check the latest schema before writing.

  • Step 2: Verify
go test ./internal/store/ -run TestMigrations -v

Expected: PASS, includes 0021.

  • Step 3: Commit
git add internal/store/migrations/0021_jobs_update_kind.sql
git commit -m "store: migration 0021 — add 'update' to jobs.kind"

Task 6: Migration 0022 — fleet_updates tables

Files:

  • Create: internal/store/migrations/0022_fleet_updates.sql

  • Step 1: Write the migration

-- 0022_fleet_updates.sql
--
-- Tables backing the rolling fleet-update worker (P6-02). One row in
-- fleet_updates per "update all" invocation, a child row per host so
-- the worker can resume / report progress / mark per-host failures.

CREATE TABLE fleet_updates (
  id                 TEXT PRIMARY KEY,
  started_at         TEXT NOT NULL,
  started_by_user_id TEXT NOT NULL REFERENCES users(id),
  target_version     TEXT NOT NULL,
  status             TEXT NOT NULL CHECK (status IN
                          ('running','completed','halted','cancelled')),
  current_host_id    TEXT REFERENCES hosts(id),
  halted_reason      TEXT,
  completed_at       TEXT
);

CREATE INDEX fleet_updates_status ON fleet_updates(status);

CREATE TABLE fleet_update_hosts (
  fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
  host_id         TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
  position        INTEGER NOT NULL,
  status          TEXT NOT NULL CHECK (status IN
                       ('pending','running','succeeded','failed','skipped')),
  job_id          TEXT REFERENCES jobs(id),
  failed_reason   TEXT,
  PRIMARY KEY (fleet_update_id, host_id)
);

CREATE INDEX fleet_update_hosts_status ON fleet_update_hosts(fleet_update_id, position);
  • Step 2: Verify
go test ./internal/store/ -run TestMigrations -v
  • Step 3: Commit
git add internal/store/migrations/0022_fleet_updates.sql
git commit -m "store: migration 0022 — fleet_updates + fleet_update_hosts"

Task 7: internal/store/fleet_updates.go

Files:

  • Create: internal/store/fleet_updates.go

  • Create: internal/store/fleet_updates_test.go

  • Step 1: Write the failing tests

package store_test

// Cover: CreateFleetUpdate creates parent + N pending host rows;
// SetFleetUpdateHostStatus moves a row through pending→running→succeeded;
// HaltFleetUpdate sets status=halted + halted_reason and stamps no
// completed_at; CompleteFleetUpdate sets completed_at; ListPendingFleetUpdateHosts
// returns rows in position order; ActiveFleetUpdate returns the running
// row (or nil); GetFleetUpdate hydrates parent + children.
//
// One test per behaviour, table-driven where the API supports it.
// Mirror the structure of internal/store/maintenance_test.go.

Write four-six discrete test functions. Look at internal/store/maintenance.go + _test.go for the established style (constructor on *Store, NewStore + tmp DB).

  • Step 2: Implement

Sketch:

package store

import (
	"context"
	"database/sql"
	"errors"
	"fmt"
	"time"
)

type FleetUpdate struct {
	ID              string
	StartedAt       time.Time
	StartedByUserID string
	TargetVersion   string
	Status          string // running | completed | halted | cancelled
	CurrentHostID   string
	HaltedReason    string
	CompletedAt     *time.Time
}

type FleetUpdateHost struct {
	FleetUpdateID string
	HostID        string
	Position      int
	Status        string // pending | running | succeeded | failed | skipped
	JobID         string
	FailedReason  string
}

func (s *Store) CreateFleetUpdate(ctx context.Context, fu FleetUpdate, hostIDs []string) error { ... }
func (s *Store) ActiveFleetUpdate(ctx context.Context) (*FleetUpdate, error) { ... }
func (s *Store) GetFleetUpdate(ctx context.Context, id string) (*FleetUpdate, []FleetUpdateHost, error) { ... }
func (s *Store) ListPendingFleetUpdateHosts(ctx context.Context, fuID string) ([]FleetUpdateHost, error) { ... }
func (s *Store) SetFleetUpdateHostStatus(ctx context.Context, fuID, hostID, status, failedReason, jobID string) error { ... }
func (s *Store) SetFleetUpdateCurrentHost(ctx context.Context, fuID, hostID string) error { ... }
func (s *Store) HaltFleetUpdate(ctx context.Context, fuID, reason string, when time.Time) error { ... }
func (s *Store) CancelFleetUpdate(ctx context.Context, fuID string) error { ... }
func (s *Store) CompleteFleetUpdate(ctx context.Context, fuID string, when time.Time) error { ... }
  • Step 3: Run tests
go test ./internal/store/ -run TestFleetUpdate -v

Expected: PASS.

  • Step 4: Commit
git add internal/store/fleet_updates.go internal/store/fleet_updates_test.go
git commit -m "store: fleet_updates + fleet_update_hosts CRUD"

Phase 4 — Agent updater (Linux)

Task 8: internal/agent/updater package skeleton + Linux path

Files:

  • Create: internal/agent/updater/updater.go

  • Create: internal/agent/updater/updater_unix.go

  • Create: internal/agent/updater/updater_windows.go

  • Create: internal/agent/updater/updater_test.go

  • Step 1: Write the failing test (Linux)

//go:build !windows

package updater

import (
	"io"
	"net/http"
	"net/http/httptest"
	"os"
	"path/filepath"
	"runtime"
	"testing"
)

func TestUpdate_LinuxAtomicSwap(t *testing.T) {
	// Stage 1: a fake "running binary" file + a server that serves new bytes.
	tmp := t.TempDir()
	binPath := filepath.Join(tmp, "agent")
	if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
		t.Fatal(err)
	}
	newBytes := []byte("NEW BINARY CONTENTS")

	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		if r.URL.Path != "/agent/binary" {
			http.NotFound(w, r); return
		}
		if r.URL.Query().Get("os") != runtime.GOOS || r.URL.Query().Get("arch") != runtime.GOARCH {
			t.Errorf("unexpected query: %s", r.URL.RawQuery)
		}
		_, _ = io.Copy(w, &io.LimitedReader{R: bytesReader(newBytes), N: int64(len(newBytes))})
	}))
	defer srv.Close()

	if err := UpdateForTest(srv.URL, binPath); err != nil {
		t.Fatalf("update: %v", err)
	}

	got, err := os.ReadFile(binPath)
	if err != nil { t.Fatal(err) }
	if string(got) != "NEW BINARY CONTENTS" {
		t.Fatalf("binary contents: got %q", got)
	}
	old, err := os.ReadFile(binPath + ".old")
	if err != nil { t.Fatalf("agent.old missing: %v", err) }
	if string(old) != "OLD" {
		t.Fatalf("agent.old contents: got %q", old)
	}
	// .new must have been renamed away
	if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) {
		t.Fatalf("agent.new should be absent after swap")
	}
}

UpdateForTest(serverURL, binaryPath string) error is a tiny wrapper exposed by updater.go that does steps 16 of §4.1 of the spec (everything except os.Exit). The exit-and-restart side effect can't be covered by a unit test.

  • Step 2: Implement updater.go (shared)
package updater

import (
	"context"
	"fmt"
	"io"
	"net/http"
	"os"
	"path/filepath"
	"runtime"
	"time"
)

// fetch downloads the new binary into <binaryPath>.new, fsyncs, chmods.
// Returns the path to the staged file.
func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) {
	url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH)
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
	if err != nil {
		return "", err
	}
	c := &http.Client{Timeout: 5 * time.Minute}
	res, err := c.Do(req)
	if err != nil {
		return "", err
	}
	defer res.Body.Close()
	if res.StatusCode != http.StatusOK {
		return "", fmt.Errorf("agent binary fetch: %s", res.Status)
	}

	stagePath := binaryPath + ".new"
	f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
	if err != nil {
		return "", err
	}
	if _, err := io.Copy(f, res.Body); err != nil {
		f.Close()
		_ = os.Remove(stagePath)
		return "", err
	}
	if err := f.Sync(); err != nil {
		f.Close()
		_ = os.Remove(stagePath)
		return "", err
	}
	if err := f.Close(); err != nil {
		_ = os.Remove(stagePath)
		return "", err
	}
	if err := os.Chmod(stagePath, 0o755); err != nil {
		_ = os.Remove(stagePath)
		return "", err
	}
	return stagePath, nil
}

// resolveOwnBinary returns the absolute path of the running binary.
// Refuses /proc/self/exe — that's what os.Executable returns on
// some systems but it can't be renamed across.
func resolveOwnBinary() (string, error) {
	p, err := os.Executable()
	if err != nil {
		return "", err
	}
	abs, err := filepath.Abs(p)
	if err != nil {
		return "", err
	}
	if abs == "/proc/self/exe" {
		return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe — not a real file)")
	}
	return abs, nil
}

// UpdateForTest is the platform-neutral test seam.
// In production, Update (in updater_unix.go / updater_windows.go) does
// the same fetch+swap then exits the process. UpdateForTest stops short
// of the exit so unit tests can assert on file state.
func UpdateForTest(serverURL, binaryPath string) error {
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
	defer cancel()
	stage, err := fetch(ctx, serverURL, binaryPath)
	if err != nil {
		return err
	}
	return swap(stage, binaryPath)
}
  • Step 3: Implement updater_unix.go
//go:build !windows

package updater

import (
	"context"
	"fmt"
	"io"
	"log/slog"
	"os"
	"time"
)

// Update fetches the new binary, swaps it in, then exits so systemd
// restarts the process under the new binary. Caller should close
// the WS cleanly before invoking.
func Update(ctx context.Context, serverURL string) error {
	binPath, err := resolveOwnBinary()
	if err != nil {
		return err
	}
	stage, err := fetch(ctx, serverURL, binPath)
	if err != nil {
		return err
	}
	if err := swap(stage, binPath); err != nil {
		return err
	}
	slog.Info("agent self-update: binary swapped, exiting for systemd restart",
		"binary", binPath)
	// Give logger a moment to flush, then exit.
	time.Sleep(200 * time.Millisecond)
	os.Exit(0)
	return nil // unreachable
}

// swap copies the running binary to <bin>.old, then atomic-renames the
// staged binary into place. On non-Windows this works because the OS
// allows renames across an open file.
func swap(stagePath, binPath string) error {
	src, err := os.Open(binPath)
	if err != nil {
		return fmt.Errorf("open running binary: %w", err)
	}
	defer src.Close()
	dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
	if err != nil {
		return fmt.Errorf("open .old: %w", err)
	}
	if _, err := io.Copy(dst, src); err != nil {
		dst.Close()
		return fmt.Errorf("copy to .old: %w", err)
	}
	if err := dst.Sync(); err != nil {
		dst.Close()
		return err
	}
	if err := dst.Close(); err != nil {
		return err
	}
	if err := os.Rename(stagePath, binPath); err != nil {
		return fmt.Errorf("rename .new over running binary: %w", err)
	}
	return nil
}
  • Step 4: Implement updater_windows.go stub
//go:build windows

package updater

import (
	"context"
	"errors"
)

// Update is implemented in Task 12. Stubbed so the package builds
// on Windows during phases 4-11.
func Update(ctx context.Context, serverURL string) error {
	return errors.New("agent self-update on Windows: not yet implemented")
}

func swap(stagePath, binPath string) error {
	return errors.New("agent self-update on Windows: not yet implemented")
}
  • Step 5: Add bytesReader helper to test file (or use bytes.NewReader directly).

  • Step 6: Run tests

go test ./internal/agent/updater/ -v

Expected: PASS.

  • Step 7: Commit
git add internal/agent/updater/
git commit -m "agent: updater package — Linux atomic-swap path"

Task 9: Wire command.update into the agent dispatcher

Files:

  • Create: cmd/agent/update_dispatch.go

  • Verify: cmd/agent/main.go (already edited in Task 4)

  • Step 1: Implement the dispatcher method

package main

import (
	"context"
	"fmt"
	"log/slog"
	"time"

	"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater"
	"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
)

// runUpdate handles a server-dispatched command.update. It logs progress
// via log.stream so the live job page captures pre-restart state, then
// calls the platform updater. On Linux the updater calls os.Exit; on
// Windows it spawns a helper and returns, with the agent then exiting.
func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) {
	logf := func(format string, args ...any) {
		line := fmt.Sprintf(format, args...)
		slog.Info("ws agent: update: " + line)
		env, _ := api.Marshal(api.MsgLogStream, "", api.LogStreamPayload{
			JobID:  p.JobID,
			Stream: api.LogStdout,
			Data:   line + "\n",
			At:     time.Now().UTC(),
		})
		_ = tx.Send(env)
	}

	// Job-started so the server flips queued→running.
	startedEnv, _ := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{
		JobID:     p.JobID,
		Kind:      api.JobUpdate,
		StartedAt: time.Now().UTC(),
	})
	_ = tx.Send(startedEnv)

	logf("fetching new binary from %s", d.serverURL)
	if err := updater.Update(ctx, d.serverURL); err != nil {
		logf("update failed: %v", err)
		finishedEnv, _ := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{
			JobID:      p.JobID,
			Kind:       api.JobUpdate,
			Status:     api.JobFailed,
			FinishedAt: time.Now().UTC(),
			Error:      err.Error(),
		})
		_ = tx.Send(finishedEnv)
		return
	}
	// Unreachable on Linux (Update calls os.Exit). On Windows control
	// returns here and the agent exits cleanly so SCM hands off to the
	// helper script that does the actual swap-and-restart.
}

d.serverURL should already exist on the dispatcher (it's the URL the WS connection was made to). If not, plumb it from cmd/agent/main.go's connection setup — the URL is in the agent config.

  • Step 2: Verify build
go build ./...

Expected: PASS.

  • Step 3: Run all agent tests
go test ./cmd/agent/... ./internal/agent/...

Expected: PASS, no regressions.

  • Step 4: Commit
git add cmd/agent/update_dispatch.go cmd/agent/main.go
git commit -m "agent: handle command.update — fetch + swap + exit"

Phase 5 — Server endpoint + hello integration

Task 10: POST /api/hosts/{id}/update

Files:

  • Create: internal/server/http/host_update.go

  • Create: internal/server/http/host_update_test.go

  • Modify: internal/server/http/server.go (route registration)

  • Step 1: Write tests covering

Mirror the structure of an existing admin-band endpoint test, e.g. repo_ops_test.go:

  • happy path: admin POST → 200 + {job_id} returned, jobs row created with kind=update, audit row written, WS envelope command.update sent to the host's connection.

  • refuses when host offline → 409 / structured error code host_offline.

  • refuses when agent_version == server.Version → 409 / already_up_to_date.

  • refuses when an update job is already running for this host → 409 / update_in_progress.

  • RBAC: operator → 403, viewer → 403.

  • Step 2: Implement

package http

import (
	"encoding/json"
	stdhttp "net/http"

	"github.com/go-chi/chi/v5"
	"github.com/oklog/ulid/v2"

	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)

// handleHostUpdate dispatches a command.update WS envelope after
// validating that the host is online, currently running a different
// version, and not already in the middle of an update.
func (s *Server) handleHostUpdate(w stdhttp.ResponseWriter, r *stdhttp.Request) {
	hostID := chi.URLParam(r, "id")
	host, err := s.deps.Store.GetHost(r.Context(), hostID)
	if err != nil { writeJSONError(w, stdhttp.StatusNotFound, "host_not_found", ""); return }

	if !s.deps.Hub.IsOnline(hostID) {
		writeJSONError(w, stdhttp.StatusConflict, "host_offline",
			"agent must be online to receive an update")
		return
	}
	if host.AgentVersion == version.Version {
		writeJSONError(w, stdhttp.StatusConflict, "already_up_to_date",
			"host is already on "+version.Version)
		return
	}
	running, err := s.deps.Store.RunningUpdateJobForHost(r.Context(), hostID)
	if err == nil && running != "" {
		writeJSONError(w, stdhttp.StatusConflict, "update_in_progress",
			"an update job is already running for this host")
		return
	}

	jobID := ulid.Make().String()
	user := userFrom(r) // existing helper
	if err := s.deps.Store.InsertJob(r.Context(), store.Job{
		ID:         jobID,
		HostID:     hostID,
		Kind:       string(api.JobUpdate),
		Status:     string(api.JobQueued),
		ActorKind:  "user",
		ActorID:    user.ID,
		// CreatedAt is set by InsertJob.
	}); err != nil {
		writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
		return
	}

	env, _ := api.Marshal(api.MsgCommandUpdate, ulid.Make().String(), api.CommandUpdatePayload{JobID: jobID})
	if err := s.deps.Hub.SendTo(hostID, env); err != nil {
		writeJSONError(w, stdhttp.StatusBadGateway, "send_failed", err.Error())
		return
	}

	s.audit(r, "host.update_dispatched", store.AuditTarget{
		Kind: "host", ID: hostID,
	}, map[string]any{"job_id": jobID, "target_version": version.Version})

	w.Header().Set("Content-Type", "application/json")
	_ = json.NewEncoder(w).Encode(map[string]string{"job_id": jobID})
}

// Form-post variant for HTMX. Same gates, returns HX-Redirect.
func (s *Server) handleHostUpdateForm(w stdhttp.ResponseWriter, r *stdhttp.Request) {
	// reuse handleHostUpdate's pre-checks via a shared validator;
	// on success, set HX-Redirect to /jobs/<id> and write 200.
	// On error, render an inline banner partial.
}

Helpers to add:

  • Store.RunningUpdateJobForHost(ctx, hostID) (string, error) — returns the job id of any running kind=update job for this host, or empty string + nil if none. One-line query.

  • Step 3: Register routes

In server.go, inside the admin-only route group:

r.Post("/api/hosts/{id}/update", s.handleHostUpdate)
r.Post("/hosts/{id}/update", s.handleHostUpdateForm)
  • Step 4: Run tests
go test ./internal/server/http/ -run TestHostUpdate -v

Expected: PASS.

  • Step 5: Commit
git add internal/server/http/host_update.go internal/server/http/host_update_test.go internal/server/http/server.go internal/store/jobs.go
git commit -m "http: POST /api/hosts/{id}/update — dispatch agent update"

Task 11: Hello-handler integration + timeout watcher

Files:

  • Modify: internal/server/ws/handler.go

  • Create: internal/server/ws/update_watch.go

  • Create: internal/server/ws/update_watch_test.go

  • Step 1: Write the failing test

// In update_watch_test.go:
//
// 1. NewWatcher; Track(jobID, hostID, started=now). Hello arrives after
//    50ms with matching version → watcher marks the job succeeded
//    (verify via mock Store.UpdateJobStatus call).
// 2. NewWatcher; Track(...). 100ms timeout (override constant for test).
//    No hello arrives → after 100ms, watcher marks the job failed with
//    reason "timeout" and raises an alert (verify via mock Store +
//    AlertEngine).
// 3. NewWatcher; Track(...). Hello arrives but version doesn't match.
//    Watcher does nothing (timeout will catch). After timeout, marked
//    failed with reason "agent reconnected at version X, expected Y".
// 4. Cancel: Track then explicitly Stop(jobID) — no further callbacks.
  • Step 2: Implement the watcher
package ws

import (
	"context"
	"sync"
	"time"

	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)

// updateTimeout is the default ceiling for how long the server waits
// for an agent re-hello carrying the matching version after dispatching
// a command.update. Exported as a var so tests can shrink it.
var updateTimeout = 90 * time.Second

type updateWatch struct {
	jobID    string
	hostID   string
	deadline time.Time
}

type updateWatcher struct {
	mu       sync.Mutex
	pending  map[string]*updateWatch // hostID → watch
	store    *store.Store
	alerts   AlertRaiser              // small interface, injected
	now      func() time.Time
}

func newUpdateWatcher(st *store.Store, alerts AlertRaiser) *updateWatcher {
	return &updateWatcher{
		pending: make(map[string]*updateWatch),
		store:   st,
		alerts:  alerts,
		now:     func() time.Time { return time.Now().UTC() },
	}
}

// Track registers an in-flight update. If a hello with the matching
// version arrives before the deadline, OnHello returns true and clears
// the entry. Otherwise the watcher's runLoop will mark the job failed.
func (w *updateWatcher) Track(jobID, hostID string) {
	w.mu.Lock()
	defer w.mu.Unlock()
	w.pending[hostID] = &updateWatch{
		jobID:    jobID,
		hostID:   hostID,
		deadline: w.now().Add(updateTimeout),
	}
}

// OnHello is called by the WS handler when an agent hellos. If a watch
// is pending for this host AND the version matches, mark succeeded and
// drop the watch. Mismatched version → leave the watch (timeout
// handles it).
func (w *updateWatcher) OnHello(ctx context.Context, hostID, agentVersion, serverVersion string) {
	w.mu.Lock()
	watch, ok := w.pending[hostID]
	if ok && agentVersion == serverVersion {
		delete(w.pending, hostID)
	}
	w.mu.Unlock()
	if !ok || agentVersion != serverVersion { return }
	// Mark job succeeded.
	_ = w.store.SetJobStatus(ctx, watch.jobID, string(api.JobSucceeded), "", w.now())
	// Audit + alert auto-resolve.
	// (audit hook reused via http layer's helper, or write directly here)
}

// Run is a goroutine started by NewHandler — sweeps for expired
// watches every 5s.
func (w *updateWatcher) Run(ctx context.Context) {
	tick := time.NewTicker(5 * time.Second)
	defer tick.Stop()
	for {
		select {
		case <-ctx.Done(): return
		case <-tick.C:
			w.sweep(ctx)
		}
	}
}

func (w *updateWatcher) sweep(ctx context.Context) {
	now := w.now()
	w.mu.Lock()
	expired := []*updateWatch{}
	for hostID, wch := range w.pending {
		if now.After(wch.deadline) {
			expired = append(expired, wch)
			delete(w.pending, hostID)
		}
	}
	w.mu.Unlock()
	for _, wch := range expired {
		// Determine reason: did the agent come back at all?
		host, _ := w.store.GetHost(ctx, wch.hostID)
		reason := "timeout: agent did not reconnect within 90s"
		if host != nil && host.AgentVersion != "" && host.AgentVersion != version.Version {
			reason = fmt.Sprintf("agent reconnected at %s, expected %s",
				host.AgentVersion, version.Version)
		}
		_ = w.store.SetJobStatus(ctx, wch.jobID, string(api.JobFailed), reason, now)
		if w.alerts != nil {
			w.alerts.RaiseUpdateFailed(ctx, wch.hostID, wch.jobID, reason, now)
		}
	}
}
  • Step 3: Hook into the WS handler

In handler.go, where onAgentHello is defined (search for the place it upserts agent_version), at the end of the handler — after the upsert succeeds — call:

deps.UpdateWatcher.OnHello(ctx, hostID, hello.AgentVersion, version.Version)

The UpdateWatcher *updateWatcher field needs to exist on the handler Deps struct. Wire it up in cmd/server/main.go.

AlertRaiser interface (defined alongside the watcher) is implemented by *alert.Engine after Task 14 adds the RaiseUpdateFailed method. For now, define the interface and make the engine satisfy it.

  • Step 4: Run tests
go test ./internal/server/ws/ -v

Expected: PASS.

  • Step 5: Commit
git add internal/server/ws/update_watch.go internal/server/ws/update_watch_test.go internal/server/ws/handler.go cmd/server/main.go
git commit -m "ws: update watcher — promote/fail update jobs on hello timeout"

Phase 6 — Windows updater path

Task 12: Windows helper-script implementation

Files:

  • Modify: internal/agent/updater/updater_windows.go

  • Step 1: Replace the stub from Task 8

//go:build windows

package updater

import (
	"context"
	"fmt"
	"log/slog"
	"os"
	"os/exec"
	"path/filepath"
	"syscall"
	"time"
)

const helperScript = `@echo off
timeout /t 3 /nobreak >nul
copy /Y "%s" "%s"
sc stop restic-manager-agent
:wait
sc query restic-manager-agent | find "STOPPED" >nul
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
move /Y "%s" "%s"
sc start restic-manager-agent
del "%%~f0"
`

func Update(ctx context.Context, serverURL string) error {
	binPath, err := resolveOwnBinary()
	if err != nil {
		return err
	}
	stage, err := fetch(ctx, serverURL, binPath)
	if err != nil {
		return err
	}
	helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd")
	oldPath := binPath + ".old"
	body := fmt.Sprintf(helperScript, binPath, oldPath, stage, binPath)
	if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil {
		return err
	}
	cmd := exec.Command("cmd.exe", "/c", helperPath)
	cmd.SysProcAttr = &syscall.SysProcAttr{
		HideWindow:    true,
		CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW
	}
	if err := cmd.Start(); err != nil {
		return err
	}
	slog.Info("agent self-update: helper spawned, exiting cleanly", "binary", binPath)
	time.Sleep(200 * time.Millisecond)
	os.Exit(0)
	return nil
}

func swap(_, _ string) error { return nil } // not used on Windows
  • Step 2: Verify cross-compile
GOOS=windows GOARCH=amd64 go build ./...

Expected: PASS.

  • Step 3: Commit
git add internal/agent/updater/updater_windows.go
git commit -m "agent: Windows updater — detached helper script"

Phase 7 — Alert kinds + auto-resolve

Task 13: Add update_failed and fleet_update_halted alert kinds

Files:

  • Create: internal/alert/update_alerts.go

  • Modify: internal/alert/rules.go (auto-resolve hook on host hello)

  • Step 1: Implement

package alert

import (
	"context"
	"time"
)

// RaiseUpdateFailed is called by the WS update-watcher when an agent
// fails to come back at the target version after a command.update
// dispatch. Auto-resolves when the host next hellos with the right
// version.
func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) {
	dedup := "update_failed:" + hostID
	msg := "agent self-update failed: " + reason
	e.raiseAndNotify(ctx, hostID, "update_failed", dedup, "warning", msg, when)
}

// ResolveUpdateFailed is called from the WS hello handler when the
// host comes back at the expected version.
func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) {
	e.resolveAndNotify(ctx, hostID, "update_failed", "update_failed:"+hostID, when)
}

// RaiseFleetUpdateHalted is called by the fleet-update worker when it
// halts on a per-host failure. No host id (global alert).
func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) {
	dedup := "fleet_update_halted:" + fleetUpdateID
	msg := "fleet update halted: " + reason
	e.raiseAndNotify(ctx, "", "fleet_update_halted", dedup, "warning", msg, when)
}
  • Step 2: Wire auto-resolve into the WS hello handler (in Task 11's update watcher: when a successful match is recorded, also call ResolveUpdateFailed).

  • Step 3: Commit

git add internal/alert/update_alerts.go
git commit -m "alert: update_failed + fleet_update_halted kinds"

Phase 8 — Fleet-update worker

Task 14: internal/server/fleetupdate worker

Files:

  • Create: internal/server/fleetupdate/worker.go

  • Create: internal/server/fleetupdate/worker_test.go

  • Step 1: Sketch the API

package fleetupdate

import (
	"context"
	"sync"
	"time"

	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)

// Worker owns the at-most-one rolling fleet-update goroutine.
type Worker struct {
	mu      sync.Mutex   // ensures one run at a time
	store   *store.Store
	hub     Hub          // small interface — IsOnline, SendTo
	dispatcher Dispatcher // small interface — DispatchUpdate(hostID, fleetUpdateID) (jobID string, err error)
	watcher Watcher      // small interface — WaitForVersion(hostID, version, timeout) bool
	alerts  AlertRaiser
}

// Start kicks off a new fleet update. Validates that no other run
// is in progress. Returns the new fleet_update id on success.
func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) {
	if !w.mu.TryLock() {
		return "", ErrAlreadyRunning
	}
	// build the fleet_updates row + N pending fleet_update_hosts rows
	// in position order, then spawn a goroutine that runs the loop.
	go w.run(ctx, fuID)
	return fuID, nil
}

// run is the rolling loop. For each pending host: pre-check, dispatch,
// wait for hello-with-target-version, mark succeeded/failed, halt on
// first failure.
func (w *Worker) run(ctx context.Context, fuID string) {
	defer w.mu.Unlock()
	// ... see spec §7.2 pseudocode
}
  • Step 2: Write tests

Use mocks/fakes for Hub/Dispatcher/Watcher. Cover:

  • two-host run, both succeed → completed.

  • first host succeeds, second times out → halted, alert raised, third stays pending.

  • host goes offline mid-run → halted with reason "host went offline".

  • host already at target version when its turn comes (raced with another path) → skipped, loop continues.

  • cancel mid-run → status=cancelled, current host's job left running, no further dispatches.

  • start while another run active → returns ErrAlreadyRunning.

  • Step 3: Implement the run loop

func (w *Worker) run(ctx context.Context, fuID string) {
	defer w.mu.Unlock()
	pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID)
	if err != nil { return }
	for _, p := range pending {
		// Re-check status — could have been cancelled.
		fu, _ := w.store.ActiveFleetUpdate(ctx)
		if fu == nil || fu.Status != "running" || fu.ID != fuID { return }

		_ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, p.HostID)

		host, _ := w.store.GetHost(ctx, p.HostID)
		if host == nil {
			_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "host deleted", "")
			continue
		}
		// Already at target?
		// (target version comes from the fleet_update row)
		if host.AgentVersion == fu.TargetVersion {
			_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "already at target", "")
			continue
		}
		if !w.hub.IsOnline(p.HostID) {
			_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "host offline at dispatch time", "")
			_ = w.store.HaltFleetUpdate(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC())
			w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC())
			return
		}

		_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "running", "", "")
		jobID, err := w.dispatcher.DispatchUpdate(ctx, p.HostID, fuID)
		if err != nil {
			_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", err.Error(), "")
			_ = w.store.HaltFleetUpdate(ctx, fuID, "dispatch failed on "+host.Hostname, time.Now().UTC())
			w.alerts.RaiseFleetUpdateHalted(ctx, fuID, err.Error(), time.Now().UTC())
			return
		}
		_ = w.store.SetFleetUpdateHostStatusJob(ctx, fuID, p.HostID, jobID)

		ok := w.watcher.WaitForVersion(ctx, p.HostID, fu.TargetVersion, 95*time.Second)
		if !ok {
			_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "did not reconnect at target version", jobID)
			_ = w.store.HaltFleetUpdate(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC())
			w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC())
			return
		}
		_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "succeeded", "", jobID)
	}
	_ = w.store.CompleteFleetUpdate(ctx, fuID, time.Now().UTC())
}
  • Step 4: Run tests
go test ./internal/server/fleetupdate/ -v

Expected: PASS.

  • Step 5: Commit
git add internal/server/fleetupdate/
git commit -m "fleetupdate: rolling worker with halt-on-fail"

Task 15: HTTP endpoints + page handler for fleet update

Files:

  • Create: internal/server/http/fleet_update.go

  • Create: internal/server/http/fleet_update_test.go

  • Create: web/templates/pages/fleet_update.html

  • Modify: internal/server/http/server.go (route registration)

  • Step 1: Endpoints

// POST /api/fleet/update — admin-only, body: {target_version?}.
// If target_version omitted, defaults to current server version.
// Returns {fleet_update_id}.
func (s *Server) handleFleetUpdateStart(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }

// POST /api/fleet-updates/{id}/cancel — admin-only.
func (s *Server) handleFleetUpdateCancel(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }

// GET /api/fleet-updates/{id} — admin-only, returns
// {fleet_update + per-host array} as JSON.
func (s *Server) handleFleetUpdateGet(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }

// GET /settings/fleet-update — admin-only, renders the page.
// Shows idle list (out-of-date online hosts) when no run is active,
// or the running run's progress.
func (s *Server) handleFleetUpdatePage(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
  • Step 2: Tests

Unit-test the page handler (idle vs running variants) and the start endpoint (accepts target list, refuses if a run is already active, RBAC).

  • Step 3: Page template

web/templates/pages/fleet_update.html:

  • Inherit from the base layout.

  • Idle state block: header "Fleet update", paragraph explaining "rolling updates one host at a time, halts on first failure", table of out-of-date online hosts with Hostname / Current / Target / Last seen, plus a typed-confirm dialog ("Type the host count to confirm"), "Start rolling update" button.

  • Running state block: htmx auto-refresh every 3s (hx-get="/api/fleet-updates/{id}/partial" hx-trigger="every 3s [...visibility...]"), per-host progress list with status pill, link to job log when present, "Cancel" button.

  • Mirror the visual idiom of web/templates/pages/alerts.html for the auto-refresh behaviour.

  • Step 4: Run tests + smoke render

go test ./internal/server/http/ -run TestFleetUpdate -v
  • Step 5: Commit
git add internal/server/http/fleet_update.go internal/server/http/fleet_update_test.go web/templates/pages/fleet_update.html internal/server/http/server.go
git commit -m "http: fleet update endpoints + /settings/fleet-update page"

Phase 9 — UI surfacing

Task 16: Update chip on host row + host detail header

Files:

  • Create: web/templates/partials/host_update_chip.html

  • Modify: web/templates/partials/host_row.html

  • Modify: web/templates/partials/host_chrome.html

  • Modify: internal/server/http/hosts.go (add UpdateAvailable and TargetVersion fields to the row view-model)

  • Modify: internal/server/http/host_detail.go (or wherever host_chrome is populated)

  • Modify: web/styles/input.css

  • Step 1: View-model

Compute UpdateAvailable bool and TargetVersion string (= server version) anywhere host data is built for templates. Hide chip when host.AgentVersion == "" or matches.

  • Step 2: Partial
{{ define "host_update_chip" }}
{{ if .UpdateAvailable }}
<span class="update-chip" title="Agent {{ .AgentVersion }} → server {{ .TargetVersion }}">
  out of date · {{ .AgentVersion }} → {{ .TargetVersion }}
</span>
{{ end }}
{{ end }}
  • Step 3: CSS

web/styles/input.css:

.update-chip {
  @apply inline-flex items-center gap-1 px-2 py-0.5 rounded text-xs;
  @apply bg-amber-50 text-amber-900 border border-amber-200;
}
  • Step 4: Render Tailwind + commit
make build
git add web/templates web/styles/input.css web/static/css/styles.css internal/server/http/hosts.go internal/server/http/host_detail.go
git commit -m "ui: update chip on host row + detail header"

Task 17: Per-host Update agent button on /hosts/{id}

Files:

  • Modify: web/templates/pages/host_detail.html

  • Step 1: Right-rail button block

Look at the existing right-rail in host_detail.html (e.g. the Restore button block from P3). Add (admin-only):

{{ if and .CanAdmin .Host.UpdateAvailable }}
<form hx-post="/hosts/{{ .Host.ID }}/update" hx-swap="none">
  <button class="btn btn-amber w-full"
          {{ if not .Host.Online }}disabled title="Agent must be online"{{ end }}
          {{ if .Host.UpdateInProgress }}disabled title="Update already in progress"{{ end }}>
    Update agent
  </button>
</form>
{{ end }}

The view-model needs Host.Online and Host.UpdateInProgress populated.

  • Step 2: Commit
git add web/templates/pages/host_detail.html internal/server/http/host_detail.go
git commit -m "ui: per-host Update agent button"

Task 18: Dashboard "N hosts behind" tile + ?updates=behind filter

Files:

  • Modify: internal/server/http/dashboard_filter.go (or wherever the dashboard handler lives — search for the ?status= filter from NS-04)

  • Modify: web/templates/pages/dashboard.html

  • Step 1: Extend filter parsing

Add Updates string (values: "" or "behind") to the dashboard filter struct. When behind, filter to hosts where agent_version != "" && agent_version != server.Version.

  • Step 2: Hero tile

In dashboard.html, alongside existing tiles (online/offline/snapshot count), add — only when N > 0:

{{ if gt .UpdatesBehind 0 }}
<a href="?updates=behind" class="hero-tile hero-tile--amber">
  <span class="hero-num">{{ .UpdatesBehind }}</span>
  <span class="hero-label">hosts behind</span>
</a>
{{ end }}
  • Step 3: Tests

Extend dashboard_filter_test.go to cover the updates=behind path.

  • Step 4: Commit
git add internal/server/http/dashboard*.go web/templates/pages/dashboard.html
git commit -m "ui: dashboard hosts-behind tile + filter"

Phase 10 — Smoke validation

Task 19: Restage + smoke validate

  • Step 1: Build at version A
make build VERSION=v0.0.1-smoke-A
# restage block from CLAUDE.md
  • Step 2: Onboard uptime as a fresh host

Use the dashboard's Add-host flow against ssh uptime. Confirm the host shows agent_version=v0.0.1-smoke-A.

  • Step 3: Bump server to version B
make build VERSION=v0.0.1-smoke-B
# restart server only (not the agent)

Verify: dashboard shows uptime with the "out of date · v0.0.1-smoke-A → v0.0.1-smoke-B" chip and the "1 host behind" tile.

  • Step 4: Stage agent at version B
cp bin/restic-manager-agent $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
  • Step 5: Click Update agent

On /hosts/{uptime-id}. Watch the live job log. Expect: agent fetches, swaps, exits, systemd restarts it, hellos at version B, job marked succeeded, chip and tile clear.

Verify on uptime:

ssh uptime "ls -la /usr/local/bin/restic-manager-agent*"

Expect both restic-manager-agent (B) and restic-manager-agent.old (A) present.

  • Step 6: Test rollback path
# Replace the bundled binary with the OLD one — server claims B but serves A
cp bin/restic-manager-agent.A $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
# (assume earlier build saved as .A)

Click Update — agent fetches A, swaps to A, restarts at A. Server should mark the job failed after 90s with reason like "agent reconnected at v0.0.1-smoke-A, expected v0.0.1-smoke-B". Alert raised.

  • Step 7: Fleet update path

If only one host is available, this validates the worker on N=1. Spin up a second sibling agent (docker-based or another VM) to validate N=2 + halt-on-fail (replace <DataDir>/agent-binaries/... with /bin/false-equivalent during one host's turn).

  • Step 8: Capture screenshots

Save Playwright screenshots of: out-of-date host row, fleet-update idle page, fleet-update running progress, fleet-update halted state. Drop into _diag/p6-update-sweep/.

  • Step 9: Commit + update tasks.md

Mark P6-01 and P6-02 done in tasks.md with an as-shipped block summarising what landed (mirror the style used for P5-03/P5-07).

git add tasks.md _diag/p6-update-sweep/
git commit -m "tasks: mark P6-01 + P6-02 done with as-shipped block"

Self-review

Run through the spec sections:

  • §3 wire protocol → Task 4, Task 5 (jobs.kind), Task 9, Task 10.
  • §4 agent execution → Task 8 (Linux), Task 9 (dispatch wiring), Task 12 (Windows).
  • §5 server build version → Task 1, Task 2, Task 3.
  • §6 server endpoints → Task 10 (host update), Task 11 (hello integration + watcher).
  • §7 fleet update → Task 6 (schema), Task 7 (store), Task 14 (worker), Task 15 (HTTP+UI).
  • §7.3 UI surfaces → Task 16 (chip), Task 17 (button), Task 18 (dashboard tile).
  • §7.4 alert engine → Task 13.
  • §8 RBAC → enforced in Task 10 + Task 15 by reusing existing requireAdmin middleware.
  • §9 testing → Task tests + Task 19 smoke.

No placeholders. All types referenced consistently across tasks. Done.