1707 lines
52 KiB
Markdown
1707 lines
52 KiB
Markdown
# P6-01 + P6-02 Agent Self-Update Implementation Plan
|
||
|
||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||
|
||
**Goal:** Operator-driven agent self-update via WS envelope, with dashboard "out of date" surfacing, per-host Update button, and a rolling fleet-update worker that halts on first failure.
|
||
|
||
**Architecture:** Agent fetches its replacement binary from `/agent/binary`, atomic-renames over the running binary (Linux) or hands off to a detached helper script (Windows), and exits cleanly so the service manager restarts it. The server tracks each update as a `jobs` row with `kind=update`; success is detected when the agent re-hellos with `agent_version == server.Version`. A fleet-update worker iterates out-of-date hosts one at a time, halting on the first failure.
|
||
|
||
**Tech Stack:** Go server + agent, SQLite migrations, WebSocket envelopes (existing `internal/api`), htmx/Tailwind UI.
|
||
|
||
**Spec:** `docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md`
|
||
|
||
---
|
||
|
||
## File Structure
|
||
|
||
### New files
|
||
- `internal/version/version.go` — `Version`, `Commit` constants, ldflags-injected.
|
||
- `internal/agent/updater/updater.go` — shared HTTP fetch + atomic-write helpers.
|
||
- `internal/agent/updater/updater_unix.go` — Linux platform path (build-tag `!windows`).
|
||
- `internal/agent/updater/updater_windows.go` — Windows platform path (build-tag `windows`).
|
||
- `internal/agent/updater/updater_test.go` — Linux unit tests with fake HTTP server.
|
||
- `internal/store/migrations/0021_jobs_update_kind.sql` — widen jobs.kind CHECK.
|
||
- `internal/store/migrations/0022_fleet_updates.sql` — fleet_updates + fleet_update_hosts tables.
|
||
- `internal/store/fleet_updates.go` — store layer for the new tables.
|
||
- `internal/store/fleet_updates_test.go`
|
||
- `internal/server/http/host_update.go` — `POST /api/hosts/{id}/update` + form variant.
|
||
- `internal/server/http/host_update_test.go`
|
||
- `internal/server/http/version.go` — `GET /api/version`.
|
||
- `internal/server/http/fleet_update.go` — fleet update endpoints + page handler.
|
||
- `internal/server/http/fleet_update_test.go`
|
||
- `internal/server/fleetupdate/worker.go` — the rolling-update goroutine.
|
||
- `internal/server/fleetupdate/worker_test.go`
|
||
- `internal/alert/update_alerts.go` — alert kinds + helpers (`update_failed`, `fleet_update_halted`).
|
||
- `web/templates/pages/fleet_update.html` — both idle and running states.
|
||
- `web/templates/partials/host_update_chip.html` — reusable chip.
|
||
- `cmd/agent/update_dispatch.go` — agent-side `command.update` handler.
|
||
|
||
### Modified files
|
||
- `Makefile` — add `VERSION` / `COMMIT` ldflags.
|
||
- `internal/api/wire.go` — drop `MsgAgentUpdateAvail`, add `MsgCommandUpdate`.
|
||
- `internal/api/messages.go` — drop `AgentUpdateAvailablePayload`, add `CommandUpdatePayload`.
|
||
- `cmd/agent/main.go` — wire `MsgCommandUpdate` case in dispatcher.
|
||
- `internal/server/ws/handler.go` — extend `onAgentHello` to mark in-flight `update` jobs succeeded on version match.
|
||
- `internal/server/http/server.go` — register new routes + middleware.
|
||
- `internal/server/http/middleware.go` — already has `requireAdmin`; reuse.
|
||
- `internal/server/http/hosts.go` — render the update chip into host responses.
|
||
- `internal/server/http/dashboard.go` (or wherever the dashboard handler lives) — "N hosts behind" tile, `updates=behind` filter.
|
||
- `web/templates/partials/host_chrome.html` — embed update chip in header.
|
||
- `web/templates/partials/host_row.html` — embed update chip in dashboard row.
|
||
- `web/templates/pages/host_detail.html` — Update agent button on right-rail.
|
||
- `web/styles/input.css` — `.update-chip` token (amber).
|
||
- `cmd/server/main.go` — wire fleet-update worker into the daemon lifecycle.
|
||
- `tasks.md` — mark P6-01 and P6-02 done with the as-shipped block.
|
||
|
||
---
|
||
|
||
## Phase 1 — Build version plumbing
|
||
|
||
### Task 1: `internal/version` package
|
||
|
||
**Files:**
|
||
- Create: `internal/version/version.go`
|
||
|
||
- [ ] **Step 1: Create the package**
|
||
|
||
```go
|
||
// Package version exposes build-time identifying constants. Both the
|
||
// server and agent link this package; their values are set via
|
||
// -ldflags during the build. An unset Version falls back to "dev"
|
||
// so source builds without ldflags still run.
|
||
package version
|
||
|
||
var (
|
||
// Version is the human-facing release string, e.g. "v1.2.3" or
|
||
// "v1.2.3-dirty". Compared byte-for-byte between agent and
|
||
// server to drive the "out of date" signal.
|
||
Version = "dev"
|
||
|
||
// Commit is the short git SHA. Informational only; surfaced via
|
||
// /api/version but not used for any comparison.
|
||
Commit = ""
|
||
)
|
||
```
|
||
|
||
- [ ] **Step 2: Commit**
|
||
|
||
```
|
||
git add internal/version/version.go
|
||
git commit -m "version: add build-time version package"
|
||
```
|
||
|
||
### Task 2: Wire ldflags into the Makefile
|
||
|
||
**Files:**
|
||
- Modify: `Makefile`
|
||
|
||
- [ ] **Step 1: Read the Makefile, locate the build target, and prepend the ldflags**
|
||
|
||
Add near the top of the Makefile (after any existing variable block):
|
||
|
||
```make
|
||
VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
|
||
COMMIT ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo unknown)
|
||
GO_LDFLAGS := -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
|
||
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
|
||
```
|
||
|
||
Then thread `-ldflags "$(GO_LDFLAGS)"` into every `go build` invocation in the Makefile (for `cmd/server`, `cmd/agent`, and any cross-compile target).
|
||
|
||
- [ ] **Step 2: Verify**
|
||
|
||
```sh
|
||
make build
|
||
./bin/restic-manager-server -version 2>/dev/null || true # if -version flag exists
|
||
strings ./bin/restic-manager-server | grep -E "^v[0-9]+|^dev" | head -3
|
||
```
|
||
|
||
Expected: a non-`dev` version string is embedded in the binary when in a tagged-or-dirty git checkout.
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```
|
||
git add Makefile
|
||
git commit -m "build: inject version + commit via ldflags"
|
||
```
|
||
|
||
### Task 3: `GET /api/version` endpoint
|
||
|
||
**Files:**
|
||
- Create: `internal/server/http/version.go`
|
||
- Create: `internal/server/http/version_test.go`
|
||
- Modify: `internal/server/http/server.go`
|
||
|
||
- [ ] **Step 1: Write the failing test**
|
||
|
||
```go
|
||
package http
|
||
|
||
import (
|
||
"encoding/json"
|
||
"net/http"
|
||
"net/http/httptest"
|
||
"testing"
|
||
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||
)
|
||
|
||
func TestVersionEndpoint(t *testing.T) {
|
||
version.Version = "v1.2.3"
|
||
version.Commit = "abc1234"
|
||
t.Cleanup(func() {
|
||
version.Version = "dev"
|
||
version.Commit = ""
|
||
})
|
||
|
||
srv := newTestServer(t) // existing helper in this package
|
||
rr := httptest.NewRecorder()
|
||
req := httptest.NewRequest(http.MethodGet, "/api/version", nil)
|
||
srv.Router().ServeHTTP(rr, req)
|
||
|
||
if rr.Code != http.StatusOK {
|
||
t.Fatalf("status: got %d want 200", rr.Code)
|
||
}
|
||
var body struct {
|
||
Version string `json:"version"`
|
||
Commit string `json:"commit"`
|
||
}
|
||
if err := json.NewDecoder(rr.Body).Decode(&body); err != nil {
|
||
t.Fatalf("decode: %v", err)
|
||
}
|
||
if body.Version != "v1.2.3" || body.Commit != "abc1234" {
|
||
t.Fatalf("body: %+v", body)
|
||
}
|
||
}
|
||
```
|
||
|
||
If `newTestServer` doesn't exist by that name in this package, locate the equivalent helper (look at `enrollment_test.go` or `version.go` style elsewhere) and adapt.
|
||
|
||
- [ ] **Step 2: Implement**
|
||
|
||
```go
|
||
// internal/server/http/version.go
|
||
package http
|
||
|
||
import (
|
||
"encoding/json"
|
||
stdhttp "net/http"
|
||
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||
)
|
||
|
||
func (s *Server) handleVersion(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||
w.Header().Set("Content-Type", "application/json")
|
||
_ = json.NewEncoder(w).Encode(map[string]string{
|
||
"version": version.Version,
|
||
"commit": version.Commit,
|
||
})
|
||
}
|
||
```
|
||
|
||
In `server.go`, add inside the route registration block (where other public routes live, near `/agent/binary`):
|
||
|
||
```go
|
||
r.Get("/api/version", s.handleVersion)
|
||
```
|
||
|
||
- [ ] **Step 3: Run tests**
|
||
|
||
```sh
|
||
go test ./internal/server/http/ -run TestVersionEndpoint -v
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 4: Commit**
|
||
|
||
```
|
||
git add internal/server/http/version.go internal/server/http/version_test.go internal/server/http/server.go
|
||
git commit -m "http: expose GET /api/version"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 2 — Wire protocol changes
|
||
|
||
### Task 4: Add `MsgCommandUpdate` and `CommandUpdatePayload`, retire `MsgAgentUpdateAvail`
|
||
|
||
**Files:**
|
||
- Modify: `internal/api/wire.go`
|
||
- Modify: `internal/api/messages.go`
|
||
- Modify: `cmd/agent/main.go`
|
||
|
||
- [ ] **Step 1: Edit `wire.go`**
|
||
|
||
In the server-to-agent block (around line 32-37), replace:
|
||
```go
|
||
MsgAgentUpdateAvail MessageType = "agent.update.available"
|
||
```
|
||
with:
|
||
```go
|
||
MsgCommandUpdate MessageType = "command.update"
|
||
```
|
||
|
||
- [ ] **Step 2: Edit `messages.go`**
|
||
|
||
Delete the `AgentUpdateAvailablePayload` struct (lines ~364-371) and its doc comment. Add immediately before `TreeListRequestPayload`:
|
||
|
||
```go
|
||
// CommandUpdatePayload carries no operational data — the agent
|
||
// already knows its own os/arch and fetches from its configured
|
||
// server URL via /agent/binary. JobID is the server-issued id of
|
||
// the update job; the agent echoes it on log.stream lines so the
|
||
// live job log captures pre-restart progress.
|
||
type CommandUpdatePayload struct {
|
||
JobID string `json:"job_id"`
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 3: Edit `cmd/agent/main.go`**
|
||
|
||
Replace the `case api.MsgAgentUpdateAvail` block (lines ~398-401) with:
|
||
|
||
```go
|
||
case api.MsgCommandUpdate:
|
||
var p api.CommandUpdatePayload
|
||
if err := env.UnmarshalPayload(&p); err != nil {
|
||
return fmt.Errorf("command.update: %w", err)
|
||
}
|
||
go d.runUpdate(ctx, p, tx)
|
||
```
|
||
|
||
`runUpdate` lands in Task 9.
|
||
|
||
- [ ] **Step 4: Update `JobKind` constants in `messages.go`**
|
||
|
||
In the `JobKind` const block (line ~57), add:
|
||
|
||
```go
|
||
JobUpdate JobKind = "update"
|
||
```
|
||
|
||
- [ ] **Step 5: Verify**
|
||
|
||
```sh
|
||
go build ./...
|
||
```
|
||
|
||
Expected: build error from `cmd/agent/main.go` calling `d.runUpdate` (not yet defined). That's fine — proceed; the next phase plugs the gap. Verify only the `internal/api` package builds:
|
||
|
||
```sh
|
||
go build ./internal/api/
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 6: Commit**
|
||
|
||
```
|
||
git add internal/api/wire.go internal/api/messages.go cmd/agent/main.go
|
||
git commit -m "api: replace agent.update.available with command.update + JobUpdate kind"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 3 — Database migrations
|
||
|
||
### Task 5: Migration 0021 — widen `jobs.kind` CHECK
|
||
|
||
**Files:**
|
||
- Create: `internal/store/migrations/0021_jobs_update_kind.sql`
|
||
|
||
- [ ] **Step 1: Write the migration**
|
||
|
||
Mirror the pattern in `0012_jobs_restore_diff_kind.sql` exactly: temp-backup of `job_logs`, rebuild `jobs` with the wider CHECK, restore log rows, recreate indexes. Only change is the CHECK list now includes `'update'`:
|
||
|
||
```sql
|
||
-- 0021_jobs_update_kind.sql
|
||
--
|
||
-- Add 'update' to the jobs.kind CHECK constraint so the agent
|
||
-- self-update flow (P6-01) can persist its job rows. Same safe
|
||
-- rebuild pattern as 0012; cascade trap mitigated by job_logs
|
||
-- temp-backup.
|
||
|
||
CREATE TEMPORARY TABLE _job_logs_backup AS
|
||
SELECT job_id, seq, ts, stream, payload FROM job_logs;
|
||
|
||
CREATE TABLE jobs_new (
|
||
id TEXT PRIMARY KEY,
|
||
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
|
||
kind TEXT NOT NULL CHECK (kind IN
|
||
('backup','init','forget','prune','check','unlock','restore','diff','update')),
|
||
status TEXT NOT NULL CHECK (status IN ('queued','running','succeeded','failed','cancelled')),
|
||
scheduled_id TEXT REFERENCES schedules(id) ON DELETE SET NULL,
|
||
actor_kind TEXT NOT NULL CHECK (actor_kind IN ('user','schedule','system')),
|
||
actor_id TEXT,
|
||
started_at TEXT,
|
||
finished_at TEXT,
|
||
exit_code INTEGER,
|
||
stats TEXT,
|
||
error TEXT,
|
||
created_at TEXT NOT NULL
|
||
);
|
||
|
||
INSERT INTO jobs_new
|
||
SELECT id, host_id, kind, status, scheduled_id, actor_kind, actor_id,
|
||
started_at, finished_at, exit_code, stats, error, created_at
|
||
FROM jobs;
|
||
|
||
DROP TABLE jobs;
|
||
ALTER TABLE jobs_new RENAME TO jobs;
|
||
|
||
CREATE INDEX jobs_host_id ON jobs(host_id);
|
||
CREATE INDEX jobs_status ON jobs(status);
|
||
CREATE INDEX jobs_created_at ON jobs(created_at);
|
||
|
||
INSERT OR IGNORE INTO job_logs (job_id, seq, ts, stream, payload)
|
||
SELECT job_id, seq, ts, stream, payload FROM _job_logs_backup;
|
||
|
||
DROP TABLE _job_logs_backup;
|
||
```
|
||
|
||
If the live `jobs` schema already has columns added by post-0012 migrations (e.g. 0015 added `source_group_id`), match them in `jobs_new` and the INSERT — check the latest schema before writing.
|
||
|
||
- [ ] **Step 2: Verify**
|
||
|
||
```sh
|
||
go test ./internal/store/ -run TestMigrations -v
|
||
```
|
||
|
||
Expected: PASS, includes 0021.
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```
|
||
git add internal/store/migrations/0021_jobs_update_kind.sql
|
||
git commit -m "store: migration 0021 — add 'update' to jobs.kind"
|
||
```
|
||
|
||
### Task 6: Migration 0022 — fleet_updates tables
|
||
|
||
**Files:**
|
||
- Create: `internal/store/migrations/0022_fleet_updates.sql`
|
||
|
||
- [ ] **Step 1: Write the migration**
|
||
|
||
```sql
|
||
-- 0022_fleet_updates.sql
|
||
--
|
||
-- Tables backing the rolling fleet-update worker (P6-02). One row in
|
||
-- fleet_updates per "update all" invocation, a child row per host so
|
||
-- the worker can resume / report progress / mark per-host failures.
|
||
|
||
CREATE TABLE fleet_updates (
|
||
id TEXT PRIMARY KEY,
|
||
started_at TEXT NOT NULL,
|
||
started_by_user_id TEXT NOT NULL REFERENCES users(id),
|
||
target_version TEXT NOT NULL,
|
||
status TEXT NOT NULL CHECK (status IN
|
||
('running','completed','halted','cancelled')),
|
||
current_host_id TEXT REFERENCES hosts(id),
|
||
halted_reason TEXT,
|
||
completed_at TEXT
|
||
);
|
||
|
||
CREATE INDEX fleet_updates_status ON fleet_updates(status);
|
||
|
||
CREATE TABLE fleet_update_hosts (
|
||
fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
|
||
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
|
||
position INTEGER NOT NULL,
|
||
status TEXT NOT NULL CHECK (status IN
|
||
('pending','running','succeeded','failed','skipped')),
|
||
job_id TEXT REFERENCES jobs(id),
|
||
failed_reason TEXT,
|
||
PRIMARY KEY (fleet_update_id, host_id)
|
||
);
|
||
|
||
CREATE INDEX fleet_update_hosts_status ON fleet_update_hosts(fleet_update_id, position);
|
||
```
|
||
|
||
- [ ] **Step 2: Verify**
|
||
|
||
```sh
|
||
go test ./internal/store/ -run TestMigrations -v
|
||
```
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```
|
||
git add internal/store/migrations/0022_fleet_updates.sql
|
||
git commit -m "store: migration 0022 — fleet_updates + fleet_update_hosts"
|
||
```
|
||
|
||
### Task 7: `internal/store/fleet_updates.go`
|
||
|
||
**Files:**
|
||
- Create: `internal/store/fleet_updates.go`
|
||
- Create: `internal/store/fleet_updates_test.go`
|
||
|
||
- [ ] **Step 1: Write the failing tests**
|
||
|
||
```go
|
||
package store_test
|
||
|
||
// Cover: CreateFleetUpdate creates parent + N pending host rows;
|
||
// SetFleetUpdateHostStatus moves a row through pending→running→succeeded;
|
||
// HaltFleetUpdate sets status=halted + halted_reason and stamps no
|
||
// completed_at; CompleteFleetUpdate sets completed_at; ListPendingFleetUpdateHosts
|
||
// returns rows in position order; ActiveFleetUpdate returns the running
|
||
// row (or nil); GetFleetUpdate hydrates parent + children.
|
||
//
|
||
// One test per behaviour, table-driven where the API supports it.
|
||
// Mirror the structure of internal/store/maintenance_test.go.
|
||
```
|
||
|
||
Write four-six discrete test functions. Look at `internal/store/maintenance.go` + `_test.go` for the established style (constructor on `*Store`, NewStore + tmp DB).
|
||
|
||
- [ ] **Step 2: Implement**
|
||
|
||
Sketch:
|
||
|
||
```go
|
||
package store
|
||
|
||
import (
|
||
"context"
|
||
"database/sql"
|
||
"errors"
|
||
"fmt"
|
||
"time"
|
||
)
|
||
|
||
type FleetUpdate struct {
|
||
ID string
|
||
StartedAt time.Time
|
||
StartedByUserID string
|
||
TargetVersion string
|
||
Status string // running | completed | halted | cancelled
|
||
CurrentHostID string
|
||
HaltedReason string
|
||
CompletedAt *time.Time
|
||
}
|
||
|
||
type FleetUpdateHost struct {
|
||
FleetUpdateID string
|
||
HostID string
|
||
Position int
|
||
Status string // pending | running | succeeded | failed | skipped
|
||
JobID string
|
||
FailedReason string
|
||
}
|
||
|
||
func (s *Store) CreateFleetUpdate(ctx context.Context, fu FleetUpdate, hostIDs []string) error { ... }
|
||
func (s *Store) ActiveFleetUpdate(ctx context.Context) (*FleetUpdate, error) { ... }
|
||
func (s *Store) GetFleetUpdate(ctx context.Context, id string) (*FleetUpdate, []FleetUpdateHost, error) { ... }
|
||
func (s *Store) ListPendingFleetUpdateHosts(ctx context.Context, fuID string) ([]FleetUpdateHost, error) { ... }
|
||
func (s *Store) SetFleetUpdateHostStatus(ctx context.Context, fuID, hostID, status, failedReason, jobID string) error { ... }
|
||
func (s *Store) SetFleetUpdateCurrentHost(ctx context.Context, fuID, hostID string) error { ... }
|
||
func (s *Store) HaltFleetUpdate(ctx context.Context, fuID, reason string, when time.Time) error { ... }
|
||
func (s *Store) CancelFleetUpdate(ctx context.Context, fuID string) error { ... }
|
||
func (s *Store) CompleteFleetUpdate(ctx context.Context, fuID string, when time.Time) error { ... }
|
||
```
|
||
|
||
- [ ] **Step 3: Run tests**
|
||
|
||
```sh
|
||
go test ./internal/store/ -run TestFleetUpdate -v
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 4: Commit**
|
||
|
||
```
|
||
git add internal/store/fleet_updates.go internal/store/fleet_updates_test.go
|
||
git commit -m "store: fleet_updates + fleet_update_hosts CRUD"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 4 — Agent updater (Linux)
|
||
|
||
### Task 8: `internal/agent/updater` package skeleton + Linux path
|
||
|
||
**Files:**
|
||
- Create: `internal/agent/updater/updater.go`
|
||
- Create: `internal/agent/updater/updater_unix.go`
|
||
- Create: `internal/agent/updater/updater_windows.go`
|
||
- Create: `internal/agent/updater/updater_test.go`
|
||
|
||
- [ ] **Step 1: Write the failing test (Linux)**
|
||
|
||
```go
|
||
//go:build !windows
|
||
|
||
package updater
|
||
|
||
import (
|
||
"io"
|
||
"net/http"
|
||
"net/http/httptest"
|
||
"os"
|
||
"path/filepath"
|
||
"runtime"
|
||
"testing"
|
||
)
|
||
|
||
func TestUpdate_LinuxAtomicSwap(t *testing.T) {
|
||
// Stage 1: a fake "running binary" file + a server that serves new bytes.
|
||
tmp := t.TempDir()
|
||
binPath := filepath.Join(tmp, "agent")
|
||
if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
|
||
t.Fatal(err)
|
||
}
|
||
newBytes := []byte("NEW BINARY CONTENTS")
|
||
|
||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||
if r.URL.Path != "/agent/binary" {
|
||
http.NotFound(w, r); return
|
||
}
|
||
if r.URL.Query().Get("os") != runtime.GOOS || r.URL.Query().Get("arch") != runtime.GOARCH {
|
||
t.Errorf("unexpected query: %s", r.URL.RawQuery)
|
||
}
|
||
_, _ = io.Copy(w, &io.LimitedReader{R: bytesReader(newBytes), N: int64(len(newBytes))})
|
||
}))
|
||
defer srv.Close()
|
||
|
||
if err := UpdateForTest(srv.URL, binPath); err != nil {
|
||
t.Fatalf("update: %v", err)
|
||
}
|
||
|
||
got, err := os.ReadFile(binPath)
|
||
if err != nil { t.Fatal(err) }
|
||
if string(got) != "NEW BINARY CONTENTS" {
|
||
t.Fatalf("binary contents: got %q", got)
|
||
}
|
||
old, err := os.ReadFile(binPath + ".old")
|
||
if err != nil { t.Fatalf("agent.old missing: %v", err) }
|
||
if string(old) != "OLD" {
|
||
t.Fatalf("agent.old contents: got %q", old)
|
||
}
|
||
// .new must have been renamed away
|
||
if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) {
|
||
t.Fatalf("agent.new should be absent after swap")
|
||
}
|
||
}
|
||
```
|
||
|
||
`UpdateForTest(serverURL, binaryPath string) error` is a tiny wrapper exposed by `updater.go` that does steps 1–6 of §4.1 of the spec (everything except `os.Exit`). The exit-and-restart side effect can't be covered by a unit test.
|
||
|
||
- [ ] **Step 2: Implement `updater.go` (shared)**
|
||
|
||
```go
|
||
package updater
|
||
|
||
import (
|
||
"context"
|
||
"fmt"
|
||
"io"
|
||
"net/http"
|
||
"os"
|
||
"path/filepath"
|
||
"runtime"
|
||
"time"
|
||
)
|
||
|
||
// fetch downloads the new binary into <binaryPath>.new, fsyncs, chmods.
|
||
// Returns the path to the staged file.
|
||
func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) {
|
||
url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH)
|
||
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
|
||
if err != nil {
|
||
return "", err
|
||
}
|
||
c := &http.Client{Timeout: 5 * time.Minute}
|
||
res, err := c.Do(req)
|
||
if err != nil {
|
||
return "", err
|
||
}
|
||
defer res.Body.Close()
|
||
if res.StatusCode != http.StatusOK {
|
||
return "", fmt.Errorf("agent binary fetch: %s", res.Status)
|
||
}
|
||
|
||
stagePath := binaryPath + ".new"
|
||
f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
|
||
if err != nil {
|
||
return "", err
|
||
}
|
||
if _, err := io.Copy(f, res.Body); err != nil {
|
||
f.Close()
|
||
_ = os.Remove(stagePath)
|
||
return "", err
|
||
}
|
||
if err := f.Sync(); err != nil {
|
||
f.Close()
|
||
_ = os.Remove(stagePath)
|
||
return "", err
|
||
}
|
||
if err := f.Close(); err != nil {
|
||
_ = os.Remove(stagePath)
|
||
return "", err
|
||
}
|
||
if err := os.Chmod(stagePath, 0o755); err != nil {
|
||
_ = os.Remove(stagePath)
|
||
return "", err
|
||
}
|
||
return stagePath, nil
|
||
}
|
||
|
||
// resolveOwnBinary returns the absolute path of the running binary.
|
||
// Refuses /proc/self/exe — that's what os.Executable returns on
|
||
// some systems but it can't be renamed across.
|
||
func resolveOwnBinary() (string, error) {
|
||
p, err := os.Executable()
|
||
if err != nil {
|
||
return "", err
|
||
}
|
||
abs, err := filepath.Abs(p)
|
||
if err != nil {
|
||
return "", err
|
||
}
|
||
if abs == "/proc/self/exe" {
|
||
return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe — not a real file)")
|
||
}
|
||
return abs, nil
|
||
}
|
||
|
||
// UpdateForTest is the platform-neutral test seam.
|
||
// In production, Update (in updater_unix.go / updater_windows.go) does
|
||
// the same fetch+swap then exits the process. UpdateForTest stops short
|
||
// of the exit so unit tests can assert on file state.
|
||
func UpdateForTest(serverURL, binaryPath string) error {
|
||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
|
||
defer cancel()
|
||
stage, err := fetch(ctx, serverURL, binaryPath)
|
||
if err != nil {
|
||
return err
|
||
}
|
||
return swap(stage, binaryPath)
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 3: Implement `updater_unix.go`**
|
||
|
||
```go
|
||
//go:build !windows
|
||
|
||
package updater
|
||
|
||
import (
|
||
"context"
|
||
"fmt"
|
||
"io"
|
||
"log/slog"
|
||
"os"
|
||
"time"
|
||
)
|
||
|
||
// Update fetches the new binary, swaps it in, then exits so systemd
|
||
// restarts the process under the new binary. Caller should close
|
||
// the WS cleanly before invoking.
|
||
func Update(ctx context.Context, serverURL string) error {
|
||
binPath, err := resolveOwnBinary()
|
||
if err != nil {
|
||
return err
|
||
}
|
||
stage, err := fetch(ctx, serverURL, binPath)
|
||
if err != nil {
|
||
return err
|
||
}
|
||
if err := swap(stage, binPath); err != nil {
|
||
return err
|
||
}
|
||
slog.Info("agent self-update: binary swapped, exiting for systemd restart",
|
||
"binary", binPath)
|
||
// Give logger a moment to flush, then exit.
|
||
time.Sleep(200 * time.Millisecond)
|
||
os.Exit(0)
|
||
return nil // unreachable
|
||
}
|
||
|
||
// swap copies the running binary to <bin>.old, then atomic-renames the
|
||
// staged binary into place. On non-Windows this works because the OS
|
||
// allows renames across an open file.
|
||
func swap(stagePath, binPath string) error {
|
||
src, err := os.Open(binPath)
|
||
if err != nil {
|
||
return fmt.Errorf("open running binary: %w", err)
|
||
}
|
||
defer src.Close()
|
||
dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
|
||
if err != nil {
|
||
return fmt.Errorf("open .old: %w", err)
|
||
}
|
||
if _, err := io.Copy(dst, src); err != nil {
|
||
dst.Close()
|
||
return fmt.Errorf("copy to .old: %w", err)
|
||
}
|
||
if err := dst.Sync(); err != nil {
|
||
dst.Close()
|
||
return err
|
||
}
|
||
if err := dst.Close(); err != nil {
|
||
return err
|
||
}
|
||
if err := os.Rename(stagePath, binPath); err != nil {
|
||
return fmt.Errorf("rename .new over running binary: %w", err)
|
||
}
|
||
return nil
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Implement `updater_windows.go` stub**
|
||
|
||
```go
|
||
//go:build windows
|
||
|
||
package updater
|
||
|
||
import (
|
||
"context"
|
||
"errors"
|
||
)
|
||
|
||
// Update is implemented in Task 12. Stubbed so the package builds
|
||
// on Windows during phases 4-11.
|
||
func Update(ctx context.Context, serverURL string) error {
|
||
return errors.New("agent self-update on Windows: not yet implemented")
|
||
}
|
||
|
||
func swap(stagePath, binPath string) error {
|
||
return errors.New("agent self-update on Windows: not yet implemented")
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 5: Add `bytesReader` helper to test file** (or use `bytes.NewReader` directly).
|
||
|
||
- [ ] **Step 6: Run tests**
|
||
|
||
```sh
|
||
go test ./internal/agent/updater/ -v
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 7: Commit**
|
||
|
||
```
|
||
git add internal/agent/updater/
|
||
git commit -m "agent: updater package — Linux atomic-swap path"
|
||
```
|
||
|
||
### Task 9: Wire `command.update` into the agent dispatcher
|
||
|
||
**Files:**
|
||
- Create: `cmd/agent/update_dispatch.go`
|
||
- Verify: `cmd/agent/main.go` (already edited in Task 4)
|
||
|
||
- [ ] **Step 1: Implement the dispatcher method**
|
||
|
||
```go
|
||
package main
|
||
|
||
import (
|
||
"context"
|
||
"fmt"
|
||
"log/slog"
|
||
"time"
|
||
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater"
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||
)
|
||
|
||
// runUpdate handles a server-dispatched command.update. It logs progress
|
||
// via log.stream so the live job page captures pre-restart state, then
|
||
// calls the platform updater. On Linux the updater calls os.Exit; on
|
||
// Windows it spawns a helper and returns, with the agent then exiting.
|
||
func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) {
|
||
logf := func(format string, args ...any) {
|
||
line := fmt.Sprintf(format, args...)
|
||
slog.Info("ws agent: update: " + line)
|
||
env, _ := api.Marshal(api.MsgLogStream, "", api.LogStreamPayload{
|
||
JobID: p.JobID,
|
||
Stream: api.LogStdout,
|
||
Data: line + "\n",
|
||
At: time.Now().UTC(),
|
||
})
|
||
_ = tx.Send(env)
|
||
}
|
||
|
||
// Job-started so the server flips queued→running.
|
||
startedEnv, _ := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{
|
||
JobID: p.JobID,
|
||
Kind: api.JobUpdate,
|
||
StartedAt: time.Now().UTC(),
|
||
})
|
||
_ = tx.Send(startedEnv)
|
||
|
||
logf("fetching new binary from %s", d.serverURL)
|
||
if err := updater.Update(ctx, d.serverURL); err != nil {
|
||
logf("update failed: %v", err)
|
||
finishedEnv, _ := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{
|
||
JobID: p.JobID,
|
||
Kind: api.JobUpdate,
|
||
Status: api.JobFailed,
|
||
FinishedAt: time.Now().UTC(),
|
||
Error: err.Error(),
|
||
})
|
||
_ = tx.Send(finishedEnv)
|
||
return
|
||
}
|
||
// Unreachable on Linux (Update calls os.Exit). On Windows control
|
||
// returns here and the agent exits cleanly so SCM hands off to the
|
||
// helper script that does the actual swap-and-restart.
|
||
}
|
||
```
|
||
|
||
`d.serverURL` should already exist on the dispatcher (it's the URL the WS connection was made to). If not, plumb it from `cmd/agent/main.go`'s connection setup — the URL is in the agent config.
|
||
|
||
- [ ] **Step 2: Verify build**
|
||
|
||
```sh
|
||
go build ./...
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 3: Run all agent tests**
|
||
|
||
```sh
|
||
go test ./cmd/agent/... ./internal/agent/...
|
||
```
|
||
|
||
Expected: PASS, no regressions.
|
||
|
||
- [ ] **Step 4: Commit**
|
||
|
||
```
|
||
git add cmd/agent/update_dispatch.go cmd/agent/main.go
|
||
git commit -m "agent: handle command.update — fetch + swap + exit"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 5 — Server endpoint + hello integration
|
||
|
||
### Task 10: `POST /api/hosts/{id}/update`
|
||
|
||
**Files:**
|
||
- Create: `internal/server/http/host_update.go`
|
||
- Create: `internal/server/http/host_update_test.go`
|
||
- Modify: `internal/server/http/server.go` (route registration)
|
||
|
||
- [ ] **Step 1: Write tests covering**
|
||
|
||
Mirror the structure of an existing admin-band endpoint test, e.g. `repo_ops_test.go`:
|
||
|
||
- happy path: admin POST → 200 + `{job_id}` returned, `jobs` row created with `kind=update`, audit row written, WS envelope `command.update` sent to the host's connection.
|
||
- refuses when host offline → 409 / structured error code `host_offline`.
|
||
- refuses when `agent_version == server.Version` → 409 / `already_up_to_date`.
|
||
- refuses when an `update` job is already running for this host → 409 / `update_in_progress`.
|
||
- RBAC: operator → 403, viewer → 403.
|
||
|
||
- [ ] **Step 2: Implement**
|
||
|
||
```go
|
||
package http
|
||
|
||
import (
|
||
"encoding/json"
|
||
stdhttp "net/http"
|
||
|
||
"github.com/go-chi/chi/v5"
|
||
"github.com/oklog/ulid/v2"
|
||
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||
)
|
||
|
||
// handleHostUpdate dispatches a command.update WS envelope after
|
||
// validating that the host is online, currently running a different
|
||
// version, and not already in the middle of an update.
|
||
func (s *Server) handleHostUpdate(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||
hostID := chi.URLParam(r, "id")
|
||
host, err := s.deps.Store.GetHost(r.Context(), hostID)
|
||
if err != nil { writeJSONError(w, stdhttp.StatusNotFound, "host_not_found", ""); return }
|
||
|
||
if !s.deps.Hub.IsOnline(hostID) {
|
||
writeJSONError(w, stdhttp.StatusConflict, "host_offline",
|
||
"agent must be online to receive an update")
|
||
return
|
||
}
|
||
if host.AgentVersion == version.Version {
|
||
writeJSONError(w, stdhttp.StatusConflict, "already_up_to_date",
|
||
"host is already on "+version.Version)
|
||
return
|
||
}
|
||
running, err := s.deps.Store.RunningUpdateJobForHost(r.Context(), hostID)
|
||
if err == nil && running != "" {
|
||
writeJSONError(w, stdhttp.StatusConflict, "update_in_progress",
|
||
"an update job is already running for this host")
|
||
return
|
||
}
|
||
|
||
jobID := ulid.Make().String()
|
||
user := userFrom(r) // existing helper
|
||
if err := s.deps.Store.InsertJob(r.Context(), store.Job{
|
||
ID: jobID,
|
||
HostID: hostID,
|
||
Kind: string(api.JobUpdate),
|
||
Status: string(api.JobQueued),
|
||
ActorKind: "user",
|
||
ActorID: user.ID,
|
||
// CreatedAt is set by InsertJob.
|
||
}); err != nil {
|
||
writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
|
||
return
|
||
}
|
||
|
||
env, _ := api.Marshal(api.MsgCommandUpdate, ulid.Make().String(), api.CommandUpdatePayload{JobID: jobID})
|
||
if err := s.deps.Hub.SendTo(hostID, env); err != nil {
|
||
writeJSONError(w, stdhttp.StatusBadGateway, "send_failed", err.Error())
|
||
return
|
||
}
|
||
|
||
s.audit(r, "host.update_dispatched", store.AuditTarget{
|
||
Kind: "host", ID: hostID,
|
||
}, map[string]any{"job_id": jobID, "target_version": version.Version})
|
||
|
||
w.Header().Set("Content-Type", "application/json")
|
||
_ = json.NewEncoder(w).Encode(map[string]string{"job_id": jobID})
|
||
}
|
||
|
||
// Form-post variant for HTMX. Same gates, returns HX-Redirect.
|
||
func (s *Server) handleHostUpdateForm(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||
// reuse handleHostUpdate's pre-checks via a shared validator;
|
||
// on success, set HX-Redirect to /jobs/<id> and write 200.
|
||
// On error, render an inline banner partial.
|
||
}
|
||
```
|
||
|
||
Helpers to add:
|
||
- `Store.RunningUpdateJobForHost(ctx, hostID) (string, error)` — returns the job id of any running `kind=update` job for this host, or empty string + nil if none. One-line query.
|
||
|
||
- [ ] **Step 3: Register routes**
|
||
|
||
In `server.go`, inside the admin-only route group:
|
||
|
||
```go
|
||
r.Post("/api/hosts/{id}/update", s.handleHostUpdate)
|
||
r.Post("/hosts/{id}/update", s.handleHostUpdateForm)
|
||
```
|
||
|
||
- [ ] **Step 4: Run tests**
|
||
|
||
```sh
|
||
go test ./internal/server/http/ -run TestHostUpdate -v
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```
|
||
git add internal/server/http/host_update.go internal/server/http/host_update_test.go internal/server/http/server.go internal/store/jobs.go
|
||
git commit -m "http: POST /api/hosts/{id}/update — dispatch agent update"
|
||
```
|
||
|
||
### Task 11: Hello-handler integration + timeout watcher
|
||
|
||
**Files:**
|
||
- Modify: `internal/server/ws/handler.go`
|
||
- Create: `internal/server/ws/update_watch.go`
|
||
- Create: `internal/server/ws/update_watch_test.go`
|
||
|
||
- [ ] **Step 1: Write the failing test**
|
||
|
||
```go
|
||
// In update_watch_test.go:
|
||
//
|
||
// 1. NewWatcher; Track(jobID, hostID, started=now). Hello arrives after
|
||
// 50ms with matching version → watcher marks the job succeeded
|
||
// (verify via mock Store.UpdateJobStatus call).
|
||
// 2. NewWatcher; Track(...). 100ms timeout (override constant for test).
|
||
// No hello arrives → after 100ms, watcher marks the job failed with
|
||
// reason "timeout" and raises an alert (verify via mock Store +
|
||
// AlertEngine).
|
||
// 3. NewWatcher; Track(...). Hello arrives but version doesn't match.
|
||
// Watcher does nothing (timeout will catch). After timeout, marked
|
||
// failed with reason "agent reconnected at version X, expected Y".
|
||
// 4. Cancel: Track then explicitly Stop(jobID) — no further callbacks.
|
||
```
|
||
|
||
- [ ] **Step 2: Implement the watcher**
|
||
|
||
```go
|
||
package ws
|
||
|
||
import (
|
||
"context"
|
||
"sync"
|
||
"time"
|
||
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||
)
|
||
|
||
// updateTimeout is the default ceiling for how long the server waits
|
||
// for an agent re-hello carrying the matching version after dispatching
|
||
// a command.update. Exported as a var so tests can shrink it.
|
||
var updateTimeout = 90 * time.Second
|
||
|
||
type updateWatch struct {
|
||
jobID string
|
||
hostID string
|
||
deadline time.Time
|
||
}
|
||
|
||
type updateWatcher struct {
|
||
mu sync.Mutex
|
||
pending map[string]*updateWatch // hostID → watch
|
||
store *store.Store
|
||
alerts AlertRaiser // small interface, injected
|
||
now func() time.Time
|
||
}
|
||
|
||
func newUpdateWatcher(st *store.Store, alerts AlertRaiser) *updateWatcher {
|
||
return &updateWatcher{
|
||
pending: make(map[string]*updateWatch),
|
||
store: st,
|
||
alerts: alerts,
|
||
now: func() time.Time { return time.Now().UTC() },
|
||
}
|
||
}
|
||
|
||
// Track registers an in-flight update. If a hello with the matching
|
||
// version arrives before the deadline, OnHello returns true and clears
|
||
// the entry. Otherwise the watcher's runLoop will mark the job failed.
|
||
func (w *updateWatcher) Track(jobID, hostID string) {
|
||
w.mu.Lock()
|
||
defer w.mu.Unlock()
|
||
w.pending[hostID] = &updateWatch{
|
||
jobID: jobID,
|
||
hostID: hostID,
|
||
deadline: w.now().Add(updateTimeout),
|
||
}
|
||
}
|
||
|
||
// OnHello is called by the WS handler when an agent hellos. If a watch
|
||
// is pending for this host AND the version matches, mark succeeded and
|
||
// drop the watch. Mismatched version → leave the watch (timeout
|
||
// handles it).
|
||
func (w *updateWatcher) OnHello(ctx context.Context, hostID, agentVersion, serverVersion string) {
|
||
w.mu.Lock()
|
||
watch, ok := w.pending[hostID]
|
||
if ok && agentVersion == serverVersion {
|
||
delete(w.pending, hostID)
|
||
}
|
||
w.mu.Unlock()
|
||
if !ok || agentVersion != serverVersion { return }
|
||
// Mark job succeeded.
|
||
_ = w.store.SetJobStatus(ctx, watch.jobID, string(api.JobSucceeded), "", w.now())
|
||
// Audit + alert auto-resolve.
|
||
// (audit hook reused via http layer's helper, or write directly here)
|
||
}
|
||
|
||
// Run is a goroutine started by NewHandler — sweeps for expired
|
||
// watches every 5s.
|
||
func (w *updateWatcher) Run(ctx context.Context) {
|
||
tick := time.NewTicker(5 * time.Second)
|
||
defer tick.Stop()
|
||
for {
|
||
select {
|
||
case <-ctx.Done(): return
|
||
case <-tick.C:
|
||
w.sweep(ctx)
|
||
}
|
||
}
|
||
}
|
||
|
||
func (w *updateWatcher) sweep(ctx context.Context) {
|
||
now := w.now()
|
||
w.mu.Lock()
|
||
expired := []*updateWatch{}
|
||
for hostID, wch := range w.pending {
|
||
if now.After(wch.deadline) {
|
||
expired = append(expired, wch)
|
||
delete(w.pending, hostID)
|
||
}
|
||
}
|
||
w.mu.Unlock()
|
||
for _, wch := range expired {
|
||
// Determine reason: did the agent come back at all?
|
||
host, _ := w.store.GetHost(ctx, wch.hostID)
|
||
reason := "timeout: agent did not reconnect within 90s"
|
||
if host != nil && host.AgentVersion != "" && host.AgentVersion != version.Version {
|
||
reason = fmt.Sprintf("agent reconnected at %s, expected %s",
|
||
host.AgentVersion, version.Version)
|
||
}
|
||
_ = w.store.SetJobStatus(ctx, wch.jobID, string(api.JobFailed), reason, now)
|
||
if w.alerts != nil {
|
||
w.alerts.RaiseUpdateFailed(ctx, wch.hostID, wch.jobID, reason, now)
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 3: Hook into the WS handler**
|
||
|
||
In `handler.go`, where `onAgentHello` is defined (search for the place it upserts `agent_version`), at the *end* of the handler — after the upsert succeeds — call:
|
||
|
||
```go
|
||
deps.UpdateWatcher.OnHello(ctx, hostID, hello.AgentVersion, version.Version)
|
||
```
|
||
|
||
The `UpdateWatcher *updateWatcher` field needs to exist on the handler `Deps` struct. Wire it up in `cmd/server/main.go`.
|
||
|
||
`AlertRaiser` interface (defined alongside the watcher) is implemented by `*alert.Engine` after Task 14 adds the `RaiseUpdateFailed` method. For now, define the interface and make the engine satisfy it.
|
||
|
||
- [ ] **Step 4: Run tests**
|
||
|
||
```sh
|
||
go test ./internal/server/ws/ -v
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```
|
||
git add internal/server/ws/update_watch.go internal/server/ws/update_watch_test.go internal/server/ws/handler.go cmd/server/main.go
|
||
git commit -m "ws: update watcher — promote/fail update jobs on hello timeout"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 6 — Windows updater path
|
||
|
||
### Task 12: Windows helper-script implementation
|
||
|
||
**Files:**
|
||
- Modify: `internal/agent/updater/updater_windows.go`
|
||
|
||
- [ ] **Step 1: Replace the stub from Task 8**
|
||
|
||
```go
|
||
//go:build windows
|
||
|
||
package updater
|
||
|
||
import (
|
||
"context"
|
||
"fmt"
|
||
"log/slog"
|
||
"os"
|
||
"os/exec"
|
||
"path/filepath"
|
||
"syscall"
|
||
"time"
|
||
)
|
||
|
||
const helperScript = `@echo off
|
||
timeout /t 3 /nobreak >nul
|
||
copy /Y "%s" "%s"
|
||
sc stop restic-manager-agent
|
||
:wait
|
||
sc query restic-manager-agent | find "STOPPED" >nul
|
||
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
|
||
move /Y "%s" "%s"
|
||
sc start restic-manager-agent
|
||
del "%%~f0"
|
||
`
|
||
|
||
func Update(ctx context.Context, serverURL string) error {
|
||
binPath, err := resolveOwnBinary()
|
||
if err != nil {
|
||
return err
|
||
}
|
||
stage, err := fetch(ctx, serverURL, binPath)
|
||
if err != nil {
|
||
return err
|
||
}
|
||
helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd")
|
||
oldPath := binPath + ".old"
|
||
body := fmt.Sprintf(helperScript, binPath, oldPath, stage, binPath)
|
||
if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil {
|
||
return err
|
||
}
|
||
cmd := exec.Command("cmd.exe", "/c", helperPath)
|
||
cmd.SysProcAttr = &syscall.SysProcAttr{
|
||
HideWindow: true,
|
||
CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW
|
||
}
|
||
if err := cmd.Start(); err != nil {
|
||
return err
|
||
}
|
||
slog.Info("agent self-update: helper spawned, exiting cleanly", "binary", binPath)
|
||
time.Sleep(200 * time.Millisecond)
|
||
os.Exit(0)
|
||
return nil
|
||
}
|
||
|
||
func swap(_, _ string) error { return nil } // not used on Windows
|
||
```
|
||
|
||
- [ ] **Step 2: Verify cross-compile**
|
||
|
||
```sh
|
||
GOOS=windows GOARCH=amd64 go build ./...
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```
|
||
git add internal/agent/updater/updater_windows.go
|
||
git commit -m "agent: Windows updater — detached helper script"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 7 — Alert kinds + auto-resolve
|
||
|
||
### Task 13: Add `update_failed` and `fleet_update_halted` alert kinds
|
||
|
||
**Files:**
|
||
- Create: `internal/alert/update_alerts.go`
|
||
- Modify: `internal/alert/rules.go` (auto-resolve hook on host hello)
|
||
|
||
- [ ] **Step 1: Implement**
|
||
|
||
```go
|
||
package alert
|
||
|
||
import (
|
||
"context"
|
||
"time"
|
||
)
|
||
|
||
// RaiseUpdateFailed is called by the WS update-watcher when an agent
|
||
// fails to come back at the target version after a command.update
|
||
// dispatch. Auto-resolves when the host next hellos with the right
|
||
// version.
|
||
func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) {
|
||
dedup := "update_failed:" + hostID
|
||
msg := "agent self-update failed: " + reason
|
||
e.raiseAndNotify(ctx, hostID, "update_failed", dedup, "warning", msg, when)
|
||
}
|
||
|
||
// ResolveUpdateFailed is called from the WS hello handler when the
|
||
// host comes back at the expected version.
|
||
func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) {
|
||
e.resolveAndNotify(ctx, hostID, "update_failed", "update_failed:"+hostID, when)
|
||
}
|
||
|
||
// RaiseFleetUpdateHalted is called by the fleet-update worker when it
|
||
// halts on a per-host failure. No host id (global alert).
|
||
func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) {
|
||
dedup := "fleet_update_halted:" + fleetUpdateID
|
||
msg := "fleet update halted: " + reason
|
||
e.raiseAndNotify(ctx, "", "fleet_update_halted", dedup, "warning", msg, when)
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Wire auto-resolve into the WS hello handler** (in Task 11's update watcher: when a successful match is recorded, also call `ResolveUpdateFailed`).
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```
|
||
git add internal/alert/update_alerts.go
|
||
git commit -m "alert: update_failed + fleet_update_halted kinds"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 8 — Fleet-update worker
|
||
|
||
### Task 14: `internal/server/fleetupdate` worker
|
||
|
||
**Files:**
|
||
- Create: `internal/server/fleetupdate/worker.go`
|
||
- Create: `internal/server/fleetupdate/worker_test.go`
|
||
|
||
- [ ] **Step 1: Sketch the API**
|
||
|
||
```go
|
||
package fleetupdate
|
||
|
||
import (
|
||
"context"
|
||
"sync"
|
||
"time"
|
||
|
||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||
)
|
||
|
||
// Worker owns the at-most-one rolling fleet-update goroutine.
|
||
type Worker struct {
|
||
mu sync.Mutex // ensures one run at a time
|
||
store *store.Store
|
||
hub Hub // small interface — IsOnline, SendTo
|
||
dispatcher Dispatcher // small interface — DispatchUpdate(hostID, fleetUpdateID) (jobID string, err error)
|
||
watcher Watcher // small interface — WaitForVersion(hostID, version, timeout) bool
|
||
alerts AlertRaiser
|
||
}
|
||
|
||
// Start kicks off a new fleet update. Validates that no other run
|
||
// is in progress. Returns the new fleet_update id on success.
|
||
func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) {
|
||
if !w.mu.TryLock() {
|
||
return "", ErrAlreadyRunning
|
||
}
|
||
// build the fleet_updates row + N pending fleet_update_hosts rows
|
||
// in position order, then spawn a goroutine that runs the loop.
|
||
go w.run(ctx, fuID)
|
||
return fuID, nil
|
||
}
|
||
|
||
// run is the rolling loop. For each pending host: pre-check, dispatch,
|
||
// wait for hello-with-target-version, mark succeeded/failed, halt on
|
||
// first failure.
|
||
func (w *Worker) run(ctx context.Context, fuID string) {
|
||
defer w.mu.Unlock()
|
||
// ... see spec §7.2 pseudocode
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Write tests**
|
||
|
||
Use mocks/fakes for Hub/Dispatcher/Watcher. Cover:
|
||
|
||
- two-host run, both succeed → completed.
|
||
- first host succeeds, second times out → halted, alert raised, third stays pending.
|
||
- host goes offline mid-run → halted with reason "host went offline".
|
||
- host already at target version when its turn comes (raced with another path) → skipped, loop continues.
|
||
- cancel mid-run → status=cancelled, current host's job left running, no further dispatches.
|
||
- start while another run active → returns ErrAlreadyRunning.
|
||
|
||
- [ ] **Step 3: Implement the run loop**
|
||
|
||
```go
|
||
func (w *Worker) run(ctx context.Context, fuID string) {
|
||
defer w.mu.Unlock()
|
||
pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID)
|
||
if err != nil { return }
|
||
for _, p := range pending {
|
||
// Re-check status — could have been cancelled.
|
||
fu, _ := w.store.ActiveFleetUpdate(ctx)
|
||
if fu == nil || fu.Status != "running" || fu.ID != fuID { return }
|
||
|
||
_ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, p.HostID)
|
||
|
||
host, _ := w.store.GetHost(ctx, p.HostID)
|
||
if host == nil {
|
||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "host deleted", "")
|
||
continue
|
||
}
|
||
// Already at target?
|
||
// (target version comes from the fleet_update row)
|
||
if host.AgentVersion == fu.TargetVersion {
|
||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "skipped", "already at target", "")
|
||
continue
|
||
}
|
||
if !w.hub.IsOnline(p.HostID) {
|
||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "host offline at dispatch time", "")
|
||
_ = w.store.HaltFleetUpdate(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC())
|
||
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "host went offline: "+host.Hostname, time.Now().UTC())
|
||
return
|
||
}
|
||
|
||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "running", "", "")
|
||
jobID, err := w.dispatcher.DispatchUpdate(ctx, p.HostID, fuID)
|
||
if err != nil {
|
||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", err.Error(), "")
|
||
_ = w.store.HaltFleetUpdate(ctx, fuID, "dispatch failed on "+host.Hostname, time.Now().UTC())
|
||
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, err.Error(), time.Now().UTC())
|
||
return
|
||
}
|
||
_ = w.store.SetFleetUpdateHostStatusJob(ctx, fuID, p.HostID, jobID)
|
||
|
||
ok := w.watcher.WaitForVersion(ctx, p.HostID, fu.TargetVersion, 95*time.Second)
|
||
if !ok {
|
||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "failed", "did not reconnect at target version", jobID)
|
||
_ = w.store.HaltFleetUpdate(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC())
|
||
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, "update failed on "+host.Hostname, time.Now().UTC())
|
||
return
|
||
}
|
||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, p.HostID, "succeeded", "", jobID)
|
||
}
|
||
_ = w.store.CompleteFleetUpdate(ctx, fuID, time.Now().UTC())
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Run tests**
|
||
|
||
```sh
|
||
go test ./internal/server/fleetupdate/ -v
|
||
```
|
||
|
||
Expected: PASS.
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```
|
||
git add internal/server/fleetupdate/
|
||
git commit -m "fleetupdate: rolling worker with halt-on-fail"
|
||
```
|
||
|
||
### Task 15: HTTP endpoints + page handler for fleet update
|
||
|
||
**Files:**
|
||
- Create: `internal/server/http/fleet_update.go`
|
||
- Create: `internal/server/http/fleet_update_test.go`
|
||
- Create: `web/templates/pages/fleet_update.html`
|
||
- Modify: `internal/server/http/server.go` (route registration)
|
||
|
||
- [ ] **Step 1: Endpoints**
|
||
|
||
```go
|
||
// POST /api/fleet/update — admin-only, body: {target_version?}.
|
||
// If target_version omitted, defaults to current server version.
|
||
// Returns {fleet_update_id}.
|
||
func (s *Server) handleFleetUpdateStart(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
|
||
|
||
// POST /api/fleet-updates/{id}/cancel — admin-only.
|
||
func (s *Server) handleFleetUpdateCancel(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
|
||
|
||
// GET /api/fleet-updates/{id} — admin-only, returns
|
||
// {fleet_update + per-host array} as JSON.
|
||
func (s *Server) handleFleetUpdateGet(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
|
||
|
||
// GET /settings/fleet-update — admin-only, renders the page.
|
||
// Shows idle list (out-of-date online hosts) when no run is active,
|
||
// or the running run's progress.
|
||
func (s *Server) handleFleetUpdatePage(w stdhttp.ResponseWriter, r *stdhttp.Request) { ... }
|
||
```
|
||
|
||
- [ ] **Step 2: Tests**
|
||
|
||
Unit-test the page handler (idle vs running variants) and the start endpoint (accepts target list, refuses if a run is already active, RBAC).
|
||
|
||
- [ ] **Step 3: Page template**
|
||
|
||
`web/templates/pages/fleet_update.html`:
|
||
- Inherit from the base layout.
|
||
- Idle state block: header "Fleet update", paragraph explaining "rolling updates one host at a time, halts on first failure", table of out-of-date online hosts with Hostname / Current / Target / Last seen, plus a typed-confirm dialog ("Type the host count to confirm"), "Start rolling update" button.
|
||
- Running state block: htmx auto-refresh every 3s (`hx-get="/api/fleet-updates/{id}/partial" hx-trigger="every 3s [...visibility...]"`), per-host progress list with status pill, link to job log when present, "Cancel" button.
|
||
- Mirror the visual idiom of `web/templates/pages/alerts.html` for the auto-refresh behaviour.
|
||
|
||
- [ ] **Step 4: Run tests + smoke render**
|
||
|
||
```sh
|
||
go test ./internal/server/http/ -run TestFleetUpdate -v
|
||
```
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```
|
||
git add internal/server/http/fleet_update.go internal/server/http/fleet_update_test.go web/templates/pages/fleet_update.html internal/server/http/server.go
|
||
git commit -m "http: fleet update endpoints + /settings/fleet-update page"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 9 — UI surfacing
|
||
|
||
### Task 16: Update chip on host row + host detail header
|
||
|
||
**Files:**
|
||
- Create: `web/templates/partials/host_update_chip.html`
|
||
- Modify: `web/templates/partials/host_row.html`
|
||
- Modify: `web/templates/partials/host_chrome.html`
|
||
- Modify: `internal/server/http/hosts.go` (add `UpdateAvailable` and `TargetVersion` fields to the row view-model)
|
||
- Modify: `internal/server/http/host_detail.go` (or wherever `host_chrome` is populated)
|
||
- Modify: `web/styles/input.css`
|
||
|
||
- [ ] **Step 1: View-model**
|
||
|
||
Compute `UpdateAvailable bool` and `TargetVersion string` (= server version) anywhere `host` data is built for templates. Hide chip when `host.AgentVersion == ""` or matches.
|
||
|
||
- [ ] **Step 2: Partial**
|
||
|
||
```html
|
||
{{ define "host_update_chip" }}
|
||
{{ if .UpdateAvailable }}
|
||
<span class="update-chip" title="Agent {{ .AgentVersion }} → server {{ .TargetVersion }}">
|
||
out of date · {{ .AgentVersion }} → {{ .TargetVersion }}
|
||
</span>
|
||
{{ end }}
|
||
{{ end }}
|
||
```
|
||
|
||
- [ ] **Step 3: CSS**
|
||
|
||
`web/styles/input.css`:
|
||
|
||
```css
|
||
.update-chip {
|
||
@apply inline-flex items-center gap-1 px-2 py-0.5 rounded text-xs;
|
||
@apply bg-amber-50 text-amber-900 border border-amber-200;
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Render Tailwind + commit**
|
||
|
||
```sh
|
||
make build
|
||
git add web/templates web/styles/input.css web/static/css/styles.css internal/server/http/hosts.go internal/server/http/host_detail.go
|
||
git commit -m "ui: update chip on host row + detail header"
|
||
```
|
||
|
||
### Task 17: Per-host Update agent button on `/hosts/{id}`
|
||
|
||
**Files:**
|
||
- Modify: `web/templates/pages/host_detail.html`
|
||
|
||
- [ ] **Step 1: Right-rail button block**
|
||
|
||
Look at the existing right-rail in `host_detail.html` (e.g. the Restore button block from P3). Add (admin-only):
|
||
|
||
```html
|
||
{{ if and .CanAdmin .Host.UpdateAvailable }}
|
||
<form hx-post="/hosts/{{ .Host.ID }}/update" hx-swap="none">
|
||
<button class="btn btn-amber w-full"
|
||
{{ if not .Host.Online }}disabled title="Agent must be online"{{ end }}
|
||
{{ if .Host.UpdateInProgress }}disabled title="Update already in progress"{{ end }}>
|
||
Update agent
|
||
</button>
|
||
</form>
|
||
{{ end }}
|
||
```
|
||
|
||
The view-model needs `Host.Online` and `Host.UpdateInProgress` populated.
|
||
|
||
- [ ] **Step 2: Commit**
|
||
|
||
```
|
||
git add web/templates/pages/host_detail.html internal/server/http/host_detail.go
|
||
git commit -m "ui: per-host Update agent button"
|
||
```
|
||
|
||
### Task 18: Dashboard "N hosts behind" tile + `?updates=behind` filter
|
||
|
||
**Files:**
|
||
- Modify: `internal/server/http/dashboard_filter.go` (or wherever the dashboard handler lives — search for the `?status=` filter from NS-04)
|
||
- Modify: `web/templates/pages/dashboard.html`
|
||
|
||
- [ ] **Step 1: Extend filter parsing**
|
||
|
||
Add `Updates string` (values: "" or "behind") to the dashboard filter struct. When `behind`, filter to hosts where `agent_version != "" && agent_version != server.Version`.
|
||
|
||
- [ ] **Step 2: Hero tile**
|
||
|
||
In `dashboard.html`, alongside existing tiles (online/offline/snapshot count), add — only when N > 0:
|
||
|
||
```html
|
||
{{ if gt .UpdatesBehind 0 }}
|
||
<a href="?updates=behind" class="hero-tile hero-tile--amber">
|
||
<span class="hero-num">{{ .UpdatesBehind }}</span>
|
||
<span class="hero-label">hosts behind</span>
|
||
</a>
|
||
{{ end }}
|
||
```
|
||
|
||
- [ ] **Step 3: Tests**
|
||
|
||
Extend `dashboard_filter_test.go` to cover the `updates=behind` path.
|
||
|
||
- [ ] **Step 4: Commit**
|
||
|
||
```
|
||
git add internal/server/http/dashboard*.go web/templates/pages/dashboard.html
|
||
git commit -m "ui: dashboard hosts-behind tile + filter"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 10 — Smoke validation
|
||
|
||
### Task 19: Restage + smoke validate
|
||
|
||
- [ ] **Step 1: Build at version A**
|
||
|
||
```sh
|
||
make build VERSION=v0.0.1-smoke-A
|
||
# restage block from CLAUDE.md
|
||
```
|
||
|
||
- [ ] **Step 2: Onboard `uptime` as a fresh host**
|
||
|
||
Use the dashboard's Add-host flow against `ssh uptime`. Confirm the host shows `agent_version=v0.0.1-smoke-A`.
|
||
|
||
- [ ] **Step 3: Bump server to version B**
|
||
|
||
```sh
|
||
make build VERSION=v0.0.1-smoke-B
|
||
# restart server only (not the agent)
|
||
```
|
||
|
||
Verify: dashboard shows `uptime` with the "out of date · v0.0.1-smoke-A → v0.0.1-smoke-B" chip and the "1 host behind" tile.
|
||
|
||
- [ ] **Step 4: Stage agent at version B**
|
||
|
||
```sh
|
||
cp bin/restic-manager-agent $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
|
||
```
|
||
|
||
- [ ] **Step 5: Click Update agent**
|
||
|
||
On `/hosts/{uptime-id}`. Watch the live job log. Expect: agent fetches, swaps, exits, systemd restarts it, hellos at version B, job marked succeeded, chip and tile clear.
|
||
|
||
Verify on `uptime`:
|
||
```sh
|
||
ssh uptime "ls -la /usr/local/bin/restic-manager-agent*"
|
||
```
|
||
Expect both `restic-manager-agent` (B) and `restic-manager-agent.old` (A) present.
|
||
|
||
- [ ] **Step 6: Test rollback path**
|
||
|
||
```sh
|
||
# Replace the bundled binary with the OLD one — server claims B but serves A
|
||
cp bin/restic-manager-agent.A $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
|
||
# (assume earlier build saved as .A)
|
||
```
|
||
|
||
Click Update — agent fetches A, swaps to A, restarts at A. Server should mark the job `failed` after 90s with reason like "agent reconnected at v0.0.1-smoke-A, expected v0.0.1-smoke-B". Alert raised.
|
||
|
||
- [ ] **Step 7: Fleet update path**
|
||
|
||
If only one host is available, this validates the worker on N=1. Spin up a second sibling agent (docker-based or another VM) to validate N=2 + halt-on-fail (replace `<DataDir>/agent-binaries/...` with `/bin/false`-equivalent during one host's turn).
|
||
|
||
- [ ] **Step 8: Capture screenshots**
|
||
|
||
Save Playwright screenshots of: out-of-date host row, fleet-update idle page, fleet-update running progress, fleet-update halted state. Drop into `_diag/p6-update-sweep/`.
|
||
|
||
- [ ] **Step 9: Commit + update tasks.md**
|
||
|
||
Mark P6-01 and P6-02 done in `tasks.md` with an as-shipped block summarising what landed (mirror the style used for P5-03/P5-07).
|
||
|
||
```
|
||
git add tasks.md _diag/p6-update-sweep/
|
||
git commit -m "tasks: mark P6-01 + P6-02 done with as-shipped block"
|
||
```
|
||
|
||
---
|
||
|
||
## Self-review
|
||
|
||
Run through the spec sections:
|
||
|
||
- §3 wire protocol → Task 4, Task 5 (jobs.kind), Task 9, Task 10. ✅
|
||
- §4 agent execution → Task 8 (Linux), Task 9 (dispatch wiring), Task 12 (Windows). ✅
|
||
- §5 server build version → Task 1, Task 2, Task 3. ✅
|
||
- §6 server endpoints → Task 10 (host update), Task 11 (hello integration + watcher). ✅
|
||
- §7 fleet update → Task 6 (schema), Task 7 (store), Task 14 (worker), Task 15 (HTTP+UI). ✅
|
||
- §7.3 UI surfaces → Task 16 (chip), Task 17 (button), Task 18 (dashboard tile). ✅
|
||
- §7.4 alert engine → Task 13. ✅
|
||
- §8 RBAC → enforced in Task 10 + Task 15 by reusing existing `requireAdmin` middleware. ✅
|
||
- §9 testing → Task tests + Task 19 smoke. ✅
|
||
|
||
No placeholders. All types referenced consistently across tasks. Done.
|