# P6-01 + P6-02 — Agent self-update + fleet update

Status: design approved 2026-05-06.
Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard
version reporting + fleet update UI). One spec, one branch — the
two tasks are tightly coupled (P6-02 is the operator surface for
the mechanism P6-01 ships).

## 1. Background

P5-03 pivoted release distribution to a single multi-arch server
Docker image, with cross-compiled agent binaries baked under
`/opt/restic-manager/dist/agent-binaries/` and served via
`GET /agent/binary?os=…&arch=…`. The plumbing already does
dual-path lookup: `<DataDir>/agent-binaries/<name>` overrides the
image-baked copy, so an operator can hot-patch a pre-release agent
without rebuilding the image.

That makes the server the natural distribution point for agent
upgrades. "Update agent" collapses to "re-fetch from your own
server" — no apt repo, no Chocolatey, no third-party signing infra,
and version pinning is automatic because the server only ever
serves the agent that matches its own release.

This spec wires up the update mechanism end-to-end and the
operator surface that drives it.

## 2. Decisions

| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Operator-driven only — no auto-update | Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked |
| 2 | Linux: just exit, let systemd restart. Windows: detached helper script. | Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe |
| 3 | M1 (keep `agent.old` on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). | M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start" |
| 4 | Skip sha256 digest verification for v1 | TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's |
| 5 | Exact string version match for "out of date" | With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition |
| 6 | WS envelope only, no `restic-manager-agent update` CLI subcommand | YAGNI; no concrete consumer; the underlying logic is reusable when one appears |

## 3. Wire protocol

### 3.1 Server → agent: `command.update`

```
{
  "type": "command.update",
  "id": "<envelope id>",
  "payload": {
    "job_id": "<ulid>"
  }
}
```

No `os` / `arch` / `version` in the payload — the agent already
knows its own build target and fetches from its configured server
URL via the existing `/agent/binary` handler. Including a target
version would also tempt the agent into version-comparison logic;
keep that on the server side.

### 3.2 Job lifecycle (server-driven)

The agent has limited ability to report on its own restart, so the
job state machine lives on the server:

- **queued → running** when the envelope is dispatched.
- **running → succeeded** when the agent re-hellos with
  `agent_version == server.Version` after dispatch and within
  the timeout. Audit `host.update_succeeded`.
- **running → failed (timeout)** if 90 seconds pass without a
  hello carrying the matching version. Audit `host.update_failed`.
  Raise alert kind `update_failed` (reuses P3-05 alert engine).
  This single transition covers both the "agent never came back
  at all" case and the "agent came back at the wrong version"
  case — see §6.2 for why we don't transition immediately on a
  mismatched hello.

Migration 0021 widens the `jobs.kind` CHECK constraint to include
`update`. Same column-level pattern as 0012 (where 0012 added
`restore` and `diff`).

## 4. Agent-side execution

Lives in `internal/agent/updater`, build-tag split:

- `updater_unix.go` — Linux + any future POSIX target.
- `updater_windows.go` — Windows-only, uses the helper-script
  pattern.
- `updater.go` — shared `Update(ctx, serverURL string) error`
  interface and the HTTP fetch/streaming code (no platform deps).

### 4.1 Linux flow

1. Receive `command.update` from the WS dispatcher.
2. Resolve own binary via `os.Executable()` and `filepath.Abs`.
   Refuse if the resolved path is `/proc/self/exe` or otherwise
   not a real file (defence in depth — shouldn't happen under
   systemd, but bail loudly if it does).
3. `GET <server>/agent/binary?os=linux&arch=<runtime.GOARCH>`,
   stream to `<binary>.new` in the same directory as the running
   binary (same filesystem ⇒ atomic rename).
4. fsync the file, `os.Chmod(0755)`.
5. Copy current binary to `<binary>.old` (overwrite if it
   exists). M1 — one-revision rollback target.
6. `os.Rename(<binary>.new, <binary>)`.
7. Close the WS connection cleanly (sends close frame so the
   server transitions the connection to `disconnected` rather
   than waiting for the heartbeat-miss sweep).
8. `os.Exit(0)`. Systemd's `Restart=always` (already in the unit)
   brings up the new binary within seconds.

### 4.2 Windows flow

The .exe is exclusively locked by the OS while running, so steps
5–6 above can't happen in-process. Use a detached helper:

1. Steps 1–4 the same — fetch into `<binary>.exe.new`, fsync.
2. Write `update.cmd` to a tmp path with the orchestration:
   ```
   timeout /t 3 /nobreak >nul
   copy /Y "<binary>.exe" "<binary>.exe.old"
   sc stop restic-manager-agent
   :wait
   sc query restic-manager-agent | find "STOPPED" >nul
   if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
   move /Y "<binary>.exe.new" "<binary>.exe"
   sc start restic-manager-agent
   del "%~f0"
   ```
3. `CreateProcess` it detached
   (`DETACHED_PROCESS | CREATE_NO_WINDOW`, no parent handles).
4. Close WS, `os.Exit(0)`. SCM sees clean stop and waits — does
   *not* try to restart, because `sc stop` is the helper's job,
   not a crash. (`Restart=always` semantics differ between
   systemd and SCM. SCM treats clean-exit-after-stop as
   intentional and does not auto-restart; only crashes restart.
   That's why the helper script needs the explicit `sc start`
   at the end.)

### 4.3 Service-user assumption

Both Linux (`User=root` per the existing unit) and Windows
(`LocalSystem` by default) can write the binary path directly. If
the agent ever moves to a non-root service user, the updater
breaks — would need either a setuid helper or an out-of-process
update service. Add a `// NOTE:` comment in the updater package
flagging this; not a v1 blocker.

## 5. Server build version

New package `internal/version` exposing two constants:

```
package version

var (
    Version = "dev"
    Commit  = ""
)
```

Wired via `-ldflags` in the Makefile:

```
GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
             -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)

VERSION := $(shell git describe --tags --always --dirty)
COMMIT  := $(shell git rev-parse --short HEAD)
```

Both `cmd/server` and `cmd/agent` link the same package, so an
agent's `agent_version` (sent in the hello payload, already wired
since P1-11) is comparable byte-for-byte to the server's
`version.Version`.

`make build` already does what's needed for source builds. The
Phase 2 work in this spec is the Docker release path — confirm
during plan execution that `.gitea/workflows/release.yml` passes
`VERSION` and `COMMIT` into the Docker `--build-arg` chain so the
in-image binaries embed the same string the image is tagged with.
If not, add the wiring.

Dirty/dev builds (`v1.2.3-dirty`) won't match clean server builds,
so every dev environment will show every host as out-of-date. This
is acceptable — the chip is a noop in dev, real ops always run
tagged builds.

A new `GET /api/version` endpoint returns
`{"version": "...", "commit": "..."}`. Used by the dashboard
header tile and by `/settings/fleet-update`. Public-band — exposes
no secrets, lets the install scripts surface it too.

## 6. P6-01 server endpoints

### 6.1 `POST /api/hosts/{id}/update`

Admin-only. Refuses (with structured error code) when:

- Host is offline (`host_offline`).
- Host's `agent_version == server.Version` (`already_up_to_date`).
- An update job for this host is already running (`update_in_progress`).

Happy path: creates `jobs` row with `kind=update`, dispatches
`command.update` envelope, audit-logs `host.update_dispatched`,
returns `{"job_id": "..."}`.

UI form-post variant on `/hosts/{id}/update` returns
`HX-Redirect` to the live job log.

### 6.2 Hello handler integration

The existing `onAgentHello` (P1-11) already upserts
`agent_version`. Extend it: after the upsert, look for any
`update` job for this host with `status='running'`. If one
exists:

- `agent_version == server.Version` → mark job `succeeded`,
  audit `host.update_succeeded`.
- `agent_version != server.Version` → leave the job running so
  the timeout path catches it as a rollback failure (don't fail
  immediately — gives the agent one chance to come back, restart,
  hello again with the right version).

Adds a small in-memory map of pending updates so the timeout
goroutine knows when to give up. Persisted state lives in the
`jobs` table; the in-memory map is just for the timer.

## 7. P6-02 fleet update

### 7.1 Schema

Migration 0022, column-level adds only:

```
CREATE TABLE fleet_updates (
  id              TEXT PRIMARY KEY,
  started_at      TEXT NOT NULL,
  started_by_user_id TEXT NOT NULL REFERENCES users(id),
  target_version  TEXT NOT NULL,
  status          TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')),
  current_host_id TEXT REFERENCES hosts(id),
  halted_reason   TEXT,
  completed_at    TEXT
);

CREATE TABLE fleet_update_hosts (
  fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
  host_id         TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
  status          TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')),
  job_id          TEXT REFERENCES jobs(id),
  failed_reason   TEXT,
  PRIMARY KEY (fleet_update_id, host_id)
);
```

### 7.2 Worker loop

A single in-process goroutine — at most one fleet update may run
at a time (enforced via a `sync.Mutex` + a precondition check on
`POST /api/fleet/update`).

```
for each pending fleet_update_hosts row in dispatch order:
    set fleet_updates.current_host_id = row.host_id
    set fleet_update_hosts.status = 'running'
    if host.agent_version == server.Version:
        # Already updated since we built the list — skip.
        set status = 'skipped'; continue
    if !host.online:
        # Offline since we built the list — halt.
        halt(reason="host went offline")
        return
    dispatch_update_for_host(host)  # reuses 6.1 logic
    wait_up_to_90s_for_hello_with_matching_version()
    if matched:
        set status = 'succeeded'; continue
    else:
        set status = 'failed', failed_reason = "..."
        halt(reason="update failed on host X")
        return
set fleet_updates.status = 'completed', completed_at = now
```

Halt: set `fleet_updates.status = 'halted'`, raise an alert kind
`fleet_update_halted`, audit `fleet.update_halted` with the host
id and reason. Subsequent hosts stay `pending` so the operator can
see what was queued and decide whether to resume (resume = start a
new fleet update with the still-out-of-date subset).

Cancel: admin-only `POST /api/fleet-updates/{id}/cancel`. Sets
`status='cancelled'`. The currently-dispatched host's update job
keeps running (the agent is already mid-restart) — cancel only
prevents the *next* host from being picked. Audit
`fleet.update_cancelled`.

### 7.3 UI surfaces

**Per-host chip (host_row partial + host detail chrome):**

`out of date · v1.2.2 → v1.2.3` — amber-accented, mirrors `.tag`
token shape. Only rendered when:

```
host.agent_version != "" && host.agent_version != server.Version
```

Empty `agent_version` (host enrolled but never connected) renders
nothing rather than "out of date" — we don't know what version
they have.

**Dashboard summary tile:**

The hero strip already has tiles. Add an "Updates" tile:
`N hosts behind` linking to `/?updates=behind` (extends NS-04's
filter machinery — adds an `updates` query param alongside
`status`/`repo_status`/`tag`). Hidden when N == 0.

**Per-host Update button on `/hosts/{id}`:**

Right-rail, admin-only. Disabled with hover tooltip when host
offline / already up to date / update in progress. POSTs to
`/hosts/{id}/update`, `HX-Redirect` to the live job log.

**Fleet update page `/settings/fleet-update`:**

Admin-only. Two states:

- **Idle**: lists out-of-date online hosts (table: hostname,
  current version, target version, last seen). Big "Start rolling
  update" button behind a typed-confirm dialog (operator types
  the host count, e.g. `12`, to enable the button — same shape as
  the host-delete confirm).
- **Running/halted/completed**: shows the currently-active
  fleet_update row + per-host progress list. Polls every 3s (htmx
  trigger conditional on `document.visibilityState === 'visible'`,
  same pattern as the alerts page). Renders:
  ```
  Updated 3/12 · currently updating <hostname>
  Halted on <hostname>: <reason> · job log →
  ```

Audit actions: `fleet.update_started`, `fleet.update_completed`,
`fleet.update_halted`, `fleet.update_cancelled`.

### 7.4 Alert engine integration

P3-05's alert engine already supports kind-based registration. Add
two new kinds:

- `update_failed` — per-host, raised on individual update failure.
  Auto-resolves when the host re-hellos with the matching version.
- `fleet_update_halted` — global, raised on fleet halt. Auto-resolves
  when a subsequent fleet update completes successfully.

## 8. RBAC

| Endpoint | Role |
|----------|------|
| `POST /api/hosts/{id}/update` | admin |
| `POST /api/fleet/update` | admin |
| `POST /api/fleet-updates/{id}/cancel` | admin |
| `GET /api/fleet-updates/{id}` | admin (status polling) |
| `GET /api/version` | public |

Operator and viewer see the "out of date" chip but no update
buttons. Mirrors the existing pattern: read affordances are
visible to all roles, write affordances are gated.

## 9. Testing

### 9.1 Unit

- `internal/agent/updater`: fake-`/agent/binary` HTTP server +
  tmp "running binary" file, assert post-state — binary swapped,
  `.old` present, no leftover `.new`. Linux path only (Windows
  helper covered by build-tag compile-only).
- `internal/server/http`: `POST /api/hosts/{id}/update` happy
  path, refuses-when-offline, refuses-when-up-to-date,
  refuses-when-update-in-progress, RBAC enforcement, audit row
  written.
- Hello handler: agent reconnects with matching version after
  `update` job dispatch → marks job `succeeded`, drops the
  in-memory pending entry. Mismatched version → no-op (timeout
  catches it).
- Timeout path: synthetic `update` job + 90s elapsed →
  marks `failed`, raises alert.
- Fleet worker: table-driven over the loop's state machine —
  success-then-success, success-then-timeout-halts,
  cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately,
  host-disappears-from-list-mid-loop-skips.

### 9.2 Smoke validation (per CLAUDE.md restage block)

1. Build server + agent at version A. Restage. Enrol a host;
   confirm `agent_version=A`.
2. Bump version to B (`make build VERSION=B`), rebuild server
   only, restart server. Dashboard shows host as out-of-date with
   `A → B` chip. Updates tile reads "1 host behind".
3. Rebuild agent at B, restage `<DataDir>/agent-binaries/`. Click
   **Update agent** on host detail. Agent fetches, swaps, exits;
   systemd restarts it; hello-back at B → job `succeeded`, chip
   gone, tile clears.
4. Rollback path: leave `<DataDir>/agent-binaries/` at A, server
   at B, click Update — agent fetches A, swaps to A, restarts at
   A; hello says A != B; server marks job `failed` after 90s with
   reason "agent reconnected at version A, expected B".
5. Fleet update: spin up two smoke hosts both out-of-date, fire
   **Start rolling update**, watch progress page tick host 1 →
   host 2 → completed.
6. Halt path: replace one of the `<DataDir>/agent-binaries/`
   files with `/bin/false`. Run fleet update. First host gets
   broken binary, fails to come back up, fleet update halts at
   host 1 after 90s, alert raised, host 2 left as `pending`.

Step 6 validates M2 end-to-end — the rolling halt is the actual
safety guarantee, not a nice-to-have.

## 10. Out of scope

- sha256 digest verification (deferred — see decision 4).
- `restic-manager-agent update` CLI subcommand (deferred —
  decision 6).
- Auto-update (deferred — decision 1).
- Auto-rollback watchdog M3 (deferred — decision 3).
- Migrating the agent off `User=root` (separate hardening track).
- Cross-version protocol-compatibility checks beyond the existing
  `protocol_version` handshake (P1-11). If the new agent's
  `protocol_version` is incompatible with the server, the
  existing handshake rejects it; the update job will then
  correctly time out and be marked failed.

## 11. Migration plan

1. `internal/version` package + Makefile ldflags wiring.
2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates
   tables).
3. `internal/agent/updater` package, Linux first.
4. WS envelope wiring + `command.update` dispatcher.
5. `POST /api/hosts/{id}/update` + hello-handler integration +
   timeout goroutine.
6. UI: chip + per-host update button + dashboard tile + filter.
7. Fleet update worker + page.
8. Windows updater path.
9. Alert engine kinds.
10. Smoke validation per §9.2.

Each step is independently testable; commits should land at each
boundary so a failed Windows path (8) doesn't block the rest of
the work.