Files
restic-manager/docs/superpowers/specs/2026-05-06-p6-01-02-agent-self-update-design.md
T

17 KiB
Raw Blame History

P6-01 + P6-02 — Agent self-update + fleet update

Status: design approved 2026-05-06. Scope: P6-01 (agent self-update mechanism) and P6-02 (dashboard version reporting + fleet update UI). One spec, one branch — the two tasks are tightly coupled (P6-02 is the operator surface for the mechanism P6-01 ships).

1. Background

P5-03 pivoted release distribution to a single multi-arch server Docker image, with cross-compiled agent binaries baked under /opt/restic-manager/dist/agent-binaries/ and served via GET /agent/binary?os=…&arch=…. The plumbing already does dual-path lookup: <DataDir>/agent-binaries/<name> overrides the image-baked copy, so an operator can hot-patch a pre-release agent without rebuilding the image.

That makes the server the natural distribution point for agent upgrades. "Update agent" collapses to "re-fetch from your own server" — no apt repo, no Chocolatey, no third-party signing infra, and version pinning is automatic because the server only ever serves the agent that matches its own release.

This spec wires up the update mechanism end-to-end and the operator surface that drives it.

2. Decisions

# Decision Rationale
1 Operator-driven only — no auto-update Matches the rest of the app's job-dispatch model; avoids "bad release upgrades every host instantly"; auto-update can be added later as a setting flip if asked
2 Linux: just exit, let systemd restart. Windows: detached helper script. Linux supports rename-while-open; Windows holds an exclusive lock on the running .exe
3 M1 (keep agent.old on disk) + M2 (rolling fleet update with halt-on-fail). Skip M3 (auto-rollback watchdog). M1 is ~5 lines, M2 falls naturally out of P6-02's UI, M3 is a lot of plumbing for "shipped a binary that doesn't start"
4 Skip sha256 digest verification for v1 TLS already covers the corruption-in-transit threat; image-tampering is image-build's problem, not the agent's
5 Exact string version match for "out of date" With server-bundled binaries there's exactly one canonical version per server image — anything else is out of date by definition
6 WS envelope only, no restic-manager-agent update CLI subcommand YAGNI; no concrete consumer; the underlying logic is reusable when one appears

3. Wire protocol

3.1 Server → agent: command.update

{
  "type": "command.update",
  "id": "<envelope id>",
  "payload": {
    "job_id": "<ulid>"
  }
}

No os / arch / version in the payload — the agent already knows its own build target and fetches from its configured server URL via the existing /agent/binary handler. Including a target version would also tempt the agent into version-comparison logic; keep that on the server side.

3.2 Job lifecycle (server-driven)

The agent has limited ability to report on its own restart, so the job state machine lives on the server:

  • queued → running when the envelope is dispatched.
  • running → succeeded when the agent re-hellos with agent_version == server.Version after dispatch and within the timeout. Audit host.update_succeeded.
  • running → failed (timeout) if 90 seconds pass without a hello carrying the matching version. Audit host.update_failed. Raise alert kind update_failed (reuses P3-05 alert engine). This single transition covers both the "agent never came back at all" case and the "agent came back at the wrong version" case — see §6.2 for why we don't transition immediately on a mismatched hello.

Migration 0021 widens the jobs.kind CHECK constraint to include update. Same column-level pattern as 0012 (where 0012 added restore and diff).

4. Agent-side execution

Lives in internal/agent/updater, build-tag split:

  • updater_unix.go — Linux + any future POSIX target.
  • updater_windows.go — Windows-only, uses the helper-script pattern.
  • updater.go — shared Update(ctx, serverURL string) error interface and the HTTP fetch/streaming code (no platform deps).

4.1 Linux flow

  1. Receive command.update from the WS dispatcher.
  2. Resolve own binary via os.Executable() and filepath.Abs. Refuse if the resolved path is /proc/self/exe or otherwise not a real file (defence in depth — shouldn't happen under systemd, but bail loudly if it does).
  3. GET <server>/agent/binary?os=linux&arch=<runtime.GOARCH>, stream to <binary>.new in the same directory as the running binary (same filesystem ⇒ atomic rename).
  4. fsync the file, os.Chmod(0755).
  5. Copy current binary to <binary>.old (overwrite if it exists). M1 — one-revision rollback target.
  6. os.Rename(<binary>.new, <binary>).
  7. Close the WS connection cleanly (sends close frame so the server transitions the connection to disconnected rather than waiting for the heartbeat-miss sweep).
  8. os.Exit(0). Systemd's Restart=always (already in the unit) brings up the new binary within seconds.

4.2 Windows flow

The .exe is exclusively locked by the OS while running, so steps 56 above can't happen in-process. Use a detached helper:

  1. Steps 14 the same — fetch into <binary>.exe.new, fsync.
  2. Write update.cmd to a tmp path with the orchestration:
    timeout /t 3 /nobreak >nul
    copy /Y "<binary>.exe" "<binary>.exe.old"
    sc stop restic-manager-agent
    :wait
    sc query restic-manager-agent | find "STOPPED" >nul
    if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
    move /Y "<binary>.exe.new" "<binary>.exe"
    sc start restic-manager-agent
    del "%~f0"
    
  3. CreateProcess it detached (DETACHED_PROCESS | CREATE_NO_WINDOW, no parent handles).
  4. Close WS, os.Exit(0). SCM sees clean stop and waits — does not try to restart, because sc stop is the helper's job, not a crash. (Restart=always semantics differ between systemd and SCM. SCM treats clean-exit-after-stop as intentional and does not auto-restart; only crashes restart. That's why the helper script needs the explicit sc start at the end.)

4.3 Service-user assumption

Both Linux (User=root per the existing unit) and Windows (LocalSystem by default) can write the binary path directly. If the agent ever moves to a non-root service user, the updater breaks — would need either a setuid helper or an out-of-process update service. Add a // NOTE: comment in the updater package flagging this; not a v1 blocker.

5. Server build version

New package internal/version exposing two constants:

package version

var (
    Version = "dev"
    Commit  = ""
)

Wired via -ldflags in the Makefile:

GO_LDFLAGS = -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION) \
             -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)

VERSION := $(shell git describe --tags --always --dirty)
COMMIT  := $(shell git rev-parse --short HEAD)

Both cmd/server and cmd/agent link the same package, so an agent's agent_version (sent in the hello payload, already wired since P1-11) is comparable byte-for-byte to the server's version.Version.

make build already does what's needed for source builds. The Phase 2 work in this spec is the Docker release path — confirm during plan execution that .gitea/workflows/release.yml passes VERSION and COMMIT into the Docker --build-arg chain so the in-image binaries embed the same string the image is tagged with. If not, add the wiring.

Dirty/dev builds (v1.2.3-dirty) won't match clean server builds, so every dev environment will show every host as out-of-date. This is acceptable — the chip is a noop in dev, real ops always run tagged builds.

A new GET /api/version endpoint returns {"version": "...", "commit": "..."}. Used by the dashboard header tile and by /settings/fleet-update. Public-band — exposes no secrets, lets the install scripts surface it too.

6. P6-01 server endpoints

6.1 POST /api/hosts/{id}/update

Admin-only. Refuses (with structured error code) when:

  • Host is offline (host_offline).
  • Host's agent_version == server.Version (already_up_to_date).
  • An update job for this host is already running (update_in_progress).

Happy path: creates jobs row with kind=update, dispatches command.update envelope, audit-logs host.update_dispatched, returns {"job_id": "..."}.

UI form-post variant on /hosts/{id}/update returns HX-Redirect to the live job log.

6.2 Hello handler integration

The existing onAgentHello (P1-11) already upserts agent_version. Extend it: after the upsert, look for any update job for this host with status='running'. If one exists:

  • agent_version == server.Version → mark job succeeded, audit host.update_succeeded.
  • agent_version != server.Version → leave the job running so the timeout path catches it as a rollback failure (don't fail immediately — gives the agent one chance to come back, restart, hello again with the right version).

Adds a small in-memory map of pending updates so the timeout goroutine knows when to give up. Persisted state lives in the jobs table; the in-memory map is just for the timer.

7. P6-02 fleet update

7.1 Schema

Migration 0022, column-level adds only:

CREATE TABLE fleet_updates (
  id              TEXT PRIMARY KEY,
  started_at      TEXT NOT NULL,
  started_by_user_id TEXT NOT NULL REFERENCES users(id),
  target_version  TEXT NOT NULL,
  status          TEXT NOT NULL CHECK (status IN ('running','completed','halted','cancelled')),
  current_host_id TEXT REFERENCES hosts(id),
  halted_reason   TEXT,
  completed_at    TEXT
);

CREATE TABLE fleet_update_hosts (
  fleet_update_id TEXT NOT NULL REFERENCES fleet_updates(id) ON DELETE CASCADE,
  host_id         TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
  status          TEXT NOT NULL CHECK (status IN ('pending','running','succeeded','failed','skipped')),
  job_id          TEXT REFERENCES jobs(id),
  failed_reason   TEXT,
  PRIMARY KEY (fleet_update_id, host_id)
);

7.2 Worker loop

A single in-process goroutine — at most one fleet update may run at a time (enforced via a sync.Mutex + a precondition check on POST /api/fleet/update).

for each pending fleet_update_hosts row in dispatch order:
    set fleet_updates.current_host_id = row.host_id
    set fleet_update_hosts.status = 'running'
    if host.agent_version == server.Version:
        # Already updated since we built the list — skip.
        set status = 'skipped'; continue
    if !host.online:
        # Offline since we built the list — halt.
        halt(reason="host went offline")
        return
    dispatch_update_for_host(host)  # reuses 6.1 logic
    wait_up_to_90s_for_hello_with_matching_version()
    if matched:
        set status = 'succeeded'; continue
    else:
        set status = 'failed', failed_reason = "..."
        halt(reason="update failed on host X")
        return
set fleet_updates.status = 'completed', completed_at = now

Halt: set fleet_updates.status = 'halted', raise an alert kind fleet_update_halted, audit fleet.update_halted with the host id and reason. Subsequent hosts stay pending so the operator can see what was queued and decide whether to resume (resume = start a new fleet update with the still-out-of-date subset).

Cancel: admin-only POST /api/fleet-updates/{id}/cancel. Sets status='cancelled'. The currently-dispatched host's update job keeps running (the agent is already mid-restart) — cancel only prevents the next host from being picked. Audit fleet.update_cancelled.

7.3 UI surfaces

Per-host chip (host_row partial + host detail chrome):

out of date · v1.2.2 → v1.2.3 — amber-accented, mirrors .tag token shape. Only rendered when:

host.agent_version != "" && host.agent_version != server.Version

Empty agent_version (host enrolled but never connected) renders nothing rather than "out of date" — we don't know what version they have.

Dashboard summary tile:

The hero strip already has tiles. Add an "Updates" tile: N hosts behind linking to /?updates=behind (extends NS-04's filter machinery — adds an updates query param alongside status/repo_status/tag). Hidden when N == 0.

Per-host Update button on /hosts/{id}:

Right-rail, admin-only. Disabled with hover tooltip when host offline / already up to date / update in progress. POSTs to /hosts/{id}/update, HX-Redirect to the live job log.

Fleet update page /settings/fleet-update:

Admin-only. Two states:

  • Idle: lists out-of-date online hosts (table: hostname, current version, target version, last seen). Big "Start rolling update" button behind a typed-confirm dialog (operator types the host count, e.g. 12, to enable the button — same shape as the host-delete confirm).
  • Running/halted/completed: shows the currently-active fleet_update row + per-host progress list. Polls every 3s (htmx trigger conditional on document.visibilityState === 'visible', same pattern as the alerts page). Renders:
    Updated 3/12 · currently updating <hostname>
    Halted on <hostname>: <reason> · job log →
    

Audit actions: fleet.update_started, fleet.update_completed, fleet.update_halted, fleet.update_cancelled.

7.4 Alert engine integration

P3-05's alert engine already supports kind-based registration. Add two new kinds:

  • update_failed — per-host, raised on individual update failure. Auto-resolves when the host re-hellos with the matching version.
  • fleet_update_halted — global, raised on fleet halt. Auto-resolves when a subsequent fleet update completes successfully.

8. RBAC

Endpoint Role
POST /api/hosts/{id}/update admin
POST /api/fleet/update admin
POST /api/fleet-updates/{id}/cancel admin
GET /api/fleet-updates/{id} admin (status polling)
GET /api/version public

Operator and viewer see the "out of date" chip but no update buttons. Mirrors the existing pattern: read affordances are visible to all roles, write affordances are gated.

9. Testing

9.1 Unit

  • internal/agent/updater: fake-/agent/binary HTTP server + tmp "running binary" file, assert post-state — binary swapped, .old present, no leftover .new. Linux path only (Windows helper covered by build-tag compile-only).
  • internal/server/http: POST /api/hosts/{id}/update happy path, refuses-when-offline, refuses-when-up-to-date, refuses-when-update-in-progress, RBAC enforcement, audit row written.
  • Hello handler: agent reconnects with matching version after update job dispatch → marks job succeeded, drops the in-memory pending entry. Mismatched version → no-op (timeout catches it).
  • Timeout path: synthetic update job + 90s elapsed → marks failed, raises alert.
  • Fleet worker: table-driven over the loop's state machine — success-then-success, success-then-timeout-halts, cancel-mid-flight, no-online-out-of-date-hosts-completes-immediately, host-disappears-from-list-mid-loop-skips.

9.2 Smoke validation (per CLAUDE.md restage block)

  1. Build server + agent at version A. Restage. Enrol a host; confirm agent_version=A.
  2. Bump version to B (make build VERSION=B), rebuild server only, restart server. Dashboard shows host as out-of-date with A → B chip. Updates tile reads "1 host behind".
  3. Rebuild agent at B, restage <DataDir>/agent-binaries/. Click Update agent on host detail. Agent fetches, swaps, exits; systemd restarts it; hello-back at B → job succeeded, chip gone, tile clears.
  4. Rollback path: leave <DataDir>/agent-binaries/ at A, server at B, click Update — agent fetches A, swaps to A, restarts at A; hello says A != B; server marks job failed after 90s with reason "agent reconnected at version A, expected B".
  5. Fleet update: spin up two smoke hosts both out-of-date, fire Start rolling update, watch progress page tick host 1 → host 2 → completed.
  6. Halt path: replace one of the <DataDir>/agent-binaries/ files with /bin/false. Run fleet update. First host gets broken binary, fails to come back up, fleet update halts at host 1 after 90s, alert raised, host 2 left as pending.

Step 6 validates M2 end-to-end — the rolling halt is the actual safety guarantee, not a nice-to-have.

10. Out of scope

  • sha256 digest verification (deferred — see decision 4).
  • restic-manager-agent update CLI subcommand (deferred — decision 6).
  • Auto-update (deferred — decision 1).
  • Auto-rollback watchdog M3 (deferred — decision 3).
  • Migrating the agent off User=root (separate hardening track).
  • Cross-version protocol-compatibility checks beyond the existing protocol_version handshake (P1-11). If the new agent's protocol_version is incompatible with the server, the existing handshake rejects it; the update job will then correctly time out and be marked failed.

11. Migration plan

  1. internal/version package + Makefile ldflags wiring.
  2. Migration 0021 (jobs.kind widening) + 0022 (fleet_updates tables).
  3. internal/agent/updater package, Linux first.
  4. WS envelope wiring + command.update dispatcher.
  5. POST /api/hosts/{id}/update + hello-handler integration + timeout goroutine.
  6. UI: chip + per-host update button + dashboard tile + filter.
  7. Fleet update worker + page.
  8. Windows updater path.
  9. Alert engine kinds.
  10. Smoke validation per §9.2.

Each step is independently testable; commits should land at each boundary so a failed Windows path (8) doesn't block the rest of the work.