Commit Graph

303 Commits

Author SHA1 Message Date
steve 8ef681f3f4 tasks: queue NS-05 (drop setup-go) + NS-06 (drop disabled Run-backup button)
Two small follow-ups noted while working through the
p5-oss-readiness CI-runner switch:

* NS-05 — actions/setup-go is now redundant; ci-runner-go ships
  Go on PATH and re-downloading on every job costs ~5s a shard.
* NS-06 — host_chrome's per-host "Run backup now" button is a
  permanently-disabled tombstone; remove it so the chrome stops
  advertising an action that no longer exists.
2026-05-08 22:26:59 +01:00
steve 7be2e4c5b0 Merge pull request 'P5: OSS readiness — docs site, contributor onboarding, e2e harness' (#23) from p5-oss-readiness into main
Reviewed-on: #23
2026-05-08 21:22:38 +00:00
steve ea9941b9ec e2e: dispatch backup via source-group API
CI / Test (rest) (pull_request) Successful in 7s
CI / Test (store) (pull_request) Successful in 6s
CI / Build (windows/amd64) (pull_request) Successful in 8s
CI / Lint (pull_request) Successful in 18s
CI / Build (linux/amd64) (pull_request) Successful in 7s
CI / Build (linux/arm64) (pull_request) Successful in 8s
e2e / Playwright vs docker-compose (pull_request) Successful in 1m27s
CI / Test (server-http) (pull_request) Successful in 3m3s
Per-host Run-backup is gone — the host_chrome partial still
renders the button but it's hard-disabled with a tooltip
pointing to per-source-group Run-now. The smoke test was
clicking that disabled button and waiting forever for a URL
change that would never happen.

Replace the navigation-based dispatch with two API calls:
create a source group covering the agent's /source mount,
then POST to /api/hosts/{id}/source-groups/{gid}/run. The
backup-status assertion at the end is unchanged — host record
is still the source of truth.
2026-05-08 22:16:57 +01:00
steve 130b68226e api: expose host.repo_status in /api/hosts JSON
CI / Test (rest) (pull_request) Successful in 11s
CI / Test (store) (pull_request) Successful in 10s
CI / Build (windows/amd64) (pull_request) Successful in 15s
CI / Lint (pull_request) Successful in 19s
CI / Build (linux/arm64) (pull_request) Successful in 7s
CI / Build (linux/amd64) (pull_request) Successful in 15s
CI / Test (server-http) (pull_request) Successful in 1m30s
e2e / Playwright vs docker-compose (pull_request) Failing after 6m36s
The dashboard renders init_running / init_failed / ready state
based on host.repo_status, but the JSON endpoint dropped the
field on its way out. The e2e test couldn't poll for repo
readiness; reflect the same projection the UI uses.
2026-05-08 22:06:22 +01:00
steve ccd7c2f2fd e2e: wait for repo_status=ready and bump test timeout
CI / Test (rest) (pull_request) Successful in 8s
CI / Test (store) (pull_request) Successful in 8s
CI / Test (server-http) (pull_request) Successful in 12s
CI / Build (windows/amd64) (pull_request) Successful in 9s
CI / Lint (pull_request) Successful in 20s
CI / Build (linux/arm64) (pull_request) Successful in 9s
CI / Build (linux/amd64) (pull_request) Successful in 17s
e2e / Playwright vs docker-compose (pull_request) Failing after 4m7s
Two issues uncovered by the page-snapshot dump after the agent
state-dir fix:

* The host page server-renders `Run backup now` as disabled
  while repo_status != ready, and the page has no live-refresh
  on that field. The test was navigating right after status
  flipped to 'online' but before auto-init had completed (~3s
  later), so the rendered HTML still showed init_running and
  the click was a no-op. Wait for repo_status === 'ready'
  before navigating.

* playwright.config.ts pinned the per-test timeout at 60s,
  but the test itself uses 60s + 120s of internal waits.
  Bump to 240s so the test fails on real regressions instead
  of timing out on its own internal budget.

Renamed the test description away from "under a minute" since
it overpromises against the new timeout. The performance SLO
belongs in a separate test if we want to assert it.
2026-05-08 22:00:24 +01:00
steve 51fe1946b7 e2e: fix agent state-dir to /var/lib/restic-manager
CI / Test (store) (pull_request) Successful in 6s
CI / Test (rest) (pull_request) Successful in 17s
CI / Lint (pull_request) Successful in 19s
CI / Build (linux/amd64) (pull_request) Successful in 8s
CI / Build (linux/arm64) (pull_request) Successful in 9s
CI / Build (windows/amd64) (pull_request) Successful in 57s
CI / Test (server-http) (pull_request) Successful in 1m30s
e2e / Playwright vs docker-compose (pull_request) Failing after 3m26s
The agent writes its encrypted secrets blob to
$DefaultSecretsPath (/var/lib/restic-manager/secrets.enc) but
the e2e fixtures created and mounted a directory at
/var/lib/restic-manager-agent — name mismatch. Result: every
`config.update` push failed with 'create tmp: no such file or
directory', the auto-init never got the repo creds, the host
landed in init_failed, and the smoke test couldn't kick off a
backup (the Run backup button is disabled while
repo_status != ready).

Align the compose volume mount and the Dockerfile mkdir on
/var/lib/restic-manager so they match the production install
script + the agent's own default.
2026-05-08 21:53:35 +01:00
steve 523ac4137a ui: show pending-hosts panel even when fleet is otherwise empty
CI / Test (store) (pull_request) Successful in 6s
CI / Lint (pull_request) Successful in 20s
CI / Build (windows/amd64) (pull_request) Successful in 10s
CI / Test (rest) (pull_request) Successful in 41s
CI / Build (linux/amd64) (pull_request) Successful in 8s
CI / Build (linux/arm64) (pull_request) Successful in 8s
CI / Test (server-http) (pull_request) Successful in 3m8s
e2e / Playwright vs docker-compose (pull_request) Failing after 3m28s
The dashboard's empty-state ("No hosts yet.") was gated on
HostCount == 0 alone, which hid the pending-hosts panel — and
the inline accept form — for the most common first-run scenario:
operator just installed an agent that announced, the fleet has
zero accepted hosts, and the only thing the operator needs to do
is review fingerprint + click Accept.

Tighten the gate so the empty state only shows when there are
truly zero hosts and zero pending announces. With a pending
host, fall through to the regular dashboard layout so the
approval queue is visible and actionable.

Caught by the e2e enrol-via-announce smoke test (now unblocked
on PR #23).
2026-05-08 21:47:31 +01:00
steve 74be681b4b e2e: dump error-context.md to log on failure + bump upload-artifact
CI / Test (server-http) (pull_request) Successful in 7s
CI / Test (store) (pull_request) Successful in 6s
CI / Test (rest) (pull_request) Successful in 13s
CI / Build (windows/amd64) (pull_request) Successful in 9s
CI / Build (linux/arm64) (pull_request) Successful in 8s
CI / Lint (pull_request) Successful in 19s
CI / Build (linux/amd64) (pull_request) Successful in 15s
e2e / Playwright vs docker-compose (pull_request) Failing after 3m37s
The Playwright run produces error-context.md per failed test
with a full DOM snapshot — useful for triaging UI test failures
without round-tripping through downloaded artifacts. Cat it
into the workflow log on failure.

Also bump actions/upload-artifact v3 → v4. v3 uploads still
return success on this Gitea runner but the artifacts don't
surface through the API or UI; v4 is the correct version per
the workflow header note.
2026-05-08 21:41:38 +01:00
steve e14dd82f20 e2e: extract Playwright report via docker cp instead of bind mount
CI / Test (server-http) (pull_request) Successful in 6s
CI / Test (store) (pull_request) Successful in 5s
CI / Build (windows/amd64) (pull_request) Successful in 7s
CI / Lint (pull_request) Successful in 19s
CI / Build (linux/amd64) (pull_request) Successful in 7s
CI / Build (linux/arm64) (pull_request) Successful in 8s
CI / Test (rest) (pull_request) Successful in 51s
e2e / Playwright vs docker-compose (pull_request) Failing after 3m30s
When the runner job runs inside a container, compose's relative
`./playwright/playwright-report` resolves to a path that exists
only inside the runner container, so the host's docker daemon
silently bind-mounts an empty dir and the report never lands
anywhere we can read.

Drop the bind mounts; keep the playwright container around
(--name e2e-pw, no --rm); after the test, `docker cp` the
report and traces out into the runner's workspace volume so
upload-artifact has something real to upload. The new test-results
directory (Playwright traces, screenshots, videos) is also
included so failure post-mortem doesn't need a re-run.
2026-05-08 21:36:09 +01:00
steve 21567adb8e runner tests: probe-exec setupScript to clear overlayfs ETXTBSY
CI / Test (rest) (pull_request) Successful in 7s
CI / Test (server-http) (pull_request) Successful in 1m37s
CI / Test (store) (pull_request) Successful in 5s
CI / Lint (pull_request) Successful in 21s
CI / Build (windows/amd64) (pull_request) Successful in 10s
CI / Build (linux/arm64) (pull_request) Successful in 9s
CI / Build (linux/amd64) (pull_request) Successful in 1m2s
e2e / Playwright vs docker-compose (pull_request) Failing after 5m0s
The original write-tmp-then-rename guard handles the ETXTBSY race
on a vanilla filesystem, but inside the new ci-runner-go
container our jobs land on overlayfs, which keeps a lagged
"writable inode" view long enough to leak ETXTBSY into the
exec the test does milliseconds later.

After rename, probe-exec the file with a benign argument
("__rm_probe__" — every script's case statement falls through
to a clean exit) until exec succeeds. Each script body is shaped
`case "$1" in restore) ... ;; esac` so the probe is a no-op.
3s deadline keeps a stuck filesystem from hanging the suite.
2026-05-08 21:26:35 +01:00
steve 084ddd56ba ci: force bash as default shell in container jobs
CI / Test (rest) (pull_request) Failing after 56s
CI / Test (store) (pull_request) Successful in 37s
CI / Lint (pull_request) Successful in 17s
CI / Test (server-http) (pull_request) Successful in 2m0s
CI / Build (windows/amd64) (pull_request) Successful in 26s
CI / Build (linux/amd64) (pull_request) Successful in 28s
CI / Build (linux/arm64) (pull_request) Successful in 26s
e2e / Playwright vs docker-compose (pull_request) Failing after 3m47s
When jobs run with `container:` set, Gitea Actions defaults to
`sh -e` (dash on Ubuntu), so `set -euo pipefail` fails with
"Illegal option -o pipefail". Pinning bash workflow-wide
matches what the runner used pre-container and keeps existing
scripts portable.
2026-05-08 21:10:33 +01:00
steve dedc653256 ci: run jobs in ci-runner-go container
CI / Test (rest) (pull_request) Failing after 40s
CI / Test (store) (pull_request) Failing after 40s
CI / Lint (pull_request) Successful in 21s
CI / Build (windows/amd64) (pull_request) Successful in 26s
CI / Test (server-http) (pull_request) Failing after 1m19s
CI / Build (linux/amd64) (pull_request) Successful in 27s
CI / Build (linux/arm64) (pull_request) Successful in 27s
e2e / Playwright vs docker-compose (pull_request) Failing after 5m18s
Pin every job to gitea.dcglab.co.uk/steve/ci-runner-go:2026-05-08
so Go, Node, and Docker tooling are already installed when the
job starts. Drops three actions/setup-go invocations from ci.yml
(redundant — Go is on PATH) and inherits Buildx + Compose v2 in
e2e.yml and release.yml without per-job apt-installs.

Recipe lives in steve/ci. Bump the date pin in lockstep across
the three workflows when picking up a fresher image (e.g. when
the Go floor moves).
2026-05-08 21:06:38 +01:00
steve 60e9197c24 e2e: build playwright image with --profile test --pull
CI / Test (server-http) (pull_request) Successful in 22s
CI / Test (store) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 27s
CI / Build (windows/amd64) (pull_request) Successful in 25s
CI / Build (linux/amd64) (pull_request) Successful in 25s
CI / Build (linux/arm64) (pull_request) Successful in 24s
CI / Test (rest) (pull_request) Successful in 1m19s
e2e / Playwright vs docker-compose (pull_request) Failing after 4m51s
Without --profile test, `docker compose build` skips the
playwright service (profiles: [test]) and the image is built
on-demand by `compose run` instead. Across CI runs the Gitea
runner caches the resulting tag, so a Dockerfile FROM bump
(v1.50.0 → v1.59.1) is masked by the cached image — the
container ends up with old browser binaries and Playwright's
own version-mismatch check fails the suite. Pull base images
on every build so the FROM tag wins.
2026-05-08 20:15:21 +01:00
steve a3f134bcd6 e2e: pin Playwright to 1.59.1
CI / Test (rest) (pull_request) Successful in 34s
CI / Test (store) (pull_request) Successful in 54s
CI / Lint (pull_request) Successful in 26s
CI / Build (windows/amd64) (pull_request) Successful in 26s
CI / Build (linux/amd64) (pull_request) Successful in 25s
CI / Build (linux/arm64) (pull_request) Successful in 25s
e2e / Playwright vs docker-compose (pull_request) Failing after 1m36s
CI / Test (server-http) (pull_request) Successful in 3m19s
`@playwright/test` was loose-pinned to ^1.50.0; npm resolved it
to 1.59.1 inside the runner image, which only ships browser
binaries for 1.50.0. Pin both the package and the docker image
to v1.59.1 so deps and binaries stay aligned.
2026-05-08 20:09:17 +01:00
steve 17b9ee08b7 e2e: run health probe + Playwright on the compose network
Gitea's act-style runners execute workflow steps inside a runner
container, so compose's host port-publish (127.0.0.1:8080:8080) is
not reachable from the steps. PR #23's e2e job timed out waiting
for the server even though the container was up and listening.

Move both the health probe and the Playwright run onto rmnet so
they address the server as http://server:8080:

* health probe: docker run --rm --network e2e_rmnet curlimages/curl
* Playwright: new mcr.microsoft.com/playwright-based image, added
  as a profile-gated `playwright` service in compose.e2e.yml,
  invoked via `docker compose run --rm playwright`. Drops the
  setup-node + npm install runner steps.
2026-05-08 20:08:23 +01:00
steve 89537d417a P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
2026-05-08 20:08:23 +01:00
steve a252b25854 Merge pull request 'spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard' (#22) from p6-04-05-prometheus-metrics into main
Reviewed-on: #22
2026-05-08 18:31:57 +00:00
steve 73e733be61 P6-04+05: Prometheus /metrics endpoint + Grafana dashboard
CI / Test (rest) (pull_request) Successful in 41s
CI / Test (store) (pull_request) Successful in 43s
CI / Lint (pull_request) Successful in 29s
CI / Build (windows/amd64) (pull_request) Successful in 44s
CI / Test (server-http) (pull_request) Successful in 1m47s
CI / Build (linux/arm64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 2m1s
New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.

Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.

Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.

Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
2026-05-07 23:17:15 +01:00
steve 70ff554402 spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard 2026-05-07 23:07:30 +01:00
steve 39dcda4e9e Merge pull request 'P6-03 repo size trend + agent-update UI fix + dashboard polish' (#21) from tidy-up-last-backup-projection into main
Reviewed-on: #21
2026-05-07 22:00:03 +00:00
steve 1b9b23f205 smoke env: systemd --user unit + Make targets so the dev server outlives shell tool boundaries
CI / Test (rest) (pull_request) Successful in 46s
CI / Test (store) (pull_request) Successful in 1m34s
CI / Test (server-http) (pull_request) Successful in 1m46s
CI / Build (linux/amd64) (pull_request) Successful in 23s
CI / Build (windows/amd64) (pull_request) Successful in 41s
CI / Build (linux/arm64) (pull_request) Successful in 23s
CI / Lint (pull_request) Successful in 2m9s
Spent half an evening fighting a smoke server that kept getting SIGTERM'd
mid-iteration. Root cause: backgrounded processes spawned from sandboxed
shell tool calls don't outlive the parent — even with nohup + disown.

Fix: hand the server to user-systemd as a transient unit so its lifecycle
is owned by the user's session, not by whichever bash subprocess started it.
New Make targets:

  make smoke-restart   build server + (re)launch as systemd --user unit
  make smoke-status    show unit status
  make smoke-logs      tail $HOME/smoke/server.log
  make smoke-stop      stop the unit
  make smoke-deploy    full rebuild + restage agent assets + restart

Documents the workflow in CLAUDE.md so the next session doesn't relitigate.
2026-05-07 22:55:36 +01:00
steve c4dc9e9119 ui+store: dashboard polish — repo size projection + header alignment
- Project total_size_bytes onto hosts.repo_size_bytes inside the
  UpsertHostRepoStats transaction. The hosts row column has been
  unwritten since the initial schema in 0001, so the dashboard's
  Repo size cell has always rendered '—' even after backups. Now
  the column updates atomically alongside the host_repo_stats row,
  and FleetSummary's SUM(repo_size_bytes) becomes accurate too.
- Right-align the Alerts column header so it sits over its
  right-aligned value (was floating left of column, ambiguous).
- Add text-ink-mid to the 30d trend / Alerts / Tags headers so all
  column headers share the same brightness.
2026-05-07 22:55:21 +01:00
steve 7011510092 ui: chart polish — rotated y-axis labels, wider viewBox, single-day fallback
- Add rotated 'Size' (left) and 'Snapshots' (right) axis titles in
  the chart's outer margins so the two y-axes are self-describing.
- Bump the chart viewBox from 600x220 to 640x220 and lift padL from
  56 to 72 so the rotated labels and byte tick numbers don't crowd.
- Dedupe the X-axis labels for short windows (1 or 2 days collapsed
  the start/mid/end indices onto each other, stacking 'May 7' three
  times); the 1-day case now centres a single label, 2-day uses
  start+end only.
- Pin a lone data dot to the chart centre instead of the left edge
  when len(days)==1, so it sits under the centred date label.

Goldens regenerated.
2026-05-07 22:55:12 +01:00
steve 42eeabea9a ui: per-host Jobs sub-tab; drop unused Settings stub
Adds /hosts/{id}/jobs page listing recent jobs for the host (newest
first, capped at 100) with click-through to /jobs/{id}. Converts the
Jobs placeholder <div> to a real <a> nav link; removes the Settings
stub entirely. Also registers durationHuman template func and a
.jobs-row CSS grid to match the existing .schd-row idiom.
2026-05-07 22:49:10 +01:00
steve 7b390e9e5e ws: synthesize job.finished from update watcher so browser stream wakes up 2026-05-07 20:32:48 +01:00
steve afd15c6990 tasks: P6-03 done, repo size trend graphs 2026-05-07 19:20:05 +01:00
steve 2562b2c7b5 test: assert Trend panel renders on full repo page 2026-05-07 19:14:34 +01:00
steve 8be551349c ui: trend panel + range selector on host repo page 2026-05-07 19:10:59 +01:00
steve a48df77f40 ui: 30d repo-size sparkline on every dashboard host row 2026-05-07 19:02:35 +01:00
steve 70769f0841 web/sparkline: guard days[i] against shorter days slice in RenderChart 2026-05-07 18:58:33 +01:00
steve ea74965830 web/sparkline: two-axis trend chart with hover dots 2026-05-07 18:55:31 +01:00
steve 9c209a952e web/sparkline: inline-SVG sparkline renderer (empty / single / multi) 2026-05-07 18:50:23 +01:00
steve 871490b9d4 ws: record daily repo stats history alongside current upsert 2026-05-07 18:46:26 +01:00
steve d317d2e561 store: history table helpers (upsert/list, COALESCE preserves prior values) 2026-05-07 18:43:20 +01:00
steve 00bfef0aee store: migration 0023 host_repo_stats_history 2026-05-07 18:39:44 +01:00
steve 363bdff85b plan: P6-03 repo size trend implementation 2026-05-07 18:15:06 +01:00
steve 20425b3360 spec: P6-03 repo size trend (sparkline + chart) design 2026-05-07 18:09:25 +01:00
steve 9c098e773b Merge pull request 'tidy: project finished backup jobs onto host row + smoke doc tweaks' (#20) from tidy-up-last-backup-projection into main
Reviewed-on: #20
2026-05-07 16:58:16 +00:00
steve 711d5e964c fix: project finished backup jobs onto host row + smoke path tweaks
CI / Test (store) (pull_request) Successful in 50s
CI / Test (rest) (pull_request) Successful in 1m5s
CI / Lint (pull_request) Successful in 24s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (windows/amd64) (pull_request) Successful in 43s
CI / Test (server-http) (pull_request) Successful in 1m51s
CI / Build (linux/arm64) (pull_request) Successful in 21s
The dashboard's 'Last backup' column reads hosts.last_backup_at /
last_backup_status, but the WS handler only updated hosts.repo_status
on job.finished — backup terminations were silently dropped. Add a
SetHostLastBackup store method and call it from the same job.finished
switch that already handles init jobs.

Also: CLAUDE.md restage block uses /tmp/rm-smoke (the original
default) but the actual dev env runs out of $HOME/smoke. Update the
paths in the doc to match.
2026-05-07 17:55:23 +01:00
steve 39657355be Merge pull request 'P6-01 + P6-02: agent self-update + fleet update' (#19) from p6-agent-self-update into main
Reviewed-on: #19
2026-05-07 16:49:25 +00:00
steve 0bd075c2a3 tasks: mark P6-01 + P6-02 done with as-shipped block
CI / Test (store) (pull_request) Successful in 52s
CI / Test (rest) (pull_request) Successful in 1m6s
CI / Lint (pull_request) Successful in 32s
CI / Test (server-http) (pull_request) Successful in 1m41s
CI / Build (windows/amd64) (pull_request) Successful in 41s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 24s
2026-05-06 22:33:33 +01:00
steve 83d97a27cc agent unit: allow writes to /usr/local/bin for self-update
Smoke caught this: ProtectSystem=full mounts /usr read-only so the
agent couldn't write its own .new staging file or atomic-rename over
the running binary. Adding /usr/local/bin to ReadWritePaths is the
minimum diff that lets self-update work; the whole-dir grant is
required because os.Rename needs write on the parent directory.
2026-05-06 22:32:50 +01:00
steve ccaccd840a ui: dashboard hosts-behind tile + filter
- Add ?updates=behind query filter and the matching dashboardFilter
  field; round-trips through encode/parse.
- Compute UpdatesBehind on the dashboard view-model (online + version
  trailing the server) and surface as an amber hero tile that links
  to the filtered list.
- Test exercise covering the new filter case.
2026-05-06 22:20:54 +01:00
steve 94441a5371 ui: update chip + per-host button
- Surface UpdateAvailable + TargetVersion on the dashboard host row,
  the host_chrome header, and the JSON Host shape.
- New host_update_chip partial renders an amber out-of-date pill
  next to the agent-version display when the host's agent trails
  the server.
- Host detail right-rail gains an admin-only Update agent button
  (disabled when host is offline or already updating).
- New .update-chip and .btn-amber CSS tokens; tailwind output
  refreshed.
2026-05-06 22:20:40 +01:00
steve 3fa7be51a5 ui: fleet update page + endpoints
- POST /api/fleet/update, POST /api/fleet-updates/{id}/cancel,
  GET /api/fleet-updates/{id} (admin-only).
- GET /settings/fleet-update + /partial for htmx polling.
- Renders idle / running / terminal states with per-host progress.
- Tests cover happy path, derive-host-ids, conflict, cancel, get,
  and RBAC.
2026-05-06 22:20:03 +01:00
steve 6fd2a2ff77 p6-01/02: agent self-update + fleet update server cluster
- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted
  (system-scoped, host_id NULL via new RaiseOrTouchSystem helper).
- ws: UpdateWatcher tracks in-flight command.update dispatches and
  reconciles them against incoming hello envelopes — success path
  marks the job succeeded and auto-resolves the alert; 90s timeout
  marks the job failed and raises update_failed.
- http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX
  /hosts/{id}/update form variant. Pre-checks: host exists, online,
  agent_version != current, no running update job. Refactored core
  into Server.dispatchHostUpdate so the fleet worker can share it
  without going through HTTP.
- fleetupdate: rolling worker iterating through host slots, halting
  on first failure and raising fleet_update_halted. Polling-based
  version-match (re-read hosts.agent_version every 1s up to 95s) —
  no extra plumbing into the WS hello path. At-most-one-running is
  enforced at the store layer (ErrFleetUpdateRunning).
- cmd/server: wire UpdateWatcher and FleetWorker into the main
  goroutine; the worker uses a small serverDispatcher adapter that
  delegates back into Server.DispatchHostUpdate.

Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint
(happy + four pre-check branches + RBAC), worker (two-host happy,
timeout-halt, host-offline-halt, already-at-target skip, cancel
mid-run, double-Start guard).
2026-05-06 22:03:50 +01:00
steve d413896302 store: migrations 0021+0022 + fleet_updates CRUD 2026-05-06 21:47:54 +01:00
steve 74cf24c28b agent: command.update handler + updater package (Linux + Windows) 2026-05-06 21:42:50 +01:00
steve 22bcf69e6c http: expose GET /api/version 2026-05-06 21:39:13 +01:00
steve fe1ed49977 version: build-time version package + Makefile ldflags wiring 2026-05-06 21:38:35 +01:00