Commit Graph

283 Commits

Author SHA1 Message Date
steve cc6844605f e2e: fix agent state-dir to /var/lib/restic-manager
The agent writes its encrypted secrets blob to
$DefaultSecretsPath (/var/lib/restic-manager/secrets.enc) but
the e2e fixtures created and mounted a directory at
/var/lib/restic-manager-agent — name mismatch. Result: every
`config.update` push failed with 'create tmp: no such file or
directory', the auto-init never got the repo creds, the host
landed in init_failed, and the smoke test couldn't kick off a
backup (the Run backup button is disabled while
repo_status != ready).

Align the compose volume mount and the Dockerfile mkdir on
/var/lib/restic-manager so they match the production install
script + the agent's own default.
2026-05-08 21:53:35 +01:00
steve 4cd36d83e3 ui: show pending-hosts panel even when fleet is otherwise empty
The dashboard's empty-state ("No hosts yet.") was gated on
HostCount == 0 alone, which hid the pending-hosts panel — and
the inline accept form — for the most common first-run scenario:
operator just installed an agent that announced, the fleet has
zero accepted hosts, and the only thing the operator needs to do
is review fingerprint + click Accept.

Tighten the gate so the empty state only shows when there are
truly zero hosts and zero pending announces. With a pending
host, fall through to the regular dashboard layout so the
approval queue is visible and actionable.

Caught by the e2e enrol-via-announce smoke test (now unblocked
on PR #23).
2026-05-08 21:47:31 +01:00
steve 68276810ec e2e: dump error-context.md to log on failure + bump upload-artifact
The Playwright run produces error-context.md per failed test
with a full DOM snapshot — useful for triaging UI test failures
without round-tripping through downloaded artifacts. Cat it
into the workflow log on failure.

Also bump actions/upload-artifact v3 → v4. v3 uploads still
return success on this Gitea runner but the artifacts don't
surface through the API or UI; v4 is the correct version per
the workflow header note.
2026-05-08 21:41:38 +01:00
steve e8804922b5 e2e: extract Playwright report via docker cp instead of bind mount
When the runner job runs inside a container, compose's relative
`./playwright/playwright-report` resolves to a path that exists
only inside the runner container, so the host's docker daemon
silently bind-mounts an empty dir and the report never lands
anywhere we can read.

Drop the bind mounts; keep the playwright container around
(--name e2e-pw, no --rm); after the test, `docker cp` the
report and traces out into the runner's workspace volume so
upload-artifact has something real to upload. The new test-results
directory (Playwright traces, screenshots, videos) is also
included so failure post-mortem doesn't need a re-run.
2026-05-08 21:36:09 +01:00
steve a9c6a060d4 runner tests: probe-exec setupScript to clear overlayfs ETXTBSY
The original write-tmp-then-rename guard handles the ETXTBSY race
on a vanilla filesystem, but inside the new ci-runner-go
container our jobs land on overlayfs, which keeps a lagged
"writable inode" view long enough to leak ETXTBSY into the
exec the test does milliseconds later.

After rename, probe-exec the file with a benign argument
("__rm_probe__" — every script's case statement falls through
to a clean exit) until exec succeeds. Each script body is shaped
`case "$1" in restore) ... ;; esac` so the probe is a no-op.
3s deadline keeps a stuck filesystem from hanging the suite.
2026-05-08 21:26:35 +01:00
steve a8026608ae ci: force bash as default shell in container jobs
When jobs run with `container:` set, Gitea Actions defaults to
`sh -e` (dash on Ubuntu), so `set -euo pipefail` fails with
"Illegal option -o pipefail". Pinning bash workflow-wide
matches what the runner used pre-container and keeps existing
scripts portable.
2026-05-08 21:10:33 +01:00
steve 6c23bdbe63 ci: run jobs in ci-runner-go container
Pin every job to gitea.dcglab.co.uk/steve/ci-runner-go:2026-05-08
so Go, Node, and Docker tooling are already installed when the
job starts. Drops three actions/setup-go invocations from ci.yml
(redundant — Go is on PATH) and inherits Buildx + Compose v2 in
e2e.yml and release.yml without per-job apt-installs.

Recipe lives in steve/ci. Bump the date pin in lockstep across
the three workflows when picking up a fresher image (e.g. when
the Go floor moves).
2026-05-08 21:06:38 +01:00
steve a087321570 e2e: build playwright image with --profile test --pull
Without --profile test, `docker compose build` skips the
playwright service (profiles: [test]) and the image is built
on-demand by `compose run` instead. Across CI runs the Gitea
runner caches the resulting tag, so a Dockerfile FROM bump
(v1.50.0 → v1.59.1) is masked by the cached image — the
container ends up with old browser binaries and Playwright's
own version-mismatch check fails the suite. Pull base images
on every build so the FROM tag wins.
2026-05-08 20:15:21 +01:00
steve e8f7502a7f e2e: pin Playwright to 1.59.1
`@playwright/test` was loose-pinned to ^1.50.0; npm resolved it
to 1.59.1 inside the runner image, which only ships browser
binaries for 1.50.0. Pin both the package and the docker image
to v1.59.1 so deps and binaries stay aligned.
2026-05-08 20:09:17 +01:00
steve af2cb292b8 e2e: run health probe + Playwright on the compose network
Gitea's act-style runners execute workflow steps inside a runner
container, so compose's host port-publish (127.0.0.1:8080:8080) is
not reachable from the steps. PR #23's e2e job timed out waiting
for the server even though the container was up and listening.

Move both the health probe and the Playwright run onto rmnet so
they address the server as http://server:8080:

* health probe: docker run --rm --network e2e_rmnet curlimages/curl
* Playwright: new mcr.microsoft.com/playwright-based image, added
  as a profile-gated `playwright` service in compose.e2e.yml,
  invoked via `docker compose run --rm playwright`. Drops the
  setup-node + npm install runner steps.
2026-05-08 20:08:23 +01:00
steve bb4ed3502d P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
2026-05-08 20:08:23 +01:00
steve ff8a5dbead Merge pull request 'spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard' (#22) from p6-04-05-prometheus-metrics into main
Reviewed-on: #22
2026-05-08 18:31:57 +00:00
steve ccd14f7cee P6-04+05: Prometheus /metrics endpoint + Grafana dashboard
New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.

Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.

Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.

Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
2026-05-07 23:17:15 +01:00
steve 07bce16c84 Merge pull request 'P6-03 repo size trend + agent-update UI fix + dashboard polish' (#21) from tidy-up-last-backup-projection into main
Reviewed-on: #21
2026-05-07 22:00:03 +00:00
steve a28bda2031 smoke env: systemd --user unit + Make targets so the dev server outlives shell tool boundaries
Spent half an evening fighting a smoke server that kept getting SIGTERM'd
mid-iteration. Root cause: backgrounded processes spawned from sandboxed
shell tool calls don't outlive the parent — even with nohup + disown.

Fix: hand the server to user-systemd as a transient unit so its lifecycle
is owned by the user's session, not by whichever bash subprocess started it.
New Make targets:

  make smoke-restart   build server + (re)launch as systemd --user unit
  make smoke-status    show unit status
  make smoke-logs      tail $HOME/smoke/server.log
  make smoke-stop      stop the unit
  make smoke-deploy    full rebuild + restage agent assets + restart

Documents the workflow in CLAUDE.md so the next session doesn't relitigate.
2026-05-07 22:55:36 +01:00
steve 51192c3603 ui+store: dashboard polish — repo size projection + header alignment
- Project total_size_bytes onto hosts.repo_size_bytes inside the
  UpsertHostRepoStats transaction. The hosts row column has been
  unwritten since the initial schema in 0001, so the dashboard's
  Repo size cell has always rendered '—' even after backups. Now
  the column updates atomically alongside the host_repo_stats row,
  and FleetSummary's SUM(repo_size_bytes) becomes accurate too.
- Right-align the Alerts column header so it sits over its
  right-aligned value (was floating left of column, ambiguous).
- Add text-ink-mid to the 30d trend / Alerts / Tags headers so all
  column headers share the same brightness.
2026-05-07 22:55:21 +01:00
steve 06fd440dd4 ui: chart polish — rotated y-axis labels, wider viewBox, single-day fallback
- Add rotated 'Size' (left) and 'Snapshots' (right) axis titles in
  the chart's outer margins so the two y-axes are self-describing.
- Bump the chart viewBox from 600x220 to 640x220 and lift padL from
  56 to 72 so the rotated labels and byte tick numbers don't crowd.
- Dedupe the X-axis labels for short windows (1 or 2 days collapsed
  the start/mid/end indices onto each other, stacking 'May 7' three
  times); the 1-day case now centres a single label, 2-day uses
  start+end only.
- Pin a lone data dot to the chart centre instead of the left edge
  when len(days)==1, so it sits under the centred date label.

Goldens regenerated.
2026-05-07 22:55:12 +01:00
steve 28c8b58f93 ui: per-host Jobs sub-tab; drop unused Settings stub
Adds /hosts/{id}/jobs page listing recent jobs for the host (newest
first, capped at 100) with click-through to /jobs/{id}. Converts the
Jobs placeholder <div> to a real <a> nav link; removes the Settings
stub entirely. Also registers durationHuman template func and a
.jobs-row CSS grid to match the existing .schd-row idiom.
2026-05-07 22:49:10 +01:00
steve 6ef58a707e ws: synthesize job.finished from update watcher so browser stream wakes up 2026-05-07 20:32:48 +01:00
steve 001575ae9c tasks: P6-03 done, repo size trend graphs 2026-05-07 19:20:05 +01:00
steve 28cc55711d test: assert Trend panel renders on full repo page 2026-05-07 19:14:34 +01:00
steve 98cc490ea8 ui: trend panel + range selector on host repo page 2026-05-07 19:10:59 +01:00
steve be4ac02ddd ui: 30d repo-size sparkline on every dashboard host row 2026-05-07 19:02:35 +01:00
steve 6e8a1c5b45 web/sparkline: guard days[i] against shorter days slice in RenderChart 2026-05-07 18:58:33 +01:00
steve e7d25cd704 web/sparkline: two-axis trend chart with hover dots 2026-05-07 18:55:31 +01:00
steve db88c5a7d1 web/sparkline: inline-SVG sparkline renderer (empty / single / multi) 2026-05-07 18:50:23 +01:00
steve bb2a88be24 ws: record daily repo stats history alongside current upsert 2026-05-07 18:46:26 +01:00
steve b9c7ec6ebf store: history table helpers (upsert/list, COALESCE preserves prior values) 2026-05-07 18:43:20 +01:00
steve da518de3e6 store: migration 0023 host_repo_stats_history 2026-05-07 18:39:44 +01:00
steve 55453300b0 Merge pull request 'tidy: project finished backup jobs onto host row + smoke doc tweaks' (#20) from tidy-up-last-backup-projection into main
Reviewed-on: #20
2026-05-07 16:58:16 +00:00
steve 0a75b82c17 fix: project finished backup jobs onto host row + smoke path tweaks
The dashboard's 'Last backup' column reads hosts.last_backup_at /
last_backup_status, but the WS handler only updated hosts.repo_status
on job.finished — backup terminations were silently dropped. Add a
SetHostLastBackup store method and call it from the same job.finished
switch that already handles init jobs.

Also: CLAUDE.md restage block uses /tmp/rm-smoke (the original
default) but the actual dev env runs out of $HOME/smoke. Update the
paths in the doc to match.
2026-05-07 17:55:23 +01:00
steve b60c2c6f6b Merge pull request 'P6-01 + P6-02: agent self-update + fleet update' (#19) from p6-agent-self-update into main
Reviewed-on: #19
2026-05-07 16:49:25 +00:00
steve 1909f71f90 tasks: mark P6-01 + P6-02 done with as-shipped block 2026-05-06 22:33:33 +01:00
steve dddff10b99 agent unit: allow writes to /usr/local/bin for self-update
Smoke caught this: ProtectSystem=full mounts /usr read-only so the
agent couldn't write its own .new staging file or atomic-rename over
the running binary. Adding /usr/local/bin to ReadWritePaths is the
minimum diff that lets self-update work; the whole-dir grant is
required because os.Rename needs write on the parent directory.
2026-05-06 22:32:50 +01:00
steve 39304b08d0 ui: dashboard hosts-behind tile + filter
- Add ?updates=behind query filter and the matching dashboardFilter
  field; round-trips through encode/parse.
- Compute UpdatesBehind on the dashboard view-model (online + version
  trailing the server) and surface as an amber hero tile that links
  to the filtered list.
- Test exercise covering the new filter case.
2026-05-06 22:20:54 +01:00
steve 9bcd8bc5fe ui: update chip + per-host button
- Surface UpdateAvailable + TargetVersion on the dashboard host row,
  the host_chrome header, and the JSON Host shape.
- New host_update_chip partial renders an amber out-of-date pill
  next to the agent-version display when the host's agent trails
  the server.
- Host detail right-rail gains an admin-only Update agent button
  (disabled when host is offline or already updating).
- New .update-chip and .btn-amber CSS tokens; tailwind output
  refreshed.
2026-05-06 22:20:40 +01:00
steve e6cfb1cd9f ui: fleet update page + endpoints
- POST /api/fleet/update, POST /api/fleet-updates/{id}/cancel,
  GET /api/fleet-updates/{id} (admin-only).
- GET /settings/fleet-update + /partial for htmx polling.
- Renders idle / running / terminal states with per-host progress.
- Tests cover happy path, derive-host-ids, conflict, cancel, get,
  and RBAC.
2026-05-06 22:20:03 +01:00
steve 9d5775fb47 p6-01/02: agent self-update + fleet update server cluster
- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted
  (system-scoped, host_id NULL via new RaiseOrTouchSystem helper).
- ws: UpdateWatcher tracks in-flight command.update dispatches and
  reconciles them against incoming hello envelopes — success path
  marks the job succeeded and auto-resolves the alert; 90s timeout
  marks the job failed and raises update_failed.
- http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX
  /hosts/{id}/update form variant. Pre-checks: host exists, online,
  agent_version != current, no running update job. Refactored core
  into Server.dispatchHostUpdate so the fleet worker can share it
  without going through HTTP.
- fleetupdate: rolling worker iterating through host slots, halting
  on first failure and raising fleet_update_halted. Polling-based
  version-match (re-read hosts.agent_version every 1s up to 95s) —
  no extra plumbing into the WS hello path. At-most-one-running is
  enforced at the store layer (ErrFleetUpdateRunning).
- cmd/server: wire UpdateWatcher and FleetWorker into the main
  goroutine; the worker uses a small serverDispatcher adapter that
  delegates back into Server.DispatchHostUpdate.

Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint
(happy + four pre-check branches + RBAC), worker (two-host happy,
timeout-halt, host-offline-halt, already-at-target skip, cancel
mid-run, double-Start guard).
2026-05-06 22:03:50 +01:00
steve c37954aa3f store: migrations 0021+0022 + fleet_updates CRUD 2026-05-06 21:47:54 +01:00
steve efed96f67a agent: command.update handler + updater package (Linux + Windows) 2026-05-06 21:42:50 +01:00
steve f31f6edde7 http: expose GET /api/version 2026-05-06 21:39:13 +01:00
steve 516c50fa16 version: build-time version package + Makefile ldflags wiring 2026-05-06 21:38:35 +01:00
steve a8256f5aff tasks: rewrite P6-01/02 around server-bundled agent self-update
The original plan was apt repo + Chocolatey package. The P5-03 Docker
pivot bundled matching agent binaries into the server image and
exposes them via /agent/binary, so 'update agent' now collapses to
're-fetch from your own server'. No third-party packaging or signing
infra needed. P6-01 drops to S; P6-02 keeps the dashboard reporting
+ fleet-update UX but points at the new mechanism.
2026-05-06 21:08:22 +01:00
steve ab7fee0ae7 ci(release): use DEV_TOKEN for registry login
Release / Build + push image (push) Successful in 3m57s
The auto-issued GITHUB_TOKEN lacks write:package scope on this Gitea
instance, so the v0.9.0 tag build failed at docker login. Switch to
the user-level DEV_TOKEN secret which has the correct scope.
v0.9.0
2026-05-06 19:05:54 +01:00
steve ed276813f0 Merge pull request 'testing: bootstrap UI, agent reliability, NS-01..04 + alert username' (#18) from ns-batch-host-ops into main 2026-05-05 21:09:17 +00:00
steve 02e4ef7544 testing: bootstrap UI, agent reliability, NS-01..04 + alert username
Smoothes the rough edges that came up exercising a live deployment.

First-run bootstrap UI: /bootstrap renders a username + password form
that uses the in-memory token directly (operator no longer copies it
out of the log); /login redirects there while bootstrap is available.

Agent reliability: failJob synthetic envelopes so command.run early
returns no longer hang the server-side job; runtime probe of restic
restore --help drives --no-ownership instead of version sniffing
(0.18.x had it removed). Server unit re-shaped: ProtectSystem=full
plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore
can now write anywhere a user might want.

Restore wizard: default target is /root/rm-restore/<job-id>/ with
clearer help text. Re-init confirm input uses .field (was .input,
which doesn't exist — text was invisible).

NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete
with hostname-confirm danger zone, audit, FK cascade, live WS close.

NS-02 enrollment-token recovery: outstanding-tokens panel on
/hosts/new, regenerate (preserves attachments) and revoke handlers
+ audit, store-level ListOutstandingEnrollmentTokens and
DeleteEnrollmentToken.

NS-03 repo init / probe surface: migration 0020 adds
hosts.repo_status + repo_status_error; WS handler projects every
init job's outcome onto the host row (idempotent already-initialised
collapses to ready); creds-save resets status and dispatches a fresh
probe; /hosts/{id}/repo/probe retry endpoint with banner.

NS-04 dashboard live + sort + filter: query-string filter
(q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the
alerts pattern with a localStorage live toggle, sortable column
headers, filter row + clear.

Alerts page: ack'd-by line resolves user_id ULID to username.

Compose.yaml ignored — host-specific.
2026-05-05 22:03:15 +01:00
steve ddb46e16b6 Merge pull request 'P5-03 + P5-07: docker-only release path & reference deployment' (#17) from p5-03-docker-release into main
Reviewed-on: #17
2026-05-05 16:36:08 +00:00
steve e8913943f9 p5-07: reference deployment (server-only compose + reverse-proxy docs)
The reverse proxy is assumed to live outside this project (Caddy,
nginx, Traefik, whatever the operator already runs). The reference
compose stands up only the server: image-pinned via RM_VERSION,
named volume for operator state, localhost-bound so the proxy
reaches it on loopback.

docs/reverse-proxy.md covers what the proxy must forward — the
X-Forwarded-* headers, Host, and Connection: upgrade for the agent
WebSocket and live-log streams — plus the RM_TRUSTED_PROXY CIDR
rule that gates header trust. Worked examples for Caddy, nginx
(with the websocket upgrade map + 1h proxy_read_timeout for live
logs), and Traefik.
2026-05-05 17:15:00 +01:00
steve fb978ad10c p5-03: docker-only release path (drop goreleaser)
Single public deliverable per tag: a multi-arch server image, with
cross-compiled agent binaries + install scripts + the systemd unit
baked under /opt/restic-manager/dist/. The /agent/binary and
/install/* handlers fall back from <DataDir>/... to that read-only
path so a fresh container Just Works without first-run staging;
operators can still drop a custom build into <DataDir>/ to override
per-host.

Architecture rationale: agent distribution already routes through
the running server, so the release surface mirrors that — there's
no second source of truth to keep in sync.

Workflow .gitea/workflows/release.yml triggers on v*.*.* tag-push
(fan-out :vX.Y.Z / :X.Y / :X, plus :latest once MAJOR>=1) and
workflow_dispatch (snapshot tag only). Pushes to the Gitea
container registry on this instance.

Both binaries grow main.commit + main.date ldflag targets. Makefile
and Dockerfile fill them; release workflow forwards from gitea.sha
plus a UTC timestamp.

Spec : docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md
Plan : docs/superpowers/plans/2026-05-05-p5-03-docker-only-release.md
2026-05-05 15:18:48 +01:00
steve 9abdedf40a Merge pull request 'P4-05: OIDC login (generic, JIT-provisioned)' (#16) from p4-05-oidc into main
Reviewed-on: #16
2026-05-05 13:46:23 +00:00