Commit Graph

304 Commits

Author SHA1 Message Date
steve 06748f5582 Merge pull request 'ui(relTime): tick relative timestamps client-side' (#28) from fix-stale-reltime into main
Release / Build + push image (push) Successful in 3m52s
Reviewed-on: #28
v1.0.1
2026-05-15 20:14:08 +00:00
steve a4d705db6b Merge branch 'main' into fix-stale-reltime
CI / Test (store) (pull_request) Successful in 1m15s
CI / Lint (pull_request) Successful in 19s
CI / Build (windows/amd64) (pull_request) Successful in 25s
CI / Test (server-http) (pull_request) Successful in 2m2s
CI / Test (rest) (pull_request) Successful in 2m12s
CI / Build (linux/amd64) (pull_request) Successful in 26s
CI / Build (linux/arm64) (pull_request) Successful in 26s
e2e / Playwright vs docker-compose (pull_request) Successful in 2m59s
2026-05-15 20:05:45 +00:00
steve c6f73f790d ci: pull ci-runner-go from zot registry 2026-05-15 19:51:02 +00:00
steve 068f08d96d ci: migrate release workflow to zot registry 2026-05-15 19:50:50 +00:00
steve 28ef9750d3 ui(relTime): tick relative timestamps client-side so long-open tabs don't freeze
CI / Test (rest) (pull_request) Successful in 9s
CI / Test (store) (pull_request) Successful in 6s
CI / Build (windows/amd64) (pull_request) Successful in 8s
CI / Build (linux/amd64) (pull_request) Successful in 7s
CI / Lint (pull_request) Successful in 19s
CI / Build (linux/arm64) (pull_request) Successful in 7s
e2e / Playwright vs docker-compose (pull_request) Successful in 1m26s
CI / Test (server-http) (pull_request) Successful in 2m34s
formatRelTime now wraps its label in <time data-rel-ts=...>, and
both layouts include a small ticker that re-renders every 30s.
Without this, a job-detail page rendered an hour ago kept showing
'2h ago' when the wall-clock truth was '3h ago'.
2026-05-10 07:37:03 +01:00
steve f4db0b17e8 Merge pull request 'fix(version): single-source internal/version, fix dockerfile ldflags' (#27) from fix-version-ldflags into main
Release / Build + push image (push) Successful in 3m58s
2026-05-09 14:26:50 +00:00
steve 8afda7cd8c fix(version): use internal/version as single source for build constants
CI / Test (store) (pull_request) Successful in 5s
CI / Test (rest) (pull_request) Successful in 9s
CI / Build (windows/amd64) (pull_request) Successful in 7s
CI / Test (server-http) (pull_request) Successful in 17s
CI / Build (linux/amd64) (pull_request) Successful in 7s
CI / Lint (pull_request) Successful in 19s
CI / Build (linux/arm64) (pull_request) Successful in 14s
e2e / Playwright vs docker-compose (pull_request) Successful in 1m27s
The Dockerfile only set `-X main.version=...`, so docker-built binaries
left `internal/version.Version` at its default "dev". The update logic
(host_update.go:61, hosts.go:94, fleet_update.go:101 et al.) compares
against `internal/version.Version`, so a v1.0.0 host always looked
out-of-date to a v1.0.0 server, the chip never cleared, and pressing
"update" re-downloaded the same bundled binary on a loop.

Collapse the two version sources: drop the `var version/commit/date`
locals in cmd/{server,agent}/main.go, route everything through
internal/version (now also carrying Date), and have both the Dockerfile
and the Makefile set the same single set of -X flags. Verified
end-to-end: make build and docker build both emit binaries whose
--version reflects the build VERSION.
2026-05-09 15:20:13 +01:00
steve 123e4f4915 scrub: remove docs/superpowers and ask.md; gitignore them
These were never meant for the public repo. Wiped from history in
the same change set via git-filter-repo.
2026-05-09 14:23:29 +01:00
steve 7b035a8f09 Merge pull request 'v1 readiness: CHANGELOG + threat model + first-run onboarding polish' (#26) from v1-readiness into main
Release / Build + push image (push) Successful in 2m16s
Reviewed-on: #26
v1.0.0
2026-05-09 11:52:33 +00:00
steve 7a813cacd3 first-run: keep 'bootstrap token' phrase so e2e log-scraper still matches
The CI e2e workflow greps for 'bootstrap token' in server logs to capture
the one-shot token. The earlier reword dropped that phrase; restore it on
the headless-instructions line so .gitea/workflows/e2e.yml step 'Capture
bootstrap token from server logs' keeps matching.
2026-05-09 12:49:40 +01:00
steve 1d36dcd668 v1 readiness: CHANGELOG + threat model + first-run onboarding polish
- CHANGELOG.md: Keep-a-Changelog format, v1.0.0 entry summarising
  what each phase delivered.
- docs/threat-model.md: structured walkthrough of assets, actors,
  attack surfaces and residual risks; reviewed against v1.0.0.
- cmd/server/main.go: at first-run startup, print a clickable
  $RM_BASE_URL/bootstrap URL alongside the existing one-shot
  bootstrap token (or a fallback hint when RM_BASE_URL is unset).
- web/templates/pages/bootstrap.html: visible "Minimum 12 characters"
  hint under the password field so the rule is communicated
  before the operator submits.
- tasks.md: close X-01, X-04, X-05 with notes.
2026-05-09 12:29:00 +01:00
steve 755840d9ff Merge pull request 'docs: AI-agent host onboarding guide' (#25) from temp-onboarding into main
Reviewed-on: #25
2026-05-09 11:22:54 +00:00
steve cc638f6456 Added new AI focused document for host onboarding 2026-05-09 12:18:42 +01:00
steve e046be98b2 Merge pull request 'Cleanup: NS-05/NS-06 + drop dead /repos nav link' (#24) from ns-05-06-cleanup into main
Reviewed-on: #24
2026-05-09 11:11:36 +00:00
steve a9c47deb26 nav: drop dead /repos top-level link (repos are per-host, accessed via host sub-tab) 2026-05-09 11:59:08 +01:00
steve 8a7706407d tasks: close NS-05 (setup-go already gone) + NS-06 (drop Run-backup tombstone button) 2026-05-09 11:55:21 +01:00
steve 3101024d1a tasks: queue NS-05 (drop setup-go) + NS-06 (drop disabled Run-backup button)
Two small follow-ups noted while working through the
p5-oss-readiness CI-runner switch:

* NS-05 — actions/setup-go is now redundant; ci-runner-go ships
  Go on PATH and re-downloading on every job costs ~5s a shard.
* NS-06 — host_chrome's per-host "Run backup now" button is a
  permanently-disabled tombstone; remove it so the chrome stops
  advertising an action that no longer exists.
2026-05-08 22:26:59 +01:00
steve 7f98524cfa Merge pull request 'P5: OSS readiness — docs site, contributor onboarding, e2e harness' (#23) from p5-oss-readiness into main
Reviewed-on: #23
2026-05-08 21:22:38 +00:00
steve 41def51977 e2e: dispatch backup via source-group API
Per-host Run-backup is gone — the host_chrome partial still
renders the button but it's hard-disabled with a tooltip
pointing to per-source-group Run-now. The smoke test was
clicking that disabled button and waiting forever for a URL
change that would never happen.

Replace the navigation-based dispatch with two API calls:
create a source group covering the agent's /source mount,
then POST to /api/hosts/{id}/source-groups/{gid}/run. The
backup-status assertion at the end is unchanged — host record
is still the source of truth.
2026-05-08 22:16:57 +01:00
steve b9439da467 api: expose host.repo_status in /api/hosts JSON
The dashboard renders init_running / init_failed / ready state
based on host.repo_status, but the JSON endpoint dropped the
field on its way out. The e2e test couldn't poll for repo
readiness; reflect the same projection the UI uses.
2026-05-08 22:06:22 +01:00
steve 5925d09e8b e2e: wait for repo_status=ready and bump test timeout
Two issues uncovered by the page-snapshot dump after the agent
state-dir fix:

* The host page server-renders `Run backup now` as disabled
  while repo_status != ready, and the page has no live-refresh
  on that field. The test was navigating right after status
  flipped to 'online' but before auto-init had completed (~3s
  later), so the rendered HTML still showed init_running and
  the click was a no-op. Wait for repo_status === 'ready'
  before navigating.

* playwright.config.ts pinned the per-test timeout at 60s,
  but the test itself uses 60s + 120s of internal waits.
  Bump to 240s so the test fails on real regressions instead
  of timing out on its own internal budget.

Renamed the test description away from "under a minute" since
it overpromises against the new timeout. The performance SLO
belongs in a separate test if we want to assert it.
2026-05-08 22:00:24 +01:00
steve cc6844605f e2e: fix agent state-dir to /var/lib/restic-manager
The agent writes its encrypted secrets blob to
$DefaultSecretsPath (/var/lib/restic-manager/secrets.enc) but
the e2e fixtures created and mounted a directory at
/var/lib/restic-manager-agent — name mismatch. Result: every
`config.update` push failed with 'create tmp: no such file or
directory', the auto-init never got the repo creds, the host
landed in init_failed, and the smoke test couldn't kick off a
backup (the Run backup button is disabled while
repo_status != ready).

Align the compose volume mount and the Dockerfile mkdir on
/var/lib/restic-manager so they match the production install
script + the agent's own default.
2026-05-08 21:53:35 +01:00
steve 4cd36d83e3 ui: show pending-hosts panel even when fleet is otherwise empty
The dashboard's empty-state ("No hosts yet.") was gated on
HostCount == 0 alone, which hid the pending-hosts panel — and
the inline accept form — for the most common first-run scenario:
operator just installed an agent that announced, the fleet has
zero accepted hosts, and the only thing the operator needs to do
is review fingerprint + click Accept.

Tighten the gate so the empty state only shows when there are
truly zero hosts and zero pending announces. With a pending
host, fall through to the regular dashboard layout so the
approval queue is visible and actionable.

Caught by the e2e enrol-via-announce smoke test (now unblocked
on PR #23).
2026-05-08 21:47:31 +01:00
steve 68276810ec e2e: dump error-context.md to log on failure + bump upload-artifact
The Playwright run produces error-context.md per failed test
with a full DOM snapshot — useful for triaging UI test failures
without round-tripping through downloaded artifacts. Cat it
into the workflow log on failure.

Also bump actions/upload-artifact v3 → v4. v3 uploads still
return success on this Gitea runner but the artifacts don't
surface through the API or UI; v4 is the correct version per
the workflow header note.
2026-05-08 21:41:38 +01:00
steve e8804922b5 e2e: extract Playwright report via docker cp instead of bind mount
When the runner job runs inside a container, compose's relative
`./playwright/playwright-report` resolves to a path that exists
only inside the runner container, so the host's docker daemon
silently bind-mounts an empty dir and the report never lands
anywhere we can read.

Drop the bind mounts; keep the playwright container around
(--name e2e-pw, no --rm); after the test, `docker cp` the
report and traces out into the runner's workspace volume so
upload-artifact has something real to upload. The new test-results
directory (Playwright traces, screenshots, videos) is also
included so failure post-mortem doesn't need a re-run.
2026-05-08 21:36:09 +01:00
steve a9c6a060d4 runner tests: probe-exec setupScript to clear overlayfs ETXTBSY
The original write-tmp-then-rename guard handles the ETXTBSY race
on a vanilla filesystem, but inside the new ci-runner-go
container our jobs land on overlayfs, which keeps a lagged
"writable inode" view long enough to leak ETXTBSY into the
exec the test does milliseconds later.

After rename, probe-exec the file with a benign argument
("__rm_probe__" — every script's case statement falls through
to a clean exit) until exec succeeds. Each script body is shaped
`case "$1" in restore) ... ;; esac` so the probe is a no-op.
3s deadline keeps a stuck filesystem from hanging the suite.
2026-05-08 21:26:35 +01:00
steve a8026608ae ci: force bash as default shell in container jobs
When jobs run with `container:` set, Gitea Actions defaults to
`sh -e` (dash on Ubuntu), so `set -euo pipefail` fails with
"Illegal option -o pipefail". Pinning bash workflow-wide
matches what the runner used pre-container and keeps existing
scripts portable.
2026-05-08 21:10:33 +01:00
steve 6c23bdbe63 ci: run jobs in ci-runner-go container
Pin every job to gitea.dcglab.co.uk/steve/ci-runner-go:2026-05-08
so Go, Node, and Docker tooling are already installed when the
job starts. Drops three actions/setup-go invocations from ci.yml
(redundant — Go is on PATH) and inherits Buildx + Compose v2 in
e2e.yml and release.yml without per-job apt-installs.

Recipe lives in steve/ci. Bump the date pin in lockstep across
the three workflows when picking up a fresher image (e.g. when
the Go floor moves).
2026-05-08 21:06:38 +01:00
steve a087321570 e2e: build playwright image with --profile test --pull
Without --profile test, `docker compose build` skips the
playwright service (profiles: [test]) and the image is built
on-demand by `compose run` instead. Across CI runs the Gitea
runner caches the resulting tag, so a Dockerfile FROM bump
(v1.50.0 → v1.59.1) is masked by the cached image — the
container ends up with old browser binaries and Playwright's
own version-mismatch check fails the suite. Pull base images
on every build so the FROM tag wins.
2026-05-08 20:15:21 +01:00
steve e8f7502a7f e2e: pin Playwright to 1.59.1
`@playwright/test` was loose-pinned to ^1.50.0; npm resolved it
to 1.59.1 inside the runner image, which only ships browser
binaries for 1.50.0. Pin both the package and the docker image
to v1.59.1 so deps and binaries stay aligned.
2026-05-08 20:09:17 +01:00
steve af2cb292b8 e2e: run health probe + Playwright on the compose network
Gitea's act-style runners execute workflow steps inside a runner
container, so compose's host port-publish (127.0.0.1:8080:8080) is
not reachable from the steps. PR #23's e2e job timed out waiting
for the server even though the container was up and listening.

Move both the health probe and the Playwright run onto rmnet so
they address the server as http://server:8080:

* health probe: docker run --rm --network e2e_rmnet curlimages/curl
* Playwright: new mcr.microsoft.com/playwright-based image, added
  as a profile-gated `playwright` service in compose.e2e.yml,
  invoked via `docker compose run --rm playwright`. Drops the
  setup-node + npm install runner steps.
2026-05-08 20:08:23 +01:00
steve bb4ed3502d P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
2026-05-08 20:08:23 +01:00
steve ff8a5dbead Merge pull request 'spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard' (#22) from p6-04-05-prometheus-metrics into main
Reviewed-on: #22
2026-05-08 18:31:57 +00:00
steve ccd14f7cee P6-04+05: Prometheus /metrics endpoint + Grafana dashboard
New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.

Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.

Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.

Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
2026-05-07 23:17:15 +01:00
steve 07bce16c84 Merge pull request 'P6-03 repo size trend + agent-update UI fix + dashboard polish' (#21) from tidy-up-last-backup-projection into main
Reviewed-on: #21
2026-05-07 22:00:03 +00:00
steve a28bda2031 smoke env: systemd --user unit + Make targets so the dev server outlives shell tool boundaries
Spent half an evening fighting a smoke server that kept getting SIGTERM'd
mid-iteration. Root cause: backgrounded processes spawned from sandboxed
shell tool calls don't outlive the parent — even with nohup + disown.

Fix: hand the server to user-systemd as a transient unit so its lifecycle
is owned by the user's session, not by whichever bash subprocess started it.
New Make targets:

  make smoke-restart   build server + (re)launch as systemd --user unit
  make smoke-status    show unit status
  make smoke-logs      tail $HOME/smoke/server.log
  make smoke-stop      stop the unit
  make smoke-deploy    full rebuild + restage agent assets + restart

Documents the workflow in CLAUDE.md so the next session doesn't relitigate.
2026-05-07 22:55:36 +01:00
steve 51192c3603 ui+store: dashboard polish — repo size projection + header alignment
- Project total_size_bytes onto hosts.repo_size_bytes inside the
  UpsertHostRepoStats transaction. The hosts row column has been
  unwritten since the initial schema in 0001, so the dashboard's
  Repo size cell has always rendered '—' even after backups. Now
  the column updates atomically alongside the host_repo_stats row,
  and FleetSummary's SUM(repo_size_bytes) becomes accurate too.
- Right-align the Alerts column header so it sits over its
  right-aligned value (was floating left of column, ambiguous).
- Add text-ink-mid to the 30d trend / Alerts / Tags headers so all
  column headers share the same brightness.
2026-05-07 22:55:21 +01:00
steve 06fd440dd4 ui: chart polish — rotated y-axis labels, wider viewBox, single-day fallback
- Add rotated 'Size' (left) and 'Snapshots' (right) axis titles in
  the chart's outer margins so the two y-axes are self-describing.
- Bump the chart viewBox from 600x220 to 640x220 and lift padL from
  56 to 72 so the rotated labels and byte tick numbers don't crowd.
- Dedupe the X-axis labels for short windows (1 or 2 days collapsed
  the start/mid/end indices onto each other, stacking 'May 7' three
  times); the 1-day case now centres a single label, 2-day uses
  start+end only.
- Pin a lone data dot to the chart centre instead of the left edge
  when len(days)==1, so it sits under the centred date label.

Goldens regenerated.
2026-05-07 22:55:12 +01:00
steve 28c8b58f93 ui: per-host Jobs sub-tab; drop unused Settings stub
Adds /hosts/{id}/jobs page listing recent jobs for the host (newest
first, capped at 100) with click-through to /jobs/{id}. Converts the
Jobs placeholder <div> to a real <a> nav link; removes the Settings
stub entirely. Also registers durationHuman template func and a
.jobs-row CSS grid to match the existing .schd-row idiom.
2026-05-07 22:49:10 +01:00
steve 6ef58a707e ws: synthesize job.finished from update watcher so browser stream wakes up 2026-05-07 20:32:48 +01:00
steve 001575ae9c tasks: P6-03 done, repo size trend graphs 2026-05-07 19:20:05 +01:00
steve 28cc55711d test: assert Trend panel renders on full repo page 2026-05-07 19:14:34 +01:00
steve 98cc490ea8 ui: trend panel + range selector on host repo page 2026-05-07 19:10:59 +01:00
steve be4ac02ddd ui: 30d repo-size sparkline on every dashboard host row 2026-05-07 19:02:35 +01:00
steve 6e8a1c5b45 web/sparkline: guard days[i] against shorter days slice in RenderChart 2026-05-07 18:58:33 +01:00
steve e7d25cd704 web/sparkline: two-axis trend chart with hover dots 2026-05-07 18:55:31 +01:00
steve db88c5a7d1 web/sparkline: inline-SVG sparkline renderer (empty / single / multi) 2026-05-07 18:50:23 +01:00
steve bb2a88be24 ws: record daily repo stats history alongside current upsert 2026-05-07 18:46:26 +01:00
steve b9c7ec6ebf store: history table helpers (upsert/list, COALESCE preserves prior values) 2026-05-07 18:43:20 +01:00
steve da518de3e6 store: migration 0023 host_repo_stats_history 2026-05-07 18:39:44 +01:00