Pull the operator-experience polish out of Phase 4 so a working v1
ships sooner. Phase 4 keeps RBAC + user mgmt (already done), OIDC,
and host tags. Deferred items renumbered as P6-01..P6-05:
P4-01 → P6-01 apt + Chocolatey update delivery
P4-02 → P6-02 agent-version-behind-server tracking on dashboard
P4-06 → P6-03 repo size trend graphs
P4-08 → P6-04 Prometheus /metrics endpoint
P4-09 → P6-05 Grafana dashboard JSON + integration docs
None of these gate getting the system into production. They land
after Phase 5 (OSS readiness) on the new Phase 6.
Phase 4 remaining: P4-05 (OIDC login) + P4-07 (per-host tags +
dashboard filtering).
Live Playwright + curl sweep on the smoke env exercised the full
user-management lifecycle:
admin add user → setup link generated → curl-as-new-user fetches
/setup (200, username on page) → POSTs password → 303 to / with
Set-Cookie → 200 on dashboard, 200 on /settings/account,
**403 on /settings/users** (admin-only) → admin disables → next
request is **401** + session row count drops to 0 → audit log
reflects user.created + user.setup_completed.
Three-role middleware enforces band gates; admin is fail-closed
default. Setup tokens are sha256-hashed at rest with 1h expiry;
expired tokens are swept on the alert engine's 60s tick. Last-admin
guard rejects disable + demote of the only enabled admin. Self-
service password change at /settings/account is reachable by every
role.
Adds GET/POST handlers for /settings/account in the viewer band
(any authenticated user), account.html template with current-password
field suppressed when must_change_password is set, and audits the
change via AppendAudit.
Adds handleUIUserNewGet, handleUIUserNewPost, handleUIUserSetupLinkGet
to ui_users.go; creates web/templates/pages/user_edit.html (multi-mode
new/edit/setup-link); wires three routes in the admin band of server.go.
Replaces the 501 stub with the full handler: validates the token and
password, hashes and stores the password, deletes the setup token,
mints an 8-hour session cookie, appends a user.setup_completed audit
entry, and redirects to /. Adds TestSetupPostHappyPath covering the
full round-trip including normal-login verification after setup.
Routes are now structured into Public / Viewer / Operator / Admin bands
using requireRole middleware. Job log stream and download moved into the
Viewer band. healthz moved from New() into routes() with the other
public endpoints.
Test job was wall-clocked by `internal/server/http` (~156s on the
self-hosted runner under -race). Two changes here cut that:
1. Matrix-shard the test job by package group: server-http, store,
and "rest" (everything else, computed via `go list | grep -v`).
Each shard runs on its own runner so the heavy package isn't
CPU-starved by siblings.
2. `auth.HashPassword` drops to cheap argon2id params (8 KiB / 1
iter / 1 lane) when `testing.Testing()` returns true. Production
params are unchanged. VerifyPassword reads params from the
encoded hash so cheap-params hashes verify identically — no test
call sites need to change.
Two operator-visible changes on /alerts:
1. Polling drops from 15s to 5s and gains a checkbox in the table
header to turn live monitoring on/off. Choice is persisted in
localStorage so it survives full-page navigations. The toggle
state is woven into the htmx hx-trigger predicate, so flipping
the checkbox just sets the flag and the next tick (or the
absence of one) honours it — no attribute juggling, no
htmx.process re-init. The dot dims to 0.3 opacity when paused
so operators can see at a glance that they're looking at a
stale view.
2. Severity dropdown options pick up the same oklch tints used by
the row dots / left borders / kind chips. The kind column shows
only the kind text, so without a colour cue the dropdown
mentioned a concept (severity) that the table itself didn't
render. Now the colours bridge the gap.
Note on <option> styling: Chrome and Firefox honour inline color:
on options; Safari ignores it. Acceptable degradation — falls back
to plain text, which is what we had.
The alerts list is the one screen where staleness is genuinely
harmful — an operator can be looking at an Open tab that's already
been resolved by another admin or auto-resolved by the engine, and
take action on a row that no longer exists.
Add an htmx poll on just the table panel:
hx-get same URL with current querystring (filters preserved)
hx-trigger every 15s, only when document is visible (no idle CPU)
hx-select #alerts-table — pull this element out of the response
hx-swap outerHTML
Polling lives on the table div, not the page root, so the filter
strip and header don't flash on each tick. Header gains a small
'live ●' label so the polling is discoverable.
RefreshURL is r.URL.RequestURI() on the server side — keeps any
status/severity/host_id/q params intact across refreshes.
Other screens (dashboard, hosts, jobs) deliberately stay manual-
refresh per the project's anti-flicker stance.