Brainstormed shape locked: JIT-provision local rows on first OIDC
sign-in (auth_source='oidc'), YAML-only config (no UI), 'roles'
claim with deny-on-no-match default, preferred_username with email
fallback, refuse on local-user collision, single provider, login
page shows SSO above password (break-glass), front-channel logout
only, role re-evaluation at login only.
Migration 0019: users.auth_source + users.oidc_subject (partial
unique index), sessions.id_token (for end_session id_token_hint),
oidc_state table for the OAuth round-trip state, swept on the
existing alert-engine tick.
Composes with the user-management work from P4-03/04: admin can
disable OIDC users like local; last-admin guard catches IdP role-
mapping mistakes; audit trail covers JIT-provision via
user.created with auth_source payload + new user.oidc_login /
user.oidc_login_blocked actions.
Out of scope (deferred): back-channel logout, multi-provider,
UI-driven role mapping, refresh tokens / mid-session re-eval.
Pull the operator-experience polish out of Phase 4 so a working v1
ships sooner. Phase 4 keeps RBAC + user mgmt (already done), OIDC,
and host tags. Deferred items renumbered as P6-01..P6-05:
P4-01 → P6-01 apt + Chocolatey update delivery
P4-02 → P6-02 agent-version-behind-server tracking on dashboard
P4-06 → P6-03 repo size trend graphs
P4-08 → P6-04 Prometheus /metrics endpoint
P4-09 → P6-05 Grafana dashboard JSON + integration docs
None of these gate getting the system into production. They land
after Phase 5 (OSS readiness) on the new Phase 6.
Phase 4 remaining: P4-05 (OIDC login) + P4-07 (per-host tags +
dashboard filtering).
Live Playwright + curl sweep on the smoke env exercised the full
user-management lifecycle:
admin add user → setup link generated → curl-as-new-user fetches
/setup (200, username on page) → POSTs password → 303 to / with
Set-Cookie → 200 on dashboard, 200 on /settings/account,
**403 on /settings/users** (admin-only) → admin disables → next
request is **401** + session row count drops to 0 → audit log
reflects user.created + user.setup_completed.
Three-role middleware enforces band gates; admin is fail-closed
default. Setup tokens are sha256-hashed at rest with 1h expiry;
expired tokens are swept on the alert engine's 60s tick. Last-admin
guard rejects disable + demote of the only enabled admin. Self-
service password change at /settings/account is reachable by every
role.
Adds GET/POST handlers for /settings/account in the viewer band
(any authenticated user), account.html template with current-password
field suppressed when must_change_password is set, and audits the
change via AppendAudit.
Adds handleUIUserNewGet, handleUIUserNewPost, handleUIUserSetupLinkGet
to ui_users.go; creates web/templates/pages/user_edit.html (multi-mode
new/edit/setup-link); wires three routes in the admin band of server.go.
Replaces the 501 stub with the full handler: validates the token and
password, hashes and stores the password, deletes the setup token,
mints an 8-hour session cookie, appends a user.setup_completed audit
entry, and redirects to /. Adds TestSetupPostHappyPath covering the
full round-trip including normal-login verification after setup.
Routes are now structured into Public / Viewer / Operator / Admin bands
using requireRole middleware. Job log stream and download moved into the
Viewer band. healthz moved from New() into routes() with the other
public endpoints.
Bite-sized TDD tasks across 7 slices (A schema, B middleware,
C session re-validation, D setup-token flow, E user CRUD API,
F UI, G wiring + sweep). Each task is one commit with concrete
code blocks and test cases — no placeholders.
Refs spec at docs/superpowers/specs/2026-05-05-p4-03-04-rbac-user-mgmt-design.md.
Brainstormed shape locked: chi route-group middleware, fail-closed
admin default; setup-token flow with 1h single-use tokens
(sha256-hashed at rest, raw shown to admin once); disable-only user
lifecycle with last-admin guard; self-service /settings/account
password change for every role; email field on users (metadata
v1); session re-validation on every authenticated request so
disable / role change land immediately.
Locked decisions captured in §Role taxonomy, §Schema changes,
§Setup-token flow, §RBAC enforcement, §Last-admin self-protection.
Deferred items in §Out of scope (OIDC, SMTP email-the-link,
hard delete, lockout).
Migrations 0017 (users extensions) + 0018 (user_setup_tokens)
both column-level ALTERs per CLAUDE.md preference.
Test job was wall-clocked by `internal/server/http` (~156s on the
self-hosted runner under -race). Two changes here cut that:
1. Matrix-shard the test job by package group: server-http, store,
and "rest" (everything else, computed via `go list | grep -v`).
Each shard runs on its own runner so the heavy package isn't
CPU-starved by siblings.
2. `auth.HashPassword` drops to cheap argon2id params (8 KiB / 1
iter / 1 lane) when `testing.Testing()` returns true. Production
params are unchanged. VerifyPassword reads params from the
encoded hash so cheap-params hashes verify identically — no test
call sites need to change.
Two operator-visible changes on /alerts:
1. Polling drops from 15s to 5s and gains a checkbox in the table
header to turn live monitoring on/off. Choice is persisted in
localStorage so it survives full-page navigations. The toggle
state is woven into the htmx hx-trigger predicate, so flipping
the checkbox just sets the flag and the next tick (or the
absence of one) honours it — no attribute juggling, no
htmx.process re-init. The dot dims to 0.3 opacity when paused
so operators can see at a glance that they're looking at a
stale view.
2. Severity dropdown options pick up the same oklch tints used by
the row dots / left borders / kind chips. The kind column shows
only the kind text, so without a colour cue the dropdown
mentioned a concept (severity) that the table itself didn't
render. Now the colours bridge the gap.
Note on <option> styling: Chrome and Firefox honour inline color:
on options; Safari ignores it. Acceptable degradation — falls back
to plain text, which is what we had.
The alerts list is the one screen where staleness is genuinely
harmful — an operator can be looking at an Open tab that's already
been resolved by another admin or auto-resolved by the engine, and
take action on a row that no longer exists.
Add an htmx poll on just the table panel:
hx-get same URL with current querystring (filters preserved)
hx-trigger every 15s, only when document is visible (no idle CPU)
hx-select #alerts-table — pull this element out of the response
hx-swap outerHTML
Polling lives on the table div, not the page root, so the filter
strip and header don't flash on each tick. Header gains a small
'live ●' label so the polling is discoverable.
RefreshURL is r.URL.RequestURI() on the server side — keeps any
status/severity/host_id/q params intact across refreshes.
Other screens (dashboard, hosts, jobs) deliberately stay manual-
refresh per the project's anti-flicker stance.
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.
Schema changes (column-level ALTER, no rebuild):
- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
partial index gets dropped and replaced with a UNIQUE partial
index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
the index is now the actual dedup primitive.
Plumbing:
- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
passes it through for backup_failed only (forget/prune/check stay
repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
per-group Run-now path and schedule.fire path pass &g.ID.
Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.
Dev tool: cmd/_fake_alert gains -dedup-key flag.