spec: P4-03/04 — RBAC + user management design

Brainstormed shape locked: chi route-group middleware, fail-closed
admin default; setup-token flow with 1h single-use tokens
(sha256-hashed at rest, raw shown to admin once); disable-only user
lifecycle with last-admin guard; self-service /settings/account
password change for every role; email field on users (metadata
v1); session re-validation on every authenticated request so
disable / role change land immediately.

Locked decisions captured in §Role taxonomy, §Schema changes,
§Setup-token flow, §RBAC enforcement, §Last-admin self-protection.
Deferred items in §Out of scope (OIDC, SMTP email-the-link,
hard delete, lockout).

Migrations 0017 (users extensions) + 0018 (user_setup_tokens)
both column-level ALTERs per CLAUDE.md preference.
This commit is contained in:
2026-05-05 08:36:53 +01:00
parent 4eab42a9c3
commit 282258e837
@@ -0,0 +1,340 @@
# P4-03 / P4-04 — RBAC + User Management Design
> **Date:** 2026-05-05
> **Status:** brainstorm complete; ready for plan
> **Closes:** P4-03 (RBAC enforcement at API layer), P4-04 (User management UI)
## Goal
Enforce role-based access control at the HTTP layer (currently every authenticated user has admin powers) and ship the operator-facing screens for managing users, roles, and password lifecycle.
## Architecture
Two coupled subsystems landing in one PR:
1. **RBAC enforcement** — chi route-group middleware that gates each subtree by minimum role. Fail-closed default (admin) so a forgotten declaration doesn't accidentally widen access.
2. **User management**`/settings/users` sub-tab with list / add / edit / disable. Setup-link flow for new users (1-hour-expiry single-use token). Self-service password change at `/settings/account`.
The audit log already records actor + user_id on every mutation; new endpoints fold in naturally.
## Role taxonomy
Locked. Three roles, hierarchical (admin ⊇ operator ⊇ viewer):
| Action | admin | operator | viewer |
|---|:-:|:-:|:-:|
| View dashboard / alerts / audit / hosts | ✓ | ✓ | ✓ |
| Trigger Run-now / Restore / Snapshot diff | ✓ | ✓ | ✗ |
| Acknowledge / resolve alerts | ✓ | ✓ | ✗ |
| Edit schedules / source groups / retention / hooks | ✓ | ✓ | ✗ |
| Add / remove hosts (enrolment, accept/reject pending) | ✓ | ✓ | ✗ |
| Cancel running jobs | ✓ | ✓ | ✗ |
| Edit repo credentials | ✓ | ✓ | ✗ |
| Edit notification channels | ✓ | ✗ | ✗ |
| Manage users | ✓ | ✗ | ✗ |
| Self password change (`/settings/account`) | ✓ | ✓ | ✓ |
The role enum already exists in the schema (`CHECK (role IN ('admin','operator','viewer'))`) and in `internal/store/types.go`. Bootstrap creates the first user as admin. Zero migration needed for existing installs.
## Schema changes
All column-level ALTERs (CLAUDE.md prefers these over rebuilds; safe under `foreign_keys=ON`).
### Migration 0017 — `users` extensions
```sql
ALTER TABLE users ADD COLUMN email TEXT;
ALTER TABLE users ADD COLUMN disabled_at TEXT;
ALTER TABLE users ADD COLUMN must_change_password INTEGER NOT NULL DEFAULT 0;
-- Username case-insensitive lookup. Existing rows are kept as-is;
-- normalisation only applies to new INSERTs (handled in Go).
CREATE UNIQUE INDEX users_username_lower ON users(LOWER(username));
```
### Migration 0018 — `user_setup_tokens`
```sql
CREATE TABLE user_setup_tokens (
user_id TEXT PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
token_hash TEXT NOT NULL, -- sha256(raw_token), hex
expires_at TEXT NOT NULL,
created_at TEXT NOT NULL,
created_by TEXT NOT NULL REFERENCES users(id) ON DELETE SET NULL
);
CREATE INDEX user_setup_tokens_expires ON user_setup_tokens(expires_at);
```
`user_id` is PRIMARY KEY, not just FOREIGN KEY — only one outstanding setup token per user. Regenerating supersedes the old via `INSERT OR REPLACE`.
## RBAC enforcement
### Middleware
```go
// requireRole returns chi middleware that 403s any request whose
// session-resolved user doesn't meet the minimum role. Roles are
// hierarchical: admin > operator > viewer.
func (s *Server) requireRole(min store.Role) func(http.Handler) http.Handler
```
Hierarchy implemented as a small helper:
```go
func roleAtLeast(have, min store.Role) bool {
rank := map[store.Role]int{
store.RoleViewer: 1,
store.RoleOperator: 2,
store.RoleAdmin: 3,
}
return rank[have] >= rank[min]
}
```
### Route grouping in `server.go`
The existing `/api` and UI routes get re-grouped into three role bands plus a self-service group:
```
/api/* viewer-readable — GET endpoints anyone authenticated can hit
/api/* operator+ — mutating endpoints up to host/source-group/schedule level
/api/* admin-only — /api/users/*, channel CRUD
/api/account — self-service password change
/audit, /alerts, /hosts/{id}, etc. — viewer
/hosts/{id}/run, /alerts/{id}/ack — operator
/settings/users/*, /settings/notifications/* — admin
/settings/account — viewer (any authenticated)
```
Default at the bottom of `routes()` is admin (fail-closed). Any future endpoint that doesn't get explicitly placed lands in admin-only, surfacing the missing declaration as a permission error rather than a silent bypass.
### Per-handler nuance
One existing case warrants a handler-level check on top of the route gate: `GET /settings/users/{id}/edit` is admin-only, but the `PUT /api/account/password` is viewer-OK. The split-by-route already covers this; no per-handler overrides expected in v1.
### Out of scope of role middleware
- `/ws/agent` and `/api/agents/*` — agent bearer-token auth, separate chain
- `/healthz` — unauthenticated
- `/login`, `/logout`, `/bootstrap` — public
### 403 handling
- JSON endpoints: `{"error":"forbidden","code":"insufficient_role"}` with HTTP 403
- HTML endpoints: render a small "You don't have permission" panel inside the chrome (so the user keeps their nav and can move away), HTTP 403
- **No audit row on 403** — too noisy with normal users hitting URLs they don't have access to
### Session re-validation
Sessions need to honour `disabled_at` and current role on every request, not just at login. The session-validation middleware reads the user row each request (single PK lookup, fast in SQLite). If `disabled_at IS NOT NULL`, the session is invalidated and the request 401s. This makes "disable user" and "force logout" effectively immediate.
Cost: one SELECT per authenticated request. SQLite handles this comfortably for the fleet sizes this codebase targets.
## Setup-token flow (replacing temp passwords)
### Add user
1. Admin clicks **+ Add user** on `/settings/users`
2. Form: username (required, lowercase-normalised), email (optional, validated), role (admin/operator/viewer)
3. Server:
- Validates username uniqueness (case-insensitive). On collision with a *disabled* user, return a 409 with `{"existing_user_id": "...", "disabled": true}` so the UI can pivot to a "re-enable existing user" prompt
- On collision with an enabled user: 409 with a plain "username taken" error
- Creates user row with `password_hash = ""`, `must_change_password = 1`, `disabled_at = NULL`
- Generates 32 random bytes, hex-encodes → raw token (64 chars). Stores `sha256(token)` hex in `user_setup_tokens`. `expires_at = now + 1h`
- Audit: `user.created`, payload `{"username": "...", "role": "...", "with_setup_token": true}`
4. Server returns the admin to a one-time setup-link page: `/settings/users/{id}/setup-link`
- Shows the URL `http(s)://<base>/setup?token=<raw>` with a Copy button
- Countdown timer (live JS) showing time-to-expiry
- Warning: "This is the only time you'll see this link. If you lose it, regenerate from the user edit page."
- "Done" button → `/settings/users`
The raw token is **never persisted** server-side. Lost tokens require regeneration.
### Setup landing page (public, no auth required)
1. User clicks the link, lands on `/setup?token=<raw>`
2. Server hashes the token, looks up `user_setup_tokens` row, validates `expires_at > now`
3. On invalid / expired: render an error page with a "Contact your administrator" message. Audit: `user.setup_token.expired` (no actor).
4. On valid: render a password-set form: `new password + confirm`. Submit:
- Validates password meets policy (min 12 chars, no other constraints in v1 — same as bootstrap path)
- Hashes via `auth.HashPassword` (existing helper)
- Updates `users.password_hash`, sets `must_change_password = 0`
- Deletes the `user_setup_tokens` row (single-use)
- Logs the user in via the existing session helper
- Audit: `user.setup_completed`, payload `{"user_id": "..."}`
- Redirect to `/`
### Regenerate setup link (admin)
`/settings/users/{id}/edit` shows a "Regenerate setup link" button when `must_change_password = 1`. Clicking it:
1. Generates a new token + hash, INSERT OR REPLACE on `user_setup_tokens`
2. Returns the admin to the same one-time link page as the add-user flow
3. Audit: `user.setup_token.regenerated`
### Cleanup
Expired tokens linger in the DB until cleaned. Add a cheap sweep on the existing maintenance ticker: `DELETE FROM user_setup_tokens WHERE expires_at < ?`. Runs at the same cadence as the alert engine tick (60s). No new ticker needed.
## Self-service password change
`/settings/account`
- Accessible to every authenticated user (any role)
- Form: `current password + new password + confirm`
- Server validates current password (re-uses login bcrypt comparison), updates hash, audits `user.password_changed`
- Special case: if `must_change_password = 1`, the current-password field is hidden / not required (covers the legacy "admin reset password" path if we ever add one — current setup-token path doesn't use this)
The bootstrap user's password change uses this same page (no special case for "first admin").
## User list / management UI
### `/settings/users` (admin-only)
```
Settings · Users [3]
─────────────────────────────────────────────────
[ + Add user ] [ ] Show disabled
USERNAME EMAIL ROLE LAST LOGIN STATUS
alice alice@example.com admin 2 mins ago enabled
bob — operator 3 days ago enabled
charlie c@example.com viewer never setup pending ← if has open setup token
diane d@example.com operator 1 month ago disabled ← only when "Show disabled"
Actions per row: Edit · (Re-enable | Disable)
```
- "setup pending" badge for users with `must_change_password=1` — clicking the row goes to edit, which surfaces the regenerate-link button prominently
- "Show disabled" is a checkbox querystring filter (`?show_disabled=1`)
- Sort columns: clickable like the audit log (username, role, last_login). Reuse the same pattern (server-side sort + URL builder + glyph)
### `/settings/users/new` (admin-only)
Single form: `username + email (optional) + role`. On submit → either landed on the setup-link page (success) or returned with an inline "username exists, re-enable existing?" panel (collision with disabled user) / red error (collision with enabled user).
### `/settings/users/{id}/edit` (admin-only)
- Display-only block: id, created_at, last_login_at, status
- **Editable**: email, role
- **Buttons**:
- "Regenerate setup link" — only when `must_change_password = 1`
- "Disable user" — flips `disabled_at`; rejected if last enabled admin (server-side check). Confirmation modal with typed name to confirm.
- "Re-enable user" — clears `disabled_at`. No confirmation.
- "Force logout" — separate from disable; just kills the session but keeps the user enabled. Useful for "I think Bob's session was hijacked" without locking him out.
- Cancel / Save buttons at the bottom
### `/settings/users/{id}/setup-link` (admin-only)
Renders the one-time link with copy button + countdown. Shown after add-user and after regenerate. Reload of this URL after the token is consumed: 410 Gone with a clear message.
### `/settings/account` (any authenticated)
Self-service password change. Form-only page; no nav under Settings since most users will only see this one Settings page in v1.
## API surface
```
GET /api/users admin — list (with ?show_disabled=1 filter)
POST /api/users admin — create user, returns user_id + setup_url
GET /api/users/{id} admin — read
PATCH /api/users/{id} admin — update email, role
POST /api/users/{id}/disable admin — set disabled_at; rejects last-admin
POST /api/users/{id}/enable admin — clear disabled_at
POST /api/users/{id}/regenerate-setup admin — new token, returns setup_url
POST /api/users/{id}/force-logout admin — kill all sessions for this user
POST /api/account/password any auth — self password change
GET /setup public — landing page (HTML form)
POST /setup public — submit new password
```
UI routes mirror the API but at `/settings/users/...`.
## Last-admin self-protection
Two operations that could lock everyone out are guarded:
- **Disable user**: rejected if the user is admin AND there are no other enabled admins
- **Demote admin to operator/viewer**: same check
Server-side enforcement (single SELECT on `COUNT(*) FROM users WHERE role='admin' AND disabled_at IS NULL`). UI hint: edit page disables the role dropdown's non-admin options + disable button when the user is the last admin, with a tooltip explaining why.
The bootstrap admin is just a regular admin row; this check covers it.
## Audit actions
New action strings introduced:
- `user.created`
- `user.updated` (email / role change)
- `user.disabled`
- `user.enabled`
- `user.password_changed`
- `user.setup_completed`
- `user.setup_token.regenerated`
- `user.setup_token.expired` (system-driven, on cleanup sweep)
- `user.force_logout`
All target_kind = `user`, target_id = the affected user's id. Existing payload conventions apply.
## Ordering / dependencies
Slices in approximate landing order (writing-plans will firm this up):
1. **A. Schema** — migrations 0017 + 0018, `Role` helper updates, store API extensions (email, disabled_at, must_change_password, setup_token CRUD, lowercase username constraints)
2. **B. RBAC middleware**`requireRole` + `roleAtLeast`, route re-grouping in server.go, 403 rendering for HTML + JSON
3. **C. Session re-validation** — extend the existing session middleware to re-read user state per request, kick disabled users
4. **D. Setup-token flow**`/setup` GET+POST, the one-time link page after add-user
5. **E. User CRUD API** — handlers + handlers' tests
6. **F. UI**`/settings/users` list, add, edit, setup-link page, account page
7. **G. Sweep** — Playwright walk through the full lifecycle (add → setup link → user signs in → admin disables → user gets kicked → admin re-enables → user signs back in)
Each slice can land as its own commit on the branch. RBAC middleware (B) goes in *before* user CRUD so we don't ship an open `/api/users/*` even briefly.
## Test strategy
- **Store**: `Set/GetSetupToken`, `EnableUser`/`DisableUser`, last-admin guard, lowercase-username uniqueness, expired-token cleanup
- **HTTP middleware**: `roleAtLeast` truth table; viewer hitting an operator route returns 403; disabled user gets 401 mid-session
- **Setup flow integration**: create user → fetch setup URL → land on `/setup?token=...` → POST password → user can log in → token row gone
- **UI**: existing Playwright sweep pattern, screenshots into `_diag/p4-03-04-sweep/`
## Out of scope (deferred)
- **OIDC** (P4-05) — adds a parallel auth chain. This PR keeps the surface for it (role taxonomy, session middleware) but doesn't wire it.
- **Email-the-setup-link** — explicitly deferred. Easy follow-up because the SMTP channel client from P3-06 is already there.
- **Hard delete** — disable-only in v1; can add a typed-confirm "purge" later if it turns out to be needed.
- **Password complexity / rotation policy** — current minimum (12 chars) and no rotation; tighten later if/when policy demands.
- **Lockout on failed login** — a brute-force protection layer is its own task and orthogonal to RBAC.
- **Audit on 403** — not in v1; revisit if compliance asks for it.
## Risks / gotchas to watch
- **Existing tests** that assume "any logged-in user can hit any endpoint" will break. Audit the test fixtures: most use `loginAsAdmin`, which is fine; any tests currently exercising specific operator/viewer paths need explicit role assignment. (Quick grep suggests there aren't many — bootstrap-only.)
- **Bootstrap user normalisation** — the existing admin row's username is whatever it was set to at first run. The new lowercase-uniqueness index uses `LOWER(username)`, which makes the existing row implicitly lowercase-keyed for lookups. No data migration needed.
- **Session middleware re-read cost** — one SELECT per authenticated request. SQLite WAL handles this fine at expected fleet sizes; if it ever shows up on a profile we add a small in-memory cache keyed by session id with a 30s TTL.
- **403 vs 401 distinction** — make sure unauthenticated requests still get 401 (login redirect) and authenticated-but-insufficient get 403. The middleware should compose: auth-required first, role-required second.
## Acceptance
- [ ] An admin can add a user, copy the setup link, the new user can land on `/setup?token=...`, set a password, and reach `/`
- [ ] An expired token (>1h) on `/setup?token=...` shows the "contact your administrator" page
- [ ] Admin regenerates the link, old token is invalid, new token works
- [ ] Operator user can trigger Run-now but cannot reach `/settings/users` (403) and the Users tab in Settings is hidden in their nav
- [ ] Viewer user gets 403 on Run-now, 200 on dashboard / alerts / audit
- [ ] Admin disables a user mid-session — the user's next request is 401 and they're redirected to login
- [ ] Admin cannot disable themselves if they are the last enabled admin (server returns 409, UI button is greyed)
- [ ] Self-service password change at `/settings/account` works for every role
- [ ] All existing tests pass; new test suite covers role middleware, setup-token lifecycle, last-admin guard
## Self-review notes
- ✅ All sections concrete, no TBD / TODO
- ✅ Schema migrations are column-level (CLAUDE.md compliance)
- ✅ Audit action vocabulary listed in one place; no string typos to drift
- ✅ Out-of-scope list explicit so reviewers can challenge what we *aren't* doing
- ✅ Last-admin guard handled both server-side and UI-hinted
- ✅ Token storage hashes the secret server-side; raw is shown to admin once and never again
- ✅ Session re-validation cost noted with a fallback if it shows up on a profile