Commit Graph

299 Commits

Author SHA1 Message Date
steve 2073898c10 http: test helpers — makeUser, loginAs 2026-05-05 10:57:24 +01:00
steve 37a25beb14 http: roleAtLeast helper for the role hierarchy 2026-05-05 10:57:24 +01:00
steve f0828782c1 store: DeleteSessionsByUserID for force-logout 2026-05-05 10:57:24 +01:00
steve 12391abef0 store: user_setup_tokens CRUD + cleanup-expired 2026-05-05 10:57:24 +01:00
steve 2c090171e5 store: lowercase username, email/disable helpers, last-admin count 2026-05-05 10:57:24 +01:00
steve bd08d8ca14 store: extend User struct with Email, DisabledAt, MustChangePassword 2026-05-05 10:57:24 +01:00
steve a7e53e0a64 store: migration 0018 — user_setup_tokens 2026-05-05 10:57:24 +01:00
steve ca170fedc5 store: migration 0017 — users.email, disabled_at, must_change_password 2026-05-05 10:57:24 +01:00
steve c9f230ce1d plan: P4-03/04 — RBAC + user management implementation plan
Bite-sized TDD tasks across 7 slices (A schema, B middleware,
C session re-validation, D setup-token flow, E user CRUD API,
F UI, G wiring + sweep). Each task is one commit with concrete
code blocks and test cases — no placeholders.

Refs spec at docs/superpowers/specs/2026-05-05-p4-03-04-rbac-user-mgmt-design.md.
2026-05-05 10:57:24 +01:00
steve 282258e837 spec: P4-03/04 — RBAC + user management design
Brainstormed shape locked: chi route-group middleware, fail-closed
admin default; setup-token flow with 1h single-use tokens
(sha256-hashed at rest, raw shown to admin once); disable-only user
lifecycle with last-admin guard; self-service /settings/account
password change for every role; email field on users (metadata
v1); session re-validation on every authenticated request so
disable / role change land immediately.

Locked decisions captured in §Role taxonomy, §Schema changes,
§Setup-token flow, §RBAC enforcement, §Last-admin self-protection.
Deferred items in §Out of scope (OIDC, SMTP email-the-link,
hard delete, lockout).

Migrations 0017 (users extensions) + 0018 (user_setup_tokens)
both column-level ALTERs per CLAUDE.md preference.
2026-05-05 10:57:24 +01:00
steve 4eab42a9c3 Merge pull request 'ci: shard test job + cheap argon2 in test mode' (#13) from ci-faster-tests into main
Reviewed-on: #13
2026-05-05 07:44:30 +00:00
steve 03e5ec31f1 ci: shard test job + cheap argon2 in test mode
CI / Test (store) (pull_request) Successful in 38s
CI / Test (rest) (pull_request) Successful in 48s
CI / Test (server-http) (pull_request) Successful in 1m10s
CI / Lint (pull_request) Successful in 33s
CI / Build (linux/amd64) (pull_request) Successful in 24s
CI / Build (windows/amd64) (pull_request) Successful in 48s
CI / Build (linux/arm64) (pull_request) Successful in 23s
Test job was wall-clocked by `internal/server/http` (~156s on the
self-hosted runner under -race). Two changes here cut that:

1. Matrix-shard the test job by package group: server-http, store,
   and "rest" (everything else, computed via `go list | grep -v`).
   Each shard runs on its own runner so the heavy package isn't
   CPU-starved by siblings.

2. `auth.HashPassword` drops to cheap argon2id params (8 KiB / 1
   iter / 1 lane) when `testing.Testing()` returns true. Production
   params are unchanged. VerifyPassword reads params from the
   encoded hash so cheap-params hashes verify identically — no test
   call sites need to change.
2026-05-05 08:40:50 +01:00
steve 6fd16ace81 Merge pull request 'feat(audit): P3-08 — audit log UI with filters, sort, CSV export, payload modal' (#12) from p3-08-audit-ui into main
Reviewed-on: #12
2026-05-05 07:17:25 +00:00
steve ba425c9766 feat(audit): clickable column headers with asc/desc sort
CI / Build (windows/amd64) (pull_request) Successful in 23s
CI / Lint (pull_request) Successful in 34s
CI / Build (linux/amd64) (pull_request) Successful in 23s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 3m41s
2026-05-05 08:15:22 +01:00
steve 1d0d994bc4 audit(csv): drop user_id and target_id columns 2026-05-05 08:05:41 +01:00
steve 489f831fc7 feat(audit): CSV export, absolute timestamps, payload modal 2026-05-05 08:00:53 +01:00
steve 3f36bcd0b0 feat(audit): P3-08 — audit log UI with filters 2026-05-05 07:49:25 +01:00
steve cb3260b89c Merge pull request 'feat(alerts): live refresh table with toggle + severity colour cues' (#11) from alerts-live-refresh into main
Reviewed-on: #11
2026-05-05 06:42:21 +00:00
steve 8813e93317 alerts: 5s polling cadence + live toggle + severity colour cues
CI / Build (windows/amd64) (pull_request) Successful in 23s
CI / Lint (pull_request) Successful in 38s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 23s
CI / Test (linux/amd64) (pull_request) Successful in 2m57s
Two operator-visible changes on /alerts:

1. Polling drops from 15s to 5s and gains a checkbox in the table
   header to turn live monitoring on/off. Choice is persisted in
   localStorage so it survives full-page navigations. The toggle
   state is woven into the htmx hx-trigger predicate, so flipping
   the checkbox just sets the flag and the next tick (or the
   absence of one) honours it — no attribute juggling, no
   htmx.process re-init. The dot dims to 0.3 opacity when paused
   so operators can see at a glance that they're looking at a
   stale view.

2. Severity dropdown options pick up the same oklch tints used by
   the row dots / left borders / kind chips. The kind column shows
   only the kind text, so without a colour cue the dropdown
   mentioned a concept (severity) that the table itself didn't
   render. Now the colours bridge the gap.

Note on <option> styling: Chrome and Firefox honour inline color:
on options; Safari ignores it. Acceptable degradation — falls back
to plain text, which is what we had.
2026-05-04 23:35:03 +01:00
steve 9860b412f7 feat(alerts): live-refresh the table every 15s while the tab is visible
The alerts list is the one screen where staleness is genuinely
harmful — an operator can be looking at an Open tab that's already
been resolved by another admin or auto-resolved by the engine, and
take action on a row that no longer exists.

Add an htmx poll on just the table panel:

  hx-get        same URL with current querystring (filters preserved)
  hx-trigger    every 15s, only when document is visible (no idle CPU)
  hx-select     #alerts-table — pull this element out of the response
  hx-swap       outerHTML

Polling lives on the table div, not the page root, so the filter
strip and header don't flash on each tick. Header gains a small
'live ●' label so the polling is discoverable.

RefreshURL is r.URL.RequestURI() on the server side — keeps any
status/severity/host_id/q params intact across refreshes.

Other screens (dashboard, hosts, jobs) deliberately stay manual-
refresh per the project's anti-flicker stance.
2026-05-04 23:30:19 +01:00
steve 1618094a26 feat(channels): include event verb in ntfy title + smtp subject (#10)
Co-authored-by: Steve Cliff <steve@devcloud.guru>
Co-committed-by: Steve Cliff <steve@devcloud.guru>
2026-05-04 22:25:38 +00:00
steve dd53c9e497 ui(alerts): clarify Acknowledge vs Resolve (#9)
Co-authored-by: Steve Cliff <steve@devcloud.guru>
Co-committed-by: Steve Cliff <steve@devcloud.guru>
2026-05-04 22:25:35 +00:00
steve 84814b1386 Merge pull request 'Phase 3 — Alerts: per-source-group dedup' (#8) from p3-alerts-dedup into main
CI / Build (windows/amd64) (pull_request) Successful in 23s
CI / Build (linux/amd64) (pull_request) Successful in 23s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 1m22s
CI / Test (linux/amd64) (pull_request) Successful in 1m28s
Reviewed-on: #8
2026-05-04 22:11:08 +00:00
steve a45c801884 feat(alerts): per-source-group dedup so two failing backups produce two alerts
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.

Schema changes (column-level ALTER, no rebuild):

- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
  index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
  partial index gets dropped and replaced with a UNIQUE partial
  index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
  the index is now the actual dedup primitive.

Plumbing:

- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
  passes it through for backup_failed only (forget/prune/check stay
  repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
  per-group Run-now path and schedule.fire path pass &g.ID.

Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.

Dev tool: cmd/_fake_alert gains -dedup-key flag.
2026-05-04 22:59:48 +01:00
steve 7792aadb94 Merge pull request 'Phase 3 — Alerts (P3-05/06/07)' (#7) from p3-alerts into main
Reviewed-on: #7
2026-05-04 21:51:16 +00:00
steve 2eac324cec chore: ignore cmd/_* dev binaries + Tailwind rebuild
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 1m13s
CI / Test (linux/amd64) (pull_request) Successful in 1m20s
cmd/_fake_alert and similar one-shot dev tools live under cmd/_*
where Go's build tooling skips them. Add an explicit gitignore line
so an accidental 'git add cmd/.' can't drag them into a release.

styles.css is the regenerated Tailwind output — picks up the new
ntfy basic-auth fields and the right-rail preview ids.
2026-05-04 22:49:46 +01:00
steve 3cdaee63d4 fix: payload-preview rail follows kind switcher
CI / Lint (pull_request) Successful in 32s
CI / Build (windows/amd64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 1m18s
CI / Build (linux/arm64) (pull_request) Successful in 43s
Right-rail preview was rendered server-side via {{if eq $f.Kind ...}},
so it stayed on whatever kind the page loaded with. Editing an SMTP
channel and flipping to ntfy in the picker left the email RFC 5322
sample on screen.

Render all three preview panels with id='preview-<kind>' (only the
matching one visible on first render) and toggle their .hidden class
in the kind-switcher JS alongside the field panels. Same pattern
used for fields-<kind>.
2026-05-04 22:40:46 +01:00
steve 7f2a9964db fix: move channel delete-panel out of edit form (nested form bug)
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Lint (pull_request) Successful in 1m11s
CI / Test (linux/amd64) (pull_request) Successful in 1m22s
The delete-panel <form action='.../delete'> was nested inside the
main <form action='.../edit'>. HTML doesn't allow nested forms —
browsers parse the inner form as if it didn't exist, so clicking
'Delete permanently' submitted the outer edit form to /edit
instead of /delete, leaving the channel intact.

Move the delete-panel block to a sibling of the main form. The
'Delete channel…' button still toggles its visibility via JS, the
panel still renders inside the page layout, and now its form
actually posts to the delete handler.
2026-05-04 22:35:58 +01:00
steve feaeff217d feat(ntfy): support HTTP Basic auth alongside access tokens
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Lint (pull_request) Successful in 1m12s
CI / Test (linux/amd64) (pull_request) Successful in 1m18s
Self-hosted ntfy that doesn't expose a token-mint endpoint can still
authenticate over HTTP Basic. Add Username + Password fields to
NtfyConfig; the channel sends 'Authorization: Basic …' when token is
empty and username is set. Token wins when both are configured.

Form-side: two new optional fields next to the access token, with
the same write-only placeholder treatment as smtp_password (blank
on edit means 'keep stored value'). Username is round-tripped on
edit; password is masked.
2026-05-04 22:25:42 +01:00
steve cffad4b4f3 fix: enabled toggle — list-row click + edit-form save
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 24s
CI / Build (linux/arm64) (pull_request) Successful in 24s
CI / Lint (pull_request) Successful in 1m15s
CI / Test (linux/amd64) (pull_request) Successful in 1m36s
Two bugs in the channel-enabled affordance:

1. List-row toggle was a static span with no handler; the row's
   row-link overlay swallowed every click and routed to /edit. Add
   POST /settings/notifications/{id}/toggle backed by a new store
   method SetNotificationChannelEnabled, and turn the row toggle
   into an htmx-driven button that swaps in the new state. Use
   event.stopPropagation() on the toggle so it beats the row link.

2. Edit-form toggle visually flipped but the underlying checkbox
   reverted: the visual span lives inside the <label>, so clicking
   it fired the inline JS handler AND the label's native
   checkbox-toggle, cancelling out. Bind to the checkbox 'change'
   event instead and let the label do the toggling — the JS just
   mirrors check.checked into the .on class.
2026-05-04 22:21:45 +01:00
steve 84e121bb9c fix: read 'name' across all per-kind sub-forms when editing channels
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 38s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Test (linux/amd64) (pull_request) Successful in 2m39s
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.

Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 6466f8c — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
2026-05-04 22:16:59 +01:00
steve c5b884a22b tasks: tick P3-05/06/07 + Playwright sweep notes
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 32s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 3m44s
Sweep against the live smoke env confirmed the alerts subsystem
end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh,
SMTP → MailHog) created and verified via the Test button; synthetic
critical raised; ack + resolve fan out alert.acknowledged /
alert.resolved across all three; dashboard banner appears and
clears; nav badge tracks open count.

Three real bugs found and fixed mid-sweep — see preceding three
commits for the full reasoning.
2026-05-04 21:01:34 +01:00
steve 3d99306cea fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.

Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.

Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
2026-05-04 21:01:17 +01:00
steve 6466f8c759 fix: read enabled checkbox correctly when paired with hidden=0 sibling
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.

Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
2026-05-04 21:00:54 +01:00
steve 9be3cead8e fix: dispatch alert.acknowledged + alert.resolved on UI ack/resolve
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.

Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.

Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
2026-05-04 21:00:44 +01:00
steve ee410fcf95 alert: construct + run engine; expose hub to handlers
- Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go
- Start go alertEngine.Run(ctx) after construction, before the HTTP listener
- Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed)
- Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID
2026-05-04 20:32:10 +01:00
steve e0fbb8c980 ui: dashboard crit-alerts banner 2026-05-04 20:29:49 +01:00
steve 371fe734f3 ui: /settings/notifications list + edit form (3 kinds)
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.

The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
2026-05-04 20:25:06 +01:00
steve d373d19647 ui: F1 — populate OpenAlerts in baseView so nav badge updates everywhere
Flagged in review of cd38b40: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
2026-05-04 20:19:09 +01:00
steve cd38b40516 ui: alerts list page + alert row partial + nav badge 2026-05-04 20:15:01 +01:00
steve de6939b3f6 http: /settings/notifications CRUD + test endpoint 2026-05-04 20:06:45 +01:00
steve 873821b871 http: /alerts list + ack/resolve handlers + /api/alerts JSON 2026-05-04 19:59:24 +01:00
steve 8c42b00228 alert: wire engine into ws hello + MarkJobFinished + offline sweep
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
  from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
  looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
  transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
  TODO(G1) comment marks where NotifyHostOffline calls will land
2026-05-04 19:54:39 +01:00
steve cb4695e09a alert: rule logic for the six v1 rules 2026-05-04 19:50:33 +01:00
steve f38930e2e6 alert: engine skeleton + event channels 2026-05-04 19:47:09 +01:00
steve 16e71a0708 notification: Hub fan-out + log writer 2026-05-04 19:44:31 +01:00
steve a6ac9ee71d notification: smtp channel 2026-05-04 19:40:21 +01:00
steve a99864c649 notification: B3 — Content-Type header + URL trim
Fixes flagged in spec review of f0a323e: ntfy POSTs need explicit
Content-Type: text/plain (the spec calls for it; ntfy works without
but explicit beats inferred); trim trailing slashes from server URL
to avoid double-slash when operators paste 'https://ntfy.sh/'.
2026-05-04 19:38:16 +01:00
steve f0a323ef91 notification: ntfy channel 2026-05-04 19:35:50 +01:00
steve c22fb24f5b notification: webhook channel 2026-05-04 19:33:29 +01:00