Replaces the 501 stub with the full handler: validates the token and
password, hashes and stores the password, deletes the setup token,
mints an 8-hour session cookie, appends a user.setup_completed audit
entry, and redirects to /. Adds TestSetupPostHappyPath covering the
full round-trip including normal-login verification after setup.
Routes are now structured into Public / Viewer / Operator / Admin bands
using requireRole middleware. Job log stream and download moved into the
Viewer band. healthz moved from New() into routes() with the other
public endpoints.
Test job was wall-clocked by `internal/server/http` (~156s on the
self-hosted runner under -race). Two changes here cut that:
1. Matrix-shard the test job by package group: server-http, store,
and "rest" (everything else, computed via `go list | grep -v`).
Each shard runs on its own runner so the heavy package isn't
CPU-starved by siblings.
2. `auth.HashPassword` drops to cheap argon2id params (8 KiB / 1
iter / 1 lane) when `testing.Testing()` returns true. Production
params are unchanged. VerifyPassword reads params from the
encoded hash so cheap-params hashes verify identically — no test
call sites need to change.
The alerts list is the one screen where staleness is genuinely
harmful — an operator can be looking at an Open tab that's already
been resolved by another admin or auto-resolved by the engine, and
take action on a row that no longer exists.
Add an htmx poll on just the table panel:
hx-get same URL with current querystring (filters preserved)
hx-trigger every 15s, only when document is visible (no idle CPU)
hx-select #alerts-table — pull this element out of the response
hx-swap outerHTML
Polling lives on the table div, not the page root, so the filter
strip and header don't flash on each tick. Header gains a small
'live ●' label so the polling is discoverable.
RefreshURL is r.URL.RequestURI() on the server side — keeps any
status/severity/host_id/q params intact across refreshes.
Other screens (dashboard, hosts, jobs) deliberately stay manual-
refresh per the project's anti-flicker stance.
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.
Schema changes (column-level ALTER, no rebuild):
- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
partial index gets dropped and replaced with a UNIQUE partial
index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
the index is now the actual dedup primitive.
Plumbing:
- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
passes it through for backup_failed only (forget/prune/check stay
repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
per-group Run-now path and schedule.fire path pass &g.ID.
Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.
Dev tool: cmd/_fake_alert gains -dedup-key flag.
Self-hosted ntfy that doesn't expose a token-mint endpoint can still
authenticate over HTTP Basic. Add Username + Password fields to
NtfyConfig; the channel sends 'Authorization: Basic …' when token is
empty and username is set. Token wins when both are configured.
Form-side: two new optional fields next to the access token, with
the same write-only placeholder treatment as smtp_password (blank
on edit means 'keep stored value'). Username is round-tripped on
edit; password is masked.
Two bugs in the channel-enabled affordance:
1. List-row toggle was a static span with no handler; the row's
row-link overlay swallowed every click and routed to /edit. Add
POST /settings/notifications/{id}/toggle backed by a new store
method SetNotificationChannelEnabled, and turn the row toggle
into an htmx-driven button that swaps in the new state. Use
event.stopPropagation() on the toggle so it beats the row link.
2. Edit-form toggle visually flipped but the underlying checkbox
reverted: the visual span lives inside the <label>, so clicking
it fired the inline JS handler AND the label's native
checkbox-toggle, cancelling out. Bind to the checkbox 'change'
event instead and let the label do the toggling — the JS just
mirrors check.checked into the .on class.
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.
Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 6466f8c — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.
Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.
Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.
Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.
Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.
Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.
The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
Flagged in review of cd38b40: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
TODO(G1) comment marks where NotifyHostOffline calls will land
Fixes flagged in spec review of f0a323e: ntfy POSTs need explicit
Content-Type: text/plain (the spec calls for it; ntfy works without
but explicit beats inferred); trim trailing slashes from server URL
to avoid double-slash when operators paste 'https://ntfy.sh/'.