Two operator-visible changes on /alerts:
1. Polling drops from 15s to 5s and gains a checkbox in the table
header to turn live monitoring on/off. Choice is persisted in
localStorage so it survives full-page navigations. The toggle
state is woven into the htmx hx-trigger predicate, so flipping
the checkbox just sets the flag and the next tick (or the
absence of one) honours it — no attribute juggling, no
htmx.process re-init. The dot dims to 0.3 opacity when paused
so operators can see at a glance that they're looking at a
stale view.
2. Severity dropdown options pick up the same oklch tints used by
the row dots / left borders / kind chips. The kind column shows
only the kind text, so without a colour cue the dropdown
mentioned a concept (severity) that the table itself didn't
render. Now the colours bridge the gap.
Note on <option> styling: Chrome and Firefox honour inline color:
on options; Safari ignores it. Acceptable degradation — falls back
to plain text, which is what we had.
The alerts list is the one screen where staleness is genuinely
harmful — an operator can be looking at an Open tab that's already
been resolved by another admin or auto-resolved by the engine, and
take action on a row that no longer exists.
Add an htmx poll on just the table panel:
hx-get same URL with current querystring (filters preserved)
hx-trigger every 15s, only when document is visible (no idle CPU)
hx-select #alerts-table — pull this element out of the response
hx-swap outerHTML
Polling lives on the table div, not the page root, so the filter
strip and header don't flash on each tick. Header gains a small
'live ●' label so the polling is discoverable.
RefreshURL is r.URL.RequestURI() on the server side — keeps any
status/severity/host_id/q params intact across refreshes.
Other screens (dashboard, hosts, jobs) deliberately stay manual-
refresh per the project's anti-flicker stance.
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.
Schema changes (column-level ALTER, no rebuild):
- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
partial index gets dropped and replaced with a UNIQUE partial
index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
the index is now the actual dedup primitive.
Plumbing:
- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
passes it through for backup_failed only (forget/prune/check stay
repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
per-group Run-now path and schedule.fire path pass &g.ID.
Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.
Dev tool: cmd/_fake_alert gains -dedup-key flag.
cmd/_fake_alert and similar one-shot dev tools live under cmd/_*
where Go's build tooling skips them. Add an explicit gitignore line
so an accidental 'git add cmd/.' can't drag them into a release.
styles.css is the regenerated Tailwind output — picks up the new
ntfy basic-auth fields and the right-rail preview ids.
Right-rail preview was rendered server-side via {{if eq $f.Kind ...}},
so it stayed on whatever kind the page loaded with. Editing an SMTP
channel and flipping to ntfy in the picker left the email RFC 5322
sample on screen.
Render all three preview panels with id='preview-<kind>' (only the
matching one visible on first render) and toggle their .hidden class
in the kind-switcher JS alongside the field panels. Same pattern
used for fields-<kind>.
The delete-panel <form action='.../delete'> was nested inside the
main <form action='.../edit'>. HTML doesn't allow nested forms —
browsers parse the inner form as if it didn't exist, so clicking
'Delete permanently' submitted the outer edit form to /edit
instead of /delete, leaving the channel intact.
Move the delete-panel block to a sibling of the main form. The
'Delete channel…' button still toggles its visibility via JS, the
panel still renders inside the page layout, and now its form
actually posts to the delete handler.
Self-hosted ntfy that doesn't expose a token-mint endpoint can still
authenticate over HTTP Basic. Add Username + Password fields to
NtfyConfig; the channel sends 'Authorization: Basic …' when token is
empty and username is set. Token wins when both are configured.
Form-side: two new optional fields next to the access token, with
the same write-only placeholder treatment as smtp_password (blank
on edit means 'keep stored value'). Username is round-tripped on
edit; password is masked.
Two bugs in the channel-enabled affordance:
1. List-row toggle was a static span with no handler; the row's
row-link overlay swallowed every click and routed to /edit. Add
POST /settings/notifications/{id}/toggle backed by a new store
method SetNotificationChannelEnabled, and turn the row toggle
into an htmx-driven button that swaps in the new state. Use
event.stopPropagation() on the toggle so it beats the row link.
2. Edit-form toggle visually flipped but the underlying checkbox
reverted: the visual span lives inside the <label>, so clicking
it fired the inline JS handler AND the label's native
checkbox-toggle, cancelling out. Bind to the checkbox 'change'
event instead and let the label do the toggling — the JS just
mirrors check.checked into the .on class.
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.
Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 24eecc1 — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
Sweep against the live smoke env confirmed the alerts subsystem
end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh,
SMTP → MailHog) created and verified via the Test button; synthetic
critical raised; ack + resolve fan out alert.acknowledged /
alert.resolved across all three; dashboard banner appears and
clears; nav badge tracks open count.
Three real bugs found and fixed mid-sweep — see preceding three
commits for the full reasoning.
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.
Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.
Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.
Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.
Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.
Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
- Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go
- Start go alertEngine.Run(ctx) after construction, before the HTTP listener
- Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed)
- Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.
The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
Flagged in review of 35dee98: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
TODO(G1) comment marks where NotifyHostOffline calls will land
Fixes flagged in spec review of 1ff0b2d: ntfy POSTs need explicit
Content-Type: text/plain (the spec calls for it; ntfy works without
but explicit beats inferred); trim trailing slashes from server URL
to avoid double-slash when operators paste 'https://ntfy.sh/'.
The CI runs go test with -race; the agent runner has two pump goroutines
(pumpStdout + pumpStderr) writing through the sender concurrently, and
the unprotected fakeSender slice append raced. The cancel_test had a
local 'safeSender' workaround for the same issue; promote that mutex
onto fakeSender itself so every test in the package is race-clean
without per-test variants.
- fakeSender grows mu sync.Mutex; Send takes/releases. New snapshot()
helper for tests that want a stable copy.
- cancel_test drops its local safeSender + sync import; uses fakeSender.
Verified: go test -race ./... passes across all packages.
1. Agent-side MkdirAll on the new-dir restore target. Restic creates
missing leaves but won't traverse multiple missing levels, and
under the systemd sandbox writes outside ReadWritePaths fail
anyway. Calling os.MkdirAll(target, 0700) before invoking restic
means the operator never has to pre-create the per-job subdir,
and a path the sandbox rejects surfaces as a clean
'restic restore: prepare target ...: read-only file system' error
in the job log instead of a cryptic restic-side stat failure.
2. tasks.md Phase 3 — Restore section refreshed:
- P3-X4 added (job log download dropdown — txt + ndjson)
- P3-X5 added (UK lint locale switch + 73-correction sweep)
- P3-X6 added (SIZE/FILES tooltip when host's restic < 0.17)
- P3-03 entry expanded to cover version-gated --no-ownership,
editable target, $HOME expansion, agent-side MkdirAll
- As-shipped sweep summary mentions custom-target restore +
download dropdown + tooltip in addition to the original walk
Test: TestRunRestoreNewDirAutoCreatesTarget seeds a multi-level
target the operator hasn't created and confirms RunRestore mkdir's
the chain before invoking restic.
Per-snapshot size + file-count come from the embedded summary block
restic added to 'snapshots --json' in 0.17 (the source comment in
internal/restic/snapshots.go incorrectly said 0.16+). Hosts running
0.16.x leave those columns blank.
- Fix the snapshots.go doc comment: '0.16+' -> '0.17+'.
- hostDetailPage carries a LegacyRestic bool computed from the host's
reported ResticVersion via Env.AtLeastVersion(0, 17). Empty version
also counts as legacy (conservative default).
- Template attaches title='Needs restic 0.17+ on the agent host. This
host runs <ver>.' + cursor:help on the SIZE / FILES headers when
the flag is true. Hosts already on 0.17+ get no tooltip and no
extra styling.
A host upgrading restic to 0.17+ gets the columns populated on the
next backup automatically — no further code change needed.
Replace the floating 'Download log' button + bare '.ndjson' link with
one cohesive dropdown menu — same affordance as the rest of the
header, opens to two well-described options.
- Native <details><summary> for keyboard + no-JS support; only the
click-outside-to-close handler is JS (a few lines).
- New .dropdown / .dropdown-menu / .dropdown-item tokens in
web/styles/input.css. Reusable for future header menus
(host-detail overflow, source-group action menus, etc).
- Chevron flips 180 degrees when open via .dropdown[open] selector.
- Each option has a label + a mono hint line explaining when to pick it
(.txt for humans / paste into a ticket; .ndjson for jq / tooling).
Three small follow-ups from review:
1. Restore target is now operator-editable. Default value is the
literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
run time using os.UserHomeDir(); also handles \${HOME} and ~/
prefixes). Operator can replace with any absolute path.
- ui_restore.go validates the input is either absolute or starts
with one of the recognised prefixes; other env-var refs (\$PATH
etc.) are deliberately rejected so operator paths can't pick up
arbitrary agent env values.
- host_restore.html replaces the read-only mono-text display with
a real <input>; help text spells out that \$HOME resolves
agent-side and <job-id> is substituted on dispatch.
- install.sh + the systemd unit prep /root/rm-restore so the
default works under the sandbox: ReadWritePaths gains a soft
'-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
if missing, but install.sh pre-creates it root-owned 0700).
2. --no-ownership flag now gated on restic version. The flag was
added in restic 0.17 and 0.16 rejects it. Previously dropped it
wholesale — that meant new-dir restores silently preserved
ownership against design intent on 0.17+. Now the agent threads
its detected restic version (sysinfo already collects it) through
runner.Config -> restic.Env, and RunRestore appends --no-ownership
only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
restore with original uid/gid; help text in the wizard explicitly
notes this. The previous 'Original ownership is preserved' copy
was wrong for new-dir mode and is corrected.
3. golangci-lint misspell locale switched US -> UK and the codebase
swept (73 corrections, mostly behaviour/serialise/recognise/honour).
Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
contract change but the agent doesn't parse those codes today and
no external API consumers exist yet. Tests passed before + after.
Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
edge cases (empty, exact match, patch above, minor below, non-
numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
with the job_id substituted into the placeholder.
Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
The diff job's full output streams to the standard live job log page,
which can be a lot of text the operator wants to grep through or paste
into a ticket. Add a Download button.
Source of truth is the persisted job_logs table — works any time
(running or finished) and doesn't need to pause the live WS stream.
The download is 'everything the server has up to right now'; if the
operator wants a fuller snapshot of a still-running job, they hit
Download again.
- New endpoint GET /api/jobs/{id}/log.{txt,ndjson} (chi {format}
matcher constrained to the two known suffixes). Auth via session
cookie. 404 on unknown job.
- internal/server/http/job_download.go writeLogsText emits a small
header + 'HH:MM:SS.mmm TAG payload' rows mirroring what the live
page shows. writeLogsNDJSON emits one self-contained {seq,ts,stream,
payload} JSON object per line — appending stays valid (each line
stands alone), and the whole file pipes cleanly into jq. NDJSON is
newline-delimited JSON; not the same as a JSON array.
- web/templates/pages/job_detail.html grows two header buttons:
'Download log' (txt) + '.ndjson' ghost variant for tooling.
Tests cover the txt format (header + per-row shape), the ndjson
format (each line round-trips through json.Unmarshal), unknown job
404, unauthenticated 401.
Bug fixes from the Playwright sweep against the live smoke server:
1. Snapshot-picker layout. The .snap-row class was used in the wireframe
but never landed in web/styles/input.css; rows rendered as vertical
blocks instead of a 6-column grid. Added the token (mirrors host-row
shape with restore-specific column widths).
2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't
a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven
expansion with a small window.__rmTreeToggle helper that uses plain
fetch + .tree-pair wrapper structure for trivial sibling lookup.
Caches loaded state per node.
3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership;
0.16 rejects it ('unknown flag') before doing any work. Since the
agent runs as root in the systemd unit, restored files keep their
original uid/gid either way and the parent dir is root-owned, so
the 'cp without sudo' rationale doesn't hold. Drop the flag entirely.
4. Default target dir moved to /var/lib/restic-manager/restore. The
systemd unit pins ReadWritePaths to /etc/restic-manager +
/var/lib/restic-manager (with ProtectSystem=strict making the rest
of /var read-only); writes to /var/restic-restore failed with
'read-only file system'.
5. Confirm summary HTML escaping. defaultTarget JS literal evaluates
to a string with literal angle brackets; insertion into innerHTML
must escape them. Added an inline HTML-escape pass.
tasks.md ticked for the Restore sub-phase with a sweep summary
covering the live end-to-end test.