restic-manager

Author	SHA1	Message	Date
steve	95aee73e2c	http: gated test for admin-band reject of operator (lands fully in B4+E1)	2026-05-05 10:57:24 +01:00
steve	f87ba29836	http: requireRole middleware + 403 forbidden page	2026-05-05 10:57:24 +01:00
steve	2073898c10	http: test helpers — makeUser, loginAs	2026-05-05 10:57:24 +01:00
steve	37a25beb14	http: roleAtLeast helper for the role hierarchy	2026-05-05 10:57:24 +01:00
steve	f0828782c1	store: DeleteSessionsByUserID for force-logout	2026-05-05 10:57:24 +01:00
steve	12391abef0	store: user_setup_tokens CRUD + cleanup-expired	2026-05-05 10:57:24 +01:00
steve	2c090171e5	store: lowercase username, email/disable helpers, last-admin count	2026-05-05 10:57:24 +01:00
steve	bd08d8ca14	store: extend User struct with Email, DisabledAt, MustChangePassword	2026-05-05 10:57:24 +01:00
steve	a7e53e0a64	store: migration 0018 — user_setup_tokens	2026-05-05 10:57:24 +01:00
steve	ca170fedc5	store: migration 0017 — users.email, disabled_at, must_change_password	2026-05-05 10:57:24 +01:00
steve	03e5ec31f1	ci: shard test job + cheap argon2 in test mode CI / Test (store) (pull_request) Successful in 38s Details CI / Test (rest) (pull_request) Successful in 48s Details CI / Test (server-http) (pull_request) Successful in 1m10s Details CI / Lint (pull_request) Successful in 33s Details CI / Build (linux/amd64) (pull_request) Successful in 24s Details CI / Build (windows/amd64) (pull_request) Successful in 48s Details CI / Build (linux/arm64) (pull_request) Successful in 23s Details Test job was wall-clocked by `internal/server/http` (~156s on the self-hosted runner under -race). Two changes here cut that: 1. Matrix-shard the test job by package group: server-http, store, and "rest" (everything else, computed via `go list \| grep -v`). Each shard runs on its own runner so the heavy package isn't CPU-starved by siblings. 2. `auth.HashPassword` drops to cheap argon2id params (8 KiB / 1 iter / 1 lane) when `testing.Testing()` returns true. Production params are unchanged. VerifyPassword reads params from the encoded hash so cheap-params hashes verify identically — no test call sites need to change.	2026-05-05 08:40:50 +01:00
steve	ba425c9766	feat(audit): clickable column headers with asc/desc sort CI / Build (windows/amd64) (pull_request) Successful in 23s Details CI / Lint (pull_request) Successful in 34s Details CI / Build (linux/amd64) (pull_request) Successful in 23s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details CI / Test (linux/amd64) (pull_request) Successful in 3m41s Details	2026-05-05 08:15:22 +01:00
steve	1d0d994bc4	audit(csv): drop user_id and target_id columns	2026-05-05 08:05:41 +01:00
steve	489f831fc7	feat(audit): CSV export, absolute timestamps, payload modal	2026-05-05 08:00:53 +01:00
steve	3f36bcd0b0	feat(audit): P3-08 — audit log UI with filters	2026-05-05 07:49:25 +01:00
steve	9860b412f7	feat(alerts): live-refresh the table every 15s while the tab is visible The alerts list is the one screen where staleness is genuinely harmful — an operator can be looking at an Open tab that's already been resolved by another admin or auto-resolved by the engine, and take action on a row that no longer exists. Add an htmx poll on just the table panel: hx-get same URL with current querystring (filters preserved) hx-trigger every 15s, only when document is visible (no idle CPU) hx-select #alerts-table — pull this element out of the response hx-swap outerHTML Polling lives on the table div, not the page root, so the filter strip and header don't flash on each tick. Header gains a small 'live ●' label so the polling is discoverable. RefreshURL is r.URL.RequestURI() on the server side — keeps any status/severity/host_id/q params intact across refreshes. Other screens (dashboard, hosts, jobs) deliberately stay manual- refresh per the project's anti-flicker stance.	2026-05-04 23:30:19 +01:00
steve	1618094a26	feat(channels): include event verb in ntfy title + smtp subject (#10 ) Co-authored-by: Steve Cliff <steve@devcloud.guru> Co-committed-by: Steve Cliff <steve@devcloud.guru>	2026-05-04 22:25:38 +00:00
steve	a45c801884	feat(alerts): per-source-group dedup so two failing backups produce two alerts Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.	2026-05-04 22:59:48 +01:00
steve	feaeff217d	feat(ntfy): support HTTP Basic auth alongside access tokens CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 22s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details CI / Lint (pull_request) Successful in 1m12s Details CI / Test (linux/amd64) (pull_request) Successful in 1m18s Details Self-hosted ntfy that doesn't expose a token-mint endpoint can still authenticate over HTTP Basic. Add Username + Password fields to NtfyConfig; the channel sends 'Authorization: Basic …' when token is empty and username is set. Token wins when both are configured. Form-side: two new optional fields next to the access token, with the same write-only placeholder treatment as smtp_password (blank on edit means 'keep stored value'). Username is round-tripped on edit; password is masked.	2026-05-04 22:25:42 +01:00
steve	cffad4b4f3	fix: enabled toggle — list-row click + edit-form save CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 24s Details CI / Build (linux/arm64) (pull_request) Successful in 24s Details CI / Lint (pull_request) Successful in 1m15s Details CI / Test (linux/amd64) (pull_request) Successful in 1m36s Details Two bugs in the channel-enabled affordance: 1. List-row toggle was a static span with no handler; the row's row-link overlay swallowed every click and routed to /edit. Add POST /settings/notifications/{id}/toggle backed by a new store method SetNotificationChannelEnabled, and turn the row toggle into an htmx-driven button that swaps in the new state. Use event.stopPropagation() on the toggle so it beats the row link. 2. Edit-form toggle visually flipped but the underlying checkbox reverted: the visual span lives inside the <label>, so clicking it fired the inline JS handler AND the label's native checkbox-toggle, cancelling out. Bind to the checkbox 'change' event instead and let the label do the toggling — the JS just mirrors check.checked into the .on class.	2026-05-04 22:21:45 +01:00
steve	84e121bb9c	fix: read 'name' across all per-kind sub-forms when editing channels CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Lint (pull_request) Successful in 38s Details CI / Build (linux/amd64) (pull_request) Successful in 21s Details CI / Build (linux/arm64) (pull_request) Successful in 22s Details CI / Test (linux/amd64) (pull_request) Successful in 2m39s Details The channel form has three inputs all named 'name' (one per kind section: webhook / ntfy / smtp), but only the visible kind's input is filled in. PostForm.Get returns the first regardless of emptiness, so editing an ntfy or smtp channel always read '' from the (hidden, unfilled) webhook section's name input and rejected with 'name required'. Add firstNonEmpty helper that scans the slice for the first non-blank value. Same flavour of bug as the enabled checkbox fix in `6466f8c` — both fall out of having multiple inputs share a name across the per-kind sub-forms.	2026-05-04 22:16:59 +01:00
steve	3d99306cea	fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve The denormalised projection was never written by the alerts code path, so the dashboard's OPEN ALERTS card and the per-host alerts column always read 0 regardless of how many alerts were open. fleet.GetStats sums hosts.open_alert_count; if it never moves, the card is decoration. Add refreshHostOpenAlertCount that recomputes from the alerts table (self-healing — no +/- bookkeeping to drift). Call it after the commit in RaiseOrTouch when a row was inserted, after Resolve, and after AutoResolve. Caught during the live sweep: a synthetic critical raised the count to 1, but resolving it left the dashboard reading '1 unresolved' indefinitely.	2026-05-04 21:01:17 +01:00
steve	6466f8c759	fix: read enabled checkbox correctly when paired with hidden=0 sibling The notification channel form has a <input hidden name=enabled value=0> plus a <input checkbox name=enabled value=1> so unchecking the box still submits 'enabled=0' (otherwise the field would just be absent). But Go's url.Values.Get returns the FIRST value, so even when the checkbox is ticked the handler read '0' and persisted enabled=false. Scan r.PostForm["enabled"] for any '1' instead. Caught during the sweep — all three test channels saved with enabled=0 even though the toggle visually rendered ON.	2026-05-04 21:00:54 +01:00
steve	9be3cead8e	fix: dispatch alert.acknowledged + alert.resolved on UI ack/resolve Spotted during the live Playwright sweep: clicking Acknowledge or Resolve updated the alert row but never fanned out a notification. The handlers went straight to Store.Acknowledge/Resolve, bypassing the hub. Add Engine.Acknowledge and Engine.Resolve that wrap the store call and dispatch the matching event to every enabled channel. The UI handlers prefer the engine path when wired, and fall back to the direct store call so unit tests that construct a Server without an engine still work. Use context.WithoutCancel for the goroutine dispatch — the request context is cancelled the instant the handler returns 204, so the naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and losing the channel-list query with 'context canceled'.	2026-05-04 21:00:44 +01:00
steve	e0fbb8c980	ui: dashboard crit-alerts banner	2026-05-04 20:29:49 +01:00
steve	371fe734f3	ui: /settings/notifications list + edit form (3 kinds) Add settings.html (shell + sub-tab nav + conditional list/edit body), notifications.html and notification_edit.html (glob stubs), and the supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid, .kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css. Add ui_parse_test.go to catch template regressions at test time. The kind picker is JS-driven (no full page reload); the enabled toggle mirrors the existing visual toggle pattern; the test-notification button uses HTMX and renders the JSON response as a coloured pill client-side.	2026-05-04 20:25:06 +01:00
steve	d373d19647	ui: F1 — populate OpenAlerts in baseView so nav badge updates everywhere Flagged in review of `cd38b40`: the Alerts tab badge should show the open count from any page, not just /alerts. baseView now takes the request and queries store.ListAlerts(Status: "open") to fill view.OpenAlerts on every page render. All call sites updated.	2026-05-04 20:19:09 +01:00
steve	cd38b40516	ui: alerts list page + alert row partial + nav badge	2026-05-04 20:15:01 +01:00
steve	de6939b3f6	http: /settings/notifications CRUD + test endpoint	2026-05-04 20:06:45 +01:00
steve	873821b871	http: /alerts list + ack/resolve handlers + /api/alerts JSON	2026-05-04 19:59:24 +01:00
steve	8c42b00228	alert: wire engine into ws hello + MarkJobFinished + offline sweep - ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated from http.Deps.AlertEngine (nil until G1 constructs the engine) - runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds - dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished, looking up the job Kind via Store.GetJob before notifying - store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one transaction, returns the IDs that flipped to offline - offline sweeper in cmd/server/main.go switched to the new variant; TODO(G1) comment marks where NotifyHostOffline calls will land	2026-05-04 19:54:39 +01:00
steve	cb4695e09a	alert: rule logic for the six v1 rules	2026-05-04 19:50:33 +01:00
steve	f38930e2e6	alert: engine skeleton + event channels	2026-05-04 19:47:09 +01:00
steve	16e71a0708	notification: Hub fan-out + log writer	2026-05-04 19:44:31 +01:00
steve	a6ac9ee71d	notification: smtp channel	2026-05-04 19:40:21 +01:00
steve	a99864c649	notification: B3 — Content-Type header + URL trim Fixes flagged in spec review of `f0a323e`: ntfy POSTs need explicit Content-Type: text/plain (the spec calls for it; ntfy works without but explicit beats inferred); trim trailing slashes from server URL to avoid double-slash when operators paste 'https://ntfy.sh/'.	2026-05-04 19:38:16 +01:00
steve	f0a323ef91	notification: ntfy channel	2026-05-04 19:35:50 +01:00
steve	c22fb24f5b	notification: webhook channel	2026-05-04 19:33:29 +01:00
steve	6688b3f88a	notification: payload + Channel interface	2026-05-04 19:31:27 +01:00
steve	69fc89143d	store: notification_channels CRUD + AppendNotificationLog	2026-05-04 19:28:41 +01:00
steve	b5a0aa4667	store: alerts CRUD with dedup + last_seen_at bump	2026-05-04 19:24:17 +01:00
steve	f24dfa5214	store: migration 0014 — notification_channels + notification_log	2026-05-04 19:20:37 +01:00
steve	640b64710e	store: A1 — check rows.Err() + Scan err in migrate_test Code-quality nits flagged in review of `e6d965d`. Mirrors the existing pattern in host_credentials_test.go.	2026-05-04 19:19:28 +01:00
steve	e6d965d7a5	store: migration 0013 — alerts.last_seen_at	2026-05-04 19:16:59 +01:00
steve	28d5043eb0	test: lock-protect fakeSender so -race CI passes CI / Lint (pull_request) Successful in 31s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (linux/arm64) (pull_request) Successful in 19s Details CI / Test (linux/amd64) (pull_request) Successful in 1m27s Details CI / Build (windows/amd64) (pull_request) Successful in 1m34s Details The CI runs go test with -race; the agent runner has two pump goroutines (pumpStdout + pumpStderr) writing through the sender concurrently, and the unprotected fakeSender slice append raced. The cancel_test had a local 'safeSender' workaround for the same issue; promote that mutex onto fakeSender itself so every test in the package is race-clean without per-test variants. - fakeSender grows mu sync.Mutex; Send takes/releases. New snapshot() helper for tests that want a stable copy. - cancel_test drops its local safeSender + sync import; uses fakeSender. Verified: go test -race ./... passes across all packages.	2026-05-04 18:01:35 +01:00
steve	e4031d26fa	P3 wrap: agent auto-creates restore target; tasks.md ticked CI / Lint (pull_request) Successful in 35s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (windows/amd64) (pull_request) Successful in 1m18s Details CI / Build (linux/arm64) (pull_request) Successful in 46s Details CI / Test (linux/amd64) (pull_request) Failing after 2m46s Details 1. Agent-side MkdirAll on the new-dir restore target. Restic creates missing leaves but won't traverse multiple missing levels, and under the systemd sandbox writes outside ReadWritePaths fail anyway. Calling os.MkdirAll(target, 0700) before invoking restic means the operator never has to pre-create the per-job subdir, and a path the sandbox rejects surfaces as a clean 'restic restore: prepare target ...: read-only file system' error in the job log instead of a cryptic restic-side stat failure. 2. tasks.md Phase 3 — Restore section refreshed: - P3-X4 added (job log download dropdown — txt + ndjson) - P3-X5 added (UK lint locale switch + 73-correction sweep) - P3-X6 added (SIZE/FILES tooltip when host's restic < 0.17) - P3-03 entry expanded to cover version-gated --no-ownership, editable target, $HOME expansion, agent-side MkdirAll - As-shipped sweep summary mentions custom-target restore + download dropdown + tooltip in addition to the original walk Test: TestRunRestoreNewDirAutoCreatesTarget seeds a multi-level target the operator hasn't created and confirms RunRestore mkdir's the chain before invoking restic.	2026-05-04 17:51:34 +01:00
steve	02250670c1	ui: snapshots SIZE/FILES tooltip when host's restic is < 0.17 Per-snapshot size + file-count come from the embedded summary block restic added to 'snapshots --json' in 0.17 (the source comment in internal/restic/snapshots.go incorrectly said 0.16+). Hosts running 0.16.x leave those columns blank. - Fix the snapshots.go doc comment: '0.16+' -> '0.17+'. - hostDetailPage carries a LegacyRestic bool computed from the host's reported ResticVersion via Env.AtLeastVersion(0, 17). Empty version also counts as legacy (conservative default). - Template attaches title='Needs restic 0.17+ on the agent host. This host runs <ver>.' + cursor:help on the SIZE / FILES headers when the flag is true. Hosts already on 0.17+ get no tooltip and no extra styling. A host upgrading restic to 0.17+ gets the columns populated on the next backup automatically — no further code change needed.	2026-05-04 17:45:32 +01:00
steve	f0dfa689fe	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	a2398d0b66	P3 follow-up: log download (txt + ndjson) on the live job page The diff job's full output streams to the standard live job log page, which can be a lot of text the operator wants to grep through or paste into a ticket. Add a Download button. Source of truth is the persisted job_logs table — works any time (running or finished) and doesn't need to pause the live WS stream. The download is 'everything the server has up to right now'; if the operator wants a fuller snapshot of a still-running job, they hit Download again. - New endpoint GET /api/jobs/{id}/log.{txt,ndjson} (chi {format} matcher constrained to the two known suffixes). Auth via session cookie. 404 on unknown job. - internal/server/http/job_download.go writeLogsText emits a small header + 'HH:MM:SS.mmm TAG payload' rows mirroring what the live page shows. writeLogsNDJSON emits one self-contained {seq,ts,stream, payload} JSON object per line — appending stays valid (each line stands alone), and the whole file pipes cleanly into jq. NDJSON is newline-delimited JSON; not the same as a JSON array. - web/templates/pages/job_detail.html grows two header buttons: 'Download log' (txt) + '.ndjson' ghost variant for tooling. Tests cover the txt format (header + per-row shape), the ndjson format (each line round-trips through json.Unmarshal), unknown job 404, unauthenticated 401.	2026-05-04 17:12:45 +01:00
steve	e22b41d452	P3 sweep fixes: snap-row CSS, tree expand, --no-ownership drop, target path Bug fixes from the Playwright sweep against the live smoke server: 1. Snapshot-picker layout. The .snap-row class was used in the wireframe but never landed in web/styles/input.css; rows rendered as vertical blocks instead of a 6-column grid. Added the token (mirrors host-row shape with restore-specific column widths). 2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven expansion with a small window.__rmTreeToggle helper that uses plain fetch + .tree-pair wrapper structure for trivial sibling lookup. Caches loaded state per node. 3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership; 0.16 rejects it ('unknown flag') before doing any work. Since the agent runs as root in the systemd unit, restored files keep their original uid/gid either way and the parent dir is root-owned, so the 'cp without sudo' rationale doesn't hold. Drop the flag entirely. 4. Default target dir moved to /var/lib/restic-manager/restore. The systemd unit pins ReadWritePaths to /etc/restic-manager + /var/lib/restic-manager (with ProtectSystem=strict making the rest of /var read-only); writes to /var/restic-restore failed with 'read-only file system'. 5. Confirm summary HTML escaping. defaultTarget JS literal evaluates to a string with literal angle brackets; insertion into innerHTML must escape them. Added an inline HTML-escape pass. tasks.md ticked for the Restore sub-phase with a sweep summary covering the live end-to-end test.	2026-05-04 15:57:42 +01:00

1 2 3

150 Commits