Commit Graph

88 Commits

Author SHA1 Message Date
steve 373d74cdaf fix: read 'name' across all per-kind sub-forms when editing channels
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.

Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 24eecc1 — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
2026-05-04 22:16:59 +01:00
steve 24eecc1673 fix: read enabled checkbox correctly when paired with hidden=0 sibling
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.

Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
2026-05-04 21:00:54 +01:00
steve 04dde93acd fix: dispatch alert.acknowledged + alert.resolved on UI ack/resolve
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.

Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.

Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
2026-05-04 21:00:44 +01:00
steve b25f96e465 ui: dashboard crit-alerts banner 2026-05-04 20:29:49 +01:00
steve e0847517a8 ui: /settings/notifications list + edit form (3 kinds)
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.

The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
2026-05-04 20:25:06 +01:00
steve 9dbed025e0 ui: F1 — populate OpenAlerts in baseView so nav badge updates everywhere
Flagged in review of 35dee98: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
2026-05-04 20:19:09 +01:00
steve 35dee98cf9 ui: alerts list page + alert row partial + nav badge 2026-05-04 20:15:01 +01:00
steve 5d8350132c http: /settings/notifications CRUD + test endpoint 2026-05-04 20:06:45 +01:00
steve 5c6ac155eb http: /alerts list + ack/resolve handlers + /api/alerts JSON 2026-05-04 19:59:24 +01:00
steve c710743231 alert: wire engine into ws hello + MarkJobFinished + offline sweep
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
  from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
  looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
  transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
  TODO(G1) comment marks where NotifyHostOffline calls will land
2026-05-04 19:54:39 +01:00
steve 539b941db5 ui: snapshots SIZE/FILES tooltip when host's restic is < 0.17
Per-snapshot size + file-count come from the embedded summary block
restic added to 'snapshots --json' in 0.17 (the source comment in
internal/restic/snapshots.go incorrectly said 0.16+). Hosts running
0.16.x leave those columns blank.

- Fix the snapshots.go doc comment: '0.16+' -> '0.17+'.
- hostDetailPage carries a LegacyRestic bool computed from the host's
  reported ResticVersion via Env.AtLeastVersion(0, 17). Empty version
  also counts as legacy (conservative default).
- Template attaches title='Needs restic 0.17+ on the agent host. This
  host runs <ver>.' + cursor:help on the SIZE / FILES headers when
  the flag is true. Hosts already on 0.17+ get no tooltip and no
  extra styling.

A host upgrading restic to 0.17+ gets the columns populated on the
next backup automatically — no further code change needed.
2026-05-04 17:45:32 +01:00
steve a781e95c94 P3 follow-up: editable target dir, conditional --no-ownership, UK lint
Three small follow-ups from review:

1. Restore target is now operator-editable. Default value is the
   literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
   run time using os.UserHomeDir(); also handles \${HOME} and ~/
   prefixes). Operator can replace with any absolute path.
   - ui_restore.go validates the input is either absolute or starts
     with one of the recognised prefixes; other env-var refs (\$PATH
     etc.) are deliberately rejected so operator paths can't pick up
     arbitrary agent env values.
   - host_restore.html replaces the read-only mono-text display with
     a real <input>; help text spells out that \$HOME resolves
     agent-side and <job-id> is substituted on dispatch.
   - install.sh + the systemd unit prep /root/rm-restore so the
     default works under the sandbox: ReadWritePaths gains a soft
     '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
     if missing, but install.sh pre-creates it root-owned 0700).

2. --no-ownership flag now gated on restic version. The flag was
   added in restic 0.17 and 0.16 rejects it. Previously dropped it
   wholesale — that meant new-dir restores silently preserved
   ownership against design intent on 0.17+. Now the agent threads
   its detected restic version (sysinfo already collects it) through
   runner.Config -> restic.Env, and RunRestore appends --no-ownership
   only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
   restore with original uid/gid; help text in the wizard explicitly
   notes this. The previous 'Original ownership is preserved' copy
   was wrong for new-dir mode and is corrected.

3. golangci-lint misspell locale switched US -> UK and the codebase
   swept (73 corrections, mostly behaviour/serialise/recognise/honour).
   Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
   contract change but the agent doesn't parse those codes today and
   no external API consumers exist yet. Tests passed before + after.

Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
  edge cases (empty, exact match, patch above, minor below, non-
  numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
  pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
  with the job_id substituted into the placeholder.

Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
2026-05-04 17:27:52 +01:00
steve 727c610765 P3 follow-up: log download (txt + ndjson) on the live job page
The diff job's full output streams to the standard live job log page,
which can be a lot of text the operator wants to grep through or paste
into a ticket. Add a Download button.

Source of truth is the persisted job_logs table — works any time
(running or finished) and doesn't need to pause the live WS stream.
The download is 'everything the server has up to right now'; if the
operator wants a fuller snapshot of a still-running job, they hit
Download again.

- New endpoint GET /api/jobs/{id}/log.{txt,ndjson} (chi {format}
  matcher constrained to the two known suffixes). Auth via session
  cookie. 404 on unknown job.
- internal/server/http/job_download.go writeLogsText emits a small
  header + 'HH:MM:SS.mmm  TAG  payload' rows mirroring what the live
  page shows. writeLogsNDJSON emits one self-contained {seq,ts,stream,
  payload} JSON object per line — appending stays valid (each line
  stands alone), and the whole file pipes cleanly into jq. NDJSON is
  newline-delimited JSON; not the same as a JSON array.
- web/templates/pages/job_detail.html grows two header buttons:
  'Download log' (txt) + '.ndjson' ghost variant for tooling.

Tests cover the txt format (header + per-row shape), the ndjson
format (each line round-trips through json.Unmarshal), unknown job
404, unauthenticated 401.
2026-05-04 17:12:45 +01:00
steve 65a0134101 P3 sweep fixes: snap-row CSS, tree expand, --no-ownership drop, target path
Bug fixes from the Playwright sweep against the live smoke server:

1. Snapshot-picker layout. The .snap-row class was used in the wireframe
   but never landed in web/styles/input.css; rows rendered as vertical
   blocks instead of a 6-column grid. Added the token (mirrors host-row
   shape with restore-specific column widths).

2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't
   a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven
   expansion with a small window.__rmTreeToggle helper that uses plain
   fetch + .tree-pair wrapper structure for trivial sibling lookup.
   Caches loaded state per node.

3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership;
   0.16 rejects it ('unknown flag') before doing any work. Since the
   agent runs as root in the systemd unit, restored files keep their
   original uid/gid either way and the parent dir is root-owned, so
   the 'cp without sudo' rationale doesn't hold. Drop the flag entirely.

4. Default target dir moved to /var/lib/restic-manager/restore. The
   systemd unit pins ReadWritePaths to /etc/restic-manager +
   /var/lib/restic-manager (with ProtectSystem=strict making the rest
   of /var read-only); writes to /var/restic-restore failed with
   'read-only file system'.

5. Confirm summary HTML escaping. defaultTarget JS literal evaluates
   to a string with literal angle brackets; insertion into innerHTML
   must escape them. Added an inline HTML-escape pass.

tasks.md ticked for the Restore sub-phase with a sweep summary
covering the live end-to-end test.
2026-05-04 15:57:42 +01:00
steve c417b5e9ab P3-09 + P3-X3: snapshot diff + recent-restores line
P3-09 — snapshot diff dispatcher.
- POST /api/hosts/{id}/snapshots/diff (and the unprefixed HTMX-form
  variant) takes {snapshot_a, snapshot_b}, validates both belong to
  the host (long id / short id / prefix match), checks the agent is
  online, mints a JobDiff, ships command.run with DiffPayload, writes
  a host.snapshot_diff audit row, returns HX-Redirect to the live
  job page (or JSON {job_id, job_url} for REST callers).
- Two-snapshot guard: POSTing diff(a,a) returns 422.
- UI: small panel on the host_detail right rail (visible when the
  host has 2+ snapshots) with two short-id inputs and a Diff button.
  Output renders on the standard live job page where the operator
  reads the per-line diff text directly.

P3-X3 — recent-restores line.
- hostChromeData grows RestoreStatus / RestoreAt / RestoreJobID
  populated via store.LatestJobByKind(host_id, 'restore') (already
  exists, used by the init line).
- host_chrome.html renders a small line below the existing init-status
  one with status-coloured copy + a link to the job log. Hidden when
  no restore has ever run on this host.

Tests:
- diff_test covers happy path (correct DiffPayload + HX-Redirect),
  same-id rejection (422), unknown-id rejection (422). Adds a
  seedTwoSnapshots helper since ReplaceHostSnapshots is atomic-swap
  (calling seedSnapshot twice would only leave the second).

Restage block (CLAUDE.md) deferred to the end of the restore phase.
2026-05-04 15:38:28 +01:00
steve 4c108bb68a P3-01/02/03: restore wizard backend + templates + restore-shaped job page
End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link
/hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch →
restore-shaped live job page.

Backend (internal/server/http/ui_restore.go):
- GET handlers render the four-step wizard against the wireframe shape
  in docs/superpowers/specs/2026-05-04-p3-restore-design.md.
- HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each
  directory expansion is a sub-second cached lookup after the first
  miss.
- POST validates: snapshot_id non-empty, ≥1 absolute path, in-place
  mode requires confirm_hostname == host name, agent online. On error
  re-renders the wizard with the operator's input intact. Happy path
  mints a job_id, computes the new-directory target as
  /var/restic-restore/<job-id>/ (operator can't escape the prefix —
  server picks it), creates the job row, ships command.run with
  kind=restore + RestorePayload, writes a host.restore audit row,
  returns HX-Redirect (or 303) to the live job page.

Templates:
- host_restore.html: single-page progressively-enabled wizard matching
  _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a
  running tally of selected paths and the step-4 confirm summary
  client-side; the server re-renders on validation failure with form
  fields preserved.
- partials/tree_node.html: recursive HTMX-served tree fragment.
- Top-level Restore button on host_detail right rail + per-snapshot
  Restore action on snapshot rows replace the previous P3-stub.

Restore-shaped job page (job_detail.html):
- Progress widget rendered as a panel rather than a bare strip when
  the job is active.
- Current-file display under the bar, updated from log.stream stdout
  lines that look like absolute paths. Hidden for non-restore kinds.

Migration 0012:
- Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite
  can't ALTER CHECK in place); follows the safe pattern from 0005.
  Defensive: stash job_logs into a temp table before the rebuild and
  INSERT OR IGNORE back afterwards so even if SQLite cascades on
  DROP TABLE jobs the log history survives.

Tests:
- ui_restore_test covers GET step-1 render, GET pre-selected snapshot
  summary card, POST missing snapshot, POST missing paths, POST
  in-place wrong-hostname rejection (no command.run leaks to the
  agent), POST happy path (HX-Redirect + correct payload + audit
  row), POST against offline host returns 503.

Restage block (CLAUDE.md) deferred to the end of the restore phase.
2026-05-04 15:34:29 +01:00
steve f5e3bca6a2 P3-03: restic restore + diff execution path
Wires JobRestore and JobDiff end-to-end at the agent layer (the wizard
backend that drives this lands in the next slice).

- internal/api: JobRestore + JobDiff JobKind constants. CommandRunPayload
  grows nullable Restore + Diff sub-payloads. RestorePayload carries
  snapshot_id, paths, in_place, target_dir; DiffPayload carries
  snapshot_a + snapshot_b.
- internal/restic.RunRestore wraps 'restic restore <sid> --target ...
  [--no-ownership] [--include p]...' with --json. New pumpRestoreStdout
  parses the per-line status / summary objects (drops raw status from
  log.stream — the throttled job.progress envelope covers it). New
  RestoreStatus + RestoreSummary types mirror restic's wire shape.
- internal/restic.RunDiff wraps 'restic diff --json <a> <b>'.
- internal/agent/runner: RunRestore translates RestoreStatus into
  job.progress (mapping FilesRestored → FilesDone etc) with a small
  estimateETA helper since restic doesn't provide ETA for restore.
  RunDiff is a thin streamHandler wrapper.
- cmd/agent dispatcher gains JobRestore + JobDiff cases. Both reuse
  the spawn() helper from P3-X1 so cancel just works.
- Drive-by fix: lastProgress was initialised to time.Now() so the
  very first status event was suppressed by the 1s throttle if the
  agent reported quickly. Initialise to time.Time{} (zero) so the
  first event always emits. Affects backup + restore.

Tests:
- restore_test covers restore happy path (started → progress →
  finished, kind=restore on the started envelope), in-place argv
  asserts no --no-ownership, new-dir argv asserts --no-ownership +
  --target + --include, diff produces the expected log.stream lines.

Restage block (CLAUDE.md) is deferred to the end of the restore
sub-phase so we restage once with all changes.
2026-05-04 15:24:14 +01:00
steve 13f58bd052 P3-X2: tree.list synchronous WS RPC + per-session cache
Foundational for the restore wizard's tree browser. The wizard needs to
lazy-load directory contents from a snapshot as the operator drills
down; this lands the transport.

- internal/api adds MsgTreeList (server → agent) + MsgTreeListResult
  (agent → server) with TreeListRequestPayload / TreeListEntry /
  TreeListResultPayload types. Reply correlates by Envelope.ID.
- internal/restic.ListTreeChildren wraps 'restic ls --json' and
  filters its recursive output to direct children of the requested
  path. Parser + path-normalisation + isDirectChild are unit-tested.
- internal/server/ws/rpc.go introduces a generic SendRPC helper on
  Hub: register a buffered channel keyed by ULID, send the request,
  block on ctx.Done()/timeout/reply. Reply routing piggybacks on the
  existing dispatchAgentMessage by adding a MsgTreeListResult case
  that forwards to the registered waiter; if no waiter is registered
  (caller already gave up) the stray reply is dropped quietly.
- cmd/agent gains a tree.list handler that runs ListTreeChildren on a
  fresh per-call context (60s ceiling) and ships the matching
  tree.list.result envelope. Errors surface in result.Error rather
  than as transport failures so the server-side waiter can render a
  sensible UI message.
- internal/server/http/tree_cache.go is the per-wizard-session cache
  layer (~30min TTL, sweep-on-access) that fetchTreeWithCache uses
  before falling through to SendRPC. Cached on success only; agent
  errors aren't cached so a transient failure doesn't poison the
  session.

Tests:
- internal/restic/ls_test.go covers parseLsChildren at root / mid-tree
  / leaf, plus normalizeTreePath and isDirectChild edge cases.
- internal/server/ws/rpc_test.go unit-tests the registry: round-trip,
  release semantics, concurrent waiters, ctx-cancel.
- internal/server/http/tree_rpc_test.go is the full round-trip: server
  SendRPC → fake-agent over a real WS → reply → server gets the
  payload. Plus a timeout test that confirms ~300ms timeouts terminate
  in ~300ms rather than waiting forever.

The cache is plumbed but no UI handler hits fetchTreeWithCache yet —
that lands with P3-01 (wizard backend). The unused-linter is suppressed
via nolint until the wizard wires it in.
2026-05-04 15:19:22 +01:00
steve 94149a7324 P3-X1: cancel-job feature
Wires the existing job_detail Cancel button (which was a UI stub) into
real backend behaviour:

- internal/api already declared MsgCommandCancel + CommandCancelPayload;
  promote those from forward-declarations to a working envelope. Agent
  side: cmd/agent/main.go drops the TODO-stub and gains a per-job
  ctx.CancelFunc map. runJob's switch is refactored around a small
  spawn() helper so each kind's goroutine derives a per-job context,
  registers the cancel, and removes itself on completion regardless of
  outcome. command.cancel looks up the func and fires it.
- internal/agent/runner.sendFinished now takes ctx and rebadges
  ctx.Canceled errors as JobCancelled (exit 130) rather than
  JobFailed. All Run* call sites updated.
- internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via
  build-tagged sigterm constant; os.Kill on Windows since SIGTERM
  isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL
  fallback. SIGTERM lets restic remove its lock file before exiting.
- New POST /api/jobs/{id}/cancel server endpoint validates the job
  is non-terminal and the host is online, sends command.cancel via
  the hub, writes a job.cancel audit row, returns 202. The agent's
  resulting job.finished (status=cancelled) is what actually
  transitions the row.

Tests:
- internal/server/http/cancel_test.go covers happy path (envelope
  shape + audit row), 409 for terminal jobs, 404 for missing jobs,
  503 for offline hosts.
- internal/agent/runner/cancel_test.go covers cancel mid-run: a fake
  restic that exec'd into 'sleep 30' is canceled 150ms after start
  and the resulting job.finished reports JobCancelled with exit 130
  in well under the WaitDelay.

Foundational for P3 restore (operator needs to be able to cancel a
running backup if they need to restore urgently). Independently useful
for prune/check/backup that are stuck.
2026-05-04 15:11:49 +01:00
steve 2095505edd tasks: tick P2 completion + Playwright sweep screenshots
P2R-09/10/11/12/13/14, P2-16/17/18 all marked done. Acceptance line
for Windows hosts annotated as 'compile-verified, untested in CI'.

_diag/p2-completion-sweep/ holds the dashboard + host-detail +
schedules + sources + repo + source-group-edit screenshots from a
clean sweep against :8080. Zero console errors throughout.

announce_test.go: rate-limit + global-cap subtests dropped t.Parallel
to avoid racing on the package-level tunables under -race.
2026-05-04 11:27:09 +01:00
steve 4c81ff3e7b ui+server: P2-18d pending hosts dashboard panel + expiry sweeper
Dashboard handler loads ListPendingHosts(now); template renders a
warn-bordered panel above the host table with hostname, OS/arch,
fingerprint (selectable / copyable), source IP, age, expiry. Each
row carries an inline accept form (repo URL/user/password) plus a
Reject button. cmd/server adds a 60s ticker calling
DeleteExpiredPendingHosts so 1h-stale rows drop off.
2026-05-04 11:11:32 +01:00
steve fd87218b3f server: P2-18b pending WS + admin accept/reject
GET /ws/agent/pending?pending_id=… runs an Ed25519 nonce-sign
handshake against the row's stored public key, then holds the
connection open. POST /api/pending-hosts/{id}/accept (admin)
mints a real Host row + bearer + AEAD-encrypted repo creds, pushes
the bearer down the open WS, deletes the pending row, and writes
a host.accept_pending audit entry. POST /api/pending-hosts/{id}/reject
closes the socket with code 4001 and audit-logs host.reject_pending.

In-memory pendingHub keyed by pending_id wires accept/reject to
their live socket.
2026-05-04 11:07:32 +01:00
steve cd80be3b13 store+server: P2-18a announce-and-approve schema + endpoint
migration 0011 adds pending_hosts table (id, hostname, public_key,
fingerprint, expiry). store/pending_hosts.go covers full CRUD plus
hostname-collision count + expired-row sweeper.

POST /api/agents/announce takes {hostname, os, arch, agent_version,
restic_version, public_key (base64)}, returns {pending_id,
fingerprint, hostname_collision}. Per-source-IP token-bucket
rate limit (10/min) + global cap of 100 in-flight rows. Public
key must be exactly 32 bytes (Ed25519).
2026-05-04 11:03:41 +01:00
steve a5a2cb91d0 ui: P2R-12 hook editor — source-group form + host-default Repo section
Source-group edit form gains pre/post hook textareas with a service-
user warning banner; bodies AEAD-encrypted on save (per-group AD).
Repo page adds a 'Host-default hooks' panel above the danger zone
with the same shape; saved via POST /hosts/{id}/repo/hooks.
2026-05-04 11:00:28 +01:00
steve 7b1990cf11 agent+server: P2R-11 pre/post hook execution for backup jobs
Agent: new runner.BackupHooks struct + runHook helper invoked via
/bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts
the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed
in env. Output streamed as 'hook(<phase>): …' log.stream lines.
Hooks only run for kind=backup (other kinds skip both phases).

Server: resolveBackupHooks resolves group → host default → empty,
decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext
into CommandRunPayload for both schedule.fire and per-group
Run-now dispatch sites. Decrypt failures degrade silently to no
hook so a malformed blob can't poison every backup.
2026-05-04 10:57:28 +01:00
steve c9b49637d1 ui: P2R-09 auto-init UX — init line in chrome + danger-zone re-init
Latest 'init' job status surfaced under the host-detail vitals strip
(succeeded/failed/running/queued, with link to the live job log on
non-success). New POST /hosts/{id}/repo/reinit handler dispatches a
fresh init job after the operator types the host name to confirm;
audit row records 'host.repo_reinit'.
2026-05-04 10:49:57 +01:00
steve d02a093eeb ui+server: schedule next-run / last-run on dashboard + schedules tab
P2R-14. New store.LatestJobBySchedule query (per-schedule fired job).
Schedules-tab handler computes next-fire from cron + last-fire from
the jobs table per row. Schedules table grows two columns; dashboard
host row prepends 'next 12h ago/from now' to the existing last-backup
line when a single covering schedule is the run-now candidate.

Embeds store.Schedule into scheduleRow so existing template field
references keep working without bulk renames.
2026-05-04 10:44:31 +01:00
steve e6fc9e9963 ui+server: per-job bandwidth override on Run-now
P2R-13b. POST /hosts/{id}/source-groups/{gid}/run accepts optional
bandwidth_up_kbps / bandwidth_down_kbps form fields, plumbs them onto
CommandRunPayload. Agent dispatcher already prefers per-job override
over host-wide caps (T1). UI wraps the Run-now button in a form with
a <details> 'Limit bandwidth for this run' disclosure containing two
KB/s inputs.
2026-05-04 10:41:13 +01:00
steve cdf88c6dc3 agent+server: apply host bandwidth caps to restic invocations
P2R-13a. restic.Env gains LimitUploadKBps/LimitDownloadKBps which are
emitted as global --limit-upload/--limit-download flags before the
subcommand on every invocation. Agent dispatcher tracks host-wide
caps received via config.update; server pushes them on hello and
after PUT /api/hosts/{id}/bandwidth.

Also extends api.CommandRunPayload with optional per-job overrides
(BandwidthUpKBps/Down + PreHook/PostHook); the override consumers
land in T2/T6.
2026-05-04 10:38:34 +01:00
steve e850f6f44c test: poll pending-row count in drain-on-reconnect test (race fix)
CI run #50 failed with:

  --- FAIL: TestDrainPendingDispatchesOnReconnect (1.03s)
      pending_drain_test.go:150: pending rows after drain: got 1, want 0

The test waits for a backup command.run envelope on the wire and
then checks the pending-row count. But conn.Send (the wire write)
returns BEFORE DeletePendingRun runs in the drain goroutine — both
fire serially inside drainOne, but the wire-side reader can observe
the Send while the delete is still pending.

Use the existing waitForPendingCount helper to poll the count with
a 2s deadline. Behaviour unchanged when the delete is fast (count
hits 0 immediately); only relevant under CI scheduling pressure.
-race -count=10 locally now passes consistently.
2026-05-04 10:20:54 +01:00
steve adece5eb72 server: serialize DrainPending per host (avoid drain double-dispatch)
Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on
the Server struct. DrainPending acquires it with TryLock: if a drain is
already in-flight for this host, the call returns immediately — the
running drain will see every pending row. This prevents the on-hello
goroutine and the 30s tick from both listing the same host's rows and
dispatching them twice.

Update three existing tests that called srv.DrainPending explicitly
after the on-hello goroutine had already been spawned: replace the
now-redundant direct call with a waitForPendingCount poll so they don't
race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost
which fires 10 concurrent DrainPending goroutines against a 5-row queue
and asserts exactly 5 job rows result.
2026-05-04 10:19:15 +01:00
steve 6af5a945ce server/ws: persist repo.stats into host_repo_stats 2026-05-04 10:19:15 +01:00
steve e0eae0a96f server: drainer abandons only on ErrNotFound, not transient errors
GetSourceGroup errors in drainOne now gate on errors.Is(err,
store.ErrNotFound) before calling abandonPending, mirroring the
existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context
cancellation) now log a warning and return without deleting the row.

Add regression test TestDrainPendingDropsRowsForGoneSourceGroup
confirming the ErrNotFound path still abandons correctly. Also add
a comment above the backoff-doubling loop explaining the progression.
2026-05-04 10:19:15 +01:00
steve d6dcdd5ec4 server: drainer uses dispatch-core to avoid duplicate pending_run enqueue
Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on
failure) from dispatchBackupForGroup. drainOne now calls the core
directly so a failed Send only bumps the existing pending_runs row via
BumpPendingRunAttempt — not create a second row — stopping the
geometric duplication on repeated drain failures.

dispatchBackupForGroup (schedule.fire path) wraps the core and keeps
its enqueue-on-failure behaviour unchanged.

TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row
remains after a send failure (was tolerating >=1 duplicate rows).
2026-05-04 10:19:15 +01:00
steve 5b4a590508 server: drain pending_runs on tick + on agent reconnect
Two trigger paths land here:

- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
  walks pending_runs rows whose next_attempt_at <= now, dedupes by
  host, skips offline hosts, and per online host runs DrainPending.

- onAgentHello spawns a background DrainPending(hostID). When a
  host comes back, every pending row for it is dispatchable now —
  due-ness becomes irrelevant once the wire is back.

Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
2026-05-04 10:19:15 +01:00
steve 18a4f74a22 server: enqueue pending_runs when scheduled-job dispatch fails
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.

Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
2026-05-04 10:19:15 +01:00
steve aba0b7e177 server: fix stale RetentionPolicy comment + check Scan errors in maintenance test 2026-05-04 10:19:15 +01:00
steve 14b703be58 server: maintenance ticker drives forget/prune/check on cadence
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:

  - skips offline hosts (no pending_runs queueing — maintenance is
    not a backup, missed fires shouldn't pile up)
  - silently skips prune when admin creds aren't bound
  - pushes admin creds before prune, then dispatches with
    RequiresAdminCreds=true (same as operator-driven prune)
  - persists job rows with actor_kind="system"

Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
2026-05-04 10:19:15 +01:00
steve ae96983877 maintenance: pure-logic ticker decides forget/prune/check fires 2026-05-04 10:19:15 +01:00
steve 6f204a6877 ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild
Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in
host_repo.html to match the existing pattern on host_sources.html and
host_schedules.html. Fix all-blank admin-credentials save to redirect
without ?saved= query string so no false-positive banner is shown;
strengthen the corresponding test to assert Location has no ?saved=.
Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.
2026-05-04 10:19:15 +01:00
steve c5b52df7ed ui: Slice E — admin creds form + run-now buttons + repo health panel
- hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online,
  and StatsView (pre-dereferenced projection of host_repo_stats).
- loadHostRepoPage loads the admin slot (tolerating ErrNotFound),
  hub.Connected, and stats (tolerating ErrNotFound).
- renderRepoPage gains an adminErr parameter; all callers updated.
- handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added
  (form-POST handlers mirroring the repo-creds pattern, with audit).
- Routes /hosts/{id}/admin-credentials POST and /delete POST registered.
- Template: Admin credentials form after Connection, Run-now HTMX
  buttons after Maintenance, Repo health stats panel in right rail.
- Tests: 9 new tests covering rendering, disabled states, save/delete
  round-trips, audit rows, and idempotent delete.
2026-05-04 10:19:15 +01:00
steve e2d94bf3a2 server: populate audit UserID on credential mutations + slog prune push errors
Switch handleSetHostCredentials, handleSetAdminCredentials, and
handleDeleteAdminCredentials from authedUser (bool) to requireUser
(*store.User) so AuditEntry.UserID and Actor are populated correctly.
Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in
handleRunRepoPrune so decrypt/send failures surface in the server log
rather than appearing as a generic host_offline 503.
2026-05-04 10:19:15 +01:00
steve c5f401e99b server: cover HTMX auth-redirect path in repo-ops tests 2026-05-04 10:19:15 +01:00
steve 69abc40786 server: HTTP run-now for prune / check / unlock
Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer
routes for HTMX form posts). Prune pushes the admin-cred slot via
pushAdminCredsToAgent before dispatch and refuses with
admin_creds_required when the slot is not set. Check reads
check_subset_pct from host_repo_maintenance (overridable via ?subset=N,
clamped 0-100; non-numeric override falls back to DB value silently).
Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect
response split as the per-source-group run-now endpoint.
2026-05-04 10:19:15 +01:00
steve 35f07c3cee server: admin-credentials REST + Slot:admin push helper
Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that
mirror the existing repo-credentials endpoints but write to
store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped
away from the repo slot to prevent cross-binding). PUT immediately pushes
a config.update(Slot:"admin") to the agent when it is connected, and the
new pushAdminCredsToAgent helper is wired for use by the upcoming prune
run-now endpoint (D2) to push on-demand before dispatch.
2026-05-04 10:19:15 +01:00
steve f801fdf65b store: host_credentials becomes kind-aware (repo + admin slots) 2026-05-04 10:19:15 +01:00
steve 380931b3a8 lint: align local gofumpt rules with golangci-lint v2.5.0
Bumping CI to v2.5.0 surfaced two new gofumpt findings (in two test
files that gofumpt v2.1.6 considered fine). Local re-format with
the matching tool brings them in line.

Pre-commit hook config: prepend $GOPATH/bin to PATH inside the hook
entry so gofumpt + golangci-lint resolve when ~/go/bin isn't on the
operator's interactive shell PATH (common — go install puts them
there but PATH config varies). Without this, the hooks fail with
'Executable not found' even when the tools are installed.

Pin the Makefile setup target to v2.5.0 so a fresh clone gets the
same binary CI runs — keeps pre-commit and CI from drifting again.
2026-05-03 21:31:47 +01:00
steve b6f8de1dcc lint: drive baseline to zero, drop only-new-issues gate
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:

* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
  api.JobCancelled = "cancelled" since that literal is the wire +
  DB CHECK constraint value, plus matched the case in store/fleet.go
  back to "cancelled" and added //nolint:misspell on both for the
  next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
  `defer res.Body.Close()` in `defer func() { _ = .Close() }()`
  to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
  upgrade response Body — coder/websocket can return res with a nil
  Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
  comments explaining why nil-on-error is the contract (cookie
  missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
  revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
  ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
  the dashboard primary nav today

Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
2026-05-03 16:15:17 +01:00
steve 41c3ec7c6f ci: migrate .golangci.yml to v2 schema + only-new-issues gate
The bump from golangci-lint-action@v6 → v7 (which downloads the v2.x
binary) was blocking CI lint with 'unsupported version of the
configuration: ""' because .golangci.yml was still in the v1 schema.

Migrate the config to v2:
* version: "2" prelude
* disable-all → default: none
* linters-settings → linters.settings
* gofumpt + goimports move into formatters.enable + formatters.settings
* exclude-rules move into linters.exclusions.rules
* gosimple drops (folded into staticcheck in v2)

Fix the four lint hits in the new P2R-02 code:
* host_bandwidth.go: convert hostBandwidthRequest directly to
  hostBandwidthView via type conversion (S1016)
* ui_repo.go: drop unparam savedSection + status arguments from
  renderRepoPage (always "" / always 422 — split GET render from
  validation-fail render)
* ui_schedules.go: gofumpt formatting on the scheduleEditPage struct

Add only-new-issues: true to the lint job. The repo carries ~90
pre-existing findings (gofumpt drift × 31, misspell × 25, missing
godoc × 10, bodyclose × 6, errcheck × 12, …) accumulated before
lint was actually wired into CI. Without this gate, every PR would
fail on baseline noise instead of its own changes.

Track the cleanup as X-06 in tasks.md so the gate is temporary.
2026-05-03 15:00:24 +01:00
steve a4823193e7 P2R-02 slice 5: dashboard row Run-now uses covering schedule
Replace the placeholder 'Open →' link with a per-host Run-now
decision computed server-side once per render:

* If the host has exactly one enabled schedule whose source-group
  set covers every group on the host → primary 'Run all groups'
  button (HX-POST to that schedule's /run endpoint, fires every
  backup the host knows about in one click).
* Otherwise (zero matches, multiple matches, or any ambiguity) →
  ghost 'Open →' link to /hosts/{id}/sources, where the operator
  picks per-group from the source-group rows.

dashboardPage.Hosts moves from []store.Host to []dashboardHostRow
to carry the precomputed RunAllScheduleID; host_row.html now reads
.Host.* and .RunAllScheduleID. Two extra store calls per host on
dashboard render — fine at fleet sizes we care about; if we ever
need to support thousands of hosts we'll batch these queries.
2026-05-03 13:42:50 +01:00