restic-manager

Author	SHA1	Message	Date
steve	d373d19647	ui: F1 — populate OpenAlerts in baseView so nav badge updates everywhere Flagged in review of `cd38b40`: the Alerts tab badge should show the open count from any page, not just /alerts. baseView now takes the request and queries store.ListAlerts(Status: "open") to fill view.OpenAlerts on every page render. All call sites updated.	2026-05-04 20:19:09 +01:00
steve	cd38b40516	ui: alerts list page + alert row partial + nav badge	2026-05-04 20:15:01 +01:00
steve	de6939b3f6	http: /settings/notifications CRUD + test endpoint	2026-05-04 20:06:45 +01:00
steve	873821b871	http: /alerts list + ack/resolve handlers + /api/alerts JSON	2026-05-04 19:59:24 +01:00
steve	8c42b00228	alert: wire engine into ws hello + MarkJobFinished + offline sweep - ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated from http.Deps.AlertEngine (nil until G1 constructs the engine) - runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds - dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished, looking up the job Kind via Store.GetJob before notifying - store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one transaction, returns the IDs that flipped to offline - offline sweeper in cmd/server/main.go switched to the new variant; TODO(G1) comment marks where NotifyHostOffline calls will land	2026-05-04 19:54:39 +01:00
steve	02250670c1	ui: snapshots SIZE/FILES tooltip when host's restic is < 0.17 Per-snapshot size + file-count come from the embedded summary block restic added to 'snapshots --json' in 0.17 (the source comment in internal/restic/snapshots.go incorrectly said 0.16+). Hosts running 0.16.x leave those columns blank. - Fix the snapshots.go doc comment: '0.16+' -> '0.17+'. - hostDetailPage carries a LegacyRestic bool computed from the host's reported ResticVersion via Env.AtLeastVersion(0, 17). Empty version also counts as legacy (conservative default). - Template attaches title='Needs restic 0.17+ on the agent host. This host runs <ver>.' + cursor:help on the SIZE / FILES headers when the flag is true. Hosts already on 0.17+ get no tooltip and no extra styling. A host upgrading restic to 0.17+ gets the columns populated on the next backup automatically — no further code change needed.	2026-05-04 17:45:32 +01:00
steve	f0dfa689fe	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	a2398d0b66	P3 follow-up: log download (txt + ndjson) on the live job page The diff job's full output streams to the standard live job log page, which can be a lot of text the operator wants to grep through or paste into a ticket. Add a Download button. Source of truth is the persisted job_logs table — works any time (running or finished) and doesn't need to pause the live WS stream. The download is 'everything the server has up to right now'; if the operator wants a fuller snapshot of a still-running job, they hit Download again. - New endpoint GET /api/jobs/{id}/log.{txt,ndjson} (chi {format} matcher constrained to the two known suffixes). Auth via session cookie. 404 on unknown job. - internal/server/http/job_download.go writeLogsText emits a small header + 'HH:MM:SS.mmm TAG payload' rows mirroring what the live page shows. writeLogsNDJSON emits one self-contained {seq,ts,stream, payload} JSON object per line — appending stays valid (each line stands alone), and the whole file pipes cleanly into jq. NDJSON is newline-delimited JSON; not the same as a JSON array. - web/templates/pages/job_detail.html grows two header buttons: 'Download log' (txt) + '.ndjson' ghost variant for tooling. Tests cover the txt format (header + per-row shape), the ndjson format (each line round-trips through json.Unmarshal), unknown job 404, unauthenticated 401.	2026-05-04 17:12:45 +01:00
steve	e22b41d452	P3 sweep fixes: snap-row CSS, tree expand, --no-ownership drop, target path Bug fixes from the Playwright sweep against the live smoke server: 1. Snapshot-picker layout. The .snap-row class was used in the wireframe but never landed in web/styles/input.css; rows rendered as vertical blocks instead of a 6-column grid. Added the token (mirrors host-row shape with restore-specific column widths). 2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven expansion with a small window.__rmTreeToggle helper that uses plain fetch + .tree-pair wrapper structure for trivial sibling lookup. Caches loaded state per node. 3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership; 0.16 rejects it ('unknown flag') before doing any work. Since the agent runs as root in the systemd unit, restored files keep their original uid/gid either way and the parent dir is root-owned, so the 'cp without sudo' rationale doesn't hold. Drop the flag entirely. 4. Default target dir moved to /var/lib/restic-manager/restore. The systemd unit pins ReadWritePaths to /etc/restic-manager + /var/lib/restic-manager (with ProtectSystem=strict making the rest of /var read-only); writes to /var/restic-restore failed with 'read-only file system'. 5. Confirm summary HTML escaping. defaultTarget JS literal evaluates to a string with literal angle brackets; insertion into innerHTML must escape them. Added an inline HTML-escape pass. tasks.md ticked for the Restore sub-phase with a sweep summary covering the live end-to-end test.	2026-05-04 15:57:42 +01:00
steve	1111124573	P3-09 + P3-X3: snapshot diff + recent-restores line P3-09 — snapshot diff dispatcher. - POST /api/hosts/{id}/snapshots/diff (and the unprefixed HTMX-form variant) takes {snapshot_a, snapshot_b}, validates both belong to the host (long id / short id / prefix match), checks the agent is online, mints a JobDiff, ships command.run with DiffPayload, writes a host.snapshot_diff audit row, returns HX-Redirect to the live job page (or JSON {job_id, job_url} for REST callers). - Two-snapshot guard: POSTing diff(a,a) returns 422. - UI: small panel on the host_detail right rail (visible when the host has 2+ snapshots) with two short-id inputs and a Diff button. Output renders on the standard live job page where the operator reads the per-line diff text directly. P3-X3 — recent-restores line. - hostChromeData grows RestoreStatus / RestoreAt / RestoreJobID populated via store.LatestJobByKind(host_id, 'restore') (already exists, used by the init line). - host_chrome.html renders a small line below the existing init-status one with status-coloured copy + a link to the job log. Hidden when no restore has ever run on this host. Tests: - diff_test covers happy path (correct DiffPayload + HX-Redirect), same-id rejection (422), unknown-id rejection (422). Adds a seedTwoSnapshots helper since ReplaceHostSnapshots is atomic-swap (calling seedSnapshot twice would only leave the second). Restage block (CLAUDE.md) deferred to the end of the restore phase.	2026-05-04 15:38:28 +01:00
steve	6e47efc146	P3-01/02/03: restore wizard backend + templates + restore-shaped job page End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link /hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch → restore-shaped live job page. Backend (internal/server/http/ui_restore.go): - GET handlers render the four-step wizard against the wireframe shape in docs/superpowers/specs/2026-05-04-p3-restore-design.md. - HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each directory expansion is a sub-second cached lookup after the first miss. - POST validates: snapshot_id non-empty, ≥1 absolute path, in-place mode requires confirm_hostname == host name, agent online. On error re-renders the wizard with the operator's input intact. Happy path mints a job_id, computes the new-directory target as /var/restic-restore/<job-id>/ (operator can't escape the prefix — server picks it), creates the job row, ships command.run with kind=restore + RestorePayload, writes a host.restore audit row, returns HX-Redirect (or 303) to the live job page. Templates: - host_restore.html: single-page progressively-enabled wizard matching _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a running tally of selected paths and the step-4 confirm summary client-side; the server re-renders on validation failure with form fields preserved. - partials/tree_node.html: recursive HTMX-served tree fragment. - Top-level Restore button on host_detail right rail + per-snapshot Restore action on snapshot rows replace the previous P3-stub. Restore-shaped job page (job_detail.html): - Progress widget rendered as a panel rather than a bare strip when the job is active. - Current-file display under the bar, updated from log.stream stdout lines that look like absolute paths. Hidden for non-restore kinds. Migration 0012: - Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005. Defensive: stash job_logs into a temp table before the rebuild and INSERT OR IGNORE back afterwards so even if SQLite cascades on DROP TABLE jobs the log history survives. Tests: - ui_restore_test covers GET step-1 render, GET pre-selected snapshot summary card, POST missing snapshot, POST missing paths, POST in-place wrong-hostname rejection (no command.run leaks to the agent), POST happy path (HX-Redirect + correct payload + audit row), POST against offline host returns 503. Restage block (CLAUDE.md) deferred to the end of the restore phase.	2026-05-04 15:34:29 +01:00
steve	265b4b6c5d	P3-03: restic restore + diff execution path Wires JobRestore and JobDiff end-to-end at the agent layer (the wizard backend that drives this lands in the next slice). - internal/api: JobRestore + JobDiff JobKind constants. CommandRunPayload grows nullable Restore + Diff sub-payloads. RestorePayload carries snapshot_id, paths, in_place, target_dir; DiffPayload carries snapshot_a + snapshot_b. - internal/restic.RunRestore wraps 'restic restore <sid> --target ... [--no-ownership] [--include p]...' with --json. New pumpRestoreStdout parses the per-line status / summary objects (drops raw status from log.stream — the throttled job.progress envelope covers it). New RestoreStatus + RestoreSummary types mirror restic's wire shape. - internal/restic.RunDiff wraps 'restic diff --json <a> <b>'. - internal/agent/runner: RunRestore translates RestoreStatus into job.progress (mapping FilesRestored → FilesDone etc) with a small estimateETA helper since restic doesn't provide ETA for restore. RunDiff is a thin streamHandler wrapper. - cmd/agent dispatcher gains JobRestore + JobDiff cases. Both reuse the spawn() helper from P3-X1 so cancel just works. - Drive-by fix: lastProgress was initialised to time.Now() so the very first status event was suppressed by the 1s throttle if the agent reported quickly. Initialise to time.Time{} (zero) so the first event always emits. Affects backup + restore. Tests: - restore_test covers restore happy path (started → progress → finished, kind=restore on the started envelope), in-place argv asserts no --no-ownership, new-dir argv asserts --no-ownership + --target + --include, diff produces the expected log.stream lines. Restage block (CLAUDE.md) is deferred to the end of the restore sub-phase so we restage once with all changes.	2026-05-04 15:24:14 +01:00
steve	6d295bc9f6	P3-X2: tree.list synchronous WS RPC + per-session cache Foundational for the restore wizard's tree browser. The wizard needs to lazy-load directory contents from a snapshot as the operator drills down; this lands the transport. - internal/api adds MsgTreeList (server → agent) + MsgTreeListResult (agent → server) with TreeListRequestPayload / TreeListEntry / TreeListResultPayload types. Reply correlates by Envelope.ID. - internal/restic.ListTreeChildren wraps 'restic ls --json' and filters its recursive output to direct children of the requested path. Parser + path-normalisation + isDirectChild are unit-tested. - internal/server/ws/rpc.go introduces a generic SendRPC helper on Hub: register a buffered channel keyed by ULID, send the request, block on ctx.Done()/timeout/reply. Reply routing piggybacks on the existing dispatchAgentMessage by adding a MsgTreeListResult case that forwards to the registered waiter; if no waiter is registered (caller already gave up) the stray reply is dropped quietly. - cmd/agent gains a tree.list handler that runs ListTreeChildren on a fresh per-call context (60s ceiling) and ships the matching tree.list.result envelope. Errors surface in result.Error rather than as transport failures so the server-side waiter can render a sensible UI message. - internal/server/http/tree_cache.go is the per-wizard-session cache layer (~30min TTL, sweep-on-access) that fetchTreeWithCache uses before falling through to SendRPC. Cached on success only; agent errors aren't cached so a transient failure doesn't poison the session. Tests: - internal/restic/ls_test.go covers parseLsChildren at root / mid-tree / leaf, plus normalizeTreePath and isDirectChild edge cases. - internal/server/ws/rpc_test.go unit-tests the registry: round-trip, release semantics, concurrent waiters, ctx-cancel. - internal/server/http/tree_rpc_test.go is the full round-trip: server SendRPC → fake-agent over a real WS → reply → server gets the payload. Plus a timeout test that confirms ~300ms timeouts terminate in ~300ms rather than waiting forever. The cache is plumbed but no UI handler hits fetchTreeWithCache yet — that lands with P3-01 (wizard backend). The unused-linter is suppressed via nolint until the wizard wires it in.	2026-05-04 15:19:22 +01:00
steve	9fa2ef48f0	P3-X1: cancel-job feature Wires the existing job_detail Cancel button (which was a UI stub) into real backend behaviour: - internal/api already declared MsgCommandCancel + CommandCancelPayload; promote those from forward-declarations to a working envelope. Agent side: cmd/agent/main.go drops the TODO-stub and gains a per-job ctx.CancelFunc map. runJob's switch is refactored around a small spawn() helper so each kind's goroutine derives a per-job context, registers the cancel, and removes itself on completion regardless of outcome. command.cancel looks up the func and fires it. - internal/agent/runner.sendFinished now takes ctx and rebadges ctx.Canceled errors as JobCancelled (exit 130) rather than JobFailed. All Run* call sites updated. - internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via build-tagged sigterm constant; os.Kill on Windows since SIGTERM isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL fallback. SIGTERM lets restic remove its lock file before exiting. - New POST /api/jobs/{id}/cancel server endpoint validates the job is non-terminal and the host is online, sends command.cancel via the hub, writes a job.cancel audit row, returns 202. The agent's resulting job.finished (status=cancelled) is what actually transitions the row. Tests: - internal/server/http/cancel_test.go covers happy path (envelope shape + audit row), 409 for terminal jobs, 404 for missing jobs, 503 for offline hosts. - internal/agent/runner/cancel_test.go covers cancel mid-run: a fake restic that exec'd into 'sleep 30' is canceled 150ms after start and the resulting job.finished reports JobCancelled with exit 130 in well under the WaitDelay. Foundational for P3 restore (operator needs to be able to cancel a running backup if they need to restore urgently). Independently useful for prune/check/backup that are stuck.	2026-05-04 15:11:49 +01:00
steve	c691dc8a56	tasks: tick P2 completion + Playwright sweep screenshots CI / Build (windows/amd64) (pull_request) Successful in 20s Details CI / Lint (pull_request) Successful in 41s Details CI / Build (linux/amd64) (pull_request) Successful in 21s Details CI / Test (linux/amd64) (pull_request) Successful in 53s Details CI / Build (linux/arm64) (pull_request) Successful in 1m48s Details P2R-09/10/11/12/13/14, P2-16/17/18 all marked done. Acceptance line for Windows hosts annotated as 'compile-verified, untested in CI'. _diag/p2-completion-sweep/ holds the dashboard + host-detail + schedules + sources + repo + source-group-edit screenshots from a clean sweep against :8080. Zero console errors throughout. announce_test.go: rate-limit + global-cap subtests dropped t.Parallel to avoid racing on the package-level tunables under -race.	2026-05-04 11:27:09 +01:00
steve	bbdf631a01	ui+server: P2-18d pending hosts dashboard panel + expiry sweeper Dashboard handler loads ListPendingHosts(now); template renders a warn-bordered panel above the host table with hostname, OS/arch, fingerprint (selectable / copyable), source IP, age, expiry. Each row carries an inline accept form (repo URL/user/password) plus a Reject button. cmd/server adds a 60s ticker calling DeleteExpiredPendingHosts so 1h-stale rows drop off.	2026-05-04 11:11:32 +01:00
steve	567561a6a3	server: P2-18b pending WS + admin accept/reject GET /ws/agent/pending?pending_id=… runs an Ed25519 nonce-sign handshake against the row's stored public key, then holds the connection open. POST /api/pending-hosts/{id}/accept (admin) mints a real Host row + bearer + AEAD-encrypted repo creds, pushes the bearer down the open WS, deletes the pending row, and writes a host.accept_pending audit entry. POST /api/pending-hosts/{id}/reject closes the socket with code 4001 and audit-logs host.reject_pending. In-memory pendingHub keyed by pending_id wires accept/reject to their live socket.	2026-05-04 11:07:32 +01:00
steve	a8e6c9d6d7	store+server: P2-18a announce-and-approve schema + endpoint migration 0011 adds pending_hosts table (id, hostname, public_key, fingerprint, expiry). store/pending_hosts.go covers full CRUD plus hostname-collision count + expired-row sweeper. POST /api/agents/announce takes {hostname, os, arch, agent_version, restic_version, public_key (base64)}, returns {pending_id, fingerprint, hostname_collision}. Per-source-IP token-bucket rate limit (10/min) + global cap of 100 in-flight rows. Public key must be exactly 32 bytes (Ed25519).	2026-05-04 11:03:41 +01:00
steve	1d3661470f	ui: P2R-12 hook editor — source-group form + host-default Repo section Source-group edit form gains pre/post hook textareas with a service- user warning banner; bodies AEAD-encrypted on save (per-group AD). Repo page adds a 'Host-default hooks' panel above the danger zone with the same shape; saved via POST /hosts/{id}/repo/hooks.	2026-05-04 11:00:28 +01:00
steve	13c35b68d4	agent+server: P2R-11 pre/post hook execution for backup jobs Agent: new runner.BackupHooks struct + runHook helper invoked via /bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with RM_JOB_STATUS=succeeded\|failed in env. Output streamed as 'hook(<phase>): …' log.stream lines. Hooks only run for kind=backup (other kinds skip both phases). Server: resolveBackupHooks resolves group → host default → empty, decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext into CommandRunPayload for both schedule.fire and per-group Run-now dispatch sites. Decrypt failures degrade silently to no hook so a malformed blob can't poison every backup.	2026-05-04 10:57:28 +01:00
steve	cce3cd8384	ui: P2R-09 auto-init UX — init line in chrome + danger-zone re-init Latest 'init' job status surfaced under the host-detail vitals strip (succeeded/failed/running/queued, with link to the live job log on non-success). New POST /hosts/{id}/repo/reinit handler dispatches a fresh init job after the operator types the host name to confirm; audit row records 'host.repo_reinit'.	2026-05-04 10:49:57 +01:00
steve	93ab0ae84f	ui+server: schedule next-run / last-run on dashboard + schedules tab P2R-14. New store.LatestJobBySchedule query (per-schedule fired job). Schedules-tab handler computes next-fire from cron + last-fire from the jobs table per row. Schedules table grows two columns; dashboard host row prepends 'next 12h ago/from now' to the existing last-backup line when a single covering schedule is the run-now candidate. Embeds store.Schedule into scheduleRow so existing template field references keep working without bulk renames.	2026-05-04 10:44:31 +01:00
steve	6589f23313	ui+server: per-job bandwidth override on Run-now P2R-13b. POST /hosts/{id}/source-groups/{gid}/run accepts optional bandwidth_up_kbps / bandwidth_down_kbps form fields, plumbs them onto CommandRunPayload. Agent dispatcher already prefers per-job override over host-wide caps (T1). UI wraps the Run-now button in a form with a <details> 'Limit bandwidth for this run' disclosure containing two KB/s inputs.	2026-05-04 10:41:13 +01:00
steve	ddc07609cb	agent+server: apply host bandwidth caps to restic invocations P2R-13a. restic.Env gains LimitUploadKBps/LimitDownloadKBps which are emitted as global --limit-upload/--limit-download flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received via config.update; server pushes them on hello and after PUT /api/hosts/{id}/bandwidth. Also extends api.CommandRunPayload with optional per-job overrides (BandwidthUpKBps/Down + PreHook/PostHook); the override consumers land in T2/T6.	2026-05-04 10:38:34 +01:00
steve	bc02fcb498	test: poll pending-row count in drain-on-reconnect test (race fix) CI / Lint (pull_request) Successful in 17s Details CI / Test (linux/amd64) (pull_request) Successful in 43s Details CI / Build (linux/amd64) (pull_request) Successful in 22s Details CI / Build (windows/amd64) (pull_request) Successful in 51s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details CI run #50 failed with: --- FAIL: TestDrainPendingDispatchesOnReconnect (1.03s) pending_drain_test.go:150: pending rows after drain: got 1, want 0 The test waits for a backup command.run envelope on the wire and then checks the pending-row count. But conn.Send (the wire write) returns BEFORE DeletePendingRun runs in the drain goroutine — both fire serially inside drainOne, but the wire-side reader can observe the Send while the delete is still pending. Use the existing waitForPendingCount helper to poll the count with a 2s deadline. Behaviour unchanged when the delete is fast (count hits 0 immediately); only relevant under CI scheduling pressure. -race -count=10 locally now passes consistently.	2026-05-04 10:20:54 +01:00
steve	99ef2b7a71	server: serialize DrainPending per host (avoid drain double-dispatch) Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on the Server struct. DrainPending acquires it with TryLock: if a drain is already in-flight for this host, the call returns immediately — the running drain will see every pending row. This prevents the on-hello goroutine and the 30s tick from both listing the same host's rows and dispatching them twice. Update three existing tests that called srv.DrainPending explicitly after the on-hello goroutine had already been spawned: replace the now-redundant direct call with a waitForPendingCount poll so they don't race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost which fires 10 concurrent DrainPending goroutines against a 5-row queue and asserts exactly 5 job rows result.	2026-05-04 10:19:15 +01:00
steve	99b88d08c9	server/ws: persist repo.stats into host_repo_stats	2026-05-04 10:19:15 +01:00
steve	1629dc7146	server: drainer abandons only on ErrNotFound, not transient errors GetSourceGroup errors in drainOne now gate on errors.Is(err, store.ErrNotFound) before calling abandonPending, mirroring the existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context cancellation) now log a warning and return without deleting the row. Add regression test TestDrainPendingDropsRowsForGoneSourceGroup confirming the ErrNotFound path still abandons correctly. Also add a comment above the backoff-doubling loop explaining the progression.	2026-05-04 10:19:15 +01:00
steve	0c9ea75046	server: drainer uses dispatch-core to avoid duplicate pending_run enqueue Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on failure) from dispatchBackupForGroup. drainOne now calls the core directly so a failed Send only bumps the existing pending_runs row via BumpPendingRunAttempt — not create a second row — stopping the geometric duplication on repeated drain failures. dispatchBackupForGroup (schedule.fire path) wraps the core and keeps its enqueue-on-failure behaviour unchanged. TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row remains after a send failure (was tolerating >=1 duplicate rows).	2026-05-04 10:19:15 +01:00
steve	3e337dfb3c	server: drain pending_runs on tick + on agent reconnect Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.	2026-05-04 10:19:15 +01:00
steve	e64cf25c0e	server: enqueue pending_runs when scheduled-job dispatch fails When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant.	2026-05-04 10:19:15 +01:00
steve	2794d5a821	server: fix stale RetentionPolicy comment + check Scan errors in maintenance test	2026-05-04 10:19:15 +01:00
steve	c47cc682e0	server: maintenance ticker drives forget/prune/check on cadence Wires a 60s server-side ticker to the pure-logic maintenance.Decide introduced in the previous commit. Decisions flow through a new DispatchMaintenance method on Server, which: - skips offline hosts (no pending_runs queueing — maintenance is not a backup, missed fires shouldn't pile up) - silently skips prune when admin creds aren't bound - pushes admin creds before prune, then dispatches with RequiresAdminCreds=true (same as operator-driven prune) - persists job rows with actor_kind="system" Reshapes the forget wire payload from a single RetentionPolicy to a ForgetGroups list (one tag + per-group keep- per source group). The agent walks the groups and runs `restic forget --tag <name> --keep-*` once per group. Dead-code removed: CommandRunPayload.RetentionPolicy, the old forget JSON-decode in cmd/agent, and the single-policy form of restic.RunForget.	2026-05-04 10:19:15 +01:00
steve	e7e11454a8	maintenance: pure-logic ticker decides forget/prune/check fires	2026-05-04 10:19:15 +01:00
steve	77a8590e3a	ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in host_repo.html to match the existing pattern on host_sources.html and host_schedules.html. Fix all-blank admin-credentials save to redirect without ?saved= query string so no false-positive banner is shown; strengthen the corresponding test to assert Location has no ?saved=. Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.	2026-05-04 10:19:15 +01:00
steve	46ec123f95	ui: Slice E — admin creds form + run-now buttons + repo health panel - hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online, and StatsView (pre-dereferenced projection of host_repo_stats). - loadHostRepoPage loads the admin slot (tolerating ErrNotFound), hub.Connected, and stats (tolerating ErrNotFound). - renderRepoPage gains an adminErr parameter; all callers updated. - handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added (form-POST handlers mirroring the repo-creds pattern, with audit). - Routes /hosts/{id}/admin-credentials POST and /delete POST registered. - Template: Admin credentials form after Connection, Run-now HTMX buttons after Maintenance, Repo health stats panel in right rail. - Tests: 9 new tests covering rendering, disabled states, save/delete round-trips, audit rows, and idempotent delete.	2026-05-04 10:19:15 +01:00
steve	b35f1736f7	server: populate audit UserID on credential mutations + slog prune push errors Switch handleSetHostCredentials, handleSetAdminCredentials, and handleDeleteAdminCredentials from authedUser (bool) to requireUser (*store.User) so AuditEntry.UserID and Actor are populated correctly. Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in handleRunRepoPrune so decrypt/send failures surface in the server log rather than appearing as a generic host_offline 503.	2026-05-04 10:19:15 +01:00
steve	a8aff2c62b	server: cover HTMX auth-redirect path in repo-ops tests	2026-05-04 10:19:15 +01:00
steve	1ae567021a	server: HTTP run-now for prune / check / unlock Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer routes for HTMX form posts). Prune pushes the admin-cred slot via pushAdminCredsToAgent before dispatch and refuses with admin_creds_required when the slot is not set. Check reads check_subset_pct from host_repo_maintenance (overridable via ?subset=N, clamped 0-100; non-numeric override falls back to DB value silently). Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect response split as the per-source-group run-now endpoint.	2026-05-04 10:19:15 +01:00
steve	81a00202d0	server: admin-credentials REST + Slot:admin push helper Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that mirror the existing repo-credentials endpoints but write to store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped away from the repo slot to prevent cross-binding). PUT immediately pushes a config.update(Slot:"admin") to the agent when it is connected, and the new pushAdminCredsToAgent helper is wired for use by the upcoming prune run-now endpoint (D2) to push on-demand before dispatch.	2026-05-04 10:19:15 +01:00
steve	de6d51eeb1	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-04 10:19:15 +01:00
steve	dd7b37a5c1	lint: align local gofumpt rules with golangci-lint v2.5.0 CI / Test (linux/amd64) (pull_request) Successful in 21s Details CI / Lint (pull_request) Successful in 24s Details CI / Build (windows/amd64) (pull_request) Successful in 20s Details CI / Build (linux/amd64) (pull_request) Successful in 21s Details CI / Build (linux/arm64) (pull_request) Successful in 20s Details Bumping CI to v2.5.0 surfaced two new gofumpt findings (in two test files that gofumpt v2.1.6 considered fine). Local re-format with the matching tool brings them in line. Pre-commit hook config: prepend $GOPATH/bin to PATH inside the hook entry so gofumpt + golangci-lint resolve when ~/go/bin isn't on the operator's interactive shell PATH (common — go install puts them there but PATH config varies). Without this, the hooks fail with 'Executable not found' even when the tools are installed. Pin the Makefile setup target to v2.5.0 so a fresh clone gets the same binary CI runs — keeps pre-commit and CI from drifting again.	2026-05-03 21:31:47 +01:00
steve	e871b05b38	lint: drive baseline to zero, drop only-new-issues gate CI / Test (linux/amd64) (pull_request) Successful in 34s Details CI / Lint (pull_request) Failing after 16s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	18a9f6624e	ci: migrate .golangci.yml to v2 schema + only-new-issues gate CI / Test (linux/amd64) (pull_request) Successful in 29s Details CI / Lint (pull_request) Failing after 16s Details CI / Build (windows/amd64) (pull_request) Successful in 20s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details The bump from golangci-lint-action@v6 → v7 (which downloads the v2.x binary) was blocking CI lint with 'unsupported version of the configuration: ""' because .golangci.yml was still in the v1 schema. Migrate the config to v2: * version: "2" prelude * disable-all → default: none * linters-settings → linters.settings * gofumpt + goimports move into formatters.enable + formatters.settings * exclude-rules move into linters.exclusions.rules * gosimple drops (folded into staticcheck in v2) Fix the four lint hits in the new P2R-02 code: * host_bandwidth.go: convert hostBandwidthRequest directly to hostBandwidthView via type conversion (S1016) * ui_repo.go: drop unparam savedSection + status arguments from renderRepoPage (always "" / always 422 — split GET render from validation-fail render) * ui_schedules.go: gofumpt formatting on the scheduleEditPage struct Add only-new-issues: true to the lint job. The repo carries ~90 pre-existing findings (gofumpt drift × 31, misspell × 25, missing godoc × 10, bodyclose × 6, errcheck × 12, …) accumulated before lint was actually wired into CI. Without this gate, every PR would fail on baseline noise instead of its own changes. Track the cleanup as X-06 in tasks.md so the gate is temporary.	2026-05-03 15:00:24 +01:00
steve	fab99b4a38	P2R-02 slice 5: dashboard row Run-now uses covering schedule Replace the placeholder 'Open →' link with a per-host Run-now decision computed server-side once per render: * If the host has exactly one enabled schedule whose source-group set covers every group on the host → primary 'Run all groups' button (HX-POST to that schedule's /run endpoint, fires every backup the host knows about in one click). * Otherwise (zero matches, multiple matches, or any ambiguity) → ghost 'Open →' link to /hosts/{id}/sources, where the operator picks per-group from the source-group rows. dashboardPage.Hosts moves from []store.Host to []dashboardHostRow to carry the precomputed RunAllScheduleID; host_row.html now reads .Host.* and .RunAllScheduleID. Two extra store calls per host on dashboard render — fine at fleet sizes we care about; if we ever need to support thousands of hosts we'll batch these queries.	2026-05-03 13:42:50 +01:00
steve	4035c44be3	P2R-02 follow-up: schedule Run-now feedback (single → job log, multi → toast) Schedules tab Run-now used to silently HX-Redirect back to the list, leaving the operator wondering whether the click registered. Now: * Single-source-group schedule → HX-Redirect to that one job's live log, matching the per-source-group Run-now UX from Sources. * Multi-group schedule → stay on the schedules list and fire a success toast ("N backups dispatched: <group names>") via the existing rm:toast HX-Trigger channel, so the operator sees clear acknowledgement without losing their place. dispatchBackupForGroup now returns the persisted job ID so the caller can choose between job-log redirect and toast feedback; on any internal failure it returns "" and the warning still hits slog as before. The cron-fired path (dispatchScheduledJob) ignores the return value, behaviour unchanged.	2026-05-03 13:25:31 +01:00
steve	d62b173712	P2R-02 slice 4: Repo tab — connection / bandwidth / maintenance Three independent forms on /hosts/{id}/repo so saving one section doesn't disturb the others: * Connection: edits repo URL, username, password (pre-filled from the redacted GET /api/hosts/{id}/repo-credentials view; password field shows masked stored-creds placeholder; blank password = keep existing). On save, encrypts and pushes config.update to a connected agent. * Bandwidth: host-wide upload/download caps (KB/s; blank = no cap) written via store.SetHostBandwidth. New REST endpoint PUT /api/hosts/{id}/bandwidth for JSON callers. * Maintenance: forget/prune/check cadences + check subset %, with per-row enabled toggles. Reuses cronParser for validation; auto-seeds the row if a host pre-dates the migration. Right-rail surfaces repo size, snapshot count, snapshots-by-tag breakdown (counted from existing snapshot tag rows), and an 'untagged snapshots are left alone' note. Danger-zone re-init button is rendered but disabled with a hint pointing at P2R-09 (real implementation lands there). Validation re-renders the page with the relevant form's banner and all other section state intact. Successful saves redirect with a ?saved=<section> query param so the page surfaces a small ✓ saved indicator on the relevant form. ci.yml: bump golangci-lint-action v6→v7 (separate change picked up in this commit).	2026-05-03 12:14:03 +01:00
steve	8b91d3037c	P2R-02 follow-up: Run-now works on disabled schedules with confirm CI / Test (linux/amd64) (pull_request) Successful in 33s Details CI / Lint (pull_request) Failing after 15s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 23s Details CI / Build (linux/arm64) (pull_request) Successful in 23s Details Surface the Run-now button on every schedule when the host is online, not just enabled ones. Disabled rows render the button as a non-primary style + a HX-confirm dialog ("This schedule is paused — running it now won't change that. Fire it once anyway?"); enabled rows keep the zero-friction primary button. Server-side, Run-now no longer short-circuits on !Enabled — it dispatches the source groups inline rather than via dispatchScheduledJob (which always bails on disabled schedules, since cron-tick semantics are different from explicit operator intent). The audit-log entry inside dispatchBackupForGroup still records every fire.	2026-05-03 12:07:26 +01:00
steve	64d2fcf7a3	P2R-02 follow-up: clickable rows on Sources/Schedules + cron-preset tooltips CI / Test (linux/amd64) (pull_request) Successful in 1m57s Details CI / Lint (pull_request) Failing after 15s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 22s Details CI / Build (linux/arm64) (pull_request) Successful in 22s Details Aligns Sources and Schedules tab rows with the dashboard's row-click UX: whole-row click navigates to the row's edit page (mirroring .host-row.clickable). Drops the redundant Edit buttons; Run-now and Delete remain in .row-action cells that sit above the row-link overlay via z-index. Schedule edit form's cron preset chips now carry human-readable title= tooltips ("Every day at 03:00", "Every Sunday at 03:00", etc). tasks.md gets a binding row-design rule covering all current and future list-row templates, and the P2R-02 entry is split into the six slices already agreed with the operator (slices 1–3 marked done, 4 next).	2026-05-03 12:01:55 +01:00
steve	67ca769686	P2R-02 slice 3: Schedules tab — slim list, new/edit form, delete, Run-now CI / Test (linux/amd64) (pull_request) Failing after 44s Details CI / Lint (pull_request) Failing after 13s Details CI / Build (windows/amd64) (pull_request) Successful in 19s Details CI / Build (linux/amd64) (pull_request) Successful in 19s Details CI / Build (linux/arm64) (pull_request) Successful in 25s Details Schedules list: status (enabled/paused) + cron + source-group tags + actions (Run-now when enabled+online, Edit, Delete). Run-now reuses dispatchScheduledJob — same path real cron fires take, so each referenced source group runs as its own backup with its own tag. Falls back to a 409 if the agent is offline. Schedule new/edit form: cron input with five preset chips (quick-pick @hourly / nightly / 6h / weekly / monthly), source-group multi-pick rendered as styled checkbox cards (visual state tracks the underlying box via a tiny inline script), enabled toggle. No paths/excludes/retention/kind on the schedule itself — those live on source groups now. Server-side validation re-renders with the operator's input + ticked groups intact. Every successful mutation calls pushScheduleSetAsync. Adds .schd-row, .preset-chip, .picker styles.	2026-05-03 11:55:16 +01:00

1 2

83 Commits