restic-manager

Author	SHA1	Message	Date
steve	168059ae45	feat(hosts): per-host tags edit + dashboard chip-row filter (P4-07)	2026-05-05 11:16:09 +01:00
steve	2d9e53b025	ui(users): record last_login on /setup + sortable headers	2026-05-05 10:57:25 +01:00
steve	e76a383813	store: DeleteSessionsByUserID for force-logout	2026-05-05 10:57:24 +01:00
steve	93d857d995	store: user_setup_tokens CRUD + cleanup-expired	2026-05-05 10:57:24 +01:00
steve	dafdfcda3f	store: lowercase username, email/disable helpers, last-admin count	2026-05-05 10:57:24 +01:00
steve	c6fbe7c0e0	store: extend User struct with Email, DisabledAt, MustChangePassword	2026-05-05 10:57:24 +01:00
steve	a1d307fafa	store: migration 0018 — user_setup_tokens	2026-05-05 10:57:24 +01:00
steve	9712c65b04	store: migration 0017 — users.email, disabled_at, must_change_password	2026-05-05 10:57:24 +01:00
steve	4f66cc2b34	feat(audit): clickable column headers with asc/desc sort	2026-05-05 08:15:22 +01:00
steve	16c77a8cc5	feat(audit): P3-08 — audit log UI with filters	2026-05-05 07:49:25 +01:00
steve	350be3f19d	feat(alerts): per-source-group dedup so two failing backups produce two alerts Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.	2026-05-04 22:59:48 +01:00
steve	d830635a2e	fix: enabled toggle — list-row click + edit-form save Two bugs in the channel-enabled affordance: 1. List-row toggle was a static span with no handler; the row's row-link overlay swallowed every click and routed to /edit. Add POST /settings/notifications/{id}/toggle backed by a new store method SetNotificationChannelEnabled, and turn the row toggle into an htmx-driven button that swaps in the new state. Use event.stopPropagation() on the toggle so it beats the row link. 2. Edit-form toggle visually flipped but the underlying checkbox reverted: the visual span lives inside the <label>, so clicking it fired the inline JS handler AND the label's native checkbox-toggle, cancelling out. Bind to the checkbox 'change' event instead and let the label do the toggling — the JS just mirrors check.checked into the .on class.	2026-05-04 22:21:45 +01:00
steve	cbdaa4daeb	fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve The denormalised projection was never written by the alerts code path, so the dashboard's OPEN ALERTS card and the per-host alerts column always read 0 regardless of how many alerts were open. fleet.GetStats sums hosts.open_alert_count; if it never moves, the card is decoration. Add refreshHostOpenAlertCount that recomputes from the alerts table (self-healing — no +/- bookkeeping to drift). Call it after the commit in RaiseOrTouch when a row was inserted, after Resolve, and after AutoResolve. Caught during the live sweep: a synthetic critical raised the count to 1, but resolving it left the dashboard reading '1 unresolved' indefinitely.	2026-05-04 21:01:17 +01:00
steve	c710743231	alert: wire engine into ws hello + MarkJobFinished + offline sweep - ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated from http.Deps.AlertEngine (nil until G1 constructs the engine) - runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds - dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished, looking up the job Kind via Store.GetJob before notifying - store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one transaction, returns the IDs that flipped to offline - offline sweeper in cmd/server/main.go switched to the new variant; TODO(G1) comment marks where NotifyHostOffline calls will land	2026-05-04 19:54:39 +01:00
steve	8a92fedba1	store: notification_channels CRUD + AppendNotificationLog	2026-05-04 19:28:41 +01:00
steve	7c62d111d5	store: alerts CRUD with dedup + last_seen_at bump	2026-05-04 19:24:17 +01:00
steve	b2dffb1d83	store: migration 0014 — notification_channels + notification_log	2026-05-04 19:20:37 +01:00
steve	db71e006bb	store: A1 — check rows.Err() + Scan err in migrate_test Code-quality nits flagged in review of `2692c66`. Mirrors the existing pattern in host_credentials_test.go.	2026-05-04 19:19:28 +01:00
steve	2692c660c5	store: migration 0013 — alerts.last_seen_at	2026-05-04 19:16:59 +01:00
steve	a781e95c94	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	4c108bb68a	P3-01/02/03: restore wizard backend + templates + restore-shaped job page End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link /hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch → restore-shaped live job page. Backend (internal/server/http/ui_restore.go): - GET handlers render the four-step wizard against the wireframe shape in docs/superpowers/specs/2026-05-04-p3-restore-design.md. - HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each directory expansion is a sub-second cached lookup after the first miss. - POST validates: snapshot_id non-empty, ≥1 absolute path, in-place mode requires confirm_hostname == host name, agent online. On error re-renders the wizard with the operator's input intact. Happy path mints a job_id, computes the new-directory target as /var/restic-restore/<job-id>/ (operator can't escape the prefix — server picks it), creates the job row, ships command.run with kind=restore + RestorePayload, writes a host.restore audit row, returns HX-Redirect (or 303) to the live job page. Templates: - host_restore.html: single-page progressively-enabled wizard matching _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a running tally of selected paths and the step-4 confirm summary client-side; the server re-renders on validation failure with form fields preserved. - partials/tree_node.html: recursive HTMX-served tree fragment. - Top-level Restore button on host_detail right rail + per-snapshot Restore action on snapshot rows replace the previous P3-stub. Restore-shaped job page (job_detail.html): - Progress widget rendered as a panel rather than a bare strip when the job is active. - Current-file display under the bar, updated from log.stream stdout lines that look like absolute paths. Hidden for non-restore kinds. Migration 0012: - Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005. Defensive: stash job_logs into a temp table before the rebuild and INSERT OR IGNORE back afterwards so even if SQLite cascades on DROP TABLE jobs the log history survives. Tests: - ui_restore_test covers GET step-1 render, GET pre-selected snapshot summary card, POST missing snapshot, POST missing paths, POST in-place wrong-hostname rejection (no command.run leaks to the agent), POST happy path (HX-Redirect + correct payload + audit row), POST against offline host returns 503. Restage block (CLAUDE.md) deferred to the end of the restore phase.	2026-05-04 15:34:29 +01:00
steve	cd80be3b13	store+server: P2-18a announce-and-approve schema + endpoint migration 0011 adds pending_hosts table (id, hostname, public_key, fingerprint, expiry). store/pending_hosts.go covers full CRUD plus hostname-collision count + expired-row sweeper. POST /api/agents/announce takes {hostname, os, arch, agent_version, restic_version, public_key (base64)}, returns {pending_id, fingerprint, hostname_collision}. Per-source-IP token-bucket rate limit (10/min) + global cap of 100 in-flight rows. Public key must be exactly 32 bytes (Ed25519).	2026-05-04 11:03:41 +01:00
steve	7b1990cf11	agent+server: P2R-11 pre/post hook execution for backup jobs Agent: new runner.BackupHooks struct + runHook helper invoked via /bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with RM_JOB_STATUS=succeeded\|failed in env. Output streamed as 'hook(<phase>): …' log.stream lines. Hooks only run for kind=backup (other kinds skip both phases). Server: resolveBackupHooks resolves group → host default → empty, decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext into CommandRunPayload for both schedule.fire and per-group Run-now dispatch sites. Decrypt failures degrade silently to no hook so a malformed blob can't poison every backup.	2026-05-04 10:57:28 +01:00
steve	18b0bf976d	store: P2R-10 schema for source-group + host-default hooks (migration 0010) Adds pre_hook/post_hook BLOB columns to source_groups and pre_hook_default/post_hook_default to hosts. Bytes stored verbatim (AEAD encrypt/decrypt happens at the HTTP layer where the AEAD key lives). Round-trip tests cover set/clear semantics on both tables.	2026-05-04 10:52:16 +01:00
steve	d02a093eeb	ui+server: schedule next-run / last-run on dashboard + schedules tab P2R-14. New store.LatestJobBySchedule query (per-schedule fired job). Schedules-tab handler computes next-fire from cron + last-fire from the jobs table per row. Schedules table grows two columns; dashboard host row prepends 'next 12h ago/from now' to the existing last-backup line when a single covering schedule is the run-now candidate. Embeds store.Schedule into scheduleRow so existing template field references keep working without bulk renames.	2026-05-04 10:44:31 +01:00
steve	9ec69456fe	store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire) Widen the SQL query to consider all statuses (queued, running, succeeded, failed, cancelled) rather than terminal-only. An in-flight prune that outlasts the 60s tick interval previously produced ErrNotFound, causing the ticker to anchor at now-24h and fire a second prune concurrently with the first. Update the doc comment and test: remove the "queued job filtered out" case, add assertions that a running job and a queued job are each returned as the latest.	2026-05-04 10:19:15 +01:00
steve	18a4f74a22	server: enqueue pending_runs when scheduled-job dispatch fails When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant.	2026-05-04 10:19:15 +01:00
steve	ae96983877	maintenance: pure-logic ticker decides forget/prune/check fires	2026-05-04 10:19:15 +01:00
steve	11cbc2fb7f	store: tighten CHECK constraint on host_repo_stats.last_check_status	2026-05-04 10:19:15 +01:00
steve	5200e44536	store: wrap UpsertHostRepoStats in a transaction (concurrency safety)	2026-05-04 10:19:15 +01:00
steve	84a8c060b6	store: assert CHECK constraint on host_credentials.kind	2026-05-04 10:19:15 +01:00
steve	cfe25b9799	store: HostRepoStats projection (size, lock, last-check, last-prune)	2026-05-04 10:19:15 +01:00
steve	f801fdf65b	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-04 10:19:15 +01:00
steve	9f2cb18e42	store: migration 0009 — admin-creds kind + host_repo_stats	2026-05-04 10:19:15 +01:00
steve	b6f8de1dcc	lint: drive baseline to zero, drop only-new-issues gate Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	ec0bf0f6c3	P2R-01: REST + WS rewire against the slim shape Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron parsed via robfig/cron/v3 and group membership scoped to the host. New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete refuses with 409 if any schedule still references the group, returning the schedule list so the UI can prompt 'remove from these schedules first.' Repo-maintenance GET/PUT manages forget/prune/check cadences on host_repo_maintenance — no version bump, the server-side ticker (P2R-06) drives execution. Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run) resolves the group's includes/excludes/retention/tag and dispatches a backup command.run with the new structured CommandRunPayload fields (Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and /hosts/{id}/init-repo return 410 Gone with a redirect message. schedule_push.go is rebuilt: buildScheduleSetPayload assembles the slim wire shape, pushScheduleSetOnConn ships it during the on-hello window, pushScheduleSetAsync fires after every CRUD mutation, and dispatchScheduledJob handles agent schedule.fire by iterating the schedule's source groups and dispatching one backup per group with actor_kind=schedule and scheduled_id pointing at the schedule. Auto-init at first WS connect: when the host has repo creds bound and no init job in its history, server dispatches restic init. Restic's 'config file already exists' soft-success means re-runs against an existing repo no-op; we don't auto-retry on failure (operator triggers re-init manually via the danger zone in P2R-09). api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc. in favour of {id, cron, enabled, source_groups: [...]}. The agent scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads Includes/Excludes/Tag instead of Args. Tests cover the new HTTP surface end-to-end: source-groups CRUD with in-use refusal, schedule validation (bad cron / missing groups / foreign group), repo-maintenance auto-seed and validation, the 410 route, and buildScheduleSetPayload's wire-shape correctness. Full suite passes; smoke env exercises auto-init dispatch on hello, async push after schedule create, and per-source-group Run-now landing the right paths/excludes/tag at the agent.	2026-05-03 10:56:40 +01:00
steve	e7eea7afac	P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance Go-side data model rebuilt against migration 0008. The fat-Schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is gone; that surface lives on source_groups now. * store/types.go - Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. SourceGroupIDs populated by Get/List, accepted on Create/Update so callers pass desired junction state in one shape. - SourceGroup added: name (= snapshot tag), includes/excludes, retention_policy, retry_max + retry_backoff_seconds, cached conflict_dimension. - HostRepoMaintenance added: forget/prune/check cadences + enabled. - PendingRun added: offline-retry queue. - Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps. - RetentionPolicy moves home from "schedule field" to "source group field" but the type itself + Summary() method unchanged. * store/sources.go (new) — CRUD + GetByName + ConflictDimension cache. Group writes bump host_schedule_version; conflict cache writes don't (server-internal projection, agent doesn't see it). * store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR IGNORE). UpdateRepoMaintenance doesn't bump schedule version because these run on the server's own ticker, not the agent's local cron. * store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete. * store/schedules.go — rewritten for slim shape + junction CRUD. Update wipes the schedule_source_groups junction wholesale and re-inserts (simpler than diffing). Adds SchedulesUsingGroup for retention-conflict detection + UI labels. * store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan. New SetHostBandwidth helper. * HTTP layer — temporarily stubbed during this rewrite (501 returns with redesign_in_progress error code). Phase 3 fills these in against the new shape: - schedules.go REST CRUD - schedule_push.go agent reconciliation - ui_schedules.go HTML form CRUD Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed — both go away in the new model (Run-now per source group; auto-init at host enrolment). * enrollment.go — replaces "seed manual schedule from typed paths" with "seed default source group + repo-maintenance row." The default group gets the typed paths as its includes; operator edits later via Sources tab. * ws/handler.go — drops the MarkHostRepoInitialised projection (column is gone; auto-init makes it derivable from latest init job's status). Tests: * store: existing schedule test rewritten for slim shape + junction; new sources_test.go covers source-group CRUD, name uniqueness, conflict cache, repo-maintenance defaults + idempotent seed, pending-runs queue lifecycle. * http: schedules_test.go and schedule_push_test.go deleted — both exercised the obsolete fat-schedule API. Phase 3 rewrites them against the new endpoints. go test ./... green. cmd/server + cmd/agent build. The UI is broken end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3 restores REST + on-the-wire reconciliation; Phase 4 rewires the UI templates against the new model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:30:41 +01:00
steve	49ecb7c771	P2 redesign · phase 1: migration 0008 — sources + repo maintenance Schema rebuild for the model collapse described in design/v4-sources-redesign.html. Three nouns now stand on their own: * schedules — slim. Only cron + enabled + host_id. Fat-schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is dropped wholesale. Schedule data wiped — by design (smoke env was nuked before this ran; fresh installs have nothing to lose). * source_groups — name + includes + excludes + retention_policy + retry policy + cached conflict_dimension. Group name doubles as the snapshot tag so retention can target it cleanly. UNIQUE (host_id, name) enforces tag unambiguity. * schedule_source_groups — N:M junction. One schedule can fire N groups per tick; one group can be referenced by N schedules. * host_repo_maintenance — 1:1 with hosts. Default cadences: forget daily 03:00, prune weekly Sun 04:00, check monthly 1st 05:00 with --read-data-subset 5%. Operator can edit on Repo tab. * pending_runs — offline-retry queue. Server-side ticker dispatches due rows; bounded by source_groups.retry_max + retry_backoff_seconds. Plus: * hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps. * hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes it derivable from the latest init job; the Init-repo button goes too (failure surfaces via job history banner). Note on FK safety: smoke env was wiped before migration ran, so DROP TABLE schedules cascades to nothing. Fresh installs apply 0001-0007 then immediately 0008 — same story (no schedule rows to lose). For an upgrade path on a populated DB, this migration would need a data-preserving variant; not needed today. Tests fail to compile/run after this — expected. The Go side (store types, CRUD, REST handlers, agent runner, UI templates) gets rebuilt in subsequent phases. tasks.md will track P2 redesign progress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 20:54:01 +01:00
steve	c1f85da55f	Add-host: durable pending page + polled awaiting-agent panel Two issues from a smoke session: 1. The awaiting-agent panel never refreshed — operator had to go back to the dashboard to see the host had connected. 2. Generated passwords were displayed only on the POST response. Navigating away (or even an accidental tab close) lost them permanently, so the operator couldn't update the rest-server's htpasswd. Both are the same fix: convert the POST-rendered transient "result state" into a durable GET page at /hosts/pending/{token}. * New route GET /hosts/pending/{token} renders the install-command + htpasswd snippet view. Password is decrypted from the (still- encrypted-at-rest) token row on every render — operator can refresh, bookmark, navigate away and come back. Once the agent enrols, the page redirects to /hosts/{id}; once the token expires, redirect to /hosts/new. * New route GET /hosts/pending/{token}/awaiting returns a polled HTML fragment that the pending page swaps in every 2s via HTMX. States: awaiting (keep polling) \| connected (show "Open host →" + "View schedules" CTAs, polling stops) \| expired (mint-new link, polling stops). Polling stops naturally because only the awaiting state's wrapper carries the hx-trigger attribute. * POST /hosts/new now 303-redirects to /hosts/pending/{token} on success; validation errors keep re-rendering the form with banner. Supporting changes: * New store helper Store.GetEnrollmentTokenStatus(tokenHash) for the polling endpoint — returns {expires_at, consumed_at, consumed_host} in one round-trip without dragging in the attachments-decryption path. * New ui.Renderer.RenderPartial(w, name, data) for HTMX fragment responses (no layout wrap). Picks an arbitrary page's template set as the lookup point — every page parses the full common- paths list, so they all see every partial. * add_host.html stripped to form-only; pending_host.html owns the result-state UI; awaiting_agent.html is the polled partial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:59:24 +01:00
steve	8fb1c100fd	P2-04.5: kill host.default_paths in favour of manual schedules Two independent path lists for "what does this host back up?" was a real divergence footgun — operator types one set at Add-host time and a different set into a schedule, both end up in the same repo, the snapshot history looks fine until restore. Resolution: drop host.default_paths entirely; add a `manual` flag on schedules. A manual schedule has paths/excludes/tags/retention like any other but no cron — it fires only via per-schedule Run-now. Single source of truth for what gets backed up. Schema (migration 0007): * schedules.manual INTEGER NOT NULL DEFAULT 0. * For every host with non-empty default_paths, seed a manual schedule with those paths and bump host_schedule_version. * ALTER TABLE hosts DROP COLUMN default_paths. * ALTER TABLE enrollment_tokens RENAME COLUMN default_paths TO initial_paths. Original draft of this migration rebuilt hosts via the create-new + drop-old + rename-new pattern. With foreign_keys=ON (set in the connection DSN), DROP TABLE on the parent fired ON DELETE CASCADE on every child of hosts(id) — schedules / jobs / snapshots / host_credentials all wiped on the smoke env when I tried it. SQLite 3.35+ supports column-level ALTERs directly, so we skip the rebuild dance and avoid the cascade trap. Six lines of SQL instead of sixty, no FK risk. Run-now rewiring: * New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper unifies the agent-driven path (cron fire → schedule.fire → OnScheduleFire callback) and the UI-driven path (operator clicks Run-now on a schedule row). Conn arg is optional; nil falls back to Hub.Send. * New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row Run-now button on the schedules list. * Dashboard's per-host Run-now (handleUIRunBackup) now picks the host's only enabled manual schedule, falls back to the only enabled schedule, else returns "pick one in Schedules tab". Keeps one-click for the common case. Agent: * Scheduler skips manual schedules in cron build (silent — they're a normal data shape, not an error). * Wire Schedule struct gains Manual flag. * Schedule.fire flow unchanged — the agent only ever fires non-manual schedules anyway. UI: * Add-host form retitled "Initial schedule · manual" so the operator knows the paths become an editable schedule under the Schedules tab. Result page calls out the manual schedule + points at Host > Schedules. * Schedule edit form: "Manual schedule" checkbox at the top of the When section; toggling it hides/shows the cron field via inline JS. Server-side validator skips the cron requirement when manual=true. * Schedule list shows a "manual" tag under the status pill and renders the When column as "— run-now only —" for manual rows. Each row gets a Run-now button when the schedule is enabled and the host is online. Tests + go test ./... green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:26:06 +01:00
steve	c6237d4004	P2-04: schedule editor UI Closes the schedule foundations slice — operator can now drive the plumbing P2-01..03 landed without touching the JSON API. * New routes: - GET /hosts/{id}/schedules (list) - GET /hosts/{id}/schedules/new (create form) - POST /hosts/{id}/schedules/new (create) - GET /hosts/{id}/schedules/{sid}/edit (edit form) - POST /hosts/{id}/schedules/{sid}/edit (update) - POST /hosts/{id}/schedules/{sid}/delete (delete, confirm-then-redirect) * List view (web/templates/pages/schedules_list.html): status, cron, paths, retention summary, tags, edit/delete buttons. Header shows "version N · agent in sync" or "agent at vM" when the push hasn't been ack'd yet — backed by host_schedule_version + applied_schedule_version. Empty-state CTA points at /schedules/new. * Create/edit form (web/templates/pages/schedule_edit.html, shared): cron expression with five quick-pick presets (daily 3am / every 6h / @hourly / weekly Sun / monthly 1st), paths textarea (one per line), excludes textarea, tags (comma-separated), retention as six numeric fields (mirrors restic's --keep-* flags one-for-one), bandwidth caps, enabled toggle. Side panel explains the reconciliation flow so the operator knows what saving actually does. Validation errors re-render with operator's input intact. * internal/server/http/ui_schedules.go owns the handlers; reuses the same validateSchedule + pushScheduleSetAsync used by the JSON API path. Each save audit-logs schedule.created / schedule.updated / schedule.deleted (matching the JSON API actions). * store.RetentionPolicy gains a Summary() method ("last=7, d=14, w=4" or "—"). Used by the list view's table cell so templates don't have to do any conditional retention rendering. * Two new template helpers: list (string varargs → []string, used for the cron preset row) and joinComma (sibling to joinDot for the rare list that wants commas). RetentionPolicy.Summary covers the schedule-list case but the helpers are general. * host_detail.html secondary tabs row converted from inert <div>s into <a> links. Snapshots active by default; Schedules now points at the new page. Jobs/Repo/Settings remain inert until their P2 owners ship. Hooks UI deferred to P2-15 (lands with the hook execution path). Single-kind UI (backup only) by design — other kinds get a UI when their job dispatch lands in P2-05..08. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:44:40 +01:00
steve	608962441b	P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch Closes the schedule reconciliation loop end-to-end. * New `internal/agent/scheduler` package wraps robfig/cron/v3 with the lifecycle the agent needs: - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting for in-flight entries to return), rebuilds from scratch, starts, and emits schedule.ack with the version we just applied. - Disabled entries skipped silently; bad cron exprs (which shouldn't reach us — the server validates — but defensive) log a warn and skip. - On each cron tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The scheduler itself never builds CommandRunPayloads — server is the source of truth for jobs. - tx is swapped on every Apply, so reconnect is handled naturally: cron entries that fire against a dropped tx log "no active connection" and skip the tick. - Stop() is idempotent and waits for the cron's in-flight workers via cron.Stop().Done(). * New wire message api.MsgScheduleFire + api.ScheduleFirePayload for the agent → server "I just fired locally" RPC. * Server-side dispatch (schedule_push.go: dispatchScheduledJob): looks up the schedule by id, validates ownership + that it's enabled, builds args from kind (paths for backup; other kinds are still arg-less in Phase 2 and grow as those job kinds land in P2-05..08), persists a jobs row with actor_kind=schedule + scheduled_id, and writes command.run back on the same conn so the agent runs through its existing dispatch path. * store.CreateJob now writes scheduled_id. This column was in the schema since 0001 but never populated — the original P1 path only had operator-driven jobs, so actor_kind was always 'user' and scheduled_id was always nil. * cmd/agent/main.go integration: dispatcher gains a scheduler.Scheduler; the MsgScheduleSet case now hands the payload to scheduler.Apply (in a goroutine so the WS read loop keeps draining other messages). WS dispatcher gains OnScheduleFire alongside OnScheduleAck. * Tests: - scheduler unit tests (4): ack-on-apply, cron tick fires schedule.fire envelope, disabled entries don't fire, replace- prior-state stops the old cron. - Server-side end-to-end: schedule.fire → command.run with the right job_id / kind / args, plus jobs row with actor_kind= "schedule" and scheduled_id linking back to the schedule. Persistence of next-fire times across agent restarts is deliberately deferred. A missed fire window during downtime simply fires once on reconnect — that's the desirable behaviour (the operator wants the missed backup to run, not be silently skipped because we lost track of when it was due). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:29:12 +01:00
steve	aa9fc330fc	P2-01: schedule schema + CRUD API The `schedules` table was already laid down in migration 0001; this slice adds the Go-side data model, store CRUD with atomic version bumps, and REST endpoints. * `store.Schedule` + `RetentionPolicy` + `ScheduleOptions` typed views (the wire form on the agent side keeps retention/options as raw JSON since the agent just forwards them to restic). * Store CRUD: CreateSchedule / GetSchedule / ListSchedulesByHost / UpdateSchedule / DeleteSchedule. Each mutation bumps `host_schedule_version` atomically in the same tx via UPSERT on `host_schedule_version`. SetHostAppliedScheduleVersion records what the agent has confirmed via schedule.ack (P2-02 will use it). * REST endpoints under /api/hosts/{id}/schedules + /{sid}: GET (list, with the version envelope so callers can detect drift), POST (create), PUT (update — kind is immutable), DELETE. * Validation: cron expressions parse via robfig/cron/v3 (same parser the agent will use, so anything that validates here will fire there); kind ∈ {backup, forget, prune, check} (init/unlock are operator-only one-shot kinds, not schedulable); backup schedules require ≥1 path; hooks rejected on non-backup kinds (spec §14.3). * All mutations audit-logged. * Tests: store-level CRUD + version-bump invariants; REST happy path (create→list→update→delete with version progression); REST validation table covers each rejection code. newTestServerWithHub now sets BootstrapToken so the schedules handler tests can use the existing login flow without a parallel test-server constructor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:12:58 +01:00
steve	c8ead66f08	P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes Cohesive batch from a smoke-test session against a real rest-server. Themed bullets: * Agent runs as root, sandboxed via systemd. CapabilityBoundingSet drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict with ReadWritePaths confined to /etc + /var/lib/restic-manager; NoNewPrivileges blocks escalation. Install script no longer creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the rationale (matches UrBackup / Veeam / Bareos defaults; trying to back up "everything" as an unprivileged user creates silent skips on /home, /root, /var/lib/* with no upside vs the threat model the agent already implies). * Init-repo end-to-end. New JobKind="init" wired through agent runner, restic.Env.RunInit, server dispatcher, and a UI button (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at flips on init success, on backup success, or on a non-empty snapshots.report. The "Run now" / "Init" / "Retry" branching now drives both the dashboard host row and the host-detail panel. Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using the safe create-new-then-rename pattern; first version corrupted job_logs.job_id FK), 0006 (cleans up job_logs FK on already- affected DBs). * rest-server creds embedded at exec time only. restic.Env gains RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL inside envSlice() and never assigns it back to the struct, so nothing slog-able ever sees the cleartext form. RedactURL helper for any future surface that needs to log a URL safely. Both helpers tested. * Add-host UX. Repo password is now optional — server mints a 24-byte URL-safe random one and surfaces it once, alongside an htpasswd snippet ("echo PASS \| htpasswd -B -i ... USERNAME") so the operator pastes one command on the rest-server host and one on the endpoint. Result page also links the install snippet at /install/install.sh (was /install.sh — 404'd before) and pipes to bash (not sh — script uses set -o pipefail and other bashisms; on Debian/Ubuntu sh is dash). * Late-subscriber race in JobHub. A fast-failing job could finish (DB write + Broadcast) before the browser's HX-Redirect → page load → WS-connect path completed, so the JS sat forever waiting on a job.finished that already passed. JobHub split into Register + Send + Run; handleJobStream now subscribes first, re-fetches the job, and sends a synthetic job.finished if the state is already terminal. * HTMX error visibility. New toast partial listens to htmx:responseError and surfaces the response body as a bottom-right toast — every server-side validation error now becomes visible without per-handler JS wiring. Also handles custom rm:toast events for future server-pushed notifications via the HX-Trigger header. Themed via existing CSS vars. * Dashboard rows are now whole-row clickable to host detail (CSS card-link pattern: absolute-positioned anchor + .row-action z-index restoration so the action button stays clickable). "View →" on a running job links to /jobs/<id> rather than /hosts/<id> since the row click already covers the host page. * "Run first" / "Run first backup" → "Run now" everywhere for consistency. * runbook (docs/e2e-smoke.md) updated — live-log streaming step now reflects P1-26; mentions the browser-driven Run-now flow. * _diag/dump-creds — moved out of cmd/ so go build doesn't pick it up; .gitignore now excludes /_diag/ entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:02:12 +01:00
steve	8aa635f0c1	P1 polish: Host.default_paths interim + restic env hygiene + job_id JS quoting Two fixes that close the loop on dashboard run-now and harden the agent's restic invocation. Default paths (interim until P2-01 schedules): - 0003 migration adds default_paths TEXT NOT NULL DEFAULT '[]' to hosts and to enrollment_tokens. - Operator types paths in the Add-host form (textarea, one per line). They ride on the enrol_token row alongside the encrypted creds (paths aren't secret — plain JSON column). - On consume, ConsumeEnrollmentToken still just burns the token; the new GetEnrollmentTokenAttachments returns both the re-bindable creds and the path list in one round trip, the handler transfers them onto the new host row inside CreateHost. - The dashboard's Run-now and host-detail's "Run backup now" button now read Host.DefaultPaths and pass them to dispatchJob. A host with no default paths returns 400 with a friendly "no paths set" message instead of dispatching a doomed `restic backup` with no positional args. - Doc comments explicitly call this out as a Phase 1 interim — schedules supersede. Restic env hygiene: - envSlice() previously omitted HOME / XDG_CACHE_HOME, which bit the smoke runs whenever the agent was launched outside systemd (restic refused to start: "neither $XDG_CACHE_HOME nor $HOME are defined"). Now both are set explicitly: prefer Env.ExtraEnv overrides, fall back to the agent process's own HOME, and finally to /var/lib/restic-manager. - Comment makes the env policy explicit: parent's RESTIC_* / AWS_* / B2_* env is filtered out by design — control-plane is the unambiguous source of truth. JS bug fix in the live log page: - {{$job.ID \| printf "%q"}} produced a literal-quoted JS string, which then went into the WS URL as ".../jobs/"<ID>"/stream" → 404. Switched to '{{$job.ID}}' inside the literal so html/template's auto-escape does the right thing. Verified end-to-end: dashboard "Run now" → live progress + log lines arrive over the WS → succeeded pill renders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:35:33 +01:00
steve	e6729a5a3d	P1-26: live job log viewer + WS browser fan-out hub Closes the P1-21 remainder. internal/server/ws/jobhub.go — new JobHub. Per-job_id set of subscribers; each gets a 64-deep buffered channel with a writer goroutine. Broadcast is non-blocking: if a subscriber is slow, its channel fills and messages are dropped for that subscriber only — the agent's read loop is never blocked by a stuck browser. The agent dispatchAgentMessage path mirrors job.started / job.progress / log.stream / job.finished envelopes onto the hub in addition to its existing persistence work. The wire shape is the same end-to-end, so client-side JS switches on env.type the same way Go code does. GET /api/jobs/{id}/stream is the browser endpoint. Auth via session cookie (HTTP layer); upgrade; subscribe; pump until context closes. GET /jobs/{id} renders the live log page. Three states (queued/ running/succeeded/failed) drive the header pill, the progress bar block, the failure summary panel, and the action button (Cancel job while running, Back to host afterwards). Already- persisted log lines are server-rendered on initial load; new lines arrive over the WS and append to #log-stream. Auto-scrolls unless the user scrolls up (a "⇢ Follow" pill re-attaches). On job.finished the page reloads after 600ms to pick up the final-state header rendered server-side. POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id} on success so HTMX lands the operator straight on the live log. For non-HTMX callers (curl / plain form post) it 303s to the same target. store.ListJobLogs returns persisted log lines for initial render on page load. Browser-verified end-to-end: enrol → run a real backup against a sibling restic/rest-server → live progress + 11 log lines stream in → succeeded pill + final stats land after page reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:45:56 +01:00
steve	86f7c17d9d	P1-24: live dashboard — fleet summary tiles + host table Server-rendered HTML view backed by: - new store.FleetSummary aggregating host counts + repo bytes + snapshot total + open alerts + last-24h job rollup in two queries. - GET /api/hosts (JSON list of hosts in the dashboard projection). - GET /api/fleet/summary (JSON aggregate, same shape as above). The HTML page (web/templates/pages/dashboard.html) renders the four summary tiles + host table directly from store data — no separate fetch. Per-row state colour comes from .host-row.{degraded,failed, offline} which paint a 3px left edge so problem hosts are scannable without reading. HTMX is loaded into the base layout so per-row "Run now" buttons can hx-post to /hosts/{id}/run-backup, a thin HTML wrapper that funnels into a new dispatchJob helper shared with the JSON /api/hosts/{id}/jobs endpoint. Empty state (zero hosts) collapses to the "no hosts yet" prompt with the + Add host CTA — matches the v1 mockup. Template helpers (internal/server/ui/funcs.go) added for byte formatting (412 GB / 3.7 TB), relative time (3m ago / 2d ago), and comma grouping (1,847). Pure Go, no template-magic dependency. Browser-verified end-to-end with seeded fixture data: five hosts across all four states render with correct dots, accents, last- backup pills, sizes, snapshot counts, alerts, tags, and the right action button (Run now / Retry / Run first / View → / offline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:29:11 +01:00
steve	44feb708bc	fix: enrollment FK race + log-when-rejected; runbook fixes from dry-run The smoke runbook caught a real bug: ConsumeEnrollmentToken was inserting into host_credentials (FK -> hosts) inside the same tx as the token burn, but the host row didn't exist yet — CreateHost runs in the next statement. The agent saw a generic 401 with no clue why. Fix: drop the host_credentials insert from ConsumeEnrollmentToken; the HTTP handler now does Consume -> CreateHost -> SetHostCredentials. SetHostCredentials failure is logged loudly but doesn't fail the enrol — operator recovers via PUT /api/hosts/{id}/repo-credentials. Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the underlying cause is visible in server logs (the wire response stays generic to avoid leaking which step failed). Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new order (consume -> create host -> SetHostCredentials). Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly in use); URLs use trailing slash on the rest path; clarified that secrets_key is minted on first agent start, not at enrol time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:01:59 +01:00
steve	b3b89045f2	P1-32: server-side encrypted repo creds + push-on-hello Operator-minted enrollment tokens now carry the repo URL/username/ password as one AEAD blob bound (via additional-data) to the token hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a host_credentials row in the same tx as token-burn, so the binding moves with the credential. PUT /api/hosts/{id}/repo-credentials lets an operator edit creds post-enrollment; merges with the existing blob, audits, and pushes config.update if the agent is connected. WS handler grows an OnHello hook that the HTTP layer wires to send the host's decrypted creds as a config.update immediately after the hello succeeds — synchronously, so a racing command.run lands after the agent has its repo password. Schema: 0002_host_credentials.sql adds enc_repo_creds to enrollment_tokens and a host_credentials table (PK = host_id, FK ON DELETE CASCADE). Tests: round-trip token → consume → host_credentials with AAD swap detection; no-creds path stays compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:38:35 +01:00
steve	8d5282a180	P1-22: snapshot listing via restic snapshots --json Agent calls restic snapshots --json after each successful backup (60s timeout, separate from the backup ctx) and ships the projection over the existing snapshots.report WS envelope. Failure here is logged but doesn't fail the job — the next successful backup catches the projection up. Server-side ReplaceHostSnapshots is delete-then-insert plus a hosts.snapshot_count update in one transaction so the dashboard's per-host count stays consistent with the projection. New read endpoint GET /api/hosts/{id}/snapshots returns the cached list with a refreshed_at marker so the UI can show staleness when an agent has been offline. Schema: dropped the unused snapshots.repo_id FK (repos as a first-class entity is P2 work), added short_id and refreshed_at columns, switched the time index to DESC for the most-recent-first list query. api.Snapshot gains short_id; size_bytes/file_count come from the embedded summary block on restic 0.16+ and stay zero on older clients. Tests cover round-trip, authoritative replacement after forget+prune shrinkage, and empty-after-wipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:20:57 +01:00

1 2

54 Commits