restic-manager

Author	SHA1	Message	Date
steve	6fd2a2ff77	p6-01/02: agent self-update + fleet update server cluster - alert: update_failed (per-host, dedup=hostID) + fleet_update_halted (system-scoped, host_id NULL via new RaiseOrTouchSystem helper). - ws: UpdateWatcher tracks in-flight command.update dispatches and reconciles them against incoming hello envelopes — success path marks the job succeeded and auto-resolves the alert; 90s timeout marks the job failed and raises update_failed. - http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX /hosts/{id}/update form variant. Pre-checks: host exists, online, agent_version != current, no running update job. Refactored core into Server.dispatchHostUpdate so the fleet worker can share it without going through HTTP. - fleetupdate: rolling worker iterating through host slots, halting on first failure and raising fleet_update_halted. Polling-based version-match (re-read hosts.agent_version every 1s up to 95s) — no extra plumbing into the WS hello path. At-most-one-running is enforced at the store layer (ErrFleetUpdateRunning). - cmd/server: wire UpdateWatcher and FleetWorker into the main goroutine; the worker uses a small serverDispatcher adapter that delegates back into Server.DispatchHostUpdate. Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint (happy + four pre-check branches + RBAC), worker (two-host happy, timeout-halt, host-offline-halt, already-at-target skip, cancel mid-run, double-Start guard).	2026-05-06 22:03:50 +01:00
steve	d413896302	store: migrations 0021+0022 + fleet_updates CRUD	2026-05-06 21:47:54 +01:00
steve	3800b34a2b	testing: bootstrap UI, agent reliability, NS-01..04 + alert username CI / Test (rest) (pull_request) Successful in 29s Details CI / Lint (pull_request) Successful in 32s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Test (store) (pull_request) Successful in 1m22s Details CI / Test (server-http) (pull_request) Successful in 1m30s Details CI / Build (linux/amd64) (pull_request) Successful in 22s Details CI / Build (linux/arm64) (pull_request) Successful in 41s Details Smoothes the rough edges that came up exercising a live deployment. First-run bootstrap UI: /bootstrap renders a username + password form that uses the in-memory token directly (operator no longer copies it out of the log); /login redirects there while bootstrap is available. Agent reliability: failJob synthetic envelopes so command.run early returns no longer hang the server-side job; runtime probe of restic restore --help drives --no-ownership instead of version sniffing (0.18.x had it removed). Server unit re-shaped: ProtectSystem=full plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore can now write anywhere a user might want. Restore wizard: default target is /root/rm-restore/<job-id>/ with clearer help text. Re-init confirm input uses .field (was .input, which doesn't exist — text was invisible). NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete with hostname-confirm danger zone, audit, FK cascade, live WS close. NS-02 enrollment-token recovery: outstanding-tokens panel on /hosts/new, regenerate (preserves attachments) and revoke handlers + audit, store-level ListOutstandingEnrollmentTokens and DeleteEnrollmentToken. NS-03 repo init / probe surface: migration 0020 adds hosts.repo_status + repo_status_error; WS handler projects every init job's outcome onto the host row (idempotent already-initialised collapses to ready); creds-save resets status and dispatches a fresh probe; /hosts/{id}/repo/probe retry endpoint with banner. NS-04 dashboard live + sort + filter: query-string filter (q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the alerts pattern with a localStorage live toggle, sortable column headers, filter row + clear. Alerts page: ack'd-by line resolves user_id ULID to username. Compose.yaml ignored — host-specific.	2026-05-05 22:03:15 +01:00
steve	6006cad992	store: oidc_state CRUD + 5-minute cleanup	2026-05-05 13:15:45 +01:00
steve	7f8bd13a07	store: round-trip IDToken on sessions for RP-initiated logout	2026-05-05 13:14:27 +01:00
steve	805380f52d	store: GetUserByOIDCSubject + scanUser auth_source/oidc_subject	2026-05-05 13:12:11 +01:00
steve	c2581e56e8	store: extend User with AuthSource/OIDCSubject; Session with IDToken	2026-05-05 13:09:49 +01:00
steve	dc89997307	store: migration 0019 — users.auth_source/oidc_subject + sessions.id_token + oidc_state	2026-05-05 13:08:15 +01:00
steve	89d4458866	feat(hosts): per-host tags edit + dashboard chip-row filter (P4-07)	2026-05-05 11:16:09 +01:00
steve	0415a96e27	ui(users): record last_login on /setup + sortable headers	2026-05-05 10:57:25 +01:00
steve	f0828782c1	store: DeleteSessionsByUserID for force-logout	2026-05-05 10:57:24 +01:00
steve	12391abef0	store: user_setup_tokens CRUD + cleanup-expired	2026-05-05 10:57:24 +01:00
steve	2c090171e5	store: lowercase username, email/disable helpers, last-admin count	2026-05-05 10:57:24 +01:00
steve	bd08d8ca14	store: extend User struct with Email, DisabledAt, MustChangePassword	2026-05-05 10:57:24 +01:00
steve	a7e53e0a64	store: migration 0018 — user_setup_tokens	2026-05-05 10:57:24 +01:00
steve	ca170fedc5	store: migration 0017 — users.email, disabled_at, must_change_password	2026-05-05 10:57:24 +01:00
steve	ba425c9766	feat(audit): clickable column headers with asc/desc sort CI / Build (windows/amd64) (pull_request) Successful in 23s Details CI / Lint (pull_request) Successful in 34s Details CI / Build (linux/amd64) (pull_request) Successful in 23s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details CI / Test (linux/amd64) (pull_request) Successful in 3m41s Details	2026-05-05 08:15:22 +01:00
steve	3f36bcd0b0	feat(audit): P3-08 — audit log UI with filters	2026-05-05 07:49:25 +01:00
steve	a45c801884	feat(alerts): per-source-group dedup so two failing backups produce two alerts Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.	2026-05-04 22:59:48 +01:00
steve	cffad4b4f3	fix: enabled toggle — list-row click + edit-form save CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 24s Details CI / Build (linux/arm64) (pull_request) Successful in 24s Details CI / Lint (pull_request) Successful in 1m15s Details CI / Test (linux/amd64) (pull_request) Successful in 1m36s Details Two bugs in the channel-enabled affordance: 1. List-row toggle was a static span with no handler; the row's row-link overlay swallowed every click and routed to /edit. Add POST /settings/notifications/{id}/toggle backed by a new store method SetNotificationChannelEnabled, and turn the row toggle into an htmx-driven button that swaps in the new state. Use event.stopPropagation() on the toggle so it beats the row link. 2. Edit-form toggle visually flipped but the underlying checkbox reverted: the visual span lives inside the <label>, so clicking it fired the inline JS handler AND the label's native checkbox-toggle, cancelling out. Bind to the checkbox 'change' event instead and let the label do the toggling — the JS just mirrors check.checked into the .on class.	2026-05-04 22:21:45 +01:00
steve	3d99306cea	fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve The denormalised projection was never written by the alerts code path, so the dashboard's OPEN ALERTS card and the per-host alerts column always read 0 regardless of how many alerts were open. fleet.GetStats sums hosts.open_alert_count; if it never moves, the card is decoration. Add refreshHostOpenAlertCount that recomputes from the alerts table (self-healing — no +/- bookkeeping to drift). Call it after the commit in RaiseOrTouch when a row was inserted, after Resolve, and after AutoResolve. Caught during the live sweep: a synthetic critical raised the count to 1, but resolving it left the dashboard reading '1 unresolved' indefinitely.	2026-05-04 21:01:17 +01:00
steve	8c42b00228	alert: wire engine into ws hello + MarkJobFinished + offline sweep - ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated from http.Deps.AlertEngine (nil until G1 constructs the engine) - runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds - dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished, looking up the job Kind via Store.GetJob before notifying - store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one transaction, returns the IDs that flipped to offline - offline sweeper in cmd/server/main.go switched to the new variant; TODO(G1) comment marks where NotifyHostOffline calls will land	2026-05-04 19:54:39 +01:00
steve	69fc89143d	store: notification_channels CRUD + AppendNotificationLog	2026-05-04 19:28:41 +01:00
steve	b5a0aa4667	store: alerts CRUD with dedup + last_seen_at bump	2026-05-04 19:24:17 +01:00
steve	f24dfa5214	store: migration 0014 — notification_channels + notification_log	2026-05-04 19:20:37 +01:00
steve	640b64710e	store: A1 — check rows.Err() + Scan err in migrate_test Code-quality nits flagged in review of `e6d965d`. Mirrors the existing pattern in host_credentials_test.go.	2026-05-04 19:19:28 +01:00
steve	e6d965d7a5	store: migration 0013 — alerts.last_seen_at	2026-05-04 19:16:59 +01:00
steve	f0dfa689fe	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	6e47efc146	P3-01/02/03: restore wizard backend + templates + restore-shaped job page End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link /hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch → restore-shaped live job page. Backend (internal/server/http/ui_restore.go): - GET handlers render the four-step wizard against the wireframe shape in docs/superpowers/specs/2026-05-04-p3-restore-design.md. - HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each directory expansion is a sub-second cached lookup after the first miss. - POST validates: snapshot_id non-empty, ≥1 absolute path, in-place mode requires confirm_hostname == host name, agent online. On error re-renders the wizard with the operator's input intact. Happy path mints a job_id, computes the new-directory target as /var/restic-restore/<job-id>/ (operator can't escape the prefix — server picks it), creates the job row, ships command.run with kind=restore + RestorePayload, writes a host.restore audit row, returns HX-Redirect (or 303) to the live job page. Templates: - host_restore.html: single-page progressively-enabled wizard matching _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a running tally of selected paths and the step-4 confirm summary client-side; the server re-renders on validation failure with form fields preserved. - partials/tree_node.html: recursive HTMX-served tree fragment. - Top-level Restore button on host_detail right rail + per-snapshot Restore action on snapshot rows replace the previous P3-stub. Restore-shaped job page (job_detail.html): - Progress widget rendered as a panel rather than a bare strip when the job is active. - Current-file display under the bar, updated from log.stream stdout lines that look like absolute paths. Hidden for non-restore kinds. Migration 0012: - Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005. Defensive: stash job_logs into a temp table before the rebuild and INSERT OR IGNORE back afterwards so even if SQLite cascades on DROP TABLE jobs the log history survives. Tests: - ui_restore_test covers GET step-1 render, GET pre-selected snapshot summary card, POST missing snapshot, POST missing paths, POST in-place wrong-hostname rejection (no command.run leaks to the agent), POST happy path (HX-Redirect + correct payload + audit row), POST against offline host returns 503. Restage block (CLAUDE.md) deferred to the end of the restore phase.	2026-05-04 15:34:29 +01:00
steve	a8e6c9d6d7	store+server: P2-18a announce-and-approve schema + endpoint migration 0011 adds pending_hosts table (id, hostname, public_key, fingerprint, expiry). store/pending_hosts.go covers full CRUD plus hostname-collision count + expired-row sweeper. POST /api/agents/announce takes {hostname, os, arch, agent_version, restic_version, public_key (base64)}, returns {pending_id, fingerprint, hostname_collision}. Per-source-IP token-bucket rate limit (10/min) + global cap of 100 in-flight rows. Public key must be exactly 32 bytes (Ed25519).	2026-05-04 11:03:41 +01:00
steve	13c35b68d4	agent+server: P2R-11 pre/post hook execution for backup jobs Agent: new runner.BackupHooks struct + runHook helper invoked via /bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with RM_JOB_STATUS=succeeded\|failed in env. Output streamed as 'hook(<phase>): …' log.stream lines. Hooks only run for kind=backup (other kinds skip both phases). Server: resolveBackupHooks resolves group → host default → empty, decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext into CommandRunPayload for both schedule.fire and per-group Run-now dispatch sites. Decrypt failures degrade silently to no hook so a malformed blob can't poison every backup.	2026-05-04 10:57:28 +01:00
steve	c20375eaf5	store: P2R-10 schema for source-group + host-default hooks (migration 0010) Adds pre_hook/post_hook BLOB columns to source_groups and pre_hook_default/post_hook_default to hosts. Bytes stored verbatim (AEAD encrypt/decrypt happens at the HTTP layer where the AEAD key lives). Round-trip tests cover set/clear semantics on both tables.	2026-05-04 10:52:16 +01:00
steve	93ab0ae84f	ui+server: schedule next-run / last-run on dashboard + schedules tab P2R-14. New store.LatestJobBySchedule query (per-schedule fired job). Schedules-tab handler computes next-fire from cron + last-fire from the jobs table per row. Schedules table grows two columns; dashboard host row prepends 'next 12h ago/from now' to the existing last-backup line when a single covering schedule is the run-now candidate. Embeds store.Schedule into scheduleRow so existing template field references keep working without bulk renames.	2026-05-04 10:44:31 +01:00
steve	b8c9c50a93	store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire) Widen the SQL query to consider all statuses (queued, running, succeeded, failed, cancelled) rather than terminal-only. An in-flight prune that outlasts the 60s tick interval previously produced ErrNotFound, causing the ticker to anchor at now-24h and fire a second prune concurrently with the first. Update the doc comment and test: remove the "queued job filtered out" case, add assertions that a running job and a queued job are each returned as the latest.	2026-05-04 10:19:15 +01:00
steve	e64cf25c0e	server: enqueue pending_runs when scheduled-job dispatch fails When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant.	2026-05-04 10:19:15 +01:00
steve	e7e11454a8	maintenance: pure-logic ticker decides forget/prune/check fires	2026-05-04 10:19:15 +01:00
steve	4ad0b5147a	store: tighten CHECK constraint on host_repo_stats.last_check_status	2026-05-04 10:19:15 +01:00
steve	f97f67eb67	store: wrap UpsertHostRepoStats in a transaction (concurrency safety)	2026-05-04 10:19:15 +01:00
steve	bc77081366	store: assert CHECK constraint on host_credentials.kind	2026-05-04 10:19:15 +01:00
steve	87655cf0e4	store: HostRepoStats projection (size, lock, last-check, last-prune)	2026-05-04 10:19:15 +01:00
steve	de6d51eeb1	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-04 10:19:15 +01:00
steve	212ddfe226	store: migration 0009 — admin-creds kind + host_repo_stats	2026-05-04 10:19:15 +01:00
steve	e871b05b38	lint: drive baseline to zero, drop only-new-issues gate CI / Test (linux/amd64) (pull_request) Successful in 34s Details CI / Lint (pull_request) Failing after 16s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	d000fe7ec1	P2R-01: REST + WS rewire against the slim shape Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron parsed via robfig/cron/v3 and group membership scoped to the host. New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete refuses with 409 if any schedule still references the group, returning the schedule list so the UI can prompt 'remove from these schedules first.' Repo-maintenance GET/PUT manages forget/prune/check cadences on host_repo_maintenance — no version bump, the server-side ticker (P2R-06) drives execution. Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run) resolves the group's includes/excludes/retention/tag and dispatches a backup command.run with the new structured CommandRunPayload fields (Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and /hosts/{id}/init-repo return 410 Gone with a redirect message. schedule_push.go is rebuilt: buildScheduleSetPayload assembles the slim wire shape, pushScheduleSetOnConn ships it during the on-hello window, pushScheduleSetAsync fires after every CRUD mutation, and dispatchScheduledJob handles agent schedule.fire by iterating the schedule's source groups and dispatching one backup per group with actor_kind=schedule and scheduled_id pointing at the schedule. Auto-init at first WS connect: when the host has repo creds bound and no init job in its history, server dispatches restic init. Restic's 'config file already exists' soft-success means re-runs against an existing repo no-op; we don't auto-retry on failure (operator triggers re-init manually via the danger zone in P2R-09). api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc. in favour of {id, cron, enabled, source_groups: [...]}. The agent scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads Includes/Excludes/Tag instead of Args. Tests cover the new HTTP surface end-to-end: source-groups CRUD with in-use refusal, schedule validation (bad cron / missing groups / foreign group), repo-maintenance auto-seed and validation, the 410 route, and buildScheduleSetPayload's wire-shape correctness. Full suite passes; smoke env exercises auto-init dispatch on hello, async push after schedule create, and per-source-group Run-now landing the right paths/excludes/tag at the agent.	2026-05-03 10:56:40 +01:00
steve	5667cdf13a	P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Go-side data model rebuilt against migration 0008. The fat-Schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is gone; that surface lives on source_groups now. * store/types.go - Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. SourceGroupIDs populated by Get/List, accepted on Create/Update so callers pass desired junction state in one shape. - SourceGroup added: name (= snapshot tag), includes/excludes, retention_policy, retry_max + retry_backoff_seconds, cached conflict_dimension. - HostRepoMaintenance added: forget/prune/check cadences + enabled. - PendingRun added: offline-retry queue. - Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps. - RetentionPolicy moves home from "schedule field" to "source group field" but the type itself + Summary() method unchanged. * store/sources.go (new) — CRUD + GetByName + ConflictDimension cache. Group writes bump host_schedule_version; conflict cache writes don't (server-internal projection, agent doesn't see it). * store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR IGNORE). UpdateRepoMaintenance doesn't bump schedule version because these run on the server's own ticker, not the agent's local cron. * store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete. * store/schedules.go — rewritten for slim shape + junction CRUD. Update wipes the schedule_source_groups junction wholesale and re-inserts (simpler than diffing). Adds SchedulesUsingGroup for retention-conflict detection + UI labels. * store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan. New SetHostBandwidth helper. * HTTP layer — temporarily stubbed during this rewrite (501 returns with redesign_in_progress error code). Phase 3 fills these in against the new shape: - schedules.go REST CRUD - schedule_push.go agent reconciliation - ui_schedules.go HTML form CRUD Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed — both go away in the new model (Run-now per source group; auto-init at host enrolment). * enrollment.go — replaces "seed manual schedule from typed paths" with "seed default source group + repo-maintenance row." The default group gets the typed paths as its includes; operator edits later via Sources tab. * ws/handler.go — drops the MarkHostRepoInitialised projection (column is gone; auto-init makes it derivable from latest init job's status). Tests: * store: existing schedule test rewritten for slim shape + junction; new sources_test.go covers source-group CRUD, name uniqueness, conflict cache, repo-maintenance defaults + idempotent seed, pending-runs queue lifecycle. * http: schedules_test.go and schedule_push_test.go deleted — both exercised the obsolete fat-schedule API. Phase 3 rewrites them against the new endpoints. go test ./... green. cmd/server + cmd/agent build. The UI is broken end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3 restores REST + on-the-wire reconciliation; Phase 4 rewires the UI templates against the new model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:30:41 +01:00
steve	7a7cac588c	P2 redesign · phase 1: migration 0008 — sources + repo maintenance Schema rebuild for the model collapse described in design/v4-sources-redesign.html. Three nouns now stand on their own: * schedules — slim. Only cron + enabled + host_id. Fat-schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is dropped wholesale. Schedule data wiped — by design (smoke env was nuked before this ran; fresh installs have nothing to lose). * source_groups — name + includes + excludes + retention_policy + retry policy + cached conflict_dimension. Group name doubles as the snapshot tag so retention can target it cleanly. UNIQUE (host_id, name) enforces tag unambiguity. * schedule_source_groups — N:M junction. One schedule can fire N groups per tick; one group can be referenced by N schedules. * host_repo_maintenance — 1:1 with hosts. Default cadences: forget daily 03:00, prune weekly Sun 04:00, check monthly 1st 05:00 with --read-data-subset 5%. Operator can edit on Repo tab. * pending_runs — offline-retry queue. Server-side ticker dispatches due rows; bounded by source_groups.retry_max + retry_backoff_seconds. Plus: * hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps. * hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes it derivable from the latest init job; the Init-repo button goes too (failure surfaces via job history banner). Note on FK safety: smoke env was wiped before migration ran, so DROP TABLE schedules cascades to nothing. Fresh installs apply 0001-0007 then immediately 0008 — same story (no schedule rows to lose). For an upgrade path on a populated DB, this migration would need a data-preserving variant; not needed today. Tests fail to compile/run after this — expected. The Go side (store types, CRUD, REST handlers, agent runner, UI templates) gets rebuilt in subsequent phases. tasks.md will track P2 redesign progress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 20:54:01 +01:00
steve	8a05969953	Add-host: durable pending page + polled awaiting-agent panel CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Two issues from a smoke session: 1. The awaiting-agent panel never refreshed — operator had to go back to the dashboard to see the host had connected. 2. Generated passwords were displayed only on the POST response. Navigating away (or even an accidental tab close) lost them permanently, so the operator couldn't update the rest-server's htpasswd. Both are the same fix: convert the POST-rendered transient "result state" into a durable GET page at /hosts/pending/{token}. * New route GET /hosts/pending/{token} renders the install-command + htpasswd snippet view. Password is decrypted from the (still- encrypted-at-rest) token row on every render — operator can refresh, bookmark, navigate away and come back. Once the agent enrols, the page redirects to /hosts/{id}; once the token expires, redirect to /hosts/new. * New route GET /hosts/pending/{token}/awaiting returns a polled HTML fragment that the pending page swaps in every 2s via HTMX. States: awaiting (keep polling) \| connected (show "Open host →" + "View schedules" CTAs, polling stops) \| expired (mint-new link, polling stops). Polling stops naturally because only the awaiting state's wrapper carries the hx-trigger attribute. * POST /hosts/new now 303-redirects to /hosts/pending/{token} on success; validation errors keep re-rendering the form with banner. Supporting changes: * New store helper Store.GetEnrollmentTokenStatus(tokenHash) for the polling endpoint — returns {expires_at, consumed_at, consumed_host} in one round-trip without dragging in the attachments-decryption path. * New ui.Renderer.RenderPartial(w, name, data) for HTMX fragment responses (no layout wrap). Picks an arbitrary page's template set as the lookup point — every page parses the full common- paths list, so they all see every partial. * add_host.html stripped to form-only; pending_host.html owns the result-state UI; awaiting_agent.html is the polled partial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:59:24 +01:00
steve	148e61b33b	P2-04.5: kill host.default_paths in favour of manual schedules CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Two independent path lists for "what does this host back up?" was a real divergence footgun — operator types one set at Add-host time and a different set into a schedule, both end up in the same repo, the snapshot history looks fine until restore. Resolution: drop host.default_paths entirely; add a `manual` flag on schedules. A manual schedule has paths/excludes/tags/retention like any other but no cron — it fires only via per-schedule Run-now. Single source of truth for what gets backed up. Schema (migration 0007): * schedules.manual INTEGER NOT NULL DEFAULT 0. * For every host with non-empty default_paths, seed a manual schedule with those paths and bump host_schedule_version. * ALTER TABLE hosts DROP COLUMN default_paths. * ALTER TABLE enrollment_tokens RENAME COLUMN default_paths TO initial_paths. Original draft of this migration rebuilt hosts via the create-new + drop-old + rename-new pattern. With foreign_keys=ON (set in the connection DSN), DROP TABLE on the parent fired ON DELETE CASCADE on every child of hosts(id) — schedules / jobs / snapshots / host_credentials all wiped on the smoke env when I tried it. SQLite 3.35+ supports column-level ALTERs directly, so we skip the rebuild dance and avoid the cascade trap. Six lines of SQL instead of sixty, no FK risk. Run-now rewiring: * New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper unifies the agent-driven path (cron fire → schedule.fire → OnScheduleFire callback) and the UI-driven path (operator clicks Run-now on a schedule row). Conn arg is optional; nil falls back to Hub.Send. * New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row Run-now button on the schedules list. * Dashboard's per-host Run-now (handleUIRunBackup) now picks the host's only enabled manual schedule, falls back to the only enabled schedule, else returns "pick one in Schedules tab". Keeps one-click for the common case. Agent: * Scheduler skips manual schedules in cron build (silent — they're a normal data shape, not an error). * Wire Schedule struct gains Manual flag. * Schedule.fire flow unchanged — the agent only ever fires non-manual schedules anyway. UI: * Add-host form retitled "Initial schedule · manual" so the operator knows the paths become an editable schedule under the Schedules tab. Result page calls out the manual schedule + points at Host > Schedules. * Schedule edit form: "Manual schedule" checkbox at the top of the When section; toggling it hides/shows the cron field via inline JS. Server-side validator skips the cron requirement when manual=true. * Schedule list shows a "manual" tag under the status pill and renders the When column as "— run-now only —" for manual rows. Each row gets a Run-now button when the schedule is enabled and the host is online. Tests + go test ./... green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:26:06 +01:00
steve	160d788bae	P2-04: schedule editor UI CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Closes the schedule foundations slice — operator can now drive the plumbing P2-01..03 landed without touching the JSON API. * New routes: - GET /hosts/{id}/schedules (list) - GET /hosts/{id}/schedules/new (create form) - POST /hosts/{id}/schedules/new (create) - GET /hosts/{id}/schedules/{sid}/edit (edit form) - POST /hosts/{id}/schedules/{sid}/edit (update) - POST /hosts/{id}/schedules/{sid}/delete (delete, confirm-then-redirect) * List view (web/templates/pages/schedules_list.html): status, cron, paths, retention summary, tags, edit/delete buttons. Header shows "version N · agent in sync" or "agent at vM" when the push hasn't been ack'd yet — backed by host_schedule_version + applied_schedule_version. Empty-state CTA points at /schedules/new. * Create/edit form (web/templates/pages/schedule_edit.html, shared): cron expression with five quick-pick presets (daily 3am / every 6h / @hourly / weekly Sun / monthly 1st), paths textarea (one per line), excludes textarea, tags (comma-separated), retention as six numeric fields (mirrors restic's --keep-* flags one-for-one), bandwidth caps, enabled toggle. Side panel explains the reconciliation flow so the operator knows what saving actually does. Validation errors re-render with operator's input intact. * internal/server/http/ui_schedules.go owns the handlers; reuses the same validateSchedule + pushScheduleSetAsync used by the JSON API path. Each save audit-logs schedule.created / schedule.updated / schedule.deleted (matching the JSON API actions). * store.RetentionPolicy gains a Summary() method ("last=7, d=14, w=4" or "—"). Used by the list view's table cell so templates don't have to do any conditional retention rendering. * Two new template helpers: list (string varargs → []string, used for the cron preset row) and joinComma (sibling to joinDot for the rare list that wants commas). RetentionPolicy.Summary covers the schedule-list case but the helpers are general. * host_detail.html secondary tabs row converted from inert <div>s into <a> links. Snapshots active by default; Schedules now points at the new page. Jobs/Repo/Settings remain inert until their P2 owners ship. Hooks UI deferred to P2-15 (lands with the hook execution path). Single-kind UI (backup only) by design — other kinds get a UI when their job dispatch lands in P2-05..08. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:44:40 +01:00
steve	6450bf1b88	P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Closes the schedule reconciliation loop end-to-end. * New `internal/agent/scheduler` package wraps robfig/cron/v3 with the lifecycle the agent needs: - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting for in-flight entries to return), rebuilds from scratch, starts, and emits schedule.ack with the version we just applied. - Disabled entries skipped silently; bad cron exprs (which shouldn't reach us — the server validates — but defensive) log a warn and skip. - On each cron tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The scheduler itself never builds CommandRunPayloads — server is the source of truth for jobs. - tx is swapped on every Apply, so reconnect is handled naturally: cron entries that fire against a dropped tx log "no active connection" and skip the tick. - Stop() is idempotent and waits for the cron's in-flight workers via cron.Stop().Done(). * New wire message api.MsgScheduleFire + api.ScheduleFirePayload for the agent → server "I just fired locally" RPC. * Server-side dispatch (schedule_push.go: dispatchScheduledJob): looks up the schedule by id, validates ownership + that it's enabled, builds args from kind (paths for backup; other kinds are still arg-less in Phase 2 and grow as those job kinds land in P2-05..08), persists a jobs row with actor_kind=schedule + scheduled_id, and writes command.run back on the same conn so the agent runs through its existing dispatch path. * store.CreateJob now writes scheduled_id. This column was in the schema since 0001 but never populated — the original P1 path only had operator-driven jobs, so actor_kind was always 'user' and scheduled_id was always nil. * cmd/agent/main.go integration: dispatcher gains a scheduler.Scheduler; the MsgScheduleSet case now hands the payload to scheduler.Apply (in a goroutine so the WS read loop keeps draining other messages). WS dispatcher gains OnScheduleFire alongside OnScheduleAck. * Tests: - scheduler unit tests (4): ack-on-apply, cron tick fires schedule.fire envelope, disabled entries don't fire, replace- prior-state stops the old cron. - Server-side end-to-end: schedule.fire → command.run with the right job_id / kind / args, plus jobs row with actor_kind= "schedule" and scheduled_id linking back to the schedule. Persistence of next-fire times across agent restarts is deliberately deferred. A missed fire window during downtime simply fires once on reconnect — that's the desirable behaviour (the operator wants the missed backup to run, not be silently skipped because we lost track of when it was due). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:29:12 +01:00

1 2

62 Commits