restic-manager

Author	SHA1	Message	Date
steve	28c8b58f93	ui: per-host Jobs sub-tab; drop unused Settings stub Adds /hosts/{id}/jobs page listing recent jobs for the host (newest first, capped at 100) with click-through to /jobs/{id}. Converts the Jobs placeholder <div> to a real <a> nav link; removes the Settings stub entirely. Also registers durationHuman template func and a .jobs-row CSS grid to match the existing .schd-row idiom.	2026-05-07 22:49:10 +01:00
steve	b9c7ec6ebf	store: history table helpers (upsert/list, COALESCE preserves prior values)	2026-05-07 18:43:20 +01:00
steve	da518de3e6	store: migration 0023 host_repo_stats_history	2026-05-07 18:39:44 +01:00
steve	0a75b82c17	fix: project finished backup jobs onto host row + smoke path tweaks The dashboard's 'Last backup' column reads hosts.last_backup_at / last_backup_status, but the WS handler only updated hosts.repo_status on job.finished — backup terminations were silently dropped. Add a SetHostLastBackup store method and call it from the same job.finished switch that already handles init jobs. Also: CLAUDE.md restage block uses /tmp/rm-smoke (the original default) but the actual dev env runs out of $HOME/smoke. Update the paths in the doc to match.	2026-05-07 17:55:23 +01:00
steve	9d5775fb47	p6-01/02: agent self-update + fleet update server cluster - alert: update_failed (per-host, dedup=hostID) + fleet_update_halted (system-scoped, host_id NULL via new RaiseOrTouchSystem helper). - ws: UpdateWatcher tracks in-flight command.update dispatches and reconciles them against incoming hello envelopes — success path marks the job succeeded and auto-resolves the alert; 90s timeout marks the job failed and raises update_failed. - http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX /hosts/{id}/update form variant. Pre-checks: host exists, online, agent_version != current, no running update job. Refactored core into Server.dispatchHostUpdate so the fleet worker can share it without going through HTTP. - fleetupdate: rolling worker iterating through host slots, halting on first failure and raising fleet_update_halted. Polling-based version-match (re-read hosts.agent_version every 1s up to 95s) — no extra plumbing into the WS hello path. At-most-one-running is enforced at the store layer (ErrFleetUpdateRunning). - cmd/server: wire UpdateWatcher and FleetWorker into the main goroutine; the worker uses a small serverDispatcher adapter that delegates back into Server.DispatchHostUpdate. Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint (happy + four pre-check branches + RBAC), worker (two-host happy, timeout-halt, host-offline-halt, already-at-target skip, cancel mid-run, double-Start guard).	2026-05-06 22:03:50 +01:00
steve	c37954aa3f	store: migrations 0021+0022 + fleet_updates CRUD	2026-05-06 21:47:54 +01:00
steve	02e4ef7544	testing: bootstrap UI, agent reliability, NS-01..04 + alert username Smoothes the rough edges that came up exercising a live deployment. First-run bootstrap UI: /bootstrap renders a username + password form that uses the in-memory token directly (operator no longer copies it out of the log); /login redirects there while bootstrap is available. Agent reliability: failJob synthetic envelopes so command.run early returns no longer hang the server-side job; runtime probe of restic restore --help drives --no-ownership instead of version sniffing (0.18.x had it removed). Server unit re-shaped: ProtectSystem=full plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore can now write anywhere a user might want. Restore wizard: default target is /root/rm-restore/<job-id>/ with clearer help text. Re-init confirm input uses .field (was .input, which doesn't exist — text was invisible). NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete with hostname-confirm danger zone, audit, FK cascade, live WS close. NS-02 enrollment-token recovery: outstanding-tokens panel on /hosts/new, regenerate (preserves attachments) and revoke handlers + audit, store-level ListOutstandingEnrollmentTokens and DeleteEnrollmentToken. NS-03 repo init / probe surface: migration 0020 adds hosts.repo_status + repo_status_error; WS handler projects every init job's outcome onto the host row (idempotent already-initialised collapses to ready); creds-save resets status and dispatches a fresh probe; /hosts/{id}/repo/probe retry endpoint with banner. NS-04 dashboard live + sort + filter: query-string filter (q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the alerts pattern with a localStorage live toggle, sortable column headers, filter row + clear. Alerts page: ack'd-by line resolves user_id ULID to username. Compose.yaml ignored — host-specific.	2026-05-05 22:03:15 +01:00
steve	e2976a42e6	store: oidc_state CRUD + 5-minute cleanup	2026-05-05 13:15:45 +01:00
steve	14be63510c	store: round-trip IDToken on sessions for RP-initiated logout	2026-05-05 13:14:27 +01:00
steve	70aa22e87e	store: GetUserByOIDCSubject + scanUser auth_source/oidc_subject	2026-05-05 13:12:11 +01:00
steve	154b57a4cd	store: extend User with AuthSource/OIDCSubject; Session with IDToken	2026-05-05 13:09:49 +01:00
steve	c5b29b88b9	store: migration 0019 — users.auth_source/oidc_subject + sessions.id_token + oidc_state	2026-05-05 13:08:15 +01:00
steve	168059ae45	feat(hosts): per-host tags edit + dashboard chip-row filter (P4-07)	2026-05-05 11:16:09 +01:00
steve	2d9e53b025	ui(users): record last_login on /setup + sortable headers	2026-05-05 10:57:25 +01:00
steve	e76a383813	store: DeleteSessionsByUserID for force-logout	2026-05-05 10:57:24 +01:00
steve	93d857d995	store: user_setup_tokens CRUD + cleanup-expired	2026-05-05 10:57:24 +01:00
steve	dafdfcda3f	store: lowercase username, email/disable helpers, last-admin count	2026-05-05 10:57:24 +01:00
steve	c6fbe7c0e0	store: extend User struct with Email, DisabledAt, MustChangePassword	2026-05-05 10:57:24 +01:00
steve	a1d307fafa	store: migration 0018 — user_setup_tokens	2026-05-05 10:57:24 +01:00
steve	9712c65b04	store: migration 0017 — users.email, disabled_at, must_change_password	2026-05-05 10:57:24 +01:00
steve	4f66cc2b34	feat(audit): clickable column headers with asc/desc sort	2026-05-05 08:15:22 +01:00
steve	16c77a8cc5	feat(audit): P3-08 — audit log UI with filters	2026-05-05 07:49:25 +01:00
steve	350be3f19d	feat(alerts): per-source-group dedup so two failing backups produce two alerts Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.	2026-05-04 22:59:48 +01:00
steve	d830635a2e	fix: enabled toggle — list-row click + edit-form save Two bugs in the channel-enabled affordance: 1. List-row toggle was a static span with no handler; the row's row-link overlay swallowed every click and routed to /edit. Add POST /settings/notifications/{id}/toggle backed by a new store method SetNotificationChannelEnabled, and turn the row toggle into an htmx-driven button that swaps in the new state. Use event.stopPropagation() on the toggle so it beats the row link. 2. Edit-form toggle visually flipped but the underlying checkbox reverted: the visual span lives inside the <label>, so clicking it fired the inline JS handler AND the label's native checkbox-toggle, cancelling out. Bind to the checkbox 'change' event instead and let the label do the toggling — the JS just mirrors check.checked into the .on class.	2026-05-04 22:21:45 +01:00
steve	cbdaa4daeb	fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve The denormalised projection was never written by the alerts code path, so the dashboard's OPEN ALERTS card and the per-host alerts column always read 0 regardless of how many alerts were open. fleet.GetStats sums hosts.open_alert_count; if it never moves, the card is decoration. Add refreshHostOpenAlertCount that recomputes from the alerts table (self-healing — no +/- bookkeeping to drift). Call it after the commit in RaiseOrTouch when a row was inserted, after Resolve, and after AutoResolve. Caught during the live sweep: a synthetic critical raised the count to 1, but resolving it left the dashboard reading '1 unresolved' indefinitely.	2026-05-04 21:01:17 +01:00
steve	c710743231	alert: wire engine into ws hello + MarkJobFinished + offline sweep - ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated from http.Deps.AlertEngine (nil until G1 constructs the engine) - runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds - dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished, looking up the job Kind via Store.GetJob before notifying - store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one transaction, returns the IDs that flipped to offline - offline sweeper in cmd/server/main.go switched to the new variant; TODO(G1) comment marks where NotifyHostOffline calls will land	2026-05-04 19:54:39 +01:00
steve	8a92fedba1	store: notification_channels CRUD + AppendNotificationLog	2026-05-04 19:28:41 +01:00
steve	7c62d111d5	store: alerts CRUD with dedup + last_seen_at bump	2026-05-04 19:24:17 +01:00
steve	b2dffb1d83	store: migration 0014 — notification_channels + notification_log	2026-05-04 19:20:37 +01:00
steve	db71e006bb	store: A1 — check rows.Err() + Scan err in migrate_test Code-quality nits flagged in review of `2692c66`. Mirrors the existing pattern in host_credentials_test.go.	2026-05-04 19:19:28 +01:00
steve	2692c660c5	store: migration 0013 — alerts.last_seen_at	2026-05-04 19:16:59 +01:00
steve	a781e95c94	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	4c108bb68a	P3-01/02/03: restore wizard backend + templates + restore-shaped job page End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link /hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch → restore-shaped live job page. Backend (internal/server/http/ui_restore.go): - GET handlers render the four-step wizard against the wireframe shape in docs/superpowers/specs/2026-05-04-p3-restore-design.md. - HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each directory expansion is a sub-second cached lookup after the first miss. - POST validates: snapshot_id non-empty, ≥1 absolute path, in-place mode requires confirm_hostname == host name, agent online. On error re-renders the wizard with the operator's input intact. Happy path mints a job_id, computes the new-directory target as /var/restic-restore/<job-id>/ (operator can't escape the prefix — server picks it), creates the job row, ships command.run with kind=restore + RestorePayload, writes a host.restore audit row, returns HX-Redirect (or 303) to the live job page. Templates: - host_restore.html: single-page progressively-enabled wizard matching _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a running tally of selected paths and the step-4 confirm summary client-side; the server re-renders on validation failure with form fields preserved. - partials/tree_node.html: recursive HTMX-served tree fragment. - Top-level Restore button on host_detail right rail + per-snapshot Restore action on snapshot rows replace the previous P3-stub. Restore-shaped job page (job_detail.html): - Progress widget rendered as a panel rather than a bare strip when the job is active. - Current-file display under the bar, updated from log.stream stdout lines that look like absolute paths. Hidden for non-restore kinds. Migration 0012: - Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite can't ALTER CHECK in place); follows the safe pattern from 0005. Defensive: stash job_logs into a temp table before the rebuild and INSERT OR IGNORE back afterwards so even if SQLite cascades on DROP TABLE jobs the log history survives. Tests: - ui_restore_test covers GET step-1 render, GET pre-selected snapshot summary card, POST missing snapshot, POST missing paths, POST in-place wrong-hostname rejection (no command.run leaks to the agent), POST happy path (HX-Redirect + correct payload + audit row), POST against offline host returns 503. Restage block (CLAUDE.md) deferred to the end of the restore phase.	2026-05-04 15:34:29 +01:00
steve	cd80be3b13	store+server: P2-18a announce-and-approve schema + endpoint migration 0011 adds pending_hosts table (id, hostname, public_key, fingerprint, expiry). store/pending_hosts.go covers full CRUD plus hostname-collision count + expired-row sweeper. POST /api/agents/announce takes {hostname, os, arch, agent_version, restic_version, public_key (base64)}, returns {pending_id, fingerprint, hostname_collision}. Per-source-IP token-bucket rate limit (10/min) + global cap of 100 in-flight rows. Public key must be exactly 32 bytes (Ed25519).	2026-05-04 11:03:41 +01:00
steve	7b1990cf11	agent+server: P2R-11 pre/post hook execution for backup jobs Agent: new runner.BackupHooks struct + runHook helper invoked via /bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with RM_JOB_STATUS=succeeded\|failed in env. Output streamed as 'hook(<phase>): …' log.stream lines. Hooks only run for kind=backup (other kinds skip both phases). Server: resolveBackupHooks resolves group → host default → empty, decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext into CommandRunPayload for both schedule.fire and per-group Run-now dispatch sites. Decrypt failures degrade silently to no hook so a malformed blob can't poison every backup.	2026-05-04 10:57:28 +01:00
steve	18b0bf976d	store: P2R-10 schema for source-group + host-default hooks (migration 0010) Adds pre_hook/post_hook BLOB columns to source_groups and pre_hook_default/post_hook_default to hosts. Bytes stored verbatim (AEAD encrypt/decrypt happens at the HTTP layer where the AEAD key lives). Round-trip tests cover set/clear semantics on both tables.	2026-05-04 10:52:16 +01:00
steve	d02a093eeb	ui+server: schedule next-run / last-run on dashboard + schedules tab P2R-14. New store.LatestJobBySchedule query (per-schedule fired job). Schedules-tab handler computes next-fire from cron + last-fire from the jobs table per row. Schedules table grows two columns; dashboard host row prepends 'next 12h ago/from now' to the existing last-backup line when a single covering schedule is the run-now candidate. Embeds store.Schedule into scheduleRow so existing template field references keep working without bulk renames.	2026-05-04 10:44:31 +01:00
steve	9ec69456fe	store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire) Widen the SQL query to consider all statuses (queued, running, succeeded, failed, cancelled) rather than terminal-only. An in-flight prune that outlasts the 60s tick interval previously produced ErrNotFound, causing the ticker to anchor at now-24h and fire a second prune concurrently with the first. Update the doc comment and test: remove the "queued job filtered out" case, add assertions that a running job and a queued job are each returned as the latest.	2026-05-04 10:19:15 +01:00
steve	18a4f74a22	server: enqueue pending_runs when scheduled-job dispatch fails When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant.	2026-05-04 10:19:15 +01:00
steve	ae96983877	maintenance: pure-logic ticker decides forget/prune/check fires	2026-05-04 10:19:15 +01:00
steve	11cbc2fb7f	store: tighten CHECK constraint on host_repo_stats.last_check_status	2026-05-04 10:19:15 +01:00
steve	5200e44536	store: wrap UpsertHostRepoStats in a transaction (concurrency safety)	2026-05-04 10:19:15 +01:00
steve	84a8c060b6	store: assert CHECK constraint on host_credentials.kind	2026-05-04 10:19:15 +01:00
steve	cfe25b9799	store: HostRepoStats projection (size, lock, last-check, last-prune)	2026-05-04 10:19:15 +01:00
steve	f801fdf65b	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-04 10:19:15 +01:00
steve	9f2cb18e42	store: migration 0009 — admin-creds kind + host_repo_stats	2026-05-04 10:19:15 +01:00
steve	b6f8de1dcc	lint: drive baseline to zero, drop only-new-issues gate Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	ec0bf0f6c3	P2R-01: REST + WS rewire against the slim shape Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron parsed via robfig/cron/v3 and group membership scoped to the host. New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete refuses with 409 if any schedule still references the group, returning the schedule list so the UI can prompt 'remove from these schedules first.' Repo-maintenance GET/PUT manages forget/prune/check cadences on host_repo_maintenance — no version bump, the server-side ticker (P2R-06) drives execution. Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run) resolves the group's includes/excludes/retention/tag and dispatches a backup command.run with the new structured CommandRunPayload fields (Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and /hosts/{id}/init-repo return 410 Gone with a redirect message. schedule_push.go is rebuilt: buildScheduleSetPayload assembles the slim wire shape, pushScheduleSetOnConn ships it during the on-hello window, pushScheduleSetAsync fires after every CRUD mutation, and dispatchScheduledJob handles agent schedule.fire by iterating the schedule's source groups and dispatching one backup per group with actor_kind=schedule and scheduled_id pointing at the schedule. Auto-init at first WS connect: when the host has repo creds bound and no init job in its history, server dispatches restic init. Restic's 'config file already exists' soft-success means re-runs against an existing repo no-op; we don't auto-retry on failure (operator triggers re-init manually via the danger zone in P2R-09). api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc. in favour of {id, cron, enabled, source_groups: [...]}. The agent scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads Includes/Excludes/Tag instead of Args. Tests cover the new HTTP surface end-to-end: source-groups CRUD with in-use refusal, schedule validation (bad cron / missing groups / foreign group), repo-maintenance auto-seed and validation, the 410 route, and buildScheduleSetPayload's wire-shape correctness. Full suite passes; smoke env exercises auto-init dispatch on hello, async push after schedule create, and per-source-group Run-now landing the right paths/excludes/tag at the agent.	2026-05-03 10:56:40 +01:00
steve	e7eea7afac	P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance Go-side data model rebuilt against migration 0008. The fat-Schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is gone; that surface lives on source_groups now. * store/types.go - Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. SourceGroupIDs populated by Get/List, accepted on Create/Update so callers pass desired junction state in one shape. - SourceGroup added: name (= snapshot tag), includes/excludes, retention_policy, retry_max + retry_backoff_seconds, cached conflict_dimension. - HostRepoMaintenance added: forget/prune/check cadences + enabled. - PendingRun added: offline-retry queue. - Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps. - RetentionPolicy moves home from "schedule field" to "source group field" but the type itself + Summary() method unchanged. * store/sources.go (new) — CRUD + GetByName + ConflictDimension cache. Group writes bump host_schedule_version; conflict cache writes don't (server-internal projection, agent doesn't see it). * store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR IGNORE). UpdateRepoMaintenance doesn't bump schedule version because these run on the server's own ticker, not the agent's local cron. * store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete. * store/schedules.go — rewritten for slim shape + junction CRUD. Update wipes the schedule_source_groups junction wholesale and re-inserts (simpler than diffing). Adds SchedulesUsingGroup for retention-conflict detection + UI labels. * store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan. New SetHostBandwidth helper. * HTTP layer — temporarily stubbed during this rewrite (501 returns with redesign_in_progress error code). Phase 3 fills these in against the new shape: - schedules.go REST CRUD - schedule_push.go agent reconciliation - ui_schedules.go HTML form CRUD Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed — both go away in the new model (Run-now per source group; auto-init at host enrolment). * enrollment.go — replaces "seed manual schedule from typed paths" with "seed default source group + repo-maintenance row." The default group gets the typed paths as its includes; operator edits later via Sources tab. * ws/handler.go — drops the MarkHostRepoInitialised projection (column is gone; auto-init makes it derivable from latest init job's status). Tests: * store: existing schedule test rewritten for slim shape + junction; new sources_test.go covers source-group CRUD, name uniqueness, conflict cache, repo-maintenance defaults + idempotent seed, pending-runs queue lifecycle. * http: schedules_test.go and schedule_push_test.go deleted — both exercised the obsolete fat-schedule API. Phase 3 rewrites them against the new endpoints. go test ./... green. cmd/server + cmd/agent build. The UI is broken end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3 restores REST + on-the-wire reconciliation; Phase 4 rewires the UI templates against the new model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:30:41 +01:00
steve	49ecb7c771	P2 redesign · phase 1: migration 0008 — sources + repo maintenance Schema rebuild for the model collapse described in design/v4-sources-redesign.html. Three nouns now stand on their own: * schedules — slim. Only cron + enabled + host_id. Fat-schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is dropped wholesale. Schedule data wiped — by design (smoke env was nuked before this ran; fresh installs have nothing to lose). * source_groups — name + includes + excludes + retention_policy + retry policy + cached conflict_dimension. Group name doubles as the snapshot tag so retention can target it cleanly. UNIQUE (host_id, name) enforces tag unambiguity. * schedule_source_groups — N:M junction. One schedule can fire N groups per tick; one group can be referenced by N schedules. * host_repo_maintenance — 1:1 with hosts. Default cadences: forget daily 03:00, prune weekly Sun 04:00, check monthly 1st 05:00 with --read-data-subset 5%. Operator can edit on Repo tab. * pending_runs — offline-retry queue. Server-side ticker dispatches due rows; bounded by source_groups.retry_max + retry_backoff_seconds. Plus: * hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps. * hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes it derivable from the latest init job; the Init-repo button goes too (failure surfaces via job history banner). Note on FK safety: smoke env was wiped before migration ran, so DROP TABLE schedules cascades to nothing. Fresh installs apply 0001-0007 then immediately 0008 — same story (no schedule rows to lose). For an upgrade path on a populated DB, this migration would need a data-preserving variant; not needed today. Tests fail to compile/run after this — expected. The Go side (store types, CRUD, REST handlers, agent runner, UI templates) gets rebuilt in subsequent phases. tasks.md will track P2 redesign progress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 20:54:01 +01:00

1 2

66 Commits