Commit Graph

66 Commits

Author SHA1 Message Date
steve 28c8b58f93 ui: per-host Jobs sub-tab; drop unused Settings stub
Adds /hosts/{id}/jobs page listing recent jobs for the host (newest
first, capped at 100) with click-through to /jobs/{id}. Converts the
Jobs placeholder <div> to a real <a> nav link; removes the Settings
stub entirely. Also registers durationHuman template func and a
.jobs-row CSS grid to match the existing .schd-row idiom.
2026-05-07 22:49:10 +01:00
steve b9c7ec6ebf store: history table helpers (upsert/list, COALESCE preserves prior values) 2026-05-07 18:43:20 +01:00
steve da518de3e6 store: migration 0023 host_repo_stats_history 2026-05-07 18:39:44 +01:00
steve 0a75b82c17 fix: project finished backup jobs onto host row + smoke path tweaks
The dashboard's 'Last backup' column reads hosts.last_backup_at /
last_backup_status, but the WS handler only updated hosts.repo_status
on job.finished — backup terminations were silently dropped. Add a
SetHostLastBackup store method and call it from the same job.finished
switch that already handles init jobs.

Also: CLAUDE.md restage block uses /tmp/rm-smoke (the original
default) but the actual dev env runs out of $HOME/smoke. Update the
paths in the doc to match.
2026-05-07 17:55:23 +01:00
steve 9d5775fb47 p6-01/02: agent self-update + fleet update server cluster
- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted
  (system-scoped, host_id NULL via new RaiseOrTouchSystem helper).
- ws: UpdateWatcher tracks in-flight command.update dispatches and
  reconciles them against incoming hello envelopes — success path
  marks the job succeeded and auto-resolves the alert; 90s timeout
  marks the job failed and raises update_failed.
- http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX
  /hosts/{id}/update form variant. Pre-checks: host exists, online,
  agent_version != current, no running update job. Refactored core
  into Server.dispatchHostUpdate so the fleet worker can share it
  without going through HTTP.
- fleetupdate: rolling worker iterating through host slots, halting
  on first failure and raising fleet_update_halted. Polling-based
  version-match (re-read hosts.agent_version every 1s up to 95s) —
  no extra plumbing into the WS hello path. At-most-one-running is
  enforced at the store layer (ErrFleetUpdateRunning).
- cmd/server: wire UpdateWatcher and FleetWorker into the main
  goroutine; the worker uses a small serverDispatcher adapter that
  delegates back into Server.DispatchHostUpdate.

Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint
(happy + four pre-check branches + RBAC), worker (two-host happy,
timeout-halt, host-offline-halt, already-at-target skip, cancel
mid-run, double-Start guard).
2026-05-06 22:03:50 +01:00
steve c37954aa3f store: migrations 0021+0022 + fleet_updates CRUD 2026-05-06 21:47:54 +01:00
steve 02e4ef7544 testing: bootstrap UI, agent reliability, NS-01..04 + alert username
Smoothes the rough edges that came up exercising a live deployment.

First-run bootstrap UI: /bootstrap renders a username + password form
that uses the in-memory token directly (operator no longer copies it
out of the log); /login redirects there while bootstrap is available.

Agent reliability: failJob synthetic envelopes so command.run early
returns no longer hang the server-side job; runtime probe of restic
restore --help drives --no-ownership instead of version sniffing
(0.18.x had it removed). Server unit re-shaped: ProtectSystem=full
plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore
can now write anywhere a user might want.

Restore wizard: default target is /root/rm-restore/<job-id>/ with
clearer help text. Re-init confirm input uses .field (was .input,
which doesn't exist — text was invisible).

NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete
with hostname-confirm danger zone, audit, FK cascade, live WS close.

NS-02 enrollment-token recovery: outstanding-tokens panel on
/hosts/new, regenerate (preserves attachments) and revoke handlers
+ audit, store-level ListOutstandingEnrollmentTokens and
DeleteEnrollmentToken.

NS-03 repo init / probe surface: migration 0020 adds
hosts.repo_status + repo_status_error; WS handler projects every
init job's outcome onto the host row (idempotent already-initialised
collapses to ready); creds-save resets status and dispatches a fresh
probe; /hosts/{id}/repo/probe retry endpoint with banner.

NS-04 dashboard live + sort + filter: query-string filter
(q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the
alerts pattern with a localStorage live toggle, sortable column
headers, filter row + clear.

Alerts page: ack'd-by line resolves user_id ULID to username.

Compose.yaml ignored — host-specific.
2026-05-05 22:03:15 +01:00
steve e2976a42e6 store: oidc_state CRUD + 5-minute cleanup 2026-05-05 13:15:45 +01:00
steve 14be63510c store: round-trip IDToken on sessions for RP-initiated logout 2026-05-05 13:14:27 +01:00
steve 70aa22e87e store: GetUserByOIDCSubject + scanUser auth_source/oidc_subject 2026-05-05 13:12:11 +01:00
steve 154b57a4cd store: extend User with AuthSource/OIDCSubject; Session with IDToken 2026-05-05 13:09:49 +01:00
steve c5b29b88b9 store: migration 0019 — users.auth_source/oidc_subject + sessions.id_token + oidc_state 2026-05-05 13:08:15 +01:00
steve 168059ae45 feat(hosts): per-host tags edit + dashboard chip-row filter (P4-07) 2026-05-05 11:16:09 +01:00
steve 2d9e53b025 ui(users): record last_login on /setup + sortable headers 2026-05-05 10:57:25 +01:00
steve e76a383813 store: DeleteSessionsByUserID for force-logout 2026-05-05 10:57:24 +01:00
steve 93d857d995 store: user_setup_tokens CRUD + cleanup-expired 2026-05-05 10:57:24 +01:00
steve dafdfcda3f store: lowercase username, email/disable helpers, last-admin count 2026-05-05 10:57:24 +01:00
steve c6fbe7c0e0 store: extend User struct with Email, DisabledAt, MustChangePassword 2026-05-05 10:57:24 +01:00
steve a1d307fafa store: migration 0018 — user_setup_tokens 2026-05-05 10:57:24 +01:00
steve 9712c65b04 store: migration 0017 — users.email, disabled_at, must_change_password 2026-05-05 10:57:24 +01:00
steve 4f66cc2b34 feat(audit): clickable column headers with asc/desc sort 2026-05-05 08:15:22 +01:00
steve 16c77a8cc5 feat(audit): P3-08 — audit log UI with filters 2026-05-05 07:49:25 +01:00
steve 350be3f19d feat(alerts): per-source-group dedup so two failing backups produce two alerts
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.

Schema changes (column-level ALTER, no rebuild):

- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
  index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
  partial index gets dropped and replaced with a UNIQUE partial
  index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
  the index is now the actual dedup primitive.

Plumbing:

- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
  passes it through for backup_failed only (forget/prune/check stay
  repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
  per-group Run-now path and schedule.fire path pass &g.ID.

Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.

Dev tool: cmd/_fake_alert gains -dedup-key flag.
2026-05-04 22:59:48 +01:00
steve d830635a2e fix: enabled toggle — list-row click + edit-form save
Two bugs in the channel-enabled affordance:

1. List-row toggle was a static span with no handler; the row's
   row-link overlay swallowed every click and routed to /edit. Add
   POST /settings/notifications/{id}/toggle backed by a new store
   method SetNotificationChannelEnabled, and turn the row toggle
   into an htmx-driven button that swaps in the new state. Use
   event.stopPropagation() on the toggle so it beats the row link.

2. Edit-form toggle visually flipped but the underlying checkbox
   reverted: the visual span lives inside the <label>, so clicking
   it fired the inline JS handler AND the label's native
   checkbox-toggle, cancelling out. Bind to the checkbox 'change'
   event instead and let the label do the toggling — the JS just
   mirrors check.checked into the .on class.
2026-05-04 22:21:45 +01:00
steve cbdaa4daeb fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.

Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.

Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
2026-05-04 21:01:17 +01:00
steve c710743231 alert: wire engine into ws hello + MarkJobFinished + offline sweep
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
  from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
  looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
  transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
  TODO(G1) comment marks where NotifyHostOffline calls will land
2026-05-04 19:54:39 +01:00
steve 8a92fedba1 store: notification_channels CRUD + AppendNotificationLog 2026-05-04 19:28:41 +01:00
steve 7c62d111d5 store: alerts CRUD with dedup + last_seen_at bump 2026-05-04 19:24:17 +01:00
steve b2dffb1d83 store: migration 0014 — notification_channels + notification_log 2026-05-04 19:20:37 +01:00
steve db71e006bb store: A1 — check rows.Err() + Scan err in migrate_test
Code-quality nits flagged in review of 2692c66. Mirrors the existing
pattern in host_credentials_test.go.
2026-05-04 19:19:28 +01:00
steve 2692c660c5 store: migration 0013 — alerts.last_seen_at 2026-05-04 19:16:59 +01:00
steve a781e95c94 P3 follow-up: editable target dir, conditional --no-ownership, UK lint
Three small follow-ups from review:

1. Restore target is now operator-editable. Default value is the
   literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
   run time using os.UserHomeDir(); also handles \${HOME} and ~/
   prefixes). Operator can replace with any absolute path.
   - ui_restore.go validates the input is either absolute or starts
     with one of the recognised prefixes; other env-var refs (\$PATH
     etc.) are deliberately rejected so operator paths can't pick up
     arbitrary agent env values.
   - host_restore.html replaces the read-only mono-text display with
     a real <input>; help text spells out that \$HOME resolves
     agent-side and <job-id> is substituted on dispatch.
   - install.sh + the systemd unit prep /root/rm-restore so the
     default works under the sandbox: ReadWritePaths gains a soft
     '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
     if missing, but install.sh pre-creates it root-owned 0700).

2. --no-ownership flag now gated on restic version. The flag was
   added in restic 0.17 and 0.16 rejects it. Previously dropped it
   wholesale — that meant new-dir restores silently preserved
   ownership against design intent on 0.17+. Now the agent threads
   its detected restic version (sysinfo already collects it) through
   runner.Config -> restic.Env, and RunRestore appends --no-ownership
   only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
   restore with original uid/gid; help text in the wizard explicitly
   notes this. The previous 'Original ownership is preserved' copy
   was wrong for new-dir mode and is corrected.

3. golangci-lint misspell locale switched US -> UK and the codebase
   swept (73 corrections, mostly behaviour/serialise/recognise/honour).
   Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
   contract change but the agent doesn't parse those codes today and
   no external API consumers exist yet. Tests passed before + after.

Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
  edge cases (empty, exact match, patch above, minor below, non-
  numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
  pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
  with the job_id substituted into the placeholder.

Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
2026-05-04 17:27:52 +01:00
steve 4c108bb68a P3-01/02/03: restore wizard backend + templates + restore-shaped job page
End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link
/hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch →
restore-shaped live job page.

Backend (internal/server/http/ui_restore.go):
- GET handlers render the four-step wizard against the wireframe shape
  in docs/superpowers/specs/2026-05-04-p3-restore-design.md.
- HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each
  directory expansion is a sub-second cached lookup after the first
  miss.
- POST validates: snapshot_id non-empty, ≥1 absolute path, in-place
  mode requires confirm_hostname == host name, agent online. On error
  re-renders the wizard with the operator's input intact. Happy path
  mints a job_id, computes the new-directory target as
  /var/restic-restore/<job-id>/ (operator can't escape the prefix —
  server picks it), creates the job row, ships command.run with
  kind=restore + RestorePayload, writes a host.restore audit row,
  returns HX-Redirect (or 303) to the live job page.

Templates:
- host_restore.html: single-page progressively-enabled wizard matching
  _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a
  running tally of selected paths and the step-4 confirm summary
  client-side; the server re-renders on validation failure with form
  fields preserved.
- partials/tree_node.html: recursive HTMX-served tree fragment.
- Top-level Restore button on host_detail right rail + per-snapshot
  Restore action on snapshot rows replace the previous P3-stub.

Restore-shaped job page (job_detail.html):
- Progress widget rendered as a panel rather than a bare strip when
  the job is active.
- Current-file display under the bar, updated from log.stream stdout
  lines that look like absolute paths. Hidden for non-restore kinds.

Migration 0012:
- Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite
  can't ALTER CHECK in place); follows the safe pattern from 0005.
  Defensive: stash job_logs into a temp table before the rebuild and
  INSERT OR IGNORE back afterwards so even if SQLite cascades on
  DROP TABLE jobs the log history survives.

Tests:
- ui_restore_test covers GET step-1 render, GET pre-selected snapshot
  summary card, POST missing snapshot, POST missing paths, POST
  in-place wrong-hostname rejection (no command.run leaks to the
  agent), POST happy path (HX-Redirect + correct payload + audit
  row), POST against offline host returns 503.

Restage block (CLAUDE.md) deferred to the end of the restore phase.
2026-05-04 15:34:29 +01:00
steve cd80be3b13 store+server: P2-18a announce-and-approve schema + endpoint
migration 0011 adds pending_hosts table (id, hostname, public_key,
fingerprint, expiry). store/pending_hosts.go covers full CRUD plus
hostname-collision count + expired-row sweeper.

POST /api/agents/announce takes {hostname, os, arch, agent_version,
restic_version, public_key (base64)}, returns {pending_id,
fingerprint, hostname_collision}. Per-source-IP token-bucket
rate limit (10/min) + global cap of 100 in-flight rows. Public
key must be exactly 32 bytes (Ed25519).
2026-05-04 11:03:41 +01:00
steve 7b1990cf11 agent+server: P2R-11 pre/post hook execution for backup jobs
Agent: new runner.BackupHooks struct + runHook helper invoked via
/bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts
the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed
in env. Output streamed as 'hook(<phase>): …' log.stream lines.
Hooks only run for kind=backup (other kinds skip both phases).

Server: resolveBackupHooks resolves group → host default → empty,
decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext
into CommandRunPayload for both schedule.fire and per-group
Run-now dispatch sites. Decrypt failures degrade silently to no
hook so a malformed blob can't poison every backup.
2026-05-04 10:57:28 +01:00
steve 18b0bf976d store: P2R-10 schema for source-group + host-default hooks (migration 0010)
Adds pre_hook/post_hook BLOB columns to source_groups and
pre_hook_default/post_hook_default to hosts. Bytes stored verbatim
(AEAD encrypt/decrypt happens at the HTTP layer where the AEAD key
lives). Round-trip tests cover set/clear semantics on both tables.
2026-05-04 10:52:16 +01:00
steve d02a093eeb ui+server: schedule next-run / last-run on dashboard + schedules tab
P2R-14. New store.LatestJobBySchedule query (per-schedule fired job).
Schedules-tab handler computes next-fire from cron + last-fire from
the jobs table per row. Schedules table grows two columns; dashboard
host row prepends 'next 12h ago/from now' to the existing last-backup
line when a single covering schedule is the run-now candidate.

Embeds store.Schedule into scheduleRow so existing template field
references keep working without bulk renames.
2026-05-04 10:44:31 +01:00
steve 9ec69456fe store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire)
Widen the SQL query to consider all statuses (queued, running,
succeeded, failed, cancelled) rather than terminal-only. An in-flight
prune that outlasts the 60s tick interval previously produced
ErrNotFound, causing the ticker to anchor at now-24h and fire a second
prune concurrently with the first.

Update the doc comment and test: remove the "queued job filtered out"
case, add assertions that a running job and a queued job are each
returned as the latest.
2026-05-04 10:19:15 +01:00
steve 18a4f74a22 server: enqueue pending_runs when scheduled-job dispatch fails
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.

Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
2026-05-04 10:19:15 +01:00
steve ae96983877 maintenance: pure-logic ticker decides forget/prune/check fires 2026-05-04 10:19:15 +01:00
steve 11cbc2fb7f store: tighten CHECK constraint on host_repo_stats.last_check_status 2026-05-04 10:19:15 +01:00
steve 5200e44536 store: wrap UpsertHostRepoStats in a transaction (concurrency safety) 2026-05-04 10:19:15 +01:00
steve 84a8c060b6 store: assert CHECK constraint on host_credentials.kind 2026-05-04 10:19:15 +01:00
steve cfe25b9799 store: HostRepoStats projection (size, lock, last-check, last-prune) 2026-05-04 10:19:15 +01:00
steve f801fdf65b store: host_credentials becomes kind-aware (repo + admin slots) 2026-05-04 10:19:15 +01:00
steve 9f2cb18e42 store: migration 0009 — admin-creds kind + host_repo_stats 2026-05-04 10:19:15 +01:00
steve b6f8de1dcc lint: drive baseline to zero, drop only-new-issues gate
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:

* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
  api.JobCancelled = "cancelled" since that literal is the wire +
  DB CHECK constraint value, plus matched the case in store/fleet.go
  back to "cancelled" and added //nolint:misspell on both for the
  next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
  `defer res.Body.Close()` in `defer func() { _ = .Close() }()`
  to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
  upgrade response Body — coder/websocket can return res with a nil
  Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
  comments explaining why nil-on-error is the contract (cookie
  missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
  revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
  ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
  the dashboard primary nav today

Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
2026-05-03 16:15:17 +01:00
steve ec0bf0f6c3 P2R-01: REST + WS rewire against the slim shape
Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron
parsed via robfig/cron/v3 and group membership scoped to the host.
New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete
refuses with 409 if any schedule still references the group, returning
the schedule list so the UI can prompt 'remove from these schedules
first.' Repo-maintenance GET/PUT manages forget/prune/check cadences
on host_repo_maintenance — no version bump, the server-side ticker
(P2R-06) drives execution.

Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run)
resolves the group's includes/excludes/retention/tag and dispatches a
backup command.run with the new structured CommandRunPayload fields
(Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and
/hosts/{id}/init-repo return 410 Gone with a redirect message.

schedule_push.go is rebuilt: buildScheduleSetPayload assembles the
slim wire shape, pushScheduleSetOnConn ships it during the on-hello
window, pushScheduleSetAsync fires after every CRUD mutation, and
dispatchScheduledJob handles agent schedule.fire by iterating the
schedule's source groups and dispatching one backup per group with
actor_kind=schedule and scheduled_id pointing at the schedule.

Auto-init at first WS connect: when the host has repo creds bound and
no init job in its history, server dispatches restic init. Restic's
'config file already exists' soft-success means re-runs against an
existing repo no-op; we don't auto-retry on failure (operator triggers
re-init manually via the danger zone in P2R-09).

api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc.
in favour of {id, cron, enabled, source_groups: [...]}. The agent
scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads
Includes/Excludes/Tag instead of Args.

Tests cover the new HTTP surface end-to-end: source-groups CRUD with
in-use refusal, schedule validation (bad cron / missing groups /
foreign group), repo-maintenance auto-seed and validation, the 410
route, and buildScheduleSetPayload's wire-shape correctness. Full
suite passes; smoke env exercises auto-init dispatch on hello,
async push after schedule create, and per-source-group Run-now
landing the right paths/excludes/tag at the agent.
2026-05-03 10:56:40 +01:00
steve e7eea7afac P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance
Go-side data model rebuilt against migration 0008. The fat-Schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks) is
gone; that surface lives on source_groups now.

* store/types.go
  - Schedule slimmed to {id, host_id, cron, enabled, source_group_ids,
    timestamps}. SourceGroupIDs populated by Get/List, accepted on
    Create/Update so callers pass desired junction state in one shape.
  - SourceGroup added: name (= snapshot tag), includes/excludes,
    retention_policy, retry_max + retry_backoff_seconds, cached
    conflict_dimension.
  - HostRepoMaintenance added: forget/prune/check cadences + enabled.
  - PendingRun added: offline-retry queue.
  - Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps.
  - RetentionPolicy moves home from "schedule field" to "source group
    field" but the type itself + Summary() method unchanged.

* store/sources.go (new) — CRUD + GetByName + ConflictDimension cache.
  Group writes bump host_schedule_version; conflict cache writes don't
  (server-internal projection, agent doesn't see it).
* store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR
  IGNORE). UpdateRepoMaintenance doesn't bump schedule version because
  these run on the server's own ticker, not the agent's local cron.
* store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete.
* store/schedules.go — rewritten for slim shape + junction CRUD.
  Update wipes the schedule_source_groups junction wholesale and
  re-inserts (simpler than diffing). Adds SchedulesUsingGroup for
  retention-conflict detection + UI labels.
* store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan.
  New SetHostBandwidth helper.

* HTTP layer — temporarily stubbed during this rewrite (501 returns
  with redesign_in_progress error code). Phase 3 fills these in
  against the new shape:
    - schedules.go REST CRUD
    - schedule_push.go agent reconciliation
    - ui_schedules.go HTML form CRUD
  Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed
  — both go away in the new model (Run-now per source group; auto-init
  at host enrolment).

* enrollment.go — replaces "seed manual schedule from typed paths"
  with "seed default source group + repo-maintenance row." The default
  group gets the typed paths as its includes; operator edits later
  via Sources tab.

* ws/handler.go — drops the MarkHostRepoInitialised projection (column
  is gone; auto-init makes it derivable from latest init job's status).

Tests:
* store: existing schedule test rewritten for slim shape + junction;
  new sources_test.go covers source-group CRUD, name uniqueness,
  conflict cache, repo-maintenance defaults + idempotent seed,
  pending-runs queue lifecycle.
* http: schedules_test.go and schedule_push_test.go deleted — both
  exercised the obsolete fat-schedule API. Phase 3 rewrites them
  against the new endpoints.

go test ./... green. cmd/server + cmd/agent build. The UI is broken
end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3
restores REST + on-the-wire reconciliation; Phase 4 rewires the UI
templates against the new model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:30:41 +01:00
steve 49ecb7c771 P2 redesign · phase 1: migration 0008 — sources + repo maintenance
Schema rebuild for the model collapse described in
design/v4-sources-redesign.html. Three nouns now stand on their
own:

* schedules — slim. Only cron + enabled + host_id. Fat-schedule
  shape (paths/excludes/tags/retention/manual/kind/options/hooks)
  is dropped wholesale. Schedule data wiped — by design (smoke env
  was nuked before this ran; fresh installs have nothing to lose).
* source_groups — name + includes + excludes + retention_policy +
  retry policy + cached conflict_dimension. Group name doubles as
  the snapshot tag so retention can target it cleanly. UNIQUE
  (host_id, name) enforces tag unambiguity.
* schedule_source_groups — N:M junction. One schedule can fire N
  groups per tick; one group can be referenced by N schedules.
* host_repo_maintenance — 1:1 with hosts. Default cadences:
  forget daily 03:00, prune weekly Sun 04:00, check monthly 1st
  05:00 with --read-data-subset 5%. Operator can edit on Repo tab.
* pending_runs — offline-retry queue. Server-side ticker dispatches
  due rows; bounded by source_groups.retry_max + retry_backoff_seconds.

Plus:
* hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps.
* hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes
  it derivable from the latest init job; the Init-repo button goes
  too (failure surfaces via job history banner).

Note on FK safety: smoke env was wiped before migration ran, so
DROP TABLE schedules cascades to nothing. Fresh installs apply
0001-0007 then immediately 0008 — same story (no schedule rows
to lose). For an upgrade path on a populated DB, this migration
would need a data-preserving variant; not needed today.

Tests fail to compile/run after this — expected. The Go side
(store types, CRUD, REST handlers, agent runner, UI templates)
gets rebuilt in subsequent phases. tasks.md will track P2 redesign
progress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:54:01 +01:00