restic-manager

Author	SHA1	Message	Date
steve	02dbe59d68	server: drainer uses dispatch-core to avoid duplicate pending_run enqueue Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on failure) from dispatchBackupForGroup. drainOne now calls the core directly so a failed Send only bumps the existing pending_runs row via BumpPendingRunAttempt — not create a second row — stopping the geometric duplication on repeated drain failures. dispatchBackupForGroup (schedule.fire path) wraps the core and keeps its enqueue-on-failure behaviour unchanged. TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row remains after a send failure (was tolerating >=1 duplicate rows).	2026-05-04 10:15:18 +01:00
steve	81c611264d	server: drain pending_runs on tick + on agent reconnect Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.	2026-05-04 10:15:18 +01:00
steve	194e6c9719	server: enqueue pending_runs when scheduled-job dispatch fails When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant.	2026-05-04 10:15:18 +01:00
steve	f29a9e49d3	server: fix stale RetentionPolicy comment + check Scan errors in maintenance test	2026-05-04 10:15:18 +01:00
steve	e283d70c27	server: maintenance ticker drives forget/prune/check on cadence Wires a 60s server-side ticker to the pure-logic maintenance.Decide introduced in the previous commit. Decisions flow through a new DispatchMaintenance method on Server, which: - skips offline hosts (no pending_runs queueing — maintenance is not a backup, missed fires shouldn't pile up) - silently skips prune when admin creds aren't bound - pushes admin creds before prune, then dispatches with RequiresAdminCreds=true (same as operator-driven prune) - persists job rows with actor_kind="system" Reshapes the forget wire payload from a single RetentionPolicy to a ForgetGroups list (one tag + per-group keep- per source group). The agent walks the groups and runs `restic forget --tag <name> --keep-*` once per group. Dead-code removed: CommandRunPayload.RetentionPolicy, the old forget JSON-decode in cmd/agent, and the single-policy form of restic.RunForget.	2026-05-04 10:15:18 +01:00
steve	edce90d196	ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in host_repo.html to match the existing pattern on host_sources.html and host_schedules.html. Fix all-blank admin-credentials save to redirect without ?saved= query string so no false-positive banner is shown; strengthen the corresponding test to assert Location has no ?saved=. Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.	2026-05-04 10:15:18 +01:00
steve	ccccc6aa33	ui: Slice E — admin creds form + run-now buttons + repo health panel - hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online, and StatsView (pre-dereferenced projection of host_repo_stats). - loadHostRepoPage loads the admin slot (tolerating ErrNotFound), hub.Connected, and stats (tolerating ErrNotFound). - renderRepoPage gains an adminErr parameter; all callers updated. - handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added (form-POST handlers mirroring the repo-creds pattern, with audit). - Routes /hosts/{id}/admin-credentials POST and /delete POST registered. - Template: Admin credentials form after Connection, Run-now HTMX buttons after Maintenance, Repo health stats panel in right rail. - Tests: 9 new tests covering rendering, disabled states, save/delete round-trips, audit rows, and idempotent delete.	2026-05-04 10:15:18 +01:00
steve	b07cb14320	server: populate audit UserID on credential mutations + slog prune push errors Switch handleSetHostCredentials, handleSetAdminCredentials, and handleDeleteAdminCredentials from authedUser (bool) to requireUser (*store.User) so AuditEntry.UserID and Actor are populated correctly. Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in handleRunRepoPrune so decrypt/send failures surface in the server log rather than appearing as a generic host_offline 503.	2026-05-04 10:15:18 +01:00
steve	56dd7ab411	server: cover HTMX auth-redirect path in repo-ops tests	2026-05-04 10:15:18 +01:00
steve	0095e80fe9	server: HTTP run-now for prune / check / unlock Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer routes for HTMX form posts). Prune pushes the admin-cred slot via pushAdminCredsToAgent before dispatch and refuses with admin_creds_required when the slot is not set. Check reads check_subset_pct from host_repo_maintenance (overridable via ?subset=N, clamped 0-100; non-numeric override falls back to DB value silently). Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect response split as the per-source-group run-now endpoint.	2026-05-04 10:15:18 +01:00
steve	b66eb10524	server: admin-credentials REST + Slot:admin push helper Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that mirror the existing repo-credentials endpoints but write to store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped away from the repo slot to prevent cross-binding). PUT immediately pushes a config.update(Slot:"admin") to the agent when it is connected, and the new pushAdminCredsToAgent helper is wired for use by the upcoming prune run-now endpoint (D2) to push on-demand before dispatch.	2026-05-04 10:15:18 +01:00
steve	2055ce360b	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-04 10:15:18 +01:00
steve	dd7b37a5c1	lint: align local gofumpt rules with golangci-lint v2.5.0 CI / Test (linux/amd64) (pull_request) Successful in 21s Details CI / Lint (pull_request) Successful in 24s Details CI / Build (windows/amd64) (pull_request) Successful in 20s Details CI / Build (linux/amd64) (pull_request) Successful in 21s Details CI / Build (linux/arm64) (pull_request) Successful in 20s Details Bumping CI to v2.5.0 surfaced two new gofumpt findings (in two test files that gofumpt v2.1.6 considered fine). Local re-format with the matching tool brings them in line. Pre-commit hook config: prepend $GOPATH/bin to PATH inside the hook entry so gofumpt + golangci-lint resolve when ~/go/bin isn't on the operator's interactive shell PATH (common — go install puts them there but PATH config varies). Without this, the hooks fail with 'Executable not found' even when the tools are installed. Pin the Makefile setup target to v2.5.0 so a fresh clone gets the same binary CI runs — keeps pre-commit and CI from drifting again.	2026-05-03 21:31:47 +01:00
steve	e871b05b38	lint: drive baseline to zero, drop only-new-issues gate CI / Test (linux/amd64) (pull_request) Successful in 34s Details CI / Lint (pull_request) Failing after 16s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	18a9f6624e	ci: migrate .golangci.yml to v2 schema + only-new-issues gate CI / Test (linux/amd64) (pull_request) Successful in 29s Details CI / Lint (pull_request) Failing after 16s Details CI / Build (windows/amd64) (pull_request) Successful in 20s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details The bump from golangci-lint-action@v6 → v7 (which downloads the v2.x binary) was blocking CI lint with 'unsupported version of the configuration: ""' because .golangci.yml was still in the v1 schema. Migrate the config to v2: * version: "2" prelude * disable-all → default: none * linters-settings → linters.settings * gofumpt + goimports move into formatters.enable + formatters.settings * exclude-rules move into linters.exclusions.rules * gosimple drops (folded into staticcheck in v2) Fix the four lint hits in the new P2R-02 code: * host_bandwidth.go: convert hostBandwidthRequest directly to hostBandwidthView via type conversion (S1016) * ui_repo.go: drop unparam savedSection + status arguments from renderRepoPage (always "" / always 422 — split GET render from validation-fail render) * ui_schedules.go: gofumpt formatting on the scheduleEditPage struct Add only-new-issues: true to the lint job. The repo carries ~90 pre-existing findings (gofumpt drift × 31, misspell × 25, missing godoc × 10, bodyclose × 6, errcheck × 12, …) accumulated before lint was actually wired into CI. Without this gate, every PR would fail on baseline noise instead of its own changes. Track the cleanup as X-06 in tasks.md so the gate is temporary.	2026-05-03 15:00:24 +01:00
steve	fab99b4a38	P2R-02 slice 5: dashboard row Run-now uses covering schedule Replace the placeholder 'Open →' link with a per-host Run-now decision computed server-side once per render: * If the host has exactly one enabled schedule whose source-group set covers every group on the host → primary 'Run all groups' button (HX-POST to that schedule's /run endpoint, fires every backup the host knows about in one click). * Otherwise (zero matches, multiple matches, or any ambiguity) → ghost 'Open →' link to /hosts/{id}/sources, where the operator picks per-group from the source-group rows. dashboardPage.Hosts moves from []store.Host to []dashboardHostRow to carry the precomputed RunAllScheduleID; host_row.html now reads .Host.* and .RunAllScheduleID. Two extra store calls per host on dashboard render — fine at fleet sizes we care about; if we ever need to support thousands of hosts we'll batch these queries.	2026-05-03 13:42:50 +01:00
steve	4035c44be3	P2R-02 follow-up: schedule Run-now feedback (single → job log, multi → toast) Schedules tab Run-now used to silently HX-Redirect back to the list, leaving the operator wondering whether the click registered. Now: * Single-source-group schedule → HX-Redirect to that one job's live log, matching the per-source-group Run-now UX from Sources. * Multi-group schedule → stay on the schedules list and fire a success toast ("N backups dispatched: <group names>") via the existing rm:toast HX-Trigger channel, so the operator sees clear acknowledgement without losing their place. dispatchBackupForGroup now returns the persisted job ID so the caller can choose between job-log redirect and toast feedback; on any internal failure it returns "" and the warning still hits slog as before. The cron-fired path (dispatchScheduledJob) ignores the return value, behaviour unchanged.	2026-05-03 13:25:31 +01:00
steve	d62b173712	P2R-02 slice 4: Repo tab — connection / bandwidth / maintenance Three independent forms on /hosts/{id}/repo so saving one section doesn't disturb the others: * Connection: edits repo URL, username, password (pre-filled from the redacted GET /api/hosts/{id}/repo-credentials view; password field shows masked stored-creds placeholder; blank password = keep existing). On save, encrypts and pushes config.update to a connected agent. * Bandwidth: host-wide upload/download caps (KB/s; blank = no cap) written via store.SetHostBandwidth. New REST endpoint PUT /api/hosts/{id}/bandwidth for JSON callers. * Maintenance: forget/prune/check cadences + check subset %, with per-row enabled toggles. Reuses cronParser for validation; auto-seeds the row if a host pre-dates the migration. Right-rail surfaces repo size, snapshot count, snapshots-by-tag breakdown (counted from existing snapshot tag rows), and an 'untagged snapshots are left alone' note. Danger-zone re-init button is rendered but disabled with a hint pointing at P2R-09 (real implementation lands there). Validation re-renders the page with the relevant form's banner and all other section state intact. Successful saves redirect with a ?saved=<section> query param so the page surfaces a small ✓ saved indicator on the relevant form. ci.yml: bump golangci-lint-action v6→v7 (separate change picked up in this commit).	2026-05-03 12:14:03 +01:00
steve	8b91d3037c	P2R-02 follow-up: Run-now works on disabled schedules with confirm CI / Test (linux/amd64) (pull_request) Successful in 33s Details CI / Lint (pull_request) Failing after 15s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 23s Details CI / Build (linux/arm64) (pull_request) Successful in 23s Details Surface the Run-now button on every schedule when the host is online, not just enabled ones. Disabled rows render the button as a non-primary style + a HX-confirm dialog ("This schedule is paused — running it now won't change that. Fire it once anyway?"); enabled rows keep the zero-friction primary button. Server-side, Run-now no longer short-circuits on !Enabled — it dispatches the source groups inline rather than via dispatchScheduledJob (which always bails on disabled schedules, since cron-tick semantics are different from explicit operator intent). The audit-log entry inside dispatchBackupForGroup still records every fire.	2026-05-03 12:07:26 +01:00
steve	64d2fcf7a3	P2R-02 follow-up: clickable rows on Sources/Schedules + cron-preset tooltips CI / Test (linux/amd64) (pull_request) Successful in 1m57s Details CI / Lint (pull_request) Failing after 15s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 22s Details CI / Build (linux/arm64) (pull_request) Successful in 22s Details Aligns Sources and Schedules tab rows with the dashboard's row-click UX: whole-row click navigates to the row's edit page (mirroring .host-row.clickable). Drops the redundant Edit buttons; Run-now and Delete remain in .row-action cells that sit above the row-link overlay via z-index. Schedule edit form's cron preset chips now carry human-readable title= tooltips ("Every day at 03:00", "Every Sunday at 03:00", etc). tasks.md gets a binding row-design rule covering all current and future list-row templates, and the P2R-02 entry is split into the six slices already agreed with the operator (slices 1–3 marked done, 4 next).	2026-05-03 12:01:55 +01:00
steve	67ca769686	P2R-02 slice 3: Schedules tab — slim list, new/edit form, delete, Run-now CI / Test (linux/amd64) (pull_request) Failing after 44s Details CI / Lint (pull_request) Failing after 13s Details CI / Build (windows/amd64) (pull_request) Successful in 19s Details CI / Build (linux/amd64) (pull_request) Successful in 19s Details CI / Build (linux/arm64) (pull_request) Successful in 25s Details Schedules list: status (enabled/paused) + cron + source-group tags + actions (Run-now when enabled+online, Edit, Delete). Run-now reuses dispatchScheduledJob — same path real cron fires take, so each referenced source group runs as its own backup with its own tag. Falls back to a 409 if the agent is offline. Schedule new/edit form: cron input with five preset chips (quick-pick @hourly / nightly / 6h / weekly / monthly), source-group multi-pick rendered as styled checkbox cards (visual state tracks the underlying box via a tiny inline script), enabled toggle. No paths/excludes/retention/kind on the schedule itself — those live on source groups now. Server-side validation re-renders with the operator's input + ticked groups intact. Every successful mutation calls pushScheduleSetAsync. Adds .schd-row, .preset-chip, .picker styles.	2026-05-03 11:55:16 +01:00
steve	dede74fd3a	P2R-02 slice 2 follow-up: refuse to delete a host's last source group CI / Test (linux/amd64) (pull_request) Failing after 45s Details CI / Lint (pull_request) Failing after 12s Details CI / Build (windows/amd64) (pull_request) Successful in 19s Details CI / Build (linux/amd64) (pull_request) Successful in 19s Details CI / Build (linux/arm64) (pull_request) Successful in 23s Details Belt-and-braces: the UI now disables the Delete button when a group is the only one on the host (with a tooltip explaining why), and the server-side handler returns 409 if a curl/form-replay tries anyway. Every host needs at least one source group to be backup-able, so the 'last group on a fresh host' case is a meaningful accident to guard against.	2026-05-03 11:49:17 +01:00
steve	0ed9c3d1ec	P2R-02 slice 2: Sources tab — list, new/edit form, delete, Run-now Sources tab now lists every source group on the host with per-row counts (used-by-N-schedules, snapshot count by tag), the v4 conflict tag (keep-* dimension that has no compatible cadence), and Run-now / Edit / Delete actions. Run-now reuses the existing HTMX-aware /hosts/{id}/source-groups/{gid}/run handler. New /hosts/{id}/sources/new and /sources/{gid}/edit form: name + includes/excludes textareas + the 3×2 keep-* retention grid + retry-on-offline knobs. Server-side validation re-renders with the operator's input intact; the inline conflict banner shows above the retention grid when ConflictDimension is set. Delete blocks (UI + server) when the group is referenced by any schedule. Every successful mutation calls pushScheduleSetAsync so an online agent re-arms within seconds. Adds .src-row and .keep-cell to input.css for the row + retention grid layout.	2026-05-03 11:44:43 +01:00
steve	a535822ff3	P2R-02 slice 1: host-detail sub-tab skeleton Extract header/vitals/sub-tabs into a host_chrome partial that every host-detail tab page renders. Sources / Schedules / Repo go from inert divs to real <a> links backed by stub pages that share the chrome and a 'coming next' body — slices 2/3/4 fill them in. Also re-establishes the version indicator (host_schedule_version vs agent's applied_schedule_version) in the header. Drops the legacy fat-schedule list/edit templates that referenced fields removed by the P2 redesign (Manual / Paths / RetentionPolicy on Schedule); the new templates land in slice 3.	2026-05-03 11:37:55 +01:00
steve	e968abc042	ci: fix race-trip in enrollment fixture + bump golangci-lint to v2.1.6 - host_credentials_test.go's CreateEnrollmentToken fixture passed 1<<20 as the TTL (third arg, time.Duration) — that's ~1ms in nanoseconds. Local non-race runs finished inside the window, but -race overhead blew the deadline so the token was already expired by the time GetEnrollmentTokenAttachments / ConsumeEnrollmentToken ran. Use time.Hour instead, which matches the spirit of a per-test fixture. - Lint pin v1.61.0 was built against Go 1.23 and refuses to load a config targeting newer toolchains. go.mod is on 1.25, so the lint step exited 3 ('the Go language version used to build golangci-lint is lower than the targeted Go version'). Bumping to v2.1.6, which supports Go 1.25. Both failures showed up only on the Gitea runner because local make target runs go test without -race and lint hadn't been re-run after the go.mod toolchain bump.	2026-05-03 11:13:22 +01:00
steve	713bc4a2bb	P2R-01 follow-up: WS-path tests + drop unused retention from backup dispatch Adds p2r01_ws_test.go covering the two paths the original commit's in-process tests couldn't reach without a live conn: - maybeAutoInit dispatches command.run(init) on first hello when creds are bound, skips on second hello once a job row exists, and skips entirely when the host has no creds. - dispatchScheduledJob iterates a schedule's source groups and emits one backup per group with the right Tag/Includes; persists job rows with actor_kind=schedule + scheduled_id; no-ops on a disabled schedule. Drops RetentionPolicy from the per-group Run-now and schedule.fire backup payloads — the agent's RunBackup ignores it (forget is the only consumer). Adds Hub.Conn() so tests can grab the live *Conn post-hello.	2026-05-03 11:00:45 +01:00
steve	d000fe7ec1	P2R-01: REST + WS rewire against the slim shape Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron parsed via robfig/cron/v3 and group membership scoped to the host. New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete refuses with 409 if any schedule still references the group, returning the schedule list so the UI can prompt 'remove from these schedules first.' Repo-maintenance GET/PUT manages forget/prune/check cadences on host_repo_maintenance — no version bump, the server-side ticker (P2R-06) drives execution. Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run) resolves the group's includes/excludes/retention/tag and dispatches a backup command.run with the new structured CommandRunPayload fields (Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and /hosts/{id}/init-repo return 410 Gone with a redirect message. schedule_push.go is rebuilt: buildScheduleSetPayload assembles the slim wire shape, pushScheduleSetOnConn ships it during the on-hello window, pushScheduleSetAsync fires after every CRUD mutation, and dispatchScheduledJob handles agent schedule.fire by iterating the schedule's source groups and dispatching one backup per group with actor_kind=schedule and scheduled_id pointing at the schedule. Auto-init at first WS connect: when the host has repo creds bound and no init job in its history, server dispatches restic init. Restic's 'config file already exists' soft-success means re-runs against an existing repo no-op; we don't auto-retry on failure (operator triggers re-init manually via the danger zone in P2R-09). api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc. in favour of {id, cron, enabled, source_groups: [...]}. The agent scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads Includes/Excludes/Tag instead of Args. Tests cover the new HTTP surface end-to-end: source-groups CRUD with in-use refusal, schedule validation (bad cron / missing groups / foreign group), repo-maintenance auto-seed and validation, the 410 route, and buildScheduleSetPayload's wire-shape correctness. Full suite passes; smoke env exercises auto-init dispatch on hello, async push after schedule create, and per-source-group Run-now landing the right paths/excludes/tag at the agent.	2026-05-03 10:56:40 +01:00
steve	5667cdf13a	P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Go-side data model rebuilt against migration 0008. The fat-Schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is gone; that surface lives on source_groups now. * store/types.go - Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. SourceGroupIDs populated by Get/List, accepted on Create/Update so callers pass desired junction state in one shape. - SourceGroup added: name (= snapshot tag), includes/excludes, retention_policy, retry_max + retry_backoff_seconds, cached conflict_dimension. - HostRepoMaintenance added: forget/prune/check cadences + enabled. - PendingRun added: offline-retry queue. - Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps. - RetentionPolicy moves home from "schedule field" to "source group field" but the type itself + Summary() method unchanged. * store/sources.go (new) — CRUD + GetByName + ConflictDimension cache. Group writes bump host_schedule_version; conflict cache writes don't (server-internal projection, agent doesn't see it). * store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR IGNORE). UpdateRepoMaintenance doesn't bump schedule version because these run on the server's own ticker, not the agent's local cron. * store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete. * store/schedules.go — rewritten for slim shape + junction CRUD. Update wipes the schedule_source_groups junction wholesale and re-inserts (simpler than diffing). Adds SchedulesUsingGroup for retention-conflict detection + UI labels. * store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan. New SetHostBandwidth helper. * HTTP layer — temporarily stubbed during this rewrite (501 returns with redesign_in_progress error code). Phase 3 fills these in against the new shape: - schedules.go REST CRUD - schedule_push.go agent reconciliation - ui_schedules.go HTML form CRUD Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed — both go away in the new model (Run-now per source group; auto-init at host enrolment). * enrollment.go — replaces "seed manual schedule from typed paths" with "seed default source group + repo-maintenance row." The default group gets the typed paths as its includes; operator edits later via Sources tab. * ws/handler.go — drops the MarkHostRepoInitialised projection (column is gone; auto-init makes it derivable from latest init job's status). Tests: * store: existing schedule test rewritten for slim shape + junction; new sources_test.go covers source-group CRUD, name uniqueness, conflict cache, repo-maintenance defaults + idempotent seed, pending-runs queue lifecycle. * http: schedules_test.go and schedule_push_test.go deleted — both exercised the obsolete fat-schedule API. Phase 3 rewrites them against the new endpoints. go test ./... green. cmd/server + cmd/agent build. The UI is broken end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3 restores REST + on-the-wire reconciliation; Phase 4 rewires the UI templates against the new model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:30:41 +01:00
steve	fdecde0d5c	P2-05: forget command with retention policy CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details End-to-end forget plumbing — operator can create a forget schedule with keep-* values, agent runs restic forget --keep-* … on the schedule's cron (or via per-row Run-now), snapshot list shrinks, UI updates. * api.CommandRunPayload gains retention_policy json.RawMessage so the agent doesn't need a typed copy of the server-side struct. * restic.ForgetPolicy mirrors restic's --keep-* flags. Empty() reports zero dimensions; restic wrapper RunForget refuses to run an empty policy (would delete every snapshot). Does NOT pass --prune — pruning lives behind a separate admin-only credential (P2-06); forget just rewrites the snapshot index. * runner.RunForget mirrors RunBackup's envelope shape so the live log viewer works without special-casing. On success triggers reportSnapshots (forget shrinks the index, the host's snapshot count almost certainly changed). * cmd/agent dispatcher handles MsgCommandRun with kind=forget, decodes RetentionPolicy from the wire, builds restic.ForgetPolicy. * Server dispatchScheduleNow marshals the schedule's RetentionPolicy into the wire payload for kind=forget jobs. Refuses to dispatch a forget schedule with empty retention. * validateSchedule rejects kind=forget without at least one keep-* dimension (new error code: missing_retention). * UI schedule edit form gains a Kind dropdown (backup or forget; immutable on edit). Paths block toggles by kind via inline data-kind attributes. Form help-text explains the prune separation. Other kinds (prune, check, unlock) deferred to P2-06..08; the Kind dropdown only offers backup and forget today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 14:07:42 +01:00
steve	72d8081b0d	Add-host: default repo username to hostname; always show htpasswd snippet CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details The pending page suppressed the htpasswd snippet when repo_username was blank — but with --private-repos the username is required for auth, and operators routinely leave the field blank assuming the system will pick something sensible. * handleUIAddHostPost defaults repo_username to the typed hostname when blank. Matches what --private-repos expects (URL path segment == username). * pending_host.html: snippet now renders whenever a password is present (always true after the generate-on-blank logic landed earlier). * Form help-text updated to describe the default explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 13:08:23 +01:00
steve	8a05969953	Add-host: durable pending page + polled awaiting-agent panel CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Two issues from a smoke session: 1. The awaiting-agent panel never refreshed — operator had to go back to the dashboard to see the host had connected. 2. Generated passwords were displayed only on the POST response. Navigating away (or even an accidental tab close) lost them permanently, so the operator couldn't update the rest-server's htpasswd. Both are the same fix: convert the POST-rendered transient "result state" into a durable GET page at /hosts/pending/{token}. * New route GET /hosts/pending/{token} renders the install-command + htpasswd snippet view. Password is decrypted from the (still- encrypted-at-rest) token row on every render — operator can refresh, bookmark, navigate away and come back. Once the agent enrols, the page redirects to /hosts/{id}; once the token expires, redirect to /hosts/new. * New route GET /hosts/pending/{token}/awaiting returns a polled HTML fragment that the pending page swaps in every 2s via HTMX. States: awaiting (keep polling) \| connected (show "Open host →" + "View schedules" CTAs, polling stops) \| expired (mint-new link, polling stops). Polling stops naturally because only the awaiting state's wrapper carries the hx-trigger attribute. * POST /hosts/new now 303-redirects to /hosts/pending/{token} on success; validation errors keep re-rendering the form with banner. Supporting changes: * New store helper Store.GetEnrollmentTokenStatus(tokenHash) for the polling endpoint — returns {expires_at, consumed_at, consumed_host} in one round-trip without dragging in the attachments-decryption path. * New ui.Renderer.RenderPartial(w, name, data) for HTMX fragment responses (no layout wrap). Picks an arbitrary page's template set as the lookup point — every page parses the full common- paths list, so they all see every partial. * add_host.html stripped to form-only; pending_host.html owns the result-state UI; awaiting_agent.html is the polled partial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:59:24 +01:00
steve	148e61b33b	P2-04.5: kill host.default_paths in favour of manual schedules CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Two independent path lists for "what does this host back up?" was a real divergence footgun — operator types one set at Add-host time and a different set into a schedule, both end up in the same repo, the snapshot history looks fine until restore. Resolution: drop host.default_paths entirely; add a `manual` flag on schedules. A manual schedule has paths/excludes/tags/retention like any other but no cron — it fires only via per-schedule Run-now. Single source of truth for what gets backed up. Schema (migration 0007): * schedules.manual INTEGER NOT NULL DEFAULT 0. * For every host with non-empty default_paths, seed a manual schedule with those paths and bump host_schedule_version. * ALTER TABLE hosts DROP COLUMN default_paths. * ALTER TABLE enrollment_tokens RENAME COLUMN default_paths TO initial_paths. Original draft of this migration rebuilt hosts via the create-new + drop-old + rename-new pattern. With foreign_keys=ON (set in the connection DSN), DROP TABLE on the parent fired ON DELETE CASCADE on every child of hosts(id) — schedules / jobs / snapshots / host_credentials all wiped on the smoke env when I tried it. SQLite 3.35+ supports column-level ALTERs directly, so we skip the rebuild dance and avoid the cascade trap. Six lines of SQL instead of sixty, no FK risk. Run-now rewiring: * New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper unifies the agent-driven path (cron fire → schedule.fire → OnScheduleFire callback) and the UI-driven path (operator clicks Run-now on a schedule row). Conn arg is optional; nil falls back to Hub.Send. * New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row Run-now button on the schedules list. * Dashboard's per-host Run-now (handleUIRunBackup) now picks the host's only enabled manual schedule, falls back to the only enabled schedule, else returns "pick one in Schedules tab". Keeps one-click for the common case. Agent: * Scheduler skips manual schedules in cron build (silent — they're a normal data shape, not an error). * Wire Schedule struct gains Manual flag. * Schedule.fire flow unchanged — the agent only ever fires non-manual schedules anyway. UI: * Add-host form retitled "Initial schedule · manual" so the operator knows the paths become an editable schedule under the Schedules tab. Result page calls out the manual schedule + points at Host > Schedules. * Schedule edit form: "Manual schedule" checkbox at the top of the When section; toggling it hides/shows the cron field via inline JS. Server-side validator skips the cron requirement when manual=true. * Schedule list shows a "manual" tag under the status pill and renders the When column as "— run-now only —" for manual rows. Each row gets a Run-now button when the schedule is enabled and the host is online. Tests + go test ./... green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:26:06 +01:00
steve	160d788bae	P2-04: schedule editor UI CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Closes the schedule foundations slice — operator can now drive the plumbing P2-01..03 landed without touching the JSON API. * New routes: - GET /hosts/{id}/schedules (list) - GET /hosts/{id}/schedules/new (create form) - POST /hosts/{id}/schedules/new (create) - GET /hosts/{id}/schedules/{sid}/edit (edit form) - POST /hosts/{id}/schedules/{sid}/edit (update) - POST /hosts/{id}/schedules/{sid}/delete (delete, confirm-then-redirect) * List view (web/templates/pages/schedules_list.html): status, cron, paths, retention summary, tags, edit/delete buttons. Header shows "version N · agent in sync" or "agent at vM" when the push hasn't been ack'd yet — backed by host_schedule_version + applied_schedule_version. Empty-state CTA points at /schedules/new. * Create/edit form (web/templates/pages/schedule_edit.html, shared): cron expression with five quick-pick presets (daily 3am / every 6h / @hourly / weekly Sun / monthly 1st), paths textarea (one per line), excludes textarea, tags (comma-separated), retention as six numeric fields (mirrors restic's --keep-* flags one-for-one), bandwidth caps, enabled toggle. Side panel explains the reconciliation flow so the operator knows what saving actually does. Validation errors re-render with operator's input intact. * internal/server/http/ui_schedules.go owns the handlers; reuses the same validateSchedule + pushScheduleSetAsync used by the JSON API path. Each save audit-logs schedule.created / schedule.updated / schedule.deleted (matching the JSON API actions). * store.RetentionPolicy gains a Summary() method ("last=7, d=14, w=4" or "—"). Used by the list view's table cell so templates don't have to do any conditional retention rendering. * Two new template helpers: list (string varargs → []string, used for the cron preset row) and joinComma (sibling to joinDot for the rare list that wants commas). RetentionPolicy.Summary covers the schedule-list case but the helpers are general. * host_detail.html secondary tabs row converted from inert <div>s into <a> links. Snapshots active by default; Schedules now points at the new page. Jobs/Repo/Settings remain inert until their P2 owners ship. Hooks UI deferred to P2-15 (lands with the hook execution path). Single-kind UI (backup only) by design — other kinds get a UI when their job dispatch lands in P2-05..08. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:44:40 +01:00
steve	6450bf1b88	P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Closes the schedule reconciliation loop end-to-end. * New `internal/agent/scheduler` package wraps robfig/cron/v3 with the lifecycle the agent needs: - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting for in-flight entries to return), rebuilds from scratch, starts, and emits schedule.ack with the version we just applied. - Disabled entries skipped silently; bad cron exprs (which shouldn't reach us — the server validates — but defensive) log a warn and skip. - On each cron tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The scheduler itself never builds CommandRunPayloads — server is the source of truth for jobs. - tx is swapped on every Apply, so reconnect is handled naturally: cron entries that fire against a dropped tx log "no active connection" and skip the tick. - Stop() is idempotent and waits for the cron's in-flight workers via cron.Stop().Done(). * New wire message api.MsgScheduleFire + api.ScheduleFirePayload for the agent → server "I just fired locally" RPC. * Server-side dispatch (schedule_push.go: dispatchScheduledJob): looks up the schedule by id, validates ownership + that it's enabled, builds args from kind (paths for backup; other kinds are still arg-less in Phase 2 and grow as those job kinds land in P2-05..08), persists a jobs row with actor_kind=schedule + scheduled_id, and writes command.run back on the same conn so the agent runs through its existing dispatch path. * store.CreateJob now writes scheduled_id. This column was in the schema since 0001 but never populated — the original P1 path only had operator-driven jobs, so actor_kind was always 'user' and scheduled_id was always nil. * cmd/agent/main.go integration: dispatcher gains a scheduler.Scheduler; the MsgScheduleSet case now hands the payload to scheduler.Apply (in a goroutine so the WS read loop keeps draining other messages). WS dispatcher gains OnScheduleFire alongside OnScheduleAck. * Tests: - scheduler unit tests (4): ack-on-apply, cron tick fires schedule.fire envelope, disabled entries don't fire, replace- prior-state stops the old cron. - Server-side end-to-end: schedule.fire → command.run with the right job_id / kind / args, plus jobs row with actor_kind= "schedule" and scheduled_id linking back to the schedule. Persistence of next-fire times across agent restarts is deliberately deferred. A missed fire window during downtime simply fires once on reconnect — that's the desirable behaviour (the operator wants the missed backup to run, not be silently skipped because we lost track of when it was due). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:29:12 +01:00
steve	946b6db137	P2-02 (server side): schedule reconciliation push + ack handling CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Server is now the source of truth for the agent's cron set. * Helpers in schedule_push.go: - loadScheduleSetPayload reads the host's schedules + canonical version into the wire shape. - pushScheduleSetOnConn writes directly to a just-handshaken conn (avoids racing against Hub.Register on a brand-new connection). - pushScheduleSetAsync is the post-CRUD flavour — no-op when the host is offline (the next reconnect's on-hello path catches it up, so a missed push is non-fatal). - applyScheduleAck records what version the agent has confirmed. * onAgentHello restructured: was returning early when the host had no repo credentials, which made the schedule push unreachable for fresh hosts. Split into pushRepoCredsOnHello (silent no-op on ErrNotFound) + pushScheduleSetOnConn (always runs). Empty schedule list is a valid push: tells the agent to drop stale cron entries. * WS dispatcher gains an OnScheduleAck hook on HandlerDeps; the http server wires it to applyScheduleAck. MsgScheduleAck moves out of the "TODO(P2)" group into a real case that decodes the payload and forwards to the callback. * Schedule CRUD handlers each fire pushScheduleSetAsync after the audit-log write so the agent picks up changes within seconds. Tests cover: - On-hello push of an already-created schedule, agent acks, applied_schedule_version flips on the host row. - Connect-then-CRUD: empty initial push (version 0), then a follow-on push at version 1 after the operator creates a schedule via REST. Agent-side `schedule.set` handler (parse, replace local cron, emit `schedule.ack`) is the remainder of P2-02 and lands with P2-03's local scheduler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:22:06 +01:00
steve	4b075840a1	P2-01: schedule schema + CRUD API CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details The `schedules` table was already laid down in migration 0001; this slice adds the Go-side data model, store CRUD with atomic version bumps, and REST endpoints. * `store.Schedule` + `RetentionPolicy` + `ScheduleOptions` typed views (the wire form on the agent side keeps retention/options as raw JSON since the agent just forwards them to restic). * Store CRUD: CreateSchedule / GetSchedule / ListSchedulesByHost / UpdateSchedule / DeleteSchedule. Each mutation bumps `host_schedule_version` atomically in the same tx via UPSERT on `host_schedule_version`. SetHostAppliedScheduleVersion records what the agent has confirmed via schedule.ack (P2-02 will use it). * REST endpoints under /api/hosts/{id}/schedules + /{sid}: GET (list, with the version envelope so callers can detect drift), POST (create), PUT (update — kind is immutable), DELETE. * Validation: cron expressions parse via robfig/cron/v3 (same parser the agent will use, so anything that validates here will fire there); kind ∈ {backup, forget, prune, check} (init/unlock are operator-only one-shot kinds, not schedulable); backup schedules require ≥1 path; hooks rejected on non-backup kinds (spec §14.3). * All mutations audit-logged. * Tests: store-level CRUD + version-bump invariants; REST happy path (create→list→update→delete with version progression); REST validation table covers each rejection code. newTestServerWithHub now sets BootstrapToken so the schedules handler tests can use the existing login flow without a parallel test-server constructor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:12:58 +01:00
steve	ee3ee241ea	P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Cohesive batch from a smoke-test session against a real rest-server. Themed bullets: * Agent runs as root, sandboxed via systemd. CapabilityBoundingSet drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict with ReadWritePaths confined to /etc + /var/lib/restic-manager; NoNewPrivileges blocks escalation. Install script no longer creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the rationale (matches UrBackup / Veeam / Bareos defaults; trying to back up "everything" as an unprivileged user creates silent skips on /home, /root, /var/lib/* with no upside vs the threat model the agent already implies). * Init-repo end-to-end. New JobKind="init" wired through agent runner, restic.Env.RunInit, server dispatcher, and a UI button (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at flips on init success, on backup success, or on a non-empty snapshots.report. The "Run now" / "Init" / "Retry" branching now drives both the dashboard host row and the host-detail panel. Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using the safe create-new-then-rename pattern; first version corrupted job_logs.job_id FK), 0006 (cleans up job_logs FK on already- affected DBs). * rest-server creds embedded at exec time only. restic.Env gains RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL inside envSlice() and never assigns it back to the struct, so nothing slog-able ever sees the cleartext form. RedactURL helper for any future surface that needs to log a URL safely. Both helpers tested. * Add-host UX. Repo password is now optional — server mints a 24-byte URL-safe random one and surfaces it once, alongside an htpasswd snippet ("echo PASS \| htpasswd -B -i ... USERNAME") so the operator pastes one command on the rest-server host and one on the endpoint. Result page also links the install snippet at /install/install.sh (was /install.sh — 404'd before) and pipes to bash (not sh — script uses set -o pipefail and other bashisms; on Debian/Ubuntu sh is dash). * Late-subscriber race in JobHub. A fast-failing job could finish (DB write + Broadcast) before the browser's HX-Redirect → page load → WS-connect path completed, so the JS sat forever waiting on a job.finished that already passed. JobHub split into Register + Send + Run; handleJobStream now subscribes first, re-fetches the job, and sends a synthetic job.finished if the state is already terminal. * HTMX error visibility. New toast partial listens to htmx:responseError and surfaces the response body as a bottom-right toast — every server-side validation error now becomes visible without per-handler JS wiring. Also handles custom rm:toast events for future server-pushed notifications via the HX-Trigger header. Themed via existing CSS vars. * Dashboard rows are now whole-row clickable to host detail (CSS card-link pattern: absolute-positioned anchor + .row-action z-index restoration so the action button stays clickable). "View →" on a running job links to /jobs/<id> rather than /hosts/<id> since the row click already covers the host page. * "Run first" / "Run first backup" → "Run now" everywhere for consistency. * runbook (docs/e2e-smoke.md) updated — live-log streaming step now reflects P1-26; mentions the browser-driven Run-now flow. * _diag/dump-creds — moved out of cmd/ so go build doesn't pick it up; .gitignore now excludes /_diag/ entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:02:12 +01:00
steve	12b72e7dde	P1 polish: Host.default_paths interim + restic env hygiene + job_id JS quoting CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Two fixes that close the loop on dashboard run-now and harden the agent's restic invocation. Default paths (interim until P2-01 schedules): - 0003 migration adds default_paths TEXT NOT NULL DEFAULT '[]' to hosts and to enrollment_tokens. - Operator types paths in the Add-host form (textarea, one per line). They ride on the enrol_token row alongside the encrypted creds (paths aren't secret — plain JSON column). - On consume, ConsumeEnrollmentToken still just burns the token; the new GetEnrollmentTokenAttachments returns both the re-bindable creds and the path list in one round trip, the handler transfers them onto the new host row inside CreateHost. - The dashboard's Run-now and host-detail's "Run backup now" button now read Host.DefaultPaths and pass them to dispatchJob. A host with no default paths returns 400 with a friendly "no paths set" message instead of dispatching a doomed `restic backup` with no positional args. - Doc comments explicitly call this out as a Phase 1 interim — schedules supersede. Restic env hygiene: - envSlice() previously omitted HOME / XDG_CACHE_HOME, which bit the smoke runs whenever the agent was launched outside systemd (restic refused to start: "neither $XDG_CACHE_HOME nor $HOME are defined"). Now both are set explicitly: prefer Env.ExtraEnv overrides, fall back to the agent process's own HOME, and finally to /var/lib/restic-manager. - Comment makes the env policy explicit: parent's RESTIC_* / AWS_* / B2_* env is filtered out by design — control-plane is the unambiguous source of truth. JS bug fix in the live log page: - {{$job.ID \| printf "%q"}} produced a literal-quoted JS string, which then went into the WS URL as ".../jobs/"<ID>"/stream" → 404. Switched to '{{$job.ID}}' inside the literal so html/template's auto-escape does the right thing. Verified end-to-end: dashboard "Run now" → live progress + log lines arrive over the WS → succeeded pill renders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:35:33 +01:00
steve	bd434bd1d0	P1-26: live job log viewer + WS browser fan-out hub CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Closes the P1-21 remainder. internal/server/ws/jobhub.go — new JobHub. Per-job_id set of subscribers; each gets a 64-deep buffered channel with a writer goroutine. Broadcast is non-blocking: if a subscriber is slow, its channel fills and messages are dropped for that subscriber only — the agent's read loop is never blocked by a stuck browser. The agent dispatchAgentMessage path mirrors job.started / job.progress / log.stream / job.finished envelopes onto the hub in addition to its existing persistence work. The wire shape is the same end-to-end, so client-side JS switches on env.type the same way Go code does. GET /api/jobs/{id}/stream is the browser endpoint. Auth via session cookie (HTTP layer); upgrade; subscribe; pump until context closes. GET /jobs/{id} renders the live log page. Three states (queued/ running/succeeded/failed) drive the header pill, the progress bar block, the failure summary panel, and the action button (Cancel job while running, Back to host afterwards). Already- persisted log lines are server-rendered on initial load; new lines arrive over the WS and append to #log-stream. Auto-scrolls unless the user scrolls up (a "⇢ Follow" pill re-attaches). On job.finished the page reloads after 600ms to pick up the final-state header rendered server-side. POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id} on success so HTMX lands the operator straight on the live log. For non-HTMX callers (curl / plain form post) it 303s to the same target. store.ListJobLogs returns persisted log lines for initial render on page load. Browser-verified end-to-end: enrol → run a real backup against a sibling restic/rest-server → live progress + 11 log lines stream in → succeeded pill + final stats land after page reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:45:56 +01:00
steve	26a2b85e13	P1-25: host detail page (snapshots tab default) CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details GET /hosts/{id} renders the v1 host detail layout: - persistent header: status dot (pulse if a job is in flight), monospace name, tags, plus a metadata strip (os/arch, agent version, restic version, "last seen Xs ago" or "online · last heartbeat …"). - vitals strip: four tiles for last backup (status + relative time), repo size, snapshot count, open alerts. - sub-tabs: Snapshots is active; Jobs / Repo / Settings are visible but inert until P2. - snapshot table: short id, time (absolute), paths joined with " · ", size, file count, restore button (disabled — wires up in P3). - right rail: run-now stack (backup live, forget/prune/check/ unlock disabled with the Phase tag), danger-zone remove panel (also disabled for now). Empty state: when a host has no snapshots yet, the table replaces itself with a "no snapshots yet" prompt that includes the run-now button (provided the agent is online). Pagination cap of 50 most-recent snapshots; full pagination lands when fleet sizes demand it. Template helpers grew: comma() now accepts int / int32 / int64 so templates don't fight Go's type inference; joinDot() concatenates a []string with " · "; absTime() formats time.Time as YYYY-MM-DD HH:MM:SS; the existing relTime() already accepts T or *T after P1-27. Browser-verified end-to-end with seeded fixture data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:20:21 +01:00
steve	dad8c7fe99	P1-27: Add host flow — form + minted-token result page CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details GET /hosts/new renders the focused two-column form (hostname, tags, repo URL/username/password). POST /hosts/new validates, mints a one-time token via the new mintEnrollmentToken helper — shared with the existing JSON /api/enrollment-tokens endpoint — and re-renders the same page in result state showing: - the install command with RM_SERVER + RM_TOKEN filled in (and an inline copy-to-clipboard button), - an "awaiting agent connection" panel with the hostname pre-filled, - a troubleshooting list pointing at the most common reasons the agent doesn't appear, - back-to-dashboard / add-another-host links. publicURL() resolves RM_BASE_URL first, falling back to scheme + Host on the inbound request — useful for local smoke without a proxy. Browser-verified end-to-end: form submit → token minted → install command renders with the right values from the form input. template fn formatRelTime now accepts time.Time or *time.Time so templates can pass either without fighting Go's lack of an address-of operator. Deferred: download-preconfigured-installer (a templated .sh with the values baked in) — copy-paste covers v1; nice-to-have later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:16:54 +01:00
steve	ee16bc7ce7	P1-24: live dashboard — fleet summary tiles + host table CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Server-rendered HTML view backed by: - new store.FleetSummary aggregating host counts + repo bytes + snapshot total + open alerts + last-24h job rollup in two queries. - GET /api/hosts (JSON list of hosts in the dashboard projection). - GET /api/fleet/summary (JSON aggregate, same shape as above). The HTML page (web/templates/pages/dashboard.html) renders the four summary tiles + host table directly from store data — no separate fetch. Per-row state colour comes from .host-row.{degraded,failed, offline} which paint a 3px left edge so problem hosts are scannable without reading. HTMX is loaded into the base layout so per-row "Run now" buttons can hx-post to /hosts/{id}/run-backup, a thin HTML wrapper that funnels into a new dispatchJob helper shared with the JSON /api/hosts/{id}/jobs endpoint. Empty state (zero hosts) collapses to the "no hosts yet" prompt with the + Add host CTA — matches the v1 mockup. Template helpers (internal/server/ui/funcs.go) added for byte formatting (412 GB / 3.7 TB), relative time (3m ago / 2d ago), and comma grouping (1,847). Pure Go, no template-magic dependency. Browser-verified end-to-end with seeded fixture data: five hosts across all four states render with correct dots, accents, last- backup pills, sizes, snapshot counts, alerts, tags, and the right action button (Run now / Retry / Run first / View → / offline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:29:11 +01:00
steve	229f89fee2	P1-23 / P1-28: base layout, login, session-aware nav + Tailwind build CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details P1-28: Tailwind standalone CLI wired into the Makefile. `make tailwind` downloads the pinned v3.4.17 binary into bin/tailwindcss (gitignored), builds web/styles/input.css → web/static/css/styles.css. `make build` now runs the CSS pass first; `make tailwind-watch` for dev. Output is embedded in the binary via web.FS — single static binary, no Node. The CSS source carries every component class the v1 mockups defined (status dots, buttons, host row, log viewer, progress bar, fields, chips, snippet panel, empty state) so screens that land later can just reach for them. P1-23: html/template tree at web/templates with two layouts (base with chrome, chromeless for login + bootstrap), one nav partial, and two pages (dashboard placeholder, login). internal/server/ui parses the tree at startup; ui_handlers.go in the http package wires: GET / dashboard (303 → /login when unauthed) GET /login sign-in form POST /login consume form, mint session cookie, 303 → / POST /logout drop cookie, 303 → /login GET /static/* embedded Tailwind bundle The HTML login flow shares store/session logic with /api/auth/login via a new authenticateAndSession helper — same security guarantees, two surface representations (HTML form / JSON). Verified end-to-end: bootstrap → form-login → authed dashboard → sign-out → 303 cycle works in the browser; Tailwind output emits only the component classes referenced in the live templates (9.6kB minified). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:19:06 +01:00
steve	b6cfa99413	agent: log accept/complete on backup jobs; audit: populate host.enrolled payload CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Two warts surfaced during the smoke run: - Agent was silent between "config.update applied" and "job finished" — operators tailing journalctl saw no acknowledgement that a command.run had landed. Adds Info logs at job-accept ({job_id, paths}) and at successful completion. - The host.enrolled audit row had an empty {} payload. Now carries {hostname, os, arch, has_repo_creds} so an audit-log reader can answer "what got enrolled and did the operator bundle creds with the token" without joining back to hosts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:24:56 +01:00
steve	2418e585db	fix: enrollment FK race + log-when-rejected; runbook fixes from dry-run CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details The smoke runbook caught a real bug: ConsumeEnrollmentToken was inserting into host_credentials (FK -> hosts) inside the same tx as the token burn, but the host row didn't exist yet — CreateHost runs in the next statement. The agent saw a generic 401 with no clue why. Fix: drop the host_credentials insert from ConsumeEnrollmentToken; the HTTP handler now does Consume -> CreateHost -> SetHostCredentials. SetHostCredentials failure is logged loudly but doesn't fail the enrol — operator recovers via PUT /api/hosts/{id}/repo-credentials. Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the underlying cause is visible in server logs (the wire response stays generic to avoid leaking which step failed). Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new order (consume -> create host -> SetHostCredentials). Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly in use); URLs use trailing slash on the rest path; clarified that secrets_key is minted on first agent start, not at enrol time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:01:59 +01:00
steve	5d1951ad94	P1-34: e2e smoke runbook + redacted GET /repo-credentials CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Adds docs/e2e-smoke.md — an ~5-minute runbook that walks the full P1 happy path against a sibling restic/rest-server: bootstrap admin, mint token with repo creds, enrol an agent, watch the config.update push land, run a backup, confirm the snapshot, edit creds and watch the second push fire. Per the design discussion this is a runbook (not a Go integration test); the Playwright version lands in P5-06. GET /api/hosts/{id}/repo-credentials returns the redacted view — {repo_url, repo_username, has_password} — so the UI can pre-fill the edit form without ever pulling the password out of the AEAD blob. Marks P1-32 / P1-33 / P1-34 done in tasks.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:49:34 +01:00
steve	0ba56ed30d	P1-32: server-side encrypted repo creds + push-on-hello CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Operator-minted enrollment tokens now carry the repo URL/username/ password as one AEAD blob bound (via additional-data) to the token hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a host_credentials row in the same tx as token-burn, so the binding moves with the credential. PUT /api/hosts/{id}/repo-credentials lets an operator edit creds post-enrollment; merges with the existing blob, audits, and pushes config.update if the agent is connected. WS handler grows an OnHello hook that the HTTP layer wires to send the host's decrypted creds as a config.update immediately after the hello succeeds — synchronously, so a racing command.run lands after the agent has its repo password. Schema: 0002_host_credentials.sql adds enc_repo_creds to enrollment_tokens and a host_credentials table (PK = host_id, FK ON DELETE CASCADE). Tests: round-trip token → consume → host_credentials with AAD swap detection; no-creds path stays compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:38:35 +01:00
steve	3904a78f14	P1-22: snapshot listing via restic snapshots --json CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Agent calls restic snapshots --json after each successful backup (60s timeout, separate from the backup ctx) and ships the projection over the existing snapshots.report WS envelope. Failure here is logged but doesn't fail the job — the next successful backup catches the projection up. Server-side ReplaceHostSnapshots is delete-then-insert plus a hosts.snapshot_count update in one transaction so the dashboard's per-host count stays consistent with the projection. New read endpoint GET /api/hosts/{id}/snapshots returns the cached list with a refreshed_at marker so the UI can show staleness when an agent has been offline. Schema: dropped the unused snapshots.repo_id FK (repos as a first-class entity is P2 work), added short_id and refreshed_at columns, switched the time index to DESC for the most-recent-first list query. api.Snapshot gains short_id; size_bytes/file_count come from the embedded summary block on restic 0.16+ and stay zero on older clients. Tests cover round-trip, authoritative replacement after forget+prune shrinkage, and empty-after-wipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:20:57 +01:00
steve	41a4043af3	server: drop in-process TLS — HTTP-only behind reverse proxy Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx; making the server do TLS too means double cert config, dual ACME plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY, remove TLSEnabled() and the ListenAndServeTLS branch. Replace the cookie's "Secure if TLS-in-process" check with a new RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets RM_COOKIE_SECURE=false; production is always behind a TLS proxy and the cookie stays Secure. Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a Caddyfile snippet and a hard "do not expose RM_LISTEN publicly" warning. enrollResponse keeps cert_pin_sha256 in the shape but the server can't introspect a cert it doesn't terminate — operator pastes the proxy's hash into -cert-pin at install time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:20:41 +01:00
steve	95b49ecab9	phase 1: run-now backup — restic wrapper, job lifecycle, end-to-end Lands the operator → server → agent → restic → server roundtrip for on-demand backups. The flow: POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]} → server creates a queued Job row → server emits command.run over WS to the host's agent → agent dispatcher spawns runner.RunBackup in a goroutine → runner spawns `restic backup --json`, parses each line → forwards: job.started, log.stream (every line), job.progress (throttled to 1/sec), job.finished (with summary stats blob) → server WS handler persists those into jobs / job_logs P1-16 internal/restic: thin Locate + Env wrapper that runs `restic backup --json`, scans stdout/stderr, parses BackupStatus + BackupSummary, calls back into a LineHandler so the agent can fan out to log.stream + job.progress. Treats exit code 3 as "succeeded with issues" (matches restic's contract). P1-18 store: jobs accessors (CreateJob, MarkJobStarted, MarkJobFinished, AppendJobLog, GetJob). P1-19 server: POST /api/hosts/{id}/jobs creates the Job row, validates kind, dispatches via Hub.Send, audit-logs the action. P1-20 agent runner: wraps restic.RunBackup with throttled progress emission. Sender abstraction was added to wsclient.Handler so background goroutines can keep replying after dispatch returns. P1-21 server WS: dispatchAgentMessage now persists job.started, job.finished, log.stream into the database. Browser fan-out for live tailing lands with the UI work. Agent gets repo_url + repo_password from agent.yaml in plaintext for now (mode 0600, owned by service user); spec.md §7.3's keyring storage moves there in P2. config.update over WS overrides the in-memory copy (does not persist). Build clean; all tests pass. End-to-end with a real restic still needs a host that has restic installed — wire shape verified by the existing hello/heartbeat round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:45:04 +01:00

1 2

54 Commits