restic-manager

Author	SHA1	Message	Date
steve	5c4e0275d9	feat(catchup): arm on hello, fire missed-window backups on tick	2026-06-15 21:02:04 +01:00
steve	02e4ef7544	testing: bootstrap UI, agent reliability, NS-01..04 + alert username Smoothes the rough edges that came up exercising a live deployment. First-run bootstrap UI: /bootstrap renders a username + password form that uses the in-memory token directly (operator no longer copies it out of the log); /login redirects there while bootstrap is available. Agent reliability: failJob synthetic envelopes so command.run early returns no longer hang the server-side job; runtime probe of restic restore --help drives --no-ownership instead of version sniffing (0.18.x had it removed). Server unit re-shaped: ProtectSystem=full plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore can now write anywhere a user might want. Restore wizard: default target is /root/rm-restore/<job-id>/ with clearer help text. Re-init confirm input uses .field (was .input, which doesn't exist — text was invisible). NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete with hostname-confirm danger zone, audit, FK cascade, live WS close. NS-02 enrollment-token recovery: outstanding-tokens panel on /hosts/new, regenerate (preserves attachments) and revoke handlers + audit, store-level ListOutstandingEnrollmentTokens and DeleteEnrollmentToken. NS-03 repo init / probe surface: migration 0020 adds hosts.repo_status + repo_status_error; WS handler projects every init job's outcome onto the host row (idempotent already-initialised collapses to ready); creds-save resets status and dispatches a fresh probe; /hosts/{id}/repo/probe retry endpoint with banner. NS-04 dashboard live + sort + filter: query-string filter (q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the alerts pattern with a localStorage live toggle, sortable column headers, filter row + clear. Alerts page: ack'd-by line resolves user_id ULID to username. Compose.yaml ignored — host-specific.	2026-05-05 22:03:15 +01:00
steve	a781e95c94	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	cdf88c6dc3	agent+server: apply host bandwidth caps to restic invocations P2R-13a. restic.Env gains LimitUploadKBps/LimitDownloadKBps which are emitted as global --limit-upload/--limit-download flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received via config.update; server pushes them on hello and after PUT /api/hosts/{id}/bandwidth. Also extends api.CommandRunPayload with optional per-job overrides (BandwidthUpKBps/Down + PreHook/PostHook); the override consumers land in T2/T6.	2026-05-04 10:38:34 +01:00
steve	5b4a590508	server: drain pending_runs on tick + on agent reconnect Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.	2026-05-04 10:19:15 +01:00
steve	e2d94bf3a2	server: populate audit UserID on credential mutations + slog prune push errors Switch handleSetHostCredentials, handleSetAdminCredentials, and handleDeleteAdminCredentials from authedUser (bool) to requireUser (*store.User) so AuditEntry.UserID and Actor are populated correctly. Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in handleRunRepoPrune so decrypt/send failures surface in the server log rather than appearing as a generic host_offline 503.	2026-05-04 10:19:15 +01:00
steve	35f07c3cee	server: admin-credentials REST + Slot:admin push helper Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that mirror the existing repo-credentials endpoints but write to store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped away from the repo slot to prevent cross-binding). PUT immediately pushes a config.update(Slot:"admin") to the agent when it is connected, and the new pushAdminCredsToAgent helper is wired for use by the upcoming prune run-now endpoint (D2) to push on-demand before dispatch.	2026-05-04 10:19:15 +01:00
steve	f801fdf65b	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-04 10:19:15 +01:00
steve	b6f8de1dcc	lint: drive baseline to zero, drop only-new-issues gate Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	ec0bf0f6c3	P2R-01: REST + WS rewire against the slim shape Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron parsed via robfig/cron/v3 and group membership scoped to the host. New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete refuses with 409 if any schedule still references the group, returning the schedule list so the UI can prompt 'remove from these schedules first.' Repo-maintenance GET/PUT manages forget/prune/check cadences on host_repo_maintenance — no version bump, the server-side ticker (P2R-06) drives execution. Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run) resolves the group's includes/excludes/retention/tag and dispatches a backup command.run with the new structured CommandRunPayload fields (Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and /hosts/{id}/init-repo return 410 Gone with a redirect message. schedule_push.go is rebuilt: buildScheduleSetPayload assembles the slim wire shape, pushScheduleSetOnConn ships it during the on-hello window, pushScheduleSetAsync fires after every CRUD mutation, and dispatchScheduledJob handles agent schedule.fire by iterating the schedule's source groups and dispatching one backup per group with actor_kind=schedule and scheduled_id pointing at the schedule. Auto-init at first WS connect: when the host has repo creds bound and no init job in its history, server dispatches restic init. Restic's 'config file already exists' soft-success means re-runs against an existing repo no-op; we don't auto-retry on failure (operator triggers re-init manually via the danger zone in P2R-09). api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc. in favour of {id, cron, enabled, source_groups: [...]}. The agent scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads Includes/Excludes/Tag instead of Args. Tests cover the new HTTP surface end-to-end: source-groups CRUD with in-use refusal, schedule validation (bad cron / missing groups / foreign group), repo-maintenance auto-seed and validation, the 410 route, and buildScheduleSetPayload's wire-shape correctness. Full suite passes; smoke env exercises auto-init dispatch on hello, async push after schedule create, and per-source-group Run-now landing the right paths/excludes/tag at the agent.	2026-05-03 10:56:40 +01:00
steve	a086b0eb75	P2-02 (server side): schedule reconciliation push + ack handling Server is now the source of truth for the agent's cron set. * Helpers in schedule_push.go: - loadScheduleSetPayload reads the host's schedules + canonical version into the wire shape. - pushScheduleSetOnConn writes directly to a just-handshaken conn (avoids racing against Hub.Register on a brand-new connection). - pushScheduleSetAsync is the post-CRUD flavour — no-op when the host is offline (the next reconnect's on-hello path catches it up, so a missed push is non-fatal). - applyScheduleAck records what version the agent has confirmed. * onAgentHello restructured: was returning early when the host had no repo credentials, which made the schedule push unreachable for fresh hosts. Split into pushRepoCredsOnHello (silent no-op on ErrNotFound) + pushScheduleSetOnConn (always runs). Empty schedule list is a valid push: tells the agent to drop stale cron entries. * WS dispatcher gains an OnScheduleAck hook on HandlerDeps; the http server wires it to applyScheduleAck. MsgScheduleAck moves out of the "TODO(P2)" group into a real case that decodes the payload and forwards to the callback. * Schedule CRUD handlers each fire pushScheduleSetAsync after the audit-log write so the agent picks up changes within seconds. Tests cover: - On-hello push of an already-created schedule, agent acks, applied_schedule_version flips on the host row. - Connect-then-CRUD: empty initial push (version 0), then a follow-on push at version 1 after the operator creates a schedule via REST. Agent-side `schedule.set` handler (parse, replace local cron, emit `schedule.ack`) is the remainder of P2-02 and lands with P2-03's local scheduler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:22:06 +01:00
steve	6cfbdfc7ab	P1-34: e2e smoke runbook + redacted GET /repo-credentials Adds docs/e2e-smoke.md — an ~5-minute runbook that walks the full P1 happy path against a sibling restic/rest-server: bootstrap admin, mint token with repo creds, enrol an agent, watch the config.update push land, run a backup, confirm the snapshot, edit creds and watch the second push fire. Per the design discussion this is a runbook (not a Go integration test); the Playwright version lands in P5-06. GET /api/hosts/{id}/repo-credentials returns the redacted view — {repo_url, repo_username, has_password} — so the UI can pre-fill the edit form without ever pulling the password out of the AEAD blob. Marks P1-32 / P1-33 / P1-34 done in tasks.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:49:34 +01:00
steve	b3b89045f2	P1-32: server-side encrypted repo creds + push-on-hello Operator-minted enrollment tokens now carry the repo URL/username/ password as one AEAD blob bound (via additional-data) to the token hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a host_credentials row in the same tx as token-burn, so the binding moves with the credential. PUT /api/hosts/{id}/repo-credentials lets an operator edit creds post-enrollment; merges with the existing blob, audits, and pushes config.update if the agent is connected. WS handler grows an OnHello hook that the HTTP layer wires to send the host's decrypted creds as a config.update immediately after the hello succeeds — synchronously, so a racing command.run lands after the agent has its repo password. Schema: 0002_host_credentials.sql adds enc_repo_creds to enrollment_tokens and a host_credentials table (PK = host_id, FK ON DELETE CASCADE). Tests: round-trip token → consume → host_credentials with AAD swap detection; no-creds path stays compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:38:35 +01:00

13 Commits