restic-manager

Author	SHA1	Message	Date
steve	8fb1c100fd	P2-04.5: kill host.default_paths in favour of manual schedules Two independent path lists for "what does this host back up?" was a real divergence footgun — operator types one set at Add-host time and a different set into a schedule, both end up in the same repo, the snapshot history looks fine until restore. Resolution: drop host.default_paths entirely; add a `manual` flag on schedules. A manual schedule has paths/excludes/tags/retention like any other but no cron — it fires only via per-schedule Run-now. Single source of truth for what gets backed up. Schema (migration 0007): * schedules.manual INTEGER NOT NULL DEFAULT 0. * For every host with non-empty default_paths, seed a manual schedule with those paths and bump host_schedule_version. * ALTER TABLE hosts DROP COLUMN default_paths. * ALTER TABLE enrollment_tokens RENAME COLUMN default_paths TO initial_paths. Original draft of this migration rebuilt hosts via the create-new + drop-old + rename-new pattern. With foreign_keys=ON (set in the connection DSN), DROP TABLE on the parent fired ON DELETE CASCADE on every child of hosts(id) — schedules / jobs / snapshots / host_credentials all wiped on the smoke env when I tried it. SQLite 3.35+ supports column-level ALTERs directly, so we skip the rebuild dance and avoid the cascade trap. Six lines of SQL instead of sixty, no FK risk. Run-now rewiring: * New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper unifies the agent-driven path (cron fire → schedule.fire → OnScheduleFire callback) and the UI-driven path (operator clicks Run-now on a schedule row). Conn arg is optional; nil falls back to Hub.Send. * New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row Run-now button on the schedules list. * Dashboard's per-host Run-now (handleUIRunBackup) now picks the host's only enabled manual schedule, falls back to the only enabled schedule, else returns "pick one in Schedules tab". Keeps one-click for the common case. Agent: * Scheduler skips manual schedules in cron build (silent — they're a normal data shape, not an error). * Wire Schedule struct gains Manual flag. * Schedule.fire flow unchanged — the agent only ever fires non-manual schedules anyway. UI: * Add-host form retitled "Initial schedule · manual" so the operator knows the paths become an editable schedule under the Schedules tab. Result page calls out the manual schedule + points at Host > Schedules. * Schedule edit form: "Manual schedule" checkbox at the top of the When section; toggling it hides/shows the cron field via inline JS. Server-side validator skips the cron requirement when manual=true. * Schedule list shows a "manual" tag under the status pill and renders the When column as "— run-now only —" for manual rows. Each row gets a Run-now button when the schedule is enabled and the host is online. Tests + go test ./... green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:26:06 +01:00
steve	608962441b	P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch Closes the schedule reconciliation loop end-to-end. * New `internal/agent/scheduler` package wraps robfig/cron/v3 with the lifecycle the agent needs: - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting for in-flight entries to return), rebuilds from scratch, starts, and emits schedule.ack with the version we just applied. - Disabled entries skipped silently; bad cron exprs (which shouldn't reach us — the server validates — but defensive) log a warn and skip. - On each cron tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The scheduler itself never builds CommandRunPayloads — server is the source of truth for jobs. - tx is swapped on every Apply, so reconnect is handled naturally: cron entries that fire against a dropped tx log "no active connection" and skip the tick. - Stop() is idempotent and waits for the cron's in-flight workers via cron.Stop().Done(). * New wire message api.MsgScheduleFire + api.ScheduleFirePayload for the agent → server "I just fired locally" RPC. * Server-side dispatch (schedule_push.go: dispatchScheduledJob): looks up the schedule by id, validates ownership + that it's enabled, builds args from kind (paths for backup; other kinds are still arg-less in Phase 2 and grow as those job kinds land in P2-05..08), persists a jobs row with actor_kind=schedule + scheduled_id, and writes command.run back on the same conn so the agent runs through its existing dispatch path. * store.CreateJob now writes scheduled_id. This column was in the schema since 0001 but never populated — the original P1 path only had operator-driven jobs, so actor_kind was always 'user' and scheduled_id was always nil. * cmd/agent/main.go integration: dispatcher gains a scheduler.Scheduler; the MsgScheduleSet case now hands the payload to scheduler.Apply (in a goroutine so the WS read loop keeps draining other messages). WS dispatcher gains OnScheduleFire alongside OnScheduleAck. * Tests: - scheduler unit tests (4): ack-on-apply, cron tick fires schedule.fire envelope, disabled entries don't fire, replace- prior-state stops the old cron. - Server-side end-to-end: schedule.fire → command.run with the right job_id / kind / args, plus jobs row with actor_kind= "schedule" and scheduled_id linking back to the schedule. Persistence of next-fire times across agent restarts is deliberately deferred. A missed fire window during downtime simply fires once on reconnect — that's the desirable behaviour (the operator wants the missed backup to run, not be silently skipped because we lost track of when it was due). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:29:12 +01:00
steve	c8ead66f08	P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes Cohesive batch from a smoke-test session against a real rest-server. Themed bullets: * Agent runs as root, sandboxed via systemd. CapabilityBoundingSet drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict with ReadWritePaths confined to /etc + /var/lib/restic-manager; NoNewPrivileges blocks escalation. Install script no longer creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the rationale (matches UrBackup / Veeam / Bareos defaults; trying to back up "everything" as an unprivileged user creates silent skips on /home, /root, /var/lib/* with no upside vs the threat model the agent already implies). * Init-repo end-to-end. New JobKind="init" wired through agent runner, restic.Env.RunInit, server dispatcher, and a UI button (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at flips on init success, on backup success, or on a non-empty snapshots.report. The "Run now" / "Init" / "Retry" branching now drives both the dashboard host row and the host-detail panel. Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using the safe create-new-then-rename pattern; first version corrupted job_logs.job_id FK), 0006 (cleans up job_logs FK on already- affected DBs). * rest-server creds embedded at exec time only. restic.Env gains RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL inside envSlice() and never assigns it back to the struct, so nothing slog-able ever sees the cleartext form. RedactURL helper for any future surface that needs to log a URL safely. Both helpers tested. * Add-host UX. Repo password is now optional — server mints a 24-byte URL-safe random one and surfaces it once, alongside an htpasswd snippet ("echo PASS \| htpasswd -B -i ... USERNAME") so the operator pastes one command on the rest-server host and one on the endpoint. Result page also links the install snippet at /install/install.sh (was /install.sh — 404'd before) and pipes to bash (not sh — script uses set -o pipefail and other bashisms; on Debian/Ubuntu sh is dash). * Late-subscriber race in JobHub. A fast-failing job could finish (DB write + Broadcast) before the browser's HX-Redirect → page load → WS-connect path completed, so the JS sat forever waiting on a job.finished that already passed. JobHub split into Register + Send + Run; handleJobStream now subscribes first, re-fetches the job, and sends a synthetic job.finished if the state is already terminal. * HTMX error visibility. New toast partial listens to htmx:responseError and surfaces the response body as a bottom-right toast — every server-side validation error now becomes visible without per-handler JS wiring. Also handles custom rm:toast events for future server-pushed notifications via the HX-Trigger header. Themed via existing CSS vars. * Dashboard rows are now whole-row clickable to host detail (CSS card-link pattern: absolute-positioned anchor + .row-action z-index restoration so the action button stays clickable). "View →" on a running job links to /jobs/<id> rather than /hosts/<id> since the row click already covers the host page. * "Run first" / "Run first backup" → "Run now" everywhere for consistency. * runbook (docs/e2e-smoke.md) updated — live-log streaming step now reflects P1-26; mentions the browser-driven Run-now flow. * _diag/dump-creds — moved out of cmd/ so go build doesn't pick it up; .gitignore now excludes /_diag/ entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:02:12 +01:00
steve	8d5282a180	P1-22: snapshot listing via restic snapshots --json Agent calls restic snapshots --json after each successful backup (60s timeout, separate from the backup ctx) and ships the projection over the existing snapshots.report WS envelope. Failure here is logged but doesn't fail the job — the next successful backup catches the projection up. Server-side ReplaceHostSnapshots is delete-then-insert plus a hosts.snapshot_count update in one transaction so the dashboard's per-host count stays consistent with the projection. New read endpoint GET /api/hosts/{id}/snapshots returns the cached list with a refreshed_at marker so the UI can show staleness when an agent has been offline. Schema: dropped the unused snapshots.repo_id FK (repos as a first-class entity is P2 work), added short_id and refreshed_at columns, switched the time index to DESC for the most-recent-first list query. api.Snapshot gains short_id; size_bytes/file_count come from the embedded summary block on restic 0.16+ and stay zero on older clients. Tests cover round-trip, authoritative replacement after forget+prune shrinkage, and empty-after-wipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:20:57 +01:00
steve	f55747a281	phase 1 foundations: api types, store, crypto, auth Lands the bottom three layers of Phase 1: P1-08 internal/api: protocol_version + envelope + every WS message shape from spec.md §6.2 (Hello, Heartbeat, Job, Schedule, etc). Wire-format tests pin the JSON shape so a rename here breaks tests instead of silently breaking the agent. P1-02 + P1-03 internal/store: SQLite via modernc.org/sqlite, embed.FS + a tiny version table for hand-rolled migrations. 0001_initial.sql covers every table from spec.md §5 plus enrollment_tokens and host_schedule_version. Typed accessors for users / sessions / enrollment / audit. WAL + foreign_keys + busy_timeout on by default. P1-06 internal/crypto: XChaCha20-Poly1305 AEAD wrapper with per-message random nonce. Key file lifecycle (generate + refuse-to-overwrite, load with size validation). Optional additionalData binds ciphertext to the row that owns it. P1-04 internal/auth (partial — passwords + tokens; sessions middleware lands with the HTTP handlers): argon2id following RFC 9106 (64 MiB / t=3 / p=4 / 32B), constant-time verify. HashToken stores SHA-256 of session/agent/enrollment tokens so a stolen DB doesn't hand over credentials. Build floor moves to Go 1.25 (modernc.org/sqlite v1.50+ requires it); CI + Dockerfile + README updated. Markdown lint diagnostics on tasks.md cleared. All packages tested. ~70 new tests pass in <1s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:24:40 +01:00

5 Commits