restic-manager

Author	SHA1	Message	Date
steve	ccd14f7cee	P6-04+05: Prometheus /metrics endpoint + Grafana dashboard New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.	2026-05-07 23:17:15 +01:00
steve	6ef58a707e	ws: synthesize job.finished from update watcher so browser stream wakes up	2026-05-07 20:32:48 +01:00
steve	9d5775fb47	p6-01/02: agent self-update + fleet update server cluster - alert: update_failed (per-host, dedup=hostID) + fleet_update_halted (system-scoped, host_id NULL via new RaiseOrTouchSystem helper). - ws: UpdateWatcher tracks in-flight command.update dispatches and reconciles them against incoming hello envelopes — success path marks the job succeeded and auto-resolves the alert; 90s timeout marks the job failed and raises update_failed. - http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX /hosts/{id}/update form variant. Pre-checks: host exists, online, agent_version != current, no running update job. Refactored core into Server.dispatchHostUpdate so the fleet worker can share it without going through HTTP. - fleetupdate: rolling worker iterating through host slots, halting on first failure and raising fleet_update_halted. Polling-based version-match (re-read hosts.agent_version every 1s up to 95s) — no extra plumbing into the WS hello path. At-most-one-running is enforced at the store layer (ErrFleetUpdateRunning). - cmd/server: wire UpdateWatcher and FleetWorker into the main goroutine; the worker uses a small serverDispatcher adapter that delegates back into Server.DispatchHostUpdate. Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint (happy + four pre-check branches + RBAC), worker (two-host happy, timeout-halt, host-offline-halt, already-at-target skip, cancel mid-run, double-Start guard).	2026-05-06 22:03:50 +01:00
steve	efed96f67a	agent: command.update handler + updater package (Linux + Windows)	2026-05-06 21:42:50 +01:00
steve	02e4ef7544	testing: bootstrap UI, agent reliability, NS-01..04 + alert username Smoothes the rough edges that came up exercising a live deployment. First-run bootstrap UI: /bootstrap renders a username + password form that uses the in-memory token directly (operator no longer copies it out of the log); /login redirects there while bootstrap is available. Agent reliability: failJob synthetic envelopes so command.run early returns no longer hang the server-side job; runtime probe of restic restore --help drives --no-ownership instead of version sniffing (0.18.x had it removed). Server unit re-shaped: ProtectSystem=full plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore can now write anywhere a user might want. Restore wizard: default target is /root/rm-restore/<job-id>/ with clearer help text. Re-init confirm input uses .field (was .input, which doesn't exist — text was invisible). NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete with hostname-confirm danger zone, audit, FK cascade, live WS close. NS-02 enrollment-token recovery: outstanding-tokens panel on /hosts/new, regenerate (preserves attachments) and revoke handlers + audit, store-level ListOutstandingEnrollmentTokens and DeleteEnrollmentToken. NS-03 repo init / probe surface: migration 0020 adds hosts.repo_status + repo_status_error; WS handler projects every init job's outcome onto the host row (idempotent already-initialised collapses to ready); creds-save resets status and dispatches a fresh probe; /hosts/{id}/repo/probe retry endpoint with banner. NS-04 dashboard live + sort + filter: query-string filter (q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the alerts pattern with a localStorage live toggle, sortable column headers, filter row + clear. Alerts page: ack'd-by line resolves user_id ULID to username. Compose.yaml ignored — host-specific.	2026-05-05 22:03:15 +01:00
steve	fb978ad10c	p5-03: docker-only release path (drop goreleaser) Single public deliverable per tag: a multi-arch server image, with cross-compiled agent binaries + install scripts + the systemd unit baked under /opt/restic-manager/dist/. The /agent/binary and /install/* handlers fall back from <DataDir>/... to that read-only path so a fresh container Just Works without first-run staging; operators can still drop a custom build into <DataDir>/ to override per-host. Architecture rationale: agent distribution already routes through the running server, so the release surface mirrors that — there's no second source of truth to keep in sync. Workflow .gitea/workflows/release.yml triggers on v..* tag-push (fan-out :vX.Y.Z / :X.Y / :X, plus :latest once MAJOR>=1) and workflow_dispatch (snapshot tag only). Pushes to the Gitea container registry on this instance. Both binaries grow main.commit + main.date ldflag targets. Makefile and Dockerfile fill them; release workflow forwards from gitea.sha plus a UTC timestamp. Spec : docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md Plan : docs/superpowers/plans/2026-05-05-p5-03-docker-only-release.md	2026-05-05 15:18:48 +01:00
steve	e0989e1cef	server: build OIDC client at startup; sweep oidc_state on alert tick	2026-05-05 13:45:52 +01:00
steve	809c4ed910	alert: construct + run engine; expose hub to handlers - Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go - Start go alertEngine.Run(ctx) after construction, before the HTTP listener - Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed) - Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID	2026-05-04 20:32:10 +01:00
steve	c710743231	alert: wire engine into ws hello + MarkJobFinished + offline sweep - ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated from http.Deps.AlertEngine (nil until G1 constructs the engine) - runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds - dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished, looking up the job Kind via Store.GetJob before notifying - store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one transaction, returns the IDs that flipped to offline - offline sweeper in cmd/server/main.go switched to the new variant; TODO(G1) comment marks where NotifyHostOffline calls will land	2026-05-04 19:54:39 +01:00
steve	a781e95c94	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	f5e3bca6a2	P3-03: restic restore + diff execution path Wires JobRestore and JobDiff end-to-end at the agent layer (the wizard backend that drives this lands in the next slice). - internal/api: JobRestore + JobDiff JobKind constants. CommandRunPayload grows nullable Restore + Diff sub-payloads. RestorePayload carries snapshot_id, paths, in_place, target_dir; DiffPayload carries snapshot_a + snapshot_b. - internal/restic.RunRestore wraps 'restic restore <sid> --target ... [--no-ownership] [--include p]...' with --json. New pumpRestoreStdout parses the per-line status / summary objects (drops raw status from log.stream — the throttled job.progress envelope covers it). New RestoreStatus + RestoreSummary types mirror restic's wire shape. - internal/restic.RunDiff wraps 'restic diff --json <a> <b>'. - internal/agent/runner: RunRestore translates RestoreStatus into job.progress (mapping FilesRestored → FilesDone etc) with a small estimateETA helper since restic doesn't provide ETA for restore. RunDiff is a thin streamHandler wrapper. - cmd/agent dispatcher gains JobRestore + JobDiff cases. Both reuse the spawn() helper from P3-X1 so cancel just works. - Drive-by fix: lastProgress was initialised to time.Now() so the very first status event was suppressed by the 1s throttle if the agent reported quickly. Initialise to time.Time{} (zero) so the first event always emits. Affects backup + restore. Tests: - restore_test covers restore happy path (started → progress → finished, kind=restore on the started envelope), in-place argv asserts no --no-ownership, new-dir argv asserts --no-ownership + --target + --include, diff produces the expected log.stream lines. Restage block (CLAUDE.md) is deferred to the end of the restore sub-phase so we restage once with all changes.	2026-05-04 15:24:14 +01:00
steve	13f58bd052	P3-X2: tree.list synchronous WS RPC + per-session cache Foundational for the restore wizard's tree browser. The wizard needs to lazy-load directory contents from a snapshot as the operator drills down; this lands the transport. - internal/api adds MsgTreeList (server → agent) + MsgTreeListResult (agent → server) with TreeListRequestPayload / TreeListEntry / TreeListResultPayload types. Reply correlates by Envelope.ID. - internal/restic.ListTreeChildren wraps 'restic ls --json' and filters its recursive output to direct children of the requested path. Parser + path-normalisation + isDirectChild are unit-tested. - internal/server/ws/rpc.go introduces a generic SendRPC helper on Hub: register a buffered channel keyed by ULID, send the request, block on ctx.Done()/timeout/reply. Reply routing piggybacks on the existing dispatchAgentMessage by adding a MsgTreeListResult case that forwards to the registered waiter; if no waiter is registered (caller already gave up) the stray reply is dropped quietly. - cmd/agent gains a tree.list handler that runs ListTreeChildren on a fresh per-call context (60s ceiling) and ships the matching tree.list.result envelope. Errors surface in result.Error rather than as transport failures so the server-side waiter can render a sensible UI message. - internal/server/http/tree_cache.go is the per-wizard-session cache layer (~30min TTL, sweep-on-access) that fetchTreeWithCache uses before falling through to SendRPC. Cached on success only; agent errors aren't cached so a transient failure doesn't poison the session. Tests: - internal/restic/ls_test.go covers parseLsChildren at root / mid-tree / leaf, plus normalizeTreePath and isDirectChild edge cases. - internal/server/ws/rpc_test.go unit-tests the registry: round-trip, release semantics, concurrent waiters, ctx-cancel. - internal/server/http/tree_rpc_test.go is the full round-trip: server SendRPC → fake-agent over a real WS → reply → server gets the payload. Plus a timeout test that confirms ~300ms timeouts terminate in ~300ms rather than waiting forever. The cache is plumbed but no UI handler hits fetchTreeWithCache yet — that lands with P3-01 (wizard backend). The unused-linter is suppressed via nolint until the wizard wires it in.	2026-05-04 15:19:22 +01:00
steve	94149a7324	P3-X1: cancel-job feature Wires the existing job_detail Cancel button (which was a UI stub) into real backend behaviour: - internal/api already declared MsgCommandCancel + CommandCancelPayload; promote those from forward-declarations to a working envelope. Agent side: cmd/agent/main.go drops the TODO-stub and gains a per-job ctx.CancelFunc map. runJob's switch is refactored around a small spawn() helper so each kind's goroutine derives a per-job context, registers the cancel, and removes itself on completion regardless of outcome. command.cancel looks up the func and fires it. - internal/agent/runner.sendFinished now takes ctx and rebadges ctx.Canceled errors as JobCancelled (exit 130) rather than JobFailed. All Run* call sites updated. - internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via build-tagged sigterm constant; os.Kill on Windows since SIGTERM isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL fallback. SIGTERM lets restic remove its lock file before exiting. - New POST /api/jobs/{id}/cancel server endpoint validates the job is non-terminal and the host is online, sends command.cancel via the hub, writes a job.cancel audit row, returns 202. The agent's resulting job.finished (status=cancelled) is what actually transitions the row. Tests: - internal/server/http/cancel_test.go covers happy path (envelope shape + audit row), 409 for terminal jobs, 404 for missing jobs, 503 for offline hosts. - internal/agent/runner/cancel_test.go covers cancel mid-run: a fake restic that exec'd into 'sleep 30' is canceled 150ms after start and the resulting job.finished reports JobCancelled with exit 130 in well under the WaitDelay. Foundational for P3 restore (operator needs to be able to cancel a running backup if they need to restore urgently). Independently useful for prune/check/backup that are stuck.	2026-05-04 15:11:49 +01:00
steve	8062db1f2f	agent: P2-16 Windows service (SCM) integration internal/agent/service: build-tagged into service_windows.go (svc.Handler that listens for Stop/Shutdown + delegates to the agent loop) and service_other.go (foreground stub for Linux/macOS). install_windows.go wraps mgr.Connect+CreateService/Delete/Start/Stop for the new 'restic-manager-agent install\|uninstall\|start\|stop' subcommands. Cross-compile verified: GOOS=windows GOARCH=amd64 go build ./cmd/agent succeeds. UNTESTED on Windows itself — the SCM round-trip can't be exercised from Linux CI; treat as a starting point for the first real Windows install.	2026-05-04 11:13:56 +01:00
steve	4c81ff3e7b	ui+server: P2-18d pending hosts dashboard panel + expiry sweeper Dashboard handler loads ListPendingHosts(now); template renders a warn-bordered panel above the host table with hostname, OS/arch, fingerprint (selectable / copyable), source IP, age, expiry. Each row carries an inline accept form (repo URL/user/password) plus a Reject button. cmd/server adds a 60s ticker calling DeleteExpiredPendingHosts so 1h-stale rows drop off.	2026-05-04 11:11:32 +01:00
steve	a46d906d27	agent: P2-18c announce-and-approve enrolment path When -enroll-server is supplied without -enroll-token, the agent mints (and persists) an Ed25519 keypair, POSTs /api/agents/announce, prints the SHA256 fingerprint in a copy-friendly banner, opens /ws/agent/pending, signs the server's nonce, and blocks until the admin clicks Accept (1h ceiling). On accept, persists the bearer + host_id from the 'enrolled' message; on reject (close code 4001) exits with a clear error. Repo creds are pushed via config.update on the first standard WS hello (P1-32 path), not in the enrolled message itself.	2026-05-04 11:09:47 +01:00
steve	7b1990cf11	agent+server: P2R-11 pre/post hook execution for backup jobs Agent: new runner.BackupHooks struct + runHook helper invoked via /bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts the backup; post_hook always runs with RM_JOB_STATUS=succeeded\|failed in env. Output streamed as 'hook(<phase>): …' log.stream lines. Hooks only run for kind=backup (other kinds skip both phases). Server: resolveBackupHooks resolves group → host default → empty, decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext into CommandRunPayload for both schedule.fire and per-group Run-now dispatch sites. Decrypt failures degrade silently to no hook so a malformed blob can't poison every backup.	2026-05-04 10:57:28 +01:00
steve	cdf88c6dc3	agent+server: apply host bandwidth caps to restic invocations P2R-13a. restic.Env gains LimitUploadKBps/LimitDownloadKBps which are emitted as global --limit-upload/--limit-download flags before the subcommand on every invocation. Agent dispatcher tracks host-wide caps received via config.update; server pushes them on hello and after PUT /api/hosts/{id}/bandwidth. Also extends api.CommandRunPayload with optional per-job overrides (BandwidthUpKBps/Down + PreHook/PostHook); the override consumers land in T2/T6.	2026-05-04 10:38:34 +01:00
steve	f94e8ec967	api+agent: document protocol-version stability and forget back-compat decisions version.go: add a comment block explaining why Phase 5's wire changes (CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade path, smoke env restage enforces it. Notes where a version bump to 2 would be required if a multi-version path is ever introduced. cmd/agent/main.go: document why the JobForget handler hard-errors on empty ForgetGroups rather than falling back to a single-policy form. The maintenance ticker is the only writer and always populates the field; the fallback was specced but skipped given lockstep deploy.	2026-05-04 10:19:15 +01:00
steve	5b4a590508	server: drain pending_runs on tick + on agent reconnect Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.	2026-05-04 10:19:15 +01:00
steve	14b703be58	server: maintenance ticker drives forget/prune/check on cadence Wires a 60s server-side ticker to the pure-logic maintenance.Decide introduced in the previous commit. Decisions flow through a new DispatchMaintenance method on Server, which: - skips offline hosts (no pending_runs queueing — maintenance is not a backup, missed fires shouldn't pile up) - silently skips prune when admin creds aren't bound - pushes admin creds before prune, then dispatches with RequiresAdminCreds=true (same as operator-driven prune) - persists job rows with actor_kind="system" Reshapes the forget wire payload from a single RetentionPolicy to a ForgetGroups list (one tag + per-group keep- per source group). The agent walks the groups and runs `restic forget --tag <name> --keep-*` once per group. Dead-code removed: CommandRunPayload.RetentionPolicy, the old forget JSON-decode in cmd/agent, and the single-policy form of restic.RunForget.	2026-05-04 10:19:15 +01:00
steve	a110e3c00c	agent: secrets fail-loud on corrupt blob + small polish Save and SaveAdmin now propagate loadBundle errors instead of silently overwriting a corrupt file (data-loss fix). Tests added for both paths. reportStats logs a Debug on RunStats failure; r in runJob gets a comment explaining the prune-runner asymmetry; runner_test comment tightened.	2026-05-04 10:19:15 +01:00
steve	57bf9690f2	agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot dispatch Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats with LastPruneAt before job.finished), RunCheck (ships stats with LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships LockPresent=false on success), and reportStats (fills size fields via RunStats when caller didn't populate them). Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach MsgConfigUpdate about the Slot discriminator for admin vs repo creds; add strconv import for subset-pct parsing.	2026-05-04 10:19:15 +01:00
steve	ec0bf0f6c3	P2R-01: REST + WS rewire against the slim shape Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron parsed via robfig/cron/v3 and group membership scoped to the host. New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete refuses with 409 if any schedule still references the group, returning the schedule list so the UI can prompt 'remove from these schedules first.' Repo-maintenance GET/PUT manages forget/prune/check cadences on host_repo_maintenance — no version bump, the server-side ticker (P2R-06) drives execution. Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run) resolves the group's includes/excludes/retention/tag and dispatches a backup command.run with the new structured CommandRunPayload fields (Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and /hosts/{id}/init-repo return 410 Gone with a redirect message. schedule_push.go is rebuilt: buildScheduleSetPayload assembles the slim wire shape, pushScheduleSetOnConn ships it during the on-hello window, pushScheduleSetAsync fires after every CRUD mutation, and dispatchScheduledJob handles agent schedule.fire by iterating the schedule's source groups and dispatching one backup per group with actor_kind=schedule and scheduled_id pointing at the schedule. Auto-init at first WS connect: when the host has repo creds bound and no init job in its history, server dispatches restic init. Restic's 'config file already exists' soft-success means re-runs against an existing repo no-op; we don't auto-retry on failure (operator triggers re-init manually via the danger zone in P2R-09). api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc. in favour of {id, cron, enabled, source_groups: [...]}. The agent scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads Includes/Excludes/Tag instead of Args. Tests cover the new HTTP surface end-to-end: source-groups CRUD with in-use refusal, schedule validation (bad cron / missing groups / foreign group), repo-maintenance auto-seed and validation, the 410 route, and buildScheduleSetPayload's wire-shape correctness. Full suite passes; smoke env exercises auto-init dispatch on hello, async push after schedule create, and per-source-group Run-now landing the right paths/excludes/tag at the agent.	2026-05-03 10:56:40 +01:00
steve	6a171596f1	P2-05: forget command with retention policy End-to-end forget plumbing — operator can create a forget schedule with keep-* values, agent runs restic forget --keep-* … on the schedule's cron (or via per-row Run-now), snapshot list shrinks, UI updates. * api.CommandRunPayload gains retention_policy json.RawMessage so the agent doesn't need a typed copy of the server-side struct. * restic.ForgetPolicy mirrors restic's --keep-* flags. Empty() reports zero dimensions; restic wrapper RunForget refuses to run an empty policy (would delete every snapshot). Does NOT pass --prune — pruning lives behind a separate admin-only credential (P2-06); forget just rewrites the snapshot index. * runner.RunForget mirrors RunBackup's envelope shape so the live log viewer works without special-casing. On success triggers reportSnapshots (forget shrinks the index, the host's snapshot count almost certainly changed). * cmd/agent dispatcher handles MsgCommandRun with kind=forget, decodes RetentionPolicy from the wire, builds restic.ForgetPolicy. * Server dispatchScheduleNow marshals the schedule's RetentionPolicy into the wire payload for kind=forget jobs. Refuses to dispatch a forget schedule with empty retention. * validateSchedule rejects kind=forget without at least one keep-* dimension (new error code: missing_retention). * UI schedule edit form gains a Kind dropdown (backup or forget; immutable on edit). Paths block toggles by kind via inline data-kind attributes. Form help-text explains the prune separation. Other kinds (prune, check, unlock) deferred to P2-06..08; the Kind dropdown only offers backup and forget today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 14:07:42 +01:00
steve	608962441b	P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch Closes the schedule reconciliation loop end-to-end. * New `internal/agent/scheduler` package wraps robfig/cron/v3 with the lifecycle the agent needs: - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting for in-flight entries to return), rebuilds from scratch, starts, and emits schedule.ack with the version we just applied. - Disabled entries skipped silently; bad cron exprs (which shouldn't reach us — the server validates — but defensive) log a warn and skip. - On each cron tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The scheduler itself never builds CommandRunPayloads — server is the source of truth for jobs. - tx is swapped on every Apply, so reconnect is handled naturally: cron entries that fire against a dropped tx log "no active connection" and skip the tick. - Stop() is idempotent and waits for the cron's in-flight workers via cron.Stop().Done(). * New wire message api.MsgScheduleFire + api.ScheduleFirePayload for the agent → server "I just fired locally" RPC. * Server-side dispatch (schedule_push.go: dispatchScheduledJob): looks up the schedule by id, validates ownership + that it's enabled, builds args from kind (paths for backup; other kinds are still arg-less in Phase 2 and grow as those job kinds land in P2-05..08), persists a jobs row with actor_kind=schedule + scheduled_id, and writes command.run back on the same conn so the agent runs through its existing dispatch path. * store.CreateJob now writes scheduled_id. This column was in the schema since 0001 but never populated — the original P1 path only had operator-driven jobs, so actor_kind was always 'user' and scheduled_id was always nil. * cmd/agent/main.go integration: dispatcher gains a scheduler.Scheduler; the MsgScheduleSet case now hands the payload to scheduler.Apply (in a goroutine so the WS read loop keeps draining other messages). WS dispatcher gains OnScheduleFire alongside OnScheduleAck. * Tests: - scheduler unit tests (4): ack-on-apply, cron tick fires schedule.fire envelope, disabled entries don't fire, replace- prior-state stops the old cron. - Server-side end-to-end: schedule.fire → command.run with the right job_id / kind / args, plus jobs row with actor_kind= "schedule" and scheduled_id linking back to the schedule. Persistence of next-fire times across agent restarts is deliberately deferred. A missed fire window during downtime simply fires once on reconnect — that's the desirable behaviour (the operator wants the missed backup to run, not be silently skipped because we lost track of when it was due). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:29:12 +01:00
steve	c8ead66f08	P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes Cohesive batch from a smoke-test session against a real rest-server. Themed bullets: * Agent runs as root, sandboxed via systemd. CapabilityBoundingSet drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict with ReadWritePaths confined to /etc + /var/lib/restic-manager; NoNewPrivileges blocks escalation. Install script no longer creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the rationale (matches UrBackup / Veeam / Bareos defaults; trying to back up "everything" as an unprivileged user creates silent skips on /home, /root, /var/lib/* with no upside vs the threat model the agent already implies). * Init-repo end-to-end. New JobKind="init" wired through agent runner, restic.Env.RunInit, server dispatcher, and a UI button (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at flips on init success, on backup success, or on a non-empty snapshots.report. The "Run now" / "Init" / "Retry" branching now drives both the dashboard host row and the host-detail panel. Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using the safe create-new-then-rename pattern; first version corrupted job_logs.job_id FK), 0006 (cleans up job_logs FK on already- affected DBs). * rest-server creds embedded at exec time only. restic.Env gains RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL inside envSlice() and never assigns it back to the struct, so nothing slog-able ever sees the cleartext form. RedactURL helper for any future surface that needs to log a URL safely. Both helpers tested. * Add-host UX. Repo password is now optional — server mints a 24-byte URL-safe random one and surfaces it once, alongside an htpasswd snippet ("echo PASS \| htpasswd -B -i ... USERNAME") so the operator pastes one command on the rest-server host and one on the endpoint. Result page also links the install snippet at /install/install.sh (was /install.sh — 404'd before) and pipes to bash (not sh — script uses set -o pipefail and other bashisms; on Debian/Ubuntu sh is dash). * Late-subscriber race in JobHub. A fast-failing job could finish (DB write + Broadcast) before the browser's HX-Redirect → page load → WS-connect path completed, so the JS sat forever waiting on a job.finished that already passed. JobHub split into Register + Send + Run; handleJobStream now subscribes first, re-fetches the job, and sends a synthetic job.finished if the state is already terminal. * HTMX error visibility. New toast partial listens to htmx:responseError and surfaces the response body as a bottom-right toast — every server-side validation error now becomes visible without per-handler JS wiring. Also handles custom rm:toast events for future server-pushed notifications via the HX-Trigger header. Themed via existing CSS vars. * Dashboard rows are now whole-row clickable to host detail (CSS card-link pattern: absolute-positioned anchor + .row-action z-index restoration so the action button stays clickable). "View →" on a running job links to /jobs/<id> rather than /hosts/<id> since the row click already covers the host page. * "Run first" / "Run first backup" → "Run now" everywhere for consistency. * runbook (docs/e2e-smoke.md) updated — live-log streaming step now reflects P1-26; mentions the browser-driven Run-now flow. * _diag/dump-creds — moved out of cmd/ so go build doesn't pick it up; .gitignore now excludes /_diag/ entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:02:12 +01:00
steve	e6729a5a3d	P1-26: live job log viewer + WS browser fan-out hub Closes the P1-21 remainder. internal/server/ws/jobhub.go — new JobHub. Per-job_id set of subscribers; each gets a 64-deep buffered channel with a writer goroutine. Broadcast is non-blocking: if a subscriber is slow, its channel fills and messages are dropped for that subscriber only — the agent's read loop is never blocked by a stuck browser. The agent dispatchAgentMessage path mirrors job.started / job.progress / log.stream / job.finished envelopes onto the hub in addition to its existing persistence work. The wire shape is the same end-to-end, so client-side JS switches on env.type the same way Go code does. GET /api/jobs/{id}/stream is the browser endpoint. Auth via session cookie (HTTP layer); upgrade; subscribe; pump until context closes. GET /jobs/{id} renders the live log page. Three states (queued/ running/succeeded/failed) drive the header pill, the progress bar block, the failure summary panel, and the action button (Cancel job while running, Back to host afterwards). Already- persisted log lines are server-rendered on initial load; new lines arrive over the WS and append to #log-stream. Auto-scrolls unless the user scrolls up (a "⇢ Follow" pill re-attaches). On job.finished the page reloads after 600ms to pick up the final-state header rendered server-side. POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id} on success so HTMX lands the operator straight on the live log. For non-HTMX callers (curl / plain form post) it 303s to the same target. store.ListJobLogs returns persisted log lines for initial render on page load. Browser-verified end-to-end: enrol → run a real backup against a sibling restic/rest-server → live progress + 11 log lines stream in → succeeded pill + final stats land after page reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:45:56 +01:00
steve	55242caf58	P1-23 / P1-28: base layout, login, session-aware nav + Tailwind build P1-28: Tailwind standalone CLI wired into the Makefile. `make tailwind` downloads the pinned v3.4.17 binary into bin/tailwindcss (gitignored), builds web/styles/input.css → web/static/css/styles.css. `make build` now runs the CSS pass first; `make tailwind-watch` for dev. Output is embedded in the binary via web.FS — single static binary, no Node. The CSS source carries every component class the v1 mockups defined (status dots, buttons, host row, log viewer, progress bar, fields, chips, snippet panel, empty state) so screens that land later can just reach for them. P1-23: html/template tree at web/templates with two layouts (base with chrome, chromeless for login + bootstrap), one nav partial, and two pages (dashboard placeholder, login). internal/server/ui parses the tree at startup; ui_handlers.go in the http package wires: GET / dashboard (303 → /login when unauthed) GET /login sign-in form POST /login consume form, mint session cookie, 303 → / POST /logout drop cookie, 303 → /login GET /static/* embedded Tailwind bundle The HTML login flow shares store/session logic with /api/auth/login via a new authenticateAndSession helper — same security guarantees, two surface representations (HTML form / JSON). Verified end-to-end: bootstrap → form-login → authed dashboard → sign-out → 303 cycle works in the browser; Tailwind output emits only the component classes referenced in the live templates (9.6kB minified). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:19:06 +01:00
steve	9798a2b5fe	agent: log accept/complete on backup jobs; audit: populate host.enrolled payload Two warts surfaced during the smoke run: - Agent was silent between "config.update applied" and "job finished" — operators tailing journalctl saw no acknowledgement that a command.run had landed. Adds Info logs at job-accept ({job_id, paths}) and at successful completion. - The host.enrolled audit row had an empty {} payload. Now carries {hostname, os, arch, has_repo_creds} so an audit-log reader can answer "what got enrolled and did the operator bundle creds with the token" without joining back to hosts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:24:56 +01:00
steve	27086783da	P1-33: agent-side encrypted secrets store + push-on-update New internal/agent/secrets package: AEAD blob at /var/lib/restic-manager/secrets.enc, atomic write (os.CreateTemp + Sync + Rename), 0600. Key lives in agent.yaml as base64 (SecretsKey) — same trust boundary as the bearer token, minted on first start via EnsureSecretsKey. cmd/agent: dispatcher reads creds fresh from secrets.Load() on each job rather than from in-memory config. config.update merges the push with what's on disk and persists, so a daemon restart keeps the latest values. Legacy plaintext repo_url/repo_password in agent.yaml are silently migrated into secrets.enc on next start and stripped from the YAML on the following save. Tests: round-trip + wrong-key rejection + atomic-write post-condition for secrets; key idempotence + legacy-field parse/clear for config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:41:28 +01:00
steve	811157b4ce	server: drop in-process TLS — HTTP-only behind reverse proxy Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx; making the server do TLS too means double cert config, dual ACME plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY, remove TLSEnabled() and the ListenAndServeTLS branch. Replace the cookie's "Secure if TLS-in-process" check with a new RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets RM_COOKIE_SECURE=false; production is always behind a TLS proxy and the cookie stays Secure. Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a Caddyfile snippet and a hard "do not expose RM_LISTEN publicly" warning. enrollResponse keeps cert_pin_sha256 in the shape but the server can't introspect a cert it doesn't terminate — operator pastes the proxy's hash into -cert-pin at install time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:20:41 +01:00
steve	a7c6a6e09c	phase 1: run-now backup — restic wrapper, job lifecycle, end-to-end Lands the operator → server → agent → restic → server roundtrip for on-demand backups. The flow: POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]} → server creates a queued Job row → server emits command.run over WS to the host's agent → agent dispatcher spawns runner.RunBackup in a goroutine → runner spawns `restic backup --json`, parses each line → forwards: job.started, log.stream (every line), job.progress (throttled to 1/sec), job.finished (with summary stats blob) → server WS handler persists those into jobs / job_logs P1-16 internal/restic: thin Locate + Env wrapper that runs `restic backup --json`, scans stdout/stderr, parses BackupStatus + BackupSummary, calls back into a LineHandler so the agent can fan out to log.stream + job.progress. Treats exit code 3 as "succeeded with issues" (matches restic's contract). P1-18 store: jobs accessors (CreateJob, MarkJobStarted, MarkJobFinished, AppendJobLog, GetJob). P1-19 server: POST /api/hosts/{id}/jobs creates the Job row, validates kind, dispatches via Hub.Send, audit-logs the action. P1-20 agent runner: wraps restic.RunBackup with throttled progress emission. Sender abstraction was added to wsclient.Handler so background goroutines can keep replying after dispatch returns. P1-21 server WS: dispatchAgentMessage now persists job.started, job.finished, log.stream into the database. Browser fan-out for live tailing lands with the UI work. Agent gets repo_url + repo_password from agent.yaml in plaintext for now (mode 0600, owned by service user); spec.md §7.3's keyring storage moves there in P2. config.update over WS overrides the in-memory copy (does not persist). Build clean; all tests pass. End-to-end with a real restic still needs a host that has restic installed — wire shape verified by the existing hello/heartbeat round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:45:04 +01:00
steve	9cc0caff1e	phase 1: WS transport, enrollment, agent that hellos and heartbeats Lands the protocol layer end-to-end: an agent can be enrolled through the operator UI, store credentials, dial back to the server over WS, complete the protocol_version handshake, and stay connected with periodic heartbeats. Server side: - P1-09 ws.Hub: one Conn per host_id, last-write-wins eviction, json envelope writer with a write mutex, reader, error envelopes. - P1-09 ws.AgentHandler: bearer-auth, accept upgrade, hello-stage (10s deadline, protocol_version checked against api.MinAgentProtocolVersion → ErrProtocolTooOld with help URL on reject), main read loop, defer hub register/unregister. - P1-10 POST /api/agents/enroll consumes a one-time token, mints a persistent agent bearer (sha-256 stored), creates a host row. - P1-10 POST /api/enrollment-tokens (operator, session-auth) issues a 1h one-time token. - P1-11 hello upserts agent_version + restic_version + protocol_version on the host row, flips status to online. - P1-12 heartbeat touches last_seen_at; background sweeper marks hosts offline after 90s without one. - store: hosts table accessors, host_schedule_version, enrollment_tokens FK on consumed_host dropped (audit-only field; the token gets burned before the host row exists). Agent side: - P1-13 internal/agent/config: yaml at /etc/restic-manager/agent.yaml, atomic Save (tmp+fsync+rename), Enrolled() helper. - P1-15 internal/agent/wsclient: dial with bearer + optional TLS cert pinning (sha-256 of leaf), exponential backoff with jitter (1s → 60s cap), heartbeat goroutine, fatal handling for ErrProtocolTooOld. - P1-15 wsclient.Enroll: HTTP POST /api/agents/enroll with sysinfo. - P1-17 internal/agent/sysinfo: hostname/OS/arch/restic-version collection. restic detected by `restic version` parse; absent restic doesn't block startup. - cmd/agent: -enroll-server / -enroll-token flags drive first-run enrollment then exit (so the install script can hand off to systemd to run the persistent service). End-to-end smoke verified: bootstrap → login → issue token → enroll → run agent → server logs `ws agent connected` with the right host_id and protocol_version 1. All tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:39:00 +01:00
steve	df2c584b23	phase 1: HTTP server + first-run bootstrap P1-01 chi router, slog request log, graceful shutdown via signal context. Health endpoint, /api/auth/login, /api/auth/logout, /api/bootstrap. Background sweeper for expired sessions and enrollment tokens (15 min cadence). P1-04 (sessions half) HttpOnly Secure-when-TLS cookie carrying a base64url token; server stores SHA-256(token) so a stolen DB doesn't yield credentials. Unknown user and bad password collapse to the same 401 response code so a probe can't enumerate names. P1-05 first-run admin bootstrap. On a fresh DB the server mints a one-time token and prints it to stderr inside a banner. The /api/bootstrap handler accepts {token, username, password}, creates the first admin, then becomes a 409 forever. P1-07 (partial) audit hooks fire on auth.login and auth.bootstrap. Full middleware-driven coverage lands with the rest of the API. internal/server/config: env > YAML > defaults. RM_LISTEN / RM_DATA_DIR / RM_BASE_URL / RM_TLS_CERT / RM_TLS_KEY / RM_SECRET_KEY_FILE / RM_TRUSTED_PROXY (CIDR list, validated). End-to-end smoke test passes: server boots on a fresh dir, prints the bootstrap token, POST /api/bootstrap creates the admin, POST /api/auth/login returns 200 with a session cookie. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:28:18 +01:00
steve	25aa001135	phase 0: project bootstrap P0-01 Go module + cmd/server + cmd/agent skeletons + internal/ tree P0-02 LICENSE (PolyForm NC 1.0.0), README, CONTRIBUTING P0-03 golangci-lint, pre-commit, .editorconfig, .gitignore P0-04 Gitea Actions CI: test (race+coverage), lint, cross-platform build matrix P0-05 Dockerfile.server (multi-stage, distroless/static), docker-compose.yml P0-06 Makefile with build/test/lint/fmt/run/release targets build, vet, test, and cross-compile to linux/{amd64,arm64} + windows/amd64 all verified locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:03:59 +01:00

36 Commits