restic-manager

Author	SHA1	Message	Date
steve	a45c801884	feat(alerts): per-source-group dedup so two failing backups produce two alerts Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.	2026-05-04 22:59:48 +01:00
steve	8c42b00228	alert: wire engine into ws hello + MarkJobFinished + offline sweep - ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated from http.Deps.AlertEngine (nil until G1 constructs the engine) - runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds - dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished, looking up the job Kind via Store.GetJob before notifying - store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one transaction, returns the IDs that flipped to offline - offline sweeper in cmd/server/main.go switched to the new variant; TODO(G1) comment marks where NotifyHostOffline calls will land	2026-05-04 19:54:39 +01:00
steve	f0dfa689fe	P3 follow-up: editable target dir, conditional --no-ownership, UK lint Three small follow-ups from review: 1. Restore target is now operator-editable. Default value is the literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at run time using os.UserHomeDir(); also handles \${HOME} and ~/ prefixes). Operator can replace with any absolute path. - ui_restore.go validates the input is either absolute or starts with one of the recognised prefixes; other env-var refs (\$PATH etc.) are deliberately rejected so operator paths can't pick up arbitrary agent env values. - host_restore.html replaces the read-only mono-text display with a real <input>; help text spells out that \$HOME resolves agent-side and <job-id> is substituted on dispatch. - install.sh + the systemd unit prep /root/rm-restore so the default works under the sandbox: ReadWritePaths gains a soft '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail if missing, but install.sh pre-creates it root-owned 0700). 2. --no-ownership flag now gated on restic version. The flag was added in restic 0.17 and 0.16 rejects it. Previously dropped it wholesale — that meant new-dir restores silently preserved ownership against design intent on 0.17+. Now the agent threads its detected restic version (sysinfo already collects it) through runner.Config -> restic.Env, and RunRestore appends --no-ownership only when AtLeastVersion(0, 17) returns true. 0.16 hosts still restore with original uid/gid; help text in the wizard explicitly notes this. The previous 'Original ownership is preserved' copy was wrong for new-dir mode and is corrected. 3. golangci-lint misspell locale switched US -> UK and the codebase swept (73 corrections, mostly behaviour/serialise/recognise/honour). Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny contract change but the agent doesn't parse those codes today and no external API consumers exist yet. Tests passed before + after. Tests: - internal/restic/version_test.go covers Env.AtLeastVersion across edge cases (empty, exact match, patch above, minor below, non- numeric) and expandHome on \$HOME / \${HOME} / ~/, plus pass-through for absolute paths and refusal of other env vars. - ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/' with the job_id substituted into the placeholder. Live verified on the smoke env: default target restored to /root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files, 14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs 'succeeded', exit 0.	2026-05-04 17:27:52 +01:00
steve	6d295bc9f6	P3-X2: tree.list synchronous WS RPC + per-session cache Foundational for the restore wizard's tree browser. The wizard needs to lazy-load directory contents from a snapshot as the operator drills down; this lands the transport. - internal/api adds MsgTreeList (server → agent) + MsgTreeListResult (agent → server) with TreeListRequestPayload / TreeListEntry / TreeListResultPayload types. Reply correlates by Envelope.ID. - internal/restic.ListTreeChildren wraps 'restic ls --json' and filters its recursive output to direct children of the requested path. Parser + path-normalisation + isDirectChild are unit-tested. - internal/server/ws/rpc.go introduces a generic SendRPC helper on Hub: register a buffered channel keyed by ULID, send the request, block on ctx.Done()/timeout/reply. Reply routing piggybacks on the existing dispatchAgentMessage by adding a MsgTreeListResult case that forwards to the registered waiter; if no waiter is registered (caller already gave up) the stray reply is dropped quietly. - cmd/agent gains a tree.list handler that runs ListTreeChildren on a fresh per-call context (60s ceiling) and ships the matching tree.list.result envelope. Errors surface in result.Error rather than as transport failures so the server-side waiter can render a sensible UI message. - internal/server/http/tree_cache.go is the per-wizard-session cache layer (~30min TTL, sweep-on-access) that fetchTreeWithCache uses before falling through to SendRPC. Cached on success only; agent errors aren't cached so a transient failure doesn't poison the session. Tests: - internal/restic/ls_test.go covers parseLsChildren at root / mid-tree / leaf, plus normalizeTreePath and isDirectChild edge cases. - internal/server/ws/rpc_test.go unit-tests the registry: round-trip, release semantics, concurrent waiters, ctx-cancel. - internal/server/http/tree_rpc_test.go is the full round-trip: server SendRPC → fake-agent over a real WS → reply → server gets the payload. Plus a timeout test that confirms ~300ms timeouts terminate in ~300ms rather than waiting forever. The cache is plumbed but no UI handler hits fetchTreeWithCache yet — that lands with P3-01 (wizard backend). The unused-linter is suppressed via nolint until the wizard wires it in.	2026-05-04 15:19:22 +01:00
steve	99b88d08c9	server/ws: persist repo.stats into host_repo_stats	2026-05-04 10:19:15 +01:00
steve	e871b05b38	lint: drive baseline to zero, drop only-new-issues gate CI / Test (linux/amd64) (pull_request) Successful in 34s Details CI / Lint (pull_request) Failing after 16s Details CI / Build (windows/amd64) (pull_request) Successful in 22s Details CI / Build (linux/amd64) (pull_request) Successful in 20s Details CI / Build (linux/arm64) (pull_request) Successful in 21s Details Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	5667cdf13a	P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Go-side data model rebuilt against migration 0008. The fat-Schedule shape (paths/excludes/tags/retention/manual/kind/options/hooks) is gone; that surface lives on source_groups now. * store/types.go - Schedule slimmed to {id, host_id, cron, enabled, source_group_ids, timestamps}. SourceGroupIDs populated by Get/List, accepted on Create/Update so callers pass desired junction state in one shape. - SourceGroup added: name (= snapshot tag), includes/excludes, retention_policy, retry_max + retry_backoff_seconds, cached conflict_dimension. - HostRepoMaintenance added: forget/prune/check cadences + enabled. - PendingRun added: offline-retry queue. - Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps. - RetentionPolicy moves home from "schedule field" to "source group field" but the type itself + Summary() method unchanged. * store/sources.go (new) — CRUD + GetByName + ConflictDimension cache. Group writes bump host_schedule_version; conflict cache writes don't (server-internal projection, agent doesn't see it). * store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR IGNORE). UpdateRepoMaintenance doesn't bump schedule version because these run on the server's own ticker, not the agent's local cron. * store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete. * store/schedules.go — rewritten for slim shape + junction CRUD. Update wipes the schedule_source_groups junction wholesale and re-inserts (simpler than diffing). Adds SchedulesUsingGroup for retention-conflict detection + UI labels. * store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan. New SetHostBandwidth helper. * HTTP layer — temporarily stubbed during this rewrite (501 returns with redesign_in_progress error code). Phase 3 fills these in against the new shape: - schedules.go REST CRUD - schedule_push.go agent reconciliation - ui_schedules.go HTML form CRUD Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed — both go away in the new model (Run-now per source group; auto-init at host enrolment). * enrollment.go — replaces "seed manual schedule from typed paths" with "seed default source group + repo-maintenance row." The default group gets the typed paths as its includes; operator edits later via Sources tab. * ws/handler.go — drops the MarkHostRepoInitialised projection (column is gone; auto-init makes it derivable from latest init job's status). Tests: * store: existing schedule test rewritten for slim shape + junction; new sources_test.go covers source-group CRUD, name uniqueness, conflict cache, repo-maintenance defaults + idempotent seed, pending-runs queue lifecycle. * http: schedules_test.go and schedule_push_test.go deleted — both exercised the obsolete fat-schedule API. Phase 3 rewrites them against the new endpoints. go test ./... green. cmd/server + cmd/agent build. The UI is broken end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3 restores REST + on-the-wire reconciliation; Phase 4 rewires the UI templates against the new model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:30:41 +01:00
steve	6450bf1b88	P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Closes the schedule reconciliation loop end-to-end. * New `internal/agent/scheduler` package wraps robfig/cron/v3 with the lifecycle the agent needs: - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting for in-flight entries to return), rebuilds from scratch, starts, and emits schedule.ack with the version we just applied. - Disabled entries skipped silently; bad cron exprs (which shouldn't reach us — the server validates — but defensive) log a warn and skip. - On each cron tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The scheduler itself never builds CommandRunPayloads — server is the source of truth for jobs. - tx is swapped on every Apply, so reconnect is handled naturally: cron entries that fire against a dropped tx log "no active connection" and skip the tick. - Stop() is idempotent and waits for the cron's in-flight workers via cron.Stop().Done(). * New wire message api.MsgScheduleFire + api.ScheduleFirePayload for the agent → server "I just fired locally" RPC. * Server-side dispatch (schedule_push.go: dispatchScheduledJob): looks up the schedule by id, validates ownership + that it's enabled, builds args from kind (paths for backup; other kinds are still arg-less in Phase 2 and grow as those job kinds land in P2-05..08), persists a jobs row with actor_kind=schedule + scheduled_id, and writes command.run back on the same conn so the agent runs through its existing dispatch path. * store.CreateJob now writes scheduled_id. This column was in the schema since 0001 but never populated — the original P1 path only had operator-driven jobs, so actor_kind was always 'user' and scheduled_id was always nil. * cmd/agent/main.go integration: dispatcher gains a scheduler.Scheduler; the MsgScheduleSet case now hands the payload to scheduler.Apply (in a goroutine so the WS read loop keeps draining other messages). WS dispatcher gains OnScheduleFire alongside OnScheduleAck. * Tests: - scheduler unit tests (4): ack-on-apply, cron tick fires schedule.fire envelope, disabled entries don't fire, replace- prior-state stops the old cron. - Server-side end-to-end: schedule.fire → command.run with the right job_id / kind / args, plus jobs row with actor_kind= "schedule" and scheduled_id linking back to the schedule. Persistence of next-fire times across agent restarts is deliberately deferred. A missed fire window during downtime simply fires once on reconnect — that's the desirable behaviour (the operator wants the missed backup to run, not be silently skipped because we lost track of when it was due). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:29:12 +01:00
steve	946b6db137	P2-02 (server side): schedule reconciliation push + ack handling CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Server is now the source of truth for the agent's cron set. * Helpers in schedule_push.go: - loadScheduleSetPayload reads the host's schedules + canonical version into the wire shape. - pushScheduleSetOnConn writes directly to a just-handshaken conn (avoids racing against Hub.Register on a brand-new connection). - pushScheduleSetAsync is the post-CRUD flavour — no-op when the host is offline (the next reconnect's on-hello path catches it up, so a missed push is non-fatal). - applyScheduleAck records what version the agent has confirmed. * onAgentHello restructured: was returning early when the host had no repo credentials, which made the schedule push unreachable for fresh hosts. Split into pushRepoCredsOnHello (silent no-op on ErrNotFound) + pushScheduleSetOnConn (always runs). Empty schedule list is a valid push: tells the agent to drop stale cron entries. * WS dispatcher gains an OnScheduleAck hook on HandlerDeps; the http server wires it to applyScheduleAck. MsgScheduleAck moves out of the "TODO(P2)" group into a real case that decodes the payload and forwards to the callback. * Schedule CRUD handlers each fire pushScheduleSetAsync after the audit-log write so the agent picks up changes within seconds. Tests cover: - On-hello push of an already-created schedule, agent acks, applied_schedule_version flips on the host row. - Connect-then-CRUD: empty initial push (version 0), then a follow-on push at version 1 after the operator creates a schedule via REST. Agent-side `schedule.set` handler (parse, replace local cron, emit `schedule.ack`) is the remainder of P2-02 and lands with P2-03's local scheduler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:22:06 +01:00
steve	ee3ee241ea	P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Cohesive batch from a smoke-test session against a real rest-server. Themed bullets: * Agent runs as root, sandboxed via systemd. CapabilityBoundingSet drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict with ReadWritePaths confined to /etc + /var/lib/restic-manager; NoNewPrivileges blocks escalation. Install script no longer creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the rationale (matches UrBackup / Veeam / Bareos defaults; trying to back up "everything" as an unprivileged user creates silent skips on /home, /root, /var/lib/* with no upside vs the threat model the agent already implies). * Init-repo end-to-end. New JobKind="init" wired through agent runner, restic.Env.RunInit, server dispatcher, and a UI button (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at flips on init success, on backup success, or on a non-empty snapshots.report. The "Run now" / "Init" / "Retry" branching now drives both the dashboard host row and the host-detail panel. Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using the safe create-new-then-rename pattern; first version corrupted job_logs.job_id FK), 0006 (cleans up job_logs FK on already- affected DBs). * rest-server creds embedded at exec time only. restic.Env gains RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL inside envSlice() and never assigns it back to the struct, so nothing slog-able ever sees the cleartext form. RedactURL helper for any future surface that needs to log a URL safely. Both helpers tested. * Add-host UX. Repo password is now optional — server mints a 24-byte URL-safe random one and surfaces it once, alongside an htpasswd snippet ("echo PASS \| htpasswd -B -i ... USERNAME") so the operator pastes one command on the rest-server host and one on the endpoint. Result page also links the install snippet at /install/install.sh (was /install.sh — 404'd before) and pipes to bash (not sh — script uses set -o pipefail and other bashisms; on Debian/Ubuntu sh is dash). * Late-subscriber race in JobHub. A fast-failing job could finish (DB write + Broadcast) before the browser's HX-Redirect → page load → WS-connect path completed, so the JS sat forever waiting on a job.finished that already passed. JobHub split into Register + Send + Run; handleJobStream now subscribes first, re-fetches the job, and sends a synthetic job.finished if the state is already terminal. * HTMX error visibility. New toast partial listens to htmx:responseError and surfaces the response body as a bottom-right toast — every server-side validation error now becomes visible without per-handler JS wiring. Also handles custom rm:toast events for future server-pushed notifications via the HX-Trigger header. Themed via existing CSS vars. * Dashboard rows are now whole-row clickable to host detail (CSS card-link pattern: absolute-positioned anchor + .row-action z-index restoration so the action button stays clickable). "View →" on a running job links to /jobs/<id> rather than /hosts/<id> since the row click already covers the host page. * "Run first" / "Run first backup" → "Run now" everywhere for consistency. * runbook (docs/e2e-smoke.md) updated — live-log streaming step now reflects P1-26; mentions the browser-driven Run-now flow. * _diag/dump-creds — moved out of cmd/ so go build doesn't pick it up; .gitignore now excludes /_diag/ entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:02:12 +01:00
steve	bd434bd1d0	P1-26: live job log viewer + WS browser fan-out hub CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Closes the P1-21 remainder. internal/server/ws/jobhub.go — new JobHub. Per-job_id set of subscribers; each gets a 64-deep buffered channel with a writer goroutine. Broadcast is non-blocking: if a subscriber is slow, its channel fills and messages are dropped for that subscriber only — the agent's read loop is never blocked by a stuck browser. The agent dispatchAgentMessage path mirrors job.started / job.progress / log.stream / job.finished envelopes onto the hub in addition to its existing persistence work. The wire shape is the same end-to-end, so client-side JS switches on env.type the same way Go code does. GET /api/jobs/{id}/stream is the browser endpoint. Auth via session cookie (HTTP layer); upgrade; subscribe; pump until context closes. GET /jobs/{id} renders the live log page. Three states (queued/ running/succeeded/failed) drive the header pill, the progress bar block, the failure summary panel, and the action button (Cancel job while running, Back to host afterwards). Already- persisted log lines are server-rendered on initial load; new lines arrive over the WS and append to #log-stream. Auto-scrolls unless the user scrolls up (a "⇢ Follow" pill re-attaches). On job.finished the page reloads after 600ms to pick up the final-state header rendered server-side. POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id} on success so HTMX lands the operator straight on the live log. For non-HTMX callers (curl / plain form post) it 303s to the same target. store.ListJobLogs returns persisted log lines for initial render on page load. Browser-verified end-to-end: enrol → run a real backup against a sibling restic/rest-server → live progress + 11 log lines stream in → succeeded pill + final stats land after page reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:45:56 +01:00
steve	0ba56ed30d	P1-32: server-side encrypted repo creds + push-on-hello CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Operator-minted enrollment tokens now carry the repo URL/username/ password as one AEAD blob bound (via additional-data) to the token hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a host_credentials row in the same tx as token-burn, so the binding moves with the credential. PUT /api/hosts/{id}/repo-credentials lets an operator edit creds post-enrollment; merges with the existing blob, audits, and pushes config.update if the agent is connected. WS handler grows an OnHello hook that the HTTP layer wires to send the host's decrypted creds as a config.update immediately after the hello succeeds — synchronously, so a racing command.run lands after the agent has its repo password. Schema: 0002_host_credentials.sql adds enc_repo_creds to enrollment_tokens and a host_credentials table (PK = host_id, FK ON DELETE CASCADE). Tests: round-trip token → consume → host_credentials with AAD swap detection; no-creds path stays compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:38:35 +01:00
steve	3904a78f14	P1-22: snapshot listing via restic snapshots --json CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Agent calls restic snapshots --json after each successful backup (60s timeout, separate from the backup ctx) and ships the projection over the existing snapshots.report WS envelope. Failure here is logged but doesn't fail the job — the next successful backup catches the projection up. Server-side ReplaceHostSnapshots is delete-then-insert plus a hosts.snapshot_count update in one transaction so the dashboard's per-host count stays consistent with the projection. New read endpoint GET /api/hosts/{id}/snapshots returns the cached list with a refreshed_at marker so the UI can show staleness when an agent has been offline. Schema: dropped the unused snapshots.repo_id FK (repos as a first-class entity is P2 work), added short_id and refreshed_at columns, switched the time index to DESC for the most-recent-first list query. api.Snapshot gains short_id; size_bytes/file_count come from the embedded summary block on restic 0.16+ and stay zero on older clients. Tests cover round-trip, authoritative replacement after forget+prune shrinkage, and empty-after-wipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:20:57 +01:00
steve	95b49ecab9	phase 1: run-now backup — restic wrapper, job lifecycle, end-to-end Lands the operator → server → agent → restic → server roundtrip for on-demand backups. The flow: POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]} → server creates a queued Job row → server emits command.run over WS to the host's agent → agent dispatcher spawns runner.RunBackup in a goroutine → runner spawns `restic backup --json`, parses each line → forwards: job.started, log.stream (every line), job.progress (throttled to 1/sec), job.finished (with summary stats blob) → server WS handler persists those into jobs / job_logs P1-16 internal/restic: thin Locate + Env wrapper that runs `restic backup --json`, scans stdout/stderr, parses BackupStatus + BackupSummary, calls back into a LineHandler so the agent can fan out to log.stream + job.progress. Treats exit code 3 as "succeeded with issues" (matches restic's contract). P1-18 store: jobs accessors (CreateJob, MarkJobStarted, MarkJobFinished, AppendJobLog, GetJob). P1-19 server: POST /api/hosts/{id}/jobs creates the Job row, validates kind, dispatches via Hub.Send, audit-logs the action. P1-20 agent runner: wraps restic.RunBackup with throttled progress emission. Sender abstraction was added to wsclient.Handler so background goroutines can keep replying after dispatch returns. P1-21 server WS: dispatchAgentMessage now persists job.started, job.finished, log.stream into the database. Browser fan-out for live tailing lands with the UI work. Agent gets repo_url + repo_password from agent.yaml in plaintext for now (mode 0600, owned by service user); spec.md §7.3's keyring storage moves there in P2. config.update over WS overrides the in-memory copy (does not persist). Build clean; all tests pass. End-to-end with a real restic still needs a host that has restic installed — wire shape verified by the existing hello/heartbeat round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:45:04 +01:00
steve	f34773b505	phase 1: WS transport, enrollment, agent that hellos and heartbeats Lands the protocol layer end-to-end: an agent can be enrolled through the operator UI, store credentials, dial back to the server over WS, complete the protocol_version handshake, and stay connected with periodic heartbeats. Server side: - P1-09 ws.Hub: one Conn per host_id, last-write-wins eviction, json envelope writer with a write mutex, reader, error envelopes. - P1-09 ws.AgentHandler: bearer-auth, accept upgrade, hello-stage (10s deadline, protocol_version checked against api.MinAgentProtocolVersion → ErrProtocolTooOld with help URL on reject), main read loop, defer hub register/unregister. - P1-10 POST /api/agents/enroll consumes a one-time token, mints a persistent agent bearer (sha-256 stored), creates a host row. - P1-10 POST /api/enrollment-tokens (operator, session-auth) issues a 1h one-time token. - P1-11 hello upserts agent_version + restic_version + protocol_version on the host row, flips status to online. - P1-12 heartbeat touches last_seen_at; background sweeper marks hosts offline after 90s without one. - store: hosts table accessors, host_schedule_version, enrollment_tokens FK on consumed_host dropped (audit-only field; the token gets burned before the host row exists). Agent side: - P1-13 internal/agent/config: yaml at /etc/restic-manager/agent.yaml, atomic Save (tmp+fsync+rename), Enrolled() helper. - P1-15 internal/agent/wsclient: dial with bearer + optional TLS cert pinning (sha-256 of leaf), exponential backoff with jitter (1s → 60s cap), heartbeat goroutine, fatal handling for ErrProtocolTooOld. - P1-15 wsclient.Enroll: HTTP POST /api/agents/enroll with sysinfo. - P1-17 internal/agent/sysinfo: hostname/OS/arch/restic-version collection. restic detected by `restic version` parse; absent restic doesn't block startup. - cmd/agent: -enroll-server / -enroll-token flags drive first-run enrollment then exit (so the install script can hand off to systemd to run the persistent service). End-to-end smoke verified: bootstrap → login → issue token → enroll → run agent → server logs `ws agent connected` with the right host_id and protocol_version 1. All tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:39:00 +01:00

15 Commits