restic-manager

Author	SHA1	Message	Date
steve	9371b7b777	fix(catchup): guard on real in-flight backup check; add scheduler tests	2026-06-15 21:45:01 +01:00
steve	28c8b58f93	ui: per-host Jobs sub-tab; drop unused Settings stub Adds /hosts/{id}/jobs page listing recent jobs for the host (newest first, capped at 100) with click-through to /jobs/{id}. Converts the Jobs placeholder <div> to a real <a> nav link; removes the Settings stub entirely. Also registers durationHuman template func and a .jobs-row CSS grid to match the existing .schd-row idiom.	2026-05-07 22:49:10 +01:00
steve	350be3f19d	feat(alerts): per-source-group dedup so two failing backups produce two alerts Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.	2026-05-04 22:59:48 +01:00
steve	9ec69456fe	store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire) Widen the SQL query to consider all statuses (queued, running, succeeded, failed, cancelled) rather than terminal-only. An in-flight prune that outlasts the 60s tick interval previously produced ErrNotFound, causing the ticker to anchor at now-24h and fire a second prune concurrently with the first. Update the doc comment and test: remove the "queued job filtered out" case, add assertions that a running job and a queued job are each returned as the latest.	2026-05-04 10:19:15 +01:00
steve	ae96983877	maintenance: pure-logic ticker decides forget/prune/check fires	2026-05-04 10:19:15 +01:00
steve	b6f8de1dcc	lint: drive baseline to zero, drop only-new-issues gate Cleanup pass over the repo so CI can enforce lint going forward without the only-new-issues escape hatch: * gofumpt -w across the tree (31 hits, all formatting) * misspell --fix (25 hits, US-locale spelling) — but reverted on api.JobCancelled = "cancelled" since that literal is the wire + DB CHECK constraint value, plus matched the case in store/fleet.go back to "cancelled" and added //nolint:misspell on both for the next time someone reaches for the auto-fix * Wrap every `defer rows.Close()` / `defer stmt.Close()` / `defer res.Body.Close()` in `defer func() { _ = .Close() }()` to satisfy errcheck without losing the close itself * websocket.Dial callers (1 prod, 4 tests) now capture + close the upgrade response Body — coder/websocket can return res with a nil Body on success, so the test deferred-closes guard against that * Annotate the two genuine-by-design nilerr cases with //nolint comments explaining why nil-on-error is the contract (cookie missing = no session; ctx cancelled mid-backoff = clean shutdown) * Add brief godoc on the 10 exported const groups + types that revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/ ErrorCode, restic.EventKind, store.Role, web.FS) * Drop the unused (Server).userByID method Inline the unparam baseView(active) — every UI page is under the dashboard primary nav today Result: `golangci-lint run ./...` reports 0 issues. CI lint job no longer needs only-new-issues: true; X-06 follow-up entry in tasks.md removed.	2026-05-03 16:15:17 +01:00
steve	ec0bf0f6c3	P2R-01: REST + WS rewire against the slim shape Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron parsed via robfig/cron/v3 and group membership scoped to the host. New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete refuses with 409 if any schedule still references the group, returning the schedule list so the UI can prompt 'remove from these schedules first.' Repo-maintenance GET/PUT manages forget/prune/check cadences on host_repo_maintenance — no version bump, the server-side ticker (P2R-06) drives execution. Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run) resolves the group's includes/excludes/retention/tag and dispatches a backup command.run with the new structured CommandRunPayload fields (Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and /hosts/{id}/init-repo return 410 Gone with a redirect message. schedule_push.go is rebuilt: buildScheduleSetPayload assembles the slim wire shape, pushScheduleSetOnConn ships it during the on-hello window, pushScheduleSetAsync fires after every CRUD mutation, and dispatchScheduledJob handles agent schedule.fire by iterating the schedule's source groups and dispatching one backup per group with actor_kind=schedule and scheduled_id pointing at the schedule. Auto-init at first WS connect: when the host has repo creds bound and no init job in its history, server dispatches restic init. Restic's 'config file already exists' soft-success means re-runs against an existing repo no-op; we don't auto-retry on failure (operator triggers re-init manually via the danger zone in P2R-09). api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc. in favour of {id, cron, enabled, source_groups: [...]}. The agent scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads Includes/Excludes/Tag instead of Args. Tests cover the new HTTP surface end-to-end: source-groups CRUD with in-use refusal, schedule validation (bad cron / missing groups / foreign group), repo-maintenance auto-seed and validation, the 410 route, and buildScheduleSetPayload's wire-shape correctness. Full suite passes; smoke env exercises auto-init dispatch on hello, async push after schedule create, and per-source-group Run-now landing the right paths/excludes/tag at the agent.	2026-05-03 10:56:40 +01:00
steve	608962441b	P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch Closes the schedule reconciliation loop end-to-end. * New `internal/agent/scheduler` package wraps robfig/cron/v3 with the lifecycle the agent needs: - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting for in-flight entries to return), rebuilds from scratch, starts, and emits schedule.ack with the version we just applied. - Disabled entries skipped silently; bad cron exprs (which shouldn't reach us — the server validates — but defensive) log a warn and skip. - On each cron tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The scheduler itself never builds CommandRunPayloads — server is the source of truth for jobs. - tx is swapped on every Apply, so reconnect is handled naturally: cron entries that fire against a dropped tx log "no active connection" and skip the tick. - Stop() is idempotent and waits for the cron's in-flight workers via cron.Stop().Done(). * New wire message api.MsgScheduleFire + api.ScheduleFirePayload for the agent → server "I just fired locally" RPC. * Server-side dispatch (schedule_push.go: dispatchScheduledJob): looks up the schedule by id, validates ownership + that it's enabled, builds args from kind (paths for backup; other kinds are still arg-less in Phase 2 and grow as those job kinds land in P2-05..08), persists a jobs row with actor_kind=schedule + scheduled_id, and writes command.run back on the same conn so the agent runs through its existing dispatch path. * store.CreateJob now writes scheduled_id. This column was in the schema since 0001 but never populated — the original P1 path only had operator-driven jobs, so actor_kind was always 'user' and scheduled_id was always nil. * cmd/agent/main.go integration: dispatcher gains a scheduler.Scheduler; the MsgScheduleSet case now hands the payload to scheduler.Apply (in a goroutine so the WS read loop keeps draining other messages). WS dispatcher gains OnScheduleFire alongside OnScheduleAck. * Tests: - scheduler unit tests (4): ack-on-apply, cron tick fires schedule.fire envelope, disabled entries don't fire, replace- prior-state stops the old cron. - Server-side end-to-end: schedule.fire → command.run with the right job_id / kind / args, plus jobs row with actor_kind= "schedule" and scheduled_id linking back to the schedule. Persistence of next-fire times across agent restarts is deliberately deferred. A missed fire window during downtime simply fires once on reconnect — that's the desirable behaviour (the operator wants the missed backup to run, not be silently skipped because we lost track of when it was due). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:29:12 +01:00
steve	e6729a5a3d	P1-26: live job log viewer + WS browser fan-out hub Closes the P1-21 remainder. internal/server/ws/jobhub.go — new JobHub. Per-job_id set of subscribers; each gets a 64-deep buffered channel with a writer goroutine. Broadcast is non-blocking: if a subscriber is slow, its channel fills and messages are dropped for that subscriber only — the agent's read loop is never blocked by a stuck browser. The agent dispatchAgentMessage path mirrors job.started / job.progress / log.stream / job.finished envelopes onto the hub in addition to its existing persistence work. The wire shape is the same end-to-end, so client-side JS switches on env.type the same way Go code does. GET /api/jobs/{id}/stream is the browser endpoint. Auth via session cookie (HTTP layer); upgrade; subscribe; pump until context closes. GET /jobs/{id} renders the live log page. Three states (queued/ running/succeeded/failed) drive the header pill, the progress bar block, the failure summary panel, and the action button (Cancel job while running, Back to host afterwards). Already- persisted log lines are server-rendered on initial load; new lines arrive over the WS and append to #log-stream. Auto-scrolls unless the user scrolls up (a "⇢ Follow" pill re-attaches). On job.finished the page reloads after 600ms to pick up the final-state header rendered server-side. POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id} on success so HTMX lands the operator straight on the live log. For non-HTMX callers (curl / plain form post) it 303s to the same target. store.ListJobLogs returns persisted log lines for initial render on page load. Browser-verified end-to-end: enrol → run a real backup against a sibling restic/rest-server → live progress + 11 log lines stream in → succeeded pill + final stats land after page reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:45:56 +01:00
steve	a7c6a6e09c	phase 1: run-now backup — restic wrapper, job lifecycle, end-to-end Lands the operator → server → agent → restic → server roundtrip for on-demand backups. The flow: POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]} → server creates a queued Job row → server emits command.run over WS to the host's agent → agent dispatcher spawns runner.RunBackup in a goroutine → runner spawns `restic backup --json`, parses each line → forwards: job.started, log.stream (every line), job.progress (throttled to 1/sec), job.finished (with summary stats blob) → server WS handler persists those into jobs / job_logs P1-16 internal/restic: thin Locate + Env wrapper that runs `restic backup --json`, scans stdout/stderr, parses BackupStatus + BackupSummary, calls back into a LineHandler so the agent can fan out to log.stream + job.progress. Treats exit code 3 as "succeeded with issues" (matches restic's contract). P1-18 store: jobs accessors (CreateJob, MarkJobStarted, MarkJobFinished, AppendJobLog, GetJob). P1-19 server: POST /api/hosts/{id}/jobs creates the Job row, validates kind, dispatches via Hub.Send, audit-logs the action. P1-20 agent runner: wraps restic.RunBackup with throttled progress emission. Sender abstraction was added to wsclient.Handler so background goroutines can keep replying after dispatch returns. P1-21 server WS: dispatchAgentMessage now persists job.started, job.finished, log.stream into the database. Browser fan-out for live tailing lands with the UI work. Agent gets repo_url + repo_password from agent.yaml in plaintext for now (mode 0600, owned by service user); spec.md §7.3's keyring storage moves there in P2. config.update over WS overrides the in-memory copy (does not persist). Build clean; all tests pass. End-to-end with a real restic still needs a host that has restic installed — wire shape verified by the existing hello/heartbeat round-trip test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:45:04 +01:00

10 Commits