Commit Graph

10 Commits

Author SHA1 Message Date
steve 14b703be58 server: maintenance ticker drives forget/prune/check on cadence
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:

  - skips offline hosts (no pending_runs queueing — maintenance is
    not a backup, missed fires shouldn't pile up)
  - silently skips prune when admin creds aren't bound
  - pushes admin creds before prune, then dispatches with
    RequiresAdminCreds=true (same as operator-driven prune)
  - persists job rows with actor_kind="system"

Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
2026-05-04 10:19:15 +01:00
steve a110e3c00c agent: secrets fail-loud on corrupt blob + small polish
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
2026-05-04 10:19:15 +01:00
steve 22adde36b3 agent/runner: ship repo.stats before job.finished in RunCheck/RunUnlock
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.

Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
2026-05-04 10:19:15 +01:00
steve 57bf9690f2 agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot dispatch
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).

Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
2026-05-04 10:19:15 +01:00
steve b6f8de1dcc lint: drive baseline to zero, drop only-new-issues gate
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:

* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
  api.JobCancelled = "cancelled" since that literal is the wire +
  DB CHECK constraint value, plus matched the case in store/fleet.go
  back to "cancelled" and added //nolint:misspell on both for the
  next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
  `defer res.Body.Close()` in `defer func() { _ = .Close() }()`
  to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
  upgrade response Body — coder/websocket can return res with a nil
  Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
  comments explaining why nil-on-error is the contract (cookie
  missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
  revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
  ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
  the dashboard primary nav today

Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
2026-05-03 16:15:17 +01:00
steve 5f2845c331 agent runner: drop status-event spam from log.stream
restic --json emits a status frame ~every 16ms during a backup.
The runner was forwarding every line to log.stream verbatim, which
flooded the live log pane with duplicate status JSON for any
short-running backup (visible immediately on a 1000-file, ~4MB
test set: ~14 identical 'percent_done: 1' lines in 220ms).

The progress widget already covers the same information at a sane
sample rate (one per second via job.progress), so the raw status
lines in log.stream are double-bookkeeping. Skip them and forward
only non-status lines (file names, errors, summary).

Throttling logic for job.progress is unchanged.
2026-05-03 13:35:18 +01:00
steve 6a171596f1 P2-05: forget command with retention policy
End-to-end forget plumbing — operator can create a forget schedule
with keep-* values, agent runs restic forget --keep-* … on the
schedule's cron (or via per-row Run-now), snapshot list shrinks,
UI updates.

* api.CommandRunPayload gains retention_policy json.RawMessage so
  the agent doesn't need a typed copy of the server-side struct.
* restic.ForgetPolicy mirrors restic's --keep-* flags. Empty()
  reports zero dimensions; restic wrapper RunForget refuses to
  run an empty policy (would delete every snapshot). Does NOT
  pass --prune — pruning lives behind a separate admin-only
  credential (P2-06); forget just rewrites the snapshot index.
* runner.RunForget mirrors RunBackup's envelope shape so the
  live log viewer works without special-casing. On success
  triggers reportSnapshots (forget shrinks the index, the host's
  snapshot count almost certainly changed).
* cmd/agent dispatcher handles MsgCommandRun with kind=forget,
  decodes RetentionPolicy from the wire, builds restic.ForgetPolicy.
* Server dispatchScheduleNow marshals the schedule's
  RetentionPolicy into the wire payload for kind=forget jobs.
  Refuses to dispatch a forget schedule with empty retention.
* validateSchedule rejects kind=forget without at least one keep-*
  dimension (new error code: missing_retention).
* UI schedule edit form gains a Kind dropdown (backup or forget;
  immutable on edit). Paths block toggles by kind via inline
  data-kind attributes. Form help-text explains the prune
  separation.

Other kinds (prune, check, unlock) deferred to P2-06..08; the
Kind dropdown only offers backup and forget today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 14:07:42 +01:00
steve c8ead66f08 P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:

* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
  drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
  with ReadWritePaths confined to /etc + /var/lib/restic-manager;
  NoNewPrivileges blocks escalation. Install script no longer
  creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
  rationale (matches UrBackup / Veeam / Bareos defaults; trying to
  back up "everything" as an unprivileged user creates silent skips
  on /home, /root, /var/lib/* with no upside vs the threat model
  the agent already implies).

* Init-repo end-to-end. New JobKind="init" wired through agent
  runner, restic.Env.RunInit, server dispatcher, and a UI button
  (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
  flips on init success, on backup success, or on a non-empty
  snapshots.report. The "Run now" / "Init" / "Retry" branching now
  drives both the dashboard host row and the host-detail panel.
  Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
  the safe create-new-then-rename pattern; first version corrupted
  job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
  affected DBs).

* rest-server creds embedded at exec time only. restic.Env gains
  RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
  inside envSlice() and never assigns it back to the struct, so
  nothing slog-able ever sees the cleartext form. RedactURL helper
  for any future surface that needs to log a URL safely. Both
  helpers tested.

* Add-host UX. Repo password is now optional — server mints a
  24-byte URL-safe random one and surfaces it once, alongside an
  htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
  the operator pastes one command on the rest-server host and one
  on the endpoint. Result page also links the install snippet at
  /install/install.sh (was /install.sh — 404'd before) and pipes
  to bash (not sh — script uses set -o pipefail and other
  bashisms; on Debian/Ubuntu sh is dash).

* Late-subscriber race in JobHub. A fast-failing job could finish
  (DB write + Broadcast) before the browser's HX-Redirect → page
  load → WS-connect path completed, so the JS sat forever waiting
  on a job.finished that already passed. JobHub split into
  Register + Send + Run; handleJobStream now subscribes first,
  re-fetches the job, and sends a synthetic job.finished if the
  state is already terminal.

* HTMX error visibility. New toast partial listens to
  htmx:responseError and surfaces the response body as a
  bottom-right toast — every server-side validation error now
  becomes visible without per-handler JS wiring. Also handles
  custom rm:toast events for future server-pushed notifications
  via the HX-Trigger header. Themed via existing CSS vars.

* Dashboard rows are now whole-row clickable to host detail
  (CSS card-link pattern: absolute-positioned anchor + .row-action
  z-index restoration so the action button stays clickable).
  "View →" on a running job links to /jobs/<id> rather than
  /hosts/<id> since the row click already covers the host page.

* "Run first" / "Run first backup" → "Run now" everywhere for
  consistency.

* runbook (docs/e2e-smoke.md) updated — live-log streaming step
  now reflects P1-26; mentions the browser-driven Run-now flow.

* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
  it up; .gitignore now excludes /_diag/ entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:02:12 +01:00
steve 8d5282a180 P1-22: snapshot listing via restic snapshots --json
Agent calls restic snapshots --json after each successful backup
(60s timeout, separate from the backup ctx) and ships the projection
over the existing snapshots.report WS envelope. Failure here is
logged but doesn't fail the job — the next successful backup catches
the projection up.

Server-side ReplaceHostSnapshots is delete-then-insert plus a
hosts.snapshot_count update in one transaction so the dashboard's
per-host count stays consistent with the projection. New read
endpoint GET /api/hosts/{id}/snapshots returns the cached list with
a refreshed_at marker so the UI can show staleness when an agent
has been offline.

Schema: dropped the unused snapshots.repo_id FK (repos as a
first-class entity is P2 work), added short_id and refreshed_at
columns, switched the time index to DESC for the most-recent-first
list query. api.Snapshot gains short_id; size_bytes/file_count come
from the embedded summary block on restic 0.16+ and stay zero on
older clients.

Tests cover round-trip, authoritative replacement after forget+prune
shrinkage, and empty-after-wipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:20:57 +01:00
steve a7c6a6e09c phase 1: run-now backup — restic wrapper, job lifecycle, end-to-end
Lands the operator → server → agent → restic → server roundtrip for
on-demand backups. The flow:

  POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]}
    → server creates a queued Job row
    → server emits command.run over WS to the host's agent
    → agent dispatcher spawns runner.RunBackup in a goroutine
    → runner spawns `restic backup --json`, parses each line
    → forwards: job.started, log.stream (every line), job.progress
      (throttled to 1/sec), job.finished (with summary stats blob)
    → server WS handler persists those into jobs / job_logs

P1-16 internal/restic: thin Locate + Env wrapper that runs `restic
  backup --json`, scans stdout/stderr, parses BackupStatus +
  BackupSummary, calls back into a LineHandler so the agent can fan
  out to log.stream + job.progress. Treats exit code 3 as
  "succeeded with issues" (matches restic's contract).

P1-18 store: jobs accessors (CreateJob, MarkJobStarted,
  MarkJobFinished, AppendJobLog, GetJob).

P1-19 server: POST /api/hosts/{id}/jobs creates the Job row,
  validates kind, dispatches via Hub.Send, audit-logs the action.

P1-20 agent runner: wraps restic.RunBackup with throttled progress
  emission. Sender abstraction was added to wsclient.Handler so
  background goroutines can keep replying after dispatch returns.

P1-21 server WS: dispatchAgentMessage now persists job.started,
  job.finished, log.stream into the database. Browser fan-out for
  live tailing lands with the UI work.

Agent gets repo_url + repo_password from agent.yaml in plaintext
for now (mode 0600, owned by service user); spec.md §7.3's keyring
storage moves there in P2. config.update over WS overrides the
in-memory copy (does not persist).

Build clean; all tests pass. End-to-end with a real restic still
needs a host that has restic installed — wire shape verified by
the existing hello/heartbeat round-trip test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:45:04 +01:00