P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes

Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:

* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
  drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
  with ReadWritePaths confined to /etc + /var/lib/restic-manager;
  NoNewPrivileges blocks escalation. Install script no longer
  creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
  rationale (matches UrBackup / Veeam / Bareos defaults; trying to
  back up "everything" as an unprivileged user creates silent skips
  on /home, /root, /var/lib/* with no upside vs the threat model
  the agent already implies).

* Init-repo end-to-end. New JobKind="init" wired through agent
  runner, restic.Env.RunInit, server dispatcher, and a UI button
  (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
  flips on init success, on backup success, or on a non-empty
  snapshots.report. The "Run now" / "Init" / "Retry" branching now
  drives both the dashboard host row and the host-detail panel.
  Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
  the safe create-new-then-rename pattern; first version corrupted
  job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
  affected DBs).

* rest-server creds embedded at exec time only. restic.Env gains
  RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
  inside envSlice() and never assigns it back to the struct, so
  nothing slog-able ever sees the cleartext form. RedactURL helper
  for any future surface that needs to log a URL safely. Both
  helpers tested.

* Add-host UX. Repo password is now optional — server mints a
  24-byte URL-safe random one and surfaces it once, alongside an
  htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
  the operator pastes one command on the rest-server host and one
  on the endpoint. Result page also links the install snippet at
  /install/install.sh (was /install.sh — 404'd before) and pipes
  to bash (not sh — script uses set -o pipefail and other
  bashisms; on Debian/Ubuntu sh is dash).

* Late-subscriber race in JobHub. A fast-failing job could finish
  (DB write + Broadcast) before the browser's HX-Redirect → page
  load → WS-connect path completed, so the JS sat forever waiting
  on a job.finished that already passed. JobHub split into
  Register + Send + Run; handleJobStream now subscribes first,
  re-fetches the job, and sends a synthetic job.finished if the
  state is already terminal.

* HTMX error visibility. New toast partial listens to
  htmx:responseError and surfaces the response body as a
  bottom-right toast — every server-side validation error now
  becomes visible without per-handler JS wiring. Also handles
  custom rm:toast events for future server-pushed notifications
  via the HX-Trigger header. Themed via existing CSS vars.

* Dashboard rows are now whole-row clickable to host detail
  (CSS card-link pattern: absolute-positioned anchor + .row-action
  z-index restoration so the action button stays clickable).
  "View →" on a running job links to /jobs/<id> rather than
  /hosts/<id> since the row click already covers the host page.

* "Run first" / "Run first backup" → "Run now" everywhere for
  consistency.

* runbook (docs/e2e-smoke.md) updated — live-log streaming step
  now reflects P1-26; mentions the browser-driven Run-now flow.

* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
  it up; .gitignore now excludes /_diag/ entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-02 11:02:12 +01:00
parent 8aa635f0c1
commit c8ead66f08
29 changed files with 885 additions and 129 deletions
+28 -8
View File
@@ -123,14 +123,34 @@ It is built for small-to-medium fleets (initial target: ~12 endpoints) and is in
- **Service integration:** systemd unit (Linux). Windows service via
`golang.org/x/sys/windows/svc` — Phase 2.
- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle
- **Privilege model:** the agent runs as root, sandboxed via systemd. A
fleet-backup tool needs to read every file on the system regardless
of DAC permissions; running as a dedicated unprivileged user means
either silent skips on `/home`, `/root`, `/var/lib/<other-daemons>`,
or operators having to add the service user to every group whose
files they want backed up. Both are worse failure modes than the
threat model already implies — the agent holds long-lived repo
credentials, executes arbitrary `restic` commands, and runs
operator-defined hooks; its blast radius is already large. This
matches how every comparable tool ships (UrBackup client, Veeam
Agent, Bareos FD, BackupPC client, borgmatic via systemd). The
mitigation is aggressive systemd sandboxing of the root process:
drop the capability set to `CAP_DAC_READ_SEARCH` (read any file)
+ `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` (restore ownership);
`NoNewPrivileges=true` blocks escalation; `ProtectSystem=strict`
+ a tight `ReadWritePaths=` confines writes to `/etc/restic-manager`
and `/var/lib/restic-manager`; `ProtectHome=read-only` keeps `/home`
readable but immutable; standard `Protect*` / `Restrict*` toggles
cover the rest. Hooks (P2) run as root by default with a per-hook
override knob.
- **Persistence:** `agent.yaml` (server URL, host ID, bearer, secrets
key) + an AEAD-encrypted secrets blob (`secrets.enc`) holding the
restic repo URL + password. Both files are mode 0600 owned by the
agent service user. Phase 1 ships the encrypted-file form on
Linux; Phase 2 swaps that for OS-keyring storage (DPAPI on Windows,
Secret Service / `pass` on Linux where a session bus is
available — see §7.3). A small state DB (BoltDB or JSON) for
queued reports lands when offline-resilience work does.
restic repo URL + password. Both files are mode 0600 owned by root.
Phase 1 ships the encrypted-file form on Linux; Phase 2 swaps that
for OS-keyring storage (DPAPI on Windows, Secret Service / `pass`
on Linux where a session bus is available — see §7.3). A small
state DB (BoltDB or JSON) for queued reports lands when offline-
resilience work does.
- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time
- **Updates:** distributed via OS package manager — apt repo (Linux) and
Chocolatey package (Windows), both pointing at gitea releases. No
@@ -517,7 +537,7 @@ Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a f
- **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
- **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice``/home/alice-new`)
- **Permissions:** restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
- **Permissions:** restore runs as root (the agent's process; see §4.2). The agent retains `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` precisely so it can recreate ownership on the target. The "service user is non-root" warning that appeared in earlier drafts is moot.
- **Phase:** 3 (with the restore wizard)
### 14.2 Bandwidth limiting
@@ -535,7 +555,7 @@ Per-host shell commands run before and after a backup job. Use cases: `mysqldump
- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden
- **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable)
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable). Hooks inherit the agent's process — i.e. **root by default** (see §4.2). A per-hook `run_as` field lets the operator drop privileges for a specific hook (`run_as: postgres` for a `pg_dump` hook, etc.); the agent uses `setuid`/`setgid` before exec rather than shelling out to `sudo`. Hooks running as root is what makes `docker stop`, `mysqldump`, `systemctl reload` etc. work without per-host setup, which is what the user expects when typing them into the UI.
- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status
- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:`
- **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged