P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:
* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
with ReadWritePaths confined to /etc + /var/lib/restic-manager;
NoNewPrivileges blocks escalation. Install script no longer
creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
rationale (matches UrBackup / Veeam / Bareos defaults; trying to
back up "everything" as an unprivileged user creates silent skips
on /home, /root, /var/lib/* with no upside vs the threat model
the agent already implies).
* Init-repo end-to-end. New JobKind="init" wired through agent
runner, restic.Env.RunInit, server dispatcher, and a UI button
(red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
flips on init success, on backup success, or on a non-empty
snapshots.report. The "Run now" / "Init" / "Retry" branching now
drives both the dashboard host row and the host-detail panel.
Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
the safe create-new-then-rename pattern; first version corrupted
job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
affected DBs).
* rest-server creds embedded at exec time only. restic.Env gains
RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
inside envSlice() and never assigns it back to the struct, so
nothing slog-able ever sees the cleartext form. RedactURL helper
for any future surface that needs to log a URL safely. Both
helpers tested.
* Add-host UX. Repo password is now optional — server mints a
24-byte URL-safe random one and surfaces it once, alongside an
htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
the operator pastes one command on the rest-server host and one
on the endpoint. Result page also links the install snippet at
/install/install.sh (was /install.sh — 404'd before) and pipes
to bash (not sh — script uses set -o pipefail and other
bashisms; on Debian/Ubuntu sh is dash).
* Late-subscriber race in JobHub. A fast-failing job could finish
(DB write + Broadcast) before the browser's HX-Redirect → page
load → WS-connect path completed, so the JS sat forever waiting
on a job.finished that already passed. JobHub split into
Register + Send + Run; handleJobStream now subscribes first,
re-fetches the job, and sends a synthetic job.finished if the
state is already terminal.
* HTMX error visibility. New toast partial listens to
htmx:responseError and surfaces the response body as a
bottom-right toast — every server-side validation error now
becomes visible without per-handler JS wiring. Also handles
custom rm:toast events for future server-pushed notifications
via the HX-Trigger header. Themed via existing CSS vars.
* Dashboard rows are now whole-row clickable to host detail
(CSS card-link pattern: absolute-positioned anchor + .row-action
z-index restoration so the action button stays clickable).
"View →" on a running job links to /jobs/<id> rather than
/hosts/<id> since the row click already covers the host page.
* "Run first" / "Run first backup" → "Run now" everywhere for
consistency.
* runbook (docs/e2e-smoke.md) updated — live-log streaming step
now reflects P1-26; mentions the browser-driven Run-now flow.
* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
it up; .gitignore now excludes /_diag/ entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -123,14 +123,34 @@ It is built for small-to-medium fleets (initial target: ~12 endpoints) and is in
|
||||
- **Service integration:** systemd unit (Linux). Windows service via
|
||||
`golang.org/x/sys/windows/svc` — Phase 2.
|
||||
- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle
|
||||
- **Privilege model:** the agent runs as root, sandboxed via systemd. A
|
||||
fleet-backup tool needs to read every file on the system regardless
|
||||
of DAC permissions; running as a dedicated unprivileged user means
|
||||
either silent skips on `/home`, `/root`, `/var/lib/<other-daemons>`,
|
||||
or operators having to add the service user to every group whose
|
||||
files they want backed up. Both are worse failure modes than the
|
||||
threat model already implies — the agent holds long-lived repo
|
||||
credentials, executes arbitrary `restic` commands, and runs
|
||||
operator-defined hooks; its blast radius is already large. This
|
||||
matches how every comparable tool ships (UrBackup client, Veeam
|
||||
Agent, Bareos FD, BackupPC client, borgmatic via systemd). The
|
||||
mitigation is aggressive systemd sandboxing of the root process:
|
||||
drop the capability set to `CAP_DAC_READ_SEARCH` (read any file)
|
||||
+ `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` (restore ownership);
|
||||
`NoNewPrivileges=true` blocks escalation; `ProtectSystem=strict`
|
||||
+ a tight `ReadWritePaths=` confines writes to `/etc/restic-manager`
|
||||
and `/var/lib/restic-manager`; `ProtectHome=read-only` keeps `/home`
|
||||
readable but immutable; standard `Protect*` / `Restrict*` toggles
|
||||
cover the rest. Hooks (P2) run as root by default with a per-hook
|
||||
override knob.
|
||||
- **Persistence:** `agent.yaml` (server URL, host ID, bearer, secrets
|
||||
key) + an AEAD-encrypted secrets blob (`secrets.enc`) holding the
|
||||
restic repo URL + password. Both files are mode 0600 owned by the
|
||||
agent service user. Phase 1 ships the encrypted-file form on
|
||||
Linux; Phase 2 swaps that for OS-keyring storage (DPAPI on Windows,
|
||||
Secret Service / `pass` on Linux where a session bus is
|
||||
available — see §7.3). A small state DB (BoltDB or JSON) for
|
||||
queued reports lands when offline-resilience work does.
|
||||
restic repo URL + password. Both files are mode 0600 owned by root.
|
||||
Phase 1 ships the encrypted-file form on Linux; Phase 2 swaps that
|
||||
for OS-keyring storage (DPAPI on Windows, Secret Service / `pass`
|
||||
on Linux where a session bus is available — see §7.3). A small
|
||||
state DB (BoltDB or JSON) for queued reports lands when offline-
|
||||
resilience work does.
|
||||
- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time
|
||||
- **Updates:** distributed via OS package manager — apt repo (Linux) and
|
||||
Chocolatey package (Windows), both pointing at gitea releases. No
|
||||
@@ -517,7 +537,7 @@ Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a f
|
||||
|
||||
- **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
|
||||
- **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice` → `/home/alice-new`)
|
||||
- **Permissions:** restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
|
||||
- **Permissions:** restore runs as root (the agent's process; see §4.2). The agent retains `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` precisely so it can recreate ownership on the target. The "service user is non-root" warning that appeared in earlier drafts is moot.
|
||||
- **Phase:** 3 (with the restore wizard)
|
||||
|
||||
### 14.2 Bandwidth limiting
|
||||
@@ -535,7 +555,7 @@ Per-host shell commands run before and after a backup job. Use cases: `mysqldump
|
||||
|
||||
- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden
|
||||
- **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host
|
||||
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable)
|
||||
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable). Hooks inherit the agent's process — i.e. **root by default** (see §4.2). A per-hook `run_as` field lets the operator drop privileges for a specific hook (`run_as: postgres` for a `pg_dump` hook, etc.); the agent uses `setuid`/`setgid` before exec rather than shelling out to `sudo`. Hooks running as root is what makes `docker stop`, `mysqldump`, `systemctl reload` etc. work without per-host setup, which is what the user expects when typing them into the UI.
|
||||
- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status
|
||||
- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:`
|
||||
- **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged
|
||||
|
||||
Reference in New Issue
Block a user