P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes

Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:

* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
  drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
  with ReadWritePaths confined to /etc + /var/lib/restic-manager;
  NoNewPrivileges blocks escalation. Install script no longer
  creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
  rationale (matches UrBackup / Veeam / Bareos defaults; trying to
  back up "everything" as an unprivileged user creates silent skips
  on /home, /root, /var/lib/* with no upside vs the threat model
  the agent already implies).

* Init-repo end-to-end. New JobKind="init" wired through agent
  runner, restic.Env.RunInit, server dispatcher, and a UI button
  (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
  flips on init success, on backup success, or on a non-empty
  snapshots.report. The "Run now" / "Init" / "Retry" branching now
  drives both the dashboard host row and the host-detail panel.
  Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
  the safe create-new-then-rename pattern; first version corrupted
  job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
  affected DBs).

* rest-server creds embedded at exec time only. restic.Env gains
  RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
  inside envSlice() and never assigns it back to the struct, so
  nothing slog-able ever sees the cleartext form. RedactURL helper
  for any future surface that needs to log a URL safely. Both
  helpers tested.

* Add-host UX. Repo password is now optional — server mints a
  24-byte URL-safe random one and surfaces it once, alongside an
  htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
  the operator pastes one command on the rest-server host and one
  on the endpoint. Result page also links the install snippet at
  /install/install.sh (was /install.sh — 404'd before) and pipes
  to bash (not sh — script uses set -o pipefail and other
  bashisms; on Debian/Ubuntu sh is dash).

* Late-subscriber race in JobHub. A fast-failing job could finish
  (DB write + Broadcast) before the browser's HX-Redirect → page
  load → WS-connect path completed, so the JS sat forever waiting
  on a job.finished that already passed. JobHub split into
  Register + Send + Run; handleJobStream now subscribes first,
  re-fetches the job, and sends a synthetic job.finished if the
  state is already terminal.

* HTMX error visibility. New toast partial listens to
  htmx:responseError and surfaces the response body as a
  bottom-right toast — every server-side validation error now
  becomes visible without per-handler JS wiring. Also handles
  custom rm:toast events for future server-pushed notifications
  via the HX-Trigger header. Themed via existing CSS vars.

* Dashboard rows are now whole-row clickable to host detail
  (CSS card-link pattern: absolute-positioned anchor + .row-action
  z-index restoration so the action button stays clickable).
  "View →" on a running job links to /jobs/<id> rather than
  /hosts/<id> since the row click already covers the host page.

* "Run first" / "Run first backup" → "Run now" everywhere for
  consistency.

* runbook (docs/e2e-smoke.md) updated — live-log streaming step
  now reflects P1-26; mentions the browser-driven Run-now flow.

* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
  it up; .gitignore now excludes /_diag/ entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-02 11:02:12 +01:00
parent 8aa635f0c1
commit c8ead66f08
29 changed files with 885 additions and 129 deletions
+28 -4
View File
@@ -50,7 +50,7 @@ func (s *Store) LookupHostByAgentToken(ctx context.Context, tokenHash string) (*
enrolled_at, last_seen_at, status, repo_id, tags,
current_job_id, last_backup_at, last_backup_status,
repo_size_bytes, snapshot_count, open_alert_count,
applied_schedule_version, default_paths
applied_schedule_version, default_paths, repo_initialised_at
FROM hosts WHERE agent_token_hash = ?`,
tokenHash)
return scanHost(row)
@@ -63,7 +63,7 @@ func (s *Store) GetHost(ctx context.Context, id string) (*Host, error) {
enrolled_at, last_seen_at, status, repo_id, tags,
current_job_id, last_backup_at, last_backup_status,
repo_size_bytes, snapshot_count, open_alert_count,
applied_schedule_version, default_paths
applied_schedule_version, default_paths, repo_initialised_at
FROM hosts WHERE id = ?`, id)
return scanHost(row)
}
@@ -124,7 +124,7 @@ func (s *Store) ListHosts(ctx context.Context) ([]Host, error) {
enrolled_at, last_seen_at, status, repo_id, tags,
current_job_id, last_backup_at, last_backup_status,
repo_size_bytes, snapshot_count, open_alert_count,
applied_schedule_version, default_paths
applied_schedule_version, default_paths, repo_initialised_at
FROM hosts ORDER BY name`)
if err != nil {
return nil, fmt.Errorf("store: list hosts: %w", err)
@@ -163,13 +163,14 @@ func scanHostRow(s hostScanner) (*Host, error) {
enrolled string
tags string
defaultPaths string
repoInitAt sql.NullString
)
err := s.Scan(&h.ID, &h.Name, &h.OS, &h.Arch,
&h.AgentVersion, &h.ResticVersion, &h.ProtocolVersion,
&enrolled, &lastSeen, &h.Status, &repoID, &tags,
&currentJob, &lastBackupAt, &lastBkSt,
&h.RepoSizeBytes, &h.SnapshotCount, &h.OpenAlertCount,
&h.AppliedScheduleVersion, &defaultPaths)
&h.AppliedScheduleVersion, &defaultPaths, &repoInitAt)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
return nil, ErrNotFound
@@ -213,5 +214,28 @@ func scanHostRow(s hostScanner) (*Host, error) {
if defaultPaths != "" {
_ = json.Unmarshal([]byte(defaultPaths), &h.DefaultPaths)
}
if repoInitAt.Valid {
t, err := time.Parse(time.RFC3339Nano, repoInitAt.String)
if err != nil {
return nil, fmt.Errorf("store: parse repo_initialised_at: %w", err)
}
h.RepoInitialisedAt = &t
}
return &h, nil
}
// MarkHostRepoInitialised sets repo_initialised_at to `when` if it is
// currently NULL. Idempotent: re-firing for an already-initialised
// host is a no-op (we never want to clobber the original timestamp).
// Returns true if the row was updated, false if it was already set.
func (s *Store) MarkHostRepoInitialised(ctx context.Context, hostID string, when time.Time) (bool, error) {
res, err := s.db.ExecContext(ctx,
`UPDATE hosts SET repo_initialised_at = ?
WHERE id = ? AND repo_initialised_at IS NULL`,
when.UTC().Format(time.RFC3339Nano), hostID)
if err != nil {
return false, fmt.Errorf("store: mark repo initialised: %w", err)
}
n, _ := res.RowsAffected()
return n > 0, nil
}
@@ -0,0 +1,15 @@
-- 0004_repo_initialised.sql
--
-- Track whether a host's restic repo has been initialised. Set when:
-- 1. a `repo_init` job succeeds, OR
-- 2. any backup job succeeds (proves the repo exists), OR
-- 3. a snapshots.report arrives with at least one snapshot.
--
-- Once set, never cleared by code — only by the operator deleting the
-- host or wiping the column manually if they re-pointed the agent at
-- a different (empty) repo. The UI keys off NULL/non-NULL to decide
-- whether to surface the red "Initialise repo" affordance in the
-- run-now panel.
ALTER TABLE hosts
ADD COLUMN repo_initialised_at TEXT;
@@ -0,0 +1,47 @@
-- 0005_jobs_init_kind.sql
--
-- Add 'init' to the jobs.kind CHECK constraint so the operator can
-- dispatch a `restic init` job from the UI before the first backup.
-- SQLite can't ALTER a CHECK in place, so we rebuild the table.
--
-- Rebuild pattern note: we create jobs_new (with the wider CHECK),
-- copy data over, DROP the original jobs table, then ALTER RENAME
-- jobs_new TO jobs. This avoids the trap of renaming the original
-- first — with legacy_alter_table=OFF (the modern default), a rename
-- propagates into FK references in dependent tables (e.g.
-- job_logs.job_id), leaving them pointing at the temporary name even
-- after we drop it. Migration 0006 cleans up the orphan FK left by
-- the first version of this migration on already-affected DBs.
PRAGMA foreign_keys = OFF;
CREATE TABLE jobs_new (
id TEXT PRIMARY KEY,
host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
kind TEXT NOT NULL CHECK (kind IN ('backup','init','forget','prune','check','unlock')),
status TEXT NOT NULL CHECK (status IN ('queued','running','succeeded','failed','cancelled')),
scheduled_id TEXT REFERENCES schedules(id) ON DELETE SET NULL,
actor_kind TEXT NOT NULL CHECK (actor_kind IN ('user','schedule','system')),
actor_id TEXT,
started_at TEXT,
finished_at TEXT,
exit_code INTEGER,
stats TEXT,
error TEXT,
created_at TEXT NOT NULL
);
INSERT INTO jobs_new
SELECT id, host_id, kind, status, scheduled_id, actor_kind, actor_id,
started_at, finished_at, exit_code, stats, error, created_at
FROM jobs;
DROP TABLE jobs;
ALTER TABLE jobs_new RENAME TO jobs;
CREATE INDEX jobs_host_id ON jobs(host_id);
CREATE INDEX jobs_status ON jobs(status);
CREATE INDEX jobs_created_at ON jobs(created_at);
PRAGMA foreign_keys = ON;
@@ -0,0 +1,33 @@
-- 0006_fix_job_logs_fk.sql
--
-- Migration 0005 rebuilt the jobs table via the unsafe pattern of
-- renaming the original to jobs_old before dropping it. SQLite (with
-- legacy_alter_table=OFF, the modern default) propagated that rename
-- into the FK declaration of job_logs.job_id, which is now pointing
-- at jobs_old — a table that no longer exists. INSERTs into job_logs
-- fail with "no such table: main.jobs_old (1)".
--
-- Rebuild job_logs using the safe pattern: create job_logs_new with
-- a clean FK to jobs, copy rows, drop the broken job_logs, rename
-- job_logs_new to job_logs. Renaming job_logs_new is safe because
-- nothing references it.
PRAGMA foreign_keys = OFF;
CREATE TABLE job_logs_new (
job_id TEXT NOT NULL REFERENCES jobs(id) ON DELETE CASCADE,
seq INTEGER NOT NULL,
ts TEXT NOT NULL,
stream TEXT NOT NULL CHECK (stream IN ('stdout','stderr','event')),
payload TEXT NOT NULL,
PRIMARY KEY (job_id, seq)
);
INSERT INTO job_logs_new (job_id, seq, ts, stream, payload)
SELECT job_id, seq, ts, stream, payload FROM job_logs;
DROP TABLE job_logs;
ALTER TABLE job_logs_new RENAME TO job_logs;
PRAGMA foreign_keys = ON;
+6
View File
@@ -62,6 +62,12 @@ type Host struct {
// operator hits "Run now" without supplying paths. Phase 1
// interim — schedules (P2-01) supersede this.
DefaultPaths []string
// RepoInitialisedAt is non-nil once we've confirmed the host's
// repo has been initialised — either the operator clicked the
// init button, or a backup succeeded, or snapshots.report came
// back non-empty. The host detail run-now panel shows a red
// "Initialise repo" affordance while this is nil.
RepoInitialisedAt *time.Time
}
// EnrollmentToken is the issuer's view of a one-time token. The