diff --git a/docs/superpowers/specs/2026-05-04-p3-restore-design.md b/docs/superpowers/specs/2026-05-04-p3-restore-design.md new file mode 100644 index 0000000..7dbc747 --- /dev/null +++ b/docs/superpowers/specs/2026-05-04-p3-restore-design.md @@ -0,0 +1,342 @@ +# P3 — Restore (design) + +> Phase 3 sub-spec covering single-host restore (P3-01, P3-02, P3-03, P3-09). +> P3-04 (cross-host restore) is deferred to a new "Future / unscheduled" +> section in `tasks.md` — disaster recovery is already covered by re-enrolling +> a replacement host with the same repo credentials. +> +> Wireframe: `_diag/p3-restore-wizard/wireframe.html`. Screenshot: +> `_diag/p3-restore-wizard/01-full-wizard.png`. + +## Scope locked + +Brainstorm decisions (in order asked): + +1. **In-place vs new-directory.** Default is a new directory under + `/var/restic-restore//`. An "Restore in place (overwrite original + paths)" toggle is gated by typed-confirmation of the host name, mirroring + the repo re-init pattern. +2. **Path-selection granularity.** Tree browser as the path selector, lazy- + loaded via `restic ls --json ` per directory expansion. +3. **Cross-host restore (P3-04).** Out of scope this phase. Move to + "Future / unscheduled" in `tasks.md`. The disaster-recovery case is covered + by the standard enrolment flow: stand up a replacement host, paste the + original repo creds at enrolment, snapshots reappear, restore is + same-host. +4. **Snapshot diff (P3-09).** Diff-as-a-job. New `JobDiff` JobKind dispatched + like every other agent operation. Output streams as `log.stream` and + renders on the live job log page. +5. **Wizard entry points.** Top-level "Restore" button on host detail + (`/hosts/{id}/restore`, opens wizard at step 1) plus a per-snapshot + Restore action on snapshot rows (`/hosts/{id}/snapshots/{sid}/restore`, + skips step 1). +6. **Wizard interaction model.** Single-page, sections progressively enable; + tree-browser nodes lazy-load via HTMX partials. No `restore_drafts` table. +7. **Tree-browser data path.** Synchronous WS RPC (`tree.list` ↔ + `tree.list.result`, correlation-ID) plus a per-wizard-session in-memory + cache keyed by `{snapshot_id, path}` with ~30-min TTL. +8. **Restore progress UI.** Restore-specific job-page variant: files-restored + / bytes-restored / throughput / ETA / current-file display, driven by + restic restore's JSON status events surfaced through `job.progress`. +9. **Permissions/ownership.** Policy, not toggle. In-place restore preserves + original ownership; new-directory restore drops ownership + (`--no-ownership`). +10. **Concurrency.** Single-flight per host (one job at a time across all + kinds). Plus a real cancel-job feature: `command.cancel` envelope, agent + kills the `restic` subprocess via context cancel (SIGTERM, SIGKILL after + grace), server transitions the job to `cancelled`. The "Cancel" button + already in the `job_detail` template becomes real for any running job + kind. +11. **Audit + safety.** Audit row on every restore dispatch (`host.restore` + with snapshot ID, paths, target, in-place flag). Recent-restores panel + on the host page surfacing the latest restore job alongside last-backup + and last-init signals. Role gate deferred to P4-03. + +## Architecture + +Restore composes from existing primitives plus three new pieces: + +- **New JobKind values**: `JobRestore`, `JobDiff`. Dispatcher cases mirror + the prune/check pattern. Agent-side handlers wrap `restic.RunRestore` and + `restic.RunDiff` (new methods on the `restic` package). +- **New WS RPC**: `tree.list` request (`{snapshot_id, path}`) ↔ + `tree.list.result` reply (`{entries: [{name, type, size}], ...}` or + `{error}`). Reuses existing correlation-ID infrastructure from P1-09. No + `jobs` row. +- **New cancel surface**: `command.cancel` request (`{job_id}`), agent + cancels the running subprocess context, returns `command.ack` + `job.finished` + with status `cancelled`. Server endpoint `POST /api/jobs/{id}/cancel` + bridges UI button → WS envelope. + +Everything else (job lifecycle, log streaming, progress envelope, snapshot +listing, audit log writer, host_chrome partial, danger-zone typed-confirmation) +already exists and is reused verbatim. + +### Component boundaries + +| Component | Purpose | Depends on | +| ---------------------------------- | ---------------------------------------------------- | ----------------------------------------- | +| `internal/restic.RunRestore` | Run `restic restore` with paths + target + ownership | `restic.Env` | +| `internal/restic.RunDiff` | Run `restic diff --json a b` | `restic.Env` | +| `internal/agent/runner` cases | Dispatch `JobRestore` / `JobDiff` jobs | `restic.Run*`, hooks (skipped: backup-only) | +| `internal/agent/runner` cancel hook | Wire WS `command.cancel` → ctx.CancelFunc per job | runner job map | +| `internal/agent/runner` tree-list | Sync RPC handler: `restic ls --json` for one path | `restic.Env` | +| `internal/server/ws/cancel.go` | Validate + send `command.cancel` envelope | hub.Send, store.UpdateJobStatus | +| `internal/server/ws/tree.go` | RPC mediator: `tree.list` request → reply, with cache | hub.SendRPC, in-memory cache | +| `internal/server/http/restore.go` | Wizard routes + dispatch endpoint | store, ws, audit | +| `internal/server/http/diff.go` | Snapshot-diff dispatch endpoint | store, ws | +| `internal/server/http/cancel.go` | `POST /api/jobs/{id}/cancel` | ws | +| `web/templates/pages/host_restore.html` | Wizard page | host_chrome partial | +| `web/templates/partials/tree_node.html` | Lazy-loaded tree node fragment for HTMX swap | — | +| `web/templates/pages/job_detail.html` | Restore-kind progress widget (variant) | existing job_detail | + +### Data flow — wizard happy path + +``` +operator + ├─ GET /hosts/{id}/restore + │ server renders wizard shell, snapshot table from store.ListSnapshotsByHost + │ + ├─ click snapshot row (or arrives via /hosts/{id}/snapshots/{sid}/restore) + │ wizard advances to step 2, snapshot summary card rendered + │ + ├─ expand a tree node (chevron click) + │ HTMX GET /hosts/{id}/restore/tree?snapshot={sid}&path=/etc + │ server checks per-session cache (keyed by sid+path) + │ hit → render tree_node fragment from cache + │ miss → hub.SendRPC(host_id, "tree.list", {sid, path}) → wait reply + │ cache result, render tree_node fragment + │ + ├─ tick file/dir checkboxes (form state, no round-trip) + │ + ├─ pick target radio (and optionally type host name to unlock in-place) + │ + └─ POST /hosts/{id}/restore (form submit) + server validates: ≥1 path, target mode, in-place ⇒ host name match + write audit row host.restore + store.CreateJob{kind=restore, payload={snapshot_id, paths, target, in_place}} + hub.Send(host_id, "command.run", {job_id, kind=restore, payload}) + HX-Redirect: /jobs/{job_id} +``` + +### Data flow — agent restore execution + +``` +agent.runner receives command.run kind=restore + ├─ check single-flight: if r.activeJobID != "" → reply busy + │ (server queues to pending_runs only for kind=backup; restore returns busy) + ├─ allocate ctx, ctxCancel — store cancelFunc against job_id in r.cancels + ├─ sendStarted(job_id, JobRestore, now) + ├─ build target path: if in_place → "/" else "/var/restic-restore//" + ├─ build flags: paths from payload, --no-ownership when !in_place + ├─ restic.RunRestore(ctx, env, snapshot_id, paths, target, in_place): + │ restic restore --target [--no-ownership] -- ... + │ parse stdout JSON: forward "status" → job.progress (1Hz throttle), "summary" → final + ├─ on success: sendFinished(job_id, succeeded, exit=0) + ├─ on ctx.Err() == context.Canceled: sendFinished(job_id, cancelled, exit=130) + └─ delete cancel func from r.cancels +``` + +### Data flow — cancel + +``` +operator clicks Cancel on /jobs/{id} (running) + POST /api/jobs/{id}/cancel + server: lookup job, ensure status=running, find host + hub.Send(host_id, "command.cancel", {job_id}) + → agent.runner receives command.cancel + cancelFunc, ok := r.cancels[job_id] + ok && cancelFunc() + → restic subprocess context done → exec.Cmd kills via SIGTERM + → if still alive after 5s grace → SIGKILL + → runner sendFinished(job_id, cancelled, exit=130) + → server receives job.finished status=cancelled, persists, broadcasts + → browser refresh shows cancelled state +``` + +The cancel surface is independently useful for any kind (prune/check/backup) — +not gated to restore. The button already in `job_detail.html` becomes real. + +### Tree-list RPC details + +New WS message types (added to `internal/api/messages.go`): + +``` +type TreeListRequestPayload struct { + SnapshotID string `json:"snapshot_id"` + Path string `json:"path"` +} + +type TreeListEntry struct { + Name string `json:"name"` + Type string `json:"type"` // "dir" | "file" | "symlink" + Size int64 `json:"size,omitempty"` +} + +type TreeListResultPayload struct { + SnapshotID string `json:"snapshot_id"` + Path string `json:"path"` + Entries []TreeListEntry `json:"entries,omitempty"` + Error string `json:"error,omitempty"` +} +``` + +Server-side mediator (`ws.SendRPC`) takes a request envelope, registers the +correlation ID in a pending map, sends, blocks on a per-call channel until +the matching reply arrives (or 30s timeout). The pattern is small enough +to inline in `internal/server/ws/rpc.go` as a generic helper — future +synchronous RPCs reuse it. + +In-memory cache: `map[sessionID]map[cacheKey]TreeListResultPayload` with +`cacheKey = snapshot_id + "\x00" + path`. Session ID minted per wizard +load (HTTP-only cookie scoped to `/hosts/{id}/restore/tree`, lifetime 30 +min). On wizard close (browser navigation away) the entry expires +naturally. No persistence, no migration. + +Agent handler runs `restic ls --json ` (non-recursive — restic +defaults to recursive but `restic ls` accepts `--long` and a path filter; +parse output line-by-line and emit only direct children of `path`). 60s +context timeout, mirroring existing `restic snapshots` invocation. + +### Restore payload + +`api.CommandRunPayload` gains a nested optional `restore` field: + +``` +type RestorePayload struct { + SnapshotID string `json:"snapshot_id"` + Paths []string `json:"paths"` // absolute paths inside the snapshot + InPlace bool `json:"in_place"` + TargetDir string `json:"target_dir"` // empty when in_place=true + PreserveOwner bool `json:"preserve_owner"` // mirrors policy: in_place=>true, else=>false +} +``` + +The payload is set by the server when dispatching `JobRestore` and ignored +on every other kind. Wire-shape test pinned in `wire_test.go`. + +### Diff payload + +`api.CommandRunPayload` gains: + +``` +type DiffPayload struct { + SnapshotA string `json:"snapshot_a"` + SnapshotB string `json:"snapshot_b"` +} +``` + +Set on `JobDiff`. Output is plain `restic diff --json ` forwarded as +`log.stream` lines. Job page renders unchanged — operator reads the diff +output directly. + +### Recent-restores panel + +A small panel rendered on the host detail page below the existing init-status +line: + +``` +last restore: succeeded 2h ago · job f73ab4c1… · 3 files to /var/restic-restore/... +``` + +Backed by a new `store.LatestJobByKind(host_id, JobRestore)` query (mirroring +the existing `store.LatestJobByKind` already used for init/forget/prune/check +in P2R-06). One template addition in `host_chrome.html` next to the +`InitStatus` block. + +## Routes added + +| Method | Path | Purpose | +| ------- | --------------------------------------------------------- | ----------------------------------------------------------- | +| GET | `/hosts/{id}/restore` | Wizard shell (step 1 = snapshot picker) | +| GET | `/hosts/{id}/snapshots/{sid}/restore` | Wizard shell with snapshot pre-selected (skips step 1) | +| GET | `/hosts/{id}/restore/tree` | HTMX partial: tree node listing for `?snapshot=&path=` | +| POST | `/hosts/{id}/restore` | Validate + dispatch restore job, redirect to live job page | +| POST | `/api/hosts/{id}/snapshots/diff` | Dispatch a diff job for `{snapshot_a, snapshot_b}` | +| POST | `/api/jobs/{id}/cancel` | Send `command.cancel` to host, transition job → cancelled | + +## Migrations + +None. Restore + diff piggyback on the existing `jobs` table (their `kind` is +new but the schema already accepts arbitrary kind strings — there's no +CHECK constraint on `kind`). The cancel feature uses the existing +`JobCancelled` terminal status. The tree-list cache lives in process memory. + +## Tests (target coverage) + +- `internal/restic/restore_test.go` — `RunRestore` invocation builds the + expected argv (paths, --target, --no-ownership flag presence, in-place + variant); JSON status parsing → `BackupStatus`-shaped progress envelopes. +- `internal/restic/diff_test.go` — `RunDiff` argv shape and JSON forwarding. +- `internal/agent/runner/restore_test.go` — happy path, cancel mid-run + produces `cancelled` finished, in-place vs new-directory dispatch, + single-flight rejects when another job is running. +- `internal/agent/runner/tree_test.go` — `tree.list` handler returns + direct children for a synthetic restic ls output, surfaces error on + missing snapshot. +- `internal/server/ws/rpc_test.go` — `SendRPC` correlation matching, + timeout, concurrent calls. +- `internal/server/http/restore_test.go` — wizard renders with snapshots, + POST validates ≥1 path + in-place host-name match, audit row written, + job dispatched with correct payload, in-place without typed-confirm + re-renders form with input intact and an error. +- `internal/server/http/diff_test.go` — POST dispatches `JobDiff`, + snapshot IDs validated against the host's snapshot list. +- `internal/server/http/cancel_test.go` — POST cancel happy path + (running → cancelled), 4xx for non-running jobs, 4xx when host offline. +- `internal/server/http/restore_e2e_test.go` — happy path: GET wizard, + expand `/etc` (HTMX call returns expected fragment), submit, follow + HX-Redirect to job page, see status. +- `web/templates/pages/host_restore_test.go` (template-render test) — + wizard renders all four sections; in-place card disabled until typed + confirm. + +## Playwright iteration / sweep + +A Playwright sweep at the end (mirroring P2R-02 Slice 6) runs against the +local smoke server with a real agent enrolled. Steps: + +1. Login → navigate to alfa-01 host → click Restore. +2. Wizard step 1: pick the most recent snapshot. +3. Wizard step 2: expand a directory two levels, tick three files, + verify tally updates. +4. Wizard step 3: leave default new-directory. +5. Wizard step 4: dispatch. +6. Land on live job page, see progress widget animating, see log lines. +7. Click Cancel mid-flight, verify status transitions to cancelled and + the agent's subprocess actually died (log line `signal: killed` or exit + 130). +8. Repeat with in-place mode: type host name, dispatch, verify red + primary button, verify files actually overwritten on host. +9. Snapshot diff: navigate to snapshots, pick two, dispatch diff, see + diff output streamed. +10. Screenshots into `_diag/p3-restore-sweep/`. + +End-to-end clean, zero console errors, before handing back. + +## What does NOT change + +- `host_chrome.html` only grows the recent-restores line; sub-tab list + unchanged (Restore is a top-level button on the host page, not a sub-tab). +- `enrollment.go`, schedule reconciliation, source-group CRUD, repo + maintenance ticker, hook execution — none of these are touched. +- The CLAUDE.md restage block applies as-is when the agent binary changes + (it does — runner gains restore/diff/cancel/tree handlers). The unit + file does not change. + +## Open questions / explicit non-goals + +- **Restore preview / dry-run.** Restic doesn't have a dry-run for restore. + Out of scope. +- **Resumable restore.** Restic restore is idempotent per-file but not + resumable mid-stream from where it left off. If a restore is cancelled, + the operator re-runs (files already written are overwritten). No state + to track. +- **Restore to a glob/pattern (e.g. `*.conf`).** Out of scope; the tree + picker requires explicit ticks. Power users can edit the URL or use the + CLI. +- **Bandwidth caps for restore.** Honoured automatically — restic's + `--limit-download` is part of `restic.Env` already (P2R-13) and applies + to restore unchanged. +- **Pre/post hooks for restore.** Hooks today gate only `kind=backup` + (P2R-11). Out of scope. diff --git a/tasks.md b/tasks.md index 7a65c72..03faac7 100644 --- a/tasks.md +++ b/tasks.md @@ -233,19 +233,47 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. ## Phase 3 — Restore, alerts, audit -- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection -- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm) -- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming -- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root +> Phase 3 is split into three independently-shippable sub-phases: +> **Restore** (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC), +> **Alerts** (P3-05..07), **Audit UI** (P3-08). Each sub-phase has its own +> spec → plan → implement cycle; we hand back at sub-phase boundaries. +> +> P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm +> on 2026-05-04: disaster recovery is already covered by re-enrolling a +> replacement host with the same repo creds (snapshots reappear, restore +> is same-host). The remaining "pull a file from host A onto host C +> without giving C permanent access" use case is genuinely different and +> doesn't have a confirmed need yet, so it's moved to the **Future / +> unscheduled** section at the end of this file. + +### Phase 3 — Restore (in progress, brand `p3-restore`) + +> Spec: `docs/superpowers/specs/2026-05-04-p3-restore-design.md`. +> Wireframe: `_diag/p3-restore-wizard/wireframe.html`. + +- [ ] **P3-X1** (S) Cancel-job feature. New `command.cancel` WS envelope; agent tracks per-job ctx.CancelFunc and kills the running `restic` subprocess (SIGTERM, SIGKILL after 5s grace); server endpoint `POST /api/jobs/{id}/cancel` bridges UI → WS; the existing UI Cancel button on `/jobs/{id}` becomes real for any running kind. Foundational — restore depends on it. +- [ ] **P3-X2** (S) Tree-list synchronous WS RPC. New `tree.list` request / `tree.list.result` reply on the existing correlation-ID infra; agent runs `restic ls --json ` per call; server-side mediator `ws.SendRPC` + per-wizard-session in-memory cache (~30-min TTL). +- [ ] **P3-01** (L) Restore wizard backend: tree browse via `tree.list` RPC (P3-X2), path picker validation, target selection (new-dir vs in-place + typed-confirm), dispatch endpoint `POST /hosts/{id}/restore`, audit row `host.restore`. +- [ ] **P3-02** (L) Restore wizard UI: single-page progressively-enabled four-step form at `/hosts/{id}/restore` (and pre-selected variant `/hosts/{id}/snapshots/{sid}/restore`); tree-browser HTMX partials. Top-level "Restore" button on host detail. +- [ ] **P3-03** (M) Restore execution: `restic.RunRestore` (paths, --target, --no-ownership for new-dir; preserves ownership for in-place); agent dispatcher case `JobRestore`; restore-specific job page variant with files-restored / bytes-restored / throughput / ETA / current-file widget. +- [ ] **P3-09** (S) `diff` between two snapshots in UI: `JobDiff` JobKind, `restic.RunDiff`, `POST /api/hosts/{id}/snapshots/diff` dispatcher, snapshot-picker UI on Snapshots tab to pick A+B; output streams as `log.stream` to the standard live job log page. +- [ ] **P3-X3** (S) Recent-restores panel on host detail: small line below the existing init-status, surfacing latest `JobRestore` outcome (succeeded N hours ago / failed → live log link). Backed by `store.LatestJobByKind(host_id, JobRestore)`. + +### Phase 3 — Alerts (not started) + - [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed) - [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email - [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve + +### Phase 3 — Audit log UI (not started) + - [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range) -- [ ] **P3-09** (S) `diff` between two snapshots in UI ### Phase 3 acceptance -- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s. +- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at `/hosts/{id}/restore`; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page. +- A failed backup raises an alert via the configured channel within 60s. +- The audit-log UI lets an admin filter by user / action / target / time range. --- @@ -290,3 +318,14 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. - [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`) - [ ] **X-04** Threat-model review at end of each phase - [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation. + +--- + +## Future / unscheduled + +> Items here have a plausible use case but no confirmed need. They live +> outside numbered phases until a concrete trigger (a user request, a +> security review finding, a real disaster-recovery exercise) bumps them +> back into a phase. + +- [ ] **F-01** ~~P3-04~~ Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.