# P3 — Restore (design) > Phase 3 sub-spec covering single-host restore (P3-01, P3-02, P3-03, P3-09). > P3-04 (cross-host restore) is deferred to a new "Future / unscheduled" > section in `tasks.md` — disaster recovery is already covered by re-enrolling > a replacement host with the same repo credentials. > > Wireframe: `_diag/p3-restore-wizard/wireframe.html`. Screenshot: > `_diag/p3-restore-wizard/01-full-wizard.png`. ## Scope locked Brainstorm decisions (in order asked): 1. **In-place vs new-directory.** Default is a new directory under `/var/restic-restore//`. An "Restore in place (overwrite original paths)" toggle is gated by typed-confirmation of the host name, mirroring the repo re-init pattern. 2. **Path-selection granularity.** Tree browser as the path selector, lazy- loaded via `restic ls --json ` per directory expansion. 3. **Cross-host restore (P3-04).** Out of scope this phase. Move to "Future / unscheduled" in `tasks.md`. The disaster-recovery case is covered by the standard enrolment flow: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. 4. **Snapshot diff (P3-09).** Diff-as-a-job. New `JobDiff` JobKind dispatched like every other agent operation. Output streams as `log.stream` and renders on the live job log page. 5. **Wizard entry points.** Top-level "Restore" button on host detail (`/hosts/{id}/restore`, opens wizard at step 1) plus a per-snapshot Restore action on snapshot rows (`/hosts/{id}/snapshots/{sid}/restore`, skips step 1). 6. **Wizard interaction model.** Single-page, sections progressively enable; tree-browser nodes lazy-load via HTMX partials. No `restore_drafts` table. 7. **Tree-browser data path.** Synchronous WS RPC (`tree.list` ↔ `tree.list.result`, correlation-ID) plus a per-wizard-session in-memory cache keyed by `{snapshot_id, path}` with ~30-min TTL. 8. **Restore progress UI.** Restore-specific job-page variant: files-restored / bytes-restored / throughput / ETA / current-file display, driven by restic restore's JSON status events surfaced through `job.progress`. 9. **Permissions/ownership.** Policy, not toggle. In-place restore preserves original ownership; new-directory restore drops ownership (`--no-ownership`). 10. **Concurrency.** Single-flight per host (one job at a time across all kinds). Plus a real cancel-job feature: `command.cancel` envelope, agent kills the `restic` subprocess via context cancel (SIGTERM, SIGKILL after grace), server transitions the job to `cancelled`. The "Cancel" button already in the `job_detail` template becomes real for any running job kind. 11. **Audit + safety.** Audit row on every restore dispatch (`host.restore` with snapshot ID, paths, target, in-place flag). Recent-restores panel on the host page surfacing the latest restore job alongside last-backup and last-init signals. Role gate deferred to P4-03. ## Architecture Restore composes from existing primitives plus three new pieces: - **New JobKind values**: `JobRestore`, `JobDiff`. Dispatcher cases mirror the prune/check pattern. Agent-side handlers wrap `restic.RunRestore` and `restic.RunDiff` (new methods on the `restic` package). - **New WS RPC**: `tree.list` request (`{snapshot_id, path}`) ↔ `tree.list.result` reply (`{entries: [{name, type, size}], ...}` or `{error}`). Reuses existing correlation-ID infrastructure from P1-09. No `jobs` row. - **New cancel surface**: `command.cancel` request (`{job_id}`), agent cancels the running subprocess context, returns `command.ack` + `job.finished` with status `cancelled`. Server endpoint `POST /api/jobs/{id}/cancel` bridges UI button → WS envelope. Everything else (job lifecycle, log streaming, progress envelope, snapshot listing, audit log writer, host_chrome partial, danger-zone typed-confirmation) already exists and is reused verbatim. ### Component boundaries | Component | Purpose | Depends on | | ---------------------------------- | ---------------------------------------------------- | ----------------------------------------- | | `internal/restic.RunRestore` | Run `restic restore` with paths + target + ownership | `restic.Env` | | `internal/restic.RunDiff` | Run `restic diff --json a b` | `restic.Env` | | `internal/agent/runner` cases | Dispatch `JobRestore` / `JobDiff` jobs | `restic.Run*`, hooks (skipped: backup-only) | | `internal/agent/runner` cancel hook | Wire WS `command.cancel` → ctx.CancelFunc per job | runner job map | | `internal/agent/runner` tree-list | Sync RPC handler: `restic ls --json` for one path | `restic.Env` | | `internal/server/ws/cancel.go` | Validate + send `command.cancel` envelope | hub.Send, store.UpdateJobStatus | | `internal/server/ws/tree.go` | RPC mediator: `tree.list` request → reply, with cache | hub.SendRPC, in-memory cache | | `internal/server/http/restore.go` | Wizard routes + dispatch endpoint | store, ws, audit | | `internal/server/http/diff.go` | Snapshot-diff dispatch endpoint | store, ws | | `internal/server/http/cancel.go` | `POST /api/jobs/{id}/cancel` | ws | | `web/templates/pages/host_restore.html` | Wizard page | host_chrome partial | | `web/templates/partials/tree_node.html` | Lazy-loaded tree node fragment for HTMX swap | — | | `web/templates/pages/job_detail.html` | Restore-kind progress widget (variant) | existing job_detail | ### Data flow — wizard happy path ``` operator ├─ GET /hosts/{id}/restore │ server renders wizard shell, snapshot table from store.ListSnapshotsByHost │ ├─ click snapshot row (or arrives via /hosts/{id}/snapshots/{sid}/restore) │ wizard advances to step 2, snapshot summary card rendered │ ├─ expand a tree node (chevron click) │ HTMX GET /hosts/{id}/restore/tree?snapshot={sid}&path=/etc │ server checks per-session cache (keyed by sid+path) │ hit → render tree_node fragment from cache │ miss → hub.SendRPC(host_id, "tree.list", {sid, path}) → wait reply │ cache result, render tree_node fragment │ ├─ tick file/dir checkboxes (form state, no round-trip) │ ├─ pick target radio (and optionally type host name to unlock in-place) │ └─ POST /hosts/{id}/restore (form submit) server validates: ≥1 path, target mode, in-place ⇒ host name match write audit row host.restore store.CreateJob{kind=restore, payload={snapshot_id, paths, target, in_place}} hub.Send(host_id, "command.run", {job_id, kind=restore, payload}) HX-Redirect: /jobs/{job_id} ``` ### Data flow — agent restore execution ``` agent.runner receives command.run kind=restore ├─ check single-flight: if r.activeJobID != "" → reply busy │ (server queues to pending_runs only for kind=backup; restore returns busy) ├─ allocate ctx, ctxCancel — store cancelFunc against job_id in r.cancels ├─ sendStarted(job_id, JobRestore, now) ├─ build target path: if in_place → "/" else "/var/restic-restore//" ├─ build flags: paths from payload, --no-ownership when !in_place ├─ restic.RunRestore(ctx, env, snapshot_id, paths, target, in_place): │ restic restore --target [--no-ownership] -- ... │ parse stdout JSON: forward "status" → job.progress (1Hz throttle), "summary" → final ├─ on success: sendFinished(job_id, succeeded, exit=0) ├─ on ctx.Err() == context.Canceled: sendFinished(job_id, cancelled, exit=130) └─ delete cancel func from r.cancels ``` ### Data flow — cancel ``` operator clicks Cancel on /jobs/{id} (running) POST /api/jobs/{id}/cancel server: lookup job, ensure status=running, find host hub.Send(host_id, "command.cancel", {job_id}) → agent.runner receives command.cancel cancelFunc, ok := r.cancels[job_id] ok && cancelFunc() → restic subprocess context done → exec.Cmd kills via SIGTERM → if still alive after 5s grace → SIGKILL → runner sendFinished(job_id, cancelled, exit=130) → server receives job.finished status=cancelled, persists, broadcasts → browser refresh shows cancelled state ``` The cancel surface is independently useful for any kind (prune/check/backup) — not gated to restore. The button already in `job_detail.html` becomes real. ### Tree-list RPC details New WS message types (added to `internal/api/messages.go`): ``` type TreeListRequestPayload struct { SnapshotID string `json:"snapshot_id"` Path string `json:"path"` } type TreeListEntry struct { Name string `json:"name"` Type string `json:"type"` // "dir" | "file" | "symlink" Size int64 `json:"size,omitempty"` } type TreeListResultPayload struct { SnapshotID string `json:"snapshot_id"` Path string `json:"path"` Entries []TreeListEntry `json:"entries,omitempty"` Error string `json:"error,omitempty"` } ``` Server-side mediator (`ws.SendRPC`) takes a request envelope, registers the correlation ID in a pending map, sends, blocks on a per-call channel until the matching reply arrives (or 30s timeout). The pattern is small enough to inline in `internal/server/ws/rpc.go` as a generic helper — future synchronous RPCs reuse it. In-memory cache: `map[sessionID]map[cacheKey]TreeListResultPayload` with `cacheKey = snapshot_id + "\x00" + path`. Session ID minted per wizard load (HTTP-only cookie scoped to `/hosts/{id}/restore/tree`, lifetime 30 min). On wizard close (browser navigation away) the entry expires naturally. No persistence, no migration. Agent handler runs `restic ls --json ` (non-recursive — restic defaults to recursive but `restic ls` accepts `--long` and a path filter; parse output line-by-line and emit only direct children of `path`). 60s context timeout, mirroring existing `restic snapshots` invocation. ### Restore payload `api.CommandRunPayload` gains a nested optional `restore` field: ``` type RestorePayload struct { SnapshotID string `json:"snapshot_id"` Paths []string `json:"paths"` // absolute paths inside the snapshot InPlace bool `json:"in_place"` TargetDir string `json:"target_dir"` // empty when in_place=true PreserveOwner bool `json:"preserve_owner"` // mirrors policy: in_place=>true, else=>false } ``` The payload is set by the server when dispatching `JobRestore` and ignored on every other kind. Wire-shape test pinned in `wire_test.go`. ### Diff payload `api.CommandRunPayload` gains: ``` type DiffPayload struct { SnapshotA string `json:"snapshot_a"` SnapshotB string `json:"snapshot_b"` } ``` Set on `JobDiff`. Output is plain `restic diff --json ` forwarded as `log.stream` lines. Job page renders unchanged — operator reads the diff output directly. ### Recent-restores panel A small panel rendered on the host detail page below the existing init-status line: ``` last restore: succeeded 2h ago · job f73ab4c1… · 3 files to /var/restic-restore/... ``` Backed by a new `store.LatestJobByKind(host_id, JobRestore)` query (mirroring the existing `store.LatestJobByKind` already used for init/forget/prune/check in P2R-06). One template addition in `host_chrome.html` next to the `InitStatus` block. ## Routes added | Method | Path | Purpose | | ------- | --------------------------------------------------------- | ----------------------------------------------------------- | | GET | `/hosts/{id}/restore` | Wizard shell (step 1 = snapshot picker) | | GET | `/hosts/{id}/snapshots/{sid}/restore` | Wizard shell with snapshot pre-selected (skips step 1) | | GET | `/hosts/{id}/restore/tree` | HTMX partial: tree node listing for `?snapshot=&path=` | | POST | `/hosts/{id}/restore` | Validate + dispatch restore job, redirect to live job page | | POST | `/api/hosts/{id}/snapshots/diff` | Dispatch a diff job for `{snapshot_a, snapshot_b}` | | POST | `/api/jobs/{id}/cancel` | Send `command.cancel` to host, transition job → cancelled | ## Migrations None. Restore + diff piggyback on the existing `jobs` table (their `kind` is new but the schema already accepts arbitrary kind strings — there's no CHECK constraint on `kind`). The cancel feature uses the existing `JobCancelled` terminal status. The tree-list cache lives in process memory. ## Tests (target coverage) - `internal/restic/restore_test.go` — `RunRestore` invocation builds the expected argv (paths, --target, --no-ownership flag presence, in-place variant); JSON status parsing → `BackupStatus`-shaped progress envelopes. - `internal/restic/diff_test.go` — `RunDiff` argv shape and JSON forwarding. - `internal/agent/runner/restore_test.go` — happy path, cancel mid-run produces `cancelled` finished, in-place vs new-directory dispatch, single-flight rejects when another job is running. - `internal/agent/runner/tree_test.go` — `tree.list` handler returns direct children for a synthetic restic ls output, surfaces error on missing snapshot. - `internal/server/ws/rpc_test.go` — `SendRPC` correlation matching, timeout, concurrent calls. - `internal/server/http/restore_test.go` — wizard renders with snapshots, POST validates ≥1 path + in-place host-name match, audit row written, job dispatched with correct payload, in-place without typed-confirm re-renders form with input intact and an error. - `internal/server/http/diff_test.go` — POST dispatches `JobDiff`, snapshot IDs validated against the host's snapshot list. - `internal/server/http/cancel_test.go` — POST cancel happy path (running → cancelled), 4xx for non-running jobs, 4xx when host offline. - `internal/server/http/restore_e2e_test.go` — happy path: GET wizard, expand `/etc` (HTMX call returns expected fragment), submit, follow HX-Redirect to job page, see status. - `web/templates/pages/host_restore_test.go` (template-render test) — wizard renders all four sections; in-place card disabled until typed confirm. ## Playwright iteration / sweep A Playwright sweep at the end (mirroring P2R-02 Slice 6) runs against the local smoke server with a real agent enrolled. Steps: 1. Login → navigate to alfa-01 host → click Restore. 2. Wizard step 1: pick the most recent snapshot. 3. Wizard step 2: expand a directory two levels, tick three files, verify tally updates. 4. Wizard step 3: leave default new-directory. 5. Wizard step 4: dispatch. 6. Land on live job page, see progress widget animating, see log lines. 7. Click Cancel mid-flight, verify status transitions to cancelled and the agent's subprocess actually died (log line `signal: killed` or exit 130). 8. Repeat with in-place mode: type host name, dispatch, verify red primary button, verify files actually overwritten on host. 9. Snapshot diff: navigate to snapshots, pick two, dispatch diff, see diff output streamed. 10. Screenshots into `_diag/p3-restore-sweep/`. End-to-end clean, zero console errors, before handing back. ## What does NOT change - `host_chrome.html` only grows the recent-restores line; sub-tab list unchanged (Restore is a top-level button on the host page, not a sub-tab). - `enrollment.go`, schedule reconciliation, source-group CRUD, repo maintenance ticker, hook execution — none of these are touched. - The CLAUDE.md restage block applies as-is when the agent binary changes (it does — runner gains restore/diff/cancel/tree handlers). The unit file does not change. ## Open questions / explicit non-goals - **Restore preview / dry-run.** Restic doesn't have a dry-run for restore. Out of scope. - **Resumable restore.** Restic restore is idempotent per-file but not resumable mid-stream from where it left off. If a restore is cancelled, the operator re-runs (files already written are overwritten). No state to track. - **Restore to a glob/pattern (e.g. `*.conf`).** Out of scope; the tree picker requires explicit ticks. Power users can edit the URL or use the CLI. - **Bandwidth caps for restore.** Honoured automatically — restic's `--limit-download` is part of `restic.Env` already (P2R-13) and applies to restore unchanged. - **Pre/post hooks for restore.** Hooks today gate only `kind=backup` (P2R-11). Out of scope.