Files
restic-manager/docs/superpowers/specs/2026-05-04-p3-restore-design.md
T
steve 454a2415dc docs: P3 restore design spec + scope-decompose Phase 3
Splits Phase 3 into three independently-shippable sub-phases (Restore,
Alerts, Audit UI) so they can land in separate PRs with their own brainstorm
→ spec → plan cycles. The Restore sub-phase is up first.

The brainstorm ran on 2026-05-04 and locked the following decisions:

- Single-host restore only this phase. P3-04 (cross-host restore) is moved
  to a new 'Future / unscheduled' section. Disaster recovery is already
  covered by re-enrolling a replacement host with the same repo creds; the
  remaining 'pull a file from host A onto host C' use case is genuinely
  different (file sharing / migration, not DR) and has no confirmed need.
- Default target is /var/restic-restore/<job-id>/ with --no-ownership;
  in-place restore preserves uid/gid/mode and is gated by typed-confirmation
  of the host name (mirroring the repo re-init danger zone).
- Tree browser is the path picker, lazy-loaded via a synchronous WS RPC
  (tree.list) over the existing correlation-ID infrastructure with a
  per-wizard-session in-memory cache (~30 min TTL).
- Single-page wizard with progressively-enabled sections; entry is a
  top-level Restore button on host detail (or per-snapshot Restore action
  for direct deep-link).
- Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other
  agent operation; output streams to the standard live job log page.
- Restore-specific live job page variant with files-restored /
  bytes-restored / current-file widget.
- Single-flight per host across all kinds, plus a real cancel-job feature
  (command.cancel WS envelope, agent kills the restic subprocess via
  context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a
  long-running backup if they need to restore urgently. Wires the existing
  job_detail Cancel button (which was a UI stub).
- Audit row host.restore on every dispatch + a recent-restores panel on
  host detail. Role gate deferred to P4-03 RBAC.

Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored —
transient design artefact); screenshot reviewed and approved 2026-05-04.
2026-05-04 15:02:32 +01:00

17 KiB

P3 — Restore (design)

Phase 3 sub-spec covering single-host restore (P3-01, P3-02, P3-03, P3-09). P3-04 (cross-host restore) is deferred to a new "Future / unscheduled" section in tasks.md — disaster recovery is already covered by re-enrolling a replacement host with the same repo credentials.

Wireframe: _diag/p3-restore-wizard/wireframe.html. Screenshot: _diag/p3-restore-wizard/01-full-wizard.png.

Scope locked

Brainstorm decisions (in order asked):

  1. In-place vs new-directory. Default is a new directory under /var/restic-restore/<job-id>/. An "Restore in place (overwrite original paths)" toggle is gated by typed-confirmation of the host name, mirroring the repo re-init pattern.
  2. Path-selection granularity. Tree browser as the path selector, lazy- loaded via restic ls --json <snapshot> <path> per directory expansion.
  3. Cross-host restore (P3-04). Out of scope this phase. Move to "Future / unscheduled" in tasks.md. The disaster-recovery case is covered by the standard enrolment flow: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host.
  4. Snapshot diff (P3-09). Diff-as-a-job. New JobDiff JobKind dispatched like every other agent operation. Output streams as log.stream and renders on the live job log page.
  5. Wizard entry points. Top-level "Restore" button on host detail (/hosts/{id}/restore, opens wizard at step 1) plus a per-snapshot Restore action on snapshot rows (/hosts/{id}/snapshots/{sid}/restore, skips step 1).
  6. Wizard interaction model. Single-page, sections progressively enable; tree-browser nodes lazy-load via HTMX partials. No restore_drafts table.
  7. Tree-browser data path. Synchronous WS RPC (tree.listtree.list.result, correlation-ID) plus a per-wizard-session in-memory cache keyed by {snapshot_id, path} with ~30-min TTL.
  8. Restore progress UI. Restore-specific job-page variant: files-restored / bytes-restored / throughput / ETA / current-file display, driven by restic restore's JSON status events surfaced through job.progress.
  9. Permissions/ownership. Policy, not toggle. In-place restore preserves original ownership; new-directory restore drops ownership (--no-ownership).
  10. Concurrency. Single-flight per host (one job at a time across all kinds). Plus a real cancel-job feature: command.cancel envelope, agent kills the restic subprocess via context cancel (SIGTERM, SIGKILL after grace), server transitions the job to cancelled. The "Cancel" button already in the job_detail template becomes real for any running job kind.
  11. Audit + safety. Audit row on every restore dispatch (host.restore with snapshot ID, paths, target, in-place flag). Recent-restores panel on the host page surfacing the latest restore job alongside last-backup and last-init signals. Role gate deferred to P4-03.

Architecture

Restore composes from existing primitives plus three new pieces:

  • New JobKind values: JobRestore, JobDiff. Dispatcher cases mirror the prune/check pattern. Agent-side handlers wrap restic.RunRestore and restic.RunDiff (new methods on the restic package).
  • New WS RPC: tree.list request ({snapshot_id, path}) ↔ tree.list.result reply ({entries: [{name, type, size}], ...} or {error}). Reuses existing correlation-ID infrastructure from P1-09. No jobs row.
  • New cancel surface: command.cancel request ({job_id}), agent cancels the running subprocess context, returns command.ack + job.finished with status cancelled. Server endpoint POST /api/jobs/{id}/cancel bridges UI button → WS envelope.

Everything else (job lifecycle, log streaming, progress envelope, snapshot listing, audit log writer, host_chrome partial, danger-zone typed-confirmation) already exists and is reused verbatim.

Component boundaries

Component Purpose Depends on
internal/restic.RunRestore Run restic restore with paths + target + ownership restic.Env
internal/restic.RunDiff Run restic diff --json a b restic.Env
internal/agent/runner cases Dispatch JobRestore / JobDiff jobs restic.Run*, hooks (skipped: backup-only)
internal/agent/runner cancel hook Wire WS command.cancel → ctx.CancelFunc per job runner job map
internal/agent/runner tree-list Sync RPC handler: restic ls --json for one path restic.Env
internal/server/ws/cancel.go Validate + send command.cancel envelope hub.Send, store.UpdateJobStatus
internal/server/ws/tree.go RPC mediator: tree.list request → reply, with cache hub.SendRPC, in-memory cache
internal/server/http/restore.go Wizard routes + dispatch endpoint store, ws, audit
internal/server/http/diff.go Snapshot-diff dispatch endpoint store, ws
internal/server/http/cancel.go POST /api/jobs/{id}/cancel ws
web/templates/pages/host_restore.html Wizard page host_chrome partial
web/templates/partials/tree_node.html Lazy-loaded tree node fragment for HTMX swap
web/templates/pages/job_detail.html Restore-kind progress widget (variant) existing job_detail

Data flow — wizard happy path

operator
  ├─ GET /hosts/{id}/restore
  │     server renders wizard shell, snapshot table from store.ListSnapshotsByHost
  │
  ├─ click snapshot row (or arrives via /hosts/{id}/snapshots/{sid}/restore)
  │     wizard advances to step 2, snapshot summary card rendered
  │
  ├─ expand a tree node (chevron click)
  │     HTMX GET /hosts/{id}/restore/tree?snapshot={sid}&path=/etc
  │       server checks per-session cache (keyed by sid+path)
  │         hit  → render tree_node fragment from cache
  │         miss → hub.SendRPC(host_id, "tree.list", {sid, path}) → wait reply
  │                cache result, render tree_node fragment
  │
  ├─ tick file/dir checkboxes (form state, no round-trip)
  │
  ├─ pick target radio (and optionally type host name to unlock in-place)
  │
  └─ POST /hosts/{id}/restore  (form submit)
        server validates: ≥1 path, target mode, in-place ⇒ host name match
        write audit row host.restore
        store.CreateJob{kind=restore, payload={snapshot_id, paths, target, in_place}}
        hub.Send(host_id, "command.run", {job_id, kind=restore, payload})
        HX-Redirect: /jobs/{job_id}

Data flow — agent restore execution

agent.runner receives command.run kind=restore
  ├─ check single-flight: if r.activeJobID != "" → reply busy
  │   (server queues to pending_runs only for kind=backup; restore returns busy)
  ├─ allocate ctx, ctxCancel — store cancelFunc against job_id in r.cancels
  ├─ sendStarted(job_id, JobRestore, now)
  ├─ build target path: if in_place → "/" else "/var/restic-restore/<job_id>/"
  ├─ build flags: paths from payload, --no-ownership when !in_place
  ├─ restic.RunRestore(ctx, env, snapshot_id, paths, target, in_place):
  │   restic restore <sid> --target <path> [--no-ownership] -- <p1> <p2> ...
  │   parse stdout JSON: forward "status" → job.progress (1Hz throttle), "summary" → final
  ├─ on success: sendFinished(job_id, succeeded, exit=0)
  ├─ on ctx.Err() == context.Canceled: sendFinished(job_id, cancelled, exit=130)
  └─ delete cancel func from r.cancels

Data flow — cancel

operator clicks Cancel on /jobs/{id} (running)
  POST /api/jobs/{id}/cancel
    server: lookup job, ensure status=running, find host
    hub.Send(host_id, "command.cancel", {job_id})
  → agent.runner receives command.cancel
       cancelFunc, ok := r.cancels[job_id]
       ok && cancelFunc()
       → restic subprocess context done → exec.Cmd kills via SIGTERM
       → if still alive after 5s grace → SIGKILL
       → runner sendFinished(job_id, cancelled, exit=130)
  → server receives job.finished status=cancelled, persists, broadcasts
  → browser refresh shows cancelled state

The cancel surface is independently useful for any kind (prune/check/backup) — not gated to restore. The button already in job_detail.html becomes real.

Tree-list RPC details

New WS message types (added to internal/api/messages.go):

type TreeListRequestPayload struct {
    SnapshotID string `json:"snapshot_id"`
    Path       string `json:"path"`
}

type TreeListEntry struct {
    Name string `json:"name"`
    Type string `json:"type"`        // "dir" | "file" | "symlink"
    Size int64  `json:"size,omitempty"`
}

type TreeListResultPayload struct {
    SnapshotID string          `json:"snapshot_id"`
    Path       string          `json:"path"`
    Entries    []TreeListEntry `json:"entries,omitempty"`
    Error      string          `json:"error,omitempty"`
}

Server-side mediator (ws.SendRPC) takes a request envelope, registers the correlation ID in a pending map, sends, blocks on a per-call channel until the matching reply arrives (or 30s timeout). The pattern is small enough to inline in internal/server/ws/rpc.go as a generic helper — future synchronous RPCs reuse it.

In-memory cache: map[sessionID]map[cacheKey]TreeListResultPayload with cacheKey = snapshot_id + "\x00" + path. Session ID minted per wizard load (HTTP-only cookie scoped to /hosts/{id}/restore/tree, lifetime 30 min). On wizard close (browser navigation away) the entry expires naturally. No persistence, no migration.

Agent handler runs restic ls --json <sid> <path> (non-recursive — restic defaults to recursive but restic ls accepts --long and a path filter; parse output line-by-line and emit only direct children of path). 60s context timeout, mirroring existing restic snapshots invocation.

Restore payload

api.CommandRunPayload gains a nested optional restore field:

type RestorePayload struct {
    SnapshotID    string   `json:"snapshot_id"`
    Paths         []string `json:"paths"`           // absolute paths inside the snapshot
    InPlace       bool     `json:"in_place"`
    TargetDir     string   `json:"target_dir"`      // empty when in_place=true
    PreserveOwner bool     `json:"preserve_owner"`  // mirrors policy: in_place=>true, else=>false
}

The payload is set by the server when dispatching JobRestore and ignored on every other kind. Wire-shape test pinned in wire_test.go.

Diff payload

api.CommandRunPayload gains:

type DiffPayload struct {
    SnapshotA string `json:"snapshot_a"`
    SnapshotB string `json:"snapshot_b"`
}

Set on JobDiff. Output is plain restic diff --json <a> <b> forwarded as log.stream lines. Job page renders unchanged — operator reads the diff output directly.

Recent-restores panel

A small panel rendered on the host detail page below the existing init-status line:

last restore: succeeded 2h ago · job f73ab4c1… · 3 files to /var/restic-restore/...

Backed by a new store.LatestJobByKind(host_id, JobRestore) query (mirroring the existing store.LatestJobByKind already used for init/forget/prune/check in P2R-06). One template addition in host_chrome.html next to the InitStatus block.

Routes added

Method Path Purpose
GET /hosts/{id}/restore Wizard shell (step 1 = snapshot picker)
GET /hosts/{id}/snapshots/{sid}/restore Wizard shell with snapshot pre-selected (skips step 1)
GET /hosts/{id}/restore/tree HTMX partial: tree node listing for ?snapshot=&path=
POST /hosts/{id}/restore Validate + dispatch restore job, redirect to live job page
POST /api/hosts/{id}/snapshots/diff Dispatch a diff job for {snapshot_a, snapshot_b}
POST /api/jobs/{id}/cancel Send command.cancel to host, transition job → cancelled

Migrations

None. Restore + diff piggyback on the existing jobs table (their kind is new but the schema already accepts arbitrary kind strings — there's no CHECK constraint on kind). The cancel feature uses the existing JobCancelled terminal status. The tree-list cache lives in process memory.

Tests (target coverage)

  • internal/restic/restore_test.goRunRestore invocation builds the expected argv (paths, --target, --no-ownership flag presence, in-place variant); JSON status parsing → BackupStatus-shaped progress envelopes.
  • internal/restic/diff_test.goRunDiff argv shape and JSON forwarding.
  • internal/agent/runner/restore_test.go — happy path, cancel mid-run produces cancelled finished, in-place vs new-directory dispatch, single-flight rejects when another job is running.
  • internal/agent/runner/tree_test.gotree.list handler returns direct children for a synthetic restic ls output, surfaces error on missing snapshot.
  • internal/server/ws/rpc_test.goSendRPC correlation matching, timeout, concurrent calls.
  • internal/server/http/restore_test.go — wizard renders with snapshots, POST validates ≥1 path + in-place host-name match, audit row written, job dispatched with correct payload, in-place without typed-confirm re-renders form with input intact and an error.
  • internal/server/http/diff_test.go — POST dispatches JobDiff, snapshot IDs validated against the host's snapshot list.
  • internal/server/http/cancel_test.go — POST cancel happy path (running → cancelled), 4xx for non-running jobs, 4xx when host offline.
  • internal/server/http/restore_e2e_test.go — happy path: GET wizard, expand /etc (HTMX call returns expected fragment), submit, follow HX-Redirect to job page, see status.
  • web/templates/pages/host_restore_test.go (template-render test) — wizard renders all four sections; in-place card disabled until typed confirm.

Playwright iteration / sweep

A Playwright sweep at the end (mirroring P2R-02 Slice 6) runs against the local smoke server with a real agent enrolled. Steps:

  1. Login → navigate to alfa-01 host → click Restore.
  2. Wizard step 1: pick the most recent snapshot.
  3. Wizard step 2: expand a directory two levels, tick three files, verify tally updates.
  4. Wizard step 3: leave default new-directory.
  5. Wizard step 4: dispatch.
  6. Land on live job page, see progress widget animating, see log lines.
  7. Click Cancel mid-flight, verify status transitions to cancelled and the agent's subprocess actually died (log line signal: killed or exit 130).
  8. Repeat with in-place mode: type host name, dispatch, verify red primary button, verify files actually overwritten on host.
  9. Snapshot diff: navigate to snapshots, pick two, dispatch diff, see diff output streamed.
  10. Screenshots into _diag/p3-restore-sweep/.

End-to-end clean, zero console errors, before handing back.

What does NOT change

  • host_chrome.html only grows the recent-restores line; sub-tab list unchanged (Restore is a top-level button on the host page, not a sub-tab).
  • enrollment.go, schedule reconciliation, source-group CRUD, repo maintenance ticker, hook execution — none of these are touched.
  • The CLAUDE.md restage block applies as-is when the agent binary changes (it does — runner gains restore/diff/cancel/tree handlers). The unit file does not change.

Open questions / explicit non-goals

  • Restore preview / dry-run. Restic doesn't have a dry-run for restore. Out of scope.
  • Resumable restore. Restic restore is idempotent per-file but not resumable mid-stream from where it left off. If a restore is cancelled, the operator re-runs (files already written are overwritten). No state to track.
  • Restore to a glob/pattern (e.g. *.conf). Out of scope; the tree picker requires explicit ticks. Power users can edit the URL or use the CLI.
  • Bandwidth caps for restore. Honoured automatically — restic's --limit-download is part of restic.Env already (P2R-13) and applies to restore unchanged.
  • Pre/post hooks for restore. Hooks today gate only kind=backup (P2R-11). Out of scope.