Splits Phase 3 into three independently-shippable sub-phases (Restore, Alerts, Audit UI) so they can land in separate PRs with their own brainstorm → spec → plan cycles. The Restore sub-phase is up first. The brainstorm ran on 2026-05-04 and locked the following decisions: - Single-host restore only this phase. P3-04 (cross-host restore) is moved to a new 'Future / unscheduled' section. Disaster recovery is already covered by re-enrolling a replacement host with the same repo creds; the remaining 'pull a file from host A onto host C' use case is genuinely different (file sharing / migration, not DR) and has no confirmed need. - Default target is /var/restic-restore/<job-id>/ with --no-ownership; in-place restore preserves uid/gid/mode and is gated by typed-confirmation of the host name (mirroring the repo re-init danger zone). - Tree browser is the path picker, lazy-loaded via a synchronous WS RPC (tree.list) over the existing correlation-ID infrastructure with a per-wizard-session in-memory cache (~30 min TTL). - Single-page wizard with progressively-enabled sections; entry is a top-level Restore button on host detail (or per-snapshot Restore action for direct deep-link). - Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other agent operation; output streams to the standard live job log page. - Restore-specific live job page variant with files-restored / bytes-restored / current-file widget. - Single-flight per host across all kinds, plus a real cancel-job feature (command.cancel WS envelope, agent kills the restic subprocess via context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a long-running backup if they need to restore urgently. Wires the existing job_detail Cancel button (which was a UI stub). - Audit row host.restore on every dispatch + a recent-restores panel on host detail. Role gate deferred to P4-03 RBAC. Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored — transient design artefact); screenshot reviewed and approved 2026-05-04.
17 KiB
P3 — Restore (design)
Phase 3 sub-spec covering single-host restore (P3-01, P3-02, P3-03, P3-09). P3-04 (cross-host restore) is deferred to a new "Future / unscheduled" section in
tasks.md— disaster recovery is already covered by re-enrolling a replacement host with the same repo credentials.Wireframe:
_diag/p3-restore-wizard/wireframe.html. Screenshot:_diag/p3-restore-wizard/01-full-wizard.png.
Scope locked
Brainstorm decisions (in order asked):
- In-place vs new-directory. Default is a new directory under
/var/restic-restore/<job-id>/. An "Restore in place (overwrite original paths)" toggle is gated by typed-confirmation of the host name, mirroring the repo re-init pattern. - Path-selection granularity. Tree browser as the path selector, lazy-
loaded via
restic ls --json <snapshot> <path>per directory expansion. - Cross-host restore (P3-04). Out of scope this phase. Move to
"Future / unscheduled" in
tasks.md. The disaster-recovery case is covered by the standard enrolment flow: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. - Snapshot diff (P3-09). Diff-as-a-job. New
JobDiffJobKind dispatched like every other agent operation. Output streams aslog.streamand renders on the live job log page. - Wizard entry points. Top-level "Restore" button on host detail
(
/hosts/{id}/restore, opens wizard at step 1) plus a per-snapshot Restore action on snapshot rows (/hosts/{id}/snapshots/{sid}/restore, skips step 1). - Wizard interaction model. Single-page, sections progressively enable;
tree-browser nodes lazy-load via HTMX partials. No
restore_draftstable. - Tree-browser data path. Synchronous WS RPC (
tree.list↔tree.list.result, correlation-ID) plus a per-wizard-session in-memory cache keyed by{snapshot_id, path}with ~30-min TTL. - Restore progress UI. Restore-specific job-page variant: files-restored
/ bytes-restored / throughput / ETA / current-file display, driven by
restic restore's JSON status events surfaced through
job.progress. - Permissions/ownership. Policy, not toggle. In-place restore preserves
original ownership; new-directory restore drops ownership
(
--no-ownership). - Concurrency. Single-flight per host (one job at a time across all
kinds). Plus a real cancel-job feature:
command.cancelenvelope, agent kills theresticsubprocess via context cancel (SIGTERM, SIGKILL after grace), server transitions the job tocancelled. The "Cancel" button already in thejob_detailtemplate becomes real for any running job kind. - Audit + safety. Audit row on every restore dispatch (
host.restorewith snapshot ID, paths, target, in-place flag). Recent-restores panel on the host page surfacing the latest restore job alongside last-backup and last-init signals. Role gate deferred to P4-03.
Architecture
Restore composes from existing primitives plus three new pieces:
- New JobKind values:
JobRestore,JobDiff. Dispatcher cases mirror the prune/check pattern. Agent-side handlers wraprestic.RunRestoreandrestic.RunDiff(new methods on theresticpackage). - New WS RPC:
tree.listrequest ({snapshot_id, path}) ↔tree.list.resultreply ({entries: [{name, type, size}], ...}or{error}). Reuses existing correlation-ID infrastructure from P1-09. Nojobsrow. - New cancel surface:
command.cancelrequest ({job_id}), agent cancels the running subprocess context, returnscommand.ack+job.finishedwith statuscancelled. Server endpointPOST /api/jobs/{id}/cancelbridges UI button → WS envelope.
Everything else (job lifecycle, log streaming, progress envelope, snapshot listing, audit log writer, host_chrome partial, danger-zone typed-confirmation) already exists and is reused verbatim.
Component boundaries
| Component | Purpose | Depends on |
|---|---|---|
internal/restic.RunRestore |
Run restic restore with paths + target + ownership |
restic.Env |
internal/restic.RunDiff |
Run restic diff --json a b |
restic.Env |
internal/agent/runner cases |
Dispatch JobRestore / JobDiff jobs |
restic.Run*, hooks (skipped: backup-only) |
internal/agent/runner cancel hook |
Wire WS command.cancel → ctx.CancelFunc per job |
runner job map |
internal/agent/runner tree-list |
Sync RPC handler: restic ls --json for one path |
restic.Env |
internal/server/ws/cancel.go |
Validate + send command.cancel envelope |
hub.Send, store.UpdateJobStatus |
internal/server/ws/tree.go |
RPC mediator: tree.list request → reply, with cache |
hub.SendRPC, in-memory cache |
internal/server/http/restore.go |
Wizard routes + dispatch endpoint | store, ws, audit |
internal/server/http/diff.go |
Snapshot-diff dispatch endpoint | store, ws |
internal/server/http/cancel.go |
POST /api/jobs/{id}/cancel |
ws |
web/templates/pages/host_restore.html |
Wizard page | host_chrome partial |
web/templates/partials/tree_node.html |
Lazy-loaded tree node fragment for HTMX swap | — |
web/templates/pages/job_detail.html |
Restore-kind progress widget (variant) | existing job_detail |
Data flow — wizard happy path
operator
├─ GET /hosts/{id}/restore
│ server renders wizard shell, snapshot table from store.ListSnapshotsByHost
│
├─ click snapshot row (or arrives via /hosts/{id}/snapshots/{sid}/restore)
│ wizard advances to step 2, snapshot summary card rendered
│
├─ expand a tree node (chevron click)
│ HTMX GET /hosts/{id}/restore/tree?snapshot={sid}&path=/etc
│ server checks per-session cache (keyed by sid+path)
│ hit → render tree_node fragment from cache
│ miss → hub.SendRPC(host_id, "tree.list", {sid, path}) → wait reply
│ cache result, render tree_node fragment
│
├─ tick file/dir checkboxes (form state, no round-trip)
│
├─ pick target radio (and optionally type host name to unlock in-place)
│
└─ POST /hosts/{id}/restore (form submit)
server validates: ≥1 path, target mode, in-place ⇒ host name match
write audit row host.restore
store.CreateJob{kind=restore, payload={snapshot_id, paths, target, in_place}}
hub.Send(host_id, "command.run", {job_id, kind=restore, payload})
HX-Redirect: /jobs/{job_id}
Data flow — agent restore execution
agent.runner receives command.run kind=restore
├─ check single-flight: if r.activeJobID != "" → reply busy
│ (server queues to pending_runs only for kind=backup; restore returns busy)
├─ allocate ctx, ctxCancel — store cancelFunc against job_id in r.cancels
├─ sendStarted(job_id, JobRestore, now)
├─ build target path: if in_place → "/" else "/var/restic-restore/<job_id>/"
├─ build flags: paths from payload, --no-ownership when !in_place
├─ restic.RunRestore(ctx, env, snapshot_id, paths, target, in_place):
│ restic restore <sid> --target <path> [--no-ownership] -- <p1> <p2> ...
│ parse stdout JSON: forward "status" → job.progress (1Hz throttle), "summary" → final
├─ on success: sendFinished(job_id, succeeded, exit=0)
├─ on ctx.Err() == context.Canceled: sendFinished(job_id, cancelled, exit=130)
└─ delete cancel func from r.cancels
Data flow — cancel
operator clicks Cancel on /jobs/{id} (running)
POST /api/jobs/{id}/cancel
server: lookup job, ensure status=running, find host
hub.Send(host_id, "command.cancel", {job_id})
→ agent.runner receives command.cancel
cancelFunc, ok := r.cancels[job_id]
ok && cancelFunc()
→ restic subprocess context done → exec.Cmd kills via SIGTERM
→ if still alive after 5s grace → SIGKILL
→ runner sendFinished(job_id, cancelled, exit=130)
→ server receives job.finished status=cancelled, persists, broadcasts
→ browser refresh shows cancelled state
The cancel surface is independently useful for any kind (prune/check/backup) —
not gated to restore. The button already in job_detail.html becomes real.
Tree-list RPC details
New WS message types (added to internal/api/messages.go):
type TreeListRequestPayload struct {
SnapshotID string `json:"snapshot_id"`
Path string `json:"path"`
}
type TreeListEntry struct {
Name string `json:"name"`
Type string `json:"type"` // "dir" | "file" | "symlink"
Size int64 `json:"size,omitempty"`
}
type TreeListResultPayload struct {
SnapshotID string `json:"snapshot_id"`
Path string `json:"path"`
Entries []TreeListEntry `json:"entries,omitempty"`
Error string `json:"error,omitempty"`
}
Server-side mediator (ws.SendRPC) takes a request envelope, registers the
correlation ID in a pending map, sends, blocks on a per-call channel until
the matching reply arrives (or 30s timeout). The pattern is small enough
to inline in internal/server/ws/rpc.go as a generic helper — future
synchronous RPCs reuse it.
In-memory cache: map[sessionID]map[cacheKey]TreeListResultPayload with
cacheKey = snapshot_id + "\x00" + path. Session ID minted per wizard
load (HTTP-only cookie scoped to /hosts/{id}/restore/tree, lifetime 30
min). On wizard close (browser navigation away) the entry expires
naturally. No persistence, no migration.
Agent handler runs restic ls --json <sid> <path> (non-recursive — restic
defaults to recursive but restic ls accepts --long and a path filter;
parse output line-by-line and emit only direct children of path). 60s
context timeout, mirroring existing restic snapshots invocation.
Restore payload
api.CommandRunPayload gains a nested optional restore field:
type RestorePayload struct {
SnapshotID string `json:"snapshot_id"`
Paths []string `json:"paths"` // absolute paths inside the snapshot
InPlace bool `json:"in_place"`
TargetDir string `json:"target_dir"` // empty when in_place=true
PreserveOwner bool `json:"preserve_owner"` // mirrors policy: in_place=>true, else=>false
}
The payload is set by the server when dispatching JobRestore and ignored
on every other kind. Wire-shape test pinned in wire_test.go.
Diff payload
api.CommandRunPayload gains:
type DiffPayload struct {
SnapshotA string `json:"snapshot_a"`
SnapshotB string `json:"snapshot_b"`
}
Set on JobDiff. Output is plain restic diff --json <a> <b> forwarded as
log.stream lines. Job page renders unchanged — operator reads the diff
output directly.
Recent-restores panel
A small panel rendered on the host detail page below the existing init-status line:
last restore: succeeded 2h ago · job f73ab4c1… · 3 files to /var/restic-restore/...
Backed by a new store.LatestJobByKind(host_id, JobRestore) query (mirroring
the existing store.LatestJobByKind already used for init/forget/prune/check
in P2R-06). One template addition in host_chrome.html next to the
InitStatus block.
Routes added
| Method | Path | Purpose |
|---|---|---|
| GET | /hosts/{id}/restore |
Wizard shell (step 1 = snapshot picker) |
| GET | /hosts/{id}/snapshots/{sid}/restore |
Wizard shell with snapshot pre-selected (skips step 1) |
| GET | /hosts/{id}/restore/tree |
HTMX partial: tree node listing for ?snapshot=&path= |
| POST | /hosts/{id}/restore |
Validate + dispatch restore job, redirect to live job page |
| POST | /api/hosts/{id}/snapshots/diff |
Dispatch a diff job for {snapshot_a, snapshot_b} |
| POST | /api/jobs/{id}/cancel |
Send command.cancel to host, transition job → cancelled |
Migrations
None. Restore + diff piggyback on the existing jobs table (their kind is
new but the schema already accepts arbitrary kind strings — there's no
CHECK constraint on kind). The cancel feature uses the existing
JobCancelled terminal status. The tree-list cache lives in process memory.
Tests (target coverage)
internal/restic/restore_test.go—RunRestoreinvocation builds the expected argv (paths, --target, --no-ownership flag presence, in-place variant); JSON status parsing →BackupStatus-shaped progress envelopes.internal/restic/diff_test.go—RunDiffargv shape and JSON forwarding.internal/agent/runner/restore_test.go— happy path, cancel mid-run producescancelledfinished, in-place vs new-directory dispatch, single-flight rejects when another job is running.internal/agent/runner/tree_test.go—tree.listhandler returns direct children for a synthetic restic ls output, surfaces error on missing snapshot.internal/server/ws/rpc_test.go—SendRPCcorrelation matching, timeout, concurrent calls.internal/server/http/restore_test.go— wizard renders with snapshots, POST validates ≥1 path + in-place host-name match, audit row written, job dispatched with correct payload, in-place without typed-confirm re-renders form with input intact and an error.internal/server/http/diff_test.go— POST dispatchesJobDiff, snapshot IDs validated against the host's snapshot list.internal/server/http/cancel_test.go— POST cancel happy path (running → cancelled), 4xx for non-running jobs, 4xx when host offline.internal/server/http/restore_e2e_test.go— happy path: GET wizard, expand/etc(HTMX call returns expected fragment), submit, follow HX-Redirect to job page, see status.web/templates/pages/host_restore_test.go(template-render test) — wizard renders all four sections; in-place card disabled until typed confirm.
Playwright iteration / sweep
A Playwright sweep at the end (mirroring P2R-02 Slice 6) runs against the local smoke server with a real agent enrolled. Steps:
- Login → navigate to alfa-01 host → click Restore.
- Wizard step 1: pick the most recent snapshot.
- Wizard step 2: expand a directory two levels, tick three files, verify tally updates.
- Wizard step 3: leave default new-directory.
- Wizard step 4: dispatch.
- Land on live job page, see progress widget animating, see log lines.
- Click Cancel mid-flight, verify status transitions to cancelled and
the agent's subprocess actually died (log line
signal: killedor exit 130). - Repeat with in-place mode: type host name, dispatch, verify red primary button, verify files actually overwritten on host.
- Snapshot diff: navigate to snapshots, pick two, dispatch diff, see diff output streamed.
- Screenshots into
_diag/p3-restore-sweep/.
End-to-end clean, zero console errors, before handing back.
What does NOT change
host_chrome.htmlonly grows the recent-restores line; sub-tab list unchanged (Restore is a top-level button on the host page, not a sub-tab).enrollment.go, schedule reconciliation, source-group CRUD, repo maintenance ticker, hook execution — none of these are touched.- The CLAUDE.md restage block applies as-is when the agent binary changes (it does — runner gains restore/diff/cancel/tree handlers). The unit file does not change.
Open questions / explicit non-goals
- Restore preview / dry-run. Restic doesn't have a dry-run for restore. Out of scope.
- Resumable restore. Restic restore is idempotent per-file but not resumable mid-stream from where it left off. If a restore is cancelled, the operator re-runs (files already written are overwritten). No state to track.
- Restore to a glob/pattern (e.g.
*.conf). Out of scope; the tree picker requires explicit ticks. Power users can edit the URL or use the CLI. - Bandwidth caps for restore. Honoured automatically — restic's
--limit-downloadis part ofrestic.Envalready (P2R-13) and applies to restore unchanged. - Pre/post hooks for restore. Hooks today gate only
kind=backup(P2R-11). Out of scope.