docs: P3 restore design spec + scope-decompose Phase 3
Splits Phase 3 into three independently-shippable sub-phases (Restore, Alerts, Audit UI) so they can land in separate PRs with their own brainstorm → spec → plan cycles. The Restore sub-phase is up first. The brainstorm ran on 2026-05-04 and locked the following decisions: - Single-host restore only this phase. P3-04 (cross-host restore) is moved to a new 'Future / unscheduled' section. Disaster recovery is already covered by re-enrolling a replacement host with the same repo creds; the remaining 'pull a file from host A onto host C' use case is genuinely different (file sharing / migration, not DR) and has no confirmed need. - Default target is /var/restic-restore/<job-id>/ with --no-ownership; in-place restore preserves uid/gid/mode and is gated by typed-confirmation of the host name (mirroring the repo re-init danger zone). - Tree browser is the path picker, lazy-loaded via a synchronous WS RPC (tree.list) over the existing correlation-ID infrastructure with a per-wizard-session in-memory cache (~30 min TTL). - Single-page wizard with progressively-enabled sections; entry is a top-level Restore button on host detail (or per-snapshot Restore action for direct deep-link). - Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other agent operation; output streams to the standard live job log page. - Restore-specific live job page variant with files-restored / bytes-restored / current-file widget. - Single-flight per host across all kinds, plus a real cancel-job feature (command.cancel WS envelope, agent kills the restic subprocess via context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a long-running backup if they need to restore urgently. Wires the existing job_detail Cancel button (which was a UI stub). - Audit row host.restore on every dispatch + a recent-restores panel on host detail. Role gate deferred to P4-03 RBAC. Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored — transient design artefact); screenshot reviewed and approved 2026-05-04.
This commit is contained in:
@@ -0,0 +1,342 @@
|
||||
# P3 — Restore (design)
|
||||
|
||||
> Phase 3 sub-spec covering single-host restore (P3-01, P3-02, P3-03, P3-09).
|
||||
> P3-04 (cross-host restore) is deferred to a new "Future / unscheduled"
|
||||
> section in `tasks.md` — disaster recovery is already covered by re-enrolling
|
||||
> a replacement host with the same repo credentials.
|
||||
>
|
||||
> Wireframe: `_diag/p3-restore-wizard/wireframe.html`. Screenshot:
|
||||
> `_diag/p3-restore-wizard/01-full-wizard.png`.
|
||||
|
||||
## Scope locked
|
||||
|
||||
Brainstorm decisions (in order asked):
|
||||
|
||||
1. **In-place vs new-directory.** Default is a new directory under
|
||||
`/var/restic-restore/<job-id>/`. An "Restore in place (overwrite original
|
||||
paths)" toggle is gated by typed-confirmation of the host name, mirroring
|
||||
the repo re-init pattern.
|
||||
2. **Path-selection granularity.** Tree browser as the path selector, lazy-
|
||||
loaded via `restic ls --json <snapshot> <path>` per directory expansion.
|
||||
3. **Cross-host restore (P3-04).** Out of scope this phase. Move to
|
||||
"Future / unscheduled" in `tasks.md`. The disaster-recovery case is covered
|
||||
by the standard enrolment flow: stand up a replacement host, paste the
|
||||
original repo creds at enrolment, snapshots reappear, restore is
|
||||
same-host.
|
||||
4. **Snapshot diff (P3-09).** Diff-as-a-job. New `JobDiff` JobKind dispatched
|
||||
like every other agent operation. Output streams as `log.stream` and
|
||||
renders on the live job log page.
|
||||
5. **Wizard entry points.** Top-level "Restore" button on host detail
|
||||
(`/hosts/{id}/restore`, opens wizard at step 1) plus a per-snapshot
|
||||
Restore action on snapshot rows (`/hosts/{id}/snapshots/{sid}/restore`,
|
||||
skips step 1).
|
||||
6. **Wizard interaction model.** Single-page, sections progressively enable;
|
||||
tree-browser nodes lazy-load via HTMX partials. No `restore_drafts` table.
|
||||
7. **Tree-browser data path.** Synchronous WS RPC (`tree.list` ↔
|
||||
`tree.list.result`, correlation-ID) plus a per-wizard-session in-memory
|
||||
cache keyed by `{snapshot_id, path}` with ~30-min TTL.
|
||||
8. **Restore progress UI.** Restore-specific job-page variant: files-restored
|
||||
/ bytes-restored / throughput / ETA / current-file display, driven by
|
||||
restic restore's JSON status events surfaced through `job.progress`.
|
||||
9. **Permissions/ownership.** Policy, not toggle. In-place restore preserves
|
||||
original ownership; new-directory restore drops ownership
|
||||
(`--no-ownership`).
|
||||
10. **Concurrency.** Single-flight per host (one job at a time across all
|
||||
kinds). Plus a real cancel-job feature: `command.cancel` envelope, agent
|
||||
kills the `restic` subprocess via context cancel (SIGTERM, SIGKILL after
|
||||
grace), server transitions the job to `cancelled`. The "Cancel" button
|
||||
already in the `job_detail` template becomes real for any running job
|
||||
kind.
|
||||
11. **Audit + safety.** Audit row on every restore dispatch (`host.restore`
|
||||
with snapshot ID, paths, target, in-place flag). Recent-restores panel
|
||||
on the host page surfacing the latest restore job alongside last-backup
|
||||
and last-init signals. Role gate deferred to P4-03.
|
||||
|
||||
## Architecture
|
||||
|
||||
Restore composes from existing primitives plus three new pieces:
|
||||
|
||||
- **New JobKind values**: `JobRestore`, `JobDiff`. Dispatcher cases mirror
|
||||
the prune/check pattern. Agent-side handlers wrap `restic.RunRestore` and
|
||||
`restic.RunDiff` (new methods on the `restic` package).
|
||||
- **New WS RPC**: `tree.list` request (`{snapshot_id, path}`) ↔
|
||||
`tree.list.result` reply (`{entries: [{name, type, size}], ...}` or
|
||||
`{error}`). Reuses existing correlation-ID infrastructure from P1-09. No
|
||||
`jobs` row.
|
||||
- **New cancel surface**: `command.cancel` request (`{job_id}`), agent
|
||||
cancels the running subprocess context, returns `command.ack` + `job.finished`
|
||||
with status `cancelled`. Server endpoint `POST /api/jobs/{id}/cancel`
|
||||
bridges UI button → WS envelope.
|
||||
|
||||
Everything else (job lifecycle, log streaming, progress envelope, snapshot
|
||||
listing, audit log writer, host_chrome partial, danger-zone typed-confirmation)
|
||||
already exists and is reused verbatim.
|
||||
|
||||
### Component boundaries
|
||||
|
||||
| Component | Purpose | Depends on |
|
||||
| ---------------------------------- | ---------------------------------------------------- | ----------------------------------------- |
|
||||
| `internal/restic.RunRestore` | Run `restic restore` with paths + target + ownership | `restic.Env` |
|
||||
| `internal/restic.RunDiff` | Run `restic diff --json a b` | `restic.Env` |
|
||||
| `internal/agent/runner` cases | Dispatch `JobRestore` / `JobDiff` jobs | `restic.Run*`, hooks (skipped: backup-only) |
|
||||
| `internal/agent/runner` cancel hook | Wire WS `command.cancel` → ctx.CancelFunc per job | runner job map |
|
||||
| `internal/agent/runner` tree-list | Sync RPC handler: `restic ls --json` for one path | `restic.Env` |
|
||||
| `internal/server/ws/cancel.go` | Validate + send `command.cancel` envelope | hub.Send, store.UpdateJobStatus |
|
||||
| `internal/server/ws/tree.go` | RPC mediator: `tree.list` request → reply, with cache | hub.SendRPC, in-memory cache |
|
||||
| `internal/server/http/restore.go` | Wizard routes + dispatch endpoint | store, ws, audit |
|
||||
| `internal/server/http/diff.go` | Snapshot-diff dispatch endpoint | store, ws |
|
||||
| `internal/server/http/cancel.go` | `POST /api/jobs/{id}/cancel` | ws |
|
||||
| `web/templates/pages/host_restore.html` | Wizard page | host_chrome partial |
|
||||
| `web/templates/partials/tree_node.html` | Lazy-loaded tree node fragment for HTMX swap | — |
|
||||
| `web/templates/pages/job_detail.html` | Restore-kind progress widget (variant) | existing job_detail |
|
||||
|
||||
### Data flow — wizard happy path
|
||||
|
||||
```
|
||||
operator
|
||||
├─ GET /hosts/{id}/restore
|
||||
│ server renders wizard shell, snapshot table from store.ListSnapshotsByHost
|
||||
│
|
||||
├─ click snapshot row (or arrives via /hosts/{id}/snapshots/{sid}/restore)
|
||||
│ wizard advances to step 2, snapshot summary card rendered
|
||||
│
|
||||
├─ expand a tree node (chevron click)
|
||||
│ HTMX GET /hosts/{id}/restore/tree?snapshot={sid}&path=/etc
|
||||
│ server checks per-session cache (keyed by sid+path)
|
||||
│ hit → render tree_node fragment from cache
|
||||
│ miss → hub.SendRPC(host_id, "tree.list", {sid, path}) → wait reply
|
||||
│ cache result, render tree_node fragment
|
||||
│
|
||||
├─ tick file/dir checkboxes (form state, no round-trip)
|
||||
│
|
||||
├─ pick target radio (and optionally type host name to unlock in-place)
|
||||
│
|
||||
└─ POST /hosts/{id}/restore (form submit)
|
||||
server validates: ≥1 path, target mode, in-place ⇒ host name match
|
||||
write audit row host.restore
|
||||
store.CreateJob{kind=restore, payload={snapshot_id, paths, target, in_place}}
|
||||
hub.Send(host_id, "command.run", {job_id, kind=restore, payload})
|
||||
HX-Redirect: /jobs/{job_id}
|
||||
```
|
||||
|
||||
### Data flow — agent restore execution
|
||||
|
||||
```
|
||||
agent.runner receives command.run kind=restore
|
||||
├─ check single-flight: if r.activeJobID != "" → reply busy
|
||||
│ (server queues to pending_runs only for kind=backup; restore returns busy)
|
||||
├─ allocate ctx, ctxCancel — store cancelFunc against job_id in r.cancels
|
||||
├─ sendStarted(job_id, JobRestore, now)
|
||||
├─ build target path: if in_place → "/" else "/var/restic-restore/<job_id>/"
|
||||
├─ build flags: paths from payload, --no-ownership when !in_place
|
||||
├─ restic.RunRestore(ctx, env, snapshot_id, paths, target, in_place):
|
||||
│ restic restore <sid> --target <path> [--no-ownership] -- <p1> <p2> ...
|
||||
│ parse stdout JSON: forward "status" → job.progress (1Hz throttle), "summary" → final
|
||||
├─ on success: sendFinished(job_id, succeeded, exit=0)
|
||||
├─ on ctx.Err() == context.Canceled: sendFinished(job_id, cancelled, exit=130)
|
||||
└─ delete cancel func from r.cancels
|
||||
```
|
||||
|
||||
### Data flow — cancel
|
||||
|
||||
```
|
||||
operator clicks Cancel on /jobs/{id} (running)
|
||||
POST /api/jobs/{id}/cancel
|
||||
server: lookup job, ensure status=running, find host
|
||||
hub.Send(host_id, "command.cancel", {job_id})
|
||||
→ agent.runner receives command.cancel
|
||||
cancelFunc, ok := r.cancels[job_id]
|
||||
ok && cancelFunc()
|
||||
→ restic subprocess context done → exec.Cmd kills via SIGTERM
|
||||
→ if still alive after 5s grace → SIGKILL
|
||||
→ runner sendFinished(job_id, cancelled, exit=130)
|
||||
→ server receives job.finished status=cancelled, persists, broadcasts
|
||||
→ browser refresh shows cancelled state
|
||||
```
|
||||
|
||||
The cancel surface is independently useful for any kind (prune/check/backup) —
|
||||
not gated to restore. The button already in `job_detail.html` becomes real.
|
||||
|
||||
### Tree-list RPC details
|
||||
|
||||
New WS message types (added to `internal/api/messages.go`):
|
||||
|
||||
```
|
||||
type TreeListRequestPayload struct {
|
||||
SnapshotID string `json:"snapshot_id"`
|
||||
Path string `json:"path"`
|
||||
}
|
||||
|
||||
type TreeListEntry struct {
|
||||
Name string `json:"name"`
|
||||
Type string `json:"type"` // "dir" | "file" | "symlink"
|
||||
Size int64 `json:"size,omitempty"`
|
||||
}
|
||||
|
||||
type TreeListResultPayload struct {
|
||||
SnapshotID string `json:"snapshot_id"`
|
||||
Path string `json:"path"`
|
||||
Entries []TreeListEntry `json:"entries,omitempty"`
|
||||
Error string `json:"error,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
Server-side mediator (`ws.SendRPC`) takes a request envelope, registers the
|
||||
correlation ID in a pending map, sends, blocks on a per-call channel until
|
||||
the matching reply arrives (or 30s timeout). The pattern is small enough
|
||||
to inline in `internal/server/ws/rpc.go` as a generic helper — future
|
||||
synchronous RPCs reuse it.
|
||||
|
||||
In-memory cache: `map[sessionID]map[cacheKey]TreeListResultPayload` with
|
||||
`cacheKey = snapshot_id + "\x00" + path`. Session ID minted per wizard
|
||||
load (HTTP-only cookie scoped to `/hosts/{id}/restore/tree`, lifetime 30
|
||||
min). On wizard close (browser navigation away) the entry expires
|
||||
naturally. No persistence, no migration.
|
||||
|
||||
Agent handler runs `restic ls --json <sid> <path>` (non-recursive — restic
|
||||
defaults to recursive but `restic ls` accepts `--long` and a path filter;
|
||||
parse output line-by-line and emit only direct children of `path`). 60s
|
||||
context timeout, mirroring existing `restic snapshots` invocation.
|
||||
|
||||
### Restore payload
|
||||
|
||||
`api.CommandRunPayload` gains a nested optional `restore` field:
|
||||
|
||||
```
|
||||
type RestorePayload struct {
|
||||
SnapshotID string `json:"snapshot_id"`
|
||||
Paths []string `json:"paths"` // absolute paths inside the snapshot
|
||||
InPlace bool `json:"in_place"`
|
||||
TargetDir string `json:"target_dir"` // empty when in_place=true
|
||||
PreserveOwner bool `json:"preserve_owner"` // mirrors policy: in_place=>true, else=>false
|
||||
}
|
||||
```
|
||||
|
||||
The payload is set by the server when dispatching `JobRestore` and ignored
|
||||
on every other kind. Wire-shape test pinned in `wire_test.go`.
|
||||
|
||||
### Diff payload
|
||||
|
||||
`api.CommandRunPayload` gains:
|
||||
|
||||
```
|
||||
type DiffPayload struct {
|
||||
SnapshotA string `json:"snapshot_a"`
|
||||
SnapshotB string `json:"snapshot_b"`
|
||||
}
|
||||
```
|
||||
|
||||
Set on `JobDiff`. Output is plain `restic diff --json <a> <b>` forwarded as
|
||||
`log.stream` lines. Job page renders unchanged — operator reads the diff
|
||||
output directly.
|
||||
|
||||
### Recent-restores panel
|
||||
|
||||
A small panel rendered on the host detail page below the existing init-status
|
||||
line:
|
||||
|
||||
```
|
||||
last restore: succeeded 2h ago · job f73ab4c1… · 3 files to /var/restic-restore/...
|
||||
```
|
||||
|
||||
Backed by a new `store.LatestJobByKind(host_id, JobRestore)` query (mirroring
|
||||
the existing `store.LatestJobByKind` already used for init/forget/prune/check
|
||||
in P2R-06). One template addition in `host_chrome.html` next to the
|
||||
`InitStatus` block.
|
||||
|
||||
## Routes added
|
||||
|
||||
| Method | Path | Purpose |
|
||||
| ------- | --------------------------------------------------------- | ----------------------------------------------------------- |
|
||||
| GET | `/hosts/{id}/restore` | Wizard shell (step 1 = snapshot picker) |
|
||||
| GET | `/hosts/{id}/snapshots/{sid}/restore` | Wizard shell with snapshot pre-selected (skips step 1) |
|
||||
| GET | `/hosts/{id}/restore/tree` | HTMX partial: tree node listing for `?snapshot=&path=` |
|
||||
| POST | `/hosts/{id}/restore` | Validate + dispatch restore job, redirect to live job page |
|
||||
| POST | `/api/hosts/{id}/snapshots/diff` | Dispatch a diff job for `{snapshot_a, snapshot_b}` |
|
||||
| POST | `/api/jobs/{id}/cancel` | Send `command.cancel` to host, transition job → cancelled |
|
||||
|
||||
## Migrations
|
||||
|
||||
None. Restore + diff piggyback on the existing `jobs` table (their `kind` is
|
||||
new but the schema already accepts arbitrary kind strings — there's no
|
||||
CHECK constraint on `kind`). The cancel feature uses the existing
|
||||
`JobCancelled` terminal status. The tree-list cache lives in process memory.
|
||||
|
||||
## Tests (target coverage)
|
||||
|
||||
- `internal/restic/restore_test.go` — `RunRestore` invocation builds the
|
||||
expected argv (paths, --target, --no-ownership flag presence, in-place
|
||||
variant); JSON status parsing → `BackupStatus`-shaped progress envelopes.
|
||||
- `internal/restic/diff_test.go` — `RunDiff` argv shape and JSON forwarding.
|
||||
- `internal/agent/runner/restore_test.go` — happy path, cancel mid-run
|
||||
produces `cancelled` finished, in-place vs new-directory dispatch,
|
||||
single-flight rejects when another job is running.
|
||||
- `internal/agent/runner/tree_test.go` — `tree.list` handler returns
|
||||
direct children for a synthetic restic ls output, surfaces error on
|
||||
missing snapshot.
|
||||
- `internal/server/ws/rpc_test.go` — `SendRPC` correlation matching,
|
||||
timeout, concurrent calls.
|
||||
- `internal/server/http/restore_test.go` — wizard renders with snapshots,
|
||||
POST validates ≥1 path + in-place host-name match, audit row written,
|
||||
job dispatched with correct payload, in-place without typed-confirm
|
||||
re-renders form with input intact and an error.
|
||||
- `internal/server/http/diff_test.go` — POST dispatches `JobDiff`,
|
||||
snapshot IDs validated against the host's snapshot list.
|
||||
- `internal/server/http/cancel_test.go` — POST cancel happy path
|
||||
(running → cancelled), 4xx for non-running jobs, 4xx when host offline.
|
||||
- `internal/server/http/restore_e2e_test.go` — happy path: GET wizard,
|
||||
expand `/etc` (HTMX call returns expected fragment), submit, follow
|
||||
HX-Redirect to job page, see status.
|
||||
- `web/templates/pages/host_restore_test.go` (template-render test) —
|
||||
wizard renders all four sections; in-place card disabled until typed
|
||||
confirm.
|
||||
|
||||
## Playwright iteration / sweep
|
||||
|
||||
A Playwright sweep at the end (mirroring P2R-02 Slice 6) runs against the
|
||||
local smoke server with a real agent enrolled. Steps:
|
||||
|
||||
1. Login → navigate to alfa-01 host → click Restore.
|
||||
2. Wizard step 1: pick the most recent snapshot.
|
||||
3. Wizard step 2: expand a directory two levels, tick three files,
|
||||
verify tally updates.
|
||||
4. Wizard step 3: leave default new-directory.
|
||||
5. Wizard step 4: dispatch.
|
||||
6. Land on live job page, see progress widget animating, see log lines.
|
||||
7. Click Cancel mid-flight, verify status transitions to cancelled and
|
||||
the agent's subprocess actually died (log line `signal: killed` or exit
|
||||
130).
|
||||
8. Repeat with in-place mode: type host name, dispatch, verify red
|
||||
primary button, verify files actually overwritten on host.
|
||||
9. Snapshot diff: navigate to snapshots, pick two, dispatch diff, see
|
||||
diff output streamed.
|
||||
10. Screenshots into `_diag/p3-restore-sweep/`.
|
||||
|
||||
End-to-end clean, zero console errors, before handing back.
|
||||
|
||||
## What does NOT change
|
||||
|
||||
- `host_chrome.html` only grows the recent-restores line; sub-tab list
|
||||
unchanged (Restore is a top-level button on the host page, not a sub-tab).
|
||||
- `enrollment.go`, schedule reconciliation, source-group CRUD, repo
|
||||
maintenance ticker, hook execution — none of these are touched.
|
||||
- The CLAUDE.md restage block applies as-is when the agent binary changes
|
||||
(it does — runner gains restore/diff/cancel/tree handlers). The unit
|
||||
file does not change.
|
||||
|
||||
## Open questions / explicit non-goals
|
||||
|
||||
- **Restore preview / dry-run.** Restic doesn't have a dry-run for restore.
|
||||
Out of scope.
|
||||
- **Resumable restore.** Restic restore is idempotent per-file but not
|
||||
resumable mid-stream from where it left off. If a restore is cancelled,
|
||||
the operator re-runs (files already written are overwritten). No state
|
||||
to track.
|
||||
- **Restore to a glob/pattern (e.g. `*.conf`).** Out of scope; the tree
|
||||
picker requires explicit ticks. Power users can edit the URL or use the
|
||||
CLI.
|
||||
- **Bandwidth caps for restore.** Honoured automatically — restic's
|
||||
`--limit-download` is part of `restic.Env` already (P2R-13) and applies
|
||||
to restore unchanged.
|
||||
- **Pre/post hooks for restore.** Hooks today gate only `kind=backup`
|
||||
(P2R-11). Out of scope.
|
||||
@@ -233,19 +233,47 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||||
|
||||
## Phase 3 — Restore, alerts, audit
|
||||
|
||||
- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection
|
||||
- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
|
||||
- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming
|
||||
- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
|
||||
> Phase 3 is split into three independently-shippable sub-phases:
|
||||
> **Restore** (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC),
|
||||
> **Alerts** (P3-05..07), **Audit UI** (P3-08). Each sub-phase has its own
|
||||
> spec → plan → implement cycle; we hand back at sub-phase boundaries.
|
||||
>
|
||||
> P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm
|
||||
> on 2026-05-04: disaster recovery is already covered by re-enrolling a
|
||||
> replacement host with the same repo creds (snapshots reappear, restore
|
||||
> is same-host). The remaining "pull a file from host A onto host C
|
||||
> without giving C permanent access" use case is genuinely different and
|
||||
> doesn't have a confirmed need yet, so it's moved to the **Future /
|
||||
> unscheduled** section at the end of this file.
|
||||
|
||||
### Phase 3 — Restore (in progress, brand `p3-restore`)
|
||||
|
||||
> Spec: `docs/superpowers/specs/2026-05-04-p3-restore-design.md`.
|
||||
> Wireframe: `_diag/p3-restore-wizard/wireframe.html`.
|
||||
|
||||
- [ ] **P3-X1** (S) Cancel-job feature. New `command.cancel` WS envelope; agent tracks per-job ctx.CancelFunc and kills the running `restic` subprocess (SIGTERM, SIGKILL after 5s grace); server endpoint `POST /api/jobs/{id}/cancel` bridges UI → WS; the existing UI Cancel button on `/jobs/{id}` becomes real for any running kind. Foundational — restore depends on it.
|
||||
- [ ] **P3-X2** (S) Tree-list synchronous WS RPC. New `tree.list` request / `tree.list.result` reply on the existing correlation-ID infra; agent runs `restic ls --json <sid> <path>` per call; server-side mediator `ws.SendRPC` + per-wizard-session in-memory cache (~30-min TTL).
|
||||
- [ ] **P3-01** (L) Restore wizard backend: tree browse via `tree.list` RPC (P3-X2), path picker validation, target selection (new-dir vs in-place + typed-confirm), dispatch endpoint `POST /hosts/{id}/restore`, audit row `host.restore`.
|
||||
- [ ] **P3-02** (L) Restore wizard UI: single-page progressively-enabled four-step form at `/hosts/{id}/restore` (and pre-selected variant `/hosts/{id}/snapshots/{sid}/restore`); tree-browser HTMX partials. Top-level "Restore" button on host detail.
|
||||
- [ ] **P3-03** (M) Restore execution: `restic.RunRestore` (paths, --target, --no-ownership for new-dir; preserves ownership for in-place); agent dispatcher case `JobRestore`; restore-specific job page variant with files-restored / bytes-restored / throughput / ETA / current-file widget.
|
||||
- [ ] **P3-09** (S) `diff` between two snapshots in UI: `JobDiff` JobKind, `restic.RunDiff`, `POST /api/hosts/{id}/snapshots/diff` dispatcher, snapshot-picker UI on Snapshots tab to pick A+B; output streams as `log.stream` to the standard live job log page.
|
||||
- [ ] **P3-X3** (S) Recent-restores panel on host detail: small line below the existing init-status, surfacing latest `JobRestore` outcome (succeeded N hours ago / failed → live log link). Backed by `store.LatestJobByKind(host_id, JobRestore)`.
|
||||
|
||||
### Phase 3 — Alerts (not started)
|
||||
|
||||
- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
|
||||
- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
|
||||
- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve
|
||||
|
||||
### Phase 3 — Audit log UI (not started)
|
||||
|
||||
- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
|
||||
- [ ] **P3-09** (S) `diff` between two snapshots in UI
|
||||
|
||||
### Phase 3 acceptance
|
||||
|
||||
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
|
||||
- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at `/hosts/{id}/restore`; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page.
|
||||
- A failed backup raises an alert via the configured channel within 60s.
|
||||
- The audit-log UI lets an admin filter by user / action / target / time range.
|
||||
|
||||
---
|
||||
|
||||
@@ -290,3 +318,14 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||||
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
|
||||
- [ ] **X-04** Threat-model review at end of each phase
|
||||
- [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation.
|
||||
|
||||
---
|
||||
|
||||
## Future / unscheduled
|
||||
|
||||
> Items here have a plausible use case but no confirmed need. They live
|
||||
> outside numbered phases until a concrete trigger (a user request, a
|
||||
> security review finding, a real disaster-recovery exercise) bumps them
|
||||
> back into a phase.
|
||||
|
||||
- [ ] **F-01** ~~P3-04~~ Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.
|
||||
|
||||
Reference in New Issue
Block a user