docs: P3 restore design spec + scope-decompose Phase 3
Splits Phase 3 into three independently-shippable sub-phases (Restore, Alerts, Audit UI) so they can land in separate PRs with their own brainstorm → spec → plan cycles. The Restore sub-phase is up first. The brainstorm ran on 2026-05-04 and locked the following decisions: - Single-host restore only this phase. P3-04 (cross-host restore) is moved to a new 'Future / unscheduled' section. Disaster recovery is already covered by re-enrolling a replacement host with the same repo creds; the remaining 'pull a file from host A onto host C' use case is genuinely different (file sharing / migration, not DR) and has no confirmed need. - Default target is /var/restic-restore/<job-id>/ with --no-ownership; in-place restore preserves uid/gid/mode and is gated by typed-confirmation of the host name (mirroring the repo re-init danger zone). - Tree browser is the path picker, lazy-loaded via a synchronous WS RPC (tree.list) over the existing correlation-ID infrastructure with a per-wizard-session in-memory cache (~30 min TTL). - Single-page wizard with progressively-enabled sections; entry is a top-level Restore button on host detail (or per-snapshot Restore action for direct deep-link). - Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other agent operation; output streams to the standard live job log page. - Restore-specific live job page variant with files-restored / bytes-restored / current-file widget. - Single-flight per host across all kinds, plus a real cancel-job feature (command.cancel WS envelope, agent kills the restic subprocess via context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a long-running backup if they need to restore urgently. Wires the existing job_detail Cancel button (which was a UI stub). - Audit row host.restore on every dispatch + a recent-restores panel on host detail. Role gate deferred to P4-03 RBAC. Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored — transient design artefact); screenshot reviewed and approved 2026-05-04.
This commit is contained in:
@@ -233,19 +233,47 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||||
|
||||
## Phase 3 — Restore, alerts, audit
|
||||
|
||||
- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection
|
||||
- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
|
||||
- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming
|
||||
- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
|
||||
> Phase 3 is split into three independently-shippable sub-phases:
|
||||
> **Restore** (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC),
|
||||
> **Alerts** (P3-05..07), **Audit UI** (P3-08). Each sub-phase has its own
|
||||
> spec → plan → implement cycle; we hand back at sub-phase boundaries.
|
||||
>
|
||||
> P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm
|
||||
> on 2026-05-04: disaster recovery is already covered by re-enrolling a
|
||||
> replacement host with the same repo creds (snapshots reappear, restore
|
||||
> is same-host). The remaining "pull a file from host A onto host C
|
||||
> without giving C permanent access" use case is genuinely different and
|
||||
> doesn't have a confirmed need yet, so it's moved to the **Future /
|
||||
> unscheduled** section at the end of this file.
|
||||
|
||||
### Phase 3 — Restore (in progress, brand `p3-restore`)
|
||||
|
||||
> Spec: `docs/superpowers/specs/2026-05-04-p3-restore-design.md`.
|
||||
> Wireframe: `_diag/p3-restore-wizard/wireframe.html`.
|
||||
|
||||
- [ ] **P3-X1** (S) Cancel-job feature. New `command.cancel` WS envelope; agent tracks per-job ctx.CancelFunc and kills the running `restic` subprocess (SIGTERM, SIGKILL after 5s grace); server endpoint `POST /api/jobs/{id}/cancel` bridges UI → WS; the existing UI Cancel button on `/jobs/{id}` becomes real for any running kind. Foundational — restore depends on it.
|
||||
- [ ] **P3-X2** (S) Tree-list synchronous WS RPC. New `tree.list` request / `tree.list.result` reply on the existing correlation-ID infra; agent runs `restic ls --json <sid> <path>` per call; server-side mediator `ws.SendRPC` + per-wizard-session in-memory cache (~30-min TTL).
|
||||
- [ ] **P3-01** (L) Restore wizard backend: tree browse via `tree.list` RPC (P3-X2), path picker validation, target selection (new-dir vs in-place + typed-confirm), dispatch endpoint `POST /hosts/{id}/restore`, audit row `host.restore`.
|
||||
- [ ] **P3-02** (L) Restore wizard UI: single-page progressively-enabled four-step form at `/hosts/{id}/restore` (and pre-selected variant `/hosts/{id}/snapshots/{sid}/restore`); tree-browser HTMX partials. Top-level "Restore" button on host detail.
|
||||
- [ ] **P3-03** (M) Restore execution: `restic.RunRestore` (paths, --target, --no-ownership for new-dir; preserves ownership for in-place); agent dispatcher case `JobRestore`; restore-specific job page variant with files-restored / bytes-restored / throughput / ETA / current-file widget.
|
||||
- [ ] **P3-09** (S) `diff` between two snapshots in UI: `JobDiff` JobKind, `restic.RunDiff`, `POST /api/hosts/{id}/snapshots/diff` dispatcher, snapshot-picker UI on Snapshots tab to pick A+B; output streams as `log.stream` to the standard live job log page.
|
||||
- [ ] **P3-X3** (S) Recent-restores panel on host detail: small line below the existing init-status, surfacing latest `JobRestore` outcome (succeeded N hours ago / failed → live log link). Backed by `store.LatestJobByKind(host_id, JobRestore)`.
|
||||
|
||||
### Phase 3 — Alerts (not started)
|
||||
|
||||
- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
|
||||
- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
|
||||
- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve
|
||||
|
||||
### Phase 3 — Audit log UI (not started)
|
||||
|
||||
- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
|
||||
- [ ] **P3-09** (S) `diff` between two snapshots in UI
|
||||
|
||||
### Phase 3 acceptance
|
||||
|
||||
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
|
||||
- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at `/hosts/{id}/restore`; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page.
|
||||
- A failed backup raises an alert via the configured channel within 60s.
|
||||
- The audit-log UI lets an admin filter by user / action / target / time range.
|
||||
|
||||
---
|
||||
|
||||
@@ -290,3 +318,14 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||||
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
|
||||
- [ ] **X-04** Threat-model review at end of each phase
|
||||
- [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation.
|
||||
|
||||
---
|
||||
|
||||
## Future / unscheduled
|
||||
|
||||
> Items here have a plausible use case but no confirmed need. They live
|
||||
> outside numbered phases until a concrete trigger (a user request, a
|
||||
> security review finding, a real disaster-recovery exercise) bumps them
|
||||
> back into a phase.
|
||||
|
||||
- [ ] **F-01** ~~P3-04~~ Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.
|
||||
|
||||
Reference in New Issue
Block a user