From d325a27439bb8f152413a87114e3358aa8ff28b9 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Mon, 4 May 2026 15:02:32 +0100 Subject: [PATCH] docs: P3 restore design spec + scope-decompose Phase 3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Splits Phase 3 into three independently-shippable sub-phases (Restore, Alerts, Audit UI) so they can land in separate PRs with their own brainstorm → spec → plan cycles. The Restore sub-phase is up first. The brainstorm ran on 2026-05-04 and locked the following decisions: - Single-host restore only this phase. P3-04 (cross-host restore) is moved to a new 'Future / unscheduled' section. Disaster recovery is already covered by re-enrolling a replacement host with the same repo creds; the remaining 'pull a file from host A onto host C' use case is genuinely different (file sharing / migration, not DR) and has no confirmed need. - Default target is /var/restic-restore// with --no-ownership; in-place restore preserves uid/gid/mode and is gated by typed-confirmation of the host name (mirroring the repo re-init danger zone). - Tree browser is the path picker, lazy-loaded via a synchronous WS RPC (tree.list) over the existing correlation-ID infrastructure with a per-wizard-session in-memory cache (~30 min TTL). - Single-page wizard with progressively-enabled sections; entry is a top-level Restore button on host detail (or per-snapshot Restore action for direct deep-link). - Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other agent operation; output streams to the standard live job log page. - Restore-specific live job page variant with files-restored / bytes-restored / current-file widget. - Single-flight per host across all kinds, plus a real cancel-job feature (command.cancel WS envelope, agent kills the restic subprocess via context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a long-running backup if they need to restore urgently. Wires the existing job_detail Cancel button (which was a UI stub). - Audit row host.restore on every dispatch + a recent-restores panel on host detail. Role gate deferred to P4-03 RBAC. Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored — transient design artefact); screenshot reviewed and approved 2026-05-04. --- tasks.md | 51 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 45 insertions(+), 6 deletions(-) diff --git a/tasks.md b/tasks.md index 7a65c72..03faac7 100644 --- a/tasks.md +++ b/tasks.md @@ -233,19 +233,47 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. ## Phase 3 — Restore, alerts, audit -- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection -- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm) -- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming -- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root +> Phase 3 is split into three independently-shippable sub-phases: +> **Restore** (P3-01..03 + P3-09 + P3-X1 cancel + P3-X2 tree-list RPC), +> **Alerts** (P3-05..07), **Audit UI** (P3-08). Each sub-phase has its own +> spec → plan → implement cycle; we hand back at sub-phase boundaries. +> +> P3-04 (cross-host restore) was de-scoped during the Phase-3 brainstorm +> on 2026-05-04: disaster recovery is already covered by re-enrolling a +> replacement host with the same repo creds (snapshots reappear, restore +> is same-host). The remaining "pull a file from host A onto host C +> without giving C permanent access" use case is genuinely different and +> doesn't have a confirmed need yet, so it's moved to the **Future / +> unscheduled** section at the end of this file. + +### Phase 3 — Restore (in progress, brand `p3-restore`) + +> Spec: `docs/superpowers/specs/2026-05-04-p3-restore-design.md`. +> Wireframe: `_diag/p3-restore-wizard/wireframe.html`. + +- [ ] **P3-X1** (S) Cancel-job feature. New `command.cancel` WS envelope; agent tracks per-job ctx.CancelFunc and kills the running `restic` subprocess (SIGTERM, SIGKILL after 5s grace); server endpoint `POST /api/jobs/{id}/cancel` bridges UI → WS; the existing UI Cancel button on `/jobs/{id}` becomes real for any running kind. Foundational — restore depends on it. +- [ ] **P3-X2** (S) Tree-list synchronous WS RPC. New `tree.list` request / `tree.list.result` reply on the existing correlation-ID infra; agent runs `restic ls --json ` per call; server-side mediator `ws.SendRPC` + per-wizard-session in-memory cache (~30-min TTL). +- [ ] **P3-01** (L) Restore wizard backend: tree browse via `tree.list` RPC (P3-X2), path picker validation, target selection (new-dir vs in-place + typed-confirm), dispatch endpoint `POST /hosts/{id}/restore`, audit row `host.restore`. +- [ ] **P3-02** (L) Restore wizard UI: single-page progressively-enabled four-step form at `/hosts/{id}/restore` (and pre-selected variant `/hosts/{id}/snapshots/{sid}/restore`); tree-browser HTMX partials. Top-level "Restore" button on host detail. +- [ ] **P3-03** (M) Restore execution: `restic.RunRestore` (paths, --target, --no-ownership for new-dir; preserves ownership for in-place); agent dispatcher case `JobRestore`; restore-specific job page variant with files-restored / bytes-restored / throughput / ETA / current-file widget. +- [ ] **P3-09** (S) `diff` between two snapshots in UI: `JobDiff` JobKind, `restic.RunDiff`, `POST /api/hosts/{id}/snapshots/diff` dispatcher, snapshot-picker UI on Snapshots tab to pick A+B; output streams as `log.stream` to the standard live job log page. +- [ ] **P3-X3** (S) Recent-restores panel on host detail: small line below the existing init-status, surfacing latest `JobRestore` outcome (succeeded N hours ago / failed → live log link). Backed by `store.LatestJobByKind(host_id, JobRestore)`. + +### Phase 3 — Alerts (not started) + - [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed) - [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email - [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve + +### Phase 3 — Audit log UI (not started) + - [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range) -- [ ] **P3-09** (S) `diff` between two snapshots in UI ### Phase 3 acceptance -- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s. +- A file deleted on a host can be restored from the UI in under 2 minutes via the wizard at `/hosts/{id}/restore`; the operator can cancel a running restore (or any other running job) from the live job page. Snapshot diff between two snapshots renders as a normal job page. +- A failed backup raises an alert via the configured channel within 60s. +- The audit-log UI lets an admin filter by user / action / target / time range. --- @@ -290,3 +318,14 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. - [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`) - [ ] **X-04** Threat-model review at end of each phase - [ ] **X-05** Proper first-run onboarding UI: admin shouldn't need to `curl` `/api/bootstrap` by hand. Render the bootstrap form on the same login page (extra "setup token" field shown only while no admin user exists, hidden after); on submit POST to `/api/bootstrap`, then drop straight into a session. Surface the one-time token from the server log somewhere copy-able (or print a clickable URL with the token in the query string at first-run). Also: relax the 12-char password floor for the first-run path or document it in the form so `admin` doesn't silently fail validation. + +--- + +## Future / unscheduled + +> Items here have a plausible use case but no confirmed need. They live +> outside numbered phases until a concrete trigger (a user request, a +> security review finding, a real disaster-recovery exercise) bumps them +> back into a phase. + +- [ ] **F-01** ~~P3-04~~ Cross-host restore. De-scoped from Phase 3 on 2026-05-04. Disaster recovery is already covered: stand up a replacement host, paste the original repo creds at enrolment, snapshots reappear, restore is same-host. The remaining "pull a file from host A onto host C without granting C permanent access" use case is genuinely different (file sharing / migration, not DR) and hasn't been requested. Original spec language was: "target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root". Re-promote when there's a real ask.