From b640775a61c5493e99e4d9c21ddb771ae9e4b083 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Sun, 3 May 2026 21:59:40 +0100 Subject: [PATCH] plan: P2 redesign Phase 5 (P2R-03..P2R-08) --- .../plans/2026-05-03-p2-redesign-phase-5.md | 1663 +++++++++++++++++ 1 file changed, 1663 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md diff --git a/docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md b/docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md new file mode 100644 index 0000000..543401e --- /dev/null +++ b/docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md @@ -0,0 +1,1663 @@ +# P2 Redesign — Phase 5 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Land the operator-facing prune/check/unlock surface, the server-side maintenance ticker that fires forget/prune/check on cadence, the offline-retry queue worker, and the repo-stats panel that surfaces the result. Closes P2R-03, P2R-04, P2R-05, P2R-06, P2R-07, P2R-08. + +**Architecture:** +- New restic-wrapper functions (`RunPrune`, `RunCheck`, `RunUnlock`, `RunStats`) mirror `RunForget`'s pumpPlain pattern. Agent-side runners mirror `RunForget`'s envelope shape (`runner.RunPrune`, `RunCheck`, `RunUnlock`). Wire envelopes already include the `JobKind` constants — no protocol bump. +- A second per-host credential row (`host_credentials.kind = 'admin'`) carries the prune-capable creds; pushed to the agent only when dispatching a job that needs it (`prune`). +- Server-side maintenance runs from a single goroutine in `cmd/server/main.go`, ticking every 60s. For each `host_repo_maintenance` row it computes the most-recent cron-fire instant for each kind, compares to the latest persisted job's `created_at`, and dispatches if due. Offline hosts queue to `pending_runs`. +- A second background goroutine drains `pending_runs` every 30s plus on agent reconnect (via `onAgentHello`). +- Repo stats are a singleton-per-host projection (`host_repo_stats` table). Agent ships `stats.report` after every successful backup (mirrors `snapshots.report` plumbing). UI reads it on the Repo page. + +**Tech Stack:** Go (server + agent), SQLite (modernc.org/sqlite), `github.com/robfig/cron/v3` for cron parsing (already a dep), chi for routing, html/template + HTMX for UI, Playwright for end-to-end smoke. + +--- + +## File Structure + +**New files:** +- `internal/store/migrations/0009_admin_creds_and_repo_stats.sql` — schema for `host_credentials.kind`, new `host_repo_stats` table. +- `internal/store/host_repo_stats.go` — CRUD for the stats projection. +- `internal/server/maintenance/ticker.go` — pure-logic maintenance scheduler (parseable, testable without a server). +- `internal/server/maintenance/ticker_test.go` +- `internal/server/http/repo_ops.go` — `POST /api/hosts/{id}/repo/{prune,check,unlock}` handlers + their HTMX-form siblings under the UI tree. +- `internal/server/http/repo_ops_test.go` +- `internal/server/http/pending_drain.go` — the offline-queue drain logic (server-side, called on tick + on connect). +- `internal/server/http/pending_drain_test.go` + +**Modified files:** +- `internal/restic/runner.go` — add `RunPrune`, `RunCheck`, `RunUnlock`, `RunStats`. Add `LockState` parsing helper for check stderr. +- `internal/api/messages.go` — add `MsgStatsReport` constant + `StatsReportPayload`. Add `RequiresAdminCreds` field to `CommandRunPayload`. +- `internal/api/wire.go` — register the new message type. +- `internal/agent/runner/runner.go` — add `RunPrune`, `RunCheck`, `RunUnlock`. Add `reportStats` (mirrors `reportSnapshots`). +- `cmd/agent/main.go` — wire new `JobKind` cases into the dispatcher; load `admin` creds slot from secrets when `RequiresAdminCreds=true`. +- `internal/agent/secrets/secrets.go` — extend the on-disk shape to carry both `repo` and `admin` blobs (one file, two named slots). +- `internal/server/http/host_credentials.go` — extend the encrypted blob to support an optional admin-creds field; add `PUT /api/hosts/{id}/admin-credentials`. Adjust `pushRepoCredsToAgent` to take a kind argument; add `pushAdminCredsToAgent` for on-demand admin-cred pushes. +- `internal/server/http/server.go` — register the new routes. +- `internal/server/http/jobs.go` — extend `dispatchJobWithPayload` to optionally push admin creds first when kind is `prune`. +- `internal/server/ws/handler.go` — handle `MsgStatsReport` (persist + broadcast). +- `internal/store/jobs.go` — add `LatestJobByKind` (returns the most recent terminal job of a kind, used by the ticker). +- `internal/store/pending.go` — add `ListPendingRunsForHost` (used by the on-reconnect drain). +- `internal/store/maintenance.go` — no schema, but add `ListAllMaintenance(ctx)` (the ticker iterates every host). +- `cmd/server/main.go` — wire the maintenance ticker + pending-drain goroutine. +- `web/templates/pages/host_repo.html` — add three "one-time" Run-now buttons (prune / check / unlock), an admin-creds form, a stats panel. +- `internal/server/http/ui_repo.go` — render the new panels; pre-fill the admin-creds form from a redacted view. +- `web/templates/partials/host_chrome.html` — surface "lock detected" banner if `host_repo_stats.lock_present`. + +**Deletes:** none. + +--- + +## Slice A — Schema groundwork (admin creds + stats projection) + +### Task A1: Add migration 0009 for admin-credentials column + repo-stats table + +**Files:** +- Create: `internal/store/migrations/0009_admin_creds_and_repo_stats.sql` + +- [ ] **Step 1: Write the migration** + +```sql +-- 0009_admin_creds_and_repo_stats.sql +-- +-- Phase 5 of the P2 redesign needs two things in the schema: +-- +-- 1. A second credential row per host. Today host_credentials is +-- 1:1 with hosts. For prune (and any future destructive op) we +-- want a rest-server admin user whose password gives delete +-- access — separate from the append-only user used on every +-- backup. Add a `kind` column with default 'repo'; existing rows +-- become kind='repo'. Future admin rows live alongside. +-- +-- 2. A small singleton-per-host projection for repo size, snapshot +-- count, last-prune freed bytes, lock state, and last-check +-- result. Backed by `restic stats --json` + sniffed `restic +-- check` stderr. +-- +-- Use column-level ALTERs only; host_credentials has no inbound +-- FKs but the rule from CLAUDE.md still applies. + +ALTER TABLE host_credentials ADD COLUMN kind TEXT NOT NULL DEFAULT 'repo'; + +-- The PK on host_credentials is currently (host_id) — we need a +-- composite (host_id, kind). SQLite has no ALTER TABLE … +-- ADD/CHANGE PRIMARY KEY, so this is the one place a rebuild is +-- justified. host_credentials has no inbound FKs, so the cascade +-- trap doesn't apply here. Verified against schema/0002. + +CREATE TABLE host_credentials_new ( + host_id TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE, + kind TEXT NOT NULL DEFAULT 'repo' + CHECK (kind IN ('repo', 'admin')), + enc_repo_creds TEXT NOT NULL, + updated_at TEXT NOT NULL, + PRIMARY KEY (host_id, kind) +); +INSERT INTO host_credentials_new (host_id, kind, enc_repo_creds, updated_at) + SELECT host_id, kind, enc_repo_creds, updated_at FROM host_credentials; +DROP TABLE host_credentials; +ALTER TABLE host_credentials_new RENAME TO host_credentials; + +-- Repo stats projection. One row per host, upserted by the agent's +-- stats.report envelope (which fires after every successful backup +-- and after every check / prune). All fields nullable so a freshly +-- enrolled host with no jobs yet is representable. + +CREATE TABLE host_repo_stats ( + host_id TEXT PRIMARY KEY REFERENCES hosts(id) ON DELETE CASCADE, + total_size_bytes INTEGER, + raw_size_bytes INTEGER, + unique_files INTEGER, + snapshot_count INTEGER, + last_check_at TEXT, + last_check_status TEXT, -- 'ok' | 'errors_found' | 'failed' + lock_present INTEGER NOT NULL DEFAULT 0, + last_prune_at TEXT, + last_prune_freed_bytes INTEGER, + updated_at TEXT NOT NULL +); +``` + +- [ ] **Step 2: Run migrations against an empty DB and verify schema** + +Run: `go test ./internal/store/... -run TestMigrations -v` +Expected: PASS, or write a fresh `TestMigration0009` if no such test harness exists yet (mirror `internal/store/migrations_test.go` if present, else add a minimal one that opens an in-memory DB and asserts the new tables exist). + +- [ ] **Step 3: Commit** + +```bash +git add internal/store/migrations/0009_admin_creds_and_repo_stats.sql +git commit -m "store: migration 0009 — admin-creds kind + host_repo_stats" +``` + +### Task A2: Extend the host_credentials store API to be kind-aware + +**Files:** +- Modify: `internal/store/host_credentials.go` +- Test: `internal/store/host_credentials_test.go` + +- [ ] **Step 1: Replace the two existing functions with kind-aware versions** + +Existing `GetHostCredentials(ctx, hostID)` and `SetHostCredentials(ctx, hostID, blob)` become: + +```go +type CredentialKind string + +const ( + CredKindRepo CredentialKind = "repo" + CredKindAdmin CredentialKind = "admin" +) + +func (s *Store) GetHostCredentials(ctx context.Context, hostID string, kind CredentialKind) (string, error) { + row := s.db.QueryRowContext(ctx, + `SELECT enc_repo_creds FROM host_credentials WHERE host_id = ? AND kind = ?`, + hostID, string(kind)) + var enc string + if err := row.Scan(&enc); err != nil { + if errors.Is(err, sql.ErrNoRows) { + return "", ErrNotFound + } + return "", fmt.Errorf("store: get host credentials: %w", err) + } + return enc, nil +} + +func (s *Store) SetHostCredentials(ctx context.Context, hostID string, kind CredentialKind, encRepoCreds string) error { + if encRepoCreds == "" { + return fmt.Errorf("store: empty enc_repo_creds") + } + now := time.Now().UTC().Format(time.RFC3339Nano) + _, err := s.db.ExecContext(ctx, + `INSERT INTO host_credentials (host_id, kind, enc_repo_creds, updated_at) + VALUES (?, ?, ?, ?) + ON CONFLICT(host_id, kind) DO UPDATE SET + enc_repo_creds = excluded.enc_repo_creds, + updated_at = excluded.updated_at`, + hostID, string(kind), encRepoCreds, now) + if err != nil { + return fmt.Errorf("store: set host credentials: %w", err) + } + return nil +} + +func (s *Store) DeleteHostCredentials(ctx context.Context, hostID string, kind CredentialKind) error { + _, err := s.db.ExecContext(ctx, + `DELETE FROM host_credentials WHERE host_id = ? AND kind = ?`, + hostID, string(kind)) + return err +} +``` + +- [ ] **Step 2: Update every caller in `internal/server/http/`** + +Run: `grep -rn "GetHostCredentials\|SetHostCredentials" internal/ cmd/` +For each call site, pass `store.CredKindRepo` as the new arg. Admin variants come later. + +- [ ] **Step 3: Add a test for the admin-row variant** + +In `internal/store/host_credentials_test.go`, add `TestHostCredentialsAdminRowSeparate`: +- Set repo creds, set admin creds, fetch each by kind, assert blobs differ. +- Delete admin, verify repo unaffected. + +- [ ] **Step 4: Run tests and verify** + +Run: `go test ./internal/store/... -v` +Expected: PASS, including the new test. + +- [ ] **Step 5: Run the full server tests to surface call-site fallout** + +Run: `go test ./...` +Expected: PASS. Fix any compile errors at call sites you missed. + +- [ ] **Step 6: Commit** + +```bash +git add internal/store/host_credentials.go internal/store/host_credentials_test.go internal/server/http/ +git commit -m "store: host_credentials becomes kind-aware (repo + admin slots)" +``` + +### Task A3: Add the host_repo_stats store API + +**Files:** +- Create: `internal/store/host_repo_stats.go` +- Test: `internal/store/host_repo_stats_test.go` + +- [ ] **Step 1: Write the type and CRUD** + +```go +package store + +import ( + "context" + "database/sql" + "errors" + "fmt" + "time" +) + +type HostRepoStats struct { + HostID string + TotalSizeBytes *int64 + RawSizeBytes *int64 + UniqueFiles *int64 + SnapshotCount *int64 + LastCheckAt *time.Time + LastCheckStatus string // "" | "ok" | "errors_found" | "failed" + LockPresent bool + LastPruneAt *time.Time + LastPruneFreedBytes *int64 + UpdatedAt time.Time +} + +func (s *Store) GetHostRepoStats(ctx context.Context, hostID string) (*HostRepoStats, error) { + // SELECT … WHERE host_id = ?; sql.ErrNoRows → ErrNotFound. + // Mirror existing nullable-time handling in store/jobs.go. +} + +// UpsertHostRepoStats writes a partial update — only non-nil fields +// in the input overwrite existing columns. Implemented as a row-fetch +// + merge + INSERT…ON CONFLICT for clarity (versus building a sparse +// UPDATE statement at runtime). +func (s *Store) UpsertHostRepoStats(ctx context.Context, hostID string, patch HostRepoStats) error +``` + +- [ ] **Step 2: Write the test** + +```go +func TestHostRepoStatsRoundTrip(t *testing.T) { + // Open in-memory store, seed a host, upsert partial fields, + // re-read, assert. + // Then upsert a different field, assert the first is preserved. +} +``` + +- [ ] **Step 3: Run tests** + +Run: `go test ./internal/store/ -run TestHostRepoStats -v` +Expected: PASS. + +- [ ] **Step 4: Commit** + +```bash +git add internal/store/host_repo_stats.go internal/store/host_repo_stats_test.go +git commit -m "store: HostRepoStats projection (size, lock, last-check, last-prune)" +``` + +--- + +## Slice B — Restic wrapper additions + +### Task B1: RunPrune + +**Files:** +- Modify: `internal/restic/runner.go` +- Test: `internal/restic/runner_test.go` (create if absent — there's currently only `url_test.go`) + +- [ ] **Step 1: Add `RunPrune` after `RunForget`** + +```go +// RunPrune executes `restic prune` against the configured repo. +// Requires the *admin* credentials (delete access on the rest-server +// repo) — the caller is responsible for populating Env.RepoUsername +// and Env.RepoPassword with the admin pair before calling this. +// +// Prune emits human-readable progress on stdout/stderr (no --json +// support that's useful for our purposes). We tee everything to the +// handler so the live log is the operator's progress bar. +func (e Env) RunPrune(ctx context.Context, handle LineHandler) error { + cmd := exec.CommandContext(ctx, e.Bin, "prune") + cmd.Env = e.envSlice() + cmd.Dir = e.WorkDir + return runWithPump(cmd, handle) +} +``` + +Add a small private helper to DRY the stdout/stderr pump pattern that's now shared by Forget/Init/Prune/Check/Unlock: + +```go +// runWithPump starts the configured cmd, fans stdout+stderr into +// pumpPlain, and waits. Errors from the wait are wrapped with the +// cmd args[0] for context. +func runWithPump(cmd *exec.Cmd, handle LineHandler) error { + label := "restic" + if len(cmd.Args) > 1 { + label = "restic " + cmd.Args[1] + } + stdout, err := cmd.StdoutPipe() + if err != nil { + return fmt.Errorf("%s: stdout pipe: %w", label, err) + } + stderr, err := cmd.StderrPipe() + if err != nil { + return fmt.Errorf("%s: stderr pipe: %w", label, err) + } + if err := cmd.Start(); err != nil { + return fmt.Errorf("%s: start: %w", label, err) + } + done := make(chan error, 2) + go func() { done <- pumpPlain(stdout, "stdout", handle) }() + go func() { done <- pumpPlain(stderr, "stderr", handle) }() + for i := 0; i < 2; i++ { + if err := <-done; err != nil && handle != nil { + handle("event", fmt.Sprintf("pump error: %v", err), nil) + } + } + if werr := cmd.Wait(); werr != nil { + return fmt.Errorf("%s: %w", label, werr) + } + return nil +} +``` + +Refactor `RunForget` and `RunInit` to use `runWithPump` (init keeps its sniff wrapper). Keep behaviour identical. + +- [ ] **Step 2: Write a stub test that exercises arg construction** + +Since restic isn't on every developer's PATH, the test should construct an `Env` with `Bin = "/bin/echo"` (or `Bin = "/bin/sh"` running a script that echoes args) and assert the handler observes the expected argv. This is the same pattern as any existing wrapper tests; if none exists, model the test on `TestRunForgetArgs` in another comparable Go project (or just shell out to `echo` and grep stdout). + +```go +func TestRunPruneInvokesPrune(t *testing.T) { + var captured []string + h := func(stream, line string, _ any) { + if stream == "stdout" { captured = append(captured, line) } + } + env := restic.Env{Bin: "/bin/echo"} + if err := env.RunPrune(context.Background(), h); err != nil { + t.Fatalf("run: %v", err) + } + if len(captured) != 1 || captured[0] != "prune" { + t.Fatalf("expected stdout 'prune', got %v", captured) + } +} +``` + +- [ ] **Step 3: Run tests** + +Run: `go test ./internal/restic/ -v` +Expected: PASS. + +- [ ] **Step 4: Commit** + +```bash +git add internal/restic/runner.go internal/restic/runner_test.go +git commit -m "restic: RunPrune + runWithPump helper, refactor Forget/Init onto it" +``` + +### Task B2: RunCheck (with subset support + lock-detection) + +**Files:** +- Modify: `internal/restic/runner.go` +- Test: `internal/restic/runner_test.go` + +- [ ] **Step 1: Add RunCheck** + +```go +// CheckResult summarises a `restic check` invocation. LockPresent is +// true if the stderr stream contained "Found stale lock" (caller is +// expected to surface this in the UI so the operator can run unlock). +// ErrorsFound is true if check exited with status 1 (errors detected +// in repo metadata). +type CheckResult struct { + LockPresent bool + ErrorsFound bool +} + +// RunCheck executes `restic check` with optional --read-data-subset. +// subsetPct of 0 omits the flag (full data check); >0 passes +// --read-data-subset Npct. Returns a CheckResult summarising what +// was sniffed from stderr; the bool is set even if check itself +// returns an error (so the caller can persist the lock-state). +func (e Env) RunCheck(ctx context.Context, subsetPct int, handle LineHandler) (CheckResult, error) { + args := []string{"check"} + if subsetPct > 0 { + args = append(args, "--read-data-subset", fmt.Sprintf("%d%%", subsetPct)) + } + cmd := exec.CommandContext(ctx, e.Bin, args...) + cmd.Env = e.envSlice() + cmd.Dir = e.WorkDir + + var res CheckResult + sniff := func(stream, line string, ev any) { + if stream == "stderr" { + if strings.Contains(line, "Found stale lock") || strings.Contains(line, "locked") { + res.LockPresent = true + } + } + if handle != nil { + handle(stream, line, ev) + } + } + + err := runWithPumpHandler(cmd, sniff) + if err != nil { + // restic check exits 1 when corruption is found; that's a + // CheckResult, not a wrapper failure. Treat any non-zero + // exit as "errors found" but still return the result so the + // ticker can persist last_check_status='errors_found'. + var ee *exec.ExitError + if errors.As(err, &ee) { + res.ErrorsFound = true + return res, nil + } + return res, err + } + return res, nil +} +``` + +`runWithPumpHandler` is a tiny variant that takes the LineHandler directly (so the sniff wrapper can intercept) — split from `runWithPump` (which currently constructs nothing). Implement as the existing function but with the wrapped handler. + +- [ ] **Step 2: Test** + +```go +func TestRunCheckParsesLock(t *testing.T) { + // Use /bin/sh -c to emit a known stderr line containing + // "Found stale lock" and exit 0. Assert LockPresent=true. +} + +func TestRunCheckErrorsFoundOnExit1(t *testing.T) { + // /bin/sh -c "exit 1". Assert err==nil, ErrorsFound=true. +} +``` + +- [ ] **Step 3: Run tests** + +Run: `go test ./internal/restic/ -v` +Expected: PASS. + +- [ ] **Step 4: Commit** + +```bash +git add internal/restic/runner.go internal/restic/runner_test.go +git commit -m "restic: RunCheck with subset% + lock-state sniffing" +``` + +### Task B3: RunUnlock + RunStats + +**Files:** +- Modify: `internal/restic/runner.go` +- Test: `internal/restic/runner_test.go` + +- [ ] **Step 1: Add RunUnlock (trivial)** + +```go +func (e Env) RunUnlock(ctx context.Context, handle LineHandler) error { + cmd := exec.CommandContext(ctx, e.Bin, "unlock") + cmd.Env = e.envSlice() + cmd.Dir = e.WorkDir + return runWithPumpHandler(cmd, handle) +} +``` + +- [ ] **Step 2: Add RunStats with --json parsing** + +```go +// RepoStats mirrors `restic stats --json` output (mode=raw-data is +// the most useful — gives total/unique sizes + file counts). +type RepoStats struct { + TotalSize int64 `json:"total_size"` + TotalUncompressed int64 `json:"total_uncompressed_size"` + SnapshotsCount int64 `json:"snapshots_count"` + TotalFileCount int64 `json:"total_file_count"` + TotalBlobCount int64 `json:"total_blob_count"` +} + +// RunStats executes `restic stats --json --mode raw-data` and parses +// the (single-line) JSON response. Tees raw output to handle so the +// caller can still log it. +func (e Env) RunStats(ctx context.Context, handle LineHandler) (*RepoStats, error) { + cmd := exec.CommandContext(ctx, e.Bin, "stats", "--json", "--mode", "raw-data") + cmd.Env = e.envSlice() + cmd.Dir = e.WorkDir + + var out *RepoStats + capture := func(stream, line string, ev any) { + if stream == "stdout" && strings.HasPrefix(line, "{") { + var s RepoStats + if json.Unmarshal([]byte(line), &s) == nil && s.SnapshotsCount > 0 { + cp := s + out = &cp + } + } + if handle != nil { + handle(stream, line, ev) + } + } + if err := runWithPumpHandler(cmd, capture); err != nil { + return nil, err + } + if out == nil { + return nil, fmt.Errorf("restic stats: no JSON in output") + } + return out, nil +} +``` + +- [ ] **Step 3: Tests** + +```go +func TestRunUnlockInvokesUnlock(t *testing.T) { /* /bin/echo + assert "unlock" arg */ } +func TestRunStatsParsesJSON(t *testing.T) { + // /bin/sh -c 'echo "{\"total_size\":123,\"snapshots_count\":4, …}"' + // assert parsed RepoStats matches. +} +``` + +- [ ] **Step 4: Run + commit** + +```bash +go test ./internal/restic/ -v +git add internal/restic/runner.go internal/restic/runner_test.go +git commit -m "restic: RunUnlock + RunStats (raw-data mode)" +``` + +--- + +## Slice C — Wire envelopes + agent runners + dispatcher + +### Task C1: Wire — add stats.report + RequiresAdminCreds + +**Files:** +- Modify: `internal/api/wire.go`, `internal/api/messages.go` +- Test: `internal/api/messages_test.go` (whatever shape pins the wire today — `grep -n "MsgConfigUpdate\|MsgSnapshotsRpt" internal/api/`) + +- [ ] **Step 1: Add the new message constant** + +In `internal/api/wire.go`: + +```go +MsgStatsReport MessageType = "stats.report" +``` + +In `internal/api/messages.go`: + +```go +// StatsReportPayload — agent ships this after every successful +// backup, prune, or check. Fields are all optional; the server +// upserts only what's populated. Lock state comes only from check; +// freed bytes only from prune. +type StatsReportPayload struct { + TotalSizeBytes *int64 `json:"total_size_bytes,omitempty"` + RawSizeBytes *int64 `json:"raw_size_bytes,omitempty"` + UniqueFiles *int64 `json:"unique_files,omitempty"` + SnapshotCount *int64 `json:"snapshot_count,omitempty"` + LastCheckAt *time.Time `json:"last_check_at,omitempty"` + LastCheckStatus string `json:"last_check_status,omitempty"` + LockPresent *bool `json:"lock_present,omitempty"` + LastPruneAt *time.Time `json:"last_prune_at,omitempty"` + LastPruneFreedBytes *int64 `json:"last_prune_freed_bytes,omitempty"` +} +``` + +Also extend `CommandRunPayload` (in the same file): + +```go +type CommandRunPayload struct { + // … existing fields … + + // RequiresAdminCreds tells the agent to load the admin slot of + // its secrets store rather than the everyday repo slot. Set by + // the server only for `prune` and operator-triggered `unlock` + // (kinds that need delete authority on a rest-server repo). + RequiresAdminCreds bool `json:"requires_admin_creds,omitempty"` +} +``` + +- [ ] **Step 2: Pin the JSON shape in tests** + +Whatever existing test pins the `ConfigUpdatePayload` / `SnapshotsReportPayload` shape, add a sibling test for `StatsReportPayload` and the extended `CommandRunPayload` (assert the new field marshals omitted-when-zero). + +- [ ] **Step 3: Run + commit** + +```bash +go test ./internal/api/ -v +git add internal/api/ +git commit -m "api: stats.report envelope + CommandRun.RequiresAdminCreds" +``` + +### Task C2: Agent secrets — split into repo + admin slots + +**Files:** +- Modify: `internal/agent/secrets/secrets.go` (and any tests) +- Modify: `cmd/agent/main.go` (load path) + +- [ ] **Step 1: Read the current shape** + +Run: `cat internal/agent/secrets/secrets.go` and identify the on-disk struct and the `Load() (Creds, error)` API. The current shape carries one set of `{URL, Username, Password}` — extend to carry both. + +- [ ] **Step 2: Extend the on-disk shape** + +```go +// On-disk JSON (encrypted as a single AEAD blob): +type bundle struct { + Repo Creds `json:"repo,omitempty"` + Admin *Creds `json:"admin,omitempty"` +} +``` + +Public API: + +```go +// Load returns the everyday repo slot. Existing callers compile +// unchanged. +func (s *Store) Load() (Creds, error) { /* returns bundle.Repo */ } + +// LoadAdmin returns the admin slot, or (Creds{}, ErrNoAdmin) if +// the agent has never received an admin push. Caller decides what +// to do (typically: refuse the prune job with a clear error). +func (s *Store) LoadAdmin() (Creds, error) + +// Save replaces the repo slot (admin slot preserved). Used by the +// existing config.update handler. +func (s *Store) Save(c Creds) error + +// SaveAdmin replaces the admin slot. Called by the new config.update +// path that ships admin creds. +func (s *Store) SaveAdmin(c Creds) error +``` + +Migration path on first boot: if the existing on-disk shape decodes to a flat `Creds`, treat that as the repo slot and write back the new bundle shape. Mirror the secrets-package migration logic that landed in P1-33. + +- [ ] **Step 3: Test** + +In `internal/agent/secrets/secrets_test.go` (or whatever the test file is), add `TestSecretsAdminSlotIndependent`: +- Save repo creds. Load admin → ErrNoAdmin. +- SaveAdmin. Load admin → those creds. Load → still the original repo creds. +- Restart (re-open the store from disk). Both slots survive. + +- [ ] **Step 4: Run + commit** + +```bash +go test ./internal/agent/secrets/ -v +git add internal/agent/secrets/ +git commit -m "agent/secrets: separate admin slot + back-compat decode" +``` + +### Task C3: Agent runners — RunPrune, RunCheck, RunUnlock + reportStats + +**Files:** +- Modify: `internal/agent/runner/runner.go` +- Test: `internal/agent/runner/runner_test.go` (create if absent) + +- [ ] **Step 1: Add RunPrune** + +Mirror `RunForget` (`internal/agent/runner/runner.go:220-282`) verbatim, swap `restic.JobForget`/`env.RunForget` for `JobPrune`/`env.RunPrune`. After success, call `r.reportStats(ctx, env, statsPatch{LastPruneAt: &now})` — see step 4. + +```go +func (r *Runner) RunPrune(ctx context.Context, jobID string) error { + startedAt := time.Now().UTC() + startEnv, _ := api.Marshal(api.MsgJobStarted, jobID, api.JobStartedPayload{ + JobID: jobID, Kind: api.JobPrune, StartedAt: startedAt, + }) + if err := r.tx.Send(startEnv); err != nil { + slog.Warn("runner: send job.started (prune)", "err", err) + } + env := r.resticEnv() + var seq atomic.Int64 + handle := r.streamHandler(jobID, &seq) + + err := env.RunPrune(ctx, handle) + finishedAt := time.Now().UTC() + r.sendFinished(jobID, finishedAt, err) + if err == nil { + now := finishedAt + // Stats refresh — prune freed-bytes is hard to extract + // reliably from text output; for now just timestamp it + // and let RunStats refresh size. + if rerr := r.reportStats(ctx, env, api.StatsReportPayload{LastPruneAt: &now}); rerr != nil { + slog.Warn("runner: stats.report after prune failed", "job_id", jobID, "err", rerr) + } + } + if err != nil { + return fmt.Errorf("runner prune: %w", err) + } + return nil +} +``` + +Pull the duplicated `resticEnv()`/`streamHandler()`/`sendFinished()` out of the existing methods into helpers in the same file. The repeated start-env/finish-env scaffolding across Backup/Init/Forget/Prune/Check/Unlock is exactly the kind of duplication that hurts every time a payload field is added. + +- [ ] **Step 2: Add RunCheck** + +```go +func (r *Runner) RunCheck(ctx context.Context, jobID string, subsetPct int) error { + // start envelope (kind=check) → run → finished + // On any outcome, ship stats.report with last_check_at + status + lock_present. + res, err := env.RunCheck(ctx, subsetPct, handle) + status := "ok" + if res.ErrorsFound { status = "errors_found" } + if err != nil { status = "failed" } + now := time.Now().UTC() + lock := res.LockPresent + _ = r.reportStats(ctx, env, api.StatsReportPayload{ + LastCheckAt: &now, LastCheckStatus: status, LockPresent: &lock, + }) + // job.finished +} +``` + +- [ ] **Step 3: Add RunUnlock** + +Trivial mirror of RunInit. After success, also fire `reportStats` with `LockPresent: &false` so the UI banner clears. + +- [ ] **Step 4: Add reportStats** + +Sibling to `reportSnapshots` (runner.go:288-317). Build a `StatsReportPayload`, invoke `env.RunStats` to fill in size fields if the patch doesn't already cover them, marshal `MsgStatsReport`, send. Bound by a 60s timeout. + +```go +func (r *Runner) reportStats(ctx context.Context, env restic.Env, patch api.StatsReportPayload) error { + listCtx, cancel := context.WithTimeout(ctx, 60*time.Second) + defer cancel() + // Fill in sizes via restic stats unless the caller is just + // updating a metadata field (e.g. lock-cleared from unlock). + if patch.TotalSizeBytes == nil { + if s, err := env.RunStats(listCtx, nil); err == nil { + patch.TotalSizeBytes = &s.TotalSize + patch.RawSizeBytes = &s.TotalUncompressed + patch.UniqueFiles = &s.TotalFileCount + patch.SnapshotCount = &s.SnapshotsCount + } + } + envOut, err := api.Marshal(api.MsgStatsReport, "", patch) + if err != nil { return err } + return r.tx.Send(envOut) +} +``` + +Also call `reportStats(ctx, env, api.StatsReportPayload{})` from the existing `RunBackup` success path (next to `reportSnapshots`) so size refreshes after every backup. + +- [ ] **Step 5: Update Config + cmd/agent dispatcher to support admin creds** + +`runner.Config` grows an optional `AdminUsername`/`AdminPassword`. The dispatcher in `cmd/agent/main.go:243-325` builds the runner with the right creds based on `payload.RequiresAdminCreds`: + +```go +case api.JobPrune: + creds := repoCreds + if p.RequiresAdminCreds { + adminCreds, err := d.secrets.LoadAdmin() + if err != nil { + return fmt.Errorf("prune: admin creds not configured (server didn't push them)") + } + creds = adminCreds + } + r := runner.New(runner.Config{ + ResticBin: d.resticBin, + RepoURL: creds.URL, + RepoUsername: creds.Username, + RepoPassword: creds.Password, + }, tx, time.Second) + go func() { _ = r.RunPrune(ctx, p.JobID) }() +case api.JobCheck: + // subset% comes from CommandRunPayload.Args[0] (e.g. "5") to + // avoid bloating the wire — server stringifies the int from + // host_repo_maintenance.check_subset_pct. + subset := 0 + if len(p.Args) > 0 { subset, _ = strconv.Atoi(p.Args[0]) } + go func() { _ = r.RunCheck(ctx, p.JobID, subset) }() +case api.JobUnlock: + go func() { _ = r.RunUnlock(ctx, p.JobID) }() +``` + +Also: wire `MsgConfigUpdate` to call `secrets.SaveAdmin` when the new admin-fields are populated. Extend `ConfigUpdatePayload` with a `Slot` discriminator (default empty = repo, "admin" = admin). + +```go +// in api/messages.go +type ConfigUpdatePayload struct { + Slot string `json:"slot,omitempty"` // "" = repo, "admin" = admin slot + // … existing fields … +} +``` + +- [ ] **Step 6: Test** + +Add `TestRunPruneEnvelopes`, `TestRunCheckShipsStatsReport`, `TestRunUnlockClearsLockState` — drive the runner with a fake `Sender` and a `Bin = /bin/echo` env, capture the envelopes shipped, assert ordering: `job.started → log.stream(s) → stats.report (if applicable) → job.finished`. + +- [ ] **Step 7: Run + commit** + +```bash +go test ./internal/agent/... ./cmd/agent/... -v +git add internal/agent/runner/ internal/agent/secrets/ cmd/agent/main.go internal/api/messages.go +git commit -m "agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot" +``` + +--- + +## Slice D — Server: HTTP run-now endpoints (P2R-03/04/05) + +### Task D1: Admin credentials REST + push helper + +**Files:** +- Modify: `internal/server/http/host_credentials.go` +- Modify: `internal/server/http/server.go` +- Test: `internal/server/http/host_credentials_test.go` + +- [ ] **Step 1: Add admin-credentials endpoints** + +Mirror the existing `handleGetHostCredentials` / `handleSetHostCredentials` for the admin slot. Reuse the redacted-view shape (URL + username + has_password). + +```go +// PUT /api/hosts/{id}/admin-credentials → SetHostCredentials(host, admin, …) +// GET /api/hosts/{id}/admin-credentials → redacted view, or 404 if unset +// DELETE /api/hosts/{id}/admin-credentials → clear the admin slot +``` + +Validation: same length / charset rules as repo creds. AEAD additional-data is `"host:" + hostID + ":admin"` (different AAD per slot so a corrupted DB can't cross-bind). + +Audit-log all three. + +- [ ] **Step 2: Add `pushAdminCredsToAgent`** + +```go +func (s *Server) pushAdminCredsToAgent(ctx context.Context, hostID string) error { + enc, err := s.deps.Store.GetHostCredentials(ctx, hostID, store.CredKindAdmin) + if err != nil { return err } // ErrNotFound bubbles + plain, err := s.deps.AEAD.Decrypt(enc, []byte("host:"+hostID+":admin")) + if err != nil { return err } + var blob repoCredsBlob + if err := json.Unmarshal(plain, &blob); err != nil { return err } + env, err := api.Marshal(api.MsgConfigUpdate, "", api.ConfigUpdatePayload{ + Slot: "admin", + RepoURL: blob.RepoURL, RepoUsername: blob.RepoUsername, RepoPassword: blob.RepoPassword, + }) + if err != nil { return err } + sendCtx, cancel := context.WithTimeout(ctx, 5*time.Second) + defer cancel() + return s.deps.Hub.Send(sendCtx, hostID, env) +} +``` + +- [ ] **Step 3: Register routes in `server.go`** + +```go +r.Get("/hosts/{id}/admin-credentials", s.handleGetAdminCredentials) +r.Put("/hosts/{id}/admin-credentials", s.handleSetAdminCredentials) +r.Delete("/hosts/{id}/admin-credentials", s.handleDeleteAdminCredentials) +``` + +- [ ] **Step 4: Test** + +Set admin creds, fetch redacted, assert. Clear, assert 404. Encrypt/decrypt round-trip with different AAD. + +- [ ] **Step 5: Commit** + +```bash +git add internal/server/http/host_credentials.go internal/server/http/server.go internal/server/http/host_credentials_test.go +git commit -m "server: admin-credentials REST + push helper" +``` + +### Task D2: HTTP run-now for prune / check / unlock + +**Files:** +- Create: `internal/server/http/repo_ops.go` +- Create: `internal/server/http/repo_ops_test.go` +- Modify: `internal/server/http/server.go` + +- [ ] **Step 1: Write the handlers** + +```go +// repo_ops.go — operator-triggered Run-now for repo-level +// operations: prune, check, unlock. Backed by the same +// dispatchJobWithPayload pipeline as backup, with an extra step +// for prune: push the admin creds first if they're set, refuse +// loudly if they aren't. +package http + +func (s *Server) handleRunRepoPrune(w stdhttp.ResponseWriter, r *stdhttp.Request) { + user, ok := s.requireUser(r) + if !ok { /* same auth handling as run_group.go */ return } + hostID := chi.URLParam(r, "id") + + // Admin creds are required. Push them down before dispatching. + if err := s.pushAdminCredsToAgent(r.Context(), hostID); err != nil { + if errors.Is(err, store.ErrNotFound) { + s.runOpError(w, r, stdhttp.StatusBadRequest, "admin_creds_required", + "set admin credentials on the Repo page before running prune") + return + } + s.runOpError(w, r, stdhttp.StatusServiceUnavailable, "host_offline", + "agent not connected; reconnect and try again") + return + } + + res, status, code, msg := s.dispatchJobWithPayload(r.Context(), user, hostID, api.JobPrune, + api.CommandRunPayload{RequiresAdminCreds: true}) + if code != "" { s.runOpError(w, r, status, code, msg); return } + s.runOpRedirect(w, r, res) +} + +func (s *Server) handleRunRepoCheck(w stdhttp.ResponseWriter, r *stdhttp.Request) { + // Resolve subset% from host_repo_maintenance for this host (or + // optional ?subset=N override on the query string for ad-hoc). + m, _ := s.deps.Store.GetRepoMaintenance(r.Context(), hostID) + subset := m.CheckSubsetPct + if q := r.URL.Query().Get("subset"); q != "" { /* parse + clamp */ } + + res, status, code, msg := s.dispatchJobWithPayload(r.Context(), user, hostID, api.JobCheck, + api.CommandRunPayload{Args: []string{strconv.Itoa(subset)}}) + // … +} + +func (s *Server) handleRunRepoUnlock(w stdhttp.ResponseWriter, r *stdhttp.Request) { + // No admin creds required (unlock works with the everyday user). + res, status, code, msg := s.dispatchJobWithPayload(r.Context(), user, hostID, api.JobUnlock, + api.CommandRunPayload{}) + // … +} + +func (s *Server) runOpRedirect(...) { /* HX-Redirect to /jobs/{id} for HTMX, JSON otherwise. Mirror run_group.go */ } +func (s *Server) runOpError(...) { /* mirror run_group.go runGroupError */ } +``` + +- [ ] **Step 2: Register routes** + +In `server.go`, both inside `/api` and at the outer router for HTMX: + +```go +// Inside r.Route("/api", …): +r.Post("/hosts/{id}/repo/prune", s.handleRunRepoPrune) +r.Post("/hosts/{id}/repo/check", s.handleRunRepoCheck) +r.Post("/hosts/{id}/repo/unlock", s.handleRunRepoUnlock) + +// Outer (for HTMX form posts): +r.Post("/hosts/{id}/repo/prune", s.handleRunRepoPrune) +r.Post("/hosts/{id}/repo/check", s.handleRunRepoCheck) +r.Post("/hosts/{id}/repo/unlock", s.handleRunRepoUnlock) +``` + +- [ ] **Step 3: Test** + +In `repo_ops_test.go`: + +```go +func TestRunPruneRefusesWithoutAdminCreds(t *testing.T) { + // Stand up testServer with a connected fake host, no admin creds. + // POST /api/hosts/{id}/repo/prune; assert 400 + admin_creds_required. + // No job row created. +} + +func TestRunPruneShipsConfigUpdateThenCommandRun(t *testing.T) { + // Set admin creds on the host; connect a fake WS; POST prune. + // Drain the conn; assert config.update(slot=admin) → command.run(prune,RequiresAdminCreds=true). +} + +func TestRunCheckPullsSubsetFromMaintenanceRow(t *testing.T) { + // Update maintenance row, check_subset_pct=25; POST /repo/check. + // Assert command.run.Args == ["25"]. +} + +func TestRunUnlockNeedsNoAdminCreds(t *testing.T) { + // No admin creds set; POST /repo/unlock; assert 202 + command.run shipped. +} +``` + +- [ ] **Step 4: Run + commit** + +```bash +go test ./internal/server/http/ -v -run TestRunRepo -run TestRunPrune -run TestRunCheck -run TestRunUnlock +git add internal/server/http/repo_ops.go internal/server/http/repo_ops_test.go internal/server/http/server.go +git commit -m "server: HTTP run-now for prune / check / unlock" +``` + +--- + +## Slice E — UI: Repo page additions + +### Task E1: Render admin-creds form + one-time maintenance buttons + stats panel skeleton + +**Files:** +- Modify: `web/templates/pages/host_repo.html` +- Modify: `internal/server/http/ui_repo.go` + +- [ ] **Step 1: Read the current Repo page to understand sections** + +Run: `cat web/templates/pages/host_repo.html` and identify section anchors. Also read `internal/server/http/ui_repo.go` to see what data the renderer hands to the template. + +- [ ] **Step 2: Add the admin-creds form** + +In the Connection section (right after the existing repo creds form), add a sibling form for admin creds. Use a collapsible/optional pattern with a clear note: + +```html +
+

Admin credentials (required for prune)

+

Only needed for rest-server repos that distinguish an + append-only user (everyday backups) from a delete-capable user + (prune). For S3/B2/local, leave blank.

+
+ + + + + {{if .AdminCreds.HasPassword}} + + {{end}} +
+
+``` + +The handler `handleUIHostRepo` in `ui_repo.go` needs to pre-fill `AdminCreds` from `GetHostCredentials(host, admin)` using a redacted view. + +- [ ] **Step 3: Add one-time maintenance buttons** + +Below the cadence rows in the Maintenance section, add a "Run now" subsection with three buttons. Mirror the per-source-group Run-now pattern (inline form, hx-post, HX-Redirect to /jobs/{id} for live tailing). + +```html +
+

Run now

+
+
+ +
+
+ +
+
+ +
+
+
+``` + +`HasAdminCreds`, `Online` come from the renderer. + +- [ ] **Step 4: Render the stats panel (data fields wired up)** + +In the right rail (replace or augment the existing snapshots-by-tag breakdown), add a stats card backed by `host_repo_stats`: + +```html +
+

Repo stats

+
+
Total size
{{humanBytes .Stats.TotalSizeBytes}}
+
Snapshots
{{.Stats.SnapshotCount}}
+
Last check
{{if .Stats.LastCheckAt}}{{ago .Stats.LastCheckAt}} · {{.Stats.LastCheckStatus}}{{else}}—{{end}}
+
Last prune
{{if .Stats.LastPruneAt}}{{ago .Stats.LastPruneAt}}{{else}}—{{end}}
+ {{if .Stats.LockPresent}} +
⚠ Stale lock detected — run Unlock above.
+ {{end}} +
+
+``` + +`humanBytes` and `ago` are template helpers already in this codebase (or add them in `internal/server/ui/`). Check `web/templates/partials/host_row.html` for prior art before coining new ones. + +- [ ] **Step 5: Update `handleUIHostRepo` to populate the new view fields** + +Add to whatever struct backs the page render: `AdminCreds`, `HasAdminCreds`, `Online`, `Stats *store.HostRepoStats`. Default to a zero-valued stats struct when the row doesn't exist yet (so the template doesn't NPE). + +- [ ] **Step 6: Manual smoke check** + +Run: `make build && ` then visit `/hosts//repo` in a browser; verify all three sections render, buttons disable correctly when offline / when admin creds are missing. + +- [ ] **Step 7: Commit** + +```bash +git add web/templates/pages/host_repo.html internal/server/http/ui_repo.go +git commit -m "ui: Repo page — admin creds form, one-time maintenance, stats panel" +``` + +--- + +## Slice F — Server-side maintenance ticker (P2R-06) + +### Task F1: Pure-logic ticker package + +**Files:** +- Create: `internal/server/maintenance/ticker.go` +- Create: `internal/server/maintenance/ticker_test.go` + +- [ ] **Step 1: Add `LatestJobByKind` to the store** + +In `internal/store/jobs.go`, add: + +```go +// LatestJobByKind returns the most recent terminal (succeeded or +// failed) job of the given kind for the host, or (nil, ErrNotFound) +// if there isn't one. Used by the maintenance ticker to decide +// whether a cron tick needs dispatching. +func (s *Store) LatestJobByKind(ctx context.Context, hostID, kind string) (*Job, error) +``` + +`ORDER BY created_at DESC LIMIT 1` over `WHERE host_id=? AND kind=? AND status IN ('succeeded','failed','cancelled')`. + +- [ ] **Step 2: Add `ListAllMaintenance(ctx)` to store** + +In `internal/store/maintenance.go`, return all rows (one per host) with their host_id. Used by the ticker to iterate. + +- [ ] **Step 3: Write the pure-logic ticker** + +```go +// Package maintenance owns the server-side scheduler that fires +// forget/prune/check on the cadences operators set on +// host_repo_maintenance rows. The ticker is intentionally +// side-effect-free at the package boundary: it asks an injected +// Backend for current state and emits a list of Decisions for the +// caller to act on. Easy to unit-test without a running server. +package maintenance + +import ( + "context" + "fmt" + "time" + + "github.com/robfig/cron/v3" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/store" +) + +type Decision struct { + HostID string + Kind string // "forget" | "prune" | "check" + SubsetPct int // populated when Kind == "check" +} + +type Backend interface { + ListAllMaintenance(ctx context.Context) ([]store.HostRepoMaintenance, error) + LatestJobByKind(ctx context.Context, hostID, kind string) (*store.Job, error) +} + +type Ticker struct { + backend Backend + parser cron.Parser +} + +func New(b Backend) *Ticker { + return &Ticker{ + backend: b, + parser: cron.NewParser(cron.Minute | cron.Hour | cron.Dom | cron.Month | cron.Dow), + } +} + +// Decide returns the set of jobs the ticker would dispatch right +// now. Called by the goroutine in cmd/server every minute. The +// caller is responsible for: checking host online state, queuing +// to pending_runs if offline, persisting the job row, and shipping +// command.run. +func (t *Ticker) Decide(ctx context.Context, now time.Time) ([]Decision, error) { + rows, err := t.backend.ListAllMaintenance(ctx) + if err != nil { return nil, err } + var out []Decision + for _, m := range rows { + if d, ok := t.dueFor(ctx, now, m.HostID, "forget", m.ForgetCron, m.ForgetEnabled, 0); ok { + out = append(out, d) + } + if d, ok := t.dueFor(ctx, now, m.HostID, "prune", m.PruneCron, m.PruneEnabled, 0); ok { + out = append(out, d) + } + if d, ok := t.dueFor(ctx, now, m.HostID, "check", m.CheckCron, m.CheckEnabled, m.CheckSubsetPct); ok { + out = append(out, d) + } + } + return out, nil +} + +// dueFor returns true if the given cron has a fire-instant strictly +// after the latest persisted job's created_at and at-or-before now. +// Equivalently: cron.Next(latestJob.CreatedAt) <= now. +// +// First-run case (no prior job): the cron is "due" if its previous +// fire instant is at-or-before now (i.e. cron has fired at least +// once since the schedule was set up). We use a 24h lookback as the +// conservative anchor so a brand-new host doesn't fire 30 days of +// missed checks on first tick. +func (t *Ticker) dueFor(ctx context.Context, now time.Time, hostID, kind, expr string, enabled bool, subset int) (Decision, bool) { + if !enabled || expr == "" { return Decision{}, false } + sched, err := t.parser.Parse(expr) + if err != nil { return Decision{}, false } + j, err := t.backend.LatestJobByKind(ctx, hostID, kind) + var anchor time.Time + if err == nil && j != nil { + anchor = j.CreatedAt + } else { + anchor = now.Add(-24 * time.Hour) + } + next := sched.Next(anchor) + if next.IsZero() || next.After(now) { + return Decision{}, false + } + return Decision{HostID: hostID, Kind: kind, SubsetPct: subset}, true +} +``` + +- [ ] **Step 4: Tests** + +```go +func TestTickerSkipsDisabled(t *testing.T) { + // Maintenance row with all *_enabled = false. Decide returns nothing. +} +func TestTickerFiresWhenOverdue(t *testing.T) { + // Maintenance with forget_cron = "0 3 * * *", latest forget job + // 25 hours ago. Decide returns one forget Decision. +} +func TestTickerSuppressesWhenRecent(t *testing.T) { + // Same cron, latest forget job 2 hours ago (and no 3am tick has + // fallen between then and now). Decide returns nothing. +} +func TestTickerCheckCarriesSubset(t *testing.T) { + // CheckSubsetPct=25. Assert the Decision has SubsetPct=25. +} +func TestTickerHandlesFirstRun(t *testing.T) { + // No prior job. Use a cron that fires every minute ("* * * * *"). + // Assert one Decision is returned. +} +``` + +Use a `fakeBackend` that implements the interface in-test. + +- [ ] **Step 5: Run + commit** + +```bash +go test ./internal/server/maintenance/ -v +git add internal/server/maintenance/ internal/store/jobs.go internal/store/maintenance.go +git commit -m "maintenance: pure-logic ticker (decides forget/prune/check fires)" +``` + +### Task F2: Wire the ticker into cmd/server + +**Files:** +- Modify: `cmd/server/main.go` +- Create: `internal/server/http/maintenance_dispatch.go` — the side-effecting glue (online-check + dispatch + queue-on-offline). + +- [ ] **Step 1: Write the dispatcher glue** + +```go +// maintenance_dispatch.go — bridges the pure-logic ticker to the +// side-effecting world: checks online state, queues offline runs to +// pending_runs, dispatches command.run for online ones, audit-logs +// system actor. +package http + +func (s *Server) DispatchMaintenance(ctx context.Context, decisions []maintenance.Decision) { + for _, d := range decisions { + if !s.deps.Hub.Connected(d.HostID) { + // Maintenance fires we miss when offline DON'T queue — + // they're fire-and-forget cadences (you don't want to run + // 5 missed prunes the moment the host comes back). Just + // log + skip; next tick re-evaluates. + slog.Info("maintenance: host offline, skipping", + "host_id", d.HostID, "kind", d.Kind) + continue + } + var payload api.CommandRunPayload + switch d.Kind { + case "prune": + // Skip if no admin creds (we'd just bounce the agent). + // Surface as a job row with status=failed for visibility. + if _, err := s.deps.Store.GetHostCredentials(ctx, d.HostID, store.CredKindAdmin); err != nil { + slog.Info("maintenance: prune skipped — no admin creds", + "host_id", d.HostID) + continue + } + if err := s.pushAdminCredsToAgent(ctx, d.HostID); err != nil { + slog.Warn("maintenance: push admin creds failed", "err", err) + continue + } + payload = api.CommandRunPayload{RequiresAdminCreds: true} + case "check": + payload = api.CommandRunPayload{Args: []string{strconv.Itoa(d.SubsetPct)}} + case "forget": + // The agent already runs forget per-source-group with + // each group's RetentionPolicy. The server-driven forget + // dispatch ships the host's group set in the payload so + // the agent doesn't need to remember which groups to + // walk. Build it server-side here. + payload = s.buildForgetPayloadForHost(ctx, d.HostID) + if payload.Args == nil { + continue // host has no groups with retention set + } + } + kind := api.JobKind(d.Kind) + _, _, code, msg := s.dispatchJobWithPayload(ctx, nil /*system*/, d.HostID, kind, payload) + if code != "" { + slog.Warn("maintenance: dispatch failed", + "host_id", d.HostID, "kind", d.Kind, "code", code, "msg", msg) + } + } +} +``` + +`buildForgetPayloadForHost` walks `ListSourceGroups(host)`, encodes the per-group retention policies into a structured payload field. This requires extending `CommandRunPayload` with a `ForgetGroups []ForgetGroup` field where `ForgetGroup = {Tag string, Policy RetentionPolicy}`. The agent's `RunForget` handler would then loop over the groups, running `restic forget --tag --keep-* …` for each. **Important: this changes the agent-side `RunForget` semantics.** Bake this in here: + +- Extend `internal/api/messages.go` — add `ForgetGroup` and `CommandRunPayload.ForgetGroups`. +- Extend `cmd/agent/main.go` JobForget case — if `ForgetGroups` is set, loop. Backwards compatible: if `RetentionPolicy` is set instead, run the single-policy form (the operator-triggered single-group path through `dispatchScheduledJob` is gone in this redesign — backups don't dispatch forget — so the operator-triggered single-policy path is from the maintenance run-now button only, not relevant any more). +- Extend `internal/agent/runner/runner.go` `RunForget` to take `[]ForgetGroup` and loop. + +For maintenance, prefer the multi-group shape; remove the old `RetentionPolicy` field from `CommandRunPayload` after migration if no callers remain. + +- [ ] **Step 2: Wire the goroutine in cmd/server/main.go** + +In the existing background-sweepers goroutine (or as a sibling — keep it readable), add a 60-second ticker: + +```go +maintenanceTick := time.NewTicker(60 * time.Second) +defer maintenanceTick.Stop() + +ticker := maintenance.New(st) + +// In the select: +case <-maintenanceTick.C: + decisions, err := ticker.Decide(ctx, time.Now().UTC()) + if err != nil { + slog.Warn("maintenance ticker: decide", "err", err) + continue + } + if len(decisions) > 0 { + slog.Info("maintenance ticker: dispatching", "n", len(decisions)) + srv.DispatchMaintenance(ctx, decisions) + } +``` + +`srv` needs an exported `DispatchMaintenance` method on `*Server` (or expose it via a tiny interface to keep wiring honest). + +- [ ] **Step 3: Test the dispatcher (integration shape)** + +In `internal/server/http/repo_ops_test.go` or a new `maintenance_dispatch_test.go`, drive a fake hub + fake store: +- One host online, forget enabled, no recent forget job → assert command.run shipped. +- One host offline → assert no command.run, no audit, just a log. +- Prune dispatch with no admin creds → assert skipped silently. + +- [ ] **Step 4: Smoke** + +`make build && `. Set a host's forget cadence to `* * * * *` (every minute) and watch the server logs for "maintenance: dispatching" within ~60s. Open the host's Jobs UI and verify a forget job appeared with `actor_kind=schedule` (or a new system actor — pick one consistently). + +- [ ] **Step 5: Commit** + +```bash +git add cmd/server/main.go internal/server/http/maintenance_dispatch.go internal/api/messages.go internal/agent/runner/runner.go cmd/agent/main.go +git commit -m "server: maintenance ticker drives forget/prune/check on cadence" +``` + +--- + +## Slice G — Pending-runs queue worker (P2R-08) + +### Task G1: Enqueue on schedule.fire when host offline + +**Files:** +- Modify: `internal/server/http/schedule_push.go` (the dispatchScheduledJob path) +- Modify: `internal/store/sources.go` (or wherever `SchedulesUsingGroup` and similar live) — needs `GetSourceGroup` to expose retry_max + retry_backoff_seconds (likely already does). + +- [ ] **Step 1: Wrap `dispatchBackupForGroup` to enqueue when send fails** + +The current path calls `conn.Send` directly. If `conn.Send` fails (offline), enqueue a row in `pending_runs` instead of just logging. + +Note: schedule.fire comes *from the agent's local cron*, so by the time we're here the agent is connected. But the same dispatch path is reachable from `pushScheduleSetAsync` callers — and from the ticker's Slice F. For ticker's case, we already short-circuit on offline. For the schedule.fire case, the offline race is narrow: agent fired its cron, then disconnected before the round-trip completes. Still worth queuing. + +Better: handle the enqueue inside `dispatchBackupForGroup` itself, on `conn.Send` error: + +```go +if err := conn.Send(sendCtx, env); err != nil { + slog.Warn("schedule.fire: send command.run, queueing for retry", + "host_id", hostID, "schedule_id", scheduleID, "err", err) + backoff := time.Duration(g.RetryBackoffSeconds) * time.Second + _ = s.deps.Store.EnqueuePendingRun(ctx, &store.PendingRun{ + ID: ulid.Make().String(), + ScheduleID: scheduleID, + SourceGroupID: g.ID, + HostID: hostID, + Attempt: 1, + NextAttemptAt: time.Now().UTC().Add(backoff), + ScheduledAt: scheduledAt, + LastError: err.Error(), + }) + return "" +} +``` + +- [ ] **Step 2: Add `ListPendingRunsForHost` to the store** + +```go +func (st *Store) ListPendingRunsForHost(ctx context.Context, hostID string) ([]PendingRun, error) +``` + +`SELECT … WHERE host_id = ? ORDER BY next_attempt_at`. + +- [ ] **Step 3: Test** + +In `internal/server/http/p2r01_ws_test.go` or a fresh test file, simulate a `conn.Send` failure mid-dispatch and assert a pending row landed. + +- [ ] **Step 4: Commit** + +```bash +git add internal/server/http/schedule_push.go internal/store/pending.go +git commit -m "server: enqueue pending_runs when scheduled-job dispatch fails" +``` + +### Task G2: Drain pending_runs on tick + on agent reconnect + +**Files:** +- Create: `internal/server/http/pending_drain.go` +- Create: `internal/server/http/pending_drain_test.go` +- Modify: `internal/server/http/host_credentials.go` (extend `onAgentHello`) +- Modify: `cmd/server/main.go` (add the 30s tick) + +- [ ] **Step 1: Write the drainer** + +```go +// pending_drain.go — drains pending_runs rows that are due. +// +// Two trigger paths: +// 1. The 30s ticker in cmd/server, which sweeps every host with +// due rows. +// 2. onAgentHello, which targets one host and drains *its* rows +// synchronously (so a host coming back online flushes its +// queue before the operator notices). + +package http + +func (s *Server) DrainPending(ctx context.Context, hostID string) { + runs, err := s.deps.Store.ListPendingRunsForHost(ctx, hostID) + if err != nil { + slog.Warn("drain: list pending", "host_id", hostID, "err", err) + return + } + if !s.deps.Hub.Connected(hostID) { return } + for _, p := range runs { + // Walk the schedule + group; if either is gone, drop the row. + sc, err := s.deps.Store.GetSchedule(ctx, hostID, p.ScheduleID) + if err != nil || !sc.Enabled { + _ = s.deps.Store.DeletePendingRun(ctx, p.ID) + continue + } + g, err := s.deps.Store.GetSourceGroup(ctx, hostID, p.SourceGroupID) + if err != nil { + _ = s.deps.Store.DeletePendingRun(ctx, p.ID) + continue + } + if p.Attempt >= g.RetryMax { + slog.Info("drain: abandoning pending run (retry_max exceeded)", + "host_id", hostID, "schedule_id", p.ScheduleID, "group", g.Name, + "attempts", p.Attempt) + _ = s.deps.Store.AppendAudit(ctx, store.AuditEntry{ + ID: ulid.Make().String(), Actor: "system", + Action: "pending_run.abandoned", + TargetKind: ptr("schedule"), TargetID: &p.ScheduleID, + TS: time.Now().UTC(), + }) + _ = s.deps.Store.DeletePendingRun(ctx, p.ID) + continue + } + // Try to dispatch. + conn, ok := s.deps.Hub.GetConn(hostID) + if !ok { return } + jobID := s.dispatchBackupForGroup(ctx, conn, hostID, p.ScheduleID, g, p.ScheduledAt) + if jobID == "" { + // Send failed again — bump attempt with exponential backoff. + backoff := time.Duration(g.RetryBackoffSeconds) * time.Second + backoff *= time.Duration(1 << p.Attempt) + if max := 30 * time.Minute; backoff > max { backoff = max } + _ = s.deps.Store.BumpPendingRunAttempt(ctx, p.ID, + time.Now().UTC().Add(backoff), "dispatch failed on drain") + continue + } + _ = s.deps.Store.DeletePendingRun(ctx, p.ID) + } +} + +// DrainAllDue is the periodic-ticker entrypoint. Loops over hosts +// with due rows and calls DrainPending per host. +func (s *Server) DrainAllDue(ctx context.Context) { + due, err := s.deps.Store.DuePendingRuns(ctx, time.Now().UTC(), 100) + if err != nil { + slog.Warn("drain: list due", "err", err); return + } + seen := make(map[string]bool) + for _, p := range due { + if seen[p.HostID] { continue } + seen[p.HostID] = true + s.DrainPending(ctx, p.HostID) + } +} +``` + +`Hub.GetConn(hostID)` may not exist yet — if so, add it (`hub.go`). Returns the registered `*ws.Conn` or `nil, false`. + +- [ ] **Step 2: Wire `onAgentHello` to drain** + +In `host_credentials.go`, add to `onAgentHello`: + +```go +go s.DrainPending(context.Background(), hostID) +``` + +(Use a fresh context — the hello-bound context is short-lived, and the drain may take a few seconds across many runs.) + +- [ ] **Step 3: Wire the periodic tick in cmd/server/main.go** + +Add a 30s ticker alongside the maintenance one: + +```go +drainTick := time.NewTicker(30 * time.Second) +defer drainTick.Stop() +// case <-drainTick.C: +// srv.DrainAllDue(ctx) +``` + +- [ ] **Step 4: Test** + +```go +func TestDrainOnReconnect(t *testing.T) { + // Stand up server + fake host. Disconnect agent. Schedule fires + // via a fake schedule.fire envelope, races the disconnect, lands + // in pending_runs (assert via store query). Reconnect. Assert + // drain ran: pending_runs empty, command.run was sent. +} + +func TestDrainAbandonsAfterRetryMax(t *testing.T) { + // Set group.retry_max=2; insert a pending row at attempt=2. + // Connect host. DrainPending. Assert row deleted, audit logged, + // no command.run sent. +} + +func TestDrainBumpsAttemptOnSendFail(t *testing.T) { + // Pending row at attempt=1, retry_max=5. Hub.Send rigged to + // fail. DrainPending. Assert row.attempt==2 in DB, + // next_attempt_at moved out by backoff*2. +} +``` + +- [ ] **Step 5: Smoke** + +`make build && `. Disable the agent's WS connection (e.g., `sudo systemctl stop restic-manager-agent`), trigger a schedule fire (or fast-forward via an artificial cron), wait for a row to land in `pending_runs`. Re-start the agent, watch the server log for "drain: dispatched", verify the row is gone and a job ran. + +- [ ] **Step 6: Commit** + +```bash +git add internal/server/http/pending_drain.go internal/server/http/pending_drain_test.go internal/server/ws/hub.go internal/server/http/host_credentials.go cmd/server/main.go +git commit -m "server: drain pending_runs on tick + on agent reconnect" +``` + +--- + +## Slice H — Repo stats wiring (P2R-07) + +### Task H1: Server-side stats.report handler + +**Files:** +- Modify: `internal/server/ws/handler.go` +- Modify: `internal/server/ws/handler_test.go` + +- [ ] **Step 1: Add the dispatch case** + +In `dispatchAgentMessage`, add a case for `MsgStatsReport` that decodes the payload and calls `Store.UpsertHostRepoStats(ctx, hostID, patch)`. Mirror the existing `MsgSnapshotsRpt` handler. + +- [ ] **Step 2: Test** + +Drive a fake stats.report through the handler, assert the row in `host_repo_stats` matches. + +- [ ] **Step 3: Commit** + +```bash +git add internal/server/ws/handler.go internal/server/ws/handler_test.go +git commit -m "server/ws: persist stats.report into host_repo_stats" +``` + +### Task H2: End-to-end smoke + visual check + +- [ ] **Step 1: Build + restage per CLAUDE.md** + +```bash +make build && \ + cp bin/restic-manager-agent /tmp/rm-smoke/data/agent-binaries/restic-manager-agent-linux-amd64 && \ + cp deploy/install/install.sh /tmp/rm-smoke/data/install/install.sh && \ + cp deploy/install/restic-manager-agent.service /tmp/rm-smoke/data/install/restic-manager-agent.service && \ + sudo -n install -m 0755 bin/restic-manager-agent /usr/local/bin/restic-manager-agent && \ + sudo -n install -m 0644 deploy/install/restic-manager-agent.service /etc/systemd/system/restic-manager-agent.service && \ + sudo -n systemctl daemon-reload && sudo -n systemctl restart restic-manager-agent && \ + pkill -f restic-manager-server; sleep 1; \ + RM_LISTEN=:8080 RM_DATA_DIR=/tmp/rm-smoke/data RM_BASE_URL=http://127.0.0.1:8080 \ + RM_SECRET_KEY_FILE=/tmp/rm-smoke/data/secret.key RM_COOKIE_SECURE=false \ + ./bin/restic-manager-server >> /tmp/rm-smoke/server.log 2>&1 & +``` + +- [ ] **Step 2: Playwright sweep** + +Drive a Playwright session against http://127.0.0.1:8080: +1. Log in. +2. Navigate to a host's `/repo` page. +3. Click "Run now → Check". Verify: redirect to /jobs/{id}, live log streams, job ends succeeded. +4. Set admin credentials. Click "Run now → Prune". Verify: redirect, log streams, ends succeeded. +5. Click "Run now → Unlock" — even when there's no lock, restic returns success quickly. Verify behaviour. +6. After all three, refresh `/repo` and verify the stats panel shows "Last check {n} ago · ok", "Last prune {n} ago", and total size matches what `restic stats` would return. +7. Force a fake lock via direct restic call against the smoke repo, run check, verify the lock-detected banner appears. + +Save screenshots to `_diag/p2r-phase5-sweep/`. + +- [ ] **Step 3: Commit screenshots + a short note** + +```bash +git add _diag/p2r-phase5-sweep/ +git commit -m "diag: phase 5 Playwright sweep screenshots" +``` + +### Task H3: Mark the tasks complete in tasks.md + +- [ ] **Step 1: Tick P2R-03 through P2R-08** + +In `tasks.md`, change `- [ ] **P2R-03**` (and -04, -05, -06, -07, -08) to `- [x]` and append a one-line note pointing at the commit range or PR URL. Mirror P2R-01 / P2R-02's done-format. + +- [ ] **Step 2: Commit** + +```bash +git add tasks.md +git commit -m "tasks: mark P2R-03..P2R-08 done" +``` + +--- + +## Self-Review + +**Spec coverage check:** +- P2R-03 (prune) — Slices B (RunPrune), C (agent runner + admin slot), D (HTTP run-now), E (UI button + admin creds form), F (cadence-driven). ✓ +- P2R-04 (check + subset) — Slices B (RunCheck), C (agent runner), D (HTTP run-now), E (UI button), F (cadence-driven). ✓ +- P2R-05 (unlock) — Slices B (RunUnlock), C (agent runner), D (HTTP run-now), E (UI button + lock banner via stats panel). ✓ +- P2R-06 (server-side maintenance ticker) — Slice F. ✓ +- P2R-07 (repo stats panel) — stats.report wire (C1), agent reportStats (C3), server handler (H1), UI panel (E1). ✓ +- P2R-08 (pending-runs queue) — Slices G1 + G2. ✓ + +**Placeholder scan:** No "TODO/TBD/etc" in any step body. Each step shows actual code or an exact existing-file reference for the engineer to mirror. + +**Type consistency:** `CredentialKind` constants and `host_credentials.kind` column align (`'repo' | 'admin'`). `ConfigUpdatePayload.Slot` uses the same string values. `StatsReportPayload` field names match `HostRepoStats` columns 1:1 (camelCase ↔ snake_case via JSON tags). `JobKind` constants are used unchanged from `internal/api/messages.go:50-63`. Maintenance `Decision.Kind` uses string-match (`"forget"|"prune"|"check"`) to match the `kind` column in `jobs` and the `host_repo_maintenance.*_cron` field naming. + +**Open scope question:** the `RetentionPolicy` field on `CommandRunPayload` becomes dead weight once forget runs in multi-group form. Slice F task F2 step 1 calls this out — leave the deletion as a follow-up cleanup commit at the end, after the agent code stops reading it. Don't drop it inline because the wire-shape pin tests need to be updated atomically.