Files

T

steve b640775a61 plan: P2 redesign Phase 5 (P2R-03..P2R-08)

2026-05-04 10:19:15 +01:00

66 KiB

Raw Blame History

P2 Redesign — Phase 5 Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Land the operator-facing prune/check/unlock surface, the server-side maintenance ticker that fires forget/prune/check on cadence, the offline-retry queue worker, and the repo-stats panel that surfaces the result. Closes P2R-03, P2R-04, P2R-05, P2R-06, P2R-07, P2R-08.

Architecture:

New restic-wrapper functions (RunPrune, RunCheck, RunUnlock, RunStats) mirror RunForget's pumpPlain pattern. Agent-side runners mirror RunForget's envelope shape (runner.RunPrune, RunCheck, RunUnlock). Wire envelopes already include the JobKind constants — no protocol bump.
A second per-host credential row (host_credentials.kind = 'admin') carries the prune-capable creds; pushed to the agent only when dispatching a job that needs it (prune).
Server-side maintenance runs from a single goroutine in cmd/server/main.go, ticking every 60s. For each host_repo_maintenance row it computes the most-recent cron-fire instant for each kind, compares to the latest persisted job's created_at, and dispatches if due. Offline hosts queue to pending_runs.
A second background goroutine drains pending_runs every 30s plus on agent reconnect (via onAgentHello).
Repo stats are a singleton-per-host projection (host_repo_stats table). Agent ships stats.report after every successful backup (mirrors snapshots.report plumbing). UI reads it on the Repo page.

Tech Stack: Go (server + agent), SQLite (modernc.org/sqlite), github.com/robfig/cron/v3 for cron parsing (already a dep), chi for routing, html/template + HTMX for UI, Playwright for end-to-end smoke.

File Structure

New files:

internal/store/migrations/0009_admin_creds_and_repo_stats.sql — schema for host_credentials.kind, new host_repo_stats table.
internal/store/host_repo_stats.go — CRUD for the stats projection.
internal/server/maintenance/ticker.go — pure-logic maintenance scheduler (parseable, testable without a server).
internal/server/maintenance/ticker_test.go
internal/server/http/repo_ops.go — POST /api/hosts/{id}/repo/{prune,check,unlock} handlers + their HTMX-form siblings under the UI tree.
internal/server/http/repo_ops_test.go
internal/server/http/pending_drain.go — the offline-queue drain logic (server-side, called on tick + on connect).
internal/server/http/pending_drain_test.go

Modified files:

internal/restic/runner.go — add RunPrune, RunCheck, RunUnlock, RunStats. Add LockState parsing helper for check stderr.
internal/api/messages.go — add MsgStatsReport constant + StatsReportPayload. Add RequiresAdminCreds field to CommandRunPayload.
internal/api/wire.go — register the new message type.
internal/agent/runner/runner.go — add RunPrune, RunCheck, RunUnlock. Add reportStats (mirrors reportSnapshots).
cmd/agent/main.go — wire new JobKind cases into the dispatcher; load admin creds slot from secrets when RequiresAdminCreds=true.
internal/agent/secrets/secrets.go — extend the on-disk shape to carry both repo and admin blobs (one file, two named slots).
internal/server/http/host_credentials.go — extend the encrypted blob to support an optional admin-creds field; add PUT /api/hosts/{id}/admin-credentials. Adjust pushRepoCredsToAgent to take a kind argument; add pushAdminCredsToAgent for on-demand admin-cred pushes.
internal/server/http/server.go — register the new routes.
internal/server/http/jobs.go — extend dispatchJobWithPayload to optionally push admin creds first when kind is prune.
internal/server/ws/handler.go — handle MsgStatsReport (persist + broadcast).
internal/store/jobs.go — add LatestJobByKind (returns the most recent terminal job of a kind, used by the ticker).
internal/store/pending.go — add ListPendingRunsForHost (used by the on-reconnect drain).
internal/store/maintenance.go — no schema, but add ListAllMaintenance(ctx) (the ticker iterates every host).
cmd/server/main.go — wire the maintenance ticker + pending-drain goroutine.
web/templates/pages/host_repo.html — add three "one-time" Run-now buttons (prune / check / unlock), an admin-creds form, a stats panel.
internal/server/http/ui_repo.go — render the new panels; pre-fill the admin-creds form from a redacted view.
web/templates/partials/host_chrome.html — surface "lock detected" banner if host_repo_stats.lock_present.

Deletes: none.

Slice A — Schema groundwork (admin creds + stats projection)

Task A1: Add migration 0009 for admin-credentials column + repo-stats table

Files:

Create: internal/store/migrations/0009_admin_creds_and_repo_stats.sql
Step 1: Write the migration

-- 0009_admin_creds_and_repo_stats.sql
--
-- Phase 5 of the P2 redesign needs two things in the schema:
--
-- 1. A second credential row per host. Today host_credentials is
--    1:1 with hosts. For prune (and any future destructive op) we
--    want a rest-server admin user whose password gives delete
--    access — separate from the append-only user used on every
--    backup. Add a `kind` column with default 'repo'; existing rows
--    become kind='repo'. Future admin rows live alongside.
--
-- 2. A small singleton-per-host projection for repo size, snapshot
--    count, last-prune freed bytes, lock state, and last-check
--    result. Backed by `restic stats --json` + sniffed `restic
--    check` stderr.
--
-- Use column-level ALTERs only; host_credentials has no inbound
-- FKs but the rule from CLAUDE.md still applies.

ALTER TABLE host_credentials ADD COLUMN kind TEXT NOT NULL DEFAULT 'repo';

-- The PK on host_credentials is currently (host_id) — we need a
-- composite (host_id, kind). SQLite has no ALTER TABLE …
-- ADD/CHANGE PRIMARY KEY, so this is the one place a rebuild is
-- justified. host_credentials has no inbound FKs, so the cascade
-- trap doesn't apply here. Verified against schema/0002.

CREATE TABLE host_credentials_new (
  host_id        TEXT NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
  kind           TEXT NOT NULL DEFAULT 'repo'
                   CHECK (kind IN ('repo', 'admin')),
  enc_repo_creds TEXT NOT NULL,
  updated_at     TEXT NOT NULL,
  PRIMARY KEY (host_id, kind)
);
INSERT INTO host_credentials_new (host_id, kind, enc_repo_creds, updated_at)
  SELECT host_id, kind, enc_repo_creds, updated_at FROM host_credentials;
DROP TABLE host_credentials;
ALTER TABLE host_credentials_new RENAME TO host_credentials;

-- Repo stats projection. One row per host, upserted by the agent's
-- stats.report envelope (which fires after every successful backup
-- and after every check / prune). All fields nullable so a freshly
-- enrolled host with no jobs yet is representable.

CREATE TABLE host_repo_stats (
  host_id                 TEXT PRIMARY KEY REFERENCES hosts(id) ON DELETE CASCADE,
  total_size_bytes        INTEGER,
  raw_size_bytes          INTEGER,
  unique_files            INTEGER,
  snapshot_count          INTEGER,
  last_check_at           TEXT,
  last_check_status       TEXT, -- 'ok' | 'errors_found' | 'failed'
  lock_present            INTEGER NOT NULL DEFAULT 0,
  last_prune_at           TEXT,
  last_prune_freed_bytes  INTEGER,
  updated_at              TEXT NOT NULL
);

Step 2: Run migrations against an empty DB and verify schema

Run: go test ./internal/store/... -run TestMigrations -v Expected: PASS, or write a fresh TestMigration0009 if no such test harness exists yet (mirror internal/store/migrations_test.go if present, else add a minimal one that opens an in-memory DB and asserts the new tables exist).

Step 3: Commit

git add internal/store/migrations/0009_admin_creds_and_repo_stats.sql
git commit -m "store: migration 0009 — admin-creds kind + host_repo_stats"

Task A2: Extend the host_credentials store API to be kind-aware

Files:

Modify: internal/store/host_credentials.go
Test: internal/store/host_credentials_test.go
Step 1: Replace the two existing functions with kind-aware versions

Existing GetHostCredentials(ctx, hostID) and SetHostCredentials(ctx, hostID, blob) become:

type CredentialKind string

const (
    CredKindRepo  CredentialKind = "repo"
    CredKindAdmin CredentialKind = "admin"
)

func (s *Store) GetHostCredentials(ctx context.Context, hostID string, kind CredentialKind) (string, error) {
    row := s.db.QueryRowContext(ctx,
        `SELECT enc_repo_creds FROM host_credentials WHERE host_id = ? AND kind = ?`,
        hostID, string(kind))
    var enc string
    if err := row.Scan(&enc); err != nil {
        if errors.Is(err, sql.ErrNoRows) {
            return "", ErrNotFound
        }
        return "", fmt.Errorf("store: get host credentials: %w", err)
    }
    return enc, nil
}

func (s *Store) SetHostCredentials(ctx context.Context, hostID string, kind CredentialKind, encRepoCreds string) error {
    if encRepoCreds == "" {
        return fmt.Errorf("store: empty enc_repo_creds")
    }
    now := time.Now().UTC().Format(time.RFC3339Nano)
    _, err := s.db.ExecContext(ctx,
        `INSERT INTO host_credentials (host_id, kind, enc_repo_creds, updated_at)
         VALUES (?, ?, ?, ?)
         ON CONFLICT(host_id, kind) DO UPDATE SET
            enc_repo_creds = excluded.enc_repo_creds,
            updated_at     = excluded.updated_at`,
        hostID, string(kind), encRepoCreds, now)
    if err != nil {
        return fmt.Errorf("store: set host credentials: %w", err)
    }
    return nil
}

func (s *Store) DeleteHostCredentials(ctx context.Context, hostID string, kind CredentialKind) error {
    _, err := s.db.ExecContext(ctx,
        `DELETE FROM host_credentials WHERE host_id = ? AND kind = ?`,
        hostID, string(kind))
    return err
}

Step 2: Update every caller in internal/server/http/

Run: grep -rn "GetHostCredentials\|SetHostCredentials" internal/ cmd/ For each call site, pass store.CredKindRepo as the new arg. Admin variants come later.

Step 3: Add a test for the admin-row variant

In internal/store/host_credentials_test.go, add TestHostCredentialsAdminRowSeparate:

Set repo creds, set admin creds, fetch each by kind, assert blobs differ.
Delete admin, verify repo unaffected.
Step 4: Run tests and verify

Run: go test ./internal/store/... -v Expected: PASS, including the new test.

Step 5: Run the full server tests to surface call-site fallout

Run: go test ./... Expected: PASS. Fix any compile errors at call sites you missed.

Step 6: Commit

git add internal/store/host_credentials.go internal/store/host_credentials_test.go internal/server/http/
git commit -m "store: host_credentials becomes kind-aware (repo + admin slots)"

Task A3: Add the host_repo_stats store API

Files:

Create: internal/store/host_repo_stats.go
Test: internal/store/host_repo_stats_test.go
Step 1: Write the type and CRUD

package store

import (
    "context"
    "database/sql"
    "errors"
    "fmt"
    "time"
)

type HostRepoStats struct {
    HostID               string
    TotalSizeBytes       *int64
    RawSizeBytes         *int64
    UniqueFiles          *int64
    SnapshotCount        *int64
    LastCheckAt          *time.Time
    LastCheckStatus      string  // "" | "ok" | "errors_found" | "failed"
    LockPresent          bool
    LastPruneAt          *time.Time
    LastPruneFreedBytes  *int64
    UpdatedAt            time.Time
}

func (s *Store) GetHostRepoStats(ctx context.Context, hostID string) (*HostRepoStats, error) {
    // SELECT … WHERE host_id = ?; sql.ErrNoRows → ErrNotFound.
    // Mirror existing nullable-time handling in store/jobs.go.
}

// UpsertHostRepoStats writes a partial update — only non-nil fields
// in the input overwrite existing columns. Implemented as a row-fetch
// + merge + INSERT…ON CONFLICT for clarity (versus building a sparse
// UPDATE statement at runtime).
func (s *Store) UpsertHostRepoStats(ctx context.Context, hostID string, patch HostRepoStats) error

Step 2: Write the test

func TestHostRepoStatsRoundTrip(t *testing.T) {
    // Open in-memory store, seed a host, upsert partial fields,
    // re-read, assert.
    // Then upsert a different field, assert the first is preserved.
}

Step 3: Run tests

Run: go test ./internal/store/ -run TestHostRepoStats -v Expected: PASS.

Step 4: Commit

git add internal/store/host_repo_stats.go internal/store/host_repo_stats_test.go
git commit -m "store: HostRepoStats projection (size, lock, last-check, last-prune)"

Slice B — Restic wrapper additions

Task B1: RunPrune

Files:

Modify: internal/restic/runner.go
Test: internal/restic/runner_test.go (create if absent — there's currently only url_test.go)
Step 1: Add RunPrune after RunForget

// RunPrune executes `restic prune` against the configured repo.
// Requires the *admin* credentials (delete access on the rest-server
// repo) — the caller is responsible for populating Env.RepoUsername
// and Env.RepoPassword with the admin pair before calling this.
//
// Prune emits human-readable progress on stdout/stderr (no --json
// support that's useful for our purposes). We tee everything to the
// handler so the live log is the operator's progress bar.
func (e Env) RunPrune(ctx context.Context, handle LineHandler) error {
    cmd := exec.CommandContext(ctx, e.Bin, "prune")
    cmd.Env = e.envSlice()
    cmd.Dir = e.WorkDir
    return runWithPump(cmd, handle)
}

Add a small private helper to DRY the stdout/stderr pump pattern that's now shared by Forget/Init/Prune/Check/Unlock:

// runWithPump starts the configured cmd, fans stdout+stderr into
// pumpPlain, and waits. Errors from the wait are wrapped with the
// cmd args[0] for context.
func runWithPump(cmd *exec.Cmd, handle LineHandler) error {
    label := "restic"
    if len(cmd.Args) > 1 {
        label = "restic " + cmd.Args[1]
    }
    stdout, err := cmd.StdoutPipe()
    if err != nil {
        return fmt.Errorf("%s: stdout pipe: %w", label, err)
    }
    stderr, err := cmd.StderrPipe()
    if err != nil {
        return fmt.Errorf("%s: stderr pipe: %w", label, err)
    }
    if err := cmd.Start(); err != nil {
        return fmt.Errorf("%s: start: %w", label, err)
    }
    done := make(chan error, 2)
    go func() { done <- pumpPlain(stdout, "stdout", handle) }()
    go func() { done <- pumpPlain(stderr, "stderr", handle) }()
    for i := 0; i < 2; i++ {
        if err := <-done; err != nil && handle != nil {
            handle("event", fmt.Sprintf("pump error: %v", err), nil)
        }
    }
    if werr := cmd.Wait(); werr != nil {
        return fmt.Errorf("%s: %w", label, werr)
    }
    return nil
}

Refactor RunForget and RunInit to use runWithPump (init keeps its sniff wrapper). Keep behaviour identical.

Step 2: Write a stub test that exercises arg construction

Since restic isn't on every developer's PATH, the test should construct an Env with Bin = "/bin/echo" (or Bin = "/bin/sh" running a script that echoes args) and assert the handler observes the expected argv. This is the same pattern as any existing wrapper tests; if none exists, model the test on TestRunForgetArgs in another comparable Go project (or just shell out to echo and grep stdout).

func TestRunPruneInvokesPrune(t *testing.T) {
    var captured []string
    h := func(stream, line string, _ any) {
        if stream == "stdout" { captured = append(captured, line) }
    }
    env := restic.Env{Bin: "/bin/echo"}
    if err := env.RunPrune(context.Background(), h); err != nil {
        t.Fatalf("run: %v", err)
    }
    if len(captured) != 1 || captured[0] != "prune" {
        t.Fatalf("expected stdout 'prune', got %v", captured)
    }
}

Step 3: Run tests

Run: go test ./internal/restic/ -v Expected: PASS.

Step 4: Commit

git add internal/restic/runner.go internal/restic/runner_test.go
git commit -m "restic: RunPrune + runWithPump helper, refactor Forget/Init onto it"

Task B2: RunCheck (with subset support + lock-detection)

Files:

Modify: internal/restic/runner.go
Test: internal/restic/runner_test.go
Step 1: Add RunCheck

// CheckResult summarises a `restic check` invocation. LockPresent is
// true if the stderr stream contained "Found stale lock" (caller is
// expected to surface this in the UI so the operator can run unlock).
// ErrorsFound is true if check exited with status 1 (errors detected
// in repo metadata).
type CheckResult struct {
    LockPresent bool
    ErrorsFound bool
}

// RunCheck executes `restic check` with optional --read-data-subset.
// subsetPct of 0 omits the flag (full data check); >0 passes
// --read-data-subset Npct. Returns a CheckResult summarising what
// was sniffed from stderr; the bool is set even if check itself
// returns an error (so the caller can persist the lock-state).
func (e Env) RunCheck(ctx context.Context, subsetPct int, handle LineHandler) (CheckResult, error) {
    args := []string{"check"}
    if subsetPct > 0 {
        args = append(args, "--read-data-subset", fmt.Sprintf("%d%%", subsetPct))
    }
    cmd := exec.CommandContext(ctx, e.Bin, args...)
    cmd.Env = e.envSlice()
    cmd.Dir = e.WorkDir

    var res CheckResult
    sniff := func(stream, line string, ev any) {
        if stream == "stderr" {
            if strings.Contains(line, "Found stale lock") || strings.Contains(line, "locked") {
                res.LockPresent = true
            }
        }
        if handle != nil {
            handle(stream, line, ev)
        }
    }

    err := runWithPumpHandler(cmd, sniff)
    if err != nil {
        // restic check exits 1 when corruption is found; that's a
        // CheckResult, not a wrapper failure. Treat any non-zero
        // exit as "errors found" but still return the result so the
        // ticker can persist last_check_status='errors_found'.
        var ee *exec.ExitError
        if errors.As(err, &ee) {
            res.ErrorsFound = true
            return res, nil
        }
        return res, err
    }
    return res, nil
}

runWithPumpHandler is a tiny variant that takes the LineHandler directly (so the sniff wrapper can intercept) — split from runWithPump (which currently constructs nothing). Implement as the existing function but with the wrapped handler.

Step 2: Test

func TestRunCheckParsesLock(t *testing.T) {
    // Use /bin/sh -c to emit a known stderr line containing
    // "Found stale lock" and exit 0. Assert LockPresent=true.
}

func TestRunCheckErrorsFoundOnExit1(t *testing.T) {
    // /bin/sh -c "exit 1". Assert err==nil, ErrorsFound=true.
}

Step 3: Run tests

Run: go test ./internal/restic/ -v Expected: PASS.

Step 4: Commit

git add internal/restic/runner.go internal/restic/runner_test.go
git commit -m "restic: RunCheck with subset% + lock-state sniffing"

Task B3: RunUnlock + RunStats

Files:

Modify: internal/restic/runner.go
Test: internal/restic/runner_test.go
Step 1: Add RunUnlock (trivial)

func (e Env) RunUnlock(ctx context.Context, handle LineHandler) error {
    cmd := exec.CommandContext(ctx, e.Bin, "unlock")
    cmd.Env = e.envSlice()
    cmd.Dir = e.WorkDir
    return runWithPumpHandler(cmd, handle)
}

Step 2: Add RunStats with --json parsing

// RepoStats mirrors `restic stats --json` output (mode=raw-data is
// the most useful — gives total/unique sizes + file counts).
type RepoStats struct {
    TotalSize          int64 `json:"total_size"`
    TotalUncompressed  int64 `json:"total_uncompressed_size"`
    SnapshotsCount     int64 `json:"snapshots_count"`
    TotalFileCount     int64 `json:"total_file_count"`
    TotalBlobCount     int64 `json:"total_blob_count"`
}

// RunStats executes `restic stats --json --mode raw-data` and parses
// the (single-line) JSON response. Tees raw output to handle so the
// caller can still log it.
func (e Env) RunStats(ctx context.Context, handle LineHandler) (*RepoStats, error) {
    cmd := exec.CommandContext(ctx, e.Bin, "stats", "--json", "--mode", "raw-data")
    cmd.Env = e.envSlice()
    cmd.Dir = e.WorkDir

    var out *RepoStats
    capture := func(stream, line string, ev any) {
        if stream == "stdout" && strings.HasPrefix(line, "{") {
            var s RepoStats
            if json.Unmarshal([]byte(line), &s) == nil && s.SnapshotsCount > 0 {
                cp := s
                out = &cp
            }
        }
        if handle != nil {
            handle(stream, line, ev)
        }
    }
    if err := runWithPumpHandler(cmd, capture); err != nil {
        return nil, err
    }
    if out == nil {
        return nil, fmt.Errorf("restic stats: no JSON in output")
    }
    return out, nil
}

Step 3: Tests

func TestRunUnlockInvokesUnlock(t *testing.T) { /* /bin/echo + assert "unlock" arg */ }
func TestRunStatsParsesJSON(t *testing.T) {
    // /bin/sh -c 'echo "{\"total_size\":123,\"snapshots_count\":4, …}"'
    // assert parsed RepoStats matches.
}

Step 4: Run + commit

go test ./internal/restic/ -v
git add internal/restic/runner.go internal/restic/runner_test.go
git commit -m "restic: RunUnlock + RunStats (raw-data mode)"

Slice C — Wire envelopes + agent runners + dispatcher

Task C1: Wire — add stats.report + RequiresAdminCreds

Files:

Modify: internal/api/wire.go, internal/api/messages.go
Test: internal/api/messages_test.go (whatever shape pins the wire today — grep -n "MsgConfigUpdate\|MsgSnapshotsRpt" internal/api/)
Step 1: Add the new message constant

In internal/api/wire.go:

MsgStatsReport MessageType = "stats.report"

In internal/api/messages.go:

// StatsReportPayload — agent ships this after every successful
// backup, prune, or check. Fields are all optional; the server
// upserts only what's populated. Lock state comes only from check;
// freed bytes only from prune.
type StatsReportPayload struct {
    TotalSizeBytes       *int64     `json:"total_size_bytes,omitempty"`
    RawSizeBytes         *int64     `json:"raw_size_bytes,omitempty"`
    UniqueFiles          *int64     `json:"unique_files,omitempty"`
    SnapshotCount        *int64     `json:"snapshot_count,omitempty"`
    LastCheckAt          *time.Time `json:"last_check_at,omitempty"`
    LastCheckStatus      string     `json:"last_check_status,omitempty"`
    LockPresent          *bool      `json:"lock_present,omitempty"`
    LastPruneAt          *time.Time `json:"last_prune_at,omitempty"`
    LastPruneFreedBytes  *int64     `json:"last_prune_freed_bytes,omitempty"`
}

Also extend CommandRunPayload (in the same file):

type CommandRunPayload struct {
    // … existing fields …

    // RequiresAdminCreds tells the agent to load the admin slot of
    // its secrets store rather than the everyday repo slot. Set by
    // the server only for `prune` and operator-triggered `unlock`
    // (kinds that need delete authority on a rest-server repo).
    RequiresAdminCreds bool `json:"requires_admin_creds,omitempty"`
}

Step 2: Pin the JSON shape in tests

Whatever existing test pins the ConfigUpdatePayload / SnapshotsReportPayload shape, add a sibling test for StatsReportPayload and the extended CommandRunPayload (assert the new field marshals omitted-when-zero).

Step 3: Run + commit

go test ./internal/api/ -v
git add internal/api/
git commit -m "api: stats.report envelope + CommandRun.RequiresAdminCreds"

Task C2: Agent secrets — split into repo + admin slots

Files:

Modify: internal/agent/secrets/secrets.go (and any tests)
Modify: cmd/agent/main.go (load path)
Step 1: Read the current shape

Run: cat internal/agent/secrets/secrets.go and identify the on-disk struct and the Load() (Creds, error) API. The current shape carries one set of {URL, Username, Password} — extend to carry both.

Step 2: Extend the on-disk shape

// On-disk JSON (encrypted as a single AEAD blob):
type bundle struct {
    Repo  Creds  `json:"repo,omitempty"`
    Admin *Creds `json:"admin,omitempty"`
}

Public API:

// Load returns the everyday repo slot. Existing callers compile
// unchanged.
func (s *Store) Load() (Creds, error) { /* returns bundle.Repo */ }

// LoadAdmin returns the admin slot, or (Creds{}, ErrNoAdmin) if
// the agent has never received an admin push. Caller decides what
// to do (typically: refuse the prune job with a clear error).
func (s *Store) LoadAdmin() (Creds, error)

// Save replaces the repo slot (admin slot preserved). Used by the
// existing config.update handler.
func (s *Store) Save(c Creds) error

// SaveAdmin replaces the admin slot. Called by the new config.update
// path that ships admin creds.
func (s *Store) SaveAdmin(c Creds) error

Migration path on first boot: if the existing on-disk shape decodes to a flat Creds, treat that as the repo slot and write back the new bundle shape. Mirror the secrets-package migration logic that landed in P1-33.

Step 3: Test

In internal/agent/secrets/secrets_test.go (or whatever the test file is), add TestSecretsAdminSlotIndependent:

Save repo creds. Load admin → ErrNoAdmin.
SaveAdmin. Load admin → those creds. Load → still the original repo creds.
Restart (re-open the store from disk). Both slots survive.
Step 4: Run + commit

go test ./internal/agent/secrets/ -v
git add internal/agent/secrets/
git commit -m "agent/secrets: separate admin slot + back-compat decode"

Task C3: Agent runners — RunPrune, RunCheck, RunUnlock + reportStats

Files:

Modify: internal/agent/runner/runner.go
Test: internal/agent/runner/runner_test.go (create if absent)
Step 1: Add RunPrune

Mirror RunForget (internal/agent/runner/runner.go:220-282) verbatim, swap restic.JobForget/env.RunForget for JobPrune/env.RunPrune. After success, call r.reportStats(ctx, env, statsPatch{LastPruneAt: &now}) — see step 4.

func (r *Runner) RunPrune(ctx context.Context, jobID string) error {
    startedAt := time.Now().UTC()
    startEnv, _ := api.Marshal(api.MsgJobStarted, jobID, api.JobStartedPayload{
        JobID: jobID, Kind: api.JobPrune, StartedAt: startedAt,
    })
    if err := r.tx.Send(startEnv); err != nil {
        slog.Warn("runner: send job.started (prune)", "err", err)
    }
    env := r.resticEnv()
    var seq atomic.Int64
    handle := r.streamHandler(jobID, &seq)

    err := env.RunPrune(ctx, handle)
    finishedAt := time.Now().UTC()
    r.sendFinished(jobID, finishedAt, err)
    if err == nil {
        now := finishedAt
        // Stats refresh — prune freed-bytes is hard to extract
        // reliably from text output; for now just timestamp it
        // and let RunStats refresh size.
        if rerr := r.reportStats(ctx, env, api.StatsReportPayload{LastPruneAt: &now}); rerr != nil {
            slog.Warn("runner: stats.report after prune failed", "job_id", jobID, "err", rerr)
        }
    }
    if err != nil {
        return fmt.Errorf("runner prune: %w", err)
    }
    return nil
}

Pull the duplicated resticEnv()/streamHandler()/sendFinished() out of the existing methods into helpers in the same file. The repeated start-env/finish-env scaffolding across Backup/Init/Forget/Prune/Check/Unlock is exactly the kind of duplication that hurts every time a payload field is added.

Step 2: Add RunCheck

func (r *Runner) RunCheck(ctx context.Context, jobID string, subsetPct int) error {
    // start envelope (kind=check) → run → finished
    // On any outcome, ship stats.report with last_check_at + status + lock_present.
    res, err := env.RunCheck(ctx, subsetPct, handle)
    status := "ok"
    if res.ErrorsFound { status = "errors_found" }
    if err != nil { status = "failed" }
    now := time.Now().UTC()
    lock := res.LockPresent
    _ = r.reportStats(ctx, env, api.StatsReportPayload{
        LastCheckAt: &now, LastCheckStatus: status, LockPresent: &lock,
    })
    // job.finished
}

Step 3: Add RunUnlock

Trivial mirror of RunInit. After success, also fire reportStats with LockPresent: &false so the UI banner clears.

Step 4: Add reportStats

Sibling to reportSnapshots (runner.go:288-317). Build a StatsReportPayload, invoke env.RunStats to fill in size fields if the patch doesn't already cover them, marshal MsgStatsReport, send. Bound by a 60s timeout.

func (r *Runner) reportStats(ctx context.Context, env restic.Env, patch api.StatsReportPayload) error {
    listCtx, cancel := context.WithTimeout(ctx, 60*time.Second)
    defer cancel()
    // Fill in sizes via restic stats unless the caller is just
    // updating a metadata field (e.g. lock-cleared from unlock).
    if patch.TotalSizeBytes == nil {
        if s, err := env.RunStats(listCtx, nil); err == nil {
            patch.TotalSizeBytes = &s.TotalSize
            patch.RawSizeBytes = &s.TotalUncompressed
            patch.UniqueFiles = &s.TotalFileCount
            patch.SnapshotCount = &s.SnapshotsCount
        }
    }
    envOut, err := api.Marshal(api.MsgStatsReport, "", patch)
    if err != nil { return err }
    return r.tx.Send(envOut)
}

Also call reportStats(ctx, env, api.StatsReportPayload{}) from the existing RunBackup success path (next to reportSnapshots) so size refreshes after every backup.

Step 5: Update Config + cmd/agent dispatcher to support admin creds

runner.Config grows an optional AdminUsername/AdminPassword. The dispatcher in cmd/agent/main.go:243-325 builds the runner with the right creds based on payload.RequiresAdminCreds:

case api.JobPrune:
    creds := repoCreds
    if p.RequiresAdminCreds {
        adminCreds, err := d.secrets.LoadAdmin()
        if err != nil {
            return fmt.Errorf("prune: admin creds not configured (server didn't push them)")
        }
        creds = adminCreds
    }
    r := runner.New(runner.Config{
        ResticBin: d.resticBin,
        RepoURL: creds.URL,
        RepoUsername: creds.Username,
        RepoPassword: creds.Password,
    }, tx, time.Second)
    go func() { _ = r.RunPrune(ctx, p.JobID) }()
case api.JobCheck:
    // subset% comes from CommandRunPayload.Args[0] (e.g. "5") to
    // avoid bloating the wire — server stringifies the int from
    // host_repo_maintenance.check_subset_pct.
    subset := 0
    if len(p.Args) > 0 { subset, _ = strconv.Atoi(p.Args[0]) }
    go func() { _ = r.RunCheck(ctx, p.JobID, subset) }()
case api.JobUnlock:
    go func() { _ = r.RunUnlock(ctx, p.JobID) }()

Also: wire MsgConfigUpdate to call secrets.SaveAdmin when the new admin-fields are populated. Extend ConfigUpdatePayload with a Slot discriminator (default empty = repo, "admin" = admin).

// in api/messages.go
type ConfigUpdatePayload struct {
    Slot           string `json:"slot,omitempty"` // "" = repo, "admin" = admin slot
    // … existing fields …
}

Step 6: Test

Add TestRunPruneEnvelopes, TestRunCheckShipsStatsReport, TestRunUnlockClearsLockState — drive the runner with a fake Sender and a Bin = /bin/echo env, capture the envelopes shipped, assert ordering: job.started → log.stream(s) → stats.report (if applicable) → job.finished.

Step 7: Run + commit

go test ./internal/agent/... ./cmd/agent/... -v
git add internal/agent/runner/ internal/agent/secrets/ cmd/agent/main.go internal/api/messages.go
git commit -m "agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot"

Slice D — Server: HTTP run-now endpoints (P2R-03/04/05)

Task D1: Admin credentials REST + push helper

Files:

Modify: internal/server/http/host_credentials.go
Modify: internal/server/http/server.go
Test: internal/server/http/host_credentials_test.go
Step 1: Add admin-credentials endpoints

Mirror the existing handleGetHostCredentials / handleSetHostCredentials for the admin slot. Reuse the redacted-view shape (URL + username + has_password).

// PUT /api/hosts/{id}/admin-credentials → SetHostCredentials(host, admin, …)
// GET /api/hosts/{id}/admin-credentials → redacted view, or 404 if unset
// DELETE /api/hosts/{id}/admin-credentials → clear the admin slot

Validation: same length / charset rules as repo creds. AEAD additional-data is "host:" + hostID + ":admin" (different AAD per slot so a corrupted DB can't cross-bind).

Audit-log all three.

Step 2: Add pushAdminCredsToAgent

func (s *Server) pushAdminCredsToAgent(ctx context.Context, hostID string) error {
    enc, err := s.deps.Store.GetHostCredentials(ctx, hostID, store.CredKindAdmin)
    if err != nil { return err } // ErrNotFound bubbles
    plain, err := s.deps.AEAD.Decrypt(enc, []byte("host:"+hostID+":admin"))
    if err != nil { return err }
    var blob repoCredsBlob
    if err := json.Unmarshal(plain, &blob); err != nil { return err }
    env, err := api.Marshal(api.MsgConfigUpdate, "", api.ConfigUpdatePayload{
        Slot: "admin",
        RepoURL: blob.RepoURL, RepoUsername: blob.RepoUsername, RepoPassword: blob.RepoPassword,
    })
    if err != nil { return err }
    sendCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()
    return s.deps.Hub.Send(sendCtx, hostID, env)
}

Step 3: Register routes in server.go

r.Get("/hosts/{id}/admin-credentials", s.handleGetAdminCredentials)
r.Put("/hosts/{id}/admin-credentials", s.handleSetAdminCredentials)
r.Delete("/hosts/{id}/admin-credentials", s.handleDeleteAdminCredentials)

Step 4: Test

Set admin creds, fetch redacted, assert. Clear, assert 404. Encrypt/decrypt round-trip with different AAD.

Step 5: Commit

git add internal/server/http/host_credentials.go internal/server/http/server.go internal/server/http/host_credentials_test.go
git commit -m "server: admin-credentials REST + push helper"

Task D2: HTTP run-now for prune / check / unlock

Files:

Create: internal/server/http/repo_ops.go
Create: internal/server/http/repo_ops_test.go
Modify: internal/server/http/server.go
Step 1: Write the handlers

// repo_ops.go — operator-triggered Run-now for repo-level
// operations: prune, check, unlock. Backed by the same
// dispatchJobWithPayload pipeline as backup, with an extra step
// for prune: push the admin creds first if they're set, refuse
// loudly if they aren't.
package http

func (s *Server) handleRunRepoPrune(w stdhttp.ResponseWriter, r *stdhttp.Request) {
    user, ok := s.requireUser(r)
    if !ok { /* same auth handling as run_group.go */ return }
    hostID := chi.URLParam(r, "id")

    // Admin creds are required. Push them down before dispatching.
    if err := s.pushAdminCredsToAgent(r.Context(), hostID); err != nil {
        if errors.Is(err, store.ErrNotFound) {
            s.runOpError(w, r, stdhttp.StatusBadRequest, "admin_creds_required",
                "set admin credentials on the Repo page before running prune")
            return
        }
        s.runOpError(w, r, stdhttp.StatusServiceUnavailable, "host_offline",
            "agent not connected; reconnect and try again")
        return
    }

    res, status, code, msg := s.dispatchJobWithPayload(r.Context(), user, hostID, api.JobPrune,
        api.CommandRunPayload{RequiresAdminCreds: true})
    if code != "" { s.runOpError(w, r, status, code, msg); return }
    s.runOpRedirect(w, r, res)
}

func (s *Server) handleRunRepoCheck(w stdhttp.ResponseWriter, r *stdhttp.Request) {
    // Resolve subset% from host_repo_maintenance for this host (or
    // optional ?subset=N override on the query string for ad-hoc).
    m, _ := s.deps.Store.GetRepoMaintenance(r.Context(), hostID)
    subset := m.CheckSubsetPct
    if q := r.URL.Query().Get("subset"); q != "" { /* parse + clamp */ }

    res, status, code, msg := s.dispatchJobWithPayload(r.Context(), user, hostID, api.JobCheck,
        api.CommandRunPayload{Args: []string{strconv.Itoa(subset)}})
    // …
}

func (s *Server) handleRunRepoUnlock(w stdhttp.ResponseWriter, r *stdhttp.Request) {
    // No admin creds required (unlock works with the everyday user).
    res, status, code, msg := s.dispatchJobWithPayload(r.Context(), user, hostID, api.JobUnlock,
        api.CommandRunPayload{})
    // …
}

func (s *Server) runOpRedirect(...) { /* HX-Redirect to /jobs/{id} for HTMX, JSON otherwise. Mirror run_group.go */ }
func (s *Server) runOpError(...)    { /* mirror run_group.go runGroupError */ }

Step 2: Register routes

In server.go, both inside /api and at the outer router for HTMX:

// Inside r.Route("/api", …):
r.Post("/hosts/{id}/repo/prune",  s.handleRunRepoPrune)
r.Post("/hosts/{id}/repo/check",  s.handleRunRepoCheck)
r.Post("/hosts/{id}/repo/unlock", s.handleRunRepoUnlock)

// Outer (for HTMX form posts):
r.Post("/hosts/{id}/repo/prune",  s.handleRunRepoPrune)
r.Post("/hosts/{id}/repo/check",  s.handleRunRepoCheck)
r.Post("/hosts/{id}/repo/unlock", s.handleRunRepoUnlock)

Step 3: Test

In repo_ops_test.go:

func TestRunPruneRefusesWithoutAdminCreds(t *testing.T) {
    // Stand up testServer with a connected fake host, no admin creds.
    // POST /api/hosts/{id}/repo/prune; assert 400 + admin_creds_required.
    // No job row created.
}

func TestRunPruneShipsConfigUpdateThenCommandRun(t *testing.T) {
    // Set admin creds on the host; connect a fake WS; POST prune.
    // Drain the conn; assert config.update(slot=admin) → command.run(prune,RequiresAdminCreds=true).
}

func TestRunCheckPullsSubsetFromMaintenanceRow(t *testing.T) {
    // Update maintenance row, check_subset_pct=25; POST /repo/check.
    // Assert command.run.Args == ["25"].
}

func TestRunUnlockNeedsNoAdminCreds(t *testing.T) {
    // No admin creds set; POST /repo/unlock; assert 202 + command.run shipped.
}

Step 4: Run + commit

go test ./internal/server/http/ -v -run TestRunRepo -run TestRunPrune -run TestRunCheck -run TestRunUnlock
git add internal/server/http/repo_ops.go internal/server/http/repo_ops_test.go internal/server/http/server.go
git commit -m "server: HTTP run-now for prune / check / unlock"

Slice E — UI: Repo page additions

Task E1: Render admin-creds form + one-time maintenance buttons + stats panel skeleton

Files:

Modify: web/templates/pages/host_repo.html
Modify: internal/server/http/ui_repo.go
Step 1: Read the current Repo page to understand sections

Run: cat web/templates/pages/host_repo.html and identify section anchors. Also read internal/server/http/ui_repo.go to see what data the renderer hands to the template.

Step 2: Add the admin-creds form

In the Connection section (right after the existing repo creds form), add a sibling form for admin creds. Use a collapsible/optional pattern with a clear note:

<section class="…">
  <h3>Admin credentials <span class="text-sm">(required for prune)</span></h3>
  <p class="…">Only needed for rest-server repos that distinguish an
    append-only user (everyday backups) from a delete-capable user
    (prune). For S3/B2/local, leave blank.</p>
  <form hx-put="/api/hosts/{{.Host.ID}}/admin-credentials" …>
    <!-- URL is shared with the repo creds — only username + password fields here. -->
    <input name="username" value="{{.AdminCreds.Username}}" …>
    <input type="password" name="password" placeholder="{{if .AdminCreds.HasPassword}}(set){{else}}(unset){{end}}" …>
    <button>Save</button>
    {{if .AdminCreds.HasPassword}}
      <button hx-delete="/api/hosts/{{.Host.ID}}/admin-credentials" hx-confirm="Clear admin credentials?" class="…">Clear</button>
    {{end}}
  </form>
</section>

The handler handleUIHostRepo in ui_repo.go needs to pre-fill AdminCreds from GetHostCredentials(host, admin) using a redacted view.

Step 3: Add one-time maintenance buttons

Below the cadence rows in the Maintenance section, add a "Run now" subsection with three buttons. Mirror the per-source-group Run-now pattern (inline form, hx-post, HX-Redirect to /jobs/{id} for live tailing).

<section class="border-t pt-4">
  <h3 class="text-sm font-medium">Run now</h3>
  <div class="grid grid-cols-3 gap-2 mt-2">
    <form method="post" action="/hosts/{{.Host.ID}}/repo/check" hx-post="/hosts/{{.Host.ID}}/repo/check" hx-confirm="Run check now ({{.Maintenance.CheckSubsetPct}}% data subset)?">
      <button class="btn btn-secondary" {{if not .Online}}disabled title="agent offline"{{end}}>Check</button>
    </form>
    <form method="post" action="/hosts/{{.Host.ID}}/repo/prune" hx-post="/hosts/{{.Host.ID}}/repo/prune" hx-confirm="Run prune now? Removes data not referenced by any snapshot.">
      <button class="btn btn-secondary" {{if not .HasAdminCreds}}disabled title="set admin credentials first"{{else if not .Online}}disabled title="agent offline"{{end}}>Prune</button>
    </form>
    <form method="post" action="/hosts/{{.Host.ID}}/repo/unlock" hx-post="/hosts/{{.Host.ID}}/repo/unlock" hx-confirm="Clear stale repo locks?">
      <button class="btn btn-secondary" {{if not .Online}}disabled title="agent offline"{{end}}>Unlock</button>
    </form>
  </div>
</section>

HasAdminCreds, Online come from the renderer.

Step 4: Render the stats panel (data fields wired up)

In the right rail (replace or augment the existing snapshots-by-tag breakdown), add a stats card backed by host_repo_stats:

<section class="…">
  <h3>Repo stats</h3>
  <dl class="…">
    <div><dt>Total size</dt><dd>{{humanBytes .Stats.TotalSizeBytes}}</dd></div>
    <div><dt>Snapshots</dt><dd>{{.Stats.SnapshotCount}}</dd></div>
    <div><dt>Last check</dt><dd>{{if .Stats.LastCheckAt}}{{ago .Stats.LastCheckAt}} · {{.Stats.LastCheckStatus}}{{else}}—{{end}}</dd></div>
    <div><dt>Last prune</dt><dd>{{if .Stats.LastPruneAt}}{{ago .Stats.LastPruneAt}}{{else}}—{{end}}</dd></div>
    {{if .Stats.LockPresent}}
    <div class="text-amber-700">⚠ Stale lock detected — run Unlock above.</div>
    {{end}}
  </dl>
</section>

humanBytes and ago are template helpers already in this codebase (or add them in internal/server/ui/). Check web/templates/partials/host_row.html for prior art before coining new ones.

Step 5: Update handleUIHostRepo to populate the new view fields

Add to whatever struct backs the page render: AdminCreds, HasAdminCreds, Online, Stats *store.HostRepoStats. Default to a zero-valued stats struct when the row doesn't exist yet (so the template doesn't NPE).

Step 6: Manual smoke check

Run: make build && <restage block from CLAUDE.md> then visit /hosts/<id>/repo in a browser; verify all three sections render, buttons disable correctly when offline / when admin creds are missing.

Step 7: Commit

git add web/templates/pages/host_repo.html internal/server/http/ui_repo.go
git commit -m "ui: Repo page — admin creds form, one-time maintenance, stats panel"

Slice F — Server-side maintenance ticker (P2R-06)

Task F1: Pure-logic ticker package

Files:

Create: internal/server/maintenance/ticker.go
Create: internal/server/maintenance/ticker_test.go
Step 1: Add LatestJobByKind to the store

In internal/store/jobs.go, add:

// LatestJobByKind returns the most recent terminal (succeeded or
// failed) job of the given kind for the host, or (nil, ErrNotFound)
// if there isn't one. Used by the maintenance ticker to decide
// whether a cron tick needs dispatching.
func (s *Store) LatestJobByKind(ctx context.Context, hostID, kind string) (*Job, error)

ORDER BY created_at DESC LIMIT 1 over WHERE host_id=? AND kind=? AND status IN ('succeeded','failed','cancelled').

Step 2: Add ListAllMaintenance(ctx) to store

In internal/store/maintenance.go, return all rows (one per host) with their host_id. Used by the ticker to iterate.

Step 3: Write the pure-logic ticker

// Package maintenance owns the server-side scheduler that fires
// forget/prune/check on the cadences operators set on
// host_repo_maintenance rows. The ticker is intentionally
// side-effect-free at the package boundary: it asks an injected
// Backend for current state and emits a list of Decisions for the
// caller to act on. Easy to unit-test without a running server.
package maintenance

import (
    "context"
    "fmt"
    "time"

    "github.com/robfig/cron/v3"

    "gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)

type Decision struct {
    HostID    string
    Kind      string  // "forget" | "prune" | "check"
    SubsetPct int     // populated when Kind == "check"
}

type Backend interface {
    ListAllMaintenance(ctx context.Context) ([]store.HostRepoMaintenance, error)
    LatestJobByKind(ctx context.Context, hostID, kind string) (*store.Job, error)
}

type Ticker struct {
    backend Backend
    parser  cron.Parser
}

func New(b Backend) *Ticker {
    return &Ticker{
        backend: b,
        parser:  cron.NewParser(cron.Minute | cron.Hour | cron.Dom | cron.Month | cron.Dow),
    }
}

// Decide returns the set of jobs the ticker would dispatch right
// now. Called by the goroutine in cmd/server every minute. The
// caller is responsible for: checking host online state, queuing
// to pending_runs if offline, persisting the job row, and shipping
// command.run.
func (t *Ticker) Decide(ctx context.Context, now time.Time) ([]Decision, error) {
    rows, err := t.backend.ListAllMaintenance(ctx)
    if err != nil { return nil, err }
    var out []Decision
    for _, m := range rows {
        if d, ok := t.dueFor(ctx, now, m.HostID, "forget", m.ForgetCron, m.ForgetEnabled, 0); ok {
            out = append(out, d)
        }
        if d, ok := t.dueFor(ctx, now, m.HostID, "prune", m.PruneCron, m.PruneEnabled, 0); ok {
            out = append(out, d)
        }
        if d, ok := t.dueFor(ctx, now, m.HostID, "check", m.CheckCron, m.CheckEnabled, m.CheckSubsetPct); ok {
            out = append(out, d)
        }
    }
    return out, nil
}

// dueFor returns true if the given cron has a fire-instant strictly
// after the latest persisted job's created_at and at-or-before now.
// Equivalently: cron.Next(latestJob.CreatedAt) <= now.
//
// First-run case (no prior job): the cron is "due" if its previous
// fire instant is at-or-before now (i.e. cron has fired at least
// once since the schedule was set up). We use a 24h lookback as the
// conservative anchor so a brand-new host doesn't fire 30 days of
// missed checks on first tick.
func (t *Ticker) dueFor(ctx context.Context, now time.Time, hostID, kind, expr string, enabled bool, subset int) (Decision, bool) {
    if !enabled || expr == "" { return Decision{}, false }
    sched, err := t.parser.Parse(expr)
    if err != nil { return Decision{}, false }
    j, err := t.backend.LatestJobByKind(ctx, hostID, kind)
    var anchor time.Time
    if err == nil && j != nil {
        anchor = j.CreatedAt
    } else {
        anchor = now.Add(-24 * time.Hour)
    }
    next := sched.Next(anchor)
    if next.IsZero() || next.After(now) {
        return Decision{}, false
    }
    return Decision{HostID: hostID, Kind: kind, SubsetPct: subset}, true
}

Step 4: Tests

func TestTickerSkipsDisabled(t *testing.T) {
    // Maintenance row with all *_enabled = false. Decide returns nothing.
}
func TestTickerFiresWhenOverdue(t *testing.T) {
    // Maintenance with forget_cron = "0 3 * * *", latest forget job
    // 25 hours ago. Decide returns one forget Decision.
}
func TestTickerSuppressesWhenRecent(t *testing.T) {
    // Same cron, latest forget job 2 hours ago (and no 3am tick has
    // fallen between then and now). Decide returns nothing.
}
func TestTickerCheckCarriesSubset(t *testing.T) {
    // CheckSubsetPct=25. Assert the Decision has SubsetPct=25.
}
func TestTickerHandlesFirstRun(t *testing.T) {
    // No prior job. Use a cron that fires every minute ("* * * * *").
    // Assert one Decision is returned.
}

Use a fakeBackend that implements the interface in-test.

Step 5: Run + commit

go test ./internal/server/maintenance/ -v
git add internal/server/maintenance/ internal/store/jobs.go internal/store/maintenance.go
git commit -m "maintenance: pure-logic ticker (decides forget/prune/check fires)"

Task F2: Wire the ticker into cmd/server

Files:

Modify: cmd/server/main.go
Create: internal/server/http/maintenance_dispatch.go — the side-effecting glue (online-check + dispatch + queue-on-offline).
Step 1: Write the dispatcher glue

// maintenance_dispatch.go — bridges the pure-logic ticker to the
// side-effecting world: checks online state, queues offline runs to
// pending_runs, dispatches command.run for online ones, audit-logs
// system actor.
package http

func (s *Server) DispatchMaintenance(ctx context.Context, decisions []maintenance.Decision) {
    for _, d := range decisions {
        if !s.deps.Hub.Connected(d.HostID) {
            // Maintenance fires we miss when offline DON'T queue —
            // they're fire-and-forget cadences (you don't want to run
            // 5 missed prunes the moment the host comes back). Just
            // log + skip; next tick re-evaluates.
            slog.Info("maintenance: host offline, skipping",
                "host_id", d.HostID, "kind", d.Kind)
            continue
        }
        var payload api.CommandRunPayload
        switch d.Kind {
        case "prune":
            // Skip if no admin creds (we'd just bounce the agent).
            // Surface as a job row with status=failed for visibility.
            if _, err := s.deps.Store.GetHostCredentials(ctx, d.HostID, store.CredKindAdmin); err != nil {
                slog.Info("maintenance: prune skipped — no admin creds",
                    "host_id", d.HostID)
                continue
            }
            if err := s.pushAdminCredsToAgent(ctx, d.HostID); err != nil {
                slog.Warn("maintenance: push admin creds failed", "err", err)
                continue
            }
            payload = api.CommandRunPayload{RequiresAdminCreds: true}
        case "check":
            payload = api.CommandRunPayload{Args: []string{strconv.Itoa(d.SubsetPct)}}
        case "forget":
            // The agent already runs forget per-source-group with
            // each group's RetentionPolicy. The server-driven forget
            // dispatch ships the host's group set in the payload so
            // the agent doesn't need to remember which groups to
            // walk. Build it server-side here.
            payload = s.buildForgetPayloadForHost(ctx, d.HostID)
            if payload.Args == nil {
                continue // host has no groups with retention set
            }
        }
        kind := api.JobKind(d.Kind)
        _, _, code, msg := s.dispatchJobWithPayload(ctx, nil /*system*/, d.HostID, kind, payload)
        if code != "" {
            slog.Warn("maintenance: dispatch failed",
                "host_id", d.HostID, "kind", d.Kind, "code", code, "msg", msg)
        }
    }
}

buildForgetPayloadForHost walks ListSourceGroups(host), encodes the per-group retention policies into a structured payload field. This requires extending CommandRunPayload with a ForgetGroups []ForgetGroup field where ForgetGroup = {Tag string, Policy RetentionPolicy}. The agent's RunForget handler would then loop over the groups, running restic forget --tag <tag> --keep-* … for each. Important: this changes the agent-side RunForget semantics. Bake this in here:

Extend internal/api/messages.go — add ForgetGroup and CommandRunPayload.ForgetGroups.
Extend cmd/agent/main.go JobForget case — if ForgetGroups is set, loop. Backwards compatible: if RetentionPolicy is set instead, run the single-policy form (the operator-triggered single-group path through dispatchScheduledJob is gone in this redesign — backups don't dispatch forget — so the operator-triggered single-policy path is from the maintenance run-now button only, not relevant any more).
Extend internal/agent/runner/runner.go RunForget to take []ForgetGroup and loop.

For maintenance, prefer the multi-group shape; remove the old RetentionPolicy field from CommandRunPayload after migration if no callers remain.

Step 2: Wire the goroutine in cmd/server/main.go

In the existing background-sweepers goroutine (or as a sibling — keep it readable), add a 60-second ticker:

maintenanceTick := time.NewTicker(60 * time.Second)
defer maintenanceTick.Stop()

ticker := maintenance.New(st)

// In the select:
case <-maintenanceTick.C:
    decisions, err := ticker.Decide(ctx, time.Now().UTC())
    if err != nil {
        slog.Warn("maintenance ticker: decide", "err", err)
        continue
    }
    if len(decisions) > 0 {
        slog.Info("maintenance ticker: dispatching", "n", len(decisions))
        srv.DispatchMaintenance(ctx, decisions)
    }

srv needs an exported DispatchMaintenance method on *Server (or expose it via a tiny interface to keep wiring honest).

Step 3: Test the dispatcher (integration shape)

In internal/server/http/repo_ops_test.go or a new maintenance_dispatch_test.go, drive a fake hub + fake store:

One host online, forget enabled, no recent forget job → assert command.run shipped.
One host offline → assert no command.run, no audit, just a log.
Prune dispatch with no admin creds → assert skipped silently.
Step 4: Smoke

make build && <restage block>. Set a host's forget cadence to * * * * * (every minute) and watch the server logs for "maintenance: dispatching" within ~60s. Open the host's Jobs UI and verify a forget job appeared with actor_kind=schedule (or a new system actor — pick one consistently).

Step 5: Commit

git add cmd/server/main.go internal/server/http/maintenance_dispatch.go internal/api/messages.go internal/agent/runner/runner.go cmd/agent/main.go
git commit -m "server: maintenance ticker drives forget/prune/check on cadence"

Slice G — Pending-runs queue worker (P2R-08)

Task G1: Enqueue on schedule.fire when host offline

Files:

Modify: internal/server/http/schedule_push.go (the dispatchScheduledJob path)
Modify: internal/store/sources.go (or wherever SchedulesUsingGroup and similar live) — needs GetSourceGroup to expose retry_max + retry_backoff_seconds (likely already does).
Step 1: Wrap dispatchBackupForGroup to enqueue when send fails

The current path calls conn.Send directly. If conn.Send fails (offline), enqueue a row in pending_runs instead of just logging.

Note: schedule.fire comes from the agent's local cron, so by the time we're here the agent is connected. But the same dispatch path is reachable from pushScheduleSetAsync callers — and from the ticker's Slice F. For ticker's case, we already short-circuit on offline. For the schedule.fire case, the offline race is narrow: agent fired its cron, then disconnected before the round-trip completes. Still worth queuing.

Better: handle the enqueue inside dispatchBackupForGroup itself, on conn.Send error:

if err := conn.Send(sendCtx, env); err != nil {
    slog.Warn("schedule.fire: send command.run, queueing for retry",
        "host_id", hostID, "schedule_id", scheduleID, "err", err)
    backoff := time.Duration(g.RetryBackoffSeconds) * time.Second
    _ = s.deps.Store.EnqueuePendingRun(ctx, &store.PendingRun{
        ID:            ulid.Make().String(),
        ScheduleID:    scheduleID,
        SourceGroupID: g.ID,
        HostID:        hostID,
        Attempt:       1,
        NextAttemptAt: time.Now().UTC().Add(backoff),
        ScheduledAt:   scheduledAt,
        LastError:     err.Error(),
    })
    return ""
}

Step 2: Add ListPendingRunsForHost to the store

func (st *Store) ListPendingRunsForHost(ctx context.Context, hostID string) ([]PendingRun, error)

SELECT … WHERE host_id = ? ORDER BY next_attempt_at.

Step 3: Test

In internal/server/http/p2r01_ws_test.go or a fresh test file, simulate a conn.Send failure mid-dispatch and assert a pending row landed.

Step 4: Commit

git add internal/server/http/schedule_push.go internal/store/pending.go
git commit -m "server: enqueue pending_runs when scheduled-job dispatch fails"

Task G2: Drain pending_runs on tick + on agent reconnect

Files:

Create: internal/server/http/pending_drain.go
Create: internal/server/http/pending_drain_test.go
Modify: internal/server/http/host_credentials.go (extend onAgentHello)
Modify: cmd/server/main.go (add the 30s tick)
Step 1: Write the drainer

// pending_drain.go — drains pending_runs rows that are due.
//
// Two trigger paths:
//   1. The 30s ticker in cmd/server, which sweeps every host with
//      due rows.
//   2. onAgentHello, which targets one host and drains *its* rows
//      synchronously (so a host coming back online flushes its
//      queue before the operator notices).

package http

func (s *Server) DrainPending(ctx context.Context, hostID string) {
    runs, err := s.deps.Store.ListPendingRunsForHost(ctx, hostID)
    if err != nil {
        slog.Warn("drain: list pending", "host_id", hostID, "err", err)
        return
    }
    if !s.deps.Hub.Connected(hostID) { return }
    for _, p := range runs {
        // Walk the schedule + group; if either is gone, drop the row.
        sc, err := s.deps.Store.GetSchedule(ctx, hostID, p.ScheduleID)
        if err != nil || !sc.Enabled {
            _ = s.deps.Store.DeletePendingRun(ctx, p.ID)
            continue
        }
        g, err := s.deps.Store.GetSourceGroup(ctx, hostID, p.SourceGroupID)
        if err != nil {
            _ = s.deps.Store.DeletePendingRun(ctx, p.ID)
            continue
        }
        if p.Attempt >= g.RetryMax {
            slog.Info("drain: abandoning pending run (retry_max exceeded)",
                "host_id", hostID, "schedule_id", p.ScheduleID, "group", g.Name,
                "attempts", p.Attempt)
            _ = s.deps.Store.AppendAudit(ctx, store.AuditEntry{
                ID: ulid.Make().String(), Actor: "system",
                Action: "pending_run.abandoned",
                TargetKind: ptr("schedule"), TargetID: &p.ScheduleID,
                TS: time.Now().UTC(),
            })
            _ = s.deps.Store.DeletePendingRun(ctx, p.ID)
            continue
        }
        // Try to dispatch.
        conn, ok := s.deps.Hub.GetConn(hostID)
        if !ok { return }
        jobID := s.dispatchBackupForGroup(ctx, conn, hostID, p.ScheduleID, g, p.ScheduledAt)
        if jobID == "" {
            // Send failed again — bump attempt with exponential backoff.
            backoff := time.Duration(g.RetryBackoffSeconds) * time.Second
            backoff *= time.Duration(1 << p.Attempt)
            if max := 30 * time.Minute; backoff > max { backoff = max }
            _ = s.deps.Store.BumpPendingRunAttempt(ctx, p.ID,
                time.Now().UTC().Add(backoff), "dispatch failed on drain")
            continue
        }
        _ = s.deps.Store.DeletePendingRun(ctx, p.ID)
    }
}

// DrainAllDue is the periodic-ticker entrypoint. Loops over hosts
// with due rows and calls DrainPending per host.
func (s *Server) DrainAllDue(ctx context.Context) {
    due, err := s.deps.Store.DuePendingRuns(ctx, time.Now().UTC(), 100)
    if err != nil {
        slog.Warn("drain: list due", "err", err); return
    }
    seen := make(map[string]bool)
    for _, p := range due {
        if seen[p.HostID] { continue }
        seen[p.HostID] = true
        s.DrainPending(ctx, p.HostID)
    }
}

Hub.GetConn(hostID) may not exist yet — if so, add it (hub.go). Returns the registered *ws.Conn or nil, false.

Step 2: Wire onAgentHello to drain

In host_credentials.go, add to onAgentHello:

go s.DrainPending(context.Background(), hostID)

(Use a fresh context — the hello-bound context is short-lived, and the drain may take a few seconds across many runs.)

Step 3: Wire the periodic tick in cmd/server/main.go

Add a 30s ticker alongside the maintenance one:

drainTick := time.NewTicker(30 * time.Second)
defer drainTick.Stop()
// case <-drainTick.C:
//     srv.DrainAllDue(ctx)

Step 4: Test

func TestDrainOnReconnect(t *testing.T) {
    // Stand up server + fake host. Disconnect agent. Schedule fires
    // via a fake schedule.fire envelope, races the disconnect, lands
    // in pending_runs (assert via store query). Reconnect. Assert
    // drain ran: pending_runs empty, command.run was sent.
}

func TestDrainAbandonsAfterRetryMax(t *testing.T) {
    // Set group.retry_max=2; insert a pending row at attempt=2.
    // Connect host. DrainPending. Assert row deleted, audit logged,
    // no command.run sent.
}

func TestDrainBumpsAttemptOnSendFail(t *testing.T) {
    // Pending row at attempt=1, retry_max=5. Hub.Send rigged to
    // fail. DrainPending. Assert row.attempt==2 in DB,
    // next_attempt_at moved out by backoff*2.
}

Step 5: Smoke

make build && <restage block>. Disable the agent's WS connection (e.g., sudo systemctl stop restic-manager-agent), trigger a schedule fire (or fast-forward via an artificial cron), wait for a row to land in pending_runs. Re-start the agent, watch the server log for "drain: dispatched", verify the row is gone and a job ran.

Step 6: Commit

git add internal/server/http/pending_drain.go internal/server/http/pending_drain_test.go internal/server/ws/hub.go internal/server/http/host_credentials.go cmd/server/main.go
git commit -m "server: drain pending_runs on tick + on agent reconnect"

Slice H — Repo stats wiring (P2R-07)

Task H1: Server-side stats.report handler

Files:

Modify: internal/server/ws/handler.go
Modify: internal/server/ws/handler_test.go
Step 1: Add the dispatch case

In dispatchAgentMessage, add a case for MsgStatsReport that decodes the payload and calls Store.UpsertHostRepoStats(ctx, hostID, patch). Mirror the existing MsgSnapshotsRpt handler.

Step 2: Test

Drive a fake stats.report through the handler, assert the row in host_repo_stats matches.

Step 3: Commit

git add internal/server/ws/handler.go internal/server/ws/handler_test.go
git commit -m "server/ws: persist stats.report into host_repo_stats"

Task H2: End-to-end smoke + visual check

Step 1: Build + restage per CLAUDE.md

make build && \
  cp bin/restic-manager-agent /tmp/rm-smoke/data/agent-binaries/restic-manager-agent-linux-amd64 && \
  cp deploy/install/install.sh /tmp/rm-smoke/data/install/install.sh && \
  cp deploy/install/restic-manager-agent.service /tmp/rm-smoke/data/install/restic-manager-agent.service && \
  sudo -n install -m 0755 bin/restic-manager-agent /usr/local/bin/restic-manager-agent && \
  sudo -n install -m 0644 deploy/install/restic-manager-agent.service /etc/systemd/system/restic-manager-agent.service && \
  sudo -n systemctl daemon-reload && sudo -n systemctl restart restic-manager-agent && \
  pkill -f restic-manager-server; sleep 1; \
  RM_LISTEN=:8080 RM_DATA_DIR=/tmp/rm-smoke/data RM_BASE_URL=http://127.0.0.1:8080 \
    RM_SECRET_KEY_FILE=/tmp/rm-smoke/data/secret.key RM_COOKIE_SECURE=false \
    ./bin/restic-manager-server >> /tmp/rm-smoke/server.log 2>&1 &

Step 2: Playwright sweep

Drive a Playwright session against http://127.0.0.1:8080:

Log in.
Navigate to a host's /repo page.
Click "Run now → Check". Verify: redirect to /jobs/{id}, live log streams, job ends succeeded.
Set admin credentials. Click "Run now → Prune". Verify: redirect, log streams, ends succeeded.
Click "Run now → Unlock" — even when there's no lock, restic returns success quickly. Verify behaviour.
After all three, refresh /repo and verify the stats panel shows "Last check {n} ago · ok", "Last prune {n} ago", and total size matches what restic stats would return.
Force a fake lock via direct restic call against the smoke repo, run check, verify the lock-detected banner appears.

Save screenshots to _diag/p2r-phase5-sweep/.

Step 3: Commit screenshots + a short note

git add _diag/p2r-phase5-sweep/
git commit -m "diag: phase 5 Playwright sweep screenshots"

Task H3: Mark the tasks complete in tasks.md

Step 1: Tick P2R-03 through P2R-08

In tasks.md, change - [ ] **P2R-03** (and -04, -05, -06, -07, -08) to - [x] and append a one-line note pointing at the commit range or PR URL. Mirror P2R-01 / P2R-02's done-format.

Step 2: Commit

git add tasks.md
git commit -m "tasks: mark P2R-03..P2R-08 done"

Self-Review

Spec coverage check:

P2R-03 (prune) — Slices B (RunPrune), C (agent runner + admin slot), D (HTTP run-now), E (UI button + admin creds form), F (cadence-driven). ✓
P2R-04 (check + subset) — Slices B (RunCheck), C (agent runner), D (HTTP run-now), E (UI button), F (cadence-driven). ✓
P2R-05 (unlock) — Slices B (RunUnlock), C (agent runner), D (HTTP run-now), E (UI button + lock banner via stats panel). ✓
P2R-06 (server-side maintenance ticker) — Slice F. ✓
P2R-07 (repo stats panel) — stats.report wire (C1), agent reportStats (C3), server handler (H1), UI panel (E1). ✓
P2R-08 (pending-runs queue) — Slices G1 + G2. ✓

Placeholder scan: No "TODO/TBD/etc" in any step body. Each step shows actual code or an exact existing-file reference for the engineer to mirror.

Type consistency: CredentialKind constants and host_credentials.kind column align ('repo' | 'admin'). ConfigUpdatePayload.Slot uses the same string values. StatsReportPayload field names match HostRepoStats columns 1:1 (camelCase ↔ snake_case via JSON tags). JobKind constants are used unchanged from internal/api/messages.go:50-63. Maintenance Decision.Kind uses string-match ("forget"|"prune"|"check") to match the kind column in jobs and the host_repo_maintenance.*_cron field naming.

Open scope question: the RetentionPolicy field on CommandRunPayload becomes dead weight once forget runs in multi-group form. Slice F task F2 step 1 calls this out — leave the deletion as a follow-up cleanup commit at the end, after the agent code stops reading it. Don't drop it inline because the wire-shape pin tests need to be updated atomically.

66 KiB Raw Blame History

P2 Redesign — Phase 5 Implementation Plan

File Structure

Slice A — Schema groundwork (admin creds + stats projection)

Task A1: Add migration 0009 for admin-credentials column + repo-stats table

Task A2: Extend the host_credentials store API to be kind-aware

Task A3: Add the host_repo_stats store API

Slice B — Restic wrapper additions

Task B1: RunPrune

Task B2: RunCheck (with subset support + lock-detection)

Task B3: RunUnlock + RunStats

Slice C — Wire envelopes + agent runners + dispatcher

Task C1: Wire — add stats.report + RequiresAdminCreds

Task C2: Agent secrets — split into repo + admin slots

Task C3: Agent runners — RunPrune, RunCheck, RunUnlock + reportStats

Slice D — Server: HTTP run-now endpoints (P2R-03/04/05)

Task D1: Admin credentials REST + push helper

Task D2: HTTP run-now for prune / check / unlock

Slice E — UI: Repo page additions

Task E1: Render admin-creds form + one-time maintenance buttons + stats panel skeleton

Slice F — Server-side maintenance ticker (P2R-06)

Task F1: Pure-logic ticker package

Task F2: Wire the ticker into cmd/server

Slice G — Pending-runs queue worker (P2R-08)

Task G1: Enqueue on schedule.fire when host offline

Task G2: Drain pending_runs on tick + on agent reconnect

Slice H — Repo stats wiring (P2R-07)

Task H1: Server-side stats.report handler

Task H2: End-to-end smoke + visual check

Task H3: Mark the tasks complete in tasks.md

Self-Review

66 KiB

Raw Blame History