P2 redesign Phase 5 — prune/check/unlock + maintenance ticker + repo stats + pending-runs queue #3

Merged
steve merged 37 commits from p2r-phase5-maintenance into main 2026-05-04 10:25:02 +01:00
Owner

Summary

Phase 5 of the P2 redesign — six task IDs (P2R-03..P2R-08) close out the server-side maintenance story:

  • P2R-03 / P2R-04 / P2R-05 — prune / check / unlock end-to-end: restic-wrapper additions (RunPrune/RunCheck/RunUnlock/RunStats), agent runners + dispatcher cases, operator HTTP run-now endpoints (POST /api/hosts/{id}/repo/{prune,check,unlock} + HTMX twins), Repo-page UI buttons.
  • P2R-06 — server-side maintenance ticker: pure-logic internal/server/maintenance package + side-effecting DispatchMaintenance glue. 60s tick reads host_repo_maintenance, derives last-fire from LatestJobByKind, dispatches via the same dispatchJobWithPayload path as operator Run-now. Forget reshape: multi-group ForgetGroups payload (one job → N restic-forget invocations per tick).
  • P2R-07 — repo stats panel: agent ships repo.stats after every backup/check/prune/unlock; server persists into new host_repo_stats projection table; Repo page surfaces total size, last-check status (color-coded), last-prune time, lock-detected banner.
  • P2R-08 — pending-runs queue: scheduled backup fires that race a disconnect queue to pending_runs. Drained on 30s tick + on agent reconnect via onAgentHello. Exponential backoff capped at 30m. Abandons rows on retry_max or genuinely-gone schedule/group.

Also lands: admin-credentials slot (host_credentials.kind = 'admin', separate AAD host:<id>:admin), agent secrets-store split with backwards-compatible decode of legacy flat blobs, and a new runWithPump helper consolidating Forget/Init/Prune/Check/Unlock pump duplication.

Plan: `docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md`. Tasks ticked in `tasks.md`.

Architecture notes

  • Migration 0009: `host_credentials` becomes kind-aware (composite PK + CHECK constraint), new `host_repo_stats` table.
  • Wire changes: `CommandRunPayload.RetentionPolicy` removed, `ForgetGroups` + `RequiresAdminCreds` added; `ConfigUpdatePayload.Slot` added; `RepoStatsPayload` reshaped to partial-update form (pointer fields). Protocol version deliberately not bumped — see comment in `internal/api/version.go` (lockstep deploy).
  • Maintenance fires don't queue when offline (re-evaluate next tick); only scheduled backups queue. Documented in code.
  • Per-host TryLock mutex on `DrainPending` prevents the on-hello goroutine and the 30s tick double-dispatching the same row.
  • `LatestJobByKind` returns most-recent job of any status (including in-flight) — a long-running prune correctly suppresses the next tick.

Test plan

  • `go vet ./...` clean
  • `go test ./...` green across all packages
  • Pre-commit hooks (gofumpt, goimports, golangci-lint v2.5.0, vet) passed on every commit; no `--no-verify`
  • Smoke sweep against `/tmp/rm-smoke` env: login → repo page → check job → repo health panel populated → unlock job. Screenshots in `_diag/p2r-phase5-sweep/`.
  • Live prune dispatch (skipped in smoke — no rest-server admin password available; disabled-button UX path verified instead)
  • Reviewer to spot-check: maintenance ticker fires forget/prune/check on cadence in a longer-running env
  • Reviewer to verify: pending-runs drain on a real disconnect/reconnect cycle

🤖 Generated with Claude Code

## Summary Phase 5 of the P2 redesign — six task IDs (P2R-03..P2R-08) close out the server-side maintenance story: - **P2R-03 / P2R-04 / P2R-05 — prune / check / unlock** end-to-end: restic-wrapper additions (`RunPrune`/`RunCheck`/`RunUnlock`/`RunStats`), agent runners + dispatcher cases, operator HTTP run-now endpoints (`POST /api/hosts/{id}/repo/{prune,check,unlock}` + HTMX twins), Repo-page UI buttons. - **P2R-06 — server-side maintenance ticker**: pure-logic `internal/server/maintenance` package + side-effecting `DispatchMaintenance` glue. 60s tick reads `host_repo_maintenance`, derives last-fire from `LatestJobByKind`, dispatches via the same `dispatchJobWithPayload` path as operator Run-now. Forget reshape: multi-group `ForgetGroups` payload (one job → N restic-forget invocations per tick). - **P2R-07 — repo stats panel**: agent ships `repo.stats` after every backup/check/prune/unlock; server persists into new `host_repo_stats` projection table; Repo page surfaces total size, last-check status (color-coded), last-prune time, lock-detected banner. - **P2R-08 — pending-runs queue**: scheduled backup fires that race a disconnect queue to `pending_runs`. Drained on 30s tick + on agent reconnect via `onAgentHello`. Exponential backoff capped at 30m. Abandons rows on `retry_max` or genuinely-gone schedule/group. Also lands: admin-credentials slot (`host_credentials.kind = 'admin'`, separate AAD `host:<id>:admin`), agent secrets-store split with backwards-compatible decode of legacy flat blobs, and a new `runWithPump` helper consolidating Forget/Init/Prune/Check/Unlock pump duplication. Plan: \`docs/superpowers/plans/2026-05-03-p2-redesign-phase-5.md\`. Tasks ticked in \`tasks.md\`. ## Architecture notes - Migration 0009: \`host_credentials\` becomes kind-aware (composite PK + CHECK constraint), new \`host_repo_stats\` table. - Wire changes: \`CommandRunPayload.RetentionPolicy\` removed, \`ForgetGroups\` + \`RequiresAdminCreds\` added; \`ConfigUpdatePayload.Slot\` added; \`RepoStatsPayload\` reshaped to partial-update form (pointer fields). Protocol version deliberately not bumped — see comment in \`internal/api/version.go\` (lockstep deploy). - Maintenance fires don't queue when offline (re-evaluate next tick); only scheduled backups queue. Documented in code. - Per-host TryLock mutex on \`DrainPending\` prevents the on-hello goroutine and the 30s tick double-dispatching the same row. - \`LatestJobByKind\` returns most-recent job of any status (including in-flight) — a long-running prune correctly suppresses the next tick. ## Test plan - [x] \`go vet ./...\` clean - [x] \`go test ./...\` green across all packages - [x] Pre-commit hooks (gofumpt, goimports, golangci-lint v2.5.0, vet) passed on every commit; no \`--no-verify\` - [x] Smoke sweep against \`/tmp/rm-smoke\` env: login → repo page → check job → repo health panel populated → unlock job. Screenshots in \`_diag/p2r-phase5-sweep/\`. - [ ] Live prune dispatch (skipped in smoke — no rest-server admin password available; disabled-button UX path verified instead) - [ ] Reviewer to spot-check: maintenance ticker fires forget/prune/check on cadence in a longer-running env - [ ] Reviewer to verify: pending-runs drain on a real disconnect/reconnect cycle 🤖 Generated with [Claude Code](https://claude.com/claude-code)
steve force-pushed p2r-phase5-maintenance from aa80a3418e to e1b60f02ae 2026-05-04 10:15:39 +01:00 Compare
steve added 36 commits 2026-05-04 10:19:16 +01:00
Add RunPrune for admin-credential prune invocations.  Extract
runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget
and RunInit to delegate to it (RunInit preserves the "config file
already exists" soft-success sniff by wrapping the handler before the
call).  Add runner_test.go with TestRunPruneInvokesPrune.
Add CheckResult (LockPresent, ErrorsFound) and RunCheck.  subsetPct>0
passes --read-data-subset N% to limit data reads.  Stderr is sniffed
for "Found stale lock"/"locked" to set LockPresent; a non-zero exit
from restic is absorbed as ErrorsFound=true rather than an error so
the caller can always persist last_check_status.  Tests cover lock
detection, exit-1 absorption, and subset-arg plumbing.
Add RunUnlock (delegates straight to runWithPump) and RunStats which
runs `restic stats --json --mode raw-data`, captures the single JSON
line from stdout into RepoStats, and returns an error if no JSON
arrives.  Tests cover arg plumbing for unlock, JSON parsing, and the
no-JSON error path.
Narrow the LockPresent predicate from bare "locked" (too broad) to
"stale lock" and "already locked" — the two phrases restic actually
emits. Replace TestRunCheckParsesLock with table-driven
TestRunCheckLockSniff covering both trigger phrases and a benign
"locked-file" line that must not set LockPresent. Add
TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot
JSON without error.
Reshape RepoStatsPayload into pointer-field partial-update form matching
store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload
for admin vs repo credential routing; add RequiresAdminCreds flag to
CommandRunPayload for prune/unlock jobs that need delete authority.
Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs
are detected at load time by the presence of "repo_url" at the top level
and transparently promoted into the new shape on the next Save/SaveAdmin.
Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).

Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.

Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that
mirror the existing repo-credentials endpoints but write to
store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped
away from the repo slot to prevent cross-binding). PUT immediately pushes
a config.update(Slot:"admin") to the agent when it is connected, and the
new pushAdminCredsToAgent helper is wired for use by the upcoming prune
run-now endpoint (D2) to push on-demand before dispatch.
Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer
routes for HTMX form posts). Prune pushes the admin-cred slot via
pushAdminCredsToAgent before dispatch and refuses with
admin_creds_required when the slot is not set. Check reads
check_subset_pct from host_repo_maintenance (overridable via ?subset=N,
clamped 0-100; non-numeric override falls back to DB value silently).
Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect
response split as the per-source-group run-now endpoint.
Switch handleSetHostCredentials, handleSetAdminCredentials, and
handleDeleteAdminCredentials from authedUser (bool) to requireUser
(*store.User) so AuditEntry.UserID and Actor are populated correctly.
Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in
handleRunRepoPrune so decrypt/send failures surface in the server log
rather than appearing as a generic host_offline 503.
- hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online,
  and StatsView (pre-dereferenced projection of host_repo_stats).
- loadHostRepoPage loads the admin slot (tolerating ErrNotFound),
  hub.Connected, and stats (tolerating ErrNotFound).
- renderRepoPage gains an adminErr parameter; all callers updated.
- handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added
  (form-POST handlers mirroring the repo-creds pattern, with audit).
- Routes /hosts/{id}/admin-credentials POST and /delete POST registered.
- Template: Admin credentials form after Connection, Run-now HTMX
  buttons after Maintenance, Repo health stats panel in right rail.
- Tests: 9 new tests covering rendering, disabled states, save/delete
  round-trips, audit rows, and idempotent delete.
Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in
host_repo.html to match the existing pattern on host_sources.html and
host_schedules.html. Fix all-blank admin-credentials save to redirect
without ?saved= query string so no false-positive banner is shown;
strengthen the corresponding test to assert Location has no ?saved=.
Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:

  - skips offline hosts (no pending_runs queueing — maintenance is
    not a backup, missed fires shouldn't pile up)
  - silently skips prune when admin creds aren't bound
  - pushes admin creds before prune, then dispatches with
    RequiresAdminCreds=true (same as operator-driven prune)
  - persists job rows with actor_kind="system"

Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.

Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
Two trigger paths land here:

- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
  walks pending_runs rows whose next_attempt_at <= now, dedupes by
  host, skips offline hosts, and per online host runs DrainPending.

- onAgentHello spawns a background DrainPending(hostID). When a
  host comes back, every pending row for it is dispatchable now —
  due-ness becomes irrelevant once the wire is back.

Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on
failure) from dispatchBackupForGroup. drainOne now calls the core
directly so a failed Send only bumps the existing pending_runs row via
BumpPendingRunAttempt — not create a second row — stopping the
geometric duplication on repeated drain failures.

dispatchBackupForGroup (schedule.fire path) wraps the core and keeps
its enqueue-on-failure behaviour unchanged.

TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row
remains after a send failure (was tolerating >=1 duplicate rows).
GetSourceGroup errors in drainOne now gate on errors.Is(err,
store.ErrNotFound) before calling abandonPending, mirroring the
existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context
cancellation) now log a warning and return without deleting the row.

Add regression test TestDrainPendingDropsRowsForGoneSourceGroup
confirming the ErrNotFound path still abandons correctly. Also add
a comment above the backoff-doubling loop explaining the progression.
Widen the SQL query to consider all statuses (queued, running,
succeeded, failed, cancelled) rather than terminal-only. An in-flight
prune that outlasts the 60s tick interval previously produced
ErrNotFound, causing the ticker to anchor at now-24h and fire a second
prune concurrently with the first.

Update the doc comment and test: remove the "queued job filtered out"
case, add assertions that a running job and a queued job are each
returned as the latest.
Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on
the Server struct. DrainPending acquires it with TryLock: if a drain is
already in-flight for this host, the call returns immediately — the
running drain will see every pending row. This prevents the on-hello
goroutine and the 30s tick from both listing the same host's rows and
dispatching them twice.

Update three existing tests that called srv.DrainPending explicitly
after the on-hello goroutine had already been spawned: replace the
now-redundant direct call with a waitForPendingCount poll so they don't
race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost
which fires 10 concurrent DrainPending goroutines against a 5-row queue
and asserts exactly 5 job rows result.
version.go: add a comment block explaining why Phase 5's wire changes
(CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did
not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade
path, smoke env restage enforces it. Notes where a version bump to 2
would be required if a multi-version path is ever introduced.

cmd/agent/main.go: document why the JobForget handler hard-errors on
empty ForgetGroups rather than falling back to a single-policy form.
The maintenance ticker is the only writer and always populates the
field; the fallback was specced but skipped given lockstep deploy.
test: write-then-rename script-bin helpers (avoid ETXTBSY under -race)
CI / Build (windows/amd64) (pull_request) Successful in 18s
CI / Build (linux/amd64) (pull_request) Successful in 19s
CI / Lint (pull_request) Successful in 41s
CI / Build (linux/arm64) (pull_request) Successful in 18s
CI / Test (linux/amd64) (pull_request) Failing after 3m41s
d8dd21b5e0
CI run #48 failed with:

  --- FAIL: TestRunInitShipsStartedAndFinished
      RunInit: ... fork/exec /tmp/.../restic: text file busy

setupScript and setupScriptBin used os.WriteFile to write a shell
script directly at the final path, then exec'd it. Under -race +
many t.Parallel tests, a fork-from-another-goroutine could inherit
the still-open writable fd from one of those WriteFile calls; the
kernel returns ETXTBSY when the freshly-execed binary still has a
writable fd anywhere on the system.

Fix: write to "<path>.tmp", then os.Rename into place. The rename
is a pure dirent op; by the time the final path exists, no process
has a writable fd on its inode and exec is safe. -race + -count=5
on both runner packages now passes consistently.
steve force-pushed p2r-phase5-maintenance from e1b60f02ae to d8dd21b5e0 2026-05-04 10:19:16 +01:00 Compare
steve added 1 commit 2026-05-04 10:20:57 +01:00
test: poll pending-row count in drain-on-reconnect test (race fix)
CI / Lint (pull_request) Successful in 17s
CI / Test (linux/amd64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (windows/amd64) (pull_request) Successful in 51s
CI / Build (linux/arm64) (pull_request) Successful in 21s
bc02fcb498
CI run #50 failed with:

  --- FAIL: TestDrainPendingDispatchesOnReconnect (1.03s)
      pending_drain_test.go:150: pending rows after drain: got 1, want 0

The test waits for a backup command.run envelope on the wire and
then checks the pending-row count. But conn.Send (the wire write)
returns BEFORE DeletePendingRun runs in the drain goroutine — both
fire serially inside drainOne, but the wire-side reader can observe
the Send while the delete is still pending.

Use the existing waitForPendingCount helper to poll the count with
a 2s deadline. Behaviour unchanged when the delete is fast (count
hits 0 immediately); only relevant under CI scheduling pressure.
-race -count=10 locally now passes consistently.
steve merged commit a07d7fc53e into main 2026-05-04 10:25:02 +01:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/restic-manager#3