Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on
failure) from dispatchBackupForGroup. drainOne now calls the core
directly so a failed Send only bumps the existing pending_runs row via
BumpPendingRunAttempt — not create a second row — stopping the
geometric duplication on repeated drain failures.
dispatchBackupForGroup (schedule.fire path) wraps the core and keeps
its enqueue-on-failure behaviour unchanged.
TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row
remains after a send failure (was tolerating >=1 duplicate rows).
Two trigger paths land here:
- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
walks pending_runs rows whose next_attempt_at <= now, dedupes by
host, skips offline hosts, and per online host runs DrainPending.
- onAgentHello spawns a background DrainPending(hostID). When a
host comes back, every pending row for it is dispatchable now —
due-ness becomes irrelevant once the wire is back.
Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.
Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:
- skips offline hosts (no pending_runs queueing — maintenance is
not a backup, missed fires shouldn't pile up)
- silently skips prune when admin creds aren't bound
- pushes admin creds before prune, then dispatches with
RequiresAdminCreds=true (same as operator-driven prune)
- persists job rows with actor_kind="system"
Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in
host_repo.html to match the existing pattern on host_sources.html and
host_schedules.html. Fix all-blank admin-credentials save to redirect
without ?saved= query string so no false-positive banner is shown;
strengthen the corresponding test to assert Location has no ?saved=.
Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.
- hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online,
and StatsView (pre-dereferenced projection of host_repo_stats).
- loadHostRepoPage loads the admin slot (tolerating ErrNotFound),
hub.Connected, and stats (tolerating ErrNotFound).
- renderRepoPage gains an adminErr parameter; all callers updated.
- handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added
(form-POST handlers mirroring the repo-creds pattern, with audit).
- Routes /hosts/{id}/admin-credentials POST and /delete POST registered.
- Template: Admin credentials form after Connection, Run-now HTMX
buttons after Maintenance, Repo health stats panel in right rail.
- Tests: 9 new tests covering rendering, disabled states, save/delete
round-trips, audit rows, and idempotent delete.
Switch handleSetHostCredentials, handleSetAdminCredentials, and
handleDeleteAdminCredentials from authedUser (bool) to requireUser
(*store.User) so AuditEntry.UserID and Actor are populated correctly.
Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in
handleRunRepoPrune so decrypt/send failures surface in the server log
rather than appearing as a generic host_offline 503.
Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer
routes for HTMX form posts). Prune pushes the admin-cred slot via
pushAdminCredsToAgent before dispatch and refuses with
admin_creds_required when the slot is not set. Check reads
check_subset_pct from host_repo_maintenance (overridable via ?subset=N,
clamped 0-100; non-numeric override falls back to DB value silently).
Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect
response split as the per-source-group run-now endpoint.
Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that
mirror the existing repo-credentials endpoints but write to
store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped
away from the repo slot to prevent cross-binding). PUT immediately pushes
a config.update(Slot:"admin") to the agent when it is connected, and the
new pushAdminCredsToAgent helper is wired for use by the upcoming prune
run-now endpoint (D2) to push on-demand before dispatch.
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.
Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).
Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs
are detected at load time by the presence of "repo_url" at the top level
and transparently promoted into the new shape on the next Save/SaveAdmin.
Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.
Reshape RepoStatsPayload into pointer-field partial-update form matching
store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload
for admin vs repo credential routing; add RequiresAdminCreds flag to
CommandRunPayload for prune/unlock jobs that need delete authority.
Narrow the LockPresent predicate from bare "locked" (too broad) to
"stale lock" and "already locked" — the two phrases restic actually
emits. Replace TestRunCheckParsesLock with table-driven
TestRunCheckLockSniff covering both trigger phrases and a benign
"locked-file" line that must not set LockPresent. Add
TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot
JSON without error.
Add RunUnlock (delegates straight to runWithPump) and RunStats which
runs `restic stats --json --mode raw-data`, captures the single JSON
line from stdout into RepoStats, and returns an error if no JSON
arrives. Tests cover arg plumbing for unlock, JSON parsing, and the
no-JSON error path.
Add CheckResult (LockPresent, ErrorsFound) and RunCheck. subsetPct>0
passes --read-data-subset N% to limit data reads. Stderr is sniffed
for "Found stale lock"/"locked" to set LockPresent; a non-zero exit
from restic is absorbed as ErrorsFound=true rather than an error so
the caller can always persist last_check_status. Tests cover lock
detection, exit-1 absorption, and subset-arg plumbing.
Add RunPrune for admin-credential prune invocations. Extract
runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget
and RunInit to delegate to it (RunInit preserves the "config file
already exists" soft-success sniff by wrapping the handler before the
call). Add runner_test.go with TestRunPruneInvokesPrune.
Bumping CI to v2.5.0 surfaced two new gofumpt findings (in two test
files that gofumpt v2.1.6 considered fine). Local re-format with
the matching tool brings them in line.
Pre-commit hook config: prepend $GOPATH/bin to PATH inside the hook
entry so gofumpt + golangci-lint resolve when ~/go/bin isn't on the
operator's interactive shell PATH (common — go install puts them
there but PATH config varies). Without this, the hooks fail with
'Executable not found' even when the tools are installed.
Pin the Makefile setup target to v2.5.0 so a fresh clone gets the
same binary CI runs — keeps pre-commit and CI from drifting again.
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:
* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
api.JobCancelled = "cancelled" since that literal is the wire +
DB CHECK constraint value, plus matched the case in store/fleet.go
back to "cancelled" and added //nolint:misspell on both for the
next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
`defer res.Body.Close()` in `defer func() { _ = .Close() }()`
to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
upgrade response Body — coder/websocket can return res with a nil
Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
comments explaining why nil-on-error is the contract (cookie
missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
the dashboard primary nav today
Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
The bump from golangci-lint-action@v6 → v7 (which downloads the v2.x
binary) was blocking CI lint with 'unsupported version of the
configuration: ""' because .golangci.yml was still in the v1 schema.
Migrate the config to v2:
* version: "2" prelude
* disable-all → default: none
* linters-settings → linters.settings
* gofumpt + goimports move into formatters.enable + formatters.settings
* exclude-rules move into linters.exclusions.rules
* gosimple drops (folded into staticcheck in v2)
Fix the four lint hits in the new P2R-02 code:
* host_bandwidth.go: convert hostBandwidthRequest directly to
hostBandwidthView via type conversion (S1016)
* ui_repo.go: drop unparam savedSection + status arguments from
renderRepoPage (always "" / always 422 — split GET render from
validation-fail render)
* ui_schedules.go: gofumpt formatting on the scheduleEditPage struct
Add only-new-issues: true to the lint job. The repo carries ~90
pre-existing findings (gofumpt drift × 31, misspell × 25, missing
godoc × 10, bodyclose × 6, errcheck × 12, …) accumulated before
lint was actually wired into CI. Without this gate, every PR would
fail on baseline noise instead of its own changes.
Track the cleanup as X-06 in tasks.md so the gate is temporary.
Replace the placeholder 'Open →' link with a per-host Run-now
decision computed server-side once per render:
* If the host has exactly one enabled schedule whose source-group
set covers every group on the host → primary 'Run all groups'
button (HX-POST to that schedule's /run endpoint, fires every
backup the host knows about in one click).
* Otherwise (zero matches, multiple matches, or any ambiguity) →
ghost 'Open →' link to /hosts/{id}/sources, where the operator
picks per-group from the source-group rows.
dashboardPage.Hosts moves from []store.Host to []dashboardHostRow
to carry the precomputed RunAllScheduleID; host_row.html now reads
.Host.* and .RunAllScheduleID. Two extra store calls per host on
dashboard render — fine at fleet sizes we care about; if we ever
need to support thousands of hosts we'll batch these queries.
restic --json emits a status frame ~every 16ms during a backup.
The runner was forwarding every line to log.stream verbatim, which
flooded the live log pane with duplicate status JSON for any
short-running backup (visible immediately on a 1000-file, ~4MB
test set: ~14 identical 'percent_done: 1' lines in 220ms).
The progress widget already covers the same information at a sane
sample rate (one per second via job.progress), so the raw status
lines in log.stream are double-bookkeeping. Skip them and forward
only non-status lines (file names, errors, summary).
Throttling logic for job.progress is unchanged.
Schedules tab Run-now used to silently HX-Redirect back to the
list, leaving the operator wondering whether the click registered.
Now:
* Single-source-group schedule → HX-Redirect to that one job's
live log, matching the per-source-group Run-now UX from Sources.
* Multi-group schedule → stay on the schedules list and fire a
success toast ("N backups dispatched: <group names>") via the
existing rm:toast HX-Trigger channel, so the operator sees clear
acknowledgement without losing their place.
dispatchBackupForGroup now returns the persisted job ID so the
caller can choose between job-log redirect and toast feedback;
on any internal failure it returns "" and the warning still
hits slog as before. The cron-fired path (dispatchScheduledJob)
ignores the return value, behaviour unchanged.
Three independent forms on /hosts/{id}/repo so saving one section
doesn't disturb the others:
* Connection: edits repo URL, username, password (pre-filled from
the redacted GET /api/hosts/{id}/repo-credentials view; password
field shows masked stored-creds placeholder; blank password = keep
existing). On save, encrypts and pushes config.update to a
connected agent.
* Bandwidth: host-wide upload/download caps (KB/s; blank = no cap)
written via store.SetHostBandwidth. New REST endpoint
PUT /api/hosts/{id}/bandwidth for JSON callers.
* Maintenance: forget/prune/check cadences + check subset %, with
per-row enabled toggles. Reuses cronParser for validation;
auto-seeds the row if a host pre-dates the migration.
Right-rail surfaces repo size, snapshot count, snapshots-by-tag
breakdown (counted from existing snapshot tag rows), and an
'untagged snapshots are left alone' note.
Danger-zone re-init button is rendered but disabled with a hint
pointing at P2R-09 (real implementation lands there).
Validation re-renders the page with the relevant form's banner and
all other section state intact. Successful saves redirect with a
?saved=<section> query param so the page surfaces a small ✓ saved
indicator on the relevant form.
ci.yml: bump golangci-lint-action v6→v7 (separate change picked up
in this commit).
Surface the Run-now button on every schedule when the host is online,
not just enabled ones. Disabled rows render the button as a non-primary
style + a HX-confirm dialog ("This schedule is paused — running it now
won't change that. Fire it once anyway?"); enabled rows keep the
zero-friction primary button.
Server-side, Run-now no longer short-circuits on !Enabled — it
dispatches the source groups inline rather than via dispatchScheduledJob
(which always bails on disabled schedules, since cron-tick semantics
are different from explicit operator intent). The audit-log entry
inside dispatchBackupForGroup still records every fire.
Aligns Sources and Schedules tab rows with the dashboard's row-click
UX: whole-row click navigates to the row's edit page (mirroring
.host-row.clickable). Drops the redundant Edit buttons; Run-now and
Delete remain in .row-action cells that sit above the row-link
overlay via z-index.
Schedule edit form's cron preset chips now carry human-readable
title= tooltips ("Every day at 03:00", "Every Sunday at 03:00", etc).
tasks.md gets a binding row-design rule covering all current and
future list-row templates, and the P2R-02 entry is split into the
six slices already agreed with the operator (slices 1–3 marked
done, 4 next).
Schedules list: status (enabled/paused) + cron + source-group tags +
actions (Run-now when enabled+online, Edit, Delete). Run-now reuses
dispatchScheduledJob — same path real cron fires take, so each
referenced source group runs as its own backup with its own tag.
Falls back to a 409 if the agent is offline.
Schedule new/edit form: cron input with five preset chips
(quick-pick @hourly / nightly / 6h / weekly / monthly), source-group
multi-pick rendered as styled checkbox cards (visual state tracks
the underlying box via a tiny inline script), enabled toggle. No
paths/excludes/retention/kind on the schedule itself — those live on
source groups now.
Server-side validation re-renders with the operator's input + ticked
groups intact. Every successful mutation calls pushScheduleSetAsync.
Adds .schd-row, .preset-chip, .picker styles.
Belt-and-braces: the UI now disables the Delete button when a group
is the only one on the host (with a tooltip explaining why), and the
server-side handler returns 409 if a curl/form-replay tries anyway.
Every host needs at least one source group to be backup-able, so the
'last group on a fresh host' case is a meaningful accident to guard
against.
Sources tab now lists every source group on the host with per-row
counts (used-by-N-schedules, snapshot count by tag), the v4
conflict tag (keep-* dimension that has no compatible cadence),
and Run-now / Edit / Delete actions. Run-now reuses the existing
HTMX-aware /hosts/{id}/source-groups/{gid}/run handler.
New /hosts/{id}/sources/new and /sources/{gid}/edit form: name +
includes/excludes textareas + the 3×2 keep-* retention grid +
retry-on-offline knobs. Server-side validation re-renders with the
operator's input intact; the inline conflict banner shows above the
retention grid when ConflictDimension is set.
Delete blocks (UI + server) when the group is referenced by any
schedule. Every successful mutation calls pushScheduleSetAsync so
an online agent re-arms within seconds.
Adds .src-row and .keep-cell to input.css for the row + retention
grid layout.
Extract header/vitals/sub-tabs into a host_chrome partial that every
host-detail tab page renders. Sources / Schedules / Repo go from
inert divs to real <a> links backed by stub pages that share the
chrome and a 'coming next' body — slices 2/3/4 fill them in.
Also re-establishes the version indicator (host_schedule_version vs
agent's applied_schedule_version) in the header.
Drops the legacy fat-schedule list/edit templates that referenced
fields removed by the P2 redesign (Manual / Paths / RetentionPolicy
on Schedule); the new templates land in slice 3.
- host_credentials_test.go's CreateEnrollmentToken fixture passed 1<<20
as the TTL (third arg, time.Duration) — that's ~1ms in nanoseconds.
Local non-race runs finished inside the window, but -race overhead
blew the deadline so the token was already expired by the time
GetEnrollmentTokenAttachments / ConsumeEnrollmentToken ran. Use
time.Hour instead, which matches the spirit of a per-test fixture.
- Lint pin v1.61.0 was built against Go 1.23 and refuses to load a
config targeting newer toolchains. go.mod is on 1.25, so the lint
step exited 3 ('the Go language version used to build golangci-lint
is lower than the targeted Go version'). Bumping to v2.1.6, which
supports Go 1.25.
Both failures showed up only on the Gitea runner because local make
target runs go test without -race and lint hadn't been re-run after
the go.mod toolchain bump.
Adds p2r01_ws_test.go covering the two paths the original commit's
in-process tests couldn't reach without a live conn:
- maybeAutoInit dispatches command.run(init) on first hello when creds
are bound, skips on second hello once a job row exists, and skips
entirely when the host has no creds.
- dispatchScheduledJob iterates a schedule's source groups and emits
one backup per group with the right Tag/Includes; persists job rows
with actor_kind=schedule + scheduled_id; no-ops on a disabled
schedule.
Drops RetentionPolicy from the per-group Run-now and schedule.fire
backup payloads — the agent's RunBackup ignores it (forget is the
only consumer). Adds Hub.Conn() so tests can grab the live *Conn
post-hello.
Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron
parsed via robfig/cron/v3 and group membership scoped to the host.
New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete
refuses with 409 if any schedule still references the group, returning
the schedule list so the UI can prompt 'remove from these schedules
first.' Repo-maintenance GET/PUT manages forget/prune/check cadences
on host_repo_maintenance — no version bump, the server-side ticker
(P2R-06) drives execution.
Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run)
resolves the group's includes/excludes/retention/tag and dispatches a
backup command.run with the new structured CommandRunPayload fields
(Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and
/hosts/{id}/init-repo return 410 Gone with a redirect message.
schedule_push.go is rebuilt: buildScheduleSetPayload assembles the
slim wire shape, pushScheduleSetOnConn ships it during the on-hello
window, pushScheduleSetAsync fires after every CRUD mutation, and
dispatchScheduledJob handles agent schedule.fire by iterating the
schedule's source groups and dispatching one backup per group with
actor_kind=schedule and scheduled_id pointing at the schedule.
Auto-init at first WS connect: when the host has repo creds bound and
no init job in its history, server dispatches restic init. Restic's
'config file already exists' soft-success means re-runs against an
existing repo no-op; we don't auto-retry on failure (operator triggers
re-init manually via the danger zone in P2R-09).
api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc.
in favour of {id, cron, enabled, source_groups: [...]}. The agent
scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads
Includes/Excludes/Tag instead of Args.
Tests cover the new HTTP surface end-to-end: source-groups CRUD with
in-use refusal, schedule validation (bad cron / missing groups /
foreign group), repo-maintenance auto-seed and validation, the 410
route, and buildScheduleSetPayload's wire-shape correctness. Full
suite passes; smoke env exercises auto-init dispatch on hello,
async push after schedule create, and per-source-group Run-now
landing the right paths/excludes/tag at the agent.
Go-side data model rebuilt against migration 0008. The fat-Schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks) is
gone; that surface lives on source_groups now.
* store/types.go
- Schedule slimmed to {id, host_id, cron, enabled, source_group_ids,
timestamps}. SourceGroupIDs populated by Get/List, accepted on
Create/Update so callers pass desired junction state in one shape.
- SourceGroup added: name (= snapshot tag), includes/excludes,
retention_policy, retry_max + retry_backoff_seconds, cached
conflict_dimension.
- HostRepoMaintenance added: forget/prune/check cadences + enabled.
- PendingRun added: offline-retry queue.
- Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps.
- RetentionPolicy moves home from "schedule field" to "source group
field" but the type itself + Summary() method unchanged.
* store/sources.go (new) — CRUD + GetByName + ConflictDimension cache.
Group writes bump host_schedule_version; conflict cache writes don't
(server-internal projection, agent doesn't see it).
* store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR
IGNORE). UpdateRepoMaintenance doesn't bump schedule version because
these run on the server's own ticker, not the agent's local cron.
* store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete.
* store/schedules.go — rewritten for slim shape + junction CRUD.
Update wipes the schedule_source_groups junction wholesale and
re-inserts (simpler than diffing). Adds SchedulesUsingGroup for
retention-conflict detection + UI labels.
* store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan.
New SetHostBandwidth helper.
* HTTP layer — temporarily stubbed during this rewrite (501 returns
with redesign_in_progress error code). Phase 3 fills these in
against the new shape:
- schedules.go REST CRUD
- schedule_push.go agent reconciliation
- ui_schedules.go HTML form CRUD
Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed
— both go away in the new model (Run-now per source group; auto-init
at host enrolment).
* enrollment.go — replaces "seed manual schedule from typed paths"
with "seed default source group + repo-maintenance row." The default
group gets the typed paths as its includes; operator edits later
via Sources tab.
* ws/handler.go — drops the MarkHostRepoInitialised projection (column
is gone; auto-init makes it derivable from latest init job's status).
Tests:
* store: existing schedule test rewritten for slim shape + junction;
new sources_test.go covers source-group CRUD, name uniqueness,
conflict cache, repo-maintenance defaults + idempotent seed,
pending-runs queue lifecycle.
* http: schedules_test.go and schedule_push_test.go deleted — both
exercised the obsolete fat-schedule API. Phase 3 rewrites them
against the new endpoints.
go test ./... green. cmd/server + cmd/agent build. The UI is broken
end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3
restores REST + on-the-wire reconciliation; Phase 4 rewires the UI
templates against the new model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema rebuild for the model collapse described in
design/v4-sources-redesign.html. Three nouns now stand on their
own:
* schedules — slim. Only cron + enabled + host_id. Fat-schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks)
is dropped wholesale. Schedule data wiped — by design (smoke env
was nuked before this ran; fresh installs have nothing to lose).
* source_groups — name + includes + excludes + retention_policy +
retry policy + cached conflict_dimension. Group name doubles as
the snapshot tag so retention can target it cleanly. UNIQUE
(host_id, name) enforces tag unambiguity.
* schedule_source_groups — N:M junction. One schedule can fire N
groups per tick; one group can be referenced by N schedules.
* host_repo_maintenance — 1:1 with hosts. Default cadences:
forget daily 03:00, prune weekly Sun 04:00, check monthly 1st
05:00 with --read-data-subset 5%. Operator can edit on Repo tab.
* pending_runs — offline-retry queue. Server-side ticker dispatches
due rows; bounded by source_groups.retry_max + retry_backoff_seconds.
Plus:
* hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps.
* hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes
it derivable from the latest init job; the Init-repo button goes
too (failure surfaces via job history banner).
Note on FK safety: smoke env was wiped before migration ran, so
DROP TABLE schedules cascades to nothing. Fresh installs apply
0001-0007 then immediately 0008 — same story (no schedule rows
to lose). For an upgrade path on a populated DB, this migration
would need a data-preserving variant; not needed today.
Tests fail to compile/run after this — expected. The Go side
(store types, CRUD, REST handlers, agent runner, UI templates)
gets rebuilt in subsequent phases. tasks.md will track P2 redesign
progress.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end forget plumbing — operator can create a forget schedule
with keep-* values, agent runs restic forget --keep-* … on the
schedule's cron (or via per-row Run-now), snapshot list shrinks,
UI updates.
* api.CommandRunPayload gains retention_policy json.RawMessage so
the agent doesn't need a typed copy of the server-side struct.
* restic.ForgetPolicy mirrors restic's --keep-* flags. Empty()
reports zero dimensions; restic wrapper RunForget refuses to
run an empty policy (would delete every snapshot). Does NOT
pass --prune — pruning lives behind a separate admin-only
credential (P2-06); forget just rewrites the snapshot index.
* runner.RunForget mirrors RunBackup's envelope shape so the
live log viewer works without special-casing. On success
triggers reportSnapshots (forget shrinks the index, the host's
snapshot count almost certainly changed).
* cmd/agent dispatcher handles MsgCommandRun with kind=forget,
decodes RetentionPolicy from the wire, builds restic.ForgetPolicy.
* Server dispatchScheduleNow marshals the schedule's
RetentionPolicy into the wire payload for kind=forget jobs.
Refuses to dispatch a forget schedule with empty retention.
* validateSchedule rejects kind=forget without at least one keep-*
dimension (new error code: missing_retention).
* UI schedule edit form gains a Kind dropdown (backup or forget;
immutable on edit). Paths block toggles by kind via inline
data-kind attributes. Form help-text explains the prune
separation.
Other kinds (prune, check, unlock) deferred to P2-06..08; the
Kind dropdown only offers backup and forget today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agent runs as root (HOME=/root from systemd) with ProtectHome=
read-only, so restic's `mkdir /root/.cache/restic` fails on the
first call. Backups still completed (restic falls back to no-cache)
but every job log started with a noisy red "unable to open cache"
warning.
Default to /var/lib/restic-manager unconditionally — that's already
in the unit's ReadWritePaths and survives ProtectHome. ExtraEnv
overrides still win for tests / unusual setups.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-running restic init on a repo that's already initialised exits
non-zero with "Fatal: ... config file already exists". Semantically
that's a no-op, not a failure — the repo IS initialised, the
caller's intent is satisfied. Sniff stderr for the magic string
and swallow the exit code in that case, emitting an event line
so the operator-facing log says what happened.
Caught while smoke-testing P2-04.5: I'd init'd the repo manually
during a debug session, then the operator clicking the UI's
Init-repo button would hit this and the host's repo_initialised_at
would never flip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pending page suppressed the htpasswd snippet when repo_username
was blank — but with --private-repos the username is required for
auth, and operators routinely leave the field blank assuming the
system will pick something sensible.
* handleUIAddHostPost defaults repo_username to the typed hostname
when blank. Matches what --private-repos expects (URL path
segment == username).
* pending_host.html: snippet now renders whenever a password is
present (always true after the generate-on-blank logic landed
earlier).
* Form help-text updated to describe the default explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues from a smoke session:
1. The awaiting-agent panel never refreshed — operator had to go
back to the dashboard to see the host had connected.
2. Generated passwords were displayed only on the POST response.
Navigating away (or even an accidental tab close) lost them
permanently, so the operator couldn't update the rest-server's
htpasswd.
Both are the same fix: convert the POST-rendered transient
"result state" into a durable GET page at /hosts/pending/{token}.
* New route GET /hosts/pending/{token} renders the install-command +
htpasswd snippet view. Password is decrypted from the (still-
encrypted-at-rest) token row on every render — operator can
refresh, bookmark, navigate away and come back. Once the agent
enrols, the page redirects to /hosts/{id}; once the token
expires, redirect to /hosts/new.
* New route GET /hosts/pending/{token}/awaiting returns a polled
HTML fragment that the pending page swaps in every 2s via HTMX.
States: awaiting (keep polling) | connected (show "Open host →"
+ "View schedules" CTAs, polling stops) | expired (mint-new
link, polling stops). Polling stops naturally because only the
awaiting state's wrapper carries the hx-trigger attribute.
* POST /hosts/new now 303-redirects to /hosts/pending/{token}
on success; validation errors keep re-rendering the form with
banner.
Supporting changes:
* New store helper Store.GetEnrollmentTokenStatus(tokenHash) for
the polling endpoint — returns {expires_at, consumed_at,
consumed_host} in one round-trip without dragging in the
attachments-decryption path.
* New ui.Renderer.RenderPartial(w, name, data) for HTMX fragment
responses (no layout wrap). Picks an arbitrary page's template
set as the lookup point — every page parses the full common-
paths list, so they all see every partial.
* add_host.html stripped to form-only; pending_host.html owns the
result-state UI; awaiting_agent.html is the polled partial.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>