d6f3d84ba5c39bc52bb6c73b90875d93b62c7021
11 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d6dcdd5ec4 |
server: drainer uses dispatch-core to avoid duplicate pending_run enqueue
Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on failure) from dispatchBackupForGroup. drainOne now calls the core directly so a failed Send only bumps the existing pending_runs row via BumpPendingRunAttempt — not create a second row — stopping the geometric duplication on repeated drain failures. dispatchBackupForGroup (schedule.fire path) wraps the core and keeps its enqueue-on-failure behaviour unchanged. TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row remains after a send failure (was tolerating >=1 duplicate rows). |
||
|
|
18a4f74a22 |
server: enqueue pending_runs when scheduled-job dispatch fails
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant. |
||
|
|
b6f8de1dcc |
lint: drive baseline to zero, drop only-new-issues gate
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:
* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
api.JobCancelled = "cancelled" since that literal is the wire +
DB CHECK constraint value, plus matched the case in store/fleet.go
back to "cancelled" and added //nolint:misspell on both for the
next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
`defer res.Body.Close()` in `defer func() { _ = .Close() }()`
to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
upgrade response Body — coder/websocket can return res with a nil
Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
comments explaining why nil-on-error is the contract (cookie
missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
the dashboard primary nav today
Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
|
||
|
|
e45f75598f |
P2R-02 follow-up: schedule Run-now feedback (single → job log, multi → toast)
Schedules tab Run-now used to silently HX-Redirect back to the
list, leaving the operator wondering whether the click registered.
Now:
* Single-source-group schedule → HX-Redirect to that one job's
live log, matching the per-source-group Run-now UX from Sources.
* Multi-group schedule → stay on the schedules list and fire a
success toast ("N backups dispatched: <group names>") via the
existing rm:toast HX-Trigger channel, so the operator sees clear
acknowledgement without losing their place.
dispatchBackupForGroup now returns the persisted job ID so the
caller can choose between job-log redirect and toast feedback;
on any internal failure it returns "" and the warning still
hits slog as before. The cron-fired path (dispatchScheduledJob)
ignores the return value, behaviour unchanged.
|
||
|
|
d692272d10 |
P2R-01 follow-up: WS-path tests + drop unused retention from backup dispatch
Adds p2r01_ws_test.go covering the two paths the original commit's in-process tests couldn't reach without a live conn: - maybeAutoInit dispatches command.run(init) on first hello when creds are bound, skips on second hello once a job row exists, and skips entirely when the host has no creds. - dispatchScheduledJob iterates a schedule's source groups and emits one backup per group with the right Tag/Includes; persists job rows with actor_kind=schedule + scheduled_id; no-ops on a disabled schedule. Drops RetentionPolicy from the per-group Run-now and schedule.fire backup payloads — the agent's RunBackup ignores it (forget is the only consumer). Adds Hub.Conn() so tests can grab the live *Conn post-hello. |
||
|
|
ec0bf0f6c3 |
P2R-01: REST + WS rewire against the slim shape
Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron
parsed via robfig/cron/v3 and group membership scoped to the host.
New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete
refuses with 409 if any schedule still references the group, returning
the schedule list so the UI can prompt 'remove from these schedules
first.' Repo-maintenance GET/PUT manages forget/prune/check cadences
on host_repo_maintenance — no version bump, the server-side ticker
(P2R-06) drives execution.
Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run)
resolves the group's includes/excludes/retention/tag and dispatches a
backup command.run with the new structured CommandRunPayload fields
(Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and
/hosts/{id}/init-repo return 410 Gone with a redirect message.
schedule_push.go is rebuilt: buildScheduleSetPayload assembles the
slim wire shape, pushScheduleSetOnConn ships it during the on-hello
window, pushScheduleSetAsync fires after every CRUD mutation, and
dispatchScheduledJob handles agent schedule.fire by iterating the
schedule's source groups and dispatching one backup per group with
actor_kind=schedule and scheduled_id pointing at the schedule.
Auto-init at first WS connect: when the host has repo creds bound and
no init job in its history, server dispatches restic init. Restic's
'config file already exists' soft-success means re-runs against an
existing repo no-op; we don't auto-retry on failure (operator triggers
re-init manually via the danger zone in P2R-09).
api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc.
in favour of {id, cron, enabled, source_groups: [...]}. The agent
scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads
Includes/Excludes/Tag instead of Args.
Tests cover the new HTTP surface end-to-end: source-groups CRUD with
in-use refusal, schedule validation (bad cron / missing groups /
foreign group), repo-maintenance auto-seed and validation, the 410
route, and buildScheduleSetPayload's wire-shape correctness. Full
suite passes; smoke env exercises auto-init dispatch on hello,
async push after schedule create, and per-source-group Run-now
landing the right paths/excludes/tag at the agent.
|
||
|
|
e7eea7afac |
P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance
Go-side data model rebuilt against migration 0008. The fat-Schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks) is
gone; that surface lives on source_groups now.
* store/types.go
- Schedule slimmed to {id, host_id, cron, enabled, source_group_ids,
timestamps}. SourceGroupIDs populated by Get/List, accepted on
Create/Update so callers pass desired junction state in one shape.
- SourceGroup added: name (= snapshot tag), includes/excludes,
retention_policy, retry_max + retry_backoff_seconds, cached
conflict_dimension.
- HostRepoMaintenance added: forget/prune/check cadences + enabled.
- PendingRun added: offline-retry queue.
- Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps.
- RetentionPolicy moves home from "schedule field" to "source group
field" but the type itself + Summary() method unchanged.
* store/sources.go (new) — CRUD + GetByName + ConflictDimension cache.
Group writes bump host_schedule_version; conflict cache writes don't
(server-internal projection, agent doesn't see it).
* store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR
IGNORE). UpdateRepoMaintenance doesn't bump schedule version because
these run on the server's own ticker, not the agent's local cron.
* store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete.
* store/schedules.go — rewritten for slim shape + junction CRUD.
Update wipes the schedule_source_groups junction wholesale and
re-inserts (simpler than diffing). Adds SchedulesUsingGroup for
retention-conflict detection + UI labels.
* store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan.
New SetHostBandwidth helper.
* HTTP layer — temporarily stubbed during this rewrite (501 returns
with redesign_in_progress error code). Phase 3 fills these in
against the new shape:
- schedules.go REST CRUD
- schedule_push.go agent reconciliation
- ui_schedules.go HTML form CRUD
Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed
— both go away in the new model (Run-now per source group; auto-init
at host enrolment).
* enrollment.go — replaces "seed manual schedule from typed paths"
with "seed default source group + repo-maintenance row." The default
group gets the typed paths as its includes; operator edits later
via Sources tab.
* ws/handler.go — drops the MarkHostRepoInitialised projection (column
is gone; auto-init makes it derivable from latest init job's status).
Tests:
* store: existing schedule test rewritten for slim shape + junction;
new sources_test.go covers source-group CRUD, name uniqueness,
conflict cache, repo-maintenance defaults + idempotent seed,
pending-runs queue lifecycle.
* http: schedules_test.go and schedule_push_test.go deleted — both
exercised the obsolete fat-schedule API. Phase 3 rewrites them
against the new endpoints.
go test ./... green. cmd/server + cmd/agent build. The UI is broken
end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3
restores REST + on-the-wire reconciliation; Phase 4 rewires the UI
templates against the new model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
6a171596f1 |
P2-05: forget command with retention policy
End-to-end forget plumbing — operator can create a forget schedule with keep-* values, agent runs restic forget --keep-* … on the schedule's cron (or via per-row Run-now), snapshot list shrinks, UI updates. * api.CommandRunPayload gains retention_policy json.RawMessage so the agent doesn't need a typed copy of the server-side struct. * restic.ForgetPolicy mirrors restic's --keep-* flags. Empty() reports zero dimensions; restic wrapper RunForget refuses to run an empty policy (would delete every snapshot). Does NOT pass --prune — pruning lives behind a separate admin-only credential (P2-06); forget just rewrites the snapshot index. * runner.RunForget mirrors RunBackup's envelope shape so the live log viewer works without special-casing. On success triggers reportSnapshots (forget shrinks the index, the host's snapshot count almost certainly changed). * cmd/agent dispatcher handles MsgCommandRun with kind=forget, decodes RetentionPolicy from the wire, builds restic.ForgetPolicy. * Server dispatchScheduleNow marshals the schedule's RetentionPolicy into the wire payload for kind=forget jobs. Refuses to dispatch a forget schedule with empty retention. * validateSchedule rejects kind=forget without at least one keep-* dimension (new error code: missing_retention). * UI schedule edit form gains a Kind dropdown (backup or forget; immutable on edit). Paths block toggles by kind via inline data-kind attributes. Form help-text explains the prune separation. Other kinds (prune, check, unlock) deferred to P2-06..08; the Kind dropdown only offers backup and forget today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8fb1c100fd |
P2-04.5: kill host.default_paths in favour of manual schedules
Two independent path lists for "what does this host back up?" was
a real divergence footgun — operator types one set at Add-host time
and a different set into a schedule, both end up in the same repo,
the snapshot history looks fine until restore. Resolution: drop
host.default_paths entirely; add a `manual` flag on schedules.
A manual schedule has paths/excludes/tags/retention like any other
but no cron — it fires only via per-schedule Run-now. Single source
of truth for what gets backed up.
Schema (migration 0007):
* schedules.manual INTEGER NOT NULL DEFAULT 0.
* For every host with non-empty default_paths, seed a manual
schedule with those paths and bump host_schedule_version.
* ALTER TABLE hosts DROP COLUMN default_paths.
* ALTER TABLE enrollment_tokens RENAME COLUMN default_paths
TO initial_paths.
Original draft of this migration rebuilt hosts via the
create-new + drop-old + rename-new pattern. With foreign_keys=ON
(set in the connection DSN), DROP TABLE on the parent fired
ON DELETE CASCADE on every child of hosts(id) — schedules /
jobs / snapshots / host_credentials all wiped on the smoke env
when I tried it. SQLite 3.35+ supports column-level ALTERs
directly, so we skip the rebuild dance and avoid the cascade
trap. Six lines of SQL instead of sixty, no FK risk.
Run-now rewiring:
* New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper
unifies the agent-driven path (cron fire → schedule.fire →
OnScheduleFire callback) and the UI-driven path (operator
clicks Run-now on a schedule row). Conn arg is optional; nil
falls back to Hub.Send.
* New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row
Run-now button on the schedules list.
* Dashboard's per-host Run-now (handleUIRunBackup) now picks the
host's only enabled manual schedule, falls back to the only
enabled schedule, else returns "pick one in Schedules tab".
Keeps one-click for the common case.
Agent:
* Scheduler skips manual schedules in cron build (silent — they're
a normal data shape, not an error).
* Wire Schedule struct gains Manual flag.
* Schedule.fire flow unchanged — the agent only ever fires
non-manual schedules anyway.
UI:
* Add-host form retitled "Initial schedule · manual" so the
operator knows the paths become an editable schedule under
the Schedules tab. Result page calls out the manual schedule
+ points at Host > Schedules.
* Schedule edit form: "Manual schedule" checkbox at the top of
the When section; toggling it hides/shows the cron field via
inline JS. Server-side validator skips the cron requirement
when manual=true.
* Schedule list shows a "manual" tag under the status pill and
renders the When column as "— run-now only —" for manual rows.
Each row gets a Run-now button when the schedule is enabled
and the host is online.
Tests + go test ./... green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
608962441b |
P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch
Closes the schedule reconciliation loop end-to-end.
* New `internal/agent/scheduler` package wraps robfig/cron/v3 with
the lifecycle the agent needs:
- Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting
for in-flight entries to return), rebuilds from scratch, starts,
and emits schedule.ack with the version we just applied.
- Disabled entries skipped silently; bad cron exprs (which
shouldn't reach us — the server validates — but defensive)
log a warn and skip.
- On each cron tick the entry sends a new schedule.fire envelope
to the server with {schedule_id, scheduled_at}. The scheduler
itself never builds CommandRunPayloads — server is the source
of truth for jobs.
- tx is swapped on every Apply, so reconnect is handled
naturally: cron entries that fire against a dropped tx log
"no active connection" and skip the tick.
- Stop() is idempotent and waits for the cron's in-flight
workers via cron.Stop().Done().
* New wire message api.MsgScheduleFire + api.ScheduleFirePayload
for the agent → server "I just fired locally" RPC.
* Server-side dispatch (schedule_push.go: dispatchScheduledJob):
looks up the schedule by id, validates ownership + that it's
enabled, builds args from kind (paths for backup; other kinds
are still arg-less in Phase 2 and grow as those job kinds land
in P2-05..08), persists a jobs row with actor_kind=schedule +
scheduled_id, and writes command.run back on the same conn so
the agent runs through its existing dispatch path.
* store.CreateJob now writes scheduled_id. This column was in the
schema since 0001 but never populated — the original P1 path
only had operator-driven jobs, so actor_kind was always 'user'
and scheduled_id was always nil.
* cmd/agent/main.go integration: dispatcher gains a
*scheduler.Scheduler; the MsgScheduleSet case now hands the
payload to scheduler.Apply (in a goroutine so the WS read loop
keeps draining other messages).
* WS dispatcher gains OnScheduleFire alongside OnScheduleAck.
* Tests:
- scheduler unit tests (4): ack-on-apply, cron tick fires
schedule.fire envelope, disabled entries don't fire, replace-
prior-state stops the old cron.
- Server-side end-to-end: schedule.fire → command.run with the
right job_id / kind / args, plus jobs row with actor_kind=
"schedule" and scheduled_id linking back to the schedule.
Persistence of next-fire times across agent restarts is
deliberately deferred. A missed fire window during downtime
simply fires once on reconnect — that's the desirable behaviour
(the operator wants the missed backup to run, not be silently
skipped because we lost track of when it was due).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a086b0eb75 |
P2-02 (server side): schedule reconciliation push + ack handling
Server is now the source of truth for the agent's cron set.
* Helpers in schedule_push.go:
- loadScheduleSetPayload reads the host's schedules + canonical
version into the wire shape.
- pushScheduleSetOnConn writes directly to a just-handshaken conn
(avoids racing against Hub.Register on a brand-new connection).
- pushScheduleSetAsync is the post-CRUD flavour — no-op when the
host is offline (the next reconnect's on-hello path catches it
up, so a missed push is non-fatal).
- applyScheduleAck records what version the agent has confirmed.
* onAgentHello restructured: was returning early when the host had
no repo credentials, which made the schedule push unreachable for
fresh hosts. Split into pushRepoCredsOnHello (silent no-op on
ErrNotFound) + pushScheduleSetOnConn (always runs). Empty schedule
list is a valid push: tells the agent to drop stale cron entries.
* WS dispatcher gains an OnScheduleAck hook on HandlerDeps; the
http server wires it to applyScheduleAck. MsgScheduleAck moves
out of the "TODO(P2)" group into a real case that decodes the
payload and forwards to the callback.
* Schedule CRUD handlers each fire pushScheduleSetAsync after the
audit-log write so the agent picks up changes within seconds.
Tests cover:
- On-hello push of an already-created schedule, agent acks,
applied_schedule_version flips on the host row.
- Connect-then-CRUD: empty initial push (version 0), then a
follow-on push at version 1 after the operator creates a
schedule via REST.
Agent-side `schedule.set` handler (parse, replace local cron,
emit `schedule.ack`) is the remainder of P2-02 and lands with
P2-03's local scheduler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|