server: drain pending_runs on tick + on agent reconnect
Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.
This commit is contained in:
@@ -147,6 +147,15 @@ func run() error {
|
||||
// work.
|
||||
maintenanceTick := time.NewTicker(60 * time.Second)
|
||||
defer maintenanceTick.Stop()
|
||||
// Pending-runs drain ticker: 30s cadence sweeps every host with
|
||||
// pending_runs rows whose next_attempt_at <= now (rows accumulate
|
||||
// when a schedule.fire's command.run send fails because the agent
|
||||
// dropped offline mid-flight). The on-reconnect path in
|
||||
// onAgentHello handles the common case; this ticker is the
|
||||
// safety-net for hosts that come back without a fresh hello (they
|
||||
// shouldn't, but the queue exists either way).
|
||||
pendingDrainTick := time.NewTicker(30 * time.Second)
|
||||
defer pendingDrainTick.Stop()
|
||||
mt := maintenance.New(st)
|
||||
go func() {
|
||||
for {
|
||||
@@ -165,6 +174,8 @@ func run() error {
|
||||
if n, err := st.MarkHostsOfflineStale(ctx, cutoff); err == nil && n > 0 {
|
||||
slog.Info("marked hosts offline (stale heartbeat)", "n", n)
|
||||
}
|
||||
case <-pendingDrainTick.C:
|
||||
srv.DrainAllDue(ctx)
|
||||
case <-maintenanceTick.C:
|
||||
decisions, err := mt.Decide(ctx, time.Now().UTC())
|
||||
if err != nil {
|
||||
|
||||
Reference in New Issue
Block a user