server: drain pending_runs on tick + on agent reconnect

Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.
2026-05-03 23:57:08 +01:00
parent 18a4f74a22
commit 5b4a590508
4 changed files with 604 additions and 0 deletions
@@ -147,6 +147,15 @@ func run() error {
 	// work.
 	maintenanceTick := time.NewTicker(60 * time.Second)
 	defer maintenanceTick.Stop()
+	// Pending-runs drain ticker: 30s cadence sweeps every host with
+	// pending_runs rows whose next_attempt_at <= now (rows accumulate
+	// when a schedule.fire's command.run send fails because the agent
+	// dropped offline mid-flight). The on-reconnect path in
+	// onAgentHello handles the common case; this ticker is the
+	// safety-net for hosts that come back without a fresh hello (they
+	// shouldn't, but the queue exists either way).
+	pendingDrainTick := time.NewTicker(30 * time.Second)
+	defer pendingDrainTick.Stop()
 	mt := maintenance.New(st)
 	go func() {
 		for {
@@ -165,6 +174,8 @@ func run() error {
 				if n, err := st.MarkHostsOfflineStale(ctx, cutoff); err == nil && n > 0 {
 					slog.Info("marked hosts offline (stale heartbeat)", "n", n)
 				}
+			case <-pendingDrainTick.C:
+				srv.DrainAllDue(ctx)
 			case <-maintenanceTick.C:
 				decisions, err := mt.Decide(ctx, time.Now().UTC())
 				if err != nil {