server: drain pending_runs on tick + on agent reconnect

Two trigger paths land here:

- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
  walks pending_runs rows whose next_attempt_at <= now, dedupes by
  host, skips offline hosts, and per online host runs DrainPending.

- onAgentHello spawns a background DrainPending(hostID). When a
  host comes back, every pending row for it is dispatchable now —
  due-ness becomes irrelevant once the wire is back.

Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
This commit is contained in:
2026-05-03 23:57:08 +01:00
parent e64cf25c0e
commit 3e337dfb3c
4 changed files with 604 additions and 0 deletions
+5
View File
@@ -411,6 +411,11 @@ func (s *Server) onAgentHello(ctx context.Context, hostID string, conn *ws.Conn)
// just no-ops. Skipped silently when the host has no creds yet —
// the next hello after the operator binds creds will dispatch.
s.maybeAutoInit(ctx, hostID, conn)
// Drain any pending runs that accumulated while this host was
// offline. Use a fresh context — the hello-bound ctx is short-lived,
// and the drain may take seconds across many rows. A non-blocking
// goroutine keeps the hello path snappy.
go s.DrainPending(context.Background(), hostID)
}
// maybeAutoInit dispatches a `restic init` job iff the host has no