P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch
Closes the schedule reconciliation loop end-to-end.
* New `internal/agent/scheduler` package wraps robfig/cron/v3 with
the lifecycle the agent needs:
- Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting
for in-flight entries to return), rebuilds from scratch, starts,
and emits schedule.ack with the version we just applied.
- Disabled entries skipped silently; bad cron exprs (which
shouldn't reach us — the server validates — but defensive)
log a warn and skip.
- On each cron tick the entry sends a new schedule.fire envelope
to the server with {schedule_id, scheduled_at}. The scheduler
itself never builds CommandRunPayloads — server is the source
of truth for jobs.
- tx is swapped on every Apply, so reconnect is handled
naturally: cron entries that fire against a dropped tx log
"no active connection" and skip the tick.
- Stop() is idempotent and waits for the cron's in-flight
workers via cron.Stop().Done().
* New wire message api.MsgScheduleFire + api.ScheduleFirePayload
for the agent → server "I just fired locally" RPC.
* Server-side dispatch (schedule_push.go: dispatchScheduledJob):
looks up the schedule by id, validates ownership + that it's
enabled, builds args from kind (paths for backup; other kinds
are still arg-less in Phase 2 and grow as those job kinds land
in P2-05..08), persists a jobs row with actor_kind=schedule +
scheduled_id, and writes command.run back on the same conn so
the agent runs through its existing dispatch path.
* store.CreateJob now writes scheduled_id. This column was in the
schema since 0001 but never populated — the original P1 path
only had operator-driven jobs, so actor_kind was always 'user'
and scheduled_id was always nil.
* cmd/agent/main.go integration: dispatcher gains a
*scheduler.Scheduler; the MsgScheduleSet case now hands the
payload to scheduler.Apply (in a goroutine so the WS read loop
keeps draining other messages).
* WS dispatcher gains OnScheduleFire alongside OnScheduleAck.
* Tests:
- scheduler unit tests (4): ack-on-apply, cron tick fires
schedule.fire envelope, disabled entries don't fire, replace-
prior-state stops the old cron.
- Server-side end-to-end: schedule.fire → command.run with the
right job_id / kind / args, plus jobs row with actor_kind=
"schedule" and scheduled_id linking back to the schedule.
Persistence of next-fire times across agent restarts is
deliberately deferred. A missed fire window during downtime
simply fires once on reconnect — that's the desirable behaviour
(the operator wants the missed backup to run, not be silently
skipped because we lost track of when it was due).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,170 @@
|
||||
package scheduler
|
||||
|
||||
import (
|
||||
"log/slog"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/robfig/cron/v3"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
)
|
||||
|
||||
// Sender abstracts away the agent's outbound WS channel — we use it
|
||||
// to fire schedule.fire and schedule.ack envelopes back at the
|
||||
// server. Same shape as runner.Sender; deliberately not shared so
|
||||
// the scheduler can be tested without dragging in the runner.
|
||||
type Sender interface {
|
||||
Send(env api.Envelope) error
|
||||
}
|
||||
|
||||
// Scheduler maintains the agent's local cron entries. Schedules
|
||||
// arrive from the server via Apply (driven by MsgScheduleSet); on
|
||||
// each fire, the entry sends a schedule.fire to the server and
|
||||
// lets the server's existing dispatch path turn that into a
|
||||
// command.run. The scheduler itself never builds CommandRunPayloads.
|
||||
//
|
||||
// Lifecycle:
|
||||
// - Start once at agent boot.
|
||||
// - Apply on every MsgScheduleSet — replaces the active cron with
|
||||
// a fresh one, then emits schedule.ack with the version we just
|
||||
// applied.
|
||||
// - Stop on agent shutdown.
|
||||
//
|
||||
// The active Sender is updated on every Apply call. This handles
|
||||
// reconnects naturally: a new connection's first MsgScheduleSet
|
||||
// re-arms the scheduler with a working tx; cron entries that fire
|
||||
// against a dropped connection just log and skip the tick.
|
||||
type Scheduler struct {
|
||||
mu sync.Mutex
|
||||
current *cron.Cron
|
||||
version int64
|
||||
tx Sender
|
||||
}
|
||||
|
||||
// New builds a Scheduler. Doesn't start any cron yet — Apply is
|
||||
// what brings the loop alive.
|
||||
func New() *Scheduler {
|
||||
return &Scheduler{}
|
||||
}
|
||||
|
||||
// Stop halts whatever cron is currently running. Safe to call
|
||||
// multiple times.
|
||||
func (s *Scheduler) Stop() {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
if s.current != nil {
|
||||
<-s.current.Stop().Done()
|
||||
s.current = nil
|
||||
}
|
||||
}
|
||||
|
||||
// Apply reconciles the active cron with payload. Stops the old cron
|
||||
// (waiting for in-flight entries to return), builds a new one from
|
||||
// every enabled entry, starts it, and emits schedule.ack with
|
||||
// payload.Version. Schedule entries with malformed cron exprs are
|
||||
// logged and skipped — the server's validator should have caught
|
||||
// these, but better skip-and-warn than crash the loop.
|
||||
//
|
||||
// Payload's order doesn't matter; we always rebuild from scratch.
|
||||
// Empty Schedules is a valid input that effectively disables every
|
||||
// timed job for this host.
|
||||
func (s *Scheduler) Apply(payload api.ScheduleSetPayload, tx Sender) {
|
||||
s.mu.Lock()
|
||||
s.tx = tx
|
||||
|
||||
// Stop the previous cron, if any. cron.Stop returns once the
|
||||
// scheduler has stopped firing new entries; in-flight ones
|
||||
// continue in their own goroutines, which is what we want
|
||||
// (otherwise a long-running backup would block reconciliation).
|
||||
if s.current != nil {
|
||||
<-s.current.Stop().Done()
|
||||
s.current = nil
|
||||
}
|
||||
|
||||
c := cron.New()
|
||||
added := 0
|
||||
for _, sch := range payload.Schedules {
|
||||
if !sch.Enabled {
|
||||
continue
|
||||
}
|
||||
// Capture by value so the closure doesn't share id across iters.
|
||||
entry := sch
|
||||
_, err := c.AddFunc(entry.CronExpr, func() {
|
||||
s.fire(entry)
|
||||
})
|
||||
if err != nil {
|
||||
slog.Warn("scheduler: skipping entry with bad cron expr",
|
||||
"schedule_id", entry.ID, "expr", entry.CronExpr, "err", err)
|
||||
continue
|
||||
}
|
||||
added++
|
||||
}
|
||||
c.Start()
|
||||
s.current = c
|
||||
s.version = payload.Version
|
||||
ackTx := s.tx
|
||||
s.mu.Unlock()
|
||||
|
||||
slog.Info("scheduler: applied", "version", payload.Version,
|
||||
"received", len(payload.Schedules), "active", added)
|
||||
|
||||
// Ack outside the lock — Send() shouldn't take long, but holding
|
||||
// s.mu across an external call would needlessly serialise other
|
||||
// callers (e.g. a future Status() inspection from the UI).
|
||||
ackEnv, err := api.Marshal(api.MsgScheduleAck, "", api.ScheduleAckPayload{
|
||||
Version: payload.Version,
|
||||
AppliedAt: time.Now().UTC(),
|
||||
})
|
||||
if err != nil {
|
||||
slog.Error("scheduler: marshal schedule.ack", "err", err)
|
||||
return
|
||||
}
|
||||
if ackTx == nil {
|
||||
return
|
||||
}
|
||||
if err := ackTx.Send(ackEnv); err != nil {
|
||||
slog.Warn("scheduler: send schedule.ack — server will retry on reconnect",
|
||||
"version", payload.Version, "err", err)
|
||||
}
|
||||
}
|
||||
|
||||
// Version returns the schedule version currently applied. Useful for
|
||||
// tests + diagnostics.
|
||||
func (s *Scheduler) Version() int64 {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
return s.version
|
||||
}
|
||||
|
||||
// fire runs when one of the cron entries' time arrives. Sends a
|
||||
// schedule.fire envelope to the server, which is responsible for
|
||||
// minting the job_id, persisting the row, and shipping back a
|
||||
// command.run envelope that the agent's existing dispatcher will
|
||||
// then execute. Fire-and-log: if the WS write fails we skip this
|
||||
// tick — the next one will fire normally, and a flapping link is
|
||||
// already noisy elsewhere.
|
||||
func (s *Scheduler) fire(entry api.Schedule) {
|
||||
s.mu.Lock()
|
||||
tx := s.tx
|
||||
s.mu.Unlock()
|
||||
if tx == nil {
|
||||
slog.Info("scheduler: tick fired with no active connection — skipping",
|
||||
"schedule_id", entry.ID)
|
||||
return
|
||||
}
|
||||
env, err := api.Marshal(api.MsgScheduleFire, "", api.ScheduleFirePayload{
|
||||
ScheduleID: entry.ID,
|
||||
ScheduledAt: time.Now().UTC(),
|
||||
})
|
||||
if err != nil {
|
||||
slog.Error("scheduler: marshal schedule.fire",
|
||||
"schedule_id", entry.ID, "err", err)
|
||||
return
|
||||
}
|
||||
if err := tx.Send(env); err != nil {
|
||||
slog.Warn("scheduler: send schedule.fire — skipping this tick",
|
||||
"schedule_id", entry.ID, "err", err)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,159 @@
|
||||
package scheduler
|
||||
|
||||
import (
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
)
|
||||
|
||||
// recSender is a Sender that records every envelope it gets. Tests
|
||||
// inspect it after a tick to assert the right messages were emitted.
|
||||
type recSender struct {
|
||||
mu sync.Mutex
|
||||
envs []api.Envelope
|
||||
}
|
||||
|
||||
func (r *recSender) Send(env api.Envelope) error {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
r.envs = append(r.envs, env)
|
||||
return nil
|
||||
}
|
||||
|
||||
func (r *recSender) snapshot() []api.Envelope {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
out := make([]api.Envelope, len(r.envs))
|
||||
copy(out, r.envs)
|
||||
return out
|
||||
}
|
||||
|
||||
func TestApplyEmitsAck(t *testing.T) {
|
||||
t.Parallel()
|
||||
tx := &recSender{}
|
||||
s := New()
|
||||
defer s.Stop()
|
||||
|
||||
s.Apply(api.ScheduleSetPayload{
|
||||
Version: 7,
|
||||
Schedules: []api.Schedule{
|
||||
{ID: "s1", Kind: api.JobBackup, CronExpr: "@hourly", Enabled: true},
|
||||
},
|
||||
}, tx)
|
||||
|
||||
if got := s.Version(); got != 7 {
|
||||
t.Fatalf("Version: got %d, want 7", got)
|
||||
}
|
||||
|
||||
envs := tx.snapshot()
|
||||
if len(envs) != 1 {
|
||||
t.Fatalf("expected 1 envelope (ack), got %d", len(envs))
|
||||
}
|
||||
if envs[0].Type != api.MsgScheduleAck {
|
||||
t.Fatalf("envelope type: got %s, want %s", envs[0].Type, api.MsgScheduleAck)
|
||||
}
|
||||
var ack api.ScheduleAckPayload
|
||||
_ = envs[0].UnmarshalPayload(&ack)
|
||||
if ack.Version != 7 {
|
||||
t.Fatalf("ack version: got %d", ack.Version)
|
||||
}
|
||||
}
|
||||
|
||||
func TestApplyTickFiresScheduleFire(t *testing.T) {
|
||||
t.Parallel()
|
||||
tx := &recSender{}
|
||||
s := New()
|
||||
defer s.Stop()
|
||||
|
||||
// Cron expression that fires roughly every second; close enough
|
||||
// to be reliable in CI without making the test slow.
|
||||
s.Apply(api.ScheduleSetPayload{
|
||||
Version: 1,
|
||||
Schedules: []api.Schedule{
|
||||
{ID: "every-second", Kind: api.JobBackup, CronExpr: "@every 1s", Enabled: true},
|
||||
},
|
||||
}, tx)
|
||||
|
||||
deadline := time.Now().Add(3 * time.Second)
|
||||
for time.Now().Before(deadline) {
|
||||
envs := tx.snapshot()
|
||||
for _, e := range envs {
|
||||
if e.Type == api.MsgScheduleFire {
|
||||
var p api.ScheduleFirePayload
|
||||
_ = e.UnmarshalPayload(&p)
|
||||
if p.ScheduleID == "every-second" {
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
}
|
||||
t.Fatal("schedule.fire did not arrive within 3s")
|
||||
}
|
||||
|
||||
func TestApplyDisabledEntriesSkipped(t *testing.T) {
|
||||
t.Parallel()
|
||||
tx := &recSender{}
|
||||
s := New()
|
||||
defer s.Stop()
|
||||
|
||||
s.Apply(api.ScheduleSetPayload{
|
||||
Version: 1,
|
||||
Schedules: []api.Schedule{
|
||||
{ID: "off", Kind: api.JobBackup, CronExpr: "@every 1s", Enabled: false},
|
||||
},
|
||||
}, tx)
|
||||
|
||||
// A disabled schedule must never fire — give the cron a couple
|
||||
// of ticks to confirm it's silent.
|
||||
time.Sleep(2200 * time.Millisecond)
|
||||
for _, e := range tx.snapshot() {
|
||||
if e.Type == api.MsgScheduleFire {
|
||||
t.Fatalf("disabled schedule fired: %+v", e)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestApplyReplacesPriorState(t *testing.T) {
|
||||
t.Parallel()
|
||||
tx := &recSender{}
|
||||
s := New()
|
||||
defer s.Stop()
|
||||
|
||||
s.Apply(api.ScheduleSetPayload{
|
||||
Version: 1,
|
||||
Schedules: []api.Schedule{
|
||||
{ID: "old", Kind: api.JobBackup, CronExpr: "@every 1s", Enabled: true},
|
||||
},
|
||||
}, tx)
|
||||
|
||||
// Wait long enough for the first version to fire at least once.
|
||||
time.Sleep(1500 * time.Millisecond)
|
||||
|
||||
// Now replace with version 2 that doesn't include "old".
|
||||
s.Apply(api.ScheduleSetPayload{
|
||||
Version: 2,
|
||||
Schedules: []api.Schedule{},
|
||||
}, tx)
|
||||
|
||||
// Snapshot count *after* the replacement.
|
||||
before := 0
|
||||
for _, e := range tx.snapshot() {
|
||||
if e.Type == api.MsgScheduleFire {
|
||||
before++
|
||||
}
|
||||
}
|
||||
time.Sleep(2 * time.Second)
|
||||
after := 0
|
||||
for _, e := range tx.snapshot() {
|
||||
if e.Type == api.MsgScheduleFire {
|
||||
after++
|
||||
}
|
||||
}
|
||||
if after != before {
|
||||
t.Fatalf("schedule.fire count grew after replacement (before=%d after=%d) — old cron still firing",
|
||||
before, after)
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user