P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch

Closes the schedule reconciliation loop end-to-end.

* New `internal/agent/scheduler` package wraps robfig/cron/v3 with
  the lifecycle the agent needs:
  - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting
    for in-flight entries to return), rebuilds from scratch, starts,
    and emits schedule.ack with the version we just applied.
  - Disabled entries skipped silently; bad cron exprs (which
    shouldn't reach us — the server validates — but defensive)
    log a warn and skip.
  - On each cron tick the entry sends a new schedule.fire envelope
    to the server with {schedule_id, scheduled_at}. The scheduler
    itself never builds CommandRunPayloads — server is the source
    of truth for jobs.
  - tx is swapped on every Apply, so reconnect is handled
    naturally: cron entries that fire against a dropped tx log
    "no active connection" and skip the tick.
  - Stop() is idempotent and waits for the cron's in-flight
    workers via cron.Stop().Done().

* New wire message api.MsgScheduleFire + api.ScheduleFirePayload
  for the agent → server "I just fired locally" RPC.

* Server-side dispatch (schedule_push.go: dispatchScheduledJob):
  looks up the schedule by id, validates ownership + that it's
  enabled, builds args from kind (paths for backup; other kinds
  are still arg-less in Phase 2 and grow as those job kinds land
  in P2-05..08), persists a jobs row with actor_kind=schedule +
  scheduled_id, and writes command.run back on the same conn so
  the agent runs through its existing dispatch path.

* store.CreateJob now writes scheduled_id. This column was in the
  schema since 0001 but never populated — the original P1 path
  only had operator-driven jobs, so actor_kind was always 'user'
  and scheduled_id was always nil.

* cmd/agent/main.go integration: dispatcher gains a
  *scheduler.Scheduler; the MsgScheduleSet case now hands the
  payload to scheduler.Apply (in a goroutine so the WS read loop
  keeps draining other messages).

* WS dispatcher gains OnScheduleFire alongside OnScheduleAck.

* Tests:
  - scheduler unit tests (4): ack-on-apply, cron tick fires
    schedule.fire envelope, disabled entries don't fire, replace-
    prior-state stops the old cron.
  - Server-side end-to-end: schedule.fire → command.run with the
    right job_id / kind / args, plus jobs row with actor_kind=
    "schedule" and scheduled_id linking back to the schedule.

Persistence of next-fire times across agent restarts is
deliberately deferred. A missed fire window during downtime
simply fires once on reconnect — that's the desirable behaviour
(the operator wants the missed backup to run, not be silently
skipped because we lost track of when it was due).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This commit is contained in:

Steve Cliff

2026-05-02 11:29:12 +01:00

parent a086b0eb75

commit 608962441b

11 changed files with 561 additions and 13 deletions

									
										internal/store/jobs.go
									
		+7
		-4
	
												View File
												
				@@ -27,12 +27,15 @@ type Job struct {

				}

				// CreateJob inserts a queued job. The agent will mark it running

				// when it actually starts work.

				// when it actually starts work. ScheduledID is set when the job

				// originates from a cron fire (actor_kind="schedule"); nil for

				// operator-driven run-now.

				func (s *Store) CreateJob(ctx context.Context, j Job) error {

					_, err := s.db.ExecContext(ctx,

						`INSERT INTO jobs (id, host_id, kind, status, actor_kind, actor_id, created_at)

						 VALUES (?, ?, ?, 'queued', ?, ?, ?)`,

						j.ID, j.HostID, j.Kind, j.ActorKind, nullable(j.ActorID),

						`INSERT INTO jobs (id, host_id, kind, status, scheduled_id, actor_kind, actor_id, created_at)

						 VALUES (?, ?, ?, 'queued', ?, ?, ?, ?)`,

						j.ID, j.HostID, j.Kind,

						nullable(j.ScheduledID), j.ActorKind, nullable(j.ActorID),

						j.CreatedAt.UTC().Format(time.RFC3339Nano))

					if err != nil {

						return fmt.Errorf("store: create job: %w", err)