10 KiB
Always-On vs Intermittent host mode
Date: 2026-06-15
Branch: feat-laptop-host-mode
Status: Design — awaiting review
Problem
The server currently assumes every host should be present 24×7. When an
agent stops heartbeating for 90s it is flipped to offline, and after 15
minutes that raises a warning alert. This is correct for a server, but
wrong for a host that legitimately comes and goes — a workstation or
laptop that sleeps overnight, travels, or is shut down on weekends. Such
a host generates noise alerts every time it is closed, and — more
importantly — there is no mechanism to catch up a backup it missed
while it was away.
Two distinct facts make the catch-up gap real:
- Backup cron runs on the agent, locally. The agent fires
MsgScheduleFire; the server only dispatches in response. If the host is asleep, the agent process is suspended, so the cron tick never fires and noMsgScheduleFireis ever sent. - Therefore the existing
pending_runsretry queue does not cover this case.pending_runsonly gets a row when a schedule fired but the agent was momentarily disconnected at dispatch time. A window missed entirely during sleep never enqueues anything.
Goal
Let an operator mark a host as not always-on. Such a host:
- Does not raise offline/agent-down alerts when it is not visible.
- Renders a distinct, calm "asleep" state in the UI instead of the alarming red "offline".
- When it reconnects, after a short settle delay, the server checks whether it missed a scheduled backup and — if so — triggers a catch-up backup automatically.
- Still raises a staleness alert if it has genuinely gone too long without any backup (a host left in a drawer), and still raises normal job-failure alerts for backups that run and fail.
Default behaviour is unchanged for the entire existing fleet.
Decisions (from brainstorming)
- Setting shape: a single boolean
Always Oncheckbox per host, default ON. Checked = today's 24×7 server semantics. Unchecked = intermittent host. Opt-in only; zero behaviour change for current and future hosts unless explicitly toggled. - Overdue trigger: evaluated on reconnect + behind schedule (not a continuous always-evaluating sweep).
- Alert policy for intermittent hosts: suppress offline alerts; keep a long-threshold staleness alert; keep job-failure alerts.
- Staleness threshold: 7 days, a global constant for v1. May become per-host configurable later — out of scope now.
- Catch-up granularity: per enabled schedule. A host with a daily and a weekly schedule catches up only whichever is actually behind.
- UI vocabulary: not-visible intermittent host shows a grey
asleepstate; detail line readsasleep · last seen <relTime> · will catch up on return. - Chip: chip and checkbox highlight the same truth (24×7). Show a chip for Always-On hosts; no chip for intermittent.
Architecture
The change is deliberately a thin policy + presentation layer over the
existing online/offline state machine. We do not add a new status
enum value or alter heartbeat / last_seen_at tracking. "Asleep" is a
reinterpretation of status='offline' AND NOT always_on.
1. Data model
- Migration
0024_hosts_always_on.sql:Column-level ALTER per the repo's migration rules. DefaultALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;1means every existing row is Always-On — no behaviour change on upgrade. store/types.go: addAlwaysOn boolto theHoststruct; thread it through every host SELECT scan and the host insert/update paths.- New store helper
SetHostAlwaysOn(ctx, hostID, bool) error.
2. Online/offline mechanics — UNCHANGED
The 30s offline sweeper (cmd/server/main.go:220) still flips an unseen
host to status='offline' and still calls
alertEngine.NotifyHostOffline(id). TouchHost / MarkHostHello
behaviour is untouched. The intermittent distinction is applied
downstream of this state, in the alert engine and the templates.
3. Alert behaviour
All changes key off host.AlwaysOn, which the engine already has access
to via the host row it loads.
- Suppress offline alert (
alert/engine.gohandleHostOffline()and the 60stick()): when!host.AlwaysOn, do not raiseagent_offline. - Resolve-on-toggle: when a host is switched server→intermittent and
has an open
agent_offlinealert, auto-resolve it. (Handled in the mode-change handler, fanning through the normal resolve path so channels/audit fire as usual.) - Staleness alert — wire up the currently-dead
KindStaleScheduleconstant, for intermittent hosts only. On the 60s tick, for each host where!AlwaysOnAND the host has ≥1 enabled schedule ANDLastBackupAt != nilANDnow - LastBackupAt > 7*24h: raise awarningstale_schedulealert (dedup key"", one per host). Auto-resolves whenLastBackupAtadvances past the threshold (i.e. any successful backup, including the catch-up). Always-On hosts'stale_scheduleremains a no-op (unchanged, out of scope).- If
LastBackupAt == nil(intermittent host enrolled but never backed up): no staleness alert in v1 — there is no baseline to measure against, and onboarding probe state (repo_status) already covers "never successfully set up."
- If
- Job-failure alerts: untouched. A catch-up backup that runs and fails alerts exactly like any other backup.
4. Catch-up on reconnect
A new small component — the catch-up scheduler — lives server-side alongside the existing ticks.
- Arm: on agent hello (
server/ws/handler.gohello path /onAgentHello), if the host is!AlwaysOn, recordcatchupDueAt[hostID] = now + 60sin an in-memory map. Re-arming on a subsequent hello just overwrites the timestamp (debounce — rapid flapping does not stack catch-ups). In-memory is acceptable: catch-up is best-effort and a server restart simply re-arms on the next hello. - Fire: reuse the existing 30s server tick. For each due entry
(
catchupDueAt <= now):- Re-verify the agent is still connected (
Hub.Connected(hostID)). If it bounced back offline within the settle window, drop the entry (it will re-arm on the next hello). - Skip if a backup is already running or queued for the host
(
current_job_idset, or a relevantpending_runsrow exists) — avoid double-firing alongside a normal dispatch or pending drain. - For each enabled schedule on the host, compute overdue:
using
overdue := sched.Next(host.LastBackupAt) <= nowrobfig/cron/v3(already a dependency) to parseSchedule.CronExpr.Next(lastBackup)is the first fire strictly after the last successful backup; if that moment has already passed, the window was missed → overdue. (IfLastBackupAtis nil, treat as overdue so a never-backed-up intermittent host with a schedule gets its first run on connect.) - For each overdue schedule, dispatch its source-groups via the
existing
dispatchBackupForGroupCore(). - Clear the entry.
- Re-verify the agent is still connected (
Net latency is ~60–90s after wake (60s settle + up to one 30s tick).
This path is independent of and complementary to the pending_runs
drain, which continues to handle the fired-but-not-sent case.
5. UI
- CSS: new grey
dot-asleeptoken inweb/styles/input.css, visually distinct from reddot-offline. partials/host_row.htmlandpartials/host_chrome.html: when!AlwaysOn && status=='offline', render the grey dot + labelasleep; the detail/last-seen line readsasleep · last seen <relTime> · will catch up on return. All other states unchanged.- 24×7 chip: on the host detail header, render a small
Always On/24×7chip only whenAlwaysOnis true. No chip for intermittent hosts. (Chip and checkbox highlight the same fact.) - Toggle: an
Always Oncheckbox (default checked) on the host edit surface. Operator-bandPOST(mirrors existing host-edit handlers), audited ashost.mode_updated. On save, if switching to intermittent, trigger the resolve-on-toggle path for any openagent_offlinealert.
Error handling & edge cases
- Toggle server→intermittent while offline+alerting: open
agent_offlinealert auto-resolved on save. - Toggle intermittent→server while asleep: host resumes normal offline/alert semantics; it will alert per the 15-minute floor once the sweeper/tick next evaluates it.
- No enabled schedules: no catch-up and no staleness alert — there is no backup expectation to measure against.
- Catch-up vs in-flight work: guarded by the running/queued check in step 4.2 so catch-up never races a normal dispatch or pending drain.
- Agent flaps during settle window: entry dropped if not connected at fire time; re-armed on the next hello.
Testing
- Alert engine (unit):
- offline alert suppressed when
!AlwaysOn. - staleness alert raised when intermittent + schedule + last backup > 7d; not raised for Always-On hosts; not raised when last backup is recent; not raised when no enabled schedule.
- staleness alert auto-resolves after a backup advances
LastBackupAt. - server→intermittent toggle resolves an open
agent_offlinealert.
- offline alert suppressed when
- Overdue computation (unit, table-driven):
(cronExpr, lastBackupAt, now) → overdue?including nil-last-backup and daily/weekly cases. - Catch-up scheduler (unit): fires only when still connected; skips when a backup is running/queued; dispatches only overdue schedules.
- UI (render test): asleep state + 24×7 chip render under the right conditions; offline state for Always-On hosts unchanged.
go vet ./...and fullgo test ./...green before merge.
Out of scope
- Per-host staleness thresholds (global 7d constant for v1).
- Continuous (non-reconnect) overdue evaluation.
- Agent-side catch-up cron — the server is the reliable arbiter.
- Wiring
stale_schedulefor Always-On hosts (separate concern).
Task tracking
Add an entry to tasks.md under "Next steps from testing" (or a new
small section) once the plan is approved, per the repo's tasks.md
source-of-truth rule.