# Always-On vs Intermittent host mode **Date:** 2026-06-15 **Branch:** `feat-laptop-host-mode` **Status:** Design — awaiting review ## Problem The server currently assumes every host should be present 24×7. When an agent stops heartbeating for 90s it is flipped to `offline`, and after 15 minutes that raises a `warning` alert. This is correct for a server, but wrong for a host that legitimately comes and goes — a workstation or laptop that sleeps overnight, travels, or is shut down on weekends. Such a host generates noise alerts every time it is closed, and — more importantly — there is **no mechanism to catch up a backup it missed while it was away.** Two distinct facts make the catch-up gap real: - **Backup cron runs on the agent, locally.** The agent fires `MsgScheduleFire`; the server only dispatches in response. If the host is asleep, the agent process is suspended, so the cron tick never fires and no `MsgScheduleFire` is ever sent. - Therefore the existing `pending_runs` retry queue **does not** cover this case. `pending_runs` only gets a row when a schedule *fired* but the agent was momentarily disconnected at dispatch time. A window missed entirely during sleep never enqueues anything. ## Goal Let an operator mark a host as **not** always-on. Such a host: 1. Does **not** raise offline/agent-down alerts when it is not visible. 2. Renders a distinct, calm "asleep" state in the UI instead of the alarming red "offline". 3. When it reconnects, after a short settle delay, the server checks whether it missed a scheduled backup and — if so — triggers a catch-up backup automatically. 4. Still raises a *staleness* alert if it has genuinely gone too long without any backup (a host left in a drawer). This is the only alert covering an asleep host: while the agent is offline no job runs, so there is no failure to detect — staleness is the safety net for "no backups are happening at all." 5. Leaves normal job-failure alerting untouched: a backup that actually runs (scheduled or catch-up) and fails alerts as it does today. Failures can only occur while the agent is online and executing restic. Default behaviour is unchanged for the entire existing fleet. ## Decisions (from brainstorming) - **Setting shape:** a single boolean `Always On` checkbox per host, **default ON**. Checked = today's 24×7 server semantics. Unchecked = intermittent host. Opt-in only; zero behaviour change for current and future hosts unless explicitly toggled. - **Overdue trigger:** evaluated on **reconnect + behind schedule** (not a continuous always-evaluating sweep). - **Alert policy for intermittent hosts:** suppress offline alerts; keep a long-threshold **staleness** alert; keep job-failure alerts. - **Staleness threshold:** **7 days**, a global constant for v1. May become per-host configurable later — out of scope now. - **Catch-up granularity:** **per enabled schedule.** A host with a daily and a weekly schedule catches up only whichever is actually behind. - **UI vocabulary:** not-visible intermittent host shows a grey `asleep` state; detail line reads `asleep · last seen · will catch up on return`. - **Chip:** chip and checkbox highlight the **same** truth (24×7). Show a chip for **Always-On** hosts; **no** chip for intermittent. ## Architecture The change is deliberately a thin policy + presentation layer over the existing online/offline state machine. We do **not** add a new `status` enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a reinterpretation of `status='offline' AND NOT always_on`. ### 1. Data model - **Migration `0024_hosts_always_on.sql`:** ```sql ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1; ``` Column-level ALTER per the repo's migration rules. Default `1` means every existing row is Always-On — no behaviour change on upgrade. - `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it through every host SELECT scan and the host insert/update paths. - New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`. ### 2. Online/offline mechanics — UNCHANGED The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen host to `status='offline'` and still calls `alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello` behaviour is untouched. The intermittent distinction is applied *downstream* of this state, in the alert engine and the templates. ### 3. Alert behaviour All changes key off `host.AlwaysOn`, which the engine already has access to via the host row it loads. - **Suppress offline alert** (`alert/engine.go` `handleHostOffline()` and the 60s `tick()`): when `!host.AlwaysOn`, do not raise `agent_offline`. - **Resolve-on-toggle:** when a host is switched server→intermittent and has an open `agent_offline` alert, auto-resolve it. (Handled in the mode-change handler, fanning through the normal resolve path so channels/audit fire as usual.) - **Staleness alert** — wire up the currently-dead `KindStaleSchedule` constant, **for intermittent hosts only.** On the 60s tick, for each host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND `LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a `warning` `stale_schedule` alert (dedup key `""`, one per host). Auto-resolves when `LastBackupAt` advances past the threshold (i.e. any successful backup, including the catch-up). Always-On hosts' `stale_schedule` remains a no-op (unchanged, out of scope). - If `LastBackupAt == nil` (intermittent host enrolled but never backed up): no staleness alert in v1 — there is no baseline to measure against, and onboarding probe state (`repo_status`) already covers "never successfully set up." - **Job-failure alerts:** untouched. A catch-up backup that runs and fails alerts exactly like any other backup. ### 4. Catch-up on reconnect A new small component — the **catch-up scheduler** — lives server-side alongside the existing ticks. - **Arm:** on agent hello (`server/ws/handler.go` hello path / `onAgentHello`), if the host is `!AlwaysOn`, record `catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a subsequent hello just overwrites the timestamp (debounce — rapid flapping does not stack catch-ups). In-memory is acceptable: catch-up is best-effort and a server restart simply re-arms on the next hello. - **Fire:** reuse the existing 30s server tick. For each due entry (`catchupDueAt <= now`): 1. Re-verify the agent is still connected (`Hub.Connected(hostID)`). If it bounced back offline within the settle window, drop the entry (it will re-arm on the next hello). 2. Skip if a backup is already running or queued for the host (`current_job_id` set, or a relevant `pending_runs` row exists) — avoid double-firing alongside a normal dispatch or pending drain. 3. For each **enabled** schedule on the host, compute overdue: ``` overdue := sched.Next(host.LastBackupAt) <= now ``` using `robfig/cron/v3` (already a dependency) to parse `Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly after the last successful backup; if that moment has already passed, the window was missed → overdue. (If `LastBackupAt` is nil, treat as overdue so a never-backed-up intermittent host with a schedule gets its first run on connect.) 4. For each overdue schedule, dispatch its source-groups via the existing `dispatchBackupForGroupCore()`. 5. Clear the entry. Net latency is ~60–90s after wake (60s settle + up to one 30s tick). This path is independent of and complementary to the `pending_runs` drain, which continues to handle the fired-but-not-sent case. ### 5. UI - **CSS:** new grey `dot-asleep` token in `web/styles/input.css`, visually distinct from red `dot-offline`. - **`partials/host_row.html` and `partials/host_chrome.html`:** when `!AlwaysOn && status=='offline'`, render the grey dot + label `asleep`; the detail/last-seen line reads `asleep · last seen · will catch up on return`. All other states unchanged. - **24×7 chip:** on the host detail header, render a small `Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip for intermittent hosts. (Chip and checkbox highlight the same fact.) - **Toggle:** an `Always On` checkbox (default checked) on the host edit surface. Operator-band `POST` (mirrors existing host-edit handlers), audited as `host.mode_updated`. On save, if switching to intermittent, trigger the resolve-on-toggle path for any open `agent_offline` alert. ## Error handling & edge cases - **Toggle server→intermittent while offline+alerting:** open `agent_offline` alert auto-resolved on save. - **Toggle intermittent→server while asleep:** host resumes normal offline/alert semantics; it will alert per the 15-minute floor once the sweeper/tick next evaluates it. - **No enabled schedules:** no catch-up and no staleness alert — there is no backup expectation to measure against. - **Catch-up vs in-flight work:** guarded by the running/queued check in step 4.2 so catch-up never races a normal dispatch or pending drain. - **Agent flaps during settle window:** entry dropped if not connected at fire time; re-armed on the next hello. ## Testing - **Alert engine (unit):** - offline alert suppressed when `!AlwaysOn`. - staleness alert raised when intermittent + schedule + last backup > 7d; not raised for Always-On hosts; not raised when last backup is recent; not raised when no enabled schedule. - staleness alert auto-resolves after a backup advances `LastBackupAt`. - server→intermittent toggle resolves an open `agent_offline` alert. - **Overdue computation (unit, table-driven):** `(cronExpr, lastBackupAt, now) → overdue?` including nil-last-backup and daily/weekly cases. - **Catch-up scheduler (unit):** fires only when still connected; skips when a backup is running/queued; dispatches only overdue schedules. - **UI (render test):** asleep state + 24×7 chip render under the right conditions; offline state for Always-On hosts unchanged. - `go vet ./...` and full `go test ./...` green before merge. ## Out of scope - Per-host staleness thresholds (global 7d constant for v1). - Continuous (non-reconnect) overdue evaluation. - Agent-side catch-up cron — the server is the reliable arbiter. - Wiring `stale_schedule` for Always-On hosts (separate concern). ## Task tracking Add an entry to `tasks.md` under "Next steps from testing" (or a new small section) once the plan is approved, per the repo's tasks.md source-of-truth rule.