From 0c3a0844e451bc8eef1da45ccd5343273f21fea5 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Mon, 15 Jun 2026 20:37:45 +0100 Subject: [PATCH] docs(spec): always-on vs intermittent host mode design --- .../2026-06-15-always-on-host-mode-design.md | 217 ++++++++++++++++++ 1 file changed, 217 insertions(+) create mode 100644 docs/specs/2026-06-15-always-on-host-mode-design.md diff --git a/docs/specs/2026-06-15-always-on-host-mode-design.md b/docs/specs/2026-06-15-always-on-host-mode-design.md new file mode 100644 index 0000000..a10a089 --- /dev/null +++ b/docs/specs/2026-06-15-always-on-host-mode-design.md @@ -0,0 +1,217 @@ +# Always-On vs Intermittent host mode + +**Date:** 2026-06-15 +**Branch:** `feat-laptop-host-mode` +**Status:** Design — awaiting review + +## Problem + +The server currently assumes every host should be present 24×7. When an +agent stops heartbeating for 90s it is flipped to `offline`, and after 15 +minutes that raises a `warning` alert. This is correct for a server, but +wrong for a host that legitimately comes and goes — a workstation or +laptop that sleeps overnight, travels, or is shut down on weekends. Such +a host generates noise alerts every time it is closed, and — more +importantly — there is **no mechanism to catch up a backup it missed +while it was away.** + +Two distinct facts make the catch-up gap real: + +- **Backup cron runs on the agent, locally.** The agent fires + `MsgScheduleFire`; the server only dispatches in response. If the host + is asleep, the agent process is suspended, so the cron tick never + fires and no `MsgScheduleFire` is ever sent. +- Therefore the existing `pending_runs` retry queue **does not** cover + this case. `pending_runs` only gets a row when a schedule *fired* but + the agent was momentarily disconnected at dispatch time. A window + missed entirely during sleep never enqueues anything. + +## Goal + +Let an operator mark a host as **not** always-on. Such a host: + +1. Does **not** raise offline/agent-down alerts when it is not visible. +2. Renders a distinct, calm "asleep" state in the UI instead of the + alarming red "offline". +3. When it reconnects, after a short settle delay, the server checks + whether it missed a scheduled backup and — if so — triggers a + catch-up backup automatically. +4. Still raises a *staleness* alert if it has genuinely gone too long + without any backup (a host left in a drawer), and still raises normal + job-failure alerts for backups that run and fail. + +Default behaviour is unchanged for the entire existing fleet. + +## Decisions (from brainstorming) + +- **Setting shape:** a single boolean `Always On` checkbox per host, + **default ON**. Checked = today's 24×7 server semantics. Unchecked = + intermittent host. Opt-in only; zero behaviour change for current and + future hosts unless explicitly toggled. +- **Overdue trigger:** evaluated on **reconnect + behind schedule** + (not a continuous always-evaluating sweep). +- **Alert policy for intermittent hosts:** suppress offline alerts; + keep a long-threshold **staleness** alert; keep job-failure alerts. +- **Staleness threshold:** **7 days**, a global constant for v1. May + become per-host configurable later — out of scope now. +- **Catch-up granularity:** **per enabled schedule.** A host with a + daily and a weekly schedule catches up only whichever is actually + behind. +- **UI vocabulary:** not-visible intermittent host shows a grey + `asleep` state; detail line reads + `asleep · last seen · will catch up on return`. +- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show + a chip for **Always-On** hosts; **no** chip for intermittent. + +## Architecture + +The change is deliberately a thin policy + presentation layer over the +existing online/offline state machine. We do **not** add a new `status` +enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a +reinterpretation of `status='offline' AND NOT always_on`. + +### 1. Data model + +- **Migration `0024_hosts_always_on.sql`:** + ```sql + ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1; + ``` + Column-level ALTER per the repo's migration rules. Default `1` means + every existing row is Always-On — no behaviour change on upgrade. +- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it + through every host SELECT scan and the host insert/update paths. +- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`. + +### 2. Online/offline mechanics — UNCHANGED + +The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen +host to `status='offline'` and still calls +`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello` +behaviour is untouched. The intermittent distinction is applied +*downstream* of this state, in the alert engine and the templates. + +### 3. Alert behaviour + +All changes key off `host.AlwaysOn`, which the engine already has access +to via the host row it loads. + +- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()` + and the 60s `tick()`): when `!host.AlwaysOn`, do not raise + `agent_offline`. +- **Resolve-on-toggle:** when a host is switched server→intermittent and + has an open `agent_offline` alert, auto-resolve it. (Handled in the + mode-change handler, fanning through the normal resolve path so + channels/audit fire as usual.) +- **Staleness alert** — wire up the currently-dead `KindStaleSchedule` + constant, **for intermittent hosts only.** On the 60s tick, for each + host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND + `LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a + `warning` `stale_schedule` alert (dedup key `""`, one per host). + Auto-resolves when `LastBackupAt` advances past the threshold (i.e. + any successful backup, including the catch-up). Always-On hosts' + `stale_schedule` remains a no-op (unchanged, out of scope). + - If `LastBackupAt == nil` (intermittent host enrolled but never + backed up): no staleness alert in v1 — there is no baseline to + measure against, and onboarding probe state (`repo_status`) already + covers "never successfully set up." +- **Job-failure alerts:** untouched. A catch-up backup that runs and + fails alerts exactly like any other backup. + +### 4. Catch-up on reconnect + +A new small component — the **catch-up scheduler** — lives server-side +alongside the existing ticks. + +- **Arm:** on agent hello (`server/ws/handler.go` hello path / + `onAgentHello`), if the host is `!AlwaysOn`, record + `catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a + subsequent hello just overwrites the timestamp (debounce — rapid + flapping does not stack catch-ups). In-memory is acceptable: catch-up + is best-effort and a server restart simply re-arms on the next hello. +- **Fire:** reuse the existing 30s server tick. For each due entry + (`catchupDueAt <= now`): + 1. Re-verify the agent is still connected (`Hub.Connected(hostID)`). + If it bounced back offline within the settle window, drop the entry + (it will re-arm on the next hello). + 2. Skip if a backup is already running or queued for the host + (`current_job_id` set, or a relevant `pending_runs` row exists) — + avoid double-firing alongside a normal dispatch or pending drain. + 3. For each **enabled** schedule on the host, compute overdue: + ``` + overdue := sched.Next(host.LastBackupAt) <= now + ``` + using `robfig/cron/v3` (already a dependency) to parse + `Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly + after the last successful backup; if that moment has already + passed, the window was missed → overdue. (If `LastBackupAt` is nil, + treat as overdue so a never-backed-up intermittent host with a + schedule gets its first run on connect.) + 4. For each overdue schedule, dispatch its source-groups via the + existing `dispatchBackupForGroupCore()`. + 5. Clear the entry. + +Net latency is ~60–90s after wake (60s settle + up to one 30s tick). +This path is independent of and complementary to the `pending_runs` +drain, which continues to handle the fired-but-not-sent case. + +### 5. UI + +- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`, + visually distinct from red `dot-offline`. +- **`partials/host_row.html` and `partials/host_chrome.html`:** when + `!AlwaysOn && status=='offline'`, render the grey dot + label + `asleep`; the detail/last-seen line reads + `asleep · last seen · will catch up on return`. All other + states unchanged. +- **24×7 chip:** on the host detail header, render a small + `Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip + for intermittent hosts. (Chip and checkbox highlight the same fact.) +- **Toggle:** an `Always On` checkbox (default checked) on the host edit + surface. Operator-band `POST` (mirrors existing host-edit handlers), + audited as `host.mode_updated`. On save, if switching to intermittent, + trigger the resolve-on-toggle path for any open `agent_offline` alert. + +## Error handling & edge cases + +- **Toggle server→intermittent while offline+alerting:** open + `agent_offline` alert auto-resolved on save. +- **Toggle intermittent→server while asleep:** host resumes normal + offline/alert semantics; it will alert per the 15-minute floor once + the sweeper/tick next evaluates it. +- **No enabled schedules:** no catch-up and no staleness alert — there + is no backup expectation to measure against. +- **Catch-up vs in-flight work:** guarded by the running/queued check in + step 4.2 so catch-up never races a normal dispatch or pending drain. +- **Agent flaps during settle window:** entry dropped if not connected + at fire time; re-armed on the next hello. + +## Testing + +- **Alert engine (unit):** + - offline alert suppressed when `!AlwaysOn`. + - staleness alert raised when intermittent + schedule + last backup > + 7d; not raised for Always-On hosts; not raised when last backup is + recent; not raised when no enabled schedule. + - staleness alert auto-resolves after a backup advances `LastBackupAt`. + - server→intermittent toggle resolves an open `agent_offline` alert. +- **Overdue computation (unit, table-driven):** `(cronExpr, + lastBackupAt, now) → overdue?` including nil-last-backup and + daily/weekly cases. +- **Catch-up scheduler (unit):** fires only when still connected; skips + when a backup is running/queued; dispatches only overdue schedules. +- **UI (render test):** asleep state + 24×7 chip render under the right + conditions; offline state for Always-On hosts unchanged. +- `go vet ./...` and full `go test ./...` green before merge. + +## Out of scope + +- Per-host staleness thresholds (global 7d constant for v1). +- Continuous (non-reconnect) overdue evaluation. +- Agent-side catch-up cron — the server is the reliable arbiter. +- Wiring `stale_schedule` for Always-On hosts (separate concern). + +## Task tracking + +Add an entry to `tasks.md` under "Next steps from testing" (or a new +small section) once the plan is approved, per the repo's tasks.md +source-of-truth rule.