2026-06-15 23:01:04 +01:00
1 changed files with 217 additions and 0 deletions
@@ -0,0 +1,217 @@
+# Always-On vs Intermittent host mode
+
+**Date:** 2026-06-15
+**Branch:** `feat-laptop-host-mode`
+**Status:** Design — awaiting review
+
+## Problem
+
+The server currently assumes every host should be present 24×7. When an
+agent stops heartbeating for 90s it is flipped to `offline`, and after 15
+minutes that raises a `warning` alert. This is correct for a server, but
+wrong for a host that legitimately comes and goes — a workstation or
+laptop that sleeps overnight, travels, or is shut down on weekends. Such
+a host generates noise alerts every time it is closed, and — more
+importantly — there is **no mechanism to catch up a backup it missed
+while it was away.**
+
+Two distinct facts make the catch-up gap real:
+
+- **Backup cron runs on the agent, locally.** The agent fires
+  `MsgScheduleFire`; the server only dispatches in response. If the host
+  is asleep, the agent process is suspended, so the cron tick never
+  fires and no `MsgScheduleFire` is ever sent.
+- Therefore the existing `pending_runs` retry queue **does not** cover
+  this case. `pending_runs` only gets a row when a schedule *fired* but
+  the agent was momentarily disconnected at dispatch time. A window
+  missed entirely during sleep never enqueues anything.
+
+## Goal
+
+Let an operator mark a host as **not** always-on. Such a host:
+
+1. Does **not** raise offline/agent-down alerts when it is not visible.
+2. Renders a distinct, calm "asleep" state in the UI instead of the
+   alarming red "offline".
+3. When it reconnects, after a short settle delay, the server checks
+   whether it missed a scheduled backup and — if so — triggers a
+   catch-up backup automatically.
+4. Still raises a *staleness* alert if it has genuinely gone too long
+   without any backup (a host left in a drawer), and still raises normal
+   job-failure alerts for backups that run and fail.
+
+Default behaviour is unchanged for the entire existing fleet.
+
+## Decisions (from brainstorming)
+
+- **Setting shape:** a single boolean `Always On` checkbox per host,
+  **default ON**. Checked = today's 24×7 server semantics. Unchecked =
+  intermittent host. Opt-in only; zero behaviour change for current and
+  future hosts unless explicitly toggled.
+- **Overdue trigger:** evaluated on **reconnect + behind schedule**
+  (not a continuous always-evaluating sweep).
+- **Alert policy for intermittent hosts:** suppress offline alerts;
+  keep a long-threshold **staleness** alert; keep job-failure alerts.
+- **Staleness threshold:** **7 days**, a global constant for v1. May
+  become per-host configurable later — out of scope now.
+- **Catch-up granularity:** **per enabled schedule.** A host with a
+  daily and a weekly schedule catches up only whichever is actually
+  behind.
+- **UI vocabulary:** not-visible intermittent host shows a grey
+  `asleep` state; detail line reads
+  `asleep · last seen <relTime> · will catch up on return`.
+- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show
+  a chip for **Always-On** hosts; **no** chip for intermittent.
+
+## Architecture
+
+The change is deliberately a thin policy + presentation layer over the
+existing online/offline state machine. We do **not** add a new `status`
+enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a
+reinterpretation of `status='offline' AND NOT always_on`.
+
+### 1. Data model
+
+- **Migration `0024_hosts_always_on.sql`:**
+  ```sql
+  ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
+  ```
+  Column-level ALTER per the repo's migration rules. Default `1` means
+  every existing row is Always-On — no behaviour change on upgrade.
+- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it
+  through every host SELECT scan and the host insert/update paths.
+- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`.
+
+### 2. Online/offline mechanics — UNCHANGED
+
+The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen
+host to `status='offline'` and still calls
+`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello`
+behaviour is untouched. The intermittent distinction is applied
+*downstream* of this state, in the alert engine and the templates.
+
+### 3. Alert behaviour
+
+All changes key off `host.AlwaysOn`, which the engine already has access
+to via the host row it loads.
+
+- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()`
+  and the 60s `tick()`): when `!host.AlwaysOn`, do not raise
+  `agent_offline`.
+- **Resolve-on-toggle:** when a host is switched server→intermittent and
+  has an open `agent_offline` alert, auto-resolve it. (Handled in the
+  mode-change handler, fanning through the normal resolve path so
+  channels/audit fire as usual.)
+- **Staleness alert** — wire up the currently-dead `KindStaleSchedule`
+  constant, **for intermittent hosts only.** On the 60s tick, for each
+  host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND
+  `LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a
+  `warning` `stale_schedule` alert (dedup key `""`, one per host).
+  Auto-resolves when `LastBackupAt` advances past the threshold (i.e.
+  any successful backup, including the catch-up). Always-On hosts'
+  `stale_schedule` remains a no-op (unchanged, out of scope).
+  - If `LastBackupAt == nil` (intermittent host enrolled but never
+    backed up): no staleness alert in v1 — there is no baseline to
+    measure against, and onboarding probe state (`repo_status`) already
+    covers "never successfully set up."
+- **Job-failure alerts:** untouched. A catch-up backup that runs and
+  fails alerts exactly like any other backup.
+
+### 4. Catch-up on reconnect
+
+A new small component — the **catch-up scheduler** — lives server-side
+alongside the existing ticks.
+
+- **Arm:** on agent hello (`server/ws/handler.go` hello path /
+  `onAgentHello`), if the host is `!AlwaysOn`, record
+  `catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a
+  subsequent hello just overwrites the timestamp (debounce — rapid
+  flapping does not stack catch-ups). In-memory is acceptable: catch-up
+  is best-effort and a server restart simply re-arms on the next hello.
+- **Fire:** reuse the existing 30s server tick. For each due entry
+  (`catchupDueAt <= now`):
+  1. Re-verify the agent is still connected (`Hub.Connected(hostID)`).
+     If it bounced back offline within the settle window, drop the entry
+     (it will re-arm on the next hello).
+  2. Skip if a backup is already running or queued for the host
+     (`current_job_id` set, or a relevant `pending_runs` row exists) —
+     avoid double-firing alongside a normal dispatch or pending drain.
+  3. For each **enabled** schedule on the host, compute overdue:
+     ```
+     overdue := sched.Next(host.LastBackupAt) <= now
+     ```
+     using `robfig/cron/v3` (already a dependency) to parse
+     `Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly
+     after the last successful backup; if that moment has already
+     passed, the window was missed → overdue. (If `LastBackupAt` is nil,
+     treat as overdue so a never-backed-up intermittent host with a
+     schedule gets its first run on connect.)
+  4. For each overdue schedule, dispatch its source-groups via the
+     existing `dispatchBackupForGroupCore()`.
+  5. Clear the entry.
+
+Net latency is ~60–90s after wake (60s settle + up to one 30s tick).
+This path is independent of and complementary to the `pending_runs`
+drain, which continues to handle the fired-but-not-sent case.
+
+### 5. UI
+
+- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`,
+  visually distinct from red `dot-offline`.
+- **`partials/host_row.html` and `partials/host_chrome.html`:** when
+  `!AlwaysOn && status=='offline'`, render the grey dot + label
+  `asleep`; the detail/last-seen line reads
+  `asleep · last seen <relTime> · will catch up on return`. All other
+  states unchanged.
+- **24×7 chip:** on the host detail header, render a small
+  `Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip
+  for intermittent hosts. (Chip and checkbox highlight the same fact.)
+- **Toggle:** an `Always On` checkbox (default checked) on the host edit
+  surface. Operator-band `POST` (mirrors existing host-edit handlers),
+  audited as `host.mode_updated`. On save, if switching to intermittent,
+  trigger the resolve-on-toggle path for any open `agent_offline` alert.
+
+## Error handling & edge cases
+
+- **Toggle server→intermittent while offline+alerting:** open
+  `agent_offline` alert auto-resolved on save.
+- **Toggle intermittent→server while asleep:** host resumes normal
+  offline/alert semantics; it will alert per the 15-minute floor once
+  the sweeper/tick next evaluates it.
+- **No enabled schedules:** no catch-up and no staleness alert — there
+  is no backup expectation to measure against.
+- **Catch-up vs in-flight work:** guarded by the running/queued check in
+  step 4.2 so catch-up never races a normal dispatch or pending drain.
+- **Agent flaps during settle window:** entry dropped if not connected
+  at fire time; re-armed on the next hello.
+
+## Testing
+
+- **Alert engine (unit):**
+  - offline alert suppressed when `!AlwaysOn`.
+  - staleness alert raised when intermittent + schedule + last backup >
+    7d; not raised for Always-On hosts; not raised when last backup is
+    recent; not raised when no enabled schedule.
+  - staleness alert auto-resolves after a backup advances `LastBackupAt`.
+  - server→intermittent toggle resolves an open `agent_offline` alert.
+- **Overdue computation (unit, table-driven):** `(cronExpr,
+  lastBackupAt, now) → overdue?` including nil-last-backup and
+  daily/weekly cases.
+- **Catch-up scheduler (unit):** fires only when still connected; skips
+  when a backup is running/queued; dispatches only overdue schedules.
+- **UI (render test):** asleep state + 24×7 chip render under the right
+  conditions; offline state for Always-On hosts unchanged.
+- `go vet ./...` and full `go test ./...` green before merge.
+
+## Out of scope
+
+- Per-host staleness thresholds (global 7d constant for v1).
+- Continuous (non-reconnect) overdue evaluation.
+- Agent-side catch-up cron — the server is the reliable arbiter.
+- Wiring `stale_schedule` for Always-On hosts (separate concern).
+
+## Task tracking
+
+Add an entry to `tasks.md` under "Next steps from testing" (or a new
+small section) once the plan is approved, per the repo's tasks.md
+source-of-truth rule.