Always-On vs intermittent host mode (laptops): suppress offline noise, catch up missed backups #31

Merged
steve merged 17 commits from feat-laptop-host-mode into main 2026-06-15 23:01:04 +01:00
Showing only changes of commit 0c3a0844e4 - Show all commits
@@ -0,0 +1,217 @@
# Always-On vs Intermittent host mode
**Date:** 2026-06-15
**Branch:** `feat-laptop-host-mode`
**Status:** Design — awaiting review
## Problem
The server currently assumes every host should be present 24×7. When an
agent stops heartbeating for 90s it is flipped to `offline`, and after 15
minutes that raises a `warning` alert. This is correct for a server, but
wrong for a host that legitimately comes and goes — a workstation or
laptop that sleeps overnight, travels, or is shut down on weekends. Such
a host generates noise alerts every time it is closed, and — more
importantly — there is **no mechanism to catch up a backup it missed
while it was away.**
Two distinct facts make the catch-up gap real:
- **Backup cron runs on the agent, locally.** The agent fires
`MsgScheduleFire`; the server only dispatches in response. If the host
is asleep, the agent process is suspended, so the cron tick never
fires and no `MsgScheduleFire` is ever sent.
- Therefore the existing `pending_runs` retry queue **does not** cover
this case. `pending_runs` only gets a row when a schedule *fired* but
the agent was momentarily disconnected at dispatch time. A window
missed entirely during sleep never enqueues anything.
## Goal
Let an operator mark a host as **not** always-on. Such a host:
1. Does **not** raise offline/agent-down alerts when it is not visible.
2. Renders a distinct, calm "asleep" state in the UI instead of the
alarming red "offline".
3. When it reconnects, after a short settle delay, the server checks
whether it missed a scheduled backup and — if so — triggers a
catch-up backup automatically.
4. Still raises a *staleness* alert if it has genuinely gone too long
without any backup (a host left in a drawer), and still raises normal
job-failure alerts for backups that run and fail.
Default behaviour is unchanged for the entire existing fleet.
## Decisions (from brainstorming)
- **Setting shape:** a single boolean `Always On` checkbox per host,
**default ON**. Checked = today's 24×7 server semantics. Unchecked =
intermittent host. Opt-in only; zero behaviour change for current and
future hosts unless explicitly toggled.
- **Overdue trigger:** evaluated on **reconnect + behind schedule**
(not a continuous always-evaluating sweep).
- **Alert policy for intermittent hosts:** suppress offline alerts;
keep a long-threshold **staleness** alert; keep job-failure alerts.
- **Staleness threshold:** **7 days**, a global constant for v1. May
become per-host configurable later — out of scope now.
- **Catch-up granularity:** **per enabled schedule.** A host with a
daily and a weekly schedule catches up only whichever is actually
behind.
- **UI vocabulary:** not-visible intermittent host shows a grey
`asleep` state; detail line reads
`asleep · last seen <relTime> · will catch up on return`.
- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show
a chip for **Always-On** hosts; **no** chip for intermittent.
## Architecture
The change is deliberately a thin policy + presentation layer over the
existing online/offline state machine. We do **not** add a new `status`
enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a
reinterpretation of `status='offline' AND NOT always_on`.
### 1. Data model
- **Migration `0024_hosts_always_on.sql`:**
```sql
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
```
Column-level ALTER per the repo's migration rules. Default `1` means
every existing row is Always-On — no behaviour change on upgrade.
- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it
through every host SELECT scan and the host insert/update paths.
- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`.
### 2. Online/offline mechanics — UNCHANGED
The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen
host to `status='offline'` and still calls
`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello`
behaviour is untouched. The intermittent distinction is applied
*downstream* of this state, in the alert engine and the templates.
### 3. Alert behaviour
All changes key off `host.AlwaysOn`, which the engine already has access
to via the host row it loads.
- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()`
and the 60s `tick()`): when `!host.AlwaysOn`, do not raise
`agent_offline`.
- **Resolve-on-toggle:** when a host is switched server→intermittent and
has an open `agent_offline` alert, auto-resolve it. (Handled in the
mode-change handler, fanning through the normal resolve path so
channels/audit fire as usual.)
- **Staleness alert** — wire up the currently-dead `KindStaleSchedule`
constant, **for intermittent hosts only.** On the 60s tick, for each
host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND
`LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a
`warning` `stale_schedule` alert (dedup key `""`, one per host).
Auto-resolves when `LastBackupAt` advances past the threshold (i.e.
any successful backup, including the catch-up). Always-On hosts'
`stale_schedule` remains a no-op (unchanged, out of scope).
- If `LastBackupAt == nil` (intermittent host enrolled but never
backed up): no staleness alert in v1 — there is no baseline to
measure against, and onboarding probe state (`repo_status`) already
covers "never successfully set up."
- **Job-failure alerts:** untouched. A catch-up backup that runs and
fails alerts exactly like any other backup.
### 4. Catch-up on reconnect
A new small component — the **catch-up scheduler** — lives server-side
alongside the existing ticks.
- **Arm:** on agent hello (`server/ws/handler.go` hello path /
`onAgentHello`), if the host is `!AlwaysOn`, record
`catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a
subsequent hello just overwrites the timestamp (debounce — rapid
flapping does not stack catch-ups). In-memory is acceptable: catch-up
is best-effort and a server restart simply re-arms on the next hello.
- **Fire:** reuse the existing 30s server tick. For each due entry
(`catchupDueAt <= now`):
1. Re-verify the agent is still connected (`Hub.Connected(hostID)`).
If it bounced back offline within the settle window, drop the entry
(it will re-arm on the next hello).
2. Skip if a backup is already running or queued for the host
(`current_job_id` set, or a relevant `pending_runs` row exists) —
avoid double-firing alongside a normal dispatch or pending drain.
3. For each **enabled** schedule on the host, compute overdue:
```
overdue := sched.Next(host.LastBackupAt) <= now
```
using `robfig/cron/v3` (already a dependency) to parse
`Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly
after the last successful backup; if that moment has already
passed, the window was missed → overdue. (If `LastBackupAt` is nil,
treat as overdue so a never-backed-up intermittent host with a
schedule gets its first run on connect.)
4. For each overdue schedule, dispatch its source-groups via the
existing `dispatchBackupForGroupCore()`.
5. Clear the entry.
Net latency is ~6090s after wake (60s settle + up to one 30s tick).
This path is independent of and complementary to the `pending_runs`
drain, which continues to handle the fired-but-not-sent case.
### 5. UI
- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`,
visually distinct from red `dot-offline`.
- **`partials/host_row.html` and `partials/host_chrome.html`:** when
`!AlwaysOn && status=='offline'`, render the grey dot + label
`asleep`; the detail/last-seen line reads
`asleep · last seen <relTime> · will catch up on return`. All other
states unchanged.
- **24×7 chip:** on the host detail header, render a small
`Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip
for intermittent hosts. (Chip and checkbox highlight the same fact.)
- **Toggle:** an `Always On` checkbox (default checked) on the host edit
surface. Operator-band `POST` (mirrors existing host-edit handlers),
audited as `host.mode_updated`. On save, if switching to intermittent,
trigger the resolve-on-toggle path for any open `agent_offline` alert.
## Error handling & edge cases
- **Toggle server→intermittent while offline+alerting:** open
`agent_offline` alert auto-resolved on save.
- **Toggle intermittent→server while asleep:** host resumes normal
offline/alert semantics; it will alert per the 15-minute floor once
the sweeper/tick next evaluates it.
- **No enabled schedules:** no catch-up and no staleness alert — there
is no backup expectation to measure against.
- **Catch-up vs in-flight work:** guarded by the running/queued check in
step 4.2 so catch-up never races a normal dispatch or pending drain.
- **Agent flaps during settle window:** entry dropped if not connected
at fire time; re-armed on the next hello.
## Testing
- **Alert engine (unit):**
- offline alert suppressed when `!AlwaysOn`.
- staleness alert raised when intermittent + schedule + last backup >
7d; not raised for Always-On hosts; not raised when last backup is
recent; not raised when no enabled schedule.
- staleness alert auto-resolves after a backup advances `LastBackupAt`.
- server→intermittent toggle resolves an open `agent_offline` alert.
- **Overdue computation (unit, table-driven):** `(cronExpr,
lastBackupAt, now) → overdue?` including nil-last-backup and
daily/weekly cases.
- **Catch-up scheduler (unit):** fires only when still connected; skips
when a backup is running/queued; dispatches only overdue schedules.
- **UI (render test):** asleep state + 24×7 chip render under the right
conditions; offline state for Always-On hosts unchanged.
- `go vet ./...` and full `go test ./...` green before merge.
## Out of scope
- Per-host staleness thresholds (global 7d constant for v1).
- Continuous (non-reconnect) overdue evaluation.
- Agent-side catch-up cron — the server is the reliable arbiter.
- Wiring `stale_schedule` for Always-On hosts (separate concern).
## Task tracking
Add an entry to `tasks.md` under "Next steps from testing" (or a new
small section) once the plan is approved, per the repo's tasks.md
source-of-truth rule.