restic-manager/docs/specs/2026-06-15-always-on-host-mode-design.md

# Always-On vs Intermittent host mode

**Date:** 2026-06-15
**Branch:** `feat-laptop-host-mode`
**Status:** Design — awaiting review

## Problem

The server currently assumes every host should be present 24×7. When an
agent stops heartbeating for 90s it is flipped to `offline`, and after 15
minutes that raises a `warning` alert. This is correct for a server, but
wrong for a host that legitimately comes and goes — a workstation or
laptop that sleeps overnight, travels, or is shut down on weekends. Such
a host generates noise alerts every time it is closed, and — more
importantly — there is **no mechanism to catch up a backup it missed
while it was away.**

Two distinct facts make the catch-up gap real:

- **Backup cron runs on the agent, locally.** The agent fires
  `MsgScheduleFire`; the server only dispatches in response. If the host
  is asleep, the agent process is suspended, so the cron tick never
  fires and no `MsgScheduleFire` is ever sent.
- Therefore the existing `pending_runs` retry queue **does not** cover
  this case. `pending_runs` only gets a row when a schedule *fired* but
  the agent was momentarily disconnected at dispatch time. A window
  missed entirely during sleep never enqueues anything.

## Goal

Let an operator mark a host as **not** always-on. Such a host:

1. Does **not** raise offline/agent-down alerts when it is not visible.
2. Renders a distinct, calm "asleep" state in the UI instead of the
   alarming red "offline".
3. When it reconnects, after a short settle delay, the server checks
   whether it missed a scheduled backup and — if so — triggers a
   catch-up backup automatically.
4. Still raises a *staleness* alert if it has genuinely gone too long
   without any backup (a host left in a drawer). This is the only
   alert covering an asleep host: while the agent is offline no job
   runs, so there is no failure to detect — staleness is the safety
   net for "no backups are happening at all."
5. Leaves normal job-failure alerting untouched: a backup that
   actually runs (scheduled or catch-up) and fails alerts as it does
   today. Failures can only occur while the agent is online and
   executing restic.

Default behaviour is unchanged for the entire existing fleet.

## Decisions (from brainstorming)

- **Setting shape:** a single boolean `Always On` checkbox per host,
  **default ON**. Checked = today's 24×7 server semantics. Unchecked =
  intermittent host. Opt-in only; zero behaviour change for current and
  future hosts unless explicitly toggled.
- **Overdue trigger:** evaluated on **reconnect + behind schedule**
  (not a continuous always-evaluating sweep).
- **Alert policy for intermittent hosts:** suppress offline alerts;
  keep a long-threshold **staleness** alert; keep job-failure alerts.
- **Staleness threshold:** **7 days**, a global constant for v1. May
  become per-host configurable later — out of scope now.
- **Catch-up granularity:** **per enabled schedule.** A host with a
  daily and a weekly schedule catches up only whichever is actually
  behind.
- **UI vocabulary:** not-visible intermittent host shows a grey
  `asleep` state; detail line reads
  `asleep · last seen <relTime> · will catch up on return`.
- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show
  a chip for **Always-On** hosts; **no** chip for intermittent.

## Architecture

The change is deliberately a thin policy + presentation layer over the
existing online/offline state machine. We do **not** add a new `status`
enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a
reinterpretation of `status='offline' AND NOT always_on`.

### 1. Data model

- **Migration `0024_hosts_always_on.sql`:**
  ```sql
  ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
  ```
  Column-level ALTER per the repo's migration rules. Default `1` means
  every existing row is Always-On — no behaviour change on upgrade.
- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it
  through every host SELECT scan and the host insert/update paths.
- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`.

### 2. Online/offline mechanics — UNCHANGED

The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen
host to `status='offline'` and still calls
`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello`
behaviour is untouched. The intermittent distinction is applied
*downstream* of this state, in the alert engine and the templates.

### 3. Alert behaviour

All changes key off `host.AlwaysOn`, which the engine already has access
to via the host row it loads.

- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()`
  and the 60s `tick()`): when `!host.AlwaysOn`, do not raise
  `agent_offline`.
- **Resolve-on-toggle:** when a host is switched server→intermittent and
  has an open `agent_offline` alert, auto-resolve it. (Handled in the
  mode-change handler, fanning through the normal resolve path so
  channels/audit fire as usual.)
- **Staleness alert** — wire up the currently-dead `KindStaleSchedule`
  constant, **for intermittent hosts only.** On the 60s tick, for each
  host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND
  `LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a
  `warning` `stale_schedule` alert (dedup key `""`, one per host).
  Auto-resolves when `LastBackupAt` advances past the threshold (i.e.
  any successful backup, including the catch-up). Always-On hosts'
  `stale_schedule` remains a no-op (unchanged, out of scope).
  - If `LastBackupAt == nil` (intermittent host enrolled but never
    backed up): no staleness alert in v1 — there is no baseline to
    measure against, and onboarding probe state (`repo_status`) already
    covers "never successfully set up."
- **Job-failure alerts:** untouched. A catch-up backup that runs and
  fails alerts exactly like any other backup.

### 4. Catch-up on reconnect

A new small component — the **catch-up scheduler** — lives server-side
alongside the existing ticks.

- **Arm:** on agent hello (`server/ws/handler.go` hello path /
  `onAgentHello`), if the host is `!AlwaysOn`, record
  `catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a
  subsequent hello just overwrites the timestamp (debounce — rapid
  flapping does not stack catch-ups). In-memory is acceptable: catch-up
  is best-effort and a server restart simply re-arms on the next hello.
- **Fire:** reuse the existing 30s server tick. For each due entry
  (`catchupDueAt <= now`):
  1. Re-verify the agent is still connected (`Hub.Connected(hostID)`).
     If it bounced back offline within the settle window, drop the entry
     (it will re-arm on the next hello).
  2. Skip if a backup is already running or queued for the host
     (`current_job_id` set, or a relevant `pending_runs` row exists) —
     avoid double-firing alongside a normal dispatch or pending drain.
  3. For each **enabled** schedule on the host, compute overdue:
     ```
     overdue := sched.Next(host.LastBackupAt) <= now
     ```
     using `robfig/cron/v3` (already a dependency) to parse
     `Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly
     after the last successful backup; if that moment has already
     passed, the window was missed → overdue. (If `LastBackupAt` is nil,
     treat as overdue so a never-backed-up intermittent host with a
     schedule gets its first run on connect.)
  4. For each overdue schedule, dispatch its source-groups via the
     existing `dispatchBackupForGroupCore()`.
  5. Clear the entry.

Net latency is ~60–90s after wake (60s settle + up to one 30s tick).
This path is independent of and complementary to the `pending_runs`
drain, which continues to handle the fired-but-not-sent case.

### 5. UI

- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`,
  visually distinct from red `dot-offline`.
- **`partials/host_row.html` and `partials/host_chrome.html`:** when
  `!AlwaysOn && status=='offline'`, render the grey dot + label
  `asleep`; the detail/last-seen line reads
  `asleep · last seen <relTime> · will catch up on return`. All other
  states unchanged.
- **24×7 chip:** on the host detail header, render a small
  `Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip
  for intermittent hosts. (Chip and checkbox highlight the same fact.)
- **Toggle:** an `Always On` checkbox (default checked) on the host edit
  surface. Operator-band `POST` (mirrors existing host-edit handlers),
  audited as `host.mode_updated`. On save, if switching to intermittent,
  trigger the resolve-on-toggle path for any open `agent_offline` alert.

## Error handling & edge cases

- **Toggle server→intermittent while offline+alerting:** open
  `agent_offline` alert auto-resolved on save.
- **Toggle intermittent→server while asleep:** host resumes normal
  offline/alert semantics; it will alert per the 15-minute floor once
  the sweeper/tick next evaluates it.
- **No enabled schedules:** no catch-up and no staleness alert — there
  is no backup expectation to measure against.
- **Catch-up vs in-flight work:** guarded by the running/queued check in
  step 4.2 so catch-up never races a normal dispatch or pending drain.
- **Agent flaps during settle window:** entry dropped if not connected
  at fire time; re-armed on the next hello.

## Testing

- **Alert engine (unit):**
  - offline alert suppressed when `!AlwaysOn`.
  - staleness alert raised when intermittent + schedule + last backup >
    7d; not raised for Always-On hosts; not raised when last backup is
    recent; not raised when no enabled schedule.
  - staleness alert auto-resolves after a backup advances `LastBackupAt`.
  - server→intermittent toggle resolves an open `agent_offline` alert.
- **Overdue computation (unit, table-driven):** `(cronExpr,
  lastBackupAt, now) → overdue?` including nil-last-backup and
  daily/weekly cases.
- **Catch-up scheduler (unit):** fires only when still connected; skips
  when a backup is running/queued; dispatches only overdue schedules.
- **UI (render test):** asleep state + 24×7 chip render under the right
  conditions; offline state for Always-On hosts unchanged.
- `go vet ./...` and full `go test ./...` green before merge.

## Out of scope

- Per-host staleness thresholds (global 7d constant for v1).
- Continuous (non-reconnect) overdue evaluation.
- Agent-side catch-up cron — the server is the reliable arbiter.
- Wiring `stale_schedule` for Always-On hosts (separate concern).

## Task tracking

Add an entry to `tasks.md` under "Next steps from testing" (or a new
small section) once the plan is approved, per the repo's tasks.md
source-of-truth rule.