steve/restic-manager

Fork 0

Files

T

steve 261b83ec26 docs(spec): clarify staleness vs job-failure alerting for asleep hosts

2026-06-15 20:42:00 +01:00

10 KiB

Raw Blame History

Always-On vs Intermittent host mode

Date: 2026-06-15 Branch: feat-laptop-host-mode Status: Design — awaiting review

Problem

The server currently assumes every host should be present 24×7. When an agent stops heartbeating for 90s it is flipped to offline, and after 15 minutes that raises a warning alert. This is correct for a server, but wrong for a host that legitimately comes and goes — a workstation or laptop that sleeps overnight, travels, or is shut down on weekends. Such a host generates noise alerts every time it is closed, and — more importantly — there is no mechanism to catch up a backup it missed while it was away.

Two distinct facts make the catch-up gap real:

Backup cron runs on the agent, locally. The agent fires MsgScheduleFire; the server only dispatches in response. If the host is asleep, the agent process is suspended, so the cron tick never fires and no MsgScheduleFire is ever sent.
Therefore the existing pending_runs retry queue does not cover this case. pending_runs only gets a row when a schedule fired but the agent was momentarily disconnected at dispatch time. A window missed entirely during sleep never enqueues anything.

Goal

Let an operator mark a host as not always-on. Such a host:

Does not raise offline/agent-down alerts when it is not visible.
Renders a distinct, calm "asleep" state in the UI instead of the alarming red "offline".
When it reconnects, after a short settle delay, the server checks whether it missed a scheduled backup and — if so — triggers a catch-up backup automatically.
Still raises a staleness alert if it has genuinely gone too long without any backup (a host left in a drawer). This is the only alert covering an asleep host: while the agent is offline no job runs, so there is no failure to detect — staleness is the safety net for "no backups are happening at all."
Leaves normal job-failure alerting untouched: a backup that actually runs (scheduled or catch-up) and fails alerts as it does today. Failures can only occur while the agent is online and executing restic.

Default behaviour is unchanged for the entire existing fleet.

Decisions (from brainstorming)

Setting shape: a single boolean Always On checkbox per host, default ON. Checked = today's 24×7 server semantics. Unchecked = intermittent host. Opt-in only; zero behaviour change for current and future hosts unless explicitly toggled.
Overdue trigger: evaluated on reconnect + behind schedule (not a continuous always-evaluating sweep).
Alert policy for intermittent hosts: suppress offline alerts; keep a long-threshold staleness alert; keep job-failure alerts.
Staleness threshold: 7 days, a global constant for v1. May become per-host configurable later — out of scope now.
Catch-up granularity: per enabled schedule. A host with a daily and a weekly schedule catches up only whichever is actually behind.
UI vocabulary: not-visible intermittent host shows a grey asleep state; detail line reads asleep · last seen <relTime> · will catch up on return.
Chip: chip and checkbox highlight the same truth (24×7). Show a chip for Always-On hosts; no chip for intermittent.

Architecture

The change is deliberately a thin policy + presentation layer over the existing online/offline state machine. We do not add a new status enum value or alter heartbeat / last_seen_at tracking. "Asleep" is a reinterpretation of status='offline' AND NOT always_on.

1. Data model

Migration 0024_hosts_always_on.sql:
```
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
```
Column-level ALTER per the repo's migration rules. Default 1 means every existing row is Always-On — no behaviour change on upgrade.
store/types.go: add AlwaysOn bool to the Host struct; thread it through every host SELECT scan and the host insert/update paths.
New store helper SetHostAlwaysOn(ctx, hostID, bool) error.

2. Online/offline mechanics — UNCHANGED

The 30s offline sweeper (cmd/server/main.go:220) still flips an unseen host to status='offline' and still calls alertEngine.NotifyHostOffline(id). TouchHost / MarkHostHello behaviour is untouched. The intermittent distinction is applied downstream of this state, in the alert engine and the templates.

3. Alert behaviour

All changes key off host.AlwaysOn, which the engine already has access to via the host row it loads.

Suppress offline alert (alert/engine.go handleHostOffline() and the 60s tick()): when !host.AlwaysOn, do not raise agent_offline.
Resolve-on-toggle: when a host is switched server→intermittent and has an open agent_offline alert, auto-resolve it. (Handled in the mode-change handler, fanning through the normal resolve path so channels/audit fire as usual.)
Staleness alert — wire up the currently-dead KindStaleSchedule constant, for intermittent hosts only. On the 60s tick, for each host where !AlwaysOn AND the host has ≥1 enabled schedule AND LastBackupAt != nil AND now - LastBackupAt > 7*24h: raise a warning stale_schedule alert (dedup key "", one per host). Auto-resolves when LastBackupAt advances past the threshold (i.e. any successful backup, including the catch-up). Always-On hosts' stale_schedule remains a no-op (unchanged, out of scope).
- If LastBackupAt == nil (intermittent host enrolled but never backed up): no staleness alert in v1 — there is no baseline to measure against, and onboarding probe state (repo_status) already covers "never successfully set up."
Job-failure alerts: untouched. A catch-up backup that runs and fails alerts exactly like any other backup.

4. Catch-up on reconnect

A new small component — the catch-up scheduler — lives server-side alongside the existing ticks.

Arm: on agent hello (server/ws/handler.go hello path / onAgentHello), if the host is !AlwaysOn, record catchupDueAt[hostID] = now + 60s in an in-memory map. Re-arming on a subsequent hello just overwrites the timestamp (debounce — rapid flapping does not stack catch-ups). In-memory is acceptable: catch-up is best-effort and a server restart simply re-arms on the next hello.
Fire: reuse the existing 30s server tick. For each due entry (catchupDueAt <= now):
1. Re-verify the agent is still connected (Hub.Connected(hostID)). If it bounced back offline within the settle window, drop the entry (it will re-arm on the next hello).
2. Skip if a backup is already running or queued for the host (current_job_id set, or a relevant pending_runs row exists) — avoid double-firing alongside a normal dispatch or pending drain.
3. For each enabled schedule on the host, compute overdue:
```
overdue := sched.Next(host.LastBackupAt) <= now
```
  using robfig/cron/v3 (already a dependency) to parse Schedule.CronExpr. Next(lastBackup) is the first fire strictly after the last successful backup; if that moment has already passed, the window was missed → overdue. (If LastBackupAt is nil, treat as overdue so a never-backed-up intermittent host with a schedule gets its first run on connect.)
4. For each overdue schedule, dispatch its source-groups via the existing dispatchBackupForGroupCore().
5. Clear the entry.

Net latency is ~60–90s after wake (60s settle + up to one 30s tick). This path is independent of and complementary to the pending_runs drain, which continues to handle the fired-but-not-sent case.

5. UI

CSS: new grey dot-asleep token in web/styles/input.css, visually distinct from red dot-offline.
partials/host_row.html and partials/host_chrome.html: when !AlwaysOn && status=='offline', render the grey dot + label asleep; the detail/last-seen line reads asleep · last seen <relTime> · will catch up on return. All other states unchanged.
24×7 chip: on the host detail header, render a small Always On / 24×7 chip only when AlwaysOn is true. No chip for intermittent hosts. (Chip and checkbox highlight the same fact.)
Toggle: an Always On checkbox (default checked) on the host edit surface. Operator-band POST (mirrors existing host-edit handlers), audited as host.mode_updated. On save, if switching to intermittent, trigger the resolve-on-toggle path for any open agent_offline alert.

Error handling & edge cases

Toggle server→intermittent while offline+alerting: open agent_offline alert auto-resolved on save.
Toggle intermittent→server while asleep: host resumes normal offline/alert semantics; it will alert per the 15-minute floor once the sweeper/tick next evaluates it.
No enabled schedules: no catch-up and no staleness alert — there is no backup expectation to measure against.
Catch-up vs in-flight work: guarded by the running/queued check in step 4.2 so catch-up never races a normal dispatch or pending drain.
Agent flaps during settle window: entry dropped if not connected at fire time; re-armed on the next hello.

Testing

Alert engine (unit):
- offline alert suppressed when !AlwaysOn.
- staleness alert raised when intermittent + schedule + last backup > 7d; not raised for Always-On hosts; not raised when last backup is recent; not raised when no enabled schedule.
- staleness alert auto-resolves after a backup advances LastBackupAt.
- server→intermittent toggle resolves an open agent_offline alert.
Overdue computation (unit, table-driven): (cronExpr, lastBackupAt, now) → overdue? including nil-last-backup and daily/weekly cases.
Catch-up scheduler (unit): fires only when still connected; skips when a backup is running/queued; dispatches only overdue schedules.
UI (render test): asleep state + 24×7 chip render under the right conditions; offline state for Always-On hosts unchanged.
go vet ./... and full go test ./... green before merge.

Out of scope

Per-host staleness thresholds (global 7d constant for v1).
Continuous (non-reconnect) overdue evaluation.
Agent-side catch-up cron — the server is the reliable arbiter.
Wiring stale_schedule for Always-On hosts (separate concern).

Task tracking

Add an entry to tasks.md under "Next steps from testing" (or a new small section) once the plan is approved, per the repo's tasks.md source-of-truth rule.

10 KiB Raw Blame History Unescape Escape