diff --git a/docs/plans/2026-06-15-always-on-host-mode.md b/docs/plans/2026-06-15-always-on-host-mode.md new file mode 100644 index 0000000..76fdf6c --- /dev/null +++ b/docs/plans/2026-06-15-always-on-host-mode.md @@ -0,0 +1,1060 @@ +# Always-On vs Intermittent Host Mode — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Let an operator mark a host as not-always-on so it stops raising offline alerts when it legitimately sleeps, renders a calm "asleep" state, auto-catches-up a missed backup ~1 minute after it reconnects, and still raises a long-threshold staleness alert if it goes too long with no backup. + +**Architecture:** A thin policy + presentation layer over the existing online/offline state machine. A new `hosts.always_on` boolean (default 1 = today's behaviour) gates three behaviours: offline-alert suppression + a 7-day staleness alert in the alert engine; an in-memory catch-up scheduler in the HTTP server armed on agent hello and fired from the existing 30s tick; and an "asleep" UI state plus a 24×7 chip. Online/offline tracking, heartbeat, and `pending_runs` are untouched. + +**Tech Stack:** Go, SQLite (modernc), `github.com/robfig/cron/v3` (already a dependency), Go `html/template`, Tailwind-in-`input.css`. + +**Spec:** `docs/specs/2026-06-15-always-on-host-mode-design.md` + +--- + +## File Structure + +- **Create** `internal/store/migrations/0024_hosts_always_on.sql` — add the column. +- **Modify** `internal/store/types.go` — add `Host.AlwaysOn bool`. +- **Modify** `internal/store/hosts.go` — add `always_on` to the 3 host SELECTs + `scanHostRow`; add `SetHostAlwaysOn`. +- **Create** `internal/store/hosts_always_on_test.go` — round-trip + default test. +- **Modify** `internal/alert/engine.go` — suppress offline for intermittent hosts; staleness sweep; resolve staleness on backup success. +- **Modify** `internal/alert/rules.go` — exported `ResolveKind` helper for the toggle handler; staleness threshold constant. +- **Create** `internal/alert/intermittent_test.go` — suppression + staleness + resolve tests. +- **Create** `internal/server/http/catchup.go` — overdue helper + in-memory catch-up scheduler. +- **Create** `internal/server/http/catchup_test.go` — overdue table tests. +- **Modify** `internal/server/http/server.go` — catch-up map fields on `Server`, init in `New`. +- **Modify** `internal/server/http/host_credentials.go` — arm catch-up in `onAgentHello`. +- **Modify** `cmd/server/main.go` — call `srv.RunCatchupsDue` on the pending-drain tick. +- **Modify** `internal/server/http/ui_handlers.go` — `handleUIHostModeSave` handler. +- **Modify** `internal/server/http/server.go` (routes) — mount `POST /hosts/{id}/mode`. +- **Modify** `web/styles/input.css` — `dot-asleep` token. +- **Modify** `web/templates/partials/host_row.html` — asleep dot + text. +- **Modify** `web/templates/partials/host_chrome.html` — asleep dot/last-seen, 24×7 chip, mode toggle form. +- **Modify** `tasks.md` — record the feature. + +--- + +## Task 1: Schema + store field for `always_on` + +**Files:** +- Create: `internal/store/migrations/0024_hosts_always_on.sql` +- Modify: `internal/store/types.go:62-102` (Host struct) +- Modify: `internal/store/hosts.go` (3 SELECTs at lines 41-48, 56-63, 224-231; `scanHostRow` at 261-334) +- Test: `internal/store/hosts_always_on_test.go` + +- [ ] **Step 1: Write the migration** + +Create `internal/store/migrations/0024_hosts_always_on.sql`: + +```sql +-- 0024: distinguish always-on (24x7 server) hosts from intermittent +-- hosts (laptops/workstations that legitimately sleep). Default 1 so +-- every existing and future host keeps today's offline/alert +-- semantics unless explicitly opted out. Column-level ALTER per the +-- repo's migration rules (no table rebuild — hosts has inbound FKs). +ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1; +``` + +- [ ] **Step 2: Add the struct field** + +In `internal/store/types.go`, add to the `Host` struct (after `RepoStatusError` at line 101): + +```go + // AlwaysOn is true for 24x7 server hosts (the default). When false + // the host is intermittent (laptop/workstation): offline alerts are + // suppressed, the UI shows an "asleep" state, and a missed backup is + // caught up ~1 min after reconnect. See the always-on-host-mode spec. + AlwaysOn bool +``` + +- [ ] **Step 3: Thread `always_on` through reads** + +In `internal/store/hosts.go`, append `, always_on` to the SELECT column list in all three queries: `LookupHostByAgentToken` (line 47), `GetHost` (line 62), and `ListHosts` (line 230). Each currently ends `repo_status, repo_status_error` — change to `repo_status, repo_status_error, always_on`. + +Then in `scanHostRow` (line 261), add scanning. Add a local var and the scan target. Change the `Scan(...)` call's final args from `&h.RepoStatus, &h.RepoStatusError)` to `&h.RepoStatus, &h.RepoStatusError, &alwaysOn)` and declare `var alwaysOn int` in the var block, then after the existing post-scan assignments add: + +```go + h.AlwaysOn = alwaysOn != 0 +``` + +(SQLite stores the boolean as INTEGER; scan into int then compare to avoid driver bool-coercion surprises.) + +- [ ] **Step 4: Add `SetHostAlwaysOn`** + +In `internal/store/hosts.go`, after `SetHostTags` (line 379), add: + +```go +// SetHostAlwaysOn flips the host's always-on flag. true = 24x7 server +// (default); false = intermittent host (laptop). See the +// always-on-host-mode spec. +func (s *Store) SetHostAlwaysOn(ctx context.Context, hostID string, alwaysOn bool) error { + v := 0 + if alwaysOn { + v = 1 + } + _, err := s.db.ExecContext(ctx, + `UPDATE hosts SET always_on = ? WHERE id = ?`, v, hostID) + if err != nil { + return fmt.Errorf("store: set host always_on: %w", err) + } + return nil +} +``` + +- [ ] **Step 5: Write the round-trip test** + +Create `internal/store/hosts_always_on_test.go`. Use the existing test harness pattern — check a sibling test (e.g. `internal/store/hosts_test.go`) for the `newTestStore`/`testStore` helper name and the host-creation helper, and mirror it exactly. The test body: + +```go +package store + +import ( + "context" + "testing" + "time" +) + +func TestHostAlwaysOnDefaultAndToggle(t *testing.T) { + ctx := context.Background() + st := newTestStore(t) // mirror the helper used by hosts_test.go + + h := Host{ + ID: "h-always-on", Name: "lap", OS: "linux", Arch: "amd64", + ProtocolVersion: 1, EnrolledAt: time.Now().UTC(), + } + if err := st.CreateHost(ctx, h, "tok-hash", "pin"); err != nil { + t.Fatalf("create host: %v", err) + } + + got, err := st.GetHost(ctx, h.ID) + if err != nil { + t.Fatalf("get host: %v", err) + } + if !got.AlwaysOn { + t.Fatalf("new host should default to always_on=true, got false") + } + + if err := st.SetHostAlwaysOn(ctx, h.ID, false); err != nil { + t.Fatalf("set always_on: %v", err) + } + got, err = st.GetHost(ctx, h.ID) + if err != nil { + t.Fatalf("get host 2: %v", err) + } + if got.AlwaysOn { + t.Fatalf("expected always_on=false after toggle, got true") + } + + // ListHosts must surface the same value. + hosts, err := st.ListHosts(ctx) + if err != nil { + t.Fatalf("list hosts: %v", err) + } + if len(hosts) != 1 || hosts[0].AlwaysOn { + t.Fatalf("ListHosts should report always_on=false, got %+v", hosts) + } +} +``` + +- [ ] **Step 6: Run the test (expect FAIL first if written before code, else PASS)** + +Run: `go test ./internal/store/ -run TestHostAlwaysOnDefaultAndToggle -v` +Expected: PASS once Steps 1-4 are in. If you wrote the test first, it fails to compile on `AlwaysOn` / `SetHostAlwaysOn` — that is the expected red. + +- [ ] **Step 7: Commit** + +```bash +go vet ./internal/store/... +git add internal/store/migrations/0024_hosts_always_on.sql internal/store/types.go internal/store/hosts.go internal/store/hosts_always_on_test.go +git commit -m "feat(store): add hosts.always_on flag (default on)" +``` + +--- + +## Task 2: Overdue computation helper + +This is a pure function so it can be unit-tested in isolation before the scheduler wires it up. It lives in the new `catchup.go` (the scheduler will follow in Task 3, same file). + +**Files:** +- Create: `internal/server/http/catchup.go` +- Test: `internal/server/http/catchup_test.go` + +- [ ] **Step 1: Write the failing test** + +Create `internal/server/http/catchup_test.go`: + +```go +package http + +import ( + "testing" + "time" +) + +func TestScheduleOverdue(t *testing.T) { + mustParse := func(s string) time.Time { + t.Helper() + v, err := time.Parse(time.RFC3339, s) + if err != nil { + t.Fatalf("parse %q: %v", s, err) + } + return v + } + daily := "0 2 * * *" // 02:00 every day + + cases := []struct { + name string + cron string + lastBackup *time.Time + now time.Time + want bool + }{ + { + name: "never backed up is overdue", + cron: daily, lastBackup: nil, + now: mustParse("2026-06-15T09:00:00Z"), + want: true, + }, + { + name: "missed last nights window", + cron: daily, + lastBackup: ptrTime(mustParse("2026-06-13T02:05:00Z")), + now: mustParse("2026-06-15T09:00:00Z"), + want: true, + }, + { + name: "backed up after the most recent window", + cron: daily, + lastBackup: ptrTime(mustParse("2026-06-15T02:05:00Z")), + now: mustParse("2026-06-15T09:00:00Z"), + want: false, + }, + { + name: "unparseable cron is never overdue", + cron: "not a cron", + lastBackup: nil, + now: mustParse("2026-06-15T09:00:00Z"), + want: false, + }, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + got := scheduleOverdue(c.cron, c.lastBackup, c.now) + if got != c.want { + t.Fatalf("scheduleOverdue(%q, %v, %v) = %v, want %v", + c.cron, c.lastBackup, c.now, got, c.want) + } + }) + } +} + +func ptrTime(t time.Time) *time.Time { return &t } +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `go test ./internal/server/http/ -run TestScheduleOverdue -v` +Expected: FAIL — `undefined: scheduleOverdue`. + +- [ ] **Step 3: Implement `scheduleOverdue`** + +Create `internal/server/http/catchup.go` with the helper (the scheduler methods are added in Task 3): + +```go +// catchup.go — server-side catch-up for intermittent (non-always-on) +// hosts. When such a host reconnects we wait a short settle window, +// then dispatch a backup for any schedule whose window elapsed while +// the host was asleep. This is separate from pending_runs: a host that +// was asleep never fired its local cron, so no pending row exists. +package http + +import ( + "time" +) + +// scheduleOverdue reports whether a schedule's most recent expected +// fire is newer than the host's last successful backup — i.e. a window +// passed with no backup. A nil lastBackup means "never backed up" and +// is always overdue (provided the cron parses). An unparseable cron is +// treated as not-overdue so a bad expression can never trigger a +// surprise dispatch. Uses the same cronParser the agent's scheduler +// and schedule validation use, so interpretation is identical. +func scheduleOverdue(cronExpr string, lastBackup *time.Time, now time.Time) bool { + sched, err := cronParser.Parse(cronExpr) + if err != nil { + return false + } + if lastBackup == nil { + return true + } + next := sched.Next(*lastBackup) + return !next.After(now) +} +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `go test ./internal/server/http/ -run TestScheduleOverdue -v` +Expected: PASS (all four sub-cases). + +- [ ] **Step 5: Commit** + +```bash +go vet ./internal/server/http/... +git add internal/server/http/catchup.go internal/server/http/catchup_test.go +git commit -m "feat(catchup): scheduleOverdue helper for missed-window detection" +``` + +--- + +## Task 3: Catch-up scheduler (arm on hello, fire on tick) + +**Files:** +- Modify: `internal/server/http/server.go:68-93` (Server struct), `:96-112` (New) +- Modify: `internal/server/http/catchup.go` (add scheduler methods) +- Modify: `internal/server/http/host_credentials.go:463-486` (onAgentHello) +- Modify: `cmd/server/main.go:228-229` (pending-drain tick case) + +- [ ] **Step 1: Add catch-up state to the Server struct** + +In `internal/server/http/server.go`, add fields to `Server` (after `treeCache` at line 92): + +```go + // catchupDueAt tracks intermittent hosts that reconnected and are + // in their settle window. Keyed hostID → earliest time to evaluate + // catch-up. Best-effort + in-memory: a server restart simply re-arms + // on the next hello. Guarded by catchupMu. + catchupMu sync.Mutex + catchupDueAt map[string]time.Time +``` + +Add `"time"` to the imports if not already present (check the import block). + +- [ ] **Step 2: Initialise the map in New** + +In `New` (line 106), add to the `&Server{...}` literal: + +```go + catchupDueAt: make(map[string]time.Time), +``` + +- [ ] **Step 3: Add scheduler methods to catchup.go** + +Append to `internal/server/http/catchup.go`. Add `"context"`, `"log/slog"` to its imports: + +```go +// catchupSettle is how long after a reconnect we wait before evaluating +// catch-up, so a laptop that wakes briefly and sleeps again doesn't +// trigger a backup it can't finish. ~1 minute per the spec. +const catchupSettle = 60 * time.Second + +// ArmCatchup records that an intermittent host just reconnected and +// should be evaluated for a missed backup after the settle window. +// No-op for always-on hosts (caller passes only intermittent hosts). +// Re-arming overwrites the timer (debounce — flapping doesn't stack). +func (s *Server) ArmCatchup(hostID string, now time.Time) { + s.catchupMu.Lock() + defer s.catchupMu.Unlock() + if s.catchupDueAt == nil { + s.catchupDueAt = make(map[string]time.Time) + } + s.catchupDueAt[hostID] = now.Add(catchupSettle) +} + +// dueCatchups returns the hostIDs whose settle window has elapsed and +// removes them from the map. Caller evaluates each. +func (s *Server) dueCatchups(now time.Time) []string { + s.catchupMu.Lock() + defer s.catchupMu.Unlock() + var due []string + for id, at := range s.catchupDueAt { + if !now.Before(at) { + due = append(due, id) + delete(s.catchupDueAt, id) + } + } + return due +} + +// RunCatchupsDue is the tick entrypoint. For each host past its settle +// window it dispatches a backup for every enabled schedule that is +// overdue. Skips hosts that bounced back offline, that are already +// running/queued a job, or that turned out to be always-on. +func (s *Server) RunCatchupsDue(ctx context.Context) { + if s.deps.Hub == nil { + return + } + now := time.Now().UTC() + for _, hostID := range s.dueCatchups(now) { + s.runCatchup(ctx, hostID, now) + } +} + +// runCatchup evaluates and dispatches catch-up backups for a single +// host. Exported logic kept here so RunCatchupsDue reads cleanly. +func (s *Server) runCatchup(ctx context.Context, hostID string, now time.Time) { + conn := s.deps.Hub.Conn(hostID) + if conn == nil { + return // bounced offline during the settle window; re-arms on next hello + } + host, err := s.deps.Store.GetHost(ctx, hostID) + if err != nil { + slog.Warn("catchup: load host", "host_id", hostID, "err", err) + return + } + if host.AlwaysOn { + return // mode flipped during settle window + } + if host.CurrentJobID != nil { + return // a job is already running; don't pile on + } + schedules, err := s.deps.Store.ListSchedulesByHost(ctx, hostID) + if err != nil { + slog.Warn("catchup: list schedules", "host_id", hostID, "err", err) + return + } + for _, sc := range schedules { + if !sc.Enabled || len(sc.SourceGroupIDs) == 0 { + continue + } + if !scheduleOverdue(sc.CronExpr, host.LastBackupAt, now) { + continue + } + for _, gid := range sc.SourceGroupIDs { + g, err := s.deps.Store.GetSourceGroup(ctx, hostID, gid) + if err != nil { + slog.Warn("catchup: load source group", + "host_id", hostID, "schedule_id", sc.ID, "group_id", gid, "err", err) + continue + } + if _, derr := s.dispatchBackupForGroupCore(ctx, conn, hostID, sc.ID, g, now); derr != nil { + // Send failed — host dropped again. Re-arm so the next + // reconnect retries; stop processing this host. + s.ArmCatchup(hostID, now) + return + } + slog.Info("catchup: dispatched missed backup", + "host_id", hostID, "schedule_id", sc.ID, "group", g.Name) + } + } +} +``` + +- [ ] **Step 4: Arm catch-up on agent hello** + +In `internal/server/http/host_credentials.go`, in `onAgentHello` (line 463), after the `go s.DrainPending(...)` line (485), add: + +```go + // Intermittent hosts that just reconnected may have slept through a + // backup window. Arm a catch-up evaluation after a settle delay; the + // pending-drain tick fires it. Always-on hosts never need this. + if host, err := s.deps.Store.GetHost(ctx, hostID); err == nil && !host.AlwaysOn { + s.ArmCatchup(hostID, time.Now().UTC()) + } +``` + +Verify `time` is already imported in this file (it is — used elsewhere). If not, add it. + +- [ ] **Step 5: Fire catch-up from the pending-drain tick** + +In `cmd/server/main.go`, in the `case <-pendingDrainTick.C:` block (line 228), change: + +```go + case <-pendingDrainTick.C: + srv.DrainAllDue(ctx) +``` + +to: + +```go + case <-pendingDrainTick.C: + srv.DrainAllDue(ctx) + srv.RunCatchupsDue(ctx) +``` + +- [ ] **Step 6: Build and vet** + +Run: `go build ./... && go vet ./...` +Expected: clean build, no vet errors. + +- [ ] **Step 7: Commit** + +```bash +git add internal/server/http/server.go internal/server/http/catchup.go internal/server/http/host_credentials.go cmd/server/main.go +git commit -m "feat(catchup): arm on hello, fire missed-window backups on tick" +``` + +--- + +## Task 4: Alert engine — suppress offline + staleness alert + +**Files:** +- Modify: `internal/alert/engine.go:121-153` (handleJobFinished), `:155-174` (handleHostOffline), `:188-216` (tick) +- Modify: `internal/alert/rules.go:13-39` (constants), add exported resolve helper +- Test: `internal/alert/intermittent_test.go` + +- [ ] **Step 1: Add the staleness threshold constant** + +In `internal/alert/engine.go`, add near the top of the file (after imports, before `JobFinishedEvent`): + +```go +// staleBackupThreshold is how long an intermittent host may go without +// a successful backup before we raise a stale_schedule alert. Global +// constant for v1 (may become per-host later). Only intermittent hosts +// are evaluated — always-on hosts' stale_schedule stays a no-op. +const staleBackupThreshold = 7 * 24 * time.Hour +``` + +- [ ] **Step 2: Suppress the offline alert for intermittent hosts** + +In `handleHostOffline` (line 155), after loading the host and the existing `if host.LastSeenAt == nil { return }` guard, add a mode check. Change: + +```go + if host.LastSeenAt == nil { + return + } + if time.Since(*host.LastSeenAt) < e.agentOfflineFloor { + return + } +``` + +to: + +```go + // Intermittent hosts (laptops) legitimately disappear — never raise + // agent_offline for them. The stale_schedule sweep in tick() is the + // only staleness signal for these hosts. + if !host.AlwaysOn { + return + } + if host.LastSeenAt == nil { + return + } + if time.Since(*host.LastSeenAt) < e.agentOfflineFloor { + return + } +``` + +- [ ] **Step 3: Suppress offline + add staleness in the tick sweep** + +In `tick` (line 188), the host loop currently raises agent_offline for every offline host. Replace the loop body (lines 205-214) with: + +```go + for _, h := range hosts { + // Intermittent hosts: suppress agent_offline entirely; instead + // raise stale_schedule when they have gone too long with no + // successful backup AND they have at least one enabled schedule + // to be measured against. A nil LastBackupAt (never backed up) + // has no baseline — onboarding/repo_status covers that case. + if !h.AlwaysOn { + if h.LastBackupAt == nil { + continue + } + if now.Sub(*h.LastBackupAt) < staleBackupThreshold { + continue + } + hasEnabled, err := e.hostHasEnabledSchedule(ctx, h.ID) + if err != nil || !hasEnabled { + continue + } + e.raiseAndNotify(ctx, h.ID, KindStaleSchedule, "", "warning", + fmt.Sprintf("No backup in %s (threshold %s)", + roundDur(now.Sub(*h.LastBackupAt)), staleBackupThreshold), now) + continue + } + // Always-on hosts: existing agent_offline re-evaluation. + if h.Status != "offline" || h.LastSeenAt == nil { + continue + } + if now.Sub(*h.LastSeenAt) >= e.agentOfflineFloor { + e.raiseAndNotify(ctx, h.ID, KindAgentOffline, "", "warning", + fmt.Sprintf("Agent offline for %s (threshold %s)", + roundDur(now.Sub(*h.LastSeenAt)), e.agentOfflineFloor), now) + } + } +``` + +Delete the trailing `// Stale-schedule sweep — no-op in v1.` comment at line 215. + +- [ ] **Step 4: Add the `hostHasEnabledSchedule` helper** + +In `internal/alert/engine.go`, add at the end of the file: + +```go +// hostHasEnabledSchedule reports whether the host has at least one +// enabled backup schedule — the precondition for a stale_schedule +// alert (no schedule = no backup expectation to measure against). +func (e *Engine) hostHasEnabledSchedule(ctx context.Context, hostID string) (bool, error) { + schedules, err := e.store.ListSchedulesByHost(ctx, hostID) + if err != nil { + return false, err + } + for _, sc := range schedules { + if sc.Enabled { + return true, nil + } + } + return false, nil +} +``` + +- [ ] **Step 5: Resolve staleness on a successful backup** + +In `handleJobFinished` (line 146), the `case "succeeded":` currently resolves only the job-kind alert. For a successful backup, also clear any open stale_schedule. Change: + +```go + case "succeeded": + e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When) + } +``` + +to: + +```go + case "succeeded": + e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When) + if ev.Kind == "backup" { + // A fresh backup clears staleness for intermittent hosts. + e.resolveAndNotify(ctx, ev.HostID, KindStaleSchedule, "", ev.When) + } + } +``` + +- [ ] **Step 6: Add an exported mode-change resolve hook** + +The HTTP toggle handler (Task 5) needs to clear stale alerts when an operator changes a host's mode. Add to `internal/alert/rules.go` (after `Resolve`, around line 100): + +```go +// ResolveOnModeChange clears any open agent_offline and stale_schedule +// alerts for a host whose always-on flag was just toggled. The next +// 60s tick re-raises whichever still applies under the new mode, so +// this is a self-correcting "wipe and let the sweep settle" call. +// Safe to invoke from the HTTP layer (it only touches the store + hub). +func (e *Engine) ResolveOnModeChange(ctx context.Context, hostID string, when time.Time) { + e.resolveAndNotify(ctx, hostID, KindAgentOffline, "", when) + e.resolveAndNotify(ctx, hostID, KindStaleSchedule, "", when) +} +``` + +- [ ] **Step 7: Write the engine tests** + +Create `internal/alert/intermittent_test.go`. First inspect an existing engine test (e.g. grep `internal/alert/*_test.go` for how `NewEngine` is constructed with a test store + hub, and the helper that creates a host + schedule). Mirror those helpers. The tests to write: + +```go +package alert + +import ( + "context" + "testing" + "time" +) + +// Mirror the construction helpers used by the existing engine tests +// (newTestEngine / test store / host+schedule seeding). Replace the +// placeholder helpers below with the real ones from this package's +// existing _test.go files. + +func TestIntermittentHostSuppressesOfflineAlert(t *testing.T) { + ctx := context.Background() + e, st := newTestEngine(t) // mirror existing helper + + hostID := seedHost(t, st, false /* alwaysOn */) + // last seen well past the floor + touchHostSeen(t, st, hostID, time.Now().Add(-2*time.Hour)) + markHostOffline(t, st, hostID) + + e.handleHostOffline(ctx, hostID) + + if n := openAlertCount(t, st, hostID, KindAgentOffline); n != 0 { + t.Fatalf("intermittent host should not raise agent_offline, got %d", n) + } +} + +func TestAlwaysOnHostStillRaisesOfflineAlert(t *testing.T) { + ctx := context.Background() + e, st := newTestEngine(t) + + hostID := seedHost(t, st, true /* alwaysOn */) + touchHostSeen(t, st, hostID, time.Now().Add(-2*time.Hour)) + markHostOffline(t, st, hostID) + + e.handleHostOffline(ctx, hostID) + + if n := openAlertCount(t, st, hostID, KindAgentOffline); n != 1 { + t.Fatalf("always-on host should raise agent_offline, got %d", n) + } +} + +func TestStalenessAlertForIntermittentHost(t *testing.T) { + ctx := context.Background() + e, st := newTestEngine(t) + + hostID := seedHost(t, st, false) + seedEnabledSchedule(t, st, hostID) // "0 2 * * *" with a source group + setLastBackup(t, st, hostID, time.Now().Add(-8*24*time.Hour)) + + e.tick(ctx, time.Now().UTC()) + + if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 1 { + t.Fatalf("expected one stale_schedule alert, got %d", n) + } + + // A successful backup clears it. + e.handleJobFinished(ctx, JobFinishedEvent{ + HostID: hostID, JobID: "j1", Kind: "backup", + Status: "succeeded", When: time.Now().UTC(), + }) + if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 0 { + t.Fatalf("stale_schedule should resolve after backup, got %d", n) + } +} + +func TestNoStalenessWithoutEnabledSchedule(t *testing.T) { + ctx := context.Background() + e, st := newTestEngine(t) + + hostID := seedHost(t, st, false) + setLastBackup(t, st, hostID, time.Now().Add(-8*24*time.Hour)) + // no schedule seeded + + e.tick(ctx, time.Now().UTC()) + + if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 0 { + t.Fatalf("no schedule => no staleness alert, got %d", n) + } +} +``` + +> **Note for the implementer:** the `newTestEngine`, `seedHost`, `touchHostSeen`, `markHostOffline`, `openAlertCount`, `seedEnabledSchedule`, `setLastBackup` helpers must be replaced with the real equivalents in this package's existing tests. If a needed seeding helper doesn't exist, write it using the `store` methods directly (`CreateHost`, `SetHostAlwaysOn`, `CreateSchedule`, `SetHostLastBackup`, `MarkHostsOfflineStale`, `ListAlerts`). Do NOT invent store methods — all required ones exist as of Task 1. + +- [ ] **Step 8: Run the tests** + +Run: `go test ./internal/alert/ -v` +Expected: PASS for all four new tests plus the existing suite. + +- [ ] **Step 9: Commit** + +```bash +go vet ./internal/alert/... +git add internal/alert/engine.go internal/alert/rules.go internal/alert/intermittent_test.go +git commit -m "feat(alert): suppress offline + add staleness alert for intermittent hosts" +``` + +--- + +## Task 5: HTTP toggle handler + route + +**Files:** +- Modify: `internal/server/http/ui_handlers.go` (new handler near `handleUIHostTagsSave` at line 954) +- Modify: `internal/server/http/server.go:281` (route mount) + +- [ ] **Step 1: Add the handler** + +In `internal/server/http/ui_handlers.go`, after `handleUIHostTagsSave` (line 984), add: + +```go +// handleUIHostModeSave flips a host's always-on flag. Checkbox present +// in the form (value any) => always-on; absent => intermittent. +// Operator-band; mounted in server.go. On change we clear open +// offline/staleness alerts via the engine so the next sweep re-raises +// only what still applies under the new mode. +func (s *Server) handleUIHostModeSave(w stdhttp.ResponseWriter, r *stdhttp.Request) { + u := s.requireUIUser(w, r) + if u == nil { + return + } + hostID := chi.URLParam(r, "id") + if _, err := s.deps.Store.GetHost(r.Context(), hostID); err != nil { + stdhttp.NotFound(w, r) + return + } + if err := r.ParseForm(); err != nil { + stdhttp.Error(w, "bad request", stdhttp.StatusBadRequest) + return + } + alwaysOn := r.PostForm.Get("always_on") != "" + if err := s.deps.Store.SetHostAlwaysOn(r.Context(), hostID, alwaysOn); err != nil { + slog.Error("ui host mode: save", "host_id", hostID, "err", err) + stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError) + return + } + if s.deps.AlertEngine != nil { + s.deps.AlertEngine.ResolveOnModeChange(r.Context(), hostID, time.Now().UTC()) + } + _ = s.deps.Store.AppendAudit(r.Context(), store.AuditEntry{ + ID: ulid.Make().String(), UserID: &u.ID, Actor: "user", + Action: "host.mode_updated", + TargetKind: ptr("host"), TargetID: &hostID, + TS: time.Now().UTC(), + }) + stdhttp.Redirect(w, r, "/hosts/"+hostID, stdhttp.StatusSeeOther) +} +``` + +- [ ] **Step 2: Mount the route** + +In `internal/server/http/server.go`, next to the tags route (line 281): + +```go + r.Post("/hosts/{id}/tags", s.handleUIHostTagsSave) +``` + +add directly below: + +```go + r.Post("/hosts/{id}/mode", s.handleUIHostModeSave) +``` + +(Confirm it lands in the same operator-band route group as `/hosts/{id}/tags` — same indentation/block.) + +- [ ] **Step 3: Build and vet** + +Run: `go build ./... && go vet ./...` +Expected: clean. + +- [ ] **Step 4: Write a handler test** + +Add to the existing UI-handler test file (grep `internal/server/http/*_test.go` for the harness that builds a `Server` + does form POSTs against `/hosts/{id}/tags`; mirror it). The test posts to `/hosts/{id}/mode` with and without the `always_on` field and asserts the stored flag: + +```go +func TestHandleUIHostModeSave(t *testing.T) { + srv, st, sess := newUITestServer(t) // mirror tags-save test harness + hostID := seedHostForUI(t, st) // mirror existing host seeding + + // Uncheck: form without always_on => intermittent. + postForm(t, srv, sess, "/hosts/"+hostID+"/mode", map[string]string{}) + if h, _ := st.GetHost(context.Background(), hostID); h.AlwaysOn { + t.Fatalf("expected always_on=false after empty post") + } + + // Check: form with always_on=on => always-on. + postForm(t, srv, sess, "/hosts/"+hostID+"/mode", map[string]string{"always_on": "on"}) + if h, _ := st.GetHost(context.Background(), hostID); !h.AlwaysOn { + t.Fatalf("expected always_on=true after checked post") + } +} +``` + +> Replace `newUITestServer`/`seedHostForUI`/`postForm` with the real harness helpers from the existing UI handler tests. + +- [ ] **Step 5: Run the test** + +Run: `go test ./internal/server/http/ -run TestHandleUIHostModeSave -v` +Expected: PASS. + +- [ ] **Step 6: Commit** + +```bash +git add internal/server/http/ui_handlers.go internal/server/http/server.go internal/server/http/*_test.go +git commit -m "feat(http): host mode toggle handler + route (host.mode_updated)" +``` + +--- + +## Task 6: UI — asleep state, 24×7 chip, mode toggle + +**Files:** +- Modify: `web/styles/input.css` (dot-asleep token) +- Modify: `web/templates/partials/host_row.html` +- Modify: `web/templates/partials/host_chrome.html` + +- [ ] **Step 1: Add the `dot-asleep` CSS token** + +In `web/styles/input.css`, find the `.dot-offline` definition (grep for `dot-offline`) and add a sibling `.dot-asleep` rule. Match the existing dot pattern; use a calm grey-blue distinct from offline's grey/red. Example (adapt colours to the file's existing tokens): + +```css +.dot-asleep { background: var(--ink-fade); opacity: 0.6; } +``` + +> Inspect the neighbouring `.dot-offline` / `.dot-degraded` rules first and follow their exact shape (size, border, etc.); only the colour/opacity should differ. + +- [ ] **Step 2: Rebuild CSS if the project precompiles it** + +Check the Makefile for a CSS build step (grep `css` in `Makefile`). If present, run it (e.g. `make css`). If the server serves `input.css` directly, skip. + +- [ ] **Step 3: Asleep dot + text in host_row.html** + +In `web/templates/partials/host_row.html`, change the status-dot block (lines 6-14). Replace the `{{- else if eq $h.Status "offline" -}}` dot branch: + +```html + {{- else if eq $h.Status "offline" -}} + +``` + +with: + +```html + {{- else if eq $h.Status "offline" -}} + {{- if $h.AlwaysOn -}} + + {{- else -}} + + {{- end -}} +``` + +Then change the last-seen text branch (lines 28-29): + +```html + {{- else if eq $h.Status "offline" -}} + last seen {{relTime $h.LastSeenAt}} +``` + +to: + +```html + {{- else if eq $h.Status "offline" -}} + {{- if $h.AlwaysOn -}} + last seen {{relTime $h.LastSeenAt}} + {{- else -}} + asleep · {{relTime $h.LastSeenAt}} · will catch up on return + {{- end -}} +``` + +And the row-action label (lines 55-56): + +```html + {{- if eq $h.Status "offline" -}} + offline +``` + +to: + +```html + {{- if eq $h.Status "offline" -}} + {{if $h.AlwaysOn}}offline{{else}}asleep{{end}} +``` + +- [ ] **Step 4: Asleep dot + last-seen in host_chrome.html** + +In `web/templates/partials/host_chrome.html`, change the offline dot branch (lines 36-37): + +```html + {{else if eq $host.Status "offline"}} + +``` + +to: + +```html + {{else if eq $host.Status "offline"}} + {{if $host.AlwaysOn}} + + {{else}} + + {{end}} +``` + +And the last-seen line (lines 90-94): + +```html + {{if eq $host.Status "offline"}} + last seen {{relTime $host.LastSeenAt}} + {{else}} + online · last heartbeat {{relTime $host.LastSeenAt}} + {{end}} +``` + +to: + +```html + {{if eq $host.Status "offline"}} + {{if $host.AlwaysOn}} + last seen {{relTime $host.LastSeenAt}} + {{else}} + asleep · last seen {{relTime $host.LastSeenAt}} · will catch up on return + {{end}} + {{else}} + online · last heartbeat {{relTime $host.LastSeenAt}} + {{end}} +``` + +- [ ] **Step 5: Add the 24×7 chip + mode toggle to host_chrome.html** + +In the header tags block (lines 42-48), after the tags `edit/add tags` button and before the closing `` at line 48, add the chip (shown only when always-on) and a small toggle button mirroring the tags-editor reveal pattern: + +```html + {{if $host.AlwaysOn}}24×7{{end}} + +``` + +Then add the toggle form right after the tags `
` block (after line 82, before the `
` at line 83): + +```html + {{/* Presence-mode editor — hidden by default; toggled by the + "presence" button. Checkbox present => always-on (24×7); + unchecked => intermittent (laptop): no offline alerts, shows + "asleep", auto-catches-up a missed backup on reconnect. */}} + + +
+ Uncheck for an intermittent host (laptop/workstation): it won’t + raise offline alerts when asleep, shows an “asleep” state, and + catches up a missed backup ~1 minute after it reconnects. +
+ + +``` + +- [ ] **Step 6: Verify templates parse** + +Run: `go build ./... && go test ./internal/server/... -run Template -v` (if a template-render test exists; otherwise rely on the smoke run in Step 7). At minimum: `go build ./...` must pass. + +- [ ] **Step 7: Manual smoke (per CLAUDE.md smoke targets)** + +```bash +make smoke-deploy +``` + +Then in a browser (or Playwright): open the dashboard and a host detail page. Toggle a host to intermittent via the "presence" control, confirm the 24×7 chip disappears, and confirm an offline/sleeping intermittent host renders the grey "asleep · … · will catch up on return" line instead of red "offline". Toggle back and confirm the chip returns. + +- [ ] **Step 8: Commit** + +```bash +git add web/styles/input.css web/templates/partials/host_row.html web/templates/partials/host_chrome.html +git commit -m "feat(ui): asleep state, 24×7 chip, presence toggle for host mode" +``` + +--- + +## Task 7: Record in tasks.md + final verification + +**Files:** +- Modify: `tasks.md` + +- [ ] **Step 1: Add a tasks.md entry** + +Add a `[x]` entry under "Next steps from testing" in `tasks.md` (mirroring the NS-07 style — one line + a short "As shipped" note) describing the always-on/intermittent host mode: `always_on` column (default on), offline-alert suppression + 7-day staleness alert for intermittent hosts, settle-then-catch-up on reconnect, and the asleep UI + 24×7 chip + presence toggle. + +- [ ] **Step 2: Full verification** + +```bash +go vet ./... +go test ./... +``` + +Expected: vet clean, all tests green. + +- [ ] **Step 3: Commit** + +```bash +git add tasks.md +git commit -m "docs(tasks): record always-on/intermittent host mode" +``` + +--- + +## Self-Review notes + +- **Spec coverage:** §1 data model → Task 1. §2 mechanics unchanged → no task needed (verified untouched). §3 alerts (suppress offline, staleness, resolve-on-backup, resolve-on-toggle) → Task 4 + Task 5 Step 1. §4 catch-up (arm on hello, settle, per-schedule overdue, dispatch, guards) → Tasks 2-3. §5 UI (dot-asleep, asleep text, 24×7 chip, toggle) → Task 6. Testing → tests in Tasks 1-5. Out-of-scope items respected (global 7d const, reconnect-only, no agent-side cron, always-on stale_schedule untouched). +- **Type consistency:** `scheduleOverdue(cronExpr string, *time.Time, time.Time) bool`, `ArmCatchup(hostID string, now time.Time)`, `RunCatchupsDue(ctx)`, `SetHostAlwaysOn(ctx, hostID, bool)`, `ResolveOnModeChange(ctx, hostID, when)`, `Host.AlwaysOn bool` — used consistently across tasks. +- **No invented store methods:** all `store.*` calls (GetHost, ListSchedulesByHost, GetSourceGroup, SetHostLastBackup, ListAlerts, AppendAudit, dispatchBackupForGroupCore, Hub.Conn/Connected) exist in the current tree; `SetHostAlwaysOn` is the only new one and is defined in Task 1. +- **Test helper caveat:** the alert and HTTP handler tests reference package-local helpers (`newTestEngine`, `newUITestServer`, etc.) that must be matched to the real names in existing `_test.go` files at implementation time — flagged inline in each task.