Files
restic-manager/docs/plans/2026-06-15-always-on-host-mode.md
T

1061 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Always-On vs Intermittent Host Mode — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Let an operator mark a host as not-always-on so it stops raising offline alerts when it legitimately sleeps, renders a calm "asleep" state, auto-catches-up a missed backup ~1 minute after it reconnects, and still raises a long-threshold staleness alert if it goes too long with no backup.
**Architecture:** A thin policy + presentation layer over the existing online/offline state machine. A new `hosts.always_on` boolean (default 1 = today's behaviour) gates three behaviours: offline-alert suppression + a 7-day staleness alert in the alert engine; an in-memory catch-up scheduler in the HTTP server armed on agent hello and fired from the existing 30s tick; and an "asleep" UI state plus a 24×7 chip. Online/offline tracking, heartbeat, and `pending_runs` are untouched.
**Tech Stack:** Go, SQLite (modernc), `github.com/robfig/cron/v3` (already a dependency), Go `html/template`, Tailwind-in-`input.css`.
**Spec:** `docs/specs/2026-06-15-always-on-host-mode-design.md`
---
## File Structure
- **Create** `internal/store/migrations/0024_hosts_always_on.sql` — add the column.
- **Modify** `internal/store/types.go` — add `Host.AlwaysOn bool`.
- **Modify** `internal/store/hosts.go` — add `always_on` to the 3 host SELECTs + `scanHostRow`; add `SetHostAlwaysOn`.
- **Create** `internal/store/hosts_always_on_test.go` — round-trip + default test.
- **Modify** `internal/alert/engine.go` — suppress offline for intermittent hosts; staleness sweep; resolve staleness on backup success.
- **Modify** `internal/alert/rules.go` — exported `ResolveKind` helper for the toggle handler; staleness threshold constant.
- **Create** `internal/alert/intermittent_test.go` — suppression + staleness + resolve tests.
- **Create** `internal/server/http/catchup.go` — overdue helper + in-memory catch-up scheduler.
- **Create** `internal/server/http/catchup_test.go` — overdue table tests.
- **Modify** `internal/server/http/server.go` — catch-up map fields on `Server`, init in `New`.
- **Modify** `internal/server/http/host_credentials.go` — arm catch-up in `onAgentHello`.
- **Modify** `cmd/server/main.go` — call `srv.RunCatchupsDue` on the pending-drain tick.
- **Modify** `internal/server/http/ui_handlers.go``handleUIHostModeSave` handler.
- **Modify** `internal/server/http/server.go` (routes) — mount `POST /hosts/{id}/mode`.
- **Modify** `web/styles/input.css``dot-asleep` token.
- **Modify** `web/templates/partials/host_row.html` — asleep dot + text.
- **Modify** `web/templates/partials/host_chrome.html` — asleep dot/last-seen, 24×7 chip, mode toggle form.
- **Modify** `tasks.md` — record the feature.
---
## Task 1: Schema + store field for `always_on`
**Files:**
- Create: `internal/store/migrations/0024_hosts_always_on.sql`
- Modify: `internal/store/types.go:62-102` (Host struct)
- Modify: `internal/store/hosts.go` (3 SELECTs at lines 41-48, 56-63, 224-231; `scanHostRow` at 261-334)
- Test: `internal/store/hosts_always_on_test.go`
- [ ] **Step 1: Write the migration**
Create `internal/store/migrations/0024_hosts_always_on.sql`:
```sql
-- 0024: distinguish always-on (24x7 server) hosts from intermittent
-- hosts (laptops/workstations that legitimately sleep). Default 1 so
-- every existing and future host keeps today's offline/alert
-- semantics unless explicitly opted out. Column-level ALTER per the
-- repo's migration rules (no table rebuild — hosts has inbound FKs).
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
```
- [ ] **Step 2: Add the struct field**
In `internal/store/types.go`, add to the `Host` struct (after `RepoStatusError` at line 101):
```go
// AlwaysOn is true for 24x7 server hosts (the default). When false
// the host is intermittent (laptop/workstation): offline alerts are
// suppressed, the UI shows an "asleep" state, and a missed backup is
// caught up ~1 min after reconnect. See the always-on-host-mode spec.
AlwaysOn bool
```
- [ ] **Step 3: Thread `always_on` through reads**
In `internal/store/hosts.go`, append `, always_on` to the SELECT column list in all three queries: `LookupHostByAgentToken` (line 47), `GetHost` (line 62), and `ListHosts` (line 230). Each currently ends `repo_status, repo_status_error` — change to `repo_status, repo_status_error, always_on`.
Then in `scanHostRow` (line 261), add scanning. Add a local var and the scan target. Change the `Scan(...)` call's final args from `&h.RepoStatus, &h.RepoStatusError)` to `&h.RepoStatus, &h.RepoStatusError, &alwaysOn)` and declare `var alwaysOn int` in the var block, then after the existing post-scan assignments add:
```go
h.AlwaysOn = alwaysOn != 0
```
(SQLite stores the boolean as INTEGER; scan into int then compare to avoid driver bool-coercion surprises.)
- [ ] **Step 4: Add `SetHostAlwaysOn`**
In `internal/store/hosts.go`, after `SetHostTags` (line 379), add:
```go
// SetHostAlwaysOn flips the host's always-on flag. true = 24x7 server
// (default); false = intermittent host (laptop). See the
// always-on-host-mode spec.
func (s *Store) SetHostAlwaysOn(ctx context.Context, hostID string, alwaysOn bool) error {
v := 0
if alwaysOn {
v = 1
}
_, err := s.db.ExecContext(ctx,
`UPDATE hosts SET always_on = ? WHERE id = ?`, v, hostID)
if err != nil {
return fmt.Errorf("store: set host always_on: %w", err)
}
return nil
}
```
- [ ] **Step 5: Write the round-trip test**
Create `internal/store/hosts_always_on_test.go`. Use the existing test harness pattern — check a sibling test (e.g. `internal/store/hosts_test.go`) for the `newTestStore`/`testStore` helper name and the host-creation helper, and mirror it exactly. The test body:
```go
package store
import (
"context"
"testing"
"time"
)
func TestHostAlwaysOnDefaultAndToggle(t *testing.T) {
ctx := context.Background()
st := newTestStore(t) // mirror the helper used by hosts_test.go
h := Host{
ID: "h-always-on", Name: "lap", OS: "linux", Arch: "amd64",
ProtocolVersion: 1, EnrolledAt: time.Now().UTC(),
}
if err := st.CreateHost(ctx, h, "tok-hash", "pin"); err != nil {
t.Fatalf("create host: %v", err)
}
got, err := st.GetHost(ctx, h.ID)
if err != nil {
t.Fatalf("get host: %v", err)
}
if !got.AlwaysOn {
t.Fatalf("new host should default to always_on=true, got false")
}
if err := st.SetHostAlwaysOn(ctx, h.ID, false); err != nil {
t.Fatalf("set always_on: %v", err)
}
got, err = st.GetHost(ctx, h.ID)
if err != nil {
t.Fatalf("get host 2: %v", err)
}
if got.AlwaysOn {
t.Fatalf("expected always_on=false after toggle, got true")
}
// ListHosts must surface the same value.
hosts, err := st.ListHosts(ctx)
if err != nil {
t.Fatalf("list hosts: %v", err)
}
if len(hosts) != 1 || hosts[0].AlwaysOn {
t.Fatalf("ListHosts should report always_on=false, got %+v", hosts)
}
}
```
- [ ] **Step 6: Run the test (expect FAIL first if written before code, else PASS)**
Run: `go test ./internal/store/ -run TestHostAlwaysOnDefaultAndToggle -v`
Expected: PASS once Steps 1-4 are in. If you wrote the test first, it fails to compile on `AlwaysOn` / `SetHostAlwaysOn` — that is the expected red.
- [ ] **Step 7: Commit**
```bash
go vet ./internal/store/...
git add internal/store/migrations/0024_hosts_always_on.sql internal/store/types.go internal/store/hosts.go internal/store/hosts_always_on_test.go
git commit -m "feat(store): add hosts.always_on flag (default on)"
```
---
## Task 2: Overdue computation helper
This is a pure function so it can be unit-tested in isolation before the scheduler wires it up. It lives in the new `catchup.go` (the scheduler will follow in Task 3, same file).
**Files:**
- Create: `internal/server/http/catchup.go`
- Test: `internal/server/http/catchup_test.go`
- [ ] **Step 1: Write the failing test**
Create `internal/server/http/catchup_test.go`:
```go
package http
import (
"testing"
"time"
)
func TestScheduleOverdue(t *testing.T) {
mustParse := func(s string) time.Time {
t.Helper()
v, err := time.Parse(time.RFC3339, s)
if err != nil {
t.Fatalf("parse %q: %v", s, err)
}
return v
}
daily := "0 2 * * *" // 02:00 every day
cases := []struct {
name string
cron string
lastBackup *time.Time
now time.Time
want bool
}{
{
name: "never backed up is overdue",
cron: daily, lastBackup: nil,
now: mustParse("2026-06-15T09:00:00Z"),
want: true,
},
{
name: "missed last nights window",
cron: daily,
lastBackup: ptrTime(mustParse("2026-06-13T02:05:00Z")),
now: mustParse("2026-06-15T09:00:00Z"),
want: true,
},
{
name: "backed up after the most recent window",
cron: daily,
lastBackup: ptrTime(mustParse("2026-06-15T02:05:00Z")),
now: mustParse("2026-06-15T09:00:00Z"),
want: false,
},
{
name: "unparseable cron is never overdue",
cron: "not a cron",
lastBackup: nil,
now: mustParse("2026-06-15T09:00:00Z"),
want: false,
},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got := scheduleOverdue(c.cron, c.lastBackup, c.now)
if got != c.want {
t.Fatalf("scheduleOverdue(%q, %v, %v) = %v, want %v",
c.cron, c.lastBackup, c.now, got, c.want)
}
})
}
}
func ptrTime(t time.Time) *time.Time { return &t }
```
- [ ] **Step 2: Run the test to verify it fails**
Run: `go test ./internal/server/http/ -run TestScheduleOverdue -v`
Expected: FAIL — `undefined: scheduleOverdue`.
- [ ] **Step 3: Implement `scheduleOverdue`**
Create `internal/server/http/catchup.go` with the helper (the scheduler methods are added in Task 3):
```go
// catchup.go — server-side catch-up for intermittent (non-always-on)
// hosts. When such a host reconnects we wait a short settle window,
// then dispatch a backup for any schedule whose window elapsed while
// the host was asleep. This is separate from pending_runs: a host that
// was asleep never fired its local cron, so no pending row exists.
package http
import (
"time"
)
// scheduleOverdue reports whether a schedule's most recent expected
// fire is newer than the host's last successful backup — i.e. a window
// passed with no backup. A nil lastBackup means "never backed up" and
// is always overdue (provided the cron parses). An unparseable cron is
// treated as not-overdue so a bad expression can never trigger a
// surprise dispatch. Uses the same cronParser the agent's scheduler
// and schedule validation use, so interpretation is identical.
func scheduleOverdue(cronExpr string, lastBackup *time.Time, now time.Time) bool {
sched, err := cronParser.Parse(cronExpr)
if err != nil {
return false
}
if lastBackup == nil {
return true
}
next := sched.Next(*lastBackup)
return !next.After(now)
}
```
- [ ] **Step 4: Run the test to verify it passes**
Run: `go test ./internal/server/http/ -run TestScheduleOverdue -v`
Expected: PASS (all four sub-cases).
- [ ] **Step 5: Commit**
```bash
go vet ./internal/server/http/...
git add internal/server/http/catchup.go internal/server/http/catchup_test.go
git commit -m "feat(catchup): scheduleOverdue helper for missed-window detection"
```
---
## Task 3: Catch-up scheduler (arm on hello, fire on tick)
**Files:**
- Modify: `internal/server/http/server.go:68-93` (Server struct), `:96-112` (New)
- Modify: `internal/server/http/catchup.go` (add scheduler methods)
- Modify: `internal/server/http/host_credentials.go:463-486` (onAgentHello)
- Modify: `cmd/server/main.go:228-229` (pending-drain tick case)
- [ ] **Step 1: Add catch-up state to the Server struct**
In `internal/server/http/server.go`, add fields to `Server` (after `treeCache` at line 92):
```go
// catchupDueAt tracks intermittent hosts that reconnected and are
// in their settle window. Keyed hostID → earliest time to evaluate
// catch-up. Best-effort + in-memory: a server restart simply re-arms
// on the next hello. Guarded by catchupMu.
catchupMu sync.Mutex
catchupDueAt map[string]time.Time
```
Add `"time"` to the imports if not already present (check the import block).
- [ ] **Step 2: Initialise the map in New**
In `New` (line 106), add to the `&Server{...}` literal:
```go
catchupDueAt: make(map[string]time.Time),
```
- [ ] **Step 3: Add scheduler methods to catchup.go**
Append to `internal/server/http/catchup.go`. Add `"context"`, `"log/slog"` to its imports:
```go
// catchupSettle is how long after a reconnect we wait before evaluating
// catch-up, so a laptop that wakes briefly and sleeps again doesn't
// trigger a backup it can't finish. ~1 minute per the spec.
const catchupSettle = 60 * time.Second
// ArmCatchup records that an intermittent host just reconnected and
// should be evaluated for a missed backup after the settle window.
// No-op for always-on hosts (caller passes only intermittent hosts).
// Re-arming overwrites the timer (debounce — flapping doesn't stack).
func (s *Server) ArmCatchup(hostID string, now time.Time) {
s.catchupMu.Lock()
defer s.catchupMu.Unlock()
if s.catchupDueAt == nil {
s.catchupDueAt = make(map[string]time.Time)
}
s.catchupDueAt[hostID] = now.Add(catchupSettle)
}
// dueCatchups returns the hostIDs whose settle window has elapsed and
// removes them from the map. Caller evaluates each.
func (s *Server) dueCatchups(now time.Time) []string {
s.catchupMu.Lock()
defer s.catchupMu.Unlock()
var due []string
for id, at := range s.catchupDueAt {
if !now.Before(at) {
due = append(due, id)
delete(s.catchupDueAt, id)
}
}
return due
}
// RunCatchupsDue is the tick entrypoint. For each host past its settle
// window it dispatches a backup for every enabled schedule that is
// overdue. Skips hosts that bounced back offline, that are already
// running/queued a job, or that turned out to be always-on.
func (s *Server) RunCatchupsDue(ctx context.Context) {
if s.deps.Hub == nil {
return
}
now := time.Now().UTC()
for _, hostID := range s.dueCatchups(now) {
s.runCatchup(ctx, hostID, now)
}
}
// runCatchup evaluates and dispatches catch-up backups for a single
// host. Exported logic kept here so RunCatchupsDue reads cleanly.
func (s *Server) runCatchup(ctx context.Context, hostID string, now time.Time) {
conn := s.deps.Hub.Conn(hostID)
if conn == nil {
return // bounced offline during the settle window; re-arms on next hello
}
host, err := s.deps.Store.GetHost(ctx, hostID)
if err != nil {
slog.Warn("catchup: load host", "host_id", hostID, "err", err)
return
}
if host.AlwaysOn {
return // mode flipped during settle window
}
if host.CurrentJobID != nil {
return // a job is already running; don't pile on
}
schedules, err := s.deps.Store.ListSchedulesByHost(ctx, hostID)
if err != nil {
slog.Warn("catchup: list schedules", "host_id", hostID, "err", err)
return
}
for _, sc := range schedules {
if !sc.Enabled || len(sc.SourceGroupIDs) == 0 {
continue
}
if !scheduleOverdue(sc.CronExpr, host.LastBackupAt, now) {
continue
}
for _, gid := range sc.SourceGroupIDs {
g, err := s.deps.Store.GetSourceGroup(ctx, hostID, gid)
if err != nil {
slog.Warn("catchup: load source group",
"host_id", hostID, "schedule_id", sc.ID, "group_id", gid, "err", err)
continue
}
if _, derr := s.dispatchBackupForGroupCore(ctx, conn, hostID, sc.ID, g, now); derr != nil {
// Send failed — host dropped again. Re-arm so the next
// reconnect retries; stop processing this host.
s.ArmCatchup(hostID, now)
return
}
slog.Info("catchup: dispatched missed backup",
"host_id", hostID, "schedule_id", sc.ID, "group", g.Name)
}
}
}
```
- [ ] **Step 4: Arm catch-up on agent hello**
In `internal/server/http/host_credentials.go`, in `onAgentHello` (line 463), after the `go s.DrainPending(...)` line (485), add:
```go
// Intermittent hosts that just reconnected may have slept through a
// backup window. Arm a catch-up evaluation after a settle delay; the
// pending-drain tick fires it. Always-on hosts never need this.
if host, err := s.deps.Store.GetHost(ctx, hostID); err == nil && !host.AlwaysOn {
s.ArmCatchup(hostID, time.Now().UTC())
}
```
Verify `time` is already imported in this file (it is — used elsewhere). If not, add it.
- [ ] **Step 5: Fire catch-up from the pending-drain tick**
In `cmd/server/main.go`, in the `case <-pendingDrainTick.C:` block (line 228), change:
```go
case <-pendingDrainTick.C:
srv.DrainAllDue(ctx)
```
to:
```go
case <-pendingDrainTick.C:
srv.DrainAllDue(ctx)
srv.RunCatchupsDue(ctx)
```
- [ ] **Step 6: Build and vet**
Run: `go build ./... && go vet ./...`
Expected: clean build, no vet errors.
- [ ] **Step 7: Commit**
```bash
git add internal/server/http/server.go internal/server/http/catchup.go internal/server/http/host_credentials.go cmd/server/main.go
git commit -m "feat(catchup): arm on hello, fire missed-window backups on tick"
```
---
## Task 4: Alert engine — suppress offline + staleness alert
**Files:**
- Modify: `internal/alert/engine.go:121-153` (handleJobFinished), `:155-174` (handleHostOffline), `:188-216` (tick)
- Modify: `internal/alert/rules.go:13-39` (constants), add exported resolve helper
- Test: `internal/alert/intermittent_test.go`
- [ ] **Step 1: Add the staleness threshold constant**
In `internal/alert/engine.go`, add near the top of the file (after imports, before `JobFinishedEvent`):
```go
// staleBackupThreshold is how long an intermittent host may go without
// a successful backup before we raise a stale_schedule alert. Global
// constant for v1 (may become per-host later). Only intermittent hosts
// are evaluated — always-on hosts' stale_schedule stays a no-op.
const staleBackupThreshold = 7 * 24 * time.Hour
```
- [ ] **Step 2: Suppress the offline alert for intermittent hosts**
In `handleHostOffline` (line 155), after loading the host and the existing `if host.LastSeenAt == nil { return }` guard, add a mode check. Change:
```go
if host.LastSeenAt == nil {
return
}
if time.Since(*host.LastSeenAt) < e.agentOfflineFloor {
return
}
```
to:
```go
// Intermittent hosts (laptops) legitimately disappear — never raise
// agent_offline for them. The stale_schedule sweep in tick() is the
// only staleness signal for these hosts.
if !host.AlwaysOn {
return
}
if host.LastSeenAt == nil {
return
}
if time.Since(*host.LastSeenAt) < e.agentOfflineFloor {
return
}
```
- [ ] **Step 3: Suppress offline + add staleness in the tick sweep**
In `tick` (line 188), the host loop currently raises agent_offline for every offline host. Replace the loop body (lines 205-214) with:
```go
for _, h := range hosts {
// Intermittent hosts: suppress agent_offline entirely; instead
// raise stale_schedule when they have gone too long with no
// successful backup AND they have at least one enabled schedule
// to be measured against. A nil LastBackupAt (never backed up)
// has no baseline — onboarding/repo_status covers that case.
if !h.AlwaysOn {
if h.LastBackupAt == nil {
continue
}
if now.Sub(*h.LastBackupAt) < staleBackupThreshold {
continue
}
hasEnabled, err := e.hostHasEnabledSchedule(ctx, h.ID)
if err != nil || !hasEnabled {
continue
}
e.raiseAndNotify(ctx, h.ID, KindStaleSchedule, "", "warning",
fmt.Sprintf("No backup in %s (threshold %s)",
roundDur(now.Sub(*h.LastBackupAt)), staleBackupThreshold), now)
continue
}
// Always-on hosts: existing agent_offline re-evaluation.
if h.Status != "offline" || h.LastSeenAt == nil {
continue
}
if now.Sub(*h.LastSeenAt) >= e.agentOfflineFloor {
e.raiseAndNotify(ctx, h.ID, KindAgentOffline, "", "warning",
fmt.Sprintf("Agent offline for %s (threshold %s)",
roundDur(now.Sub(*h.LastSeenAt)), e.agentOfflineFloor), now)
}
}
```
Delete the trailing `// Stale-schedule sweep — no-op in v1.` comment at line 215.
- [ ] **Step 4: Add the `hostHasEnabledSchedule` helper**
In `internal/alert/engine.go`, add at the end of the file:
```go
// hostHasEnabledSchedule reports whether the host has at least one
// enabled backup schedule — the precondition for a stale_schedule
// alert (no schedule = no backup expectation to measure against).
func (e *Engine) hostHasEnabledSchedule(ctx context.Context, hostID string) (bool, error) {
schedules, err := e.store.ListSchedulesByHost(ctx, hostID)
if err != nil {
return false, err
}
for _, sc := range schedules {
if sc.Enabled {
return true, nil
}
}
return false, nil
}
```
- [ ] **Step 5: Resolve staleness on a successful backup**
In `handleJobFinished` (line 146), the `case "succeeded":` currently resolves only the job-kind alert. For a successful backup, also clear any open stale_schedule. Change:
```go
case "succeeded":
e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When)
}
```
to:
```go
case "succeeded":
e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When)
if ev.Kind == "backup" {
// A fresh backup clears staleness for intermittent hosts.
e.resolveAndNotify(ctx, ev.HostID, KindStaleSchedule, "", ev.When)
}
}
```
- [ ] **Step 6: Add an exported mode-change resolve hook**
The HTTP toggle handler (Task 5) needs to clear stale alerts when an operator changes a host's mode. Add to `internal/alert/rules.go` (after `Resolve`, around line 100):
```go
// ResolveOnModeChange clears any open agent_offline and stale_schedule
// alerts for a host whose always-on flag was just toggled. The next
// 60s tick re-raises whichever still applies under the new mode, so
// this is a self-correcting "wipe and let the sweep settle" call.
// Safe to invoke from the HTTP layer (it only touches the store + hub).
func (e *Engine) ResolveOnModeChange(ctx context.Context, hostID string, when time.Time) {
e.resolveAndNotify(ctx, hostID, KindAgentOffline, "", when)
e.resolveAndNotify(ctx, hostID, KindStaleSchedule, "", when)
}
```
- [ ] **Step 7: Write the engine tests**
Create `internal/alert/intermittent_test.go`. First inspect an existing engine test (e.g. grep `internal/alert/*_test.go` for how `NewEngine` is constructed with a test store + hub, and the helper that creates a host + schedule). Mirror those helpers. The tests to write:
```go
package alert
import (
"context"
"testing"
"time"
)
// Mirror the construction helpers used by the existing engine tests
// (newTestEngine / test store / host+schedule seeding). Replace the
// placeholder helpers below with the real ones from this package's
// existing _test.go files.
func TestIntermittentHostSuppressesOfflineAlert(t *testing.T) {
ctx := context.Background()
e, st := newTestEngine(t) // mirror existing helper
hostID := seedHost(t, st, false /* alwaysOn */)
// last seen well past the floor
touchHostSeen(t, st, hostID, time.Now().Add(-2*time.Hour))
markHostOffline(t, st, hostID)
e.handleHostOffline(ctx, hostID)
if n := openAlertCount(t, st, hostID, KindAgentOffline); n != 0 {
t.Fatalf("intermittent host should not raise agent_offline, got %d", n)
}
}
func TestAlwaysOnHostStillRaisesOfflineAlert(t *testing.T) {
ctx := context.Background()
e, st := newTestEngine(t)
hostID := seedHost(t, st, true /* alwaysOn */)
touchHostSeen(t, st, hostID, time.Now().Add(-2*time.Hour))
markHostOffline(t, st, hostID)
e.handleHostOffline(ctx, hostID)
if n := openAlertCount(t, st, hostID, KindAgentOffline); n != 1 {
t.Fatalf("always-on host should raise agent_offline, got %d", n)
}
}
func TestStalenessAlertForIntermittentHost(t *testing.T) {
ctx := context.Background()
e, st := newTestEngine(t)
hostID := seedHost(t, st, false)
seedEnabledSchedule(t, st, hostID) // "0 2 * * *" with a source group
setLastBackup(t, st, hostID, time.Now().Add(-8*24*time.Hour))
e.tick(ctx, time.Now().UTC())
if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 1 {
t.Fatalf("expected one stale_schedule alert, got %d", n)
}
// A successful backup clears it.
e.handleJobFinished(ctx, JobFinishedEvent{
HostID: hostID, JobID: "j1", Kind: "backup",
Status: "succeeded", When: time.Now().UTC(),
})
if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 0 {
t.Fatalf("stale_schedule should resolve after backup, got %d", n)
}
}
func TestNoStalenessWithoutEnabledSchedule(t *testing.T) {
ctx := context.Background()
e, st := newTestEngine(t)
hostID := seedHost(t, st, false)
setLastBackup(t, st, hostID, time.Now().Add(-8*24*time.Hour))
// no schedule seeded
e.tick(ctx, time.Now().UTC())
if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 0 {
t.Fatalf("no schedule => no staleness alert, got %d", n)
}
}
```
> **Note for the implementer:** the `newTestEngine`, `seedHost`, `touchHostSeen`, `markHostOffline`, `openAlertCount`, `seedEnabledSchedule`, `setLastBackup` helpers must be replaced with the real equivalents in this package's existing tests. If a needed seeding helper doesn't exist, write it using the `store` methods directly (`CreateHost`, `SetHostAlwaysOn`, `CreateSchedule`, `SetHostLastBackup`, `MarkHostsOfflineStale`, `ListAlerts`). Do NOT invent store methods — all required ones exist as of Task 1.
- [ ] **Step 8: Run the tests**
Run: `go test ./internal/alert/ -v`
Expected: PASS for all four new tests plus the existing suite.
- [ ] **Step 9: Commit**
```bash
go vet ./internal/alert/...
git add internal/alert/engine.go internal/alert/rules.go internal/alert/intermittent_test.go
git commit -m "feat(alert): suppress offline + add staleness alert for intermittent hosts"
```
---
## Task 5: HTTP toggle handler + route
**Files:**
- Modify: `internal/server/http/ui_handlers.go` (new handler near `handleUIHostTagsSave` at line 954)
- Modify: `internal/server/http/server.go:281` (route mount)
- [ ] **Step 1: Add the handler**
In `internal/server/http/ui_handlers.go`, after `handleUIHostTagsSave` (line 984), add:
```go
// handleUIHostModeSave flips a host's always-on flag. Checkbox present
// in the form (value any) => always-on; absent => intermittent.
// Operator-band; mounted in server.go. On change we clear open
// offline/staleness alerts via the engine so the next sweep re-raises
// only what still applies under the new mode.
func (s *Server) handleUIHostModeSave(w stdhttp.ResponseWriter, r *stdhttp.Request) {
u := s.requireUIUser(w, r)
if u == nil {
return
}
hostID := chi.URLParam(r, "id")
if _, err := s.deps.Store.GetHost(r.Context(), hostID); err != nil {
stdhttp.NotFound(w, r)
return
}
if err := r.ParseForm(); err != nil {
stdhttp.Error(w, "bad request", stdhttp.StatusBadRequest)
return
}
alwaysOn := r.PostForm.Get("always_on") != ""
if err := s.deps.Store.SetHostAlwaysOn(r.Context(), hostID, alwaysOn); err != nil {
slog.Error("ui host mode: save", "host_id", hostID, "err", err)
stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
return
}
if s.deps.AlertEngine != nil {
s.deps.AlertEngine.ResolveOnModeChange(r.Context(), hostID, time.Now().UTC())
}
_ = s.deps.Store.AppendAudit(r.Context(), store.AuditEntry{
ID: ulid.Make().String(), UserID: &u.ID, Actor: "user",
Action: "host.mode_updated",
TargetKind: ptr("host"), TargetID: &hostID,
TS: time.Now().UTC(),
})
stdhttp.Redirect(w, r, "/hosts/"+hostID, stdhttp.StatusSeeOther)
}
```
- [ ] **Step 2: Mount the route**
In `internal/server/http/server.go`, next to the tags route (line 281):
```go
r.Post("/hosts/{id}/tags", s.handleUIHostTagsSave)
```
add directly below:
```go
r.Post("/hosts/{id}/mode", s.handleUIHostModeSave)
```
(Confirm it lands in the same operator-band route group as `/hosts/{id}/tags` — same indentation/block.)
- [ ] **Step 3: Build and vet**
Run: `go build ./... && go vet ./...`
Expected: clean.
- [ ] **Step 4: Write a handler test**
Add to the existing UI-handler test file (grep `internal/server/http/*_test.go` for the harness that builds a `Server` + does form POSTs against `/hosts/{id}/tags`; mirror it). The test posts to `/hosts/{id}/mode` with and without the `always_on` field and asserts the stored flag:
```go
func TestHandleUIHostModeSave(t *testing.T) {
srv, st, sess := newUITestServer(t) // mirror tags-save test harness
hostID := seedHostForUI(t, st) // mirror existing host seeding
// Uncheck: form without always_on => intermittent.
postForm(t, srv, sess, "/hosts/"+hostID+"/mode", map[string]string{})
if h, _ := st.GetHost(context.Background(), hostID); h.AlwaysOn {
t.Fatalf("expected always_on=false after empty post")
}
// Check: form with always_on=on => always-on.
postForm(t, srv, sess, "/hosts/"+hostID+"/mode", map[string]string{"always_on": "on"})
if h, _ := st.GetHost(context.Background(), hostID); !h.AlwaysOn {
t.Fatalf("expected always_on=true after checked post")
}
}
```
> Replace `newUITestServer`/`seedHostForUI`/`postForm` with the real harness helpers from the existing UI handler tests.
- [ ] **Step 5: Run the test**
Run: `go test ./internal/server/http/ -run TestHandleUIHostModeSave -v`
Expected: PASS.
- [ ] **Step 6: Commit**
```bash
git add internal/server/http/ui_handlers.go internal/server/http/server.go internal/server/http/*_test.go
git commit -m "feat(http): host mode toggle handler + route (host.mode_updated)"
```
---
## Task 6: UI — asleep state, 24×7 chip, mode toggle
**Files:**
- Modify: `web/styles/input.css` (dot-asleep token)
- Modify: `web/templates/partials/host_row.html`
- Modify: `web/templates/partials/host_chrome.html`
- [ ] **Step 1: Add the `dot-asleep` CSS token**
In `web/styles/input.css`, find the `.dot-offline` definition (grep for `dot-offline`) and add a sibling `.dot-asleep` rule. Match the existing dot pattern; use a calm grey-blue distinct from offline's grey/red. Example (adapt colours to the file's existing tokens):
```css
.dot-asleep { background: var(--ink-fade); opacity: 0.6; }
```
> Inspect the neighbouring `.dot-offline` / `.dot-degraded` rules first and follow their exact shape (size, border, etc.); only the colour/opacity should differ.
- [ ] **Step 2: Rebuild CSS if the project precompiles it**
Check the Makefile for a CSS build step (grep `css` in `Makefile`). If present, run it (e.g. `make css`). If the server serves `input.css` directly, skip.
- [ ] **Step 3: Asleep dot + text in host_row.html**
In `web/templates/partials/host_row.html`, change the status-dot block (lines 6-14). Replace the `{{- else if eq $h.Status "offline" -}}` dot branch:
```html
{{- else if eq $h.Status "offline" -}}
<span class="dot dot-offline"></span>
```
with:
```html
{{- else if eq $h.Status "offline" -}}
{{- if $h.AlwaysOn -}}
<span class="dot dot-offline"></span>
{{- else -}}
<span class="dot dot-asleep"></span>
{{- end -}}
```
Then change the last-seen text branch (lines 28-29):
```html
{{- else if eq $h.Status "offline" -}}
<span class="text-ink-mute">last seen <span class="mono">{{relTime $h.LastSeenAt}}</span></span>
```
to:
```html
{{- else if eq $h.Status "offline" -}}
{{- if $h.AlwaysOn -}}
<span class="text-ink-mute">last seen <span class="mono">{{relTime $h.LastSeenAt}}</span></span>
{{- else -}}
<span class="text-ink-mute">asleep · <span class="mono">{{relTime $h.LastSeenAt}}</span> · will catch up on return</span>
{{- end -}}
```
And the row-action label (lines 55-56):
```html
{{- if eq $h.Status "offline" -}}
<span class="mono text-xs text-ink-fade">offline</span>
```
to:
```html
{{- if eq $h.Status "offline" -}}
<span class="mono text-xs text-ink-fade">{{if $h.AlwaysOn}}offline{{else}}asleep{{end}}</span>
```
- [ ] **Step 4: Asleep dot + last-seen in host_chrome.html**
In `web/templates/partials/host_chrome.html`, change the offline dot branch (lines 36-37):
```html
{{else if eq $host.Status "offline"}}
<span class="dot dot-offline"></span>
```
to:
```html
{{else if eq $host.Status "offline"}}
{{if $host.AlwaysOn}}
<span class="dot dot-offline"></span>
{{else}}
<span class="dot dot-asleep"></span>
{{end}}
```
And the last-seen line (lines 90-94):
```html
{{if eq $host.Status "offline"}}
<span>last seen <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
{{else}}
<span>online · last heartbeat <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
{{end}}
```
to:
```html
{{if eq $host.Status "offline"}}
{{if $host.AlwaysOn}}
<span>last seen <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
{{else}}
<span>asleep · last seen <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span> · will catch up on return</span>
{{end}}
{{else}}
<span>online · last heartbeat <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
{{end}}
```
- [ ] **Step 5: Add the 24×7 chip + mode toggle to host_chrome.html**
In the header tags block (lines 42-48), after the tags `edit/add tags` button and before the closing `</div>` at line 48, add the chip (shown only when always-on) and a small toggle button mirroring the tags-editor reveal pattern:
```html
{{if $host.AlwaysOn}}<span class="tag" title="Expected online 24×7 — offline raises an alert">24×7</span>{{end}}
<button type="button" class="text-ink-fade text-[11px] hover:text-ink-mid whitespace-nowrap"
style="padding: 2px 8px; border: 1px dashed var(--line); border-radius: 3px; cursor: pointer;"
onclick="document.getElementById('mode-edit-{{$host.ID}}').classList.toggle('hidden')"
title="Change presence mode">presence</button>
```
Then add the toggle form right after the tags `<form>` block (after line 82, before the `<div class="flex items-center gap-3 mt-3 ...">` at line 83):
```html
{{/* Presence-mode editor — hidden by default; toggled by the
"presence" button. Checkbox present => always-on (24×7);
unchecked => intermittent (laptop): no offline alerts, shows
"asleep", auto-catches-up a missed backup on reconnect. */}}
<form id="mode-edit-{{$host.ID}}" method="post"
action="/hosts/{{$host.ID}}/mode"
class="hidden mt-3" style="max-width: 640px;">
<label class="flex items-center gap-2 text-[12px] text-ink-mid">
<input type="checkbox" name="always_on" value="on" {{if $host.AlwaysOn}}checked{{end}} />
Always On — expected online 24×7
</label>
<div class="field-help">
Uncheck for an intermittent host (laptop/workstation): it wont
raise offline alerts when asleep, shows an “asleep” state, and
catches up a missed backup ~1 minute after it reconnects.
</div>
<button type="submit" class="btn btn-primary mt-2 whitespace-nowrap">Save presence</button>
</form>
```
- [ ] **Step 6: Verify templates parse**
Run: `go build ./... && go test ./internal/server/... -run Template -v` (if a template-render test exists; otherwise rely on the smoke run in Step 7). At minimum: `go build ./...` must pass.
- [ ] **Step 7: Manual smoke (per CLAUDE.md smoke targets)**
```bash
make smoke-deploy
```
Then in a browser (or Playwright): open the dashboard and a host detail page. Toggle a host to intermittent via the "presence" control, confirm the 24×7 chip disappears, and confirm an offline/sleeping intermittent host renders the grey "asleep · … · will catch up on return" line instead of red "offline". Toggle back and confirm the chip returns.
- [ ] **Step 8: Commit**
```bash
git add web/styles/input.css web/templates/partials/host_row.html web/templates/partials/host_chrome.html
git commit -m "feat(ui): asleep state, 24×7 chip, presence toggle for host mode"
```
---
## Task 7: Record in tasks.md + final verification
**Files:**
- Modify: `tasks.md`
- [ ] **Step 1: Add a tasks.md entry**
Add a `[x]` entry under "Next steps from testing" in `tasks.md` (mirroring the NS-07 style — one line + a short "As shipped" note) describing the always-on/intermittent host mode: `always_on` column (default on), offline-alert suppression + 7-day staleness alert for intermittent hosts, settle-then-catch-up on reconnect, and the asleep UI + 24×7 chip + presence toggle.
- [ ] **Step 2: Full verification**
```bash
go vet ./...
go test ./...
```
Expected: vet clean, all tests green.
- [ ] **Step 3: Commit**
```bash
git add tasks.md
git commit -m "docs(tasks): record always-on/intermittent host mode"
```
---
## Self-Review notes
- **Spec coverage:** §1 data model → Task 1. §2 mechanics unchanged → no task needed (verified untouched). §3 alerts (suppress offline, staleness, resolve-on-backup, resolve-on-toggle) → Task 4 + Task 5 Step 1. §4 catch-up (arm on hello, settle, per-schedule overdue, dispatch, guards) → Tasks 2-3. §5 UI (dot-asleep, asleep text, 24×7 chip, toggle) → Task 6. Testing → tests in Tasks 1-5. Out-of-scope items respected (global 7d const, reconnect-only, no agent-side cron, always-on stale_schedule untouched).
- **Type consistency:** `scheduleOverdue(cronExpr string, *time.Time, time.Time) bool`, `ArmCatchup(hostID string, now time.Time)`, `RunCatchupsDue(ctx)`, `SetHostAlwaysOn(ctx, hostID, bool)`, `ResolveOnModeChange(ctx, hostID, when)`, `Host.AlwaysOn bool` — used consistently across tasks.
- **No invented store methods:** all `store.*` calls (GetHost, ListSchedulesByHost, GetSourceGroup, SetHostLastBackup, ListAlerts, AppendAudit, dispatchBackupForGroupCore, Hub.Conn/Connected) exist in the current tree; `SetHostAlwaysOn` is the only new one and is defined in Task 1.
- **Test helper caveat:** the alert and HTTP handler tests reference package-local helpers (`newTestEngine`, `newUITestServer`, etc.) that must be matched to the real names in existing `_test.go` files at implementation time — flagged inline in each task.