Files
restic-manager/docs/plans/2026-06-15-always-on-host-mode.md

38 KiB
Raw Permalink Blame History

Always-On vs Intermittent Host Mode — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Let an operator mark a host as not-always-on so it stops raising offline alerts when it legitimately sleeps, renders a calm "asleep" state, auto-catches-up a missed backup ~1 minute after it reconnects, and still raises a long-threshold staleness alert if it goes too long with no backup.

Architecture: A thin policy + presentation layer over the existing online/offline state machine. A new hosts.always_on boolean (default 1 = today's behaviour) gates three behaviours: offline-alert suppression + a 7-day staleness alert in the alert engine; an in-memory catch-up scheduler in the HTTP server armed on agent hello and fired from the existing 30s tick; and an "asleep" UI state plus a 24×7 chip. Online/offline tracking, heartbeat, and pending_runs are untouched.

Tech Stack: Go, SQLite (modernc), github.com/robfig/cron/v3 (already a dependency), Go html/template, Tailwind-in-input.css.

Spec: docs/specs/2026-06-15-always-on-host-mode-design.md


File Structure

  • Create internal/store/migrations/0024_hosts_always_on.sql — add the column.
  • Modify internal/store/types.go — add Host.AlwaysOn bool.
  • Modify internal/store/hosts.go — add always_on to the 3 host SELECTs + scanHostRow; add SetHostAlwaysOn.
  • Create internal/store/hosts_always_on_test.go — round-trip + default test.
  • Modify internal/alert/engine.go — suppress offline for intermittent hosts; staleness sweep; resolve staleness on backup success.
  • Modify internal/alert/rules.go — exported ResolveKind helper for the toggle handler; staleness threshold constant.
  • Create internal/alert/intermittent_test.go — suppression + staleness + resolve tests.
  • Create internal/server/http/catchup.go — overdue helper + in-memory catch-up scheduler.
  • Create internal/server/http/catchup_test.go — overdue table tests.
  • Modify internal/server/http/server.go — catch-up map fields on Server, init in New.
  • Modify internal/server/http/host_credentials.go — arm catch-up in onAgentHello.
  • Modify cmd/server/main.go — call srv.RunCatchupsDue on the pending-drain tick.
  • Modify internal/server/http/ui_handlers.gohandleUIHostModeSave handler.
  • Modify internal/server/http/server.go (routes) — mount POST /hosts/{id}/mode.
  • Modify web/styles/input.cssdot-asleep token.
  • Modify web/templates/partials/host_row.html — asleep dot + text.
  • Modify web/templates/partials/host_chrome.html — asleep dot/last-seen, 24×7 chip, mode toggle form.
  • Modify tasks.md — record the feature.

Task 1: Schema + store field for always_on

Files:

  • Create: internal/store/migrations/0024_hosts_always_on.sql

  • Modify: internal/store/types.go:62-102 (Host struct)

  • Modify: internal/store/hosts.go (3 SELECTs at lines 41-48, 56-63, 224-231; scanHostRow at 261-334)

  • Test: internal/store/hosts_always_on_test.go

  • Step 1: Write the migration

Create internal/store/migrations/0024_hosts_always_on.sql:

-- 0024: distinguish always-on (24x7 server) hosts from intermittent
-- hosts (laptops/workstations that legitimately sleep). Default 1 so
-- every existing and future host keeps today's offline/alert
-- semantics unless explicitly opted out. Column-level ALTER per the
-- repo's migration rules (no table rebuild — hosts has inbound FKs).
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
  • Step 2: Add the struct field

In internal/store/types.go, add to the Host struct (after RepoStatusError at line 101):

	// AlwaysOn is true for 24x7 server hosts (the default). When false
	// the host is intermittent (laptop/workstation): offline alerts are
	// suppressed, the UI shows an "asleep" state, and a missed backup is
	// caught up ~1 min after reconnect. See the always-on-host-mode spec.
	AlwaysOn bool
  • Step 3: Thread always_on through reads

In internal/store/hosts.go, append , always_on to the SELECT column list in all three queries: LookupHostByAgentToken (line 47), GetHost (line 62), and ListHosts (line 230). Each currently ends repo_status, repo_status_error — change to repo_status, repo_status_error, always_on.

Then in scanHostRow (line 261), add scanning. Add a local var and the scan target. Change the Scan(...) call's final args from &h.RepoStatus, &h.RepoStatusError) to &h.RepoStatus, &h.RepoStatusError, &alwaysOn) and declare var alwaysOn int in the var block, then after the existing post-scan assignments add:

	h.AlwaysOn = alwaysOn != 0

(SQLite stores the boolean as INTEGER; scan into int then compare to avoid driver bool-coercion surprises.)

  • Step 4: Add SetHostAlwaysOn

In internal/store/hosts.go, after SetHostTags (line 379), add:

// SetHostAlwaysOn flips the host's always-on flag. true = 24x7 server
// (default); false = intermittent host (laptop). See the
// always-on-host-mode spec.
func (s *Store) SetHostAlwaysOn(ctx context.Context, hostID string, alwaysOn bool) error {
	v := 0
	if alwaysOn {
		v = 1
	}
	_, err := s.db.ExecContext(ctx,
		`UPDATE hosts SET always_on = ? WHERE id = ?`, v, hostID)
	if err != nil {
		return fmt.Errorf("store: set host always_on: %w", err)
	}
	return nil
}
  • Step 5: Write the round-trip test

Create internal/store/hosts_always_on_test.go. Use the existing test harness pattern — check a sibling test (e.g. internal/store/hosts_test.go) for the newTestStore/testStore helper name and the host-creation helper, and mirror it exactly. The test body:

package store

import (
	"context"
	"testing"
	"time"
)

func TestHostAlwaysOnDefaultAndToggle(t *testing.T) {
	ctx := context.Background()
	st := newTestStore(t) // mirror the helper used by hosts_test.go

	h := Host{
		ID: "h-always-on", Name: "lap", OS: "linux", Arch: "amd64",
		ProtocolVersion: 1, EnrolledAt: time.Now().UTC(),
	}
	if err := st.CreateHost(ctx, h, "tok-hash", "pin"); err != nil {
		t.Fatalf("create host: %v", err)
	}

	got, err := st.GetHost(ctx, h.ID)
	if err != nil {
		t.Fatalf("get host: %v", err)
	}
	if !got.AlwaysOn {
		t.Fatalf("new host should default to always_on=true, got false")
	}

	if err := st.SetHostAlwaysOn(ctx, h.ID, false); err != nil {
		t.Fatalf("set always_on: %v", err)
	}
	got, err = st.GetHost(ctx, h.ID)
	if err != nil {
		t.Fatalf("get host 2: %v", err)
	}
	if got.AlwaysOn {
		t.Fatalf("expected always_on=false after toggle, got true")
	}

	// ListHosts must surface the same value.
	hosts, err := st.ListHosts(ctx)
	if err != nil {
		t.Fatalf("list hosts: %v", err)
	}
	if len(hosts) != 1 || hosts[0].AlwaysOn {
		t.Fatalf("ListHosts should report always_on=false, got %+v", hosts)
	}
}
  • Step 6: Run the test (expect FAIL first if written before code, else PASS)

Run: go test ./internal/store/ -run TestHostAlwaysOnDefaultAndToggle -v Expected: PASS once Steps 1-4 are in. If you wrote the test first, it fails to compile on AlwaysOn / SetHostAlwaysOn — that is the expected red.

  • Step 7: Commit
go vet ./internal/store/...
git add internal/store/migrations/0024_hosts_always_on.sql internal/store/types.go internal/store/hosts.go internal/store/hosts_always_on_test.go
git commit -m "feat(store): add hosts.always_on flag (default on)"

Task 2: Overdue computation helper

This is a pure function so it can be unit-tested in isolation before the scheduler wires it up. It lives in the new catchup.go (the scheduler will follow in Task 3, same file).

Files:

  • Create: internal/server/http/catchup.go

  • Test: internal/server/http/catchup_test.go

  • Step 1: Write the failing test

Create internal/server/http/catchup_test.go:

package http

import (
	"testing"
	"time"
)

func TestScheduleOverdue(t *testing.T) {
	mustParse := func(s string) time.Time {
		t.Helper()
		v, err := time.Parse(time.RFC3339, s)
		if err != nil {
			t.Fatalf("parse %q: %v", s, err)
		}
		return v
	}
	daily := "0 2 * * *" // 02:00 every day

	cases := []struct {
		name       string
		cron       string
		lastBackup *time.Time
		now        time.Time
		want       bool
	}{
		{
			name: "never backed up is overdue",
			cron: daily, lastBackup: nil,
			now:  mustParse("2026-06-15T09:00:00Z"),
			want: true,
		},
		{
			name: "missed last nights window",
			cron: daily,
			lastBackup: ptrTime(mustParse("2026-06-13T02:05:00Z")),
			now:        mustParse("2026-06-15T09:00:00Z"),
			want:       true,
		},
		{
			name: "backed up after the most recent window",
			cron: daily,
			lastBackup: ptrTime(mustParse("2026-06-15T02:05:00Z")),
			now:        mustParse("2026-06-15T09:00:00Z"),
			want:       false,
		},
		{
			name: "unparseable cron is never overdue",
			cron: "not a cron",
			lastBackup: nil,
			now:  mustParse("2026-06-15T09:00:00Z"),
			want: false,
		},
	}
	for _, c := range cases {
		t.Run(c.name, func(t *testing.T) {
			got := scheduleOverdue(c.cron, c.lastBackup, c.now)
			if got != c.want {
				t.Fatalf("scheduleOverdue(%q, %v, %v) = %v, want %v",
					c.cron, c.lastBackup, c.now, got, c.want)
			}
		})
	}
}

func ptrTime(t time.Time) *time.Time { return &t }
  • Step 2: Run the test to verify it fails

Run: go test ./internal/server/http/ -run TestScheduleOverdue -v Expected: FAIL — undefined: scheduleOverdue.

  • Step 3: Implement scheduleOverdue

Create internal/server/http/catchup.go with the helper (the scheduler methods are added in Task 3):

// catchup.go — server-side catch-up for intermittent (non-always-on)
// hosts. When such a host reconnects we wait a short settle window,
// then dispatch a backup for any schedule whose window elapsed while
// the host was asleep. This is separate from pending_runs: a host that
// was asleep never fired its local cron, so no pending row exists.
package http

import (
	"time"
)

// scheduleOverdue reports whether a schedule's most recent expected
// fire is newer than the host's last successful backup — i.e. a window
// passed with no backup. A nil lastBackup means "never backed up" and
// is always overdue (provided the cron parses). An unparseable cron is
// treated as not-overdue so a bad expression can never trigger a
// surprise dispatch. Uses the same cronParser the agent's scheduler
// and schedule validation use, so interpretation is identical.
func scheduleOverdue(cronExpr string, lastBackup *time.Time, now time.Time) bool {
	sched, err := cronParser.Parse(cronExpr)
	if err != nil {
		return false
	}
	if lastBackup == nil {
		return true
	}
	next := sched.Next(*lastBackup)
	return !next.After(now)
}
  • Step 4: Run the test to verify it passes

Run: go test ./internal/server/http/ -run TestScheduleOverdue -v Expected: PASS (all four sub-cases).

  • Step 5: Commit
go vet ./internal/server/http/...
git add internal/server/http/catchup.go internal/server/http/catchup_test.go
git commit -m "feat(catchup): scheduleOverdue helper for missed-window detection"

Task 3: Catch-up scheduler (arm on hello, fire on tick)

Files:

  • Modify: internal/server/http/server.go:68-93 (Server struct), :96-112 (New)

  • Modify: internal/server/http/catchup.go (add scheduler methods)

  • Modify: internal/server/http/host_credentials.go:463-486 (onAgentHello)

  • Modify: cmd/server/main.go:228-229 (pending-drain tick case)

  • Step 1: Add catch-up state to the Server struct

In internal/server/http/server.go, add fields to Server (after treeCache at line 92):

	// catchupDueAt tracks intermittent hosts that reconnected and are
	// in their settle window. Keyed hostID → earliest time to evaluate
	// catch-up. Best-effort + in-memory: a server restart simply re-arms
	// on the next hello. Guarded by catchupMu.
	catchupMu    sync.Mutex
	catchupDueAt map[string]time.Time

Add "time" to the imports if not already present (check the import block).

  • Step 2: Initialise the map in New

In New (line 106), add to the &Server{...} literal:

		catchupDueAt: make(map[string]time.Time),
  • Step 3: Add scheduler methods to catchup.go

Append to internal/server/http/catchup.go. Add "context", "log/slog" to its imports:

// catchupSettle is how long after a reconnect we wait before evaluating
// catch-up, so a laptop that wakes briefly and sleeps again doesn't
// trigger a backup it can't finish. ~1 minute per the spec.
const catchupSettle = 60 * time.Second

// ArmCatchup records that an intermittent host just reconnected and
// should be evaluated for a missed backup after the settle window.
// No-op for always-on hosts (caller passes only intermittent hosts).
// Re-arming overwrites the timer (debounce — flapping doesn't stack).
func (s *Server) ArmCatchup(hostID string, now time.Time) {
	s.catchupMu.Lock()
	defer s.catchupMu.Unlock()
	if s.catchupDueAt == nil {
		s.catchupDueAt = make(map[string]time.Time)
	}
	s.catchupDueAt[hostID] = now.Add(catchupSettle)
}

// dueCatchups returns the hostIDs whose settle window has elapsed and
// removes them from the map. Caller evaluates each.
func (s *Server) dueCatchups(now time.Time) []string {
	s.catchupMu.Lock()
	defer s.catchupMu.Unlock()
	var due []string
	for id, at := range s.catchupDueAt {
		if !now.Before(at) {
			due = append(due, id)
			delete(s.catchupDueAt, id)
		}
	}
	return due
}

// RunCatchupsDue is the tick entrypoint. For each host past its settle
// window it dispatches a backup for every enabled schedule that is
// overdue. Skips hosts that bounced back offline, that are already
// running/queued a job, or that turned out to be always-on.
func (s *Server) RunCatchupsDue(ctx context.Context) {
	if s.deps.Hub == nil {
		return
	}
	now := time.Now().UTC()
	for _, hostID := range s.dueCatchups(now) {
		s.runCatchup(ctx, hostID, now)
	}
}

// runCatchup evaluates and dispatches catch-up backups for a single
// host. Exported logic kept here so RunCatchupsDue reads cleanly.
func (s *Server) runCatchup(ctx context.Context, hostID string, now time.Time) {
	conn := s.deps.Hub.Conn(hostID)
	if conn == nil {
		return // bounced offline during the settle window; re-arms on next hello
	}
	host, err := s.deps.Store.GetHost(ctx, hostID)
	if err != nil {
		slog.Warn("catchup: load host", "host_id", hostID, "err", err)
		return
	}
	if host.AlwaysOn {
		return // mode flipped during settle window
	}
	if host.CurrentJobID != nil {
		return // a job is already running; don't pile on
	}
	schedules, err := s.deps.Store.ListSchedulesByHost(ctx, hostID)
	if err != nil {
		slog.Warn("catchup: list schedules", "host_id", hostID, "err", err)
		return
	}
	for _, sc := range schedules {
		if !sc.Enabled || len(sc.SourceGroupIDs) == 0 {
			continue
		}
		if !scheduleOverdue(sc.CronExpr, host.LastBackupAt, now) {
			continue
		}
		for _, gid := range sc.SourceGroupIDs {
			g, err := s.deps.Store.GetSourceGroup(ctx, hostID, gid)
			if err != nil {
				slog.Warn("catchup: load source group",
					"host_id", hostID, "schedule_id", sc.ID, "group_id", gid, "err", err)
				continue
			}
			if _, derr := s.dispatchBackupForGroupCore(ctx, conn, hostID, sc.ID, g, now); derr != nil {
				// Send failed — host dropped again. Re-arm so the next
				// reconnect retries; stop processing this host.
				s.ArmCatchup(hostID, now)
				return
			}
			slog.Info("catchup: dispatched missed backup",
				"host_id", hostID, "schedule_id", sc.ID, "group", g.Name)
		}
	}
}
  • Step 4: Arm catch-up on agent hello

In internal/server/http/host_credentials.go, in onAgentHello (line 463), after the go s.DrainPending(...) line (485), add:

	// Intermittent hosts that just reconnected may have slept through a
	// backup window. Arm a catch-up evaluation after a settle delay; the
	// pending-drain tick fires it. Always-on hosts never need this.
	if host, err := s.deps.Store.GetHost(ctx, hostID); err == nil && !host.AlwaysOn {
		s.ArmCatchup(hostID, time.Now().UTC())
	}

Verify time is already imported in this file (it is — used elsewhere). If not, add it.

  • Step 5: Fire catch-up from the pending-drain tick

In cmd/server/main.go, in the case <-pendingDrainTick.C: block (line 228), change:

				case <-pendingDrainTick.C:
					srv.DrainAllDue(ctx)

to:

				case <-pendingDrainTick.C:
					srv.DrainAllDue(ctx)
					srv.RunCatchupsDue(ctx)
  • Step 6: Build and vet

Run: go build ./... && go vet ./... Expected: clean build, no vet errors.

  • Step 7: Commit
git add internal/server/http/server.go internal/server/http/catchup.go internal/server/http/host_credentials.go cmd/server/main.go
git commit -m "feat(catchup): arm on hello, fire missed-window backups on tick"

Task 4: Alert engine — suppress offline + staleness alert

Files:

  • Modify: internal/alert/engine.go:121-153 (handleJobFinished), :155-174 (handleHostOffline), :188-216 (tick)

  • Modify: internal/alert/rules.go:13-39 (constants), add exported resolve helper

  • Test: internal/alert/intermittent_test.go

  • Step 1: Add the staleness threshold constant

In internal/alert/engine.go, add near the top of the file (after imports, before JobFinishedEvent):

// staleBackupThreshold is how long an intermittent host may go without
// a successful backup before we raise a stale_schedule alert. Global
// constant for v1 (may become per-host later). Only intermittent hosts
// are evaluated — always-on hosts' stale_schedule stays a no-op.
const staleBackupThreshold = 7 * 24 * time.Hour
  • Step 2: Suppress the offline alert for intermittent hosts

In handleHostOffline (line 155), after loading the host and the existing if host.LastSeenAt == nil { return } guard, add a mode check. Change:

	if host.LastSeenAt == nil {
		return
	}
	if time.Since(*host.LastSeenAt) < e.agentOfflineFloor {
		return
	}

to:

	// Intermittent hosts (laptops) legitimately disappear — never raise
	// agent_offline for them. The stale_schedule sweep in tick() is the
	// only staleness signal for these hosts.
	if !host.AlwaysOn {
		return
	}
	if host.LastSeenAt == nil {
		return
	}
	if time.Since(*host.LastSeenAt) < e.agentOfflineFloor {
		return
	}
  • Step 3: Suppress offline + add staleness in the tick sweep

In tick (line 188), the host loop currently raises agent_offline for every offline host. Replace the loop body (lines 205-214) with:

	for _, h := range hosts {
		// Intermittent hosts: suppress agent_offline entirely; instead
		// raise stale_schedule when they have gone too long with no
		// successful backup AND they have at least one enabled schedule
		// to be measured against. A nil LastBackupAt (never backed up)
		// has no baseline — onboarding/repo_status covers that case.
		if !h.AlwaysOn {
			if h.LastBackupAt == nil {
				continue
			}
			if now.Sub(*h.LastBackupAt) < staleBackupThreshold {
				continue
			}
			hasEnabled, err := e.hostHasEnabledSchedule(ctx, h.ID)
			if err != nil || !hasEnabled {
				continue
			}
			e.raiseAndNotify(ctx, h.ID, KindStaleSchedule, "", "warning",
				fmt.Sprintf("No backup in %s (threshold %s)",
					roundDur(now.Sub(*h.LastBackupAt)), staleBackupThreshold), now)
			continue
		}
		// Always-on hosts: existing agent_offline re-evaluation.
		if h.Status != "offline" || h.LastSeenAt == nil {
			continue
		}
		if now.Sub(*h.LastSeenAt) >= e.agentOfflineFloor {
			e.raiseAndNotify(ctx, h.ID, KindAgentOffline, "", "warning",
				fmt.Sprintf("Agent offline for %s (threshold %s)",
					roundDur(now.Sub(*h.LastSeenAt)), e.agentOfflineFloor), now)
		}
	}

Delete the trailing // Stale-schedule sweep — no-op in v1. comment at line 215.

  • Step 4: Add the hostHasEnabledSchedule helper

In internal/alert/engine.go, add at the end of the file:

// hostHasEnabledSchedule reports whether the host has at least one
// enabled backup schedule — the precondition for a stale_schedule
// alert (no schedule = no backup expectation to measure against).
func (e *Engine) hostHasEnabledSchedule(ctx context.Context, hostID string) (bool, error) {
	schedules, err := e.store.ListSchedulesByHost(ctx, hostID)
	if err != nil {
		return false, err
	}
	for _, sc := range schedules {
		if sc.Enabled {
			return true, nil
		}
	}
	return false, nil
}
  • Step 5: Resolve staleness on a successful backup

In handleJobFinished (line 146), the case "succeeded": currently resolves only the job-kind alert. For a successful backup, also clear any open stale_schedule. Change:

	case "succeeded":
		e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When)
	}

to:

	case "succeeded":
		e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When)
		if ev.Kind == "backup" {
			// A fresh backup clears staleness for intermittent hosts.
			e.resolveAndNotify(ctx, ev.HostID, KindStaleSchedule, "", ev.When)
		}
	}
  • Step 6: Add an exported mode-change resolve hook

The HTTP toggle handler (Task 5) needs to clear stale alerts when an operator changes a host's mode. Add to internal/alert/rules.go (after Resolve, around line 100):

// ResolveOnModeChange clears any open agent_offline and stale_schedule
// alerts for a host whose always-on flag was just toggled. The next
// 60s tick re-raises whichever still applies under the new mode, so
// this is a self-correcting "wipe and let the sweep settle" call.
// Safe to invoke from the HTTP layer (it only touches the store + hub).
func (e *Engine) ResolveOnModeChange(ctx context.Context, hostID string, when time.Time) {
	e.resolveAndNotify(ctx, hostID, KindAgentOffline, "", when)
	e.resolveAndNotify(ctx, hostID, KindStaleSchedule, "", when)
}
  • Step 7: Write the engine tests

Create internal/alert/intermittent_test.go. First inspect an existing engine test (e.g. grep internal/alert/*_test.go for how NewEngine is constructed with a test store + hub, and the helper that creates a host + schedule). Mirror those helpers. The tests to write:

package alert

import (
	"context"
	"testing"
	"time"
)

// Mirror the construction helpers used by the existing engine tests
// (newTestEngine / test store / host+schedule seeding). Replace the
// placeholder helpers below with the real ones from this package's
// existing _test.go files.

func TestIntermittentHostSuppressesOfflineAlert(t *testing.T) {
	ctx := context.Background()
	e, st := newTestEngine(t) // mirror existing helper

	hostID := seedHost(t, st, false /* alwaysOn */)
	// last seen well past the floor
	touchHostSeen(t, st, hostID, time.Now().Add(-2*time.Hour))
	markHostOffline(t, st, hostID)

	e.handleHostOffline(ctx, hostID)

	if n := openAlertCount(t, st, hostID, KindAgentOffline); n != 0 {
		t.Fatalf("intermittent host should not raise agent_offline, got %d", n)
	}
}

func TestAlwaysOnHostStillRaisesOfflineAlert(t *testing.T) {
	ctx := context.Background()
	e, st := newTestEngine(t)

	hostID := seedHost(t, st, true /* alwaysOn */)
	touchHostSeen(t, st, hostID, time.Now().Add(-2*time.Hour))
	markHostOffline(t, st, hostID)

	e.handleHostOffline(ctx, hostID)

	if n := openAlertCount(t, st, hostID, KindAgentOffline); n != 1 {
		t.Fatalf("always-on host should raise agent_offline, got %d", n)
	}
}

func TestStalenessAlertForIntermittentHost(t *testing.T) {
	ctx := context.Background()
	e, st := newTestEngine(t)

	hostID := seedHost(t, st, false)
	seedEnabledSchedule(t, st, hostID) // "0 2 * * *" with a source group
	setLastBackup(t, st, hostID, time.Now().Add(-8*24*time.Hour))

	e.tick(ctx, time.Now().UTC())

	if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 1 {
		t.Fatalf("expected one stale_schedule alert, got %d", n)
	}

	// A successful backup clears it.
	e.handleJobFinished(ctx, JobFinishedEvent{
		HostID: hostID, JobID: "j1", Kind: "backup",
		Status: "succeeded", When: time.Now().UTC(),
	})
	if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 0 {
		t.Fatalf("stale_schedule should resolve after backup, got %d", n)
	}
}

func TestNoStalenessWithoutEnabledSchedule(t *testing.T) {
	ctx := context.Background()
	e, st := newTestEngine(t)

	hostID := seedHost(t, st, false)
	setLastBackup(t, st, hostID, time.Now().Add(-8*24*time.Hour))
	// no schedule seeded

	e.tick(ctx, time.Now().UTC())

	if n := openAlertCount(t, st, hostID, KindStaleSchedule); n != 0 {
		t.Fatalf("no schedule => no staleness alert, got %d", n)
	}
}

Note for the implementer: the newTestEngine, seedHost, touchHostSeen, markHostOffline, openAlertCount, seedEnabledSchedule, setLastBackup helpers must be replaced with the real equivalents in this package's existing tests. If a needed seeding helper doesn't exist, write it using the store methods directly (CreateHost, SetHostAlwaysOn, CreateSchedule, SetHostLastBackup, MarkHostsOfflineStale, ListAlerts). Do NOT invent store methods — all required ones exist as of Task 1.

  • Step 8: Run the tests

Run: go test ./internal/alert/ -v Expected: PASS for all four new tests plus the existing suite.

  • Step 9: Commit
go vet ./internal/alert/...
git add internal/alert/engine.go internal/alert/rules.go internal/alert/intermittent_test.go
git commit -m "feat(alert): suppress offline + add staleness alert for intermittent hosts"

Task 5: HTTP toggle handler + route

Files:

  • Modify: internal/server/http/ui_handlers.go (new handler near handleUIHostTagsSave at line 954)

  • Modify: internal/server/http/server.go:281 (route mount)

  • Step 1: Add the handler

In internal/server/http/ui_handlers.go, after handleUIHostTagsSave (line 984), add:

// handleUIHostModeSave flips a host's always-on flag. Checkbox present
// in the form (value any) => always-on; absent => intermittent.
// Operator-band; mounted in server.go. On change we clear open
// offline/staleness alerts via the engine so the next sweep re-raises
// only what still applies under the new mode.
func (s *Server) handleUIHostModeSave(w stdhttp.ResponseWriter, r *stdhttp.Request) {
	u := s.requireUIUser(w, r)
	if u == nil {
		return
	}
	hostID := chi.URLParam(r, "id")
	if _, err := s.deps.Store.GetHost(r.Context(), hostID); err != nil {
		stdhttp.NotFound(w, r)
		return
	}
	if err := r.ParseForm(); err != nil {
		stdhttp.Error(w, "bad request", stdhttp.StatusBadRequest)
		return
	}
	alwaysOn := r.PostForm.Get("always_on") != ""
	if err := s.deps.Store.SetHostAlwaysOn(r.Context(), hostID, alwaysOn); err != nil {
		slog.Error("ui host mode: save", "host_id", hostID, "err", err)
		stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
		return
	}
	if s.deps.AlertEngine != nil {
		s.deps.AlertEngine.ResolveOnModeChange(r.Context(), hostID, time.Now().UTC())
	}
	_ = s.deps.Store.AppendAudit(r.Context(), store.AuditEntry{
		ID: ulid.Make().String(), UserID: &u.ID, Actor: "user",
		Action:     "host.mode_updated",
		TargetKind: ptr("host"), TargetID: &hostID,
		TS: time.Now().UTC(),
	})
	stdhttp.Redirect(w, r, "/hosts/"+hostID, stdhttp.StatusSeeOther)
}
  • Step 2: Mount the route

In internal/server/http/server.go, next to the tags route (line 281):

			r.Post("/hosts/{id}/tags", s.handleUIHostTagsSave)

add directly below:

			r.Post("/hosts/{id}/mode", s.handleUIHostModeSave)

(Confirm it lands in the same operator-band route group as /hosts/{id}/tags — same indentation/block.)

  • Step 3: Build and vet

Run: go build ./... && go vet ./... Expected: clean.

  • Step 4: Write a handler test

Add to the existing UI-handler test file (grep internal/server/http/*_test.go for the harness that builds a Server + does form POSTs against /hosts/{id}/tags; mirror it). The test posts to /hosts/{id}/mode with and without the always_on field and asserts the stored flag:

func TestHandleUIHostModeSave(t *testing.T) {
	srv, st, sess := newUITestServer(t) // mirror tags-save test harness
	hostID := seedHostForUI(t, st)      // mirror existing host seeding

	// Uncheck: form without always_on => intermittent.
	postForm(t, srv, sess, "/hosts/"+hostID+"/mode", map[string]string{})
	if h, _ := st.GetHost(context.Background(), hostID); h.AlwaysOn {
		t.Fatalf("expected always_on=false after empty post")
	}

	// Check: form with always_on=on => always-on.
	postForm(t, srv, sess, "/hosts/"+hostID+"/mode", map[string]string{"always_on": "on"})
	if h, _ := st.GetHost(context.Background(), hostID); !h.AlwaysOn {
		t.Fatalf("expected always_on=true after checked post")
	}
}

Replace newUITestServer/seedHostForUI/postForm with the real harness helpers from the existing UI handler tests.

  • Step 5: Run the test

Run: go test ./internal/server/http/ -run TestHandleUIHostModeSave -v Expected: PASS.

  • Step 6: Commit
git add internal/server/http/ui_handlers.go internal/server/http/server.go internal/server/http/*_test.go
git commit -m "feat(http): host mode toggle handler + route (host.mode_updated)"

Task 6: UI — asleep state, 24×7 chip, mode toggle

Files:

  • Modify: web/styles/input.css (dot-asleep token)

  • Modify: web/templates/partials/host_row.html

  • Modify: web/templates/partials/host_chrome.html

  • Step 1: Add the dot-asleep CSS token

In web/styles/input.css, find the .dot-offline definition (grep for dot-offline) and add a sibling .dot-asleep rule. Match the existing dot pattern; use a calm grey-blue distinct from offline's grey/red. Example (adapt colours to the file's existing tokens):

.dot-asleep { background: var(--ink-fade); opacity: 0.6; }

Inspect the neighbouring .dot-offline / .dot-degraded rules first and follow their exact shape (size, border, etc.); only the colour/opacity should differ.

  • Step 2: Rebuild CSS if the project precompiles it

Check the Makefile for a CSS build step (grep css in Makefile). If present, run it (e.g. make css). If the server serves input.css directly, skip.

  • Step 3: Asleep dot + text in host_row.html

In web/templates/partials/host_row.html, change the status-dot block (lines 6-14). Replace the {{- else if eq $h.Status "offline" -}} dot branch:

    {{- else if eq $h.Status "offline" -}}
      <span class="dot dot-offline"></span>

with:

    {{- else if eq $h.Status "offline" -}}
      {{- if $h.AlwaysOn -}}
        <span class="dot dot-offline"></span>
      {{- else -}}
        <span class="dot dot-asleep"></span>
      {{- end -}}

Then change the last-seen text branch (lines 28-29):

    {{- else if eq $h.Status "offline" -}}
      <span class="text-ink-mute">last seen <span class="mono">{{relTime $h.LastSeenAt}}</span></span>

to:

    {{- else if eq $h.Status "offline" -}}
      {{- if $h.AlwaysOn -}}
        <span class="text-ink-mute">last seen <span class="mono">{{relTime $h.LastSeenAt}}</span></span>
      {{- else -}}
        <span class="text-ink-mute">asleep · <span class="mono">{{relTime $h.LastSeenAt}}</span> · will catch up on return</span>
      {{- end -}}

And the row-action label (lines 55-56):

    {{- if eq $h.Status "offline" -}}
      <span class="mono text-xs text-ink-fade">offline</span>

to:

    {{- if eq $h.Status "offline" -}}
      <span class="mono text-xs text-ink-fade">{{if $h.AlwaysOn}}offline{{else}}asleep{{end}}</span>
  • Step 4: Asleep dot + last-seen in host_chrome.html

In web/templates/partials/host_chrome.html, change the offline dot branch (lines 36-37):

        {{else if eq $host.Status "offline"}}
          <span class="dot dot-offline"></span>

to:

        {{else if eq $host.Status "offline"}}
          {{if $host.AlwaysOn}}
            <span class="dot dot-offline"></span>
          {{else}}
            <span class="dot dot-asleep"></span>
          {{end}}

And the last-seen line (lines 90-94):

        {{if eq $host.Status "offline"}}
          <span>last seen <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
        {{else}}
          <span>online · last heartbeat <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
        {{end}}

to:

        {{if eq $host.Status "offline"}}
          {{if $host.AlwaysOn}}
            <span>last seen <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
          {{else}}
            <span>asleep · last seen <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span> · will catch up on return</span>
          {{end}}
        {{else}}
          <span>online · last heartbeat <span class="mono text-ink-mid">{{relTime $host.LastSeenAt}}</span></span>
        {{end}}
  • Step 5: Add the 24×7 chip + mode toggle to host_chrome.html

In the header tags block (lines 42-48), after the tags edit/add tags button and before the closing </div> at line 48, add the chip (shown only when always-on) and a small toggle button mirroring the tags-editor reveal pattern:

          {{if $host.AlwaysOn}}<span class="tag" title="Expected online 24×7 — offline raises an alert">24×7</span>{{end}}
          <button type="button" class="text-ink-fade text-[11px] hover:text-ink-mid whitespace-nowrap"
                  style="padding: 2px 8px; border: 1px dashed var(--line); border-radius: 3px; cursor: pointer;"
                  onclick="document.getElementById('mode-edit-{{$host.ID}}').classList.toggle('hidden')"
                  title="Change presence mode">presence</button>

Then add the toggle form right after the tags <form> block (after line 82, before the <div class="flex items-center gap-3 mt-3 ..."> at line 83):

      {{/* Presence-mode editor — hidden by default; toggled by the
           "presence" button. Checkbox present => always-on (24×7);
           unchecked => intermittent (laptop): no offline alerts, shows
           "asleep", auto-catches-up a missed backup on reconnect. */}}
      <form id="mode-edit-{{$host.ID}}" method="post"
            action="/hosts/{{$host.ID}}/mode"
            class="hidden mt-3" style="max-width: 640px;">
        <label class="flex items-center gap-2 text-[12px] text-ink-mid">
          <input type="checkbox" name="always_on" value="on" {{if $host.AlwaysOn}}checked{{end}} />
          Always On — expected online 24×7
        </label>
        <div class="field-help">
          Uncheck for an intermittent host (laptop/workstation): it wont
          raise offline alerts when asleep, shows an “asleep” state, and
          catches up a missed backup ~1 minute after it reconnects.
        </div>
        <button type="submit" class="btn btn-primary mt-2 whitespace-nowrap">Save presence</button>
      </form>
  • Step 6: Verify templates parse

Run: go build ./... && go test ./internal/server/... -run Template -v (if a template-render test exists; otherwise rely on the smoke run in Step 7). At minimum: go build ./... must pass.

  • Step 7: Manual smoke (per CLAUDE.md smoke targets)
make smoke-deploy

Then in a browser (or Playwright): open the dashboard and a host detail page. Toggle a host to intermittent via the "presence" control, confirm the 24×7 chip disappears, and confirm an offline/sleeping intermittent host renders the grey "asleep · … · will catch up on return" line instead of red "offline". Toggle back and confirm the chip returns.

  • Step 8: Commit
git add web/styles/input.css web/templates/partials/host_row.html web/templates/partials/host_chrome.html
git commit -m "feat(ui): asleep state, 24×7 chip, presence toggle for host mode"

Task 7: Record in tasks.md + final verification

Files:

  • Modify: tasks.md

  • Step 1: Add a tasks.md entry

Add a [x] entry under "Next steps from testing" in tasks.md (mirroring the NS-07 style — one line + a short "As shipped" note) describing the always-on/intermittent host mode: always_on column (default on), offline-alert suppression + 7-day staleness alert for intermittent hosts, settle-then-catch-up on reconnect, and the asleep UI + 24×7 chip + presence toggle.

  • Step 2: Full verification
go vet ./...
go test ./...

Expected: vet clean, all tests green.

  • Step 3: Commit
git add tasks.md
git commit -m "docs(tasks): record always-on/intermittent host mode"

Self-Review notes

  • Spec coverage: §1 data model → Task 1. §2 mechanics unchanged → no task needed (verified untouched). §3 alerts (suppress offline, staleness, resolve-on-backup, resolve-on-toggle) → Task 4 + Task 5 Step 1. §4 catch-up (arm on hello, settle, per-schedule overdue, dispatch, guards) → Tasks 2-3. §5 UI (dot-asleep, asleep text, 24×7 chip, toggle) → Task 6. Testing → tests in Tasks 1-5. Out-of-scope items respected (global 7d const, reconnect-only, no agent-side cron, always-on stale_schedule untouched).
  • Type consistency: scheduleOverdue(cronExpr string, *time.Time, time.Time) bool, ArmCatchup(hostID string, now time.Time), RunCatchupsDue(ctx), SetHostAlwaysOn(ctx, hostID, bool), ResolveOnModeChange(ctx, hostID, when), Host.AlwaysOn bool — used consistently across tasks.
  • No invented store methods: all store.* calls (GetHost, ListSchedulesByHost, GetSourceGroup, SetHostLastBackup, ListAlerts, AppendAudit, dispatchBackupForGroupCore, Hub.Conn/Connected) exist in the current tree; SetHostAlwaysOn is the only new one and is defined in Task 1.
  • Test helper caveat: the alert and HTTP handler tests reference package-local helpers (newTestEngine, newUITestServer, etc.) that must be matched to the real names in existing _test.go files at implementation time — flagged inline in each task.