Compare commits
8 Commits
v1.0.1
...
7aaafceab5
| Author | SHA1 | Date | |
|---|---|---|---|
| 7aaafceab5 | |||
| 4c9641b6ed | |||
| ff65d39f25 | |||
| 9d16e3f7e3 | |||
| 261b83ec26 | |||
| 0c3a0844e4 | |||
| 2dae61f678 | |||
| 55cb8909c7 |
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,223 @@
|
|||||||
|
# Always-On vs Intermittent host mode
|
||||||
|
|
||||||
|
**Date:** 2026-06-15
|
||||||
|
**Branch:** `feat-laptop-host-mode`
|
||||||
|
**Status:** Design — awaiting review
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The server currently assumes every host should be present 24×7. When an
|
||||||
|
agent stops heartbeating for 90s it is flipped to `offline`, and after 15
|
||||||
|
minutes that raises a `warning` alert. This is correct for a server, but
|
||||||
|
wrong for a host that legitimately comes and goes — a workstation or
|
||||||
|
laptop that sleeps overnight, travels, or is shut down on weekends. Such
|
||||||
|
a host generates noise alerts every time it is closed, and — more
|
||||||
|
importantly — there is **no mechanism to catch up a backup it missed
|
||||||
|
while it was away.**
|
||||||
|
|
||||||
|
Two distinct facts make the catch-up gap real:
|
||||||
|
|
||||||
|
- **Backup cron runs on the agent, locally.** The agent fires
|
||||||
|
`MsgScheduleFire`; the server only dispatches in response. If the host
|
||||||
|
is asleep, the agent process is suspended, so the cron tick never
|
||||||
|
fires and no `MsgScheduleFire` is ever sent.
|
||||||
|
- Therefore the existing `pending_runs` retry queue **does not** cover
|
||||||
|
this case. `pending_runs` only gets a row when a schedule *fired* but
|
||||||
|
the agent was momentarily disconnected at dispatch time. A window
|
||||||
|
missed entirely during sleep never enqueues anything.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Let an operator mark a host as **not** always-on. Such a host:
|
||||||
|
|
||||||
|
1. Does **not** raise offline/agent-down alerts when it is not visible.
|
||||||
|
2. Renders a distinct, calm "asleep" state in the UI instead of the
|
||||||
|
alarming red "offline".
|
||||||
|
3. When it reconnects, after a short settle delay, the server checks
|
||||||
|
whether it missed a scheduled backup and — if so — triggers a
|
||||||
|
catch-up backup automatically.
|
||||||
|
4. Still raises a *staleness* alert if it has genuinely gone too long
|
||||||
|
without any backup (a host left in a drawer). This is the only
|
||||||
|
alert covering an asleep host: while the agent is offline no job
|
||||||
|
runs, so there is no failure to detect — staleness is the safety
|
||||||
|
net for "no backups are happening at all."
|
||||||
|
5. Leaves normal job-failure alerting untouched: a backup that
|
||||||
|
actually runs (scheduled or catch-up) and fails alerts as it does
|
||||||
|
today. Failures can only occur while the agent is online and
|
||||||
|
executing restic.
|
||||||
|
|
||||||
|
Default behaviour is unchanged for the entire existing fleet.
|
||||||
|
|
||||||
|
## Decisions (from brainstorming)
|
||||||
|
|
||||||
|
- **Setting shape:** a single boolean `Always On` checkbox per host,
|
||||||
|
**default ON**. Checked = today's 24×7 server semantics. Unchecked =
|
||||||
|
intermittent host. Opt-in only; zero behaviour change for current and
|
||||||
|
future hosts unless explicitly toggled.
|
||||||
|
- **Overdue trigger:** evaluated on **reconnect + behind schedule**
|
||||||
|
(not a continuous always-evaluating sweep).
|
||||||
|
- **Alert policy for intermittent hosts:** suppress offline alerts;
|
||||||
|
keep a long-threshold **staleness** alert; keep job-failure alerts.
|
||||||
|
- **Staleness threshold:** **7 days**, a global constant for v1. May
|
||||||
|
become per-host configurable later — out of scope now.
|
||||||
|
- **Catch-up granularity:** **per enabled schedule.** A host with a
|
||||||
|
daily and a weekly schedule catches up only whichever is actually
|
||||||
|
behind.
|
||||||
|
- **UI vocabulary:** not-visible intermittent host shows a grey
|
||||||
|
`asleep` state; detail line reads
|
||||||
|
`asleep · last seen <relTime> · will catch up on return`.
|
||||||
|
- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show
|
||||||
|
a chip for **Always-On** hosts; **no** chip for intermittent.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
The change is deliberately a thin policy + presentation layer over the
|
||||||
|
existing online/offline state machine. We do **not** add a new `status`
|
||||||
|
enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a
|
||||||
|
reinterpretation of `status='offline' AND NOT always_on`.
|
||||||
|
|
||||||
|
### 1. Data model
|
||||||
|
|
||||||
|
- **Migration `0024_hosts_always_on.sql`:**
|
||||||
|
```sql
|
||||||
|
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
|
||||||
|
```
|
||||||
|
Column-level ALTER per the repo's migration rules. Default `1` means
|
||||||
|
every existing row is Always-On — no behaviour change on upgrade.
|
||||||
|
- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it
|
||||||
|
through every host SELECT scan and the host insert/update paths.
|
||||||
|
- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`.
|
||||||
|
|
||||||
|
### 2. Online/offline mechanics — UNCHANGED
|
||||||
|
|
||||||
|
The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen
|
||||||
|
host to `status='offline'` and still calls
|
||||||
|
`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello`
|
||||||
|
behaviour is untouched. The intermittent distinction is applied
|
||||||
|
*downstream* of this state, in the alert engine and the templates.
|
||||||
|
|
||||||
|
### 3. Alert behaviour
|
||||||
|
|
||||||
|
All changes key off `host.AlwaysOn`, which the engine already has access
|
||||||
|
to via the host row it loads.
|
||||||
|
|
||||||
|
- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()`
|
||||||
|
and the 60s `tick()`): when `!host.AlwaysOn`, do not raise
|
||||||
|
`agent_offline`.
|
||||||
|
- **Resolve-on-toggle:** when a host is switched server→intermittent and
|
||||||
|
has an open `agent_offline` alert, auto-resolve it. (Handled in the
|
||||||
|
mode-change handler, fanning through the normal resolve path so
|
||||||
|
channels/audit fire as usual.)
|
||||||
|
- **Staleness alert** — wire up the currently-dead `KindStaleSchedule`
|
||||||
|
constant, **for intermittent hosts only.** On the 60s tick, for each
|
||||||
|
host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND
|
||||||
|
`LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a
|
||||||
|
`warning` `stale_schedule` alert (dedup key `""`, one per host).
|
||||||
|
Auto-resolves when `LastBackupAt` advances past the threshold (i.e.
|
||||||
|
any successful backup, including the catch-up). Always-On hosts'
|
||||||
|
`stale_schedule` remains a no-op (unchanged, out of scope).
|
||||||
|
- If `LastBackupAt == nil` (intermittent host enrolled but never
|
||||||
|
backed up): no staleness alert in v1 — there is no baseline to
|
||||||
|
measure against, and onboarding probe state (`repo_status`) already
|
||||||
|
covers "never successfully set up."
|
||||||
|
- **Job-failure alerts:** untouched. A catch-up backup that runs and
|
||||||
|
fails alerts exactly like any other backup.
|
||||||
|
|
||||||
|
### 4. Catch-up on reconnect
|
||||||
|
|
||||||
|
A new small component — the **catch-up scheduler** — lives server-side
|
||||||
|
alongside the existing ticks.
|
||||||
|
|
||||||
|
- **Arm:** on agent hello (`server/ws/handler.go` hello path /
|
||||||
|
`onAgentHello`), if the host is `!AlwaysOn`, record
|
||||||
|
`catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a
|
||||||
|
subsequent hello just overwrites the timestamp (debounce — rapid
|
||||||
|
flapping does not stack catch-ups). In-memory is acceptable: catch-up
|
||||||
|
is best-effort and a server restart simply re-arms on the next hello.
|
||||||
|
- **Fire:** reuse the existing 30s server tick. For each due entry
|
||||||
|
(`catchupDueAt <= now`):
|
||||||
|
1. Re-verify the agent is still connected (`Hub.Connected(hostID)`).
|
||||||
|
If it bounced back offline within the settle window, drop the entry
|
||||||
|
(it will re-arm on the next hello).
|
||||||
|
2. Skip if a backup is already running or queued for the host
|
||||||
|
(`current_job_id` set, or a relevant `pending_runs` row exists) —
|
||||||
|
avoid double-firing alongside a normal dispatch or pending drain.
|
||||||
|
3. For each **enabled** schedule on the host, compute overdue:
|
||||||
|
```
|
||||||
|
overdue := sched.Next(host.LastBackupAt) <= now
|
||||||
|
```
|
||||||
|
using `robfig/cron/v3` (already a dependency) to parse
|
||||||
|
`Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly
|
||||||
|
after the last successful backup; if that moment has already
|
||||||
|
passed, the window was missed → overdue. (If `LastBackupAt` is nil,
|
||||||
|
treat as overdue so a never-backed-up intermittent host with a
|
||||||
|
schedule gets its first run on connect.)
|
||||||
|
4. For each overdue schedule, dispatch its source-groups via the
|
||||||
|
existing `dispatchBackupForGroupCore()`.
|
||||||
|
5. Clear the entry.
|
||||||
|
|
||||||
|
Net latency is ~60–90s after wake (60s settle + up to one 30s tick).
|
||||||
|
This path is independent of and complementary to the `pending_runs`
|
||||||
|
drain, which continues to handle the fired-but-not-sent case.
|
||||||
|
|
||||||
|
### 5. UI
|
||||||
|
|
||||||
|
- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`,
|
||||||
|
visually distinct from red `dot-offline`.
|
||||||
|
- **`partials/host_row.html` and `partials/host_chrome.html`:** when
|
||||||
|
`!AlwaysOn && status=='offline'`, render the grey dot + label
|
||||||
|
`asleep`; the detail/last-seen line reads
|
||||||
|
`asleep · last seen <relTime> · will catch up on return`. All other
|
||||||
|
states unchanged.
|
||||||
|
- **24×7 chip:** on the host detail header, render a small
|
||||||
|
`Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip
|
||||||
|
for intermittent hosts. (Chip and checkbox highlight the same fact.)
|
||||||
|
- **Toggle:** an `Always On` checkbox (default checked) on the host edit
|
||||||
|
surface. Operator-band `POST` (mirrors existing host-edit handlers),
|
||||||
|
audited as `host.mode_updated`. On save, if switching to intermittent,
|
||||||
|
trigger the resolve-on-toggle path for any open `agent_offline` alert.
|
||||||
|
|
||||||
|
## Error handling & edge cases
|
||||||
|
|
||||||
|
- **Toggle server→intermittent while offline+alerting:** open
|
||||||
|
`agent_offline` alert auto-resolved on save.
|
||||||
|
- **Toggle intermittent→server while asleep:** host resumes normal
|
||||||
|
offline/alert semantics; it will alert per the 15-minute floor once
|
||||||
|
the sweeper/tick next evaluates it.
|
||||||
|
- **No enabled schedules:** no catch-up and no staleness alert — there
|
||||||
|
is no backup expectation to measure against.
|
||||||
|
- **Catch-up vs in-flight work:** guarded by the running/queued check in
|
||||||
|
step 4.2 so catch-up never races a normal dispatch or pending drain.
|
||||||
|
- **Agent flaps during settle window:** entry dropped if not connected
|
||||||
|
at fire time; re-armed on the next hello.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
- **Alert engine (unit):**
|
||||||
|
- offline alert suppressed when `!AlwaysOn`.
|
||||||
|
- staleness alert raised when intermittent + schedule + last backup >
|
||||||
|
7d; not raised for Always-On hosts; not raised when last backup is
|
||||||
|
recent; not raised when no enabled schedule.
|
||||||
|
- staleness alert auto-resolves after a backup advances `LastBackupAt`.
|
||||||
|
- server→intermittent toggle resolves an open `agent_offline` alert.
|
||||||
|
- **Overdue computation (unit, table-driven):** `(cronExpr,
|
||||||
|
lastBackupAt, now) → overdue?` including nil-last-backup and
|
||||||
|
daily/weekly cases.
|
||||||
|
- **Catch-up scheduler (unit):** fires only when still connected; skips
|
||||||
|
when a backup is running/queued; dispatches only overdue schedules.
|
||||||
|
- **UI (render test):** asleep state + 24×7 chip render under the right
|
||||||
|
conditions; offline state for Always-On hosts unchanged.
|
||||||
|
- `go vet ./...` and full `go test ./...` green before merge.
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Per-host staleness thresholds (global 7d constant for v1).
|
||||||
|
- Continuous (non-reconnect) overdue evaluation.
|
||||||
|
- Agent-side catch-up cron — the server is the reliable arbiter.
|
||||||
|
- Wiring `stale_schedule` for Always-On hosts (separate concern).
|
||||||
|
|
||||||
|
## Task tracking
|
||||||
|
|
||||||
|
Add an entry to `tasks.md` under "Next steps from testing" (or a new
|
||||||
|
small section) once the plan is approved, per the repo's tasks.md
|
||||||
|
source-of-truth rule.
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
// catchup.go — server-side catch-up for intermittent (non-always-on)
|
||||||
|
// hosts. When such a host reconnects we wait a short settle window,
|
||||||
|
// then dispatch a backup for any schedule whose window elapsed while
|
||||||
|
// the host was asleep. This is separate from pending_runs: a host that
|
||||||
|
// was asleep never fired its local cron, so no pending row exists.
|
||||||
|
package http
|
||||||
|
|
||||||
|
import (
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
// scheduleOverdue reports whether a schedule's most recent expected
|
||||||
|
// fire is newer than the host's last successful backup — i.e. a window
|
||||||
|
// passed with no backup. A nil lastBackup means "never backed up" and
|
||||||
|
// is always overdue (provided the cron parses). An unparseable cron is
|
||||||
|
// treated as not-overdue so a bad expression can never trigger a
|
||||||
|
// surprise dispatch. Uses the same cronParser the agent's scheduler
|
||||||
|
// and schedule validation use, so interpretation is identical.
|
||||||
|
func scheduleOverdue(cronExpr string, lastBackup *time.Time, now time.Time) bool {
|
||||||
|
sched, err := cronParser.Parse(cronExpr)
|
||||||
|
if err != nil {
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
if lastBackup == nil {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
next := sched.Next(*lastBackup)
|
||||||
|
return !next.After(now)
|
||||||
|
}
|
||||||
@@ -0,0 +1,41 @@
|
|||||||
|
package http
|
||||||
|
|
||||||
|
import (
|
||||||
|
"testing"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestScheduleOverdue(t *testing.T) {
|
||||||
|
mustParse := func(s string) time.Time {
|
||||||
|
t.Helper()
|
||||||
|
v, err := time.Parse(time.RFC3339, s)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("parse %q: %v", s, err)
|
||||||
|
}
|
||||||
|
return v
|
||||||
|
}
|
||||||
|
daily := "0 2 * * *" // 02:00 every day
|
||||||
|
|
||||||
|
cases := []struct {
|
||||||
|
name string
|
||||||
|
cron string
|
||||||
|
lastBackup *time.Time
|
||||||
|
now time.Time
|
||||||
|
want bool
|
||||||
|
}{
|
||||||
|
{name: "never backed up is overdue", cron: daily, lastBackup: nil, now: mustParse("2026-06-15T09:00:00Z"), want: true},
|
||||||
|
{name: "missed last nights window", cron: daily, lastBackup: ptrTime(mustParse("2026-06-13T02:05:00Z")), now: mustParse("2026-06-15T09:00:00Z"), want: true},
|
||||||
|
{name: "backed up after the most recent window", cron: daily, lastBackup: ptrTime(mustParse("2026-06-15T02:05:00Z")), now: mustParse("2026-06-15T09:00:00Z"), want: false},
|
||||||
|
{name: "unparseable cron is never overdue", cron: "not a cron", lastBackup: nil, now: mustParse("2026-06-15T09:00:00Z"), want: false},
|
||||||
|
}
|
||||||
|
for _, c := range cases {
|
||||||
|
t.Run(c.name, func(t *testing.T) {
|
||||||
|
got := scheduleOverdue(c.cron, c.lastBackup, c.now)
|
||||||
|
if got != c.want {
|
||||||
|
t.Fatalf("scheduleOverdue(%q, %v, %v) = %v, want %v", c.cron, c.lastBackup, c.now, got, c.want)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func ptrTime(t time.Time) *time.Time { return &t }
|
||||||
+25
-4
@@ -44,7 +44,7 @@ func (s *Store) LookupHostByAgentToken(ctx context.Context, tokenHash string) (*
|
|||||||
repo_size_bytes, snapshot_count, open_alert_count,
|
repo_size_bytes, snapshot_count, open_alert_count,
|
||||||
applied_schedule_version, bandwidth_up_kbps, bandwidth_down_kbps,
|
applied_schedule_version, bandwidth_up_kbps, bandwidth_down_kbps,
|
||||||
pre_hook_default, post_hook_default,
|
pre_hook_default, post_hook_default,
|
||||||
repo_status, repo_status_error
|
repo_status, repo_status_error, always_on
|
||||||
FROM hosts WHERE agent_token_hash = ?`,
|
FROM hosts WHERE agent_token_hash = ?`,
|
||||||
tokenHash)
|
tokenHash)
|
||||||
return scanHost(row)
|
return scanHost(row)
|
||||||
@@ -59,7 +59,7 @@ func (s *Store) GetHost(ctx context.Context, id string) (*Host, error) {
|
|||||||
repo_size_bytes, snapshot_count, open_alert_count,
|
repo_size_bytes, snapshot_count, open_alert_count,
|
||||||
applied_schedule_version, bandwidth_up_kbps, bandwidth_down_kbps,
|
applied_schedule_version, bandwidth_up_kbps, bandwidth_down_kbps,
|
||||||
pre_hook_default, post_hook_default,
|
pre_hook_default, post_hook_default,
|
||||||
repo_status, repo_status_error
|
repo_status, repo_status_error, always_on
|
||||||
FROM hosts WHERE id = ?`, id)
|
FROM hosts WHERE id = ?`, id)
|
||||||
return scanHost(row)
|
return scanHost(row)
|
||||||
}
|
}
|
||||||
@@ -227,7 +227,7 @@ func (s *Store) ListHosts(ctx context.Context) ([]Host, error) {
|
|||||||
repo_size_bytes, snapshot_count, open_alert_count,
|
repo_size_bytes, snapshot_count, open_alert_count,
|
||||||
applied_schedule_version, bandwidth_up_kbps, bandwidth_down_kbps,
|
applied_schedule_version, bandwidth_up_kbps, bandwidth_down_kbps,
|
||||||
pre_hook_default, post_hook_default,
|
pre_hook_default, post_hook_default,
|
||||||
repo_status, repo_status_error
|
repo_status, repo_status_error, always_on
|
||||||
FROM hosts ORDER BY name`)
|
FROM hosts ORDER BY name`)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
return nil, fmt.Errorf("store: list hosts: %w", err)
|
return nil, fmt.Errorf("store: list hosts: %w", err)
|
||||||
@@ -267,6 +267,7 @@ func scanHostRow(s hostScanner) (*Host, error) {
|
|||||||
tags string
|
tags string
|
||||||
bwUp, bwDown sql.NullInt64
|
bwUp, bwDown sql.NullInt64
|
||||||
preHook, postHook sql.NullString
|
preHook, postHook sql.NullString
|
||||||
|
alwaysOn int
|
||||||
)
|
)
|
||||||
err := s.Scan(&h.ID, &h.Name, &h.OS, &h.Arch,
|
err := s.Scan(&h.ID, &h.Name, &h.OS, &h.Arch,
|
||||||
&h.AgentVersion, &h.ResticVersion, &h.ProtocolVersion,
|
&h.AgentVersion, &h.ResticVersion, &h.ProtocolVersion,
|
||||||
@@ -275,7 +276,7 @@ func scanHostRow(s hostScanner) (*Host, error) {
|
|||||||
&h.RepoSizeBytes, &h.SnapshotCount, &h.OpenAlertCount,
|
&h.RepoSizeBytes, &h.SnapshotCount, &h.OpenAlertCount,
|
||||||
&h.AppliedScheduleVersion, &bwUp, &bwDown,
|
&h.AppliedScheduleVersion, &bwUp, &bwDown,
|
||||||
&preHook, &postHook,
|
&preHook, &postHook,
|
||||||
&h.RepoStatus, &h.RepoStatusError)
|
&h.RepoStatus, &h.RepoStatusError, &alwaysOn)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
if errors.Is(err, sql.ErrNoRows) {
|
if errors.Is(err, sql.ErrNoRows) {
|
||||||
return nil, ErrNotFound
|
return nil, ErrNotFound
|
||||||
@@ -330,6 +331,7 @@ func scanHostRow(s hostScanner) (*Host, error) {
|
|||||||
if postHook.Valid {
|
if postHook.Valid {
|
||||||
h.PostHookDefault = postHook.String
|
h.PostHookDefault = postHook.String
|
||||||
}
|
}
|
||||||
|
h.AlwaysOn = alwaysOn != 0
|
||||||
return &h, nil
|
return &h, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -378,6 +380,25 @@ func (s *Store) SetHostTags(ctx context.Context, hostID string, tags []string) e
|
|||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// SetHostAlwaysOn flips the host's always-on flag. true = 24x7 server
|
||||||
|
// (default); false = intermittent host (laptop). See the
|
||||||
|
// always-on-host-mode spec.
|
||||||
|
func (s *Store) SetHostAlwaysOn(ctx context.Context, hostID string, alwaysOn bool) error {
|
||||||
|
v := 0
|
||||||
|
if alwaysOn {
|
||||||
|
v = 1
|
||||||
|
}
|
||||||
|
res, err := s.db.ExecContext(ctx,
|
||||||
|
`UPDATE hosts SET always_on = ? WHERE id = ?`, v, hostID)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("store: set host always_on: %w", err)
|
||||||
|
}
|
||||||
|
if n, _ := res.RowsAffected(); n == 0 {
|
||||||
|
return ErrNotFound
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
// DistinctHostTags returns the union of every tag in use across the
|
// DistinctHostTags returns the union of every tag in use across the
|
||||||
// fleet, sorted. Powers the autocomplete on the host-tags editor and
|
// fleet, sorted. Powers the autocomplete on the host-tags editor and
|
||||||
// the chip-row filter on the dashboard. Cheap at fleet sizes this
|
// the chip-row filter on the dashboard. Cheap at fleet sizes this
|
||||||
|
|||||||
@@ -0,0 +1,55 @@
|
|||||||
|
package store
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"testing"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestHostAlwaysOnDefaultAndToggle(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
st := openTestStore(t)
|
||||||
|
|
||||||
|
h := Host{
|
||||||
|
ID: "h-always-on", Name: "lap", OS: "linux", Arch: "amd64",
|
||||||
|
ProtocolVersion: 1, EnrolledAt: time.Now().UTC(),
|
||||||
|
}
|
||||||
|
if err := st.CreateHost(ctx, h, "tok-hash", "pin"); err != nil {
|
||||||
|
t.Fatalf("create host: %v", err)
|
||||||
|
}
|
||||||
|
got, err := st.GetHost(ctx, h.ID)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("get host: %v", err)
|
||||||
|
}
|
||||||
|
if !got.AlwaysOn {
|
||||||
|
t.Fatalf("new host should default to always_on=true, got false")
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := st.SetHostAlwaysOn(ctx, h.ID, false); err != nil {
|
||||||
|
t.Fatalf("set always_on: %v", err)
|
||||||
|
}
|
||||||
|
got, err = st.GetHost(ctx, h.ID)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("get host 2: %v", err)
|
||||||
|
}
|
||||||
|
if got.AlwaysOn {
|
||||||
|
t.Fatalf("expected always_on=false after toggle, got true")
|
||||||
|
}
|
||||||
|
|
||||||
|
hosts, err := st.ListHosts(ctx)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("list hosts: %v", err)
|
||||||
|
}
|
||||||
|
if len(hosts) != 1 || hosts[0].AlwaysOn {
|
||||||
|
t.Fatalf("ListHosts should report always_on=false, got %+v", hosts)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Verify the agent hot-path (LookupHostByAgentToken) also reflects the toggle.
|
||||||
|
byToken, err := st.LookupHostByAgentToken(ctx, "tok-hash")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("lookup by agent token: %v", err)
|
||||||
|
}
|
||||||
|
if byToken.AlwaysOn {
|
||||||
|
t.Fatalf("LookupHostByAgentToken: expected always_on=false after toggle, got true")
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,6 @@
|
|||||||
|
-- 0024: distinguish always-on (24x7 server) hosts from intermittent
|
||||||
|
-- hosts (laptops/workstations that legitimately sleep). Default 1 so
|
||||||
|
-- every existing and future host keeps today's offline/alert
|
||||||
|
-- semantics unless explicitly opted out. Column-level ALTER per the
|
||||||
|
-- repo's migration rules (no table rebuild — hosts has inbound FKs).
|
||||||
|
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
|
||||||
@@ -99,6 +99,12 @@ type Host struct {
|
|||||||
// agent-side message when RepoStatus == "init_failed".
|
// agent-side message when RepoStatus == "init_failed".
|
||||||
RepoStatus string
|
RepoStatus string
|
||||||
RepoStatusError string
|
RepoStatusError string
|
||||||
|
|
||||||
|
// AlwaysOn is true for 24x7 server hosts (the default). When false
|
||||||
|
// the host is intermittent (laptop/workstation): offline alerts are
|
||||||
|
// suppressed, the UI shows an "asleep" state, and a missed backup is
|
||||||
|
// caught up ~1 min after reconnect. See the always-on-host-mode spec.
|
||||||
|
AlwaysOn bool
|
||||||
}
|
}
|
||||||
|
|
||||||
// Schedule is now intentionally slim: cron + which groups + enabled.
|
// Schedule is now intentionally slim: cron + which groups + enabled.
|
||||||
|
|||||||
@@ -498,6 +498,7 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
|||||||
- [x] **NS-03** Auto-init repo on first onboard, surface credential failures eagerly. ✅ Landed: migration 0020 adds `hosts.repo_status` (`unknown`/`ready`/`init_failed`) + `repo_status_error`; WS handler projects every init job's terminal state onto the host row (with idempotent "config file already exists" → ready); creds-save handlers (UI + JSON API) reset status to `unknown` and dispatch a fresh init when the agent is online; new `/hosts/{id}/repo/probe` retry endpoint and a status banner on the repo page. Remainder of original scope below. surface credential failures eagerly. Today the operator types repo URL + creds during Add-host and the credentials are pushed to the agent on connect, but no `restic init`/probe runs until the first scheduled job — so a typo in the password or a wrong URL goes undetected for hours/days, manifesting as a silent missed-backup. Wanted behaviour: when the host completes enrolment (or when an admin saves new repo creds), the server dispatches a one-shot probe job that runs `restic cat config` (cheap, repo-existence + creds-validity in one call). On `Is there already a config file? unable to open config file` → run `restic init`. On success → mark the host's repo as ready. On any other error (network, auth, fingerprint) → surface a panel-level error on the host detail page and audit the failure, leaving the host in an "init pending" state with a "Retry" button. Needs: a new `JobKind` (or piggyback on an existing one) for the probe, server-side state on the host row (`repo_status` enum: `unknown`/`ready`/`init_pending`/`init_failed`), UI panel that shows the state, and clear copy on the Add-host page so the operator knows the save isn't fire-and-forget.
|
- [x] **NS-03** Auto-init repo on first onboard, surface credential failures eagerly. ✅ Landed: migration 0020 adds `hosts.repo_status` (`unknown`/`ready`/`init_failed`) + `repo_status_error`; WS handler projects every init job's terminal state onto the host row (with idempotent "config file already exists" → ready); creds-save handlers (UI + JSON API) reset status to `unknown` and dispatch a fresh init when the agent is online; new `/hosts/{id}/repo/probe` retry endpoint and a status banner on the repo page. Remainder of original scope below. surface credential failures eagerly. Today the operator types repo URL + creds during Add-host and the credentials are pushed to the agent on connect, but no `restic init`/probe runs until the first scheduled job — so a typo in the password or a wrong URL goes undetected for hours/days, manifesting as a silent missed-backup. Wanted behaviour: when the host completes enrolment (or when an admin saves new repo creds), the server dispatches a one-shot probe job that runs `restic cat config` (cheap, repo-existence + creds-validity in one call). On `Is there already a config file? unable to open config file` → run `restic init`. On success → mark the host's repo as ready. On any other error (network, auth, fingerprint) → surface a panel-level error on the host detail page and audit the failure, leaving the host in an "init pending" state with a "Retry" button. Needs: a new `JobKind` (or piggyback on an existing one) for the probe, server-side state on the host row (`repo_status` enum: `unknown`/`ready`/`init_pending`/`init_failed`), UI panel that shows the state, and clear copy on the Add-host page so the operator knows the save isn't fire-and-forget.
|
||||||
- [x] **NS-05** Drop redundant `actions/setup-go` from `.gitea/workflows/ci.yml`. ✅ Already gone — verified `.gitea/workflows/ci.yml` has zero `actions/setup-go@v5` invocations and no `GO_VERSION` env; the file's header comment now documents that the runner image (`gitea.dcglab.co.uk/steve/ci-runner-go`) is the single source of truth for the Go version. Closing as done; no further code change needed.
|
- [x] **NS-05** Drop redundant `actions/setup-go` from `.gitea/workflows/ci.yml`. ✅ Already gone — verified `.gitea/workflows/ci.yml` has zero `actions/setup-go@v5` invocations and no `GO_VERSION` env; the file's header comment now documents that the runner image (`gitea.dcglab.co.uk/steve/ci-runner-go`) is the single source of truth for the Go version. Closing as done; no further code change needed.
|
||||||
- [x] **NS-06** Remove the permanently-disabled "Run backup now" button from `web/templates/partials/host_chrome.html`. ✅ Landed: dropped the disabled tombstone button from the host header action row; only "Edit credentials" + the ⋯ menu remain. Per-source-group Run-now on `/hosts/{id}/sources` is the only path now. No e2e change needed — `smoke.spec.ts` does not assert on host_chrome's button row.
|
- [x] **NS-06** Remove the permanently-disabled "Run backup now" button from `web/templates/partials/host_chrome.html`. ✅ Landed: dropped the disabled tombstone button from the host header action row; only "Edit credentials" + the ⋯ menu remain. Per-source-group Run-now on `/hosts/{id}/sources` is the only path now. No e2e change needed — `smoke.spec.ts` does not assert on host_chrome's button row.
|
||||||
|
- [x] **NS-07** Relative timestamps go stale on long-open tabs. ✅ Landed: `formatRelTime` now wraps its label in `<time data-rel-ts=…>` and both layouts (`base.html`, `chromeless.html`) carry a small ticker that re-renders every 30s, so a page rendered an hour ago no longer keeps showing "2h ago" when the wall-clock truth is "3h ago". Covered by `funcs_test.go`. The bug: every relative label was computed once at server render and never updated client-side, so a job-detail page left open drifted further from reality the longer it sat.
|
||||||
- [x] **NS-04** Dashboard parity with the alerts screen: live refresh, column sorting, filters. ✅ Landed: `/` now parses `q`/`status`/`repo_status`/`tag`/`sort`/`dir` query params (round-trip durable for bookmarks); table is wrapped in an `id="hosts-table"` htmx live-poll matching the alerts cadence (5s, gated on `document.visibilityState` and `localStorage.rm-dashboard-live`); filter row above the table with hostname free-text + status + repo_status selects + tag chips + clear; column headers (Host / OS · arch / Last backup / Repo size / Snapshots) are clickable links that toggle direction on the active column; pure-Go sort+filter pipeline covered by `dashboard_filter_test.go`. Original scope below. live refresh, column sorting, filters. The host list is currently a static render — operators have to reload to see new heartbeats / job state changes. Mirror the alerts pattern (`web/templates/pages/alerts.html` uses `hx-trigger="every 5s [document.visibilityState==='visible' && localStorage.getItem('rm-alerts-live')!=='off']"` plus a Live/Off toggle so background tabs and explicit-off don't burn server cycles). Add: server-side sort on every meaningful column (name, OS, last-backup time, last-backup status, agent online/offline, restic version, tags), and a small filter row above the table — at minimum free-text on hostname, status (online/offline/never-seen), and tag chips. Columns + filter state should round-trip through query string so a bookmarked / shared URL is durable. Re-use the `host_row` partial that already exists so the live-refresh swap is a clean OOB swap, not a full table re-render.
|
- [x] **NS-04** Dashboard parity with the alerts screen: live refresh, column sorting, filters. ✅ Landed: `/` now parses `q`/`status`/`repo_status`/`tag`/`sort`/`dir` query params (round-trip durable for bookmarks); table is wrapped in an `id="hosts-table"` htmx live-poll matching the alerts cadence (5s, gated on `document.visibilityState` and `localStorage.rm-dashboard-live`); filter row above the table with hostname free-text + status + repo_status selects + tag chips + clear; column headers (Host / OS · arch / Last backup / Repo size / Snapshots) are clickable links that toggle direction on the active column; pure-Go sort+filter pipeline covered by `dashboard_filter_test.go`. Original scope below. live refresh, column sorting, filters. The host list is currently a static render — operators have to reload to see new heartbeats / job state changes. Mirror the alerts pattern (`web/templates/pages/alerts.html` uses `hx-trigger="every 5s [document.visibilityState==='visible' && localStorage.getItem('rm-alerts-live')!=='off']"` plus a Live/Off toggle so background tabs and explicit-off don't burn server cycles). Add: server-side sort on every meaningful column (name, OS, last-backup time, last-backup status, agent online/offline, restic version, tags), and a small filter row above the table — at minimum free-text on hostname, status (online/offline/never-seen), and tag chips. Columns + filter state should round-trip through query string so a bookmarked / shared URL is durable. Re-use the `host_row` partial that already exists so the live-refresh swap is a clean OOB swap, not a full table re-render.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
Reference in New Issue
Block a user