Adds /hosts/{id}/jobs page listing recent jobs for the host (newest
first, capped at 100) with click-through to /jobs/{id}. Converts the
Jobs placeholder <div> to a real <a> nav link; removes the Settings
stub entirely. Also registers durationHuman template func and a
.jobs-row CSS grid to match the existing .schd-row idiom.
The dashboard's 'Last backup' column reads hosts.last_backup_at /
last_backup_status, but the WS handler only updated hosts.repo_status
on job.finished — backup terminations were silently dropped. Add a
SetHostLastBackup store method and call it from the same job.finished
switch that already handles init jobs.
Also: CLAUDE.md restage block uses /tmp/rm-smoke (the original
default) but the actual dev env runs out of $HOME/smoke. Update the
paths in the doc to match.
- alert: update_failed (per-host, dedup=hostID) + fleet_update_halted
(system-scoped, host_id NULL via new RaiseOrTouchSystem helper).
- ws: UpdateWatcher tracks in-flight command.update dispatches and
reconciles them against incoming hello envelopes — success path
marks the job succeeded and auto-resolves the alert; 90s timeout
marks the job failed and raises update_failed.
- http: POST /api/hosts/{id}/update (admin-only JSON) + the HTMX
/hosts/{id}/update form variant. Pre-checks: host exists, online,
agent_version != current, no running update job. Refactored core
into Server.dispatchHostUpdate so the fleet worker can share it
without going through HTTP.
- fleetupdate: rolling worker iterating through host slots, halting
on first failure and raising fleet_update_halted. Polling-based
version-match (re-read hosts.agent_version every 1s up to 95s) —
no extra plumbing into the WS hello path. At-most-one-running is
enforced at the store layer (ErrFleetUpdateRunning).
- cmd/server: wire UpdateWatcher and FleetWorker into the main
goroutine; the worker uses a small serverDispatcher adapter that
delegates back into Server.DispatchHostUpdate.
Tests: watcher (success/timeout/mismatch/late-hello), HTTP endpoint
(happy + four pre-check branches + RBAC), worker (two-host happy,
timeout-halt, host-offline-halt, already-at-target skip, cancel
mid-run, double-Start guard).
Smoothes the rough edges that came up exercising a live deployment.
First-run bootstrap UI: /bootstrap renders a username + password form
that uses the in-memory token directly (operator no longer copies it
out of the log); /login redirects there while bootstrap is available.
Agent reliability: failJob synthetic envelopes so command.run early
returns no longer hang the server-side job; runtime probe of restic
restore --help drives --no-ownership instead of version sniffing
(0.18.x had it removed). Server unit re-shaped: ProtectSystem=full
plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore
can now write anywhere a user might want.
Restore wizard: default target is /root/rm-restore/<job-id>/ with
clearer help text. Re-init confirm input uses .field (was .input,
which doesn't exist — text was invisible).
NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete
with hostname-confirm danger zone, audit, FK cascade, live WS close.
NS-02 enrollment-token recovery: outstanding-tokens panel on
/hosts/new, regenerate (preserves attachments) and revoke handlers
+ audit, store-level ListOutstandingEnrollmentTokens and
DeleteEnrollmentToken.
NS-03 repo init / probe surface: migration 0020 adds
hosts.repo_status + repo_status_error; WS handler projects every
init job's outcome onto the host row (idempotent already-initialised
collapses to ready); creds-save resets status and dispatches a fresh
probe; /hosts/{id}/repo/probe retry endpoint with banner.
NS-04 dashboard live + sort + filter: query-string filter
(q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the
alerts pattern with a localStorage live toggle, sortable column
headers, filter row + clear.
Alerts page: ack'd-by line resolves user_id ULID to username.
Compose.yaml ignored — host-specific.
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.
Schema changes (column-level ALTER, no rebuild):
- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
partial index gets dropped and replaced with a UNIQUE partial
index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
the index is now the actual dedup primitive.
Plumbing:
- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
passes it through for backup_failed only (forget/prune/check stay
repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
per-group Run-now path and schedule.fire path pass &g.ID.
Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.
Dev tool: cmd/_fake_alert gains -dedup-key flag.
Two bugs in the channel-enabled affordance:
1. List-row toggle was a static span with no handler; the row's
row-link overlay swallowed every click and routed to /edit. Add
POST /settings/notifications/{id}/toggle backed by a new store
method SetNotificationChannelEnabled, and turn the row toggle
into an htmx-driven button that swaps in the new state. Use
event.stopPropagation() on the toggle so it beats the row link.
2. Edit-form toggle visually flipped but the underlying checkbox
reverted: the visual span lives inside the <label>, so clicking
it fired the inline JS handler AND the label's native
checkbox-toggle, cancelling out. Bind to the checkbox 'change'
event instead and let the label do the toggling — the JS just
mirrors check.checked into the .on class.
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.
Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.
Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
TODO(G1) comment marks where NotifyHostOffline calls will land
Three small follow-ups from review:
1. Restore target is now operator-editable. Default value is the
literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
run time using os.UserHomeDir(); also handles \${HOME} and ~/
prefixes). Operator can replace with any absolute path.
- ui_restore.go validates the input is either absolute or starts
with one of the recognised prefixes; other env-var refs (\$PATH
etc.) are deliberately rejected so operator paths can't pick up
arbitrary agent env values.
- host_restore.html replaces the read-only mono-text display with
a real <input>; help text spells out that \$HOME resolves
agent-side and <job-id> is substituted on dispatch.
- install.sh + the systemd unit prep /root/rm-restore so the
default works under the sandbox: ReadWritePaths gains a soft
'-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
if missing, but install.sh pre-creates it root-owned 0700).
2. --no-ownership flag now gated on restic version. The flag was
added in restic 0.17 and 0.16 rejects it. Previously dropped it
wholesale — that meant new-dir restores silently preserved
ownership against design intent on 0.17+. Now the agent threads
its detected restic version (sysinfo already collects it) through
runner.Config -> restic.Env, and RunRestore appends --no-ownership
only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
restore with original uid/gid; help text in the wizard explicitly
notes this. The previous 'Original ownership is preserved' copy
was wrong for new-dir mode and is corrected.
3. golangci-lint misspell locale switched US -> UK and the codebase
swept (73 corrections, mostly behaviour/serialise/recognise/honour).
Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
contract change but the agent doesn't parse those codes today and
no external API consumers exist yet. Tests passed before + after.
Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
edge cases (empty, exact match, patch above, minor below, non-
numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
with the job_id substituted into the placeholder.
Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link
/hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch →
restore-shaped live job page.
Backend (internal/server/http/ui_restore.go):
- GET handlers render the four-step wizard against the wireframe shape
in docs/superpowers/specs/2026-05-04-p3-restore-design.md.
- HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each
directory expansion is a sub-second cached lookup after the first
miss.
- POST validates: snapshot_id non-empty, ≥1 absolute path, in-place
mode requires confirm_hostname == host name, agent online. On error
re-renders the wizard with the operator's input intact. Happy path
mints a job_id, computes the new-directory target as
/var/restic-restore/<job-id>/ (operator can't escape the prefix —
server picks it), creates the job row, ships command.run with
kind=restore + RestorePayload, writes a host.restore audit row,
returns HX-Redirect (or 303) to the live job page.
Templates:
- host_restore.html: single-page progressively-enabled wizard matching
_diag/p3-restore-wizard wireframe. Form-state-driven JS computes a
running tally of selected paths and the step-4 confirm summary
client-side; the server re-renders on validation failure with form
fields preserved.
- partials/tree_node.html: recursive HTMX-served tree fragment.
- Top-level Restore button on host_detail right rail + per-snapshot
Restore action on snapshot rows replace the previous P3-stub.
Restore-shaped job page (job_detail.html):
- Progress widget rendered as a panel rather than a bare strip when
the job is active.
- Current-file display under the bar, updated from log.stream stdout
lines that look like absolute paths. Hidden for non-restore kinds.
Migration 0012:
- Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite
can't ALTER CHECK in place); follows the safe pattern from 0005.
Defensive: stash job_logs into a temp table before the rebuild and
INSERT OR IGNORE back afterwards so even if SQLite cascades on
DROP TABLE jobs the log history survives.
Tests:
- ui_restore_test covers GET step-1 render, GET pre-selected snapshot
summary card, POST missing snapshot, POST missing paths, POST
in-place wrong-hostname rejection (no command.run leaks to the
agent), POST happy path (HX-Redirect + correct payload + audit
row), POST against offline host returns 503.
Restage block (CLAUDE.md) deferred to the end of the restore phase.
Agent: new runner.BackupHooks struct + runHook helper invoked via
/bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts
the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed
in env. Output streamed as 'hook(<phase>): …' log.stream lines.
Hooks only run for kind=backup (other kinds skip both phases).
Server: resolveBackupHooks resolves group → host default → empty,
decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext
into CommandRunPayload for both schedule.fire and per-group
Run-now dispatch sites. Decrypt failures degrade silently to no
hook so a malformed blob can't poison every backup.
Adds pre_hook/post_hook BLOB columns to source_groups and
pre_hook_default/post_hook_default to hosts. Bytes stored verbatim
(AEAD encrypt/decrypt happens at the HTTP layer where the AEAD key
lives). Round-trip tests cover set/clear semantics on both tables.
P2R-14. New store.LatestJobBySchedule query (per-schedule fired job).
Schedules-tab handler computes next-fire from cron + last-fire from
the jobs table per row. Schedules table grows two columns; dashboard
host row prepends 'next 12h ago/from now' to the existing last-backup
line when a single covering schedule is the run-now candidate.
Embeds store.Schedule into scheduleRow so existing template field
references keep working without bulk renames.
Widen the SQL query to consider all statuses (queued, running,
succeeded, failed, cancelled) rather than terminal-only. An in-flight
prune that outlasts the 60s tick interval previously produced
ErrNotFound, causing the ticker to anchor at now-24h and fire a second
prune concurrently with the first.
Update the doc comment and test: remove the "queued job filtered out"
case, add assertions that a running job and a queued job are each
returned as the latest.
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.
Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:
* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
api.JobCancelled = "cancelled" since that literal is the wire +
DB CHECK constraint value, plus matched the case in store/fleet.go
back to "cancelled" and added //nolint:misspell on both for the
next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
`defer res.Body.Close()` in `defer func() { _ = .Close() }()`
to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
upgrade response Body — coder/websocket can return res with a nil
Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
comments explaining why nil-on-error is the contract (cookie
missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
the dashboard primary nav today
Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron
parsed via robfig/cron/v3 and group membership scoped to the host.
New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete
refuses with 409 if any schedule still references the group, returning
the schedule list so the UI can prompt 'remove from these schedules
first.' Repo-maintenance GET/PUT manages forget/prune/check cadences
on host_repo_maintenance — no version bump, the server-side ticker
(P2R-06) drives execution.
Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run)
resolves the group's includes/excludes/retention/tag and dispatches a
backup command.run with the new structured CommandRunPayload fields
(Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and
/hosts/{id}/init-repo return 410 Gone with a redirect message.
schedule_push.go is rebuilt: buildScheduleSetPayload assembles the
slim wire shape, pushScheduleSetOnConn ships it during the on-hello
window, pushScheduleSetAsync fires after every CRUD mutation, and
dispatchScheduledJob handles agent schedule.fire by iterating the
schedule's source groups and dispatching one backup per group with
actor_kind=schedule and scheduled_id pointing at the schedule.
Auto-init at first WS connect: when the host has repo creds bound and
no init job in its history, server dispatches restic init. Restic's
'config file already exists' soft-success means re-runs against an
existing repo no-op; we don't auto-retry on failure (operator triggers
re-init manually via the danger zone in P2R-09).
api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc.
in favour of {id, cron, enabled, source_groups: [...]}. The agent
scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads
Includes/Excludes/Tag instead of Args.
Tests cover the new HTTP surface end-to-end: source-groups CRUD with
in-use refusal, schedule validation (bad cron / missing groups /
foreign group), repo-maintenance auto-seed and validation, the 410
route, and buildScheduleSetPayload's wire-shape correctness. Full
suite passes; smoke env exercises auto-init dispatch on hello,
async push after schedule create, and per-source-group Run-now
landing the right paths/excludes/tag at the agent.
Go-side data model rebuilt against migration 0008. The fat-Schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks) is
gone; that surface lives on source_groups now.
* store/types.go
- Schedule slimmed to {id, host_id, cron, enabled, source_group_ids,
timestamps}. SourceGroupIDs populated by Get/List, accepted on
Create/Update so callers pass desired junction state in one shape.
- SourceGroup added: name (= snapshot tag), includes/excludes,
retention_policy, retry_max + retry_backoff_seconds, cached
conflict_dimension.
- HostRepoMaintenance added: forget/prune/check cadences + enabled.
- PendingRun added: offline-retry queue.
- Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps.
- RetentionPolicy moves home from "schedule field" to "source group
field" but the type itself + Summary() method unchanged.
* store/sources.go (new) — CRUD + GetByName + ConflictDimension cache.
Group writes bump host_schedule_version; conflict cache writes don't
(server-internal projection, agent doesn't see it).
* store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR
IGNORE). UpdateRepoMaintenance doesn't bump schedule version because
these run on the server's own ticker, not the agent's local cron.
* store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete.
* store/schedules.go — rewritten for slim shape + junction CRUD.
Update wipes the schedule_source_groups junction wholesale and
re-inserts (simpler than diffing). Adds SchedulesUsingGroup for
retention-conflict detection + UI labels.
* store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan.
New SetHostBandwidth helper.
* HTTP layer — temporarily stubbed during this rewrite (501 returns
with redesign_in_progress error code). Phase 3 fills these in
against the new shape:
- schedules.go REST CRUD
- schedule_push.go agent reconciliation
- ui_schedules.go HTML form CRUD
Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed
— both go away in the new model (Run-now per source group; auto-init
at host enrolment).
* enrollment.go — replaces "seed manual schedule from typed paths"
with "seed default source group + repo-maintenance row." The default
group gets the typed paths as its includes; operator edits later
via Sources tab.
* ws/handler.go — drops the MarkHostRepoInitialised projection (column
is gone; auto-init makes it derivable from latest init job's status).
Tests:
* store: existing schedule test rewritten for slim shape + junction;
new sources_test.go covers source-group CRUD, name uniqueness,
conflict cache, repo-maintenance defaults + idempotent seed,
pending-runs queue lifecycle.
* http: schedules_test.go and schedule_push_test.go deleted — both
exercised the obsolete fat-schedule API. Phase 3 rewrites them
against the new endpoints.
go test ./... green. cmd/server + cmd/agent build. The UI is broken
end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3
restores REST + on-the-wire reconciliation; Phase 4 rewires the UI
templates against the new model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema rebuild for the model collapse described in
design/v4-sources-redesign.html. Three nouns now stand on their
own:
* schedules — slim. Only cron + enabled + host_id. Fat-schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks)
is dropped wholesale. Schedule data wiped — by design (smoke env
was nuked before this ran; fresh installs have nothing to lose).
* source_groups — name + includes + excludes + retention_policy +
retry policy + cached conflict_dimension. Group name doubles as
the snapshot tag so retention can target it cleanly. UNIQUE
(host_id, name) enforces tag unambiguity.
* schedule_source_groups — N:M junction. One schedule can fire N
groups per tick; one group can be referenced by N schedules.
* host_repo_maintenance — 1:1 with hosts. Default cadences:
forget daily 03:00, prune weekly Sun 04:00, check monthly 1st
05:00 with --read-data-subset 5%. Operator can edit on Repo tab.
* pending_runs — offline-retry queue. Server-side ticker dispatches
due rows; bounded by source_groups.retry_max + retry_backoff_seconds.
Plus:
* hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps.
* hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes
it derivable from the latest init job; the Init-repo button goes
too (failure surfaces via job history banner).
Note on FK safety: smoke env was wiped before migration ran, so
DROP TABLE schedules cascades to nothing. Fresh installs apply
0001-0007 then immediately 0008 — same story (no schedule rows
to lose). For an upgrade path on a populated DB, this migration
would need a data-preserving variant; not needed today.
Tests fail to compile/run after this — expected. The Go side
(store types, CRUD, REST handlers, agent runner, UI templates)
gets rebuilt in subsequent phases. tasks.md will track P2 redesign
progress.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>