Commit Graph

170 Commits

Author SHA1 Message Date
steve cffad4b4f3 fix: enabled toggle — list-row click + edit-form save
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 24s
CI / Build (linux/arm64) (pull_request) Successful in 24s
CI / Lint (pull_request) Successful in 1m15s
CI / Test (linux/amd64) (pull_request) Successful in 1m36s
Two bugs in the channel-enabled affordance:

1. List-row toggle was a static span with no handler; the row's
   row-link overlay swallowed every click and routed to /edit. Add
   POST /settings/notifications/{id}/toggle backed by a new store
   method SetNotificationChannelEnabled, and turn the row toggle
   into an htmx-driven button that swaps in the new state. Use
   event.stopPropagation() on the toggle so it beats the row link.

2. Edit-form toggle visually flipped but the underlying checkbox
   reverted: the visual span lives inside the <label>, so clicking
   it fired the inline JS handler AND the label's native
   checkbox-toggle, cancelling out. Bind to the checkbox 'change'
   event instead and let the label do the toggling — the JS just
   mirrors check.checked into the .on class.
2026-05-04 22:21:45 +01:00
steve 84e121bb9c fix: read 'name' across all per-kind sub-forms when editing channels
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 38s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Test (linux/amd64) (pull_request) Successful in 2m39s
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.

Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 6466f8c — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
2026-05-04 22:16:59 +01:00
steve c5b884a22b tasks: tick P3-05/06/07 + Playwright sweep notes
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 32s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 3m44s
Sweep against the live smoke env confirmed the alerts subsystem
end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh,
SMTP → MailHog) created and verified via the Test button; synthetic
critical raised; ack + resolve fan out alert.acknowledged /
alert.resolved across all three; dashboard banner appears and
clears; nav badge tracks open count.

Three real bugs found and fixed mid-sweep — see preceding three
commits for the full reasoning.
2026-05-04 21:01:34 +01:00
steve 3d99306cea fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.

Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.

Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
2026-05-04 21:01:17 +01:00
steve 6466f8c759 fix: read enabled checkbox correctly when paired with hidden=0 sibling
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.

Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
2026-05-04 21:00:54 +01:00
steve 9be3cead8e fix: dispatch alert.acknowledged + alert.resolved on UI ack/resolve
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.

Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.

Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
2026-05-04 21:00:44 +01:00
steve ee410fcf95 alert: construct + run engine; expose hub to handlers
- Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go
- Start go alertEngine.Run(ctx) after construction, before the HTTP listener
- Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed)
- Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID
2026-05-04 20:32:10 +01:00
steve e0fbb8c980 ui: dashboard crit-alerts banner 2026-05-04 20:29:49 +01:00
steve 371fe734f3 ui: /settings/notifications list + edit form (3 kinds)
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.

The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
2026-05-04 20:25:06 +01:00
steve d373d19647 ui: F1 — populate OpenAlerts in baseView so nav badge updates everywhere
Flagged in review of cd38b40: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
2026-05-04 20:19:09 +01:00
steve cd38b40516 ui: alerts list page + alert row partial + nav badge 2026-05-04 20:15:01 +01:00
steve de6939b3f6 http: /settings/notifications CRUD + test endpoint 2026-05-04 20:06:45 +01:00
steve 873821b871 http: /alerts list + ack/resolve handlers + /api/alerts JSON 2026-05-04 19:59:24 +01:00
steve 8c42b00228 alert: wire engine into ws hello + MarkJobFinished + offline sweep
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
  from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
  looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
  transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
  TODO(G1) comment marks where NotifyHostOffline calls will land
2026-05-04 19:54:39 +01:00
steve cb4695e09a alert: rule logic for the six v1 rules 2026-05-04 19:50:33 +01:00
steve f38930e2e6 alert: engine skeleton + event channels 2026-05-04 19:47:09 +01:00
steve 16e71a0708 notification: Hub fan-out + log writer 2026-05-04 19:44:31 +01:00
steve a6ac9ee71d notification: smtp channel 2026-05-04 19:40:21 +01:00
steve a99864c649 notification: B3 — Content-Type header + URL trim
Fixes flagged in spec review of f0a323e: ntfy POSTs need explicit
Content-Type: text/plain (the spec calls for it; ntfy works without
but explicit beats inferred); trim trailing slashes from server URL
to avoid double-slash when operators paste 'https://ntfy.sh/'.
2026-05-04 19:38:16 +01:00
steve f0a323ef91 notification: ntfy channel 2026-05-04 19:35:50 +01:00
steve c22fb24f5b notification: webhook channel 2026-05-04 19:33:29 +01:00
steve 6688b3f88a notification: payload + Channel interface 2026-05-04 19:31:27 +01:00
steve 69fc89143d store: notification_channels CRUD + AppendNotificationLog 2026-05-04 19:28:41 +01:00
steve b5a0aa4667 store: alerts CRUD with dedup + last_seen_at bump 2026-05-04 19:24:17 +01:00
steve f24dfa5214 store: migration 0014 — notification_channels + notification_log 2026-05-04 19:20:37 +01:00
steve 640b64710e store: A1 — check rows.Err() + Scan err in migrate_test
Code-quality nits flagged in review of e6d965d. Mirrors the existing
pattern in host_credentials_test.go.
2026-05-04 19:19:28 +01:00
steve e6d965d7a5 store: migration 0013 — alerts.last_seen_at 2026-05-04 19:16:59 +01:00
steve 4b70939ab5 docs: P3 alerts implementation plan 2026-05-04 19:00:18 +01:00
steve 518c29ddb3 docs: P3 alerts spec — add SMTP as first-class v1 channel
Post-brainstorm change after operator review: overnight-digest /
"don't ping me at 03:00, email me in the morning" use case is poorly
served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins
webhook + ntfy as the third v1 channel; Apprise stays deferred.

Spec updates:
- Decision 5 reworded: three channels in v1.
- Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s
  timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link
  legitimately needs the headroom.
- Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host,
  port, encryption (starttls/tls/none), username, password (AEAD), from,
  to. One channel = one To-address; multi-recipient = multiple channels
  (keeps failure attribution per-recipient).
- Body shape documented: hardcoded subject pattern
  '[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the
  alert id so threading groups raised → ack → resolved cleanly. Plain
  text only in v1.
- Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no
  XOAUTH2 yet (app passwords recommended for Gmail / M365).
- Test plan adds MailHog step in the Playwright sweep.
- Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient
  channels are explicitly out of v1.

Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html):
- Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP
  gets the --ok green colour family so it visually separates from
  webhook (accent) and ntfy (warm).
- New SMTP variant section (3c): host+port+encryption row, user+pass
  row, from+to row, test result, plus right-rail email shape preview
  showing the RFC 5322 layout.
- Channel list grows a third row: 'overnight-digest · smtp://… →
  ops-overnight@example.com'.
2026-05-04 18:48:15 +01:00
steve 6165e34f6f docs: P3 alerts design spec
Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.

Key decisions captured:

- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
  backup_failed, forget_failed, prune_failed, check_failed,
  stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
  for immediate triggers; one 60s ticker for stale-schedule detection +
  auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
  Acknowledge as a separate I-have-seen-it intermediate state that does
  NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
  scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
  acknowledged / resolved / test events; ntfy uses the standard publish
  format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
  a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
  every confirming tick so the UI can render still happening · Ns ago
  without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
  Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
  to a new notification_log table but never retried. Alert row in the DB
  is the source of truth.

Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables

Wireframe: _diag/p3-alerts-wireframe/wireframe.html
2026-05-04 18:39:26 +01:00
steve 64861a5fb8 Merge pull request 'Phase 3 — Restore (P3-X1, X2, 01, 02, 03, 09, X3-X6)' (#6) from p3-restore into main
Reviewed-on: #6
2026-05-04 17:06:18 +00:00
steve 28d5043eb0 test: lock-protect fakeSender so -race CI passes
CI / Lint (pull_request) Successful in 31s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (linux/arm64) (pull_request) Successful in 19s
CI / Test (linux/amd64) (pull_request) Successful in 1m27s
CI / Build (windows/amd64) (pull_request) Successful in 1m34s
The CI runs go test with -race; the agent runner has two pump goroutines
(pumpStdout + pumpStderr) writing through the sender concurrently, and
the unprotected fakeSender slice append raced. The cancel_test had a
local 'safeSender' workaround for the same issue; promote that mutex
onto fakeSender itself so every test in the package is race-clean
without per-test variants.

- fakeSender grows mu sync.Mutex; Send takes/releases. New snapshot()
  helper for tests that want a stable copy.
- cancel_test drops its local safeSender + sync import; uses fakeSender.

Verified: go test -race ./... passes across all packages.
2026-05-04 18:01:35 +01:00
steve e4031d26fa P3 wrap: agent auto-creates restore target; tasks.md ticked
CI / Lint (pull_request) Successful in 35s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (windows/amd64) (pull_request) Successful in 1m18s
CI / Build (linux/arm64) (pull_request) Successful in 46s
CI / Test (linux/amd64) (pull_request) Failing after 2m46s
1. Agent-side MkdirAll on the new-dir restore target. Restic creates
   missing leaves but won't traverse multiple missing levels, and
   under the systemd sandbox writes outside ReadWritePaths fail
   anyway. Calling os.MkdirAll(target, 0700) before invoking restic
   means the operator never has to pre-create the per-job subdir,
   and a path the sandbox rejects surfaces as a clean
   'restic restore: prepare target ...: read-only file system' error
   in the job log instead of a cryptic restic-side stat failure.

2. tasks.md Phase 3 — Restore section refreshed:
   - P3-X4 added (job log download dropdown — txt + ndjson)
   - P3-X5 added (UK lint locale switch + 73-correction sweep)
   - P3-X6 added (SIZE/FILES tooltip when host's restic < 0.17)
   - P3-03 entry expanded to cover version-gated --no-ownership,
     editable target, $HOME expansion, agent-side MkdirAll
   - As-shipped sweep summary mentions custom-target restore +
     download dropdown + tooltip in addition to the original walk

Test: TestRunRestoreNewDirAutoCreatesTarget seeds a multi-level
target the operator hasn't created and confirms RunRestore mkdir's
the chain before invoking restic.
2026-05-04 17:51:34 +01:00
steve 02250670c1 ui: snapshots SIZE/FILES tooltip when host's restic is < 0.17
Per-snapshot size + file-count come from the embedded summary block
restic added to 'snapshots --json' in 0.17 (the source comment in
internal/restic/snapshots.go incorrectly said 0.16+). Hosts running
0.16.x leave those columns blank.

- Fix the snapshots.go doc comment: '0.16+' -> '0.17+'.
- hostDetailPage carries a LegacyRestic bool computed from the host's
  reported ResticVersion via Env.AtLeastVersion(0, 17). Empty version
  also counts as legacy (conservative default).
- Template attaches title='Needs restic 0.17+ on the agent host. This
  host runs <ver>.' + cursor:help on the SIZE / FILES headers when
  the flag is true. Hosts already on 0.17+ get no tooltip and no
  extra styling.

A host upgrading restic to 0.17+ gets the columns populated on the
next backup automatically — no further code change needed.
2026-05-04 17:45:32 +01:00
steve 8e06bc7924 ui: tidy job-page download into a single dropdown
Replace the floating 'Download log' button + bare '.ndjson' link with
one cohesive dropdown menu — same affordance as the rest of the
header, opens to two well-described options.

- Native <details><summary> for keyboard + no-JS support; only the
  click-outside-to-close handler is JS (a few lines).
- New .dropdown / .dropdown-menu / .dropdown-item tokens in
  web/styles/input.css. Reusable for future header menus
  (host-detail overflow, source-group action menus, etc).
- Chevron flips 180 degrees when open via .dropdown[open] selector.
- Each option has a label + a mono hint line explaining when to pick it
  (.txt for humans / paste into a ticket; .ndjson for jq / tooling).
2026-05-04 17:36:57 +01:00
steve f0dfa689fe P3 follow-up: editable target dir, conditional --no-ownership, UK lint
Three small follow-ups from review:

1. Restore target is now operator-editable. Default value is the
   literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
   run time using os.UserHomeDir(); also handles \${HOME} and ~/
   prefixes). Operator can replace with any absolute path.
   - ui_restore.go validates the input is either absolute or starts
     with one of the recognised prefixes; other env-var refs (\$PATH
     etc.) are deliberately rejected so operator paths can't pick up
     arbitrary agent env values.
   - host_restore.html replaces the read-only mono-text display with
     a real <input>; help text spells out that \$HOME resolves
     agent-side and <job-id> is substituted on dispatch.
   - install.sh + the systemd unit prep /root/rm-restore so the
     default works under the sandbox: ReadWritePaths gains a soft
     '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
     if missing, but install.sh pre-creates it root-owned 0700).

2. --no-ownership flag now gated on restic version. The flag was
   added in restic 0.17 and 0.16 rejects it. Previously dropped it
   wholesale — that meant new-dir restores silently preserved
   ownership against design intent on 0.17+. Now the agent threads
   its detected restic version (sysinfo already collects it) through
   runner.Config -> restic.Env, and RunRestore appends --no-ownership
   only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
   restore with original uid/gid; help text in the wizard explicitly
   notes this. The previous 'Original ownership is preserved' copy
   was wrong for new-dir mode and is corrected.

3. golangci-lint misspell locale switched US -> UK and the codebase
   swept (73 corrections, mostly behaviour/serialise/recognise/honour).
   Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
   contract change but the agent doesn't parse those codes today and
   no external API consumers exist yet. Tests passed before + after.

Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
  edge cases (empty, exact match, patch above, minor below, non-
  numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
  pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
  with the job_id substituted into the placeholder.

Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
2026-05-04 17:27:52 +01:00
steve a2398d0b66 P3 follow-up: log download (txt + ndjson) on the live job page
The diff job's full output streams to the standard live job log page,
which can be a lot of text the operator wants to grep through or paste
into a ticket. Add a Download button.

Source of truth is the persisted job_logs table — works any time
(running or finished) and doesn't need to pause the live WS stream.
The download is 'everything the server has up to right now'; if the
operator wants a fuller snapshot of a still-running job, they hit
Download again.

- New endpoint GET /api/jobs/{id}/log.{txt,ndjson} (chi {format}
  matcher constrained to the two known suffixes). Auth via session
  cookie. 404 on unknown job.
- internal/server/http/job_download.go writeLogsText emits a small
  header + 'HH:MM:SS.mmm  TAG  payload' rows mirroring what the live
  page shows. writeLogsNDJSON emits one self-contained {seq,ts,stream,
  payload} JSON object per line — appending stays valid (each line
  stands alone), and the whole file pipes cleanly into jq. NDJSON is
  newline-delimited JSON; not the same as a JSON array.
- web/templates/pages/job_detail.html grows two header buttons:
  'Download log' (txt) + '.ndjson' ghost variant for tooling.

Tests cover the txt format (header + per-row shape), the ndjson
format (each line round-trips through json.Unmarshal), unknown job
404, unauthenticated 401.
2026-05-04 17:12:45 +01:00
steve e22b41d452 P3 sweep fixes: snap-row CSS, tree expand, --no-ownership drop, target path
Bug fixes from the Playwright sweep against the live smoke server:

1. Snapshot-picker layout. The .snap-row class was used in the wireframe
   but never landed in web/styles/input.css; rows rendered as vertical
   blocks instead of a 6-column grid. Added the token (mirrors host-row
   shape with restore-specific column widths).

2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't
   a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven
   expansion with a small window.__rmTreeToggle helper that uses plain
   fetch + .tree-pair wrapper structure for trivial sibling lookup.
   Caches loaded state per node.

3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership;
   0.16 rejects it ('unknown flag') before doing any work. Since the
   agent runs as root in the systemd unit, restored files keep their
   original uid/gid either way and the parent dir is root-owned, so
   the 'cp without sudo' rationale doesn't hold. Drop the flag entirely.

4. Default target dir moved to /var/lib/restic-manager/restore. The
   systemd unit pins ReadWritePaths to /etc/restic-manager +
   /var/lib/restic-manager (with ProtectSystem=strict making the rest
   of /var read-only); writes to /var/restic-restore failed with
   'read-only file system'.

5. Confirm summary HTML escaping. defaultTarget JS literal evaluates
   to a string with literal angle brackets; insertion into innerHTML
   must escape them. Added an inline HTML-escape pass.

tasks.md ticked for the Restore sub-phase with a sweep summary
covering the live end-to-end test.
2026-05-04 15:57:42 +01:00
steve 1111124573 P3-09 + P3-X3: snapshot diff + recent-restores line
P3-09 — snapshot diff dispatcher.
- POST /api/hosts/{id}/snapshots/diff (and the unprefixed HTMX-form
  variant) takes {snapshot_a, snapshot_b}, validates both belong to
  the host (long id / short id / prefix match), checks the agent is
  online, mints a JobDiff, ships command.run with DiffPayload, writes
  a host.snapshot_diff audit row, returns HX-Redirect to the live
  job page (or JSON {job_id, job_url} for REST callers).
- Two-snapshot guard: POSTing diff(a,a) returns 422.
- UI: small panel on the host_detail right rail (visible when the
  host has 2+ snapshots) with two short-id inputs and a Diff button.
  Output renders on the standard live job page where the operator
  reads the per-line diff text directly.

P3-X3 — recent-restores line.
- hostChromeData grows RestoreStatus / RestoreAt / RestoreJobID
  populated via store.LatestJobByKind(host_id, 'restore') (already
  exists, used by the init line).
- host_chrome.html renders a small line below the existing init-status
  one with status-coloured copy + a link to the job log. Hidden when
  no restore has ever run on this host.

Tests:
- diff_test covers happy path (correct DiffPayload + HX-Redirect),
  same-id rejection (422), unknown-id rejection (422). Adds a
  seedTwoSnapshots helper since ReplaceHostSnapshots is atomic-swap
  (calling seedSnapshot twice would only leave the second).

Restage block (CLAUDE.md) deferred to the end of the restore phase.
2026-05-04 15:38:28 +01:00
steve 6e47efc146 P3-01/02/03: restore wizard backend + templates + restore-shaped job page
End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link
/hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch →
restore-shaped live job page.

Backend (internal/server/http/ui_restore.go):
- GET handlers render the four-step wizard against the wireframe shape
  in docs/superpowers/specs/2026-05-04-p3-restore-design.md.
- HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each
  directory expansion is a sub-second cached lookup after the first
  miss.
- POST validates: snapshot_id non-empty, ≥1 absolute path, in-place
  mode requires confirm_hostname == host name, agent online. On error
  re-renders the wizard with the operator's input intact. Happy path
  mints a job_id, computes the new-directory target as
  /var/restic-restore/<job-id>/ (operator can't escape the prefix —
  server picks it), creates the job row, ships command.run with
  kind=restore + RestorePayload, writes a host.restore audit row,
  returns HX-Redirect (or 303) to the live job page.

Templates:
- host_restore.html: single-page progressively-enabled wizard matching
  _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a
  running tally of selected paths and the step-4 confirm summary
  client-side; the server re-renders on validation failure with form
  fields preserved.
- partials/tree_node.html: recursive HTMX-served tree fragment.
- Top-level Restore button on host_detail right rail + per-snapshot
  Restore action on snapshot rows replace the previous P3-stub.

Restore-shaped job page (job_detail.html):
- Progress widget rendered as a panel rather than a bare strip when
  the job is active.
- Current-file display under the bar, updated from log.stream stdout
  lines that look like absolute paths. Hidden for non-restore kinds.

Migration 0012:
- Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite
  can't ALTER CHECK in place); follows the safe pattern from 0005.
  Defensive: stash job_logs into a temp table before the rebuild and
  INSERT OR IGNORE back afterwards so even if SQLite cascades on
  DROP TABLE jobs the log history survives.

Tests:
- ui_restore_test covers GET step-1 render, GET pre-selected snapshot
  summary card, POST missing snapshot, POST missing paths, POST
  in-place wrong-hostname rejection (no command.run leaks to the
  agent), POST happy path (HX-Redirect + correct payload + audit
  row), POST against offline host returns 503.

Restage block (CLAUDE.md) deferred to the end of the restore phase.
2026-05-04 15:34:29 +01:00
steve 265b4b6c5d P3-03: restic restore + diff execution path
Wires JobRestore and JobDiff end-to-end at the agent layer (the wizard
backend that drives this lands in the next slice).

- internal/api: JobRestore + JobDiff JobKind constants. CommandRunPayload
  grows nullable Restore + Diff sub-payloads. RestorePayload carries
  snapshot_id, paths, in_place, target_dir; DiffPayload carries
  snapshot_a + snapshot_b.
- internal/restic.RunRestore wraps 'restic restore <sid> --target ...
  [--no-ownership] [--include p]...' with --json. New pumpRestoreStdout
  parses the per-line status / summary objects (drops raw status from
  log.stream — the throttled job.progress envelope covers it). New
  RestoreStatus + RestoreSummary types mirror restic's wire shape.
- internal/restic.RunDiff wraps 'restic diff --json <a> <b>'.
- internal/agent/runner: RunRestore translates RestoreStatus into
  job.progress (mapping FilesRestored → FilesDone etc) with a small
  estimateETA helper since restic doesn't provide ETA for restore.
  RunDiff is a thin streamHandler wrapper.
- cmd/agent dispatcher gains JobRestore + JobDiff cases. Both reuse
  the spawn() helper from P3-X1 so cancel just works.
- Drive-by fix: lastProgress was initialised to time.Now() so the
  very first status event was suppressed by the 1s throttle if the
  agent reported quickly. Initialise to time.Time{} (zero) so the
  first event always emits. Affects backup + restore.

Tests:
- restore_test covers restore happy path (started → progress →
  finished, kind=restore on the started envelope), in-place argv
  asserts no --no-ownership, new-dir argv asserts --no-ownership +
  --target + --include, diff produces the expected log.stream lines.

Restage block (CLAUDE.md) is deferred to the end of the restore
sub-phase so we restage once with all changes.
2026-05-04 15:24:14 +01:00
steve 6d295bc9f6 P3-X2: tree.list synchronous WS RPC + per-session cache
Foundational for the restore wizard's tree browser. The wizard needs to
lazy-load directory contents from a snapshot as the operator drills
down; this lands the transport.

- internal/api adds MsgTreeList (server → agent) + MsgTreeListResult
  (agent → server) with TreeListRequestPayload / TreeListEntry /
  TreeListResultPayload types. Reply correlates by Envelope.ID.
- internal/restic.ListTreeChildren wraps 'restic ls --json' and
  filters its recursive output to direct children of the requested
  path. Parser + path-normalisation + isDirectChild are unit-tested.
- internal/server/ws/rpc.go introduces a generic SendRPC helper on
  Hub: register a buffered channel keyed by ULID, send the request,
  block on ctx.Done()/timeout/reply. Reply routing piggybacks on the
  existing dispatchAgentMessage by adding a MsgTreeListResult case
  that forwards to the registered waiter; if no waiter is registered
  (caller already gave up) the stray reply is dropped quietly.
- cmd/agent gains a tree.list handler that runs ListTreeChildren on a
  fresh per-call context (60s ceiling) and ships the matching
  tree.list.result envelope. Errors surface in result.Error rather
  than as transport failures so the server-side waiter can render a
  sensible UI message.
- internal/server/http/tree_cache.go is the per-wizard-session cache
  layer (~30min TTL, sweep-on-access) that fetchTreeWithCache uses
  before falling through to SendRPC. Cached on success only; agent
  errors aren't cached so a transient failure doesn't poison the
  session.

Tests:
- internal/restic/ls_test.go covers parseLsChildren at root / mid-tree
  / leaf, plus normalizeTreePath and isDirectChild edge cases.
- internal/server/ws/rpc_test.go unit-tests the registry: round-trip,
  release semantics, concurrent waiters, ctx-cancel.
- internal/server/http/tree_rpc_test.go is the full round-trip: server
  SendRPC → fake-agent over a real WS → reply → server gets the
  payload. Plus a timeout test that confirms ~300ms timeouts terminate
  in ~300ms rather than waiting forever.

The cache is plumbed but no UI handler hits fetchTreeWithCache yet —
that lands with P3-01 (wizard backend). The unused-linter is suppressed
via nolint until the wizard wires it in.
2026-05-04 15:19:22 +01:00
steve 9fa2ef48f0 P3-X1: cancel-job feature
Wires the existing job_detail Cancel button (which was a UI stub) into
real backend behaviour:

- internal/api already declared MsgCommandCancel + CommandCancelPayload;
  promote those from forward-declarations to a working envelope. Agent
  side: cmd/agent/main.go drops the TODO-stub and gains a per-job
  ctx.CancelFunc map. runJob's switch is refactored around a small
  spawn() helper so each kind's goroutine derives a per-job context,
  registers the cancel, and removes itself on completion regardless of
  outcome. command.cancel looks up the func and fires it.
- internal/agent/runner.sendFinished now takes ctx and rebadges
  ctx.Canceled errors as JobCancelled (exit 130) rather than
  JobFailed. All Run* call sites updated.
- internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via
  build-tagged sigterm constant; os.Kill on Windows since SIGTERM
  isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL
  fallback. SIGTERM lets restic remove its lock file before exiting.
- New POST /api/jobs/{id}/cancel server endpoint validates the job
  is non-terminal and the host is online, sends command.cancel via
  the hub, writes a job.cancel audit row, returns 202. The agent's
  resulting job.finished (status=cancelled) is what actually
  transitions the row.

Tests:
- internal/server/http/cancel_test.go covers happy path (envelope
  shape + audit row), 409 for terminal jobs, 404 for missing jobs,
  503 for offline hosts.
- internal/agent/runner/cancel_test.go covers cancel mid-run: a fake
  restic that exec'd into 'sleep 30' is canceled 150ms after start
  and the resulting job.finished reports JobCancelled with exit 130
  in well under the WaitDelay.

Foundational for P3 restore (operator needs to be able to cancel a
running backup if they need to restore urgently). Independently useful
for prune/check/backup that are stuck.
2026-05-04 15:11:49 +01:00
steve 454a2415dc docs: P3 restore design spec + scope-decompose Phase 3
Splits Phase 3 into three independently-shippable sub-phases (Restore,
Alerts, Audit UI) so they can land in separate PRs with their own brainstorm
→ spec → plan cycles. The Restore sub-phase is up first.

The brainstorm ran on 2026-05-04 and locked the following decisions:

- Single-host restore only this phase. P3-04 (cross-host restore) is moved
  to a new 'Future / unscheduled' section. Disaster recovery is already
  covered by re-enrolling a replacement host with the same repo creds; the
  remaining 'pull a file from host A onto host C' use case is genuinely
  different (file sharing / migration, not DR) and has no confirmed need.
- Default target is /var/restic-restore/<job-id>/ with --no-ownership;
  in-place restore preserves uid/gid/mode and is gated by typed-confirmation
  of the host name (mirroring the repo re-init danger zone).
- Tree browser is the path picker, lazy-loaded via a synchronous WS RPC
  (tree.list) over the existing correlation-ID infrastructure with a
  per-wizard-session in-memory cache (~30 min TTL).
- Single-page wizard with progressively-enabled sections; entry is a
  top-level Restore button on host detail (or per-snapshot Restore action
  for direct deep-link).
- Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other
  agent operation; output streams to the standard live job log page.
- Restore-specific live job page variant with files-restored /
  bytes-restored / current-file widget.
- Single-flight per host across all kinds, plus a real cancel-job feature
  (command.cancel WS envelope, agent kills the restic subprocess via
  context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a
  long-running backup if they need to restore urgently. Wires the existing
  job_detail Cancel button (which was a UI stub).
- Audit row host.restore on every dispatch + a recent-restores panel on
  host detail. Role gate deferred to P4-03 RBAC.

Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored —
transient design artefact); screenshot reviewed and approved 2026-05-04.
2026-05-04 15:02:32 +01:00
steve 0bd7a896c4 Merge pull request 'P2 completion (P2R-09/10/11/12/13/14, P2-16/17/18)' (#5) from p2-completion into main 2026-05-04 13:19:05 +00:00
steve bdabcfb68e docs: note Gitea repo + tea CLI in CLAUDE.md
CI / Build (windows/amd64) (pull_request) Successful in 19s
CI / Lint (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 19s
CI / Build (linux/arm64) (pull_request) Successful in 19s
CI / Test (linux/amd64) (pull_request) Successful in 2m17s
2026-05-04 14:18:50 +01:00
steve c691dc8a56 tasks: tick P2 completion + Playwright sweep screenshots
CI / Build (windows/amd64) (pull_request) Successful in 20s
CI / Lint (pull_request) Successful in 41s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 53s
CI / Build (linux/arm64) (pull_request) Successful in 1m48s
P2R-09/10/11/12/13/14, P2-16/17/18 all marked done. Acceptance line
for Windows hosts annotated as 'compile-verified, untested in CI'.

_diag/p2-completion-sweep/ holds the dashboard + host-detail +
schedules + sources + repo + source-group-edit screenshots from a
clean sweep against :8080. Zero console errors throughout.

announce_test.go: rate-limit + global-cap subtests dropped t.Parallel
to avoid racing on the package-level tunables under -race.
2026-05-04 11:27:09 +01:00
steve 8ceb76c733 deploy: P2-17 install.ps1 (Windows installer)
Pwsh installer that detects arch, downloads
$Server/agent/binary?os=windows&arch=amd64 to
C:\Program Files\restic-manager\, runs the agent in -enroll-server
[+ -enroll-token] mode (token flow OR announce-and-approve), then
calls 'restic-manager-agent install' to register the SCM service.
Surfaces existing scheduled tasks named *restic* without disabling.

CLAUDE.md restage block updated to also stage install.ps1 alongside
install.sh.
2026-05-04 11:15:18 +01:00
steve d29475560d agent: P2-16 Windows service (SCM) integration
internal/agent/service: build-tagged into service_windows.go (svc.Handler
that listens for Stop/Shutdown + delegates to the agent loop) and
service_other.go (foreground stub for Linux/macOS). install_windows.go
wraps mgr.Connect+CreateService/Delete/Start/Stop for the new
'restic-manager-agent install|uninstall|start|stop' subcommands.

Cross-compile verified: GOOS=windows GOARCH=amd64 go build ./cmd/agent
succeeds. UNTESTED on Windows itself — the SCM round-trip can't be
exercised from Linux CI; treat as a starting point for the first
real Windows install.
2026-05-04 11:13:56 +01:00
steve bbdf631a01 ui+server: P2-18d pending hosts dashboard panel + expiry sweeper
Dashboard handler loads ListPendingHosts(now); template renders a
warn-bordered panel above the host table with hostname, OS/arch,
fingerprint (selectable / copyable), source IP, age, expiry. Each
row carries an inline accept form (repo URL/user/password) plus a
Reject button. cmd/server adds a 60s ticker calling
DeleteExpiredPendingHosts so 1h-stale rows drop off.
2026-05-04 11:11:32 +01:00