178 Commits

Author SHA1 Message Date
steve c446ca072e ui(alerts): make Acknowledge vs Resolve distinction visible
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 37s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 23s
CI / Test (linux/amd64) (pull_request) Successful in 3m55s
Both buttons make the row leave the Open tab, so on a quiet system
they look identical. The behavioural difference only manifests next
time the underlying condition fires:

  - Acknowledge silences fan-out while the problem persists; the
    alert parks on the Acknowledged tab and recurrences just touch
    last_seen_at without re-notifying.
  - Resolve closes the alert. If the same condition fires again
    later, a fresh alert with a new id raises and the channels
    fan out as if it were the first time.

Add a one-line legend under the page header explaining both, and
title= tooltips on each button covering the same ground for keyboard
and assistive tech.
2026-05-04 23:11:46 +01:00
steve 84814b1386 Merge pull request 'Phase 3 — Alerts: per-source-group dedup' (#8) from p3-alerts-dedup into main
CI / Build (windows/amd64) (pull_request) Successful in 23s
CI / Build (linux/amd64) (pull_request) Successful in 23s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 1m22s
CI / Test (linux/amd64) (pull_request) Successful in 1m28s
Reviewed-on: #8
2026-05-04 22:11:08 +00:00
steve a45c801884 feat(alerts): per-source-group dedup so two failing backups produce two alerts
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.

Schema changes (column-level ALTER, no rebuild):

- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
  index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
  partial index gets dropped and replaced with a UNIQUE partial
  index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
  the index is now the actual dedup primitive.

Plumbing:

- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
  passes it through for backup_failed only (forget/prune/check stay
  repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
  per-group Run-now path and schedule.fire path pass &g.ID.

Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.

Dev tool: cmd/_fake_alert gains -dedup-key flag.
2026-05-04 22:59:48 +01:00
steve 7792aadb94 Merge pull request 'Phase 3 — Alerts (P3-05/06/07)' (#7) from p3-alerts into main
Reviewed-on: #7
2026-05-04 21:51:16 +00:00
steve 2eac324cec chore: ignore cmd/_* dev binaries + Tailwind rebuild
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 1m13s
CI / Test (linux/amd64) (pull_request) Successful in 1m20s
cmd/_fake_alert and similar one-shot dev tools live under cmd/_*
where Go's build tooling skips them. Add an explicit gitignore line
so an accidental 'git add cmd/.' can't drag them into a release.

styles.css is the regenerated Tailwind output — picks up the new
ntfy basic-auth fields and the right-rail preview ids.
2026-05-04 22:49:46 +01:00
steve 3cdaee63d4 fix: payload-preview rail follows kind switcher
CI / Lint (pull_request) Successful in 32s
CI / Build (windows/amd64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 1m18s
CI / Build (linux/arm64) (pull_request) Successful in 43s
Right-rail preview was rendered server-side via {{if eq $f.Kind ...}},
so it stayed on whatever kind the page loaded with. Editing an SMTP
channel and flipping to ntfy in the picker left the email RFC 5322
sample on screen.

Render all three preview panels with id='preview-<kind>' (only the
matching one visible on first render) and toggle their .hidden class
in the kind-switcher JS alongside the field panels. Same pattern
used for fields-<kind>.
2026-05-04 22:40:46 +01:00
steve 7f2a9964db fix: move channel delete-panel out of edit form (nested form bug)
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Lint (pull_request) Successful in 1m11s
CI / Test (linux/amd64) (pull_request) Successful in 1m22s
The delete-panel <form action='.../delete'> was nested inside the
main <form action='.../edit'>. HTML doesn't allow nested forms —
browsers parse the inner form as if it didn't exist, so clicking
'Delete permanently' submitted the outer edit form to /edit
instead of /delete, leaving the channel intact.

Move the delete-panel block to a sibling of the main form. The
'Delete channel…' button still toggles its visibility via JS, the
panel still renders inside the page layout, and now its form
actually posts to the delete handler.
2026-05-04 22:35:58 +01:00
steve feaeff217d feat(ntfy): support HTTP Basic auth alongside access tokens
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Lint (pull_request) Successful in 1m12s
CI / Test (linux/amd64) (pull_request) Successful in 1m18s
Self-hosted ntfy that doesn't expose a token-mint endpoint can still
authenticate over HTTP Basic. Add Username + Password fields to
NtfyConfig; the channel sends 'Authorization: Basic …' when token is
empty and username is set. Token wins when both are configured.

Form-side: two new optional fields next to the access token, with
the same write-only placeholder treatment as smtp_password (blank
on edit means 'keep stored value'). Username is round-tripped on
edit; password is masked.
2026-05-04 22:25:42 +01:00
steve cffad4b4f3 fix: enabled toggle — list-row click + edit-form save
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 24s
CI / Build (linux/arm64) (pull_request) Successful in 24s
CI / Lint (pull_request) Successful in 1m15s
CI / Test (linux/amd64) (pull_request) Successful in 1m36s
Two bugs in the channel-enabled affordance:

1. List-row toggle was a static span with no handler; the row's
   row-link overlay swallowed every click and routed to /edit. Add
   POST /settings/notifications/{id}/toggle backed by a new store
   method SetNotificationChannelEnabled, and turn the row toggle
   into an htmx-driven button that swaps in the new state. Use
   event.stopPropagation() on the toggle so it beats the row link.

2. Edit-form toggle visually flipped but the underlying checkbox
   reverted: the visual span lives inside the <label>, so clicking
   it fired the inline JS handler AND the label's native
   checkbox-toggle, cancelling out. Bind to the checkbox 'change'
   event instead and let the label do the toggling — the JS just
   mirrors check.checked into the .on class.
2026-05-04 22:21:45 +01:00
steve 84e121bb9c fix: read 'name' across all per-kind sub-forms when editing channels
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 38s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 22s
CI / Test (linux/amd64) (pull_request) Successful in 2m39s
The channel form has three inputs all named 'name' (one per kind
section: webhook / ntfy / smtp), but only the visible kind's input
is filled in. PostForm.Get returns the first regardless of
emptiness, so editing an ntfy or smtp channel always read '' from
the (hidden, unfilled) webhook section's name input and rejected
with 'name required'.

Add firstNonEmpty helper that scans the slice for the first
non-blank value. Same flavour of bug as the enabled checkbox fix
in 6466f8c — both fall out of having multiple inputs share a name
across the per-kind sub-forms.
2026-05-04 22:16:59 +01:00
steve c5b884a22b tasks: tick P3-05/06/07 + Playwright sweep notes
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Lint (pull_request) Successful in 32s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 3m44s
Sweep against the live smoke env confirmed the alerts subsystem
end-to-end: three channels (webhook → local sink, ntfy → ntfy.sh,
SMTP → MailHog) created and verified via the Test button; synthetic
critical raised; ack + resolve fan out alert.acknowledged /
alert.resolved across all three; dashboard banner appears and
clears; nav badge tracks open count.

Three real bugs found and fixed mid-sweep — see preceding three
commits for the full reasoning.
2026-05-04 21:01:34 +01:00
steve 3d99306cea fix: refresh hosts.open_alert_count on Raise/Resolve/AutoResolve
The denormalised projection was never written by the alerts code
path, so the dashboard's OPEN ALERTS card and the per-host alerts
column always read 0 regardless of how many alerts were open.
fleet.GetStats sums hosts.open_alert_count; if it never moves, the
card is decoration.

Add refreshHostOpenAlertCount that recomputes from the alerts table
(self-healing — no +/- bookkeeping to drift). Call it after the
commit in RaiseOrTouch when a row was inserted, after Resolve, and
after AutoResolve.

Caught during the live sweep: a synthetic critical raised the count
to 1, but resolving it left the dashboard reading '1 unresolved'
indefinitely.
2026-05-04 21:01:17 +01:00
steve 6466f8c759 fix: read enabled checkbox correctly when paired with hidden=0 sibling
The notification channel form has a <input hidden name=enabled value=0>
plus a <input checkbox name=enabled value=1> so unchecking the box
still submits 'enabled=0' (otherwise the field would just be absent).
But Go's url.Values.Get returns the FIRST value, so even when the
checkbox is ticked the handler read '0' and persisted enabled=false.

Scan r.PostForm["enabled"] for any '1' instead. Caught during the
sweep — all three test channels saved with enabled=0 even though
the toggle visually rendered ON.
2026-05-04 21:00:54 +01:00
steve 9be3cead8e fix: dispatch alert.acknowledged + alert.resolved on UI ack/resolve
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.

Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.

Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
2026-05-04 21:00:44 +01:00
steve ee410fcf95 alert: construct + run engine; expose hub to handlers
- Construct notification.NewHub and alert.NewEngine at boot in cmd/server/main.go
- Start go alertEngine.Run(ctx) after construction, before the HTTP listener
- Wire AlertEngine and NotificationHub into rmhttp.Deps (fields already existed)
- Remove the TODO(G1) in the offline sweeper; now calls NotifyHostOffline per ID
2026-05-04 20:32:10 +01:00
steve e0fbb8c980 ui: dashboard crit-alerts banner 2026-05-04 20:29:49 +01:00
steve 371fe734f3 ui: /settings/notifications list + edit form (3 kinds)
Add settings.html (shell + sub-tab nav + conditional list/edit body),
notifications.html and notification_edit.html (glob stubs), and the
supporting CSS tokens (.ch-row, .ch-icon, .toggle, .kind-grid,
.kind-card, .radio-pip, .test-pill) to input.css. Rebuild styles.css.
Add ui_parse_test.go to catch template regressions at test time.

The kind picker is JS-driven (no full page reload); the enabled toggle
mirrors the existing visual toggle pattern; the test-notification button
uses HTMX and renders the JSON response as a coloured pill client-side.
2026-05-04 20:25:06 +01:00
steve d373d19647 ui: F1 — populate OpenAlerts in baseView so nav badge updates everywhere
Flagged in review of cd38b40: the Alerts tab badge should show the
open count from any page, not just /alerts. baseView now takes the
request and queries store.ListAlerts(Status: "open") to fill
view.OpenAlerts on every page render. All call sites updated.
2026-05-04 20:19:09 +01:00
steve cd38b40516 ui: alerts list page + alert row partial + nav badge 2026-05-04 20:15:01 +01:00
steve de6939b3f6 http: /settings/notifications CRUD + test endpoint 2026-05-04 20:06:45 +01:00
steve 873821b871 http: /alerts list + ack/resolve handlers + /api/alerts JSON 2026-05-04 19:59:24 +01:00
steve 8c42b00228 alert: wire engine into ws hello + MarkJobFinished + offline sweep
- ws.HandlerDeps gains an AlertEngine *alert.Engine field; populated
  from http.Deps.AlertEngine (nil until G1 constructs the engine)
- runAgentLoop calls NotifyHostOnline after MarkHostHello succeeds
- dispatchAgentMessage MsgJobFinished case calls NotifyJobFinished,
  looking up the job Kind via Store.GetJob before notifying
- store.MarkHostsOfflineStaleReturnIDs added: SELECT+UPDATE in one
  transaction, returns the IDs that flipped to offline
- offline sweeper in cmd/server/main.go switched to the new variant;
  TODO(G1) comment marks where NotifyHostOffline calls will land
2026-05-04 19:54:39 +01:00
steve cb4695e09a alert: rule logic for the six v1 rules 2026-05-04 19:50:33 +01:00
steve f38930e2e6 alert: engine skeleton + event channels 2026-05-04 19:47:09 +01:00
steve 16e71a0708 notification: Hub fan-out + log writer 2026-05-04 19:44:31 +01:00
steve a6ac9ee71d notification: smtp channel 2026-05-04 19:40:21 +01:00
steve a99864c649 notification: B3 — Content-Type header + URL trim
Fixes flagged in spec review of f0a323e: ntfy POSTs need explicit
Content-Type: text/plain (the spec calls for it; ntfy works without
but explicit beats inferred); trim trailing slashes from server URL
to avoid double-slash when operators paste 'https://ntfy.sh/'.
2026-05-04 19:38:16 +01:00
steve f0a323ef91 notification: ntfy channel 2026-05-04 19:35:50 +01:00
steve c22fb24f5b notification: webhook channel 2026-05-04 19:33:29 +01:00
steve 6688b3f88a notification: payload + Channel interface 2026-05-04 19:31:27 +01:00
steve 69fc89143d store: notification_channels CRUD + AppendNotificationLog 2026-05-04 19:28:41 +01:00
steve b5a0aa4667 store: alerts CRUD with dedup + last_seen_at bump 2026-05-04 19:24:17 +01:00
steve f24dfa5214 store: migration 0014 — notification_channels + notification_log 2026-05-04 19:20:37 +01:00
steve 640b64710e store: A1 — check rows.Err() + Scan err in migrate_test
Code-quality nits flagged in review of e6d965d. Mirrors the existing
pattern in host_credentials_test.go.
2026-05-04 19:19:28 +01:00
steve e6d965d7a5 store: migration 0013 — alerts.last_seen_at 2026-05-04 19:16:59 +01:00
steve 4b70939ab5 docs: P3 alerts implementation plan 2026-05-04 19:00:18 +01:00
steve 518c29ddb3 docs: P3 alerts spec — add SMTP as first-class v1 channel
Post-brainstorm change after operator review: overnight-digest /
"don't ping me at 03:00, email me in the morning" use case is poorly
served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins
webhook + ntfy as the third v1 channel; Apprise stays deferred.

Spec updates:
- Decision 5 reworded: three channels in v1.
- Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s
  timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link
  legitimately needs the headroom.
- Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host,
  port, encryption (starttls/tls/none), username, password (AEAD), from,
  to. One channel = one To-address; multi-recipient = multiple channels
  (keeps failure attribution per-recipient).
- Body shape documented: hardcoded subject pattern
  '[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the
  alert id so threading groups raised → ack → resolved cleanly. Plain
  text only in v1.
- Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no
  XOAUTH2 yet (app passwords recommended for Gmail / M365).
- Test plan adds MailHog step in the Playwright sweep.
- Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient
  channels are explicitly out of v1.

Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html):
- Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP
  gets the --ok green colour family so it visually separates from
  webhook (accent) and ntfy (warm).
- New SMTP variant section (3c): host+port+encryption row, user+pass
  row, from+to row, test result, plus right-rail email shape preview
  showing the RFC 5322 layout.
- Channel list grows a third row: 'overnight-digest · smtp://… →
  ops-overnight@example.com'.
2026-05-04 18:48:15 +01:00
steve 6165e34f6f docs: P3 alerts design spec
Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.

Key decisions captured:

- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
  backup_failed, forget_failed, prune_failed, check_failed,
  stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
  for immediate triggers; one 60s ticker for stale-schedule detection +
  auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
  Acknowledge as a separate I-have-seen-it intermediate state that does
  NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
  scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
  acknowledged / resolved / test events; ntfy uses the standard publish
  format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
  a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
  every confirming tick so the UI can render still happening · Ns ago
  without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
  Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
  to a new notification_log table but never retried. Alert row in the DB
  is the source of truth.

Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables

Wireframe: _diag/p3-alerts-wireframe/wireframe.html
2026-05-04 18:39:26 +01:00
steve 64861a5fb8 Merge pull request 'Phase 3 — Restore (P3-X1, X2, 01, 02, 03, 09, X3-X6)' (#6) from p3-restore into main
Reviewed-on: #6
2026-05-04 17:06:18 +00:00
steve 28d5043eb0 test: lock-protect fakeSender so -race CI passes
CI / Lint (pull_request) Successful in 31s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (linux/arm64) (pull_request) Successful in 19s
CI / Test (linux/amd64) (pull_request) Successful in 1m27s
CI / Build (windows/amd64) (pull_request) Successful in 1m34s
The CI runs go test with -race; the agent runner has two pump goroutines
(pumpStdout + pumpStderr) writing through the sender concurrently, and
the unprotected fakeSender slice append raced. The cancel_test had a
local 'safeSender' workaround for the same issue; promote that mutex
onto fakeSender itself so every test in the package is race-clean
without per-test variants.

- fakeSender grows mu sync.Mutex; Send takes/releases. New snapshot()
  helper for tests that want a stable copy.
- cancel_test drops its local safeSender + sync import; uses fakeSender.

Verified: go test -race ./... passes across all packages.
2026-05-04 18:01:35 +01:00
steve e4031d26fa P3 wrap: agent auto-creates restore target; tasks.md ticked
CI / Lint (pull_request) Successful in 35s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (windows/amd64) (pull_request) Successful in 1m18s
CI / Build (linux/arm64) (pull_request) Successful in 46s
CI / Test (linux/amd64) (pull_request) Failing after 2m46s
1. Agent-side MkdirAll on the new-dir restore target. Restic creates
   missing leaves but won't traverse multiple missing levels, and
   under the systemd sandbox writes outside ReadWritePaths fail
   anyway. Calling os.MkdirAll(target, 0700) before invoking restic
   means the operator never has to pre-create the per-job subdir,
   and a path the sandbox rejects surfaces as a clean
   'restic restore: prepare target ...: read-only file system' error
   in the job log instead of a cryptic restic-side stat failure.

2. tasks.md Phase 3 — Restore section refreshed:
   - P3-X4 added (job log download dropdown — txt + ndjson)
   - P3-X5 added (UK lint locale switch + 73-correction sweep)
   - P3-X6 added (SIZE/FILES tooltip when host's restic < 0.17)
   - P3-03 entry expanded to cover version-gated --no-ownership,
     editable target, $HOME expansion, agent-side MkdirAll
   - As-shipped sweep summary mentions custom-target restore +
     download dropdown + tooltip in addition to the original walk

Test: TestRunRestoreNewDirAutoCreatesTarget seeds a multi-level
target the operator hasn't created and confirms RunRestore mkdir's
the chain before invoking restic.
2026-05-04 17:51:34 +01:00
steve 02250670c1 ui: snapshots SIZE/FILES tooltip when host's restic is < 0.17
Per-snapshot size + file-count come from the embedded summary block
restic added to 'snapshots --json' in 0.17 (the source comment in
internal/restic/snapshots.go incorrectly said 0.16+). Hosts running
0.16.x leave those columns blank.

- Fix the snapshots.go doc comment: '0.16+' -> '0.17+'.
- hostDetailPage carries a LegacyRestic bool computed from the host's
  reported ResticVersion via Env.AtLeastVersion(0, 17). Empty version
  also counts as legacy (conservative default).
- Template attaches title='Needs restic 0.17+ on the agent host. This
  host runs <ver>.' + cursor:help on the SIZE / FILES headers when
  the flag is true. Hosts already on 0.17+ get no tooltip and no
  extra styling.

A host upgrading restic to 0.17+ gets the columns populated on the
next backup automatically — no further code change needed.
2026-05-04 17:45:32 +01:00
steve 8e06bc7924 ui: tidy job-page download into a single dropdown
Replace the floating 'Download log' button + bare '.ndjson' link with
one cohesive dropdown menu — same affordance as the rest of the
header, opens to two well-described options.

- Native <details><summary> for keyboard + no-JS support; only the
  click-outside-to-close handler is JS (a few lines).
- New .dropdown / .dropdown-menu / .dropdown-item tokens in
  web/styles/input.css. Reusable for future header menus
  (host-detail overflow, source-group action menus, etc).
- Chevron flips 180 degrees when open via .dropdown[open] selector.
- Each option has a label + a mono hint line explaining when to pick it
  (.txt for humans / paste into a ticket; .ndjson for jq / tooling).
2026-05-04 17:36:57 +01:00
steve f0dfa689fe P3 follow-up: editable target dir, conditional --no-ownership, UK lint
Three small follow-ups from review:

1. Restore target is now operator-editable. Default value is the
   literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
   run time using os.UserHomeDir(); also handles \${HOME} and ~/
   prefixes). Operator can replace with any absolute path.
   - ui_restore.go validates the input is either absolute or starts
     with one of the recognised prefixes; other env-var refs (\$PATH
     etc.) are deliberately rejected so operator paths can't pick up
     arbitrary agent env values.
   - host_restore.html replaces the read-only mono-text display with
     a real <input>; help text spells out that \$HOME resolves
     agent-side and <job-id> is substituted on dispatch.
   - install.sh + the systemd unit prep /root/rm-restore so the
     default works under the sandbox: ReadWritePaths gains a soft
     '-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
     if missing, but install.sh pre-creates it root-owned 0700).

2. --no-ownership flag now gated on restic version. The flag was
   added in restic 0.17 and 0.16 rejects it. Previously dropped it
   wholesale — that meant new-dir restores silently preserved
   ownership against design intent on 0.17+. Now the agent threads
   its detected restic version (sysinfo already collects it) through
   runner.Config -> restic.Env, and RunRestore appends --no-ownership
   only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
   restore with original uid/gid; help text in the wizard explicitly
   notes this. The previous 'Original ownership is preserved' copy
   was wrong for new-dir mode and is corrected.

3. golangci-lint misspell locale switched US -> UK and the codebase
   swept (73 corrections, mostly behaviour/serialise/recognise/honour).
   Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
   contract change but the agent doesn't parse those codes today and
   no external API consumers exist yet. Tests passed before + after.

Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
  edge cases (empty, exact match, patch above, minor below, non-
  numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
  pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
  with the job_id substituted into the placeholder.

Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
2026-05-04 17:27:52 +01:00
steve a2398d0b66 P3 follow-up: log download (txt + ndjson) on the live job page
The diff job's full output streams to the standard live job log page,
which can be a lot of text the operator wants to grep through or paste
into a ticket. Add a Download button.

Source of truth is the persisted job_logs table — works any time
(running or finished) and doesn't need to pause the live WS stream.
The download is 'everything the server has up to right now'; if the
operator wants a fuller snapshot of a still-running job, they hit
Download again.

- New endpoint GET /api/jobs/{id}/log.{txt,ndjson} (chi {format}
  matcher constrained to the two known suffixes). Auth via session
  cookie. 404 on unknown job.
- internal/server/http/job_download.go writeLogsText emits a small
  header + 'HH:MM:SS.mmm  TAG  payload' rows mirroring what the live
  page shows. writeLogsNDJSON emits one self-contained {seq,ts,stream,
  payload} JSON object per line — appending stays valid (each line
  stands alone), and the whole file pipes cleanly into jq. NDJSON is
  newline-delimited JSON; not the same as a JSON array.
- web/templates/pages/job_detail.html grows two header buttons:
  'Download log' (txt) + '.ndjson' ghost variant for tooling.

Tests cover the txt format (header + per-row shape), the ndjson
format (each line round-trips through json.Unmarshal), unknown job
404, unauthenticated 401.
2026-05-04 17:12:45 +01:00
steve e22b41d452 P3 sweep fixes: snap-row CSS, tree expand, --no-ownership drop, target path
Bug fixes from the Playwright sweep against the live smoke server:

1. Snapshot-picker layout. The .snap-row class was used in the wireframe
   but never landed in web/styles/input.css; rows rendered as vertical
   blocks instead of a 6-column grid. Added the token (mirrors host-row
   shape with restore-specific column widths).

2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't
   a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven
   expansion with a small window.__rmTreeToggle helper that uses plain
   fetch + .tree-pair wrapper structure for trivial sibling lookup.
   Caches loaded state per node.

3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership;
   0.16 rejects it ('unknown flag') before doing any work. Since the
   agent runs as root in the systemd unit, restored files keep their
   original uid/gid either way and the parent dir is root-owned, so
   the 'cp without sudo' rationale doesn't hold. Drop the flag entirely.

4. Default target dir moved to /var/lib/restic-manager/restore. The
   systemd unit pins ReadWritePaths to /etc/restic-manager +
   /var/lib/restic-manager (with ProtectSystem=strict making the rest
   of /var read-only); writes to /var/restic-restore failed with
   'read-only file system'.

5. Confirm summary HTML escaping. defaultTarget JS literal evaluates
   to a string with literal angle brackets; insertion into innerHTML
   must escape them. Added an inline HTML-escape pass.

tasks.md ticked for the Restore sub-phase with a sweep summary
covering the live end-to-end test.
2026-05-04 15:57:42 +01:00
steve 1111124573 P3-09 + P3-X3: snapshot diff + recent-restores line
P3-09 — snapshot diff dispatcher.
- POST /api/hosts/{id}/snapshots/diff (and the unprefixed HTMX-form
  variant) takes {snapshot_a, snapshot_b}, validates both belong to
  the host (long id / short id / prefix match), checks the agent is
  online, mints a JobDiff, ships command.run with DiffPayload, writes
  a host.snapshot_diff audit row, returns HX-Redirect to the live
  job page (or JSON {job_id, job_url} for REST callers).
- Two-snapshot guard: POSTing diff(a,a) returns 422.
- UI: small panel on the host_detail right rail (visible when the
  host has 2+ snapshots) with two short-id inputs and a Diff button.
  Output renders on the standard live job page where the operator
  reads the per-line diff text directly.

P3-X3 — recent-restores line.
- hostChromeData grows RestoreStatus / RestoreAt / RestoreJobID
  populated via store.LatestJobByKind(host_id, 'restore') (already
  exists, used by the init line).
- host_chrome.html renders a small line below the existing init-status
  one with status-coloured copy + a link to the job log. Hidden when
  no restore has ever run on this host.

Tests:
- diff_test covers happy path (correct DiffPayload + HX-Redirect),
  same-id rejection (422), unknown-id rejection (422). Adds a
  seedTwoSnapshots helper since ReplaceHostSnapshots is atomic-swap
  (calling seedSnapshot twice would only leave the second).

Restage block (CLAUDE.md) deferred to the end of the restore phase.
2026-05-04 15:38:28 +01:00
steve 6e47efc146 P3-01/02/03: restore wizard backend + templates + restore-shaped job page
End-to-end wizard from /hosts/{id}/restore (or per-snapshot deep link
/hosts/{id}/snapshots/{sid}/restore) → tree-browse → dispatch →
restore-shaped live job page.

Backend (internal/server/http/ui_restore.go):
- GET handlers render the four-step wizard against the wireframe shape
  in docs/superpowers/specs/2026-05-04-p3-restore-design.md.
- HTMX tree partial endpoint hits fetchTreeWithCache (P3-X2) so each
  directory expansion is a sub-second cached lookup after the first
  miss.
- POST validates: snapshot_id non-empty, ≥1 absolute path, in-place
  mode requires confirm_hostname == host name, agent online. On error
  re-renders the wizard with the operator's input intact. Happy path
  mints a job_id, computes the new-directory target as
  /var/restic-restore/<job-id>/ (operator can't escape the prefix —
  server picks it), creates the job row, ships command.run with
  kind=restore + RestorePayload, writes a host.restore audit row,
  returns HX-Redirect (or 303) to the live job page.

Templates:
- host_restore.html: single-page progressively-enabled wizard matching
  _diag/p3-restore-wizard wireframe. Form-state-driven JS computes a
  running tally of selected paths and the step-4 confirm summary
  client-side; the server re-renders on validation failure with form
  fields preserved.
- partials/tree_node.html: recursive HTMX-served tree fragment.
- Top-level Restore button on host_detail right rail + per-snapshot
  Restore action on snapshot rows replace the previous P3-stub.

Restore-shaped job page (job_detail.html):
- Progress widget rendered as a panel rather than a bare strip when
  the job is active.
- Current-file display under the bar, updated from log.stream stdout
  lines that look like absolute paths. Hidden for non-restore kinds.

Migration 0012:
- Add restore + diff to the jobs.kind CHECK. Rebuild required (SQLite
  can't ALTER CHECK in place); follows the safe pattern from 0005.
  Defensive: stash job_logs into a temp table before the rebuild and
  INSERT OR IGNORE back afterwards so even if SQLite cascades on
  DROP TABLE jobs the log history survives.

Tests:
- ui_restore_test covers GET step-1 render, GET pre-selected snapshot
  summary card, POST missing snapshot, POST missing paths, POST
  in-place wrong-hostname rejection (no command.run leaks to the
  agent), POST happy path (HX-Redirect + correct payload + audit
  row), POST against offline host returns 503.

Restage block (CLAUDE.md) deferred to the end of the restore phase.
2026-05-04 15:34:29 +01:00
steve 265b4b6c5d P3-03: restic restore + diff execution path
Wires JobRestore and JobDiff end-to-end at the agent layer (the wizard
backend that drives this lands in the next slice).

- internal/api: JobRestore + JobDiff JobKind constants. CommandRunPayload
  grows nullable Restore + Diff sub-payloads. RestorePayload carries
  snapshot_id, paths, in_place, target_dir; DiffPayload carries
  snapshot_a + snapshot_b.
- internal/restic.RunRestore wraps 'restic restore <sid> --target ...
  [--no-ownership] [--include p]...' with --json. New pumpRestoreStdout
  parses the per-line status / summary objects (drops raw status from
  log.stream — the throttled job.progress envelope covers it). New
  RestoreStatus + RestoreSummary types mirror restic's wire shape.
- internal/restic.RunDiff wraps 'restic diff --json <a> <b>'.
- internal/agent/runner: RunRestore translates RestoreStatus into
  job.progress (mapping FilesRestored → FilesDone etc) with a small
  estimateETA helper since restic doesn't provide ETA for restore.
  RunDiff is a thin streamHandler wrapper.
- cmd/agent dispatcher gains JobRestore + JobDiff cases. Both reuse
  the spawn() helper from P3-X1 so cancel just works.
- Drive-by fix: lastProgress was initialised to time.Now() so the
  very first status event was suppressed by the 1s throttle if the
  agent reported quickly. Initialise to time.Time{} (zero) so the
  first event always emits. Affects backup + restore.

Tests:
- restore_test covers restore happy path (started → progress →
  finished, kind=restore on the started envelope), in-place argv
  asserts no --no-ownership, new-dir argv asserts --no-ownership +
  --target + --include, diff produces the expected log.stream lines.

Restage block (CLAUDE.md) is deferred to the end of the restore
sub-phase so we restage once with all changes.
2026-05-04 15:24:14 +01:00
steve 6d295bc9f6 P3-X2: tree.list synchronous WS RPC + per-session cache
Foundational for the restore wizard's tree browser. The wizard needs to
lazy-load directory contents from a snapshot as the operator drills
down; this lands the transport.

- internal/api adds MsgTreeList (server → agent) + MsgTreeListResult
  (agent → server) with TreeListRequestPayload / TreeListEntry /
  TreeListResultPayload types. Reply correlates by Envelope.ID.
- internal/restic.ListTreeChildren wraps 'restic ls --json' and
  filters its recursive output to direct children of the requested
  path. Parser + path-normalisation + isDirectChild are unit-tested.
- internal/server/ws/rpc.go introduces a generic SendRPC helper on
  Hub: register a buffered channel keyed by ULID, send the request,
  block on ctx.Done()/timeout/reply. Reply routing piggybacks on the
  existing dispatchAgentMessage by adding a MsgTreeListResult case
  that forwards to the registered waiter; if no waiter is registered
  (caller already gave up) the stray reply is dropped quietly.
- cmd/agent gains a tree.list handler that runs ListTreeChildren on a
  fresh per-call context (60s ceiling) and ships the matching
  tree.list.result envelope. Errors surface in result.Error rather
  than as transport failures so the server-side waiter can render a
  sensible UI message.
- internal/server/http/tree_cache.go is the per-wizard-session cache
  layer (~30min TTL, sweep-on-access) that fetchTreeWithCache uses
  before falling through to SendRPC. Cached on success only; agent
  errors aren't cached so a transient failure doesn't poison the
  session.

Tests:
- internal/restic/ls_test.go covers parseLsChildren at root / mid-tree
  / leaf, plus normalizeTreePath and isDirectChild edge cases.
- internal/server/ws/rpc_test.go unit-tests the registry: round-trip,
  release semantics, concurrent waiters, ctx-cancel.
- internal/server/http/tree_rpc_test.go is the full round-trip: server
  SendRPC → fake-agent over a real WS → reply → server gets the
  payload. Plus a timeout test that confirms ~300ms timeouts terminate
  in ~300ms rather than waiting forever.

The cache is plumbed but no UI handler hits fetchTreeWithCache yet —
that lands with P3-01 (wizard backend). The unused-linter is suppressed
via nolint until the wizard wires it in.
2026-05-04 15:19:22 +01:00
steve 9fa2ef48f0 P3-X1: cancel-job feature
Wires the existing job_detail Cancel button (which was a UI stub) into
real backend behaviour:

- internal/api already declared MsgCommandCancel + CommandCancelPayload;
  promote those from forward-declarations to a working envelope. Agent
  side: cmd/agent/main.go drops the TODO-stub and gains a per-job
  ctx.CancelFunc map. runJob's switch is refactored around a small
  spawn() helper so each kind's goroutine derives a per-job context,
  registers the cancel, and removes itself on completion regardless of
  outcome. command.cancel looks up the func and fires it.
- internal/agent/runner.sendFinished now takes ctx and rebadges
  ctx.Canceled errors as JobCancelled (exit 130) rather than
  JobFailed. All Run* call sites updated.
- internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via
  build-tagged sigterm constant; os.Kill on Windows since SIGTERM
  isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL
  fallback. SIGTERM lets restic remove its lock file before exiting.
- New POST /api/jobs/{id}/cancel server endpoint validates the job
  is non-terminal and the host is online, sends command.cancel via
  the hub, writes a job.cancel audit row, returns 202. The agent's
  resulting job.finished (status=cancelled) is what actually
  transitions the row.

Tests:
- internal/server/http/cancel_test.go covers happy path (envelope
  shape + audit row), 409 for terminal jobs, 404 for missing jobs,
  503 for offline hosts.
- internal/agent/runner/cancel_test.go covers cancel mid-run: a fake
  restic that exec'd into 'sleep 30' is canceled 150ms after start
  and the resulting job.finished reports JobCancelled with exit 130
  in well under the WaitDelay.

Foundational for P3 restore (operator needs to be able to cancel a
running backup if they need to restore urgently). Independently useful
for prune/check/backup that are stuck.
2026-05-04 15:11:49 +01:00
steve 454a2415dc docs: P3 restore design spec + scope-decompose Phase 3
Splits Phase 3 into three independently-shippable sub-phases (Restore,
Alerts, Audit UI) so they can land in separate PRs with their own brainstorm
→ spec → plan cycles. The Restore sub-phase is up first.

The brainstorm ran on 2026-05-04 and locked the following decisions:

- Single-host restore only this phase. P3-04 (cross-host restore) is moved
  to a new 'Future / unscheduled' section. Disaster recovery is already
  covered by re-enrolling a replacement host with the same repo creds; the
  remaining 'pull a file from host A onto host C' use case is genuinely
  different (file sharing / migration, not DR) and has no confirmed need.
- Default target is /var/restic-restore/<job-id>/ with --no-ownership;
  in-place restore preserves uid/gid/mode and is gated by typed-confirmation
  of the host name (mirroring the repo re-init danger zone).
- Tree browser is the path picker, lazy-loaded via a synchronous WS RPC
  (tree.list) over the existing correlation-ID infrastructure with a
  per-wizard-session in-memory cache (~30 min TTL).
- Single-page wizard with progressively-enabled sections; entry is a
  top-level Restore button on host detail (or per-snapshot Restore action
  for direct deep-link).
- Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other
  agent operation; output streams to the standard live job log page.
- Restore-specific live job page variant with files-restored /
  bytes-restored / current-file widget.
- Single-flight per host across all kinds, plus a real cancel-job feature
  (command.cancel WS envelope, agent kills the restic subprocess via
  context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a
  long-running backup if they need to restore urgently. Wires the existing
  job_detail Cancel button (which was a UI stub).
- Audit row host.restore on every dispatch + a recent-restores panel on
  host detail. Role gate deferred to P4-03 RBAC.

Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored —
transient design artefact); screenshot reviewed and approved 2026-05-04.
2026-05-04 15:02:32 +01:00
steve 0bd7a896c4 Merge pull request 'P2 completion (P2R-09/10/11/12/13/14, P2-16/17/18)' (#5) from p2-completion into main 2026-05-04 13:19:05 +00:00
steve bdabcfb68e docs: note Gitea repo + tea CLI in CLAUDE.md
CI / Build (windows/amd64) (pull_request) Successful in 19s
CI / Lint (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 19s
CI / Build (linux/arm64) (pull_request) Successful in 19s
CI / Test (linux/amd64) (pull_request) Successful in 2m17s
2026-05-04 14:18:50 +01:00
steve c691dc8a56 tasks: tick P2 completion + Playwright sweep screenshots
CI / Build (windows/amd64) (pull_request) Successful in 20s
CI / Lint (pull_request) Successful in 41s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Test (linux/amd64) (pull_request) Successful in 53s
CI / Build (linux/arm64) (pull_request) Successful in 1m48s
P2R-09/10/11/12/13/14, P2-16/17/18 all marked done. Acceptance line
for Windows hosts annotated as 'compile-verified, untested in CI'.

_diag/p2-completion-sweep/ holds the dashboard + host-detail +
schedules + sources + repo + source-group-edit screenshots from a
clean sweep against :8080. Zero console errors throughout.

announce_test.go: rate-limit + global-cap subtests dropped t.Parallel
to avoid racing on the package-level tunables under -race.
2026-05-04 11:27:09 +01:00
steve 8ceb76c733 deploy: P2-17 install.ps1 (Windows installer)
Pwsh installer that detects arch, downloads
$Server/agent/binary?os=windows&arch=amd64 to
C:\Program Files\restic-manager\, runs the agent in -enroll-server
[+ -enroll-token] mode (token flow OR announce-and-approve), then
calls 'restic-manager-agent install' to register the SCM service.
Surfaces existing scheduled tasks named *restic* without disabling.

CLAUDE.md restage block updated to also stage install.ps1 alongside
install.sh.
2026-05-04 11:15:18 +01:00
steve d29475560d agent: P2-16 Windows service (SCM) integration
internal/agent/service: build-tagged into service_windows.go (svc.Handler
that listens for Stop/Shutdown + delegates to the agent loop) and
service_other.go (foreground stub for Linux/macOS). install_windows.go
wraps mgr.Connect+CreateService/Delete/Start/Stop for the new
'restic-manager-agent install|uninstall|start|stop' subcommands.

Cross-compile verified: GOOS=windows GOARCH=amd64 go build ./cmd/agent
succeeds. UNTESTED on Windows itself — the SCM round-trip can't be
exercised from Linux CI; treat as a starting point for the first
real Windows install.
2026-05-04 11:13:56 +01:00
steve bbdf631a01 ui+server: P2-18d pending hosts dashboard panel + expiry sweeper
Dashboard handler loads ListPendingHosts(now); template renders a
warn-bordered panel above the host table with hostname, OS/arch,
fingerprint (selectable / copyable), source IP, age, expiry. Each
row carries an inline accept form (repo URL/user/password) plus a
Reject button. cmd/server adds a 60s ticker calling
DeleteExpiredPendingHosts so 1h-stale rows drop off.
2026-05-04 11:11:32 +01:00
steve a3a53e3b87 agent: P2-18c announce-and-approve enrolment path
When -enroll-server is supplied without -enroll-token, the agent
mints (and persists) an Ed25519 keypair, POSTs /api/agents/announce,
prints the SHA256 fingerprint in a copy-friendly banner, opens
/ws/agent/pending, signs the server's nonce, and blocks until the
admin clicks Accept (1h ceiling). On accept, persists the bearer +
host_id from the 'enrolled' message; on reject (close code 4001)
exits with a clear error.

Repo creds are pushed via config.update on the first standard WS
hello (P1-32 path), not in the enrolled message itself.
2026-05-04 11:09:47 +01:00
steve 567561a6a3 server: P2-18b pending WS + admin accept/reject
GET /ws/agent/pending?pending_id=… runs an Ed25519 nonce-sign
handshake against the row's stored public key, then holds the
connection open. POST /api/pending-hosts/{id}/accept (admin)
mints a real Host row + bearer + AEAD-encrypted repo creds, pushes
the bearer down the open WS, deletes the pending row, and writes
a host.accept_pending audit entry. POST /api/pending-hosts/{id}/reject
closes the socket with code 4001 and audit-logs host.reject_pending.

In-memory pendingHub keyed by pending_id wires accept/reject to
their live socket.
2026-05-04 11:07:32 +01:00
steve a8e6c9d6d7 store+server: P2-18a announce-and-approve schema + endpoint
migration 0011 adds pending_hosts table (id, hostname, public_key,
fingerprint, expiry). store/pending_hosts.go covers full CRUD plus
hostname-collision count + expired-row sweeper.

POST /api/agents/announce takes {hostname, os, arch, agent_version,
restic_version, public_key (base64)}, returns {pending_id,
fingerprint, hostname_collision}. Per-source-IP token-bucket
rate limit (10/min) + global cap of 100 in-flight rows. Public
key must be exactly 32 bytes (Ed25519).
2026-05-04 11:03:41 +01:00
steve 1d3661470f ui: P2R-12 hook editor — source-group form + host-default Repo section
Source-group edit form gains pre/post hook textareas with a service-
user warning banner; bodies AEAD-encrypted on save (per-group AD).
Repo page adds a 'Host-default hooks' panel above the danger zone
with the same shape; saved via POST /hosts/{id}/repo/hooks.
2026-05-04 11:00:28 +01:00
steve 13c35b68d4 agent+server: P2R-11 pre/post hook execution for backup jobs
Agent: new runner.BackupHooks struct + runHook helper invoked via
/bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts
the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed
in env. Output streamed as 'hook(<phase>): …' log.stream lines.
Hooks only run for kind=backup (other kinds skip both phases).

Server: resolveBackupHooks resolves group → host default → empty,
decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext
into CommandRunPayload for both schedule.fire and per-group
Run-now dispatch sites. Decrypt failures degrade silently to no
hook so a malformed blob can't poison every backup.
2026-05-04 10:57:28 +01:00
steve c20375eaf5 store: P2R-10 schema for source-group + host-default hooks (migration 0010)
Adds pre_hook/post_hook BLOB columns to source_groups and
pre_hook_default/post_hook_default to hosts. Bytes stored verbatim
(AEAD encrypt/decrypt happens at the HTTP layer where the AEAD key
lives). Round-trip tests cover set/clear semantics on both tables.
2026-05-04 10:52:16 +01:00
steve cce3cd8384 ui: P2R-09 auto-init UX — init line in chrome + danger-zone re-init
Latest 'init' job status surfaced under the host-detail vitals strip
(succeeded/failed/running/queued, with link to the live job log on
non-success). New POST /hosts/{id}/repo/reinit handler dispatches a
fresh init job after the operator types the host name to confirm;
audit row records 'host.repo_reinit'.
2026-05-04 10:49:57 +01:00
steve 93ab0ae84f ui+server: schedule next-run / last-run on dashboard + schedules tab
P2R-14. New store.LatestJobBySchedule query (per-schedule fired job).
Schedules-tab handler computes next-fire from cron + last-fire from
the jobs table per row. Schedules table grows two columns; dashboard
host row prepends 'next 12h ago/from now' to the existing last-backup
line when a single covering schedule is the run-now candidate.

Embeds store.Schedule into scheduleRow so existing template field
references keep working without bulk renames.
2026-05-04 10:44:31 +01:00
steve 6589f23313 ui+server: per-job bandwidth override on Run-now
P2R-13b. POST /hosts/{id}/source-groups/{gid}/run accepts optional
bandwidth_up_kbps / bandwidth_down_kbps form fields, plumbs them onto
CommandRunPayload. Agent dispatcher already prefers per-job override
over host-wide caps (T1). UI wraps the Run-now button in a form with
a <details> 'Limit bandwidth for this run' disclosure containing two
KB/s inputs.
2026-05-04 10:41:13 +01:00
steve ddc07609cb agent+server: apply host bandwidth caps to restic invocations
P2R-13a. restic.Env gains LimitUploadKBps/LimitDownloadKBps which are
emitted as global --limit-upload/--limit-download flags before the
subcommand on every invocation. Agent dispatcher tracks host-wide
caps received via config.update; server pushes them on hello and
after PUT /api/hosts/{id}/bandwidth.

Also extends api.CommandRunPayload with optional per-job overrides
(BandwidthUpKBps/Down + PreHook/PostHook); the override consumers
land in T2/T6.
2026-05-04 10:38:34 +01:00
steve 21d967a2cf plan: P2 completion (P2R-09/10/11/12/13/14, P2-16/17/18) 2026-05-04 10:33:34 +01:00
steve 24973bdc72 Merge pull request 'tasks: tasks.md sync left behind by PR #3 merge' (#4) from tasks-md-phase5-sync into main 2026-05-04 09:26:42 +00:00
steve cd510d2032 tasks: collapse Phase 5 header + fix P2R-03/04 cadence cross-refs
CI / Lint (pull_request) Successful in 19s
CI / Build (windows/amd64) (pull_request) Successful in 18s
CI / Build (linux/arm64) (pull_request) Successful in 18s
CI / Build (linux/amd64) (pull_request) Successful in 44s
CI / Test (linux/amd64) (pull_request) Successful in 1m23s
The Phase 5 section had drifted from the convention used by phases
1–4 (single section header carrying , no separate summary block).
Collapse to the existing pattern; fold the summary into a blockquote
sitting right under the header.

While there: P2R-03 and P2R-04 still carried forward-references
saying "cadence-driven dispatch lands in P2R-04 / P2R-05". Both
should point at P2R-06 (the maintenance ticker), not the next item
in the list. Updated descriptions to reflect what actually shipped:
LatestJobByKind anchor includes in-flight jobs, ForgetGroups
multi-group payload reshape, repo.stats envelope shape, per-host
drain mutex.
2026-05-04 10:26:24 +01:00
steve a07d7fc53e Merge pull request 'P2 redesign Phase 5 — prune/check/unlock + maintenance ticker + repo stats + pending-runs queue' (#3) from p2r-phase5-maintenance into main
Reviewed-on: #3
2026-05-04 09:25:00 +00:00
steve bc02fcb498 test: poll pending-row count in drain-on-reconnect test (race fix)
CI / Lint (pull_request) Successful in 17s
CI / Test (linux/amd64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (windows/amd64) (pull_request) Successful in 51s
CI / Build (linux/arm64) (pull_request) Successful in 21s
CI run #50 failed with:

  --- FAIL: TestDrainPendingDispatchesOnReconnect (1.03s)
      pending_drain_test.go:150: pending rows after drain: got 1, want 0

The test waits for a backup command.run envelope on the wire and
then checks the pending-row count. But conn.Send (the wire write)
returns BEFORE DeletePendingRun runs in the drain goroutine — both
fire serially inside drainOne, but the wire-side reader can observe
the Send while the delete is still pending.

Use the existing waitForPendingCount helper to poll the count with
a 2s deadline. Behaviour unchanged when the delete is fast (count
hits 0 immediately); only relevant under CI scheduling pressure.
-race -count=10 locally now passes consistently.
2026-05-04 10:20:54 +01:00
steve d8dd21b5e0 test: write-then-rename script-bin helpers (avoid ETXTBSY under -race)
CI / Build (windows/amd64) (pull_request) Successful in 18s
CI / Build (linux/amd64) (pull_request) Successful in 19s
CI / Lint (pull_request) Successful in 41s
CI / Build (linux/arm64) (pull_request) Successful in 18s
CI / Test (linux/amd64) (pull_request) Failing after 3m41s
CI run #48 failed with:

  --- FAIL: TestRunInitShipsStartedAndFinished
      RunInit: ... fork/exec /tmp/.../restic: text file busy

setupScript and setupScriptBin used os.WriteFile to write a shell
script directly at the final path, then exec'd it. Under -race +
many t.Parallel tests, a fork-from-another-goroutine could inherit
the still-open writable fd from one of those WriteFile calls; the
kernel returns ETXTBSY when the freshly-execed binary still has a
writable fd anywhere on the system.

Fix: write to "<path>.tmp", then os.Rename into place. The rename
is a pure dirent op; by the time the final path exists, no process
has a writable fd on its inode and exec is safe. -race + -count=5
on both runner packages now passes consistently.
2026-05-04 10:19:15 +01:00
steve b054e7b987 api+agent: document protocol-version stability and forget back-compat decisions
version.go: add a comment block explaining why Phase 5's wire changes
(CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did
not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade
path, smoke env restage enforces it. Notes where a version bump to 2
would be required if a multi-version path is ever introduced.

cmd/agent/main.go: document why the JobForget handler hard-errors on
empty ForgetGroups rather than falling back to a single-policy form.
The maintenance ticker is the only writer and always populates the
field; the fallback was specced but skipped given lockstep deploy.
2026-05-04 10:19:15 +01:00
steve 99ef2b7a71 server: serialize DrainPending per host (avoid drain double-dispatch)
Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on
the Server struct. DrainPending acquires it with TryLock: if a drain is
already in-flight for this host, the call returns immediately — the
running drain will see every pending row. This prevents the on-hello
goroutine and the 30s tick from both listing the same host's rows and
dispatching them twice.

Update three existing tests that called srv.DrainPending explicitly
after the on-hello goroutine had already been spawned: replace the
now-redundant direct call with a waitForPendingCount poll so they don't
race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost
which fires 10 concurrent DrainPending goroutines against a 5-row queue
and asserts exactly 5 job rows result.
2026-05-04 10:19:15 +01:00
steve b8c9c50a93 store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire)
Widen the SQL query to consider all statuses (queued, running,
succeeded, failed, cancelled) rather than terminal-only. An in-flight
prune that outlasts the 60s tick interval previously produced
ErrNotFound, causing the ticker to anchor at now-24h and fire a second
prune concurrently with the first.

Update the doc comment and test: remove the "queued job filtered out"
case, add assertions that a running job and a queued job are each
returned as the latest.
2026-05-04 10:19:15 +01:00
steve 18cc90d54e tasks: tick P2R-03 through P2R-08 done 2026-05-04 10:19:15 +01:00
steve a1db4ce4f7 diag: phase 5 Playwright sweep screenshots 2026-05-04 10:19:15 +01:00
steve 99b88d08c9 server/ws: persist repo.stats into host_repo_stats 2026-05-04 10:19:15 +01:00
steve 1629dc7146 server: drainer abandons only on ErrNotFound, not transient errors
GetSourceGroup errors in drainOne now gate on errors.Is(err,
store.ErrNotFound) before calling abandonPending, mirroring the
existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context
cancellation) now log a warning and return without deleting the row.

Add regression test TestDrainPendingDropsRowsForGoneSourceGroup
confirming the ErrNotFound path still abandons correctly. Also add
a comment above the backoff-doubling loop explaining the progression.
2026-05-04 10:19:15 +01:00
steve 0c9ea75046 server: drainer uses dispatch-core to avoid duplicate pending_run enqueue
Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on
failure) from dispatchBackupForGroup. drainOne now calls the core
directly so a failed Send only bumps the existing pending_runs row via
BumpPendingRunAttempt — not create a second row — stopping the
geometric duplication on repeated drain failures.

dispatchBackupForGroup (schedule.fire path) wraps the core and keeps
its enqueue-on-failure behaviour unchanged.

TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row
remains after a send failure (was tolerating >=1 duplicate rows).
2026-05-04 10:19:15 +01:00
steve 3e337dfb3c server: drain pending_runs on tick + on agent reconnect
Two trigger paths land here:

- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
  walks pending_runs rows whose next_attempt_at <= now, dedupes by
  host, skips offline hosts, and per online host runs DrainPending.

- onAgentHello spawns a background DrainPending(hostID). When a
  host comes back, every pending row for it is dispatchable now —
  due-ness becomes irrelevant once the wire is back.

Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
2026-05-04 10:19:15 +01:00
steve e64cf25c0e server: enqueue pending_runs when scheduled-job dispatch fails
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.

Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
2026-05-04 10:19:15 +01:00
steve 2794d5a821 server: fix stale RetentionPolicy comment + check Scan errors in maintenance test 2026-05-04 10:19:15 +01:00
steve c47cc682e0 server: maintenance ticker drives forget/prune/check on cadence
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:

  - skips offline hosts (no pending_runs queueing — maintenance is
    not a backup, missed fires shouldn't pile up)
  - silently skips prune when admin creds aren't bound
  - pushes admin creds before prune, then dispatches with
    RequiresAdminCreds=true (same as operator-driven prune)
  - persists job rows with actor_kind="system"

Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
2026-05-04 10:19:15 +01:00
steve e7e11454a8 maintenance: pure-logic ticker decides forget/prune/check fires 2026-05-04 10:19:15 +01:00
steve 77a8590e3a ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild
Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in
host_repo.html to match the existing pattern on host_sources.html and
host_schedules.html. Fix all-blank admin-credentials save to redirect
without ?saved= query string so no false-positive banner is shown;
strengthen the corresponding test to assert Location has no ?saved=.
Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.
2026-05-04 10:19:15 +01:00
steve 46ec123f95 ui: Slice E — admin creds form + run-now buttons + repo health panel
- hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online,
  and StatsView (pre-dereferenced projection of host_repo_stats).
- loadHostRepoPage loads the admin slot (tolerating ErrNotFound),
  hub.Connected, and stats (tolerating ErrNotFound).
- renderRepoPage gains an adminErr parameter; all callers updated.
- handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added
  (form-POST handlers mirroring the repo-creds pattern, with audit).
- Routes /hosts/{id}/admin-credentials POST and /delete POST registered.
- Template: Admin credentials form after Connection, Run-now HTMX
  buttons after Maintenance, Repo health stats panel in right rail.
- Tests: 9 new tests covering rendering, disabled states, save/delete
  round-trips, audit rows, and idempotent delete.
2026-05-04 10:19:15 +01:00
steve b35f1736f7 server: populate audit UserID on credential mutations + slog prune push errors
Switch handleSetHostCredentials, handleSetAdminCredentials, and
handleDeleteAdminCredentials from authedUser (bool) to requireUser
(*store.User) so AuditEntry.UserID and Actor are populated correctly.
Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in
handleRunRepoPrune so decrypt/send failures surface in the server log
rather than appearing as a generic host_offline 503.
2026-05-04 10:19:15 +01:00
steve a8aff2c62b server: cover HTMX auth-redirect path in repo-ops tests 2026-05-04 10:19:15 +01:00
steve 1ae567021a server: HTTP run-now for prune / check / unlock
Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer
routes for HTMX form posts). Prune pushes the admin-cred slot via
pushAdminCredsToAgent before dispatch and refuses with
admin_creds_required when the slot is not set. Check reads
check_subset_pct from host_repo_maintenance (overridable via ?subset=N,
clamped 0-100; non-numeric override falls back to DB value silently).
Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect
response split as the per-source-group run-now endpoint.
2026-05-04 10:19:15 +01:00
steve 81a00202d0 server: admin-credentials REST + Slot:admin push helper
Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that
mirror the existing repo-credentials endpoints but write to
store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped
away from the repo slot to prevent cross-binding). PUT immediately pushes
a config.update(Slot:"admin") to the agent when it is connected, and the
new pushAdminCredsToAgent helper is wired for use by the upcoming prune
run-now endpoint (D2) to push on-demand before dispatch.
2026-05-04 10:19:15 +01:00
steve dafae84149 agent: secrets fail-loud on corrupt blob + small polish
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
2026-05-04 10:19:15 +01:00
steve d3c354cd97 agent/runner: ship repo.stats before job.finished in RunCheck/RunUnlock
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.

Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
2026-05-04 10:19:15 +01:00
steve 1f600fa849 agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot dispatch
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).

Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
2026-05-04 10:19:15 +01:00
steve 212fd3e400 agent/secrets: separate admin slot with backwards-compatible decode
Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs
are detected at load time by the presence of "repo_url" at the top level
and transparently promoted into the new shape on the next Save/SaveAdmin.
Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.
2026-05-04 10:19:15 +01:00
steve c9be9040d9 api: stats partial-update payload + ConfigUpdate.Slot + CommandRun.RequiresAdminCreds
Reshape RepoStatsPayload into pointer-field partial-update form matching
store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload
for admin vs repo credential routing; add RequiresAdminCreds flag to
CommandRunPayload for prune/unlock jobs that need delete authority.
2026-05-04 10:19:15 +01:00
steve 7fd29427a0 restic: tighten RunCheck lock sniff + RunStats zero-snapshot test
Narrow the LockPresent predicate from bare "locked" (too broad) to
"stale lock" and "already locked" — the two phrases restic actually
emits. Replace TestRunCheckParsesLock with table-driven
TestRunCheckLockSniff covering both trigger phrases and a benign
"locked-file" line that must not set LockPresent. Add
TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot
JSON without error.
2026-05-04 10:19:15 +01:00
steve 49fd3f4441 restic: RunUnlock + RunStats (raw-data mode)
Add RunUnlock (delegates straight to runWithPump) and RunStats which
runs `restic stats --json --mode raw-data`, captures the single JSON
line from stdout into RepoStats, and returns an error if no JSON
arrives.  Tests cover arg plumbing for unlock, JSON parsing, and the
no-JSON error path.
2026-05-04 10:19:15 +01:00
steve f3eaf511be restic: RunCheck with subset% + lock-state sniffing
Add CheckResult (LockPresent, ErrorsFound) and RunCheck.  subsetPct>0
passes --read-data-subset N% to limit data reads.  Stderr is sniffed
for "Found stale lock"/"locked" to set LockPresent; a non-zero exit
from restic is absorbed as ErrorsFound=true rather than an error so
the caller can always persist last_check_status.  Tests cover lock
detection, exit-1 absorption, and subset-arg plumbing.
2026-05-04 10:19:15 +01:00
steve 2caf7f1193 restic: RunPrune + runWithPump helper, refactor Forget/Init onto it
Add RunPrune for admin-credential prune invocations.  Extract
runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget
and RunInit to delegate to it (RunInit preserves the "config file
already exists" soft-success sniff by wrapping the handler before the
call).  Add runner_test.go with TestRunPruneInvokesPrune.
2026-05-04 10:19:15 +01:00
steve 4ad0b5147a store: tighten CHECK constraint on host_repo_stats.last_check_status 2026-05-04 10:19:15 +01:00
steve f97f67eb67 store: wrap UpsertHostRepoStats in a transaction (concurrency safety) 2026-05-04 10:19:15 +01:00
steve bc77081366 store: assert CHECK constraint on host_credentials.kind 2026-05-04 10:19:15 +01:00
steve 87655cf0e4 store: HostRepoStats projection (size, lock, last-check, last-prune) 2026-05-04 10:19:15 +01:00
steve de6d51eeb1 store: host_credentials becomes kind-aware (repo + admin slots) 2026-05-04 10:19:15 +01:00
steve 212ddfe226 store: migration 0009 — admin-creds kind + host_repo_stats 2026-05-04 10:19:15 +01:00
steve b640775a61 plan: P2 redesign Phase 5 (P2R-03..P2R-08) 2026-05-04 10:19:15 +01:00
steve 13f58537ad infra: remove provision-gitea-runner.sh (now lives with the infra team)
The runner-provisioning script has been handed off to the infra
agent, who will own it going forward. ci.yml's header comment is
updated to point at "the infra team owns the script" rather than
the in-repo path, but the runner expectations themselves stay the
same — workflows still rely on the persistent volumes, pre-cloned
actions, and host-installed golangci-lint that any compliant
provisioning produces.
2026-05-04 10:19:09 +01:00
steve a24eee4c68 ci+infra: provisioning script for gitea runners + drop setup-go cache
scripts/provision-gitea-runner.sh is a one-shot, idempotent host
setup for an act_runner LXC. It mounts persistent host volumes for
GOMODCACHE / GOCACHE / act-clones, pre-pulls the runner image,
pre-clones the common GitHub actions, installs golangci-lint, and
sets up a nightly cron to refresh the lot. Generic — no per-project
state.

With those persistent volumes in place, `cache: true` on
actions/setup-go becomes a net negative — the action keeps tar-ing /
un-tar-ing GOMODCACHE+GOCACHE through the Gitea cache backend on
every job, adding ~10s per job and overwriting the volume contents.
Drop it from all three jobs in ci.yml. Add a header comment block
explaining the runner-side expectations and the Go version / build
matrix / upload-artifact context for anyone reading later.
2026-05-04 09:40:27 +01:00
steve 0ae62261e3 Merge pull request 'P2R-02: UI rewire against the slim-schedule + source-group model' (#2) from p2r-02-ui-rebuild into main
Reviewed-on: #2
2026-05-03 20:34:02 +00:00
steve dd7b37a5c1 lint: align local gofumpt rules with golangci-lint v2.5.0
CI / Test (linux/amd64) (pull_request) Successful in 21s
CI / Lint (pull_request) Successful in 24s
CI / Build (windows/amd64) (pull_request) Successful in 20s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 20s
Bumping CI to v2.5.0 surfaced two new gofumpt findings (in two test
files that gofumpt v2.1.6 considered fine). Local re-format with
the matching tool brings them in line.

Pre-commit hook config: prepend $GOPATH/bin to PATH inside the hook
entry so gofumpt + golangci-lint resolve when ~/go/bin isn't on the
operator's interactive shell PATH (common — go install puts them
there but PATH config varies). Without this, the hooks fail with
'Executable not found' even when the tools are installed.

Pin the Makefile setup target to v2.5.0 so a fresh clone gets the
same binary CI runs — keeps pre-commit and CI from drifting again.
2026-05-03 21:31:47 +01:00
steve 694d9d9bf3 ci: bump golangci-lint to v2.5.0 (Go 1.25-built binary)
CI / Test (linux/amd64) (pull_request) Successful in 19s
CI / Lint (pull_request) Failing after 27s
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 20s
The v2.1.6 release binary is built with Go 1.24, and golangci-lint
refuses to load a config targeting a newer toolchain than itself
('Go language version (go1.24) used to build golangci-lint is lower
than the targeted Go version (1.25.0)'). go.mod is on 1.25, so the
binary needs to be too.

Locally this didn't bite because 'go install …@v2.1.6' compiled
v2.1.6 against the local Go 1.25 toolchain; CI uses the prebuilt
release tarball which carries the build-time Go version.

v2.5.0 is the first v2.x line built with Go 1.25 — pin in lockstep
with go.mod going forward.
2026-05-03 21:29:02 +01:00
steve 2d40002355 ci: enforce lint locally via pre-commit hook
CI / Test (linux/amd64) (pull_request) Successful in 29s
CI / Lint (pull_request) Failing after 16s
CI / Build (windows/amd64) (pull_request) Successful in 21s
CI / Build (linux/amd64) (pull_request) Successful in 21s
CI / Build (linux/arm64) (pull_request) Successful in 21s
The repo had a .pre-commit-config.yaml entry for golangci-lint
already, but pinned to v1.61.0 — which doesn't grok the v2 schema
we just migrated to, so it would crash if anyone ever ran it. Hence
nobody did.

Replace the third-party hook blocks with local hooks that call
whatever tool is on the developer's PATH (gofumpt + go vet +
golangci-lint). That way the version of each tool tracks what the
developer would invoke by hand — no drift between hook config and
binary.

Add 'make setup' as a one-liner per-clone bootstrap:
  * installs gofumpt + golangci-lint via go install if missing
  * installs the pre-commit hooks via 'pre-commit install'

end-of-file-fixer auto-fixed two existing files (web/static/css/
styles.css and ask.md) — trailing newlines, harmless.
2026-05-03 21:26:24 +01:00
steve e871b05b38 lint: drive baseline to zero, drop only-new-issues gate
CI / Test (linux/amd64) (pull_request) Successful in 34s
CI / Lint (pull_request) Failing after 16s
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (linux/arm64) (pull_request) Successful in 21s
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:

* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
  api.JobCancelled = "cancelled" since that literal is the wire +
  DB CHECK constraint value, plus matched the case in store/fleet.go
  back to "cancelled" and added //nolint:misspell on both for the
  next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
  `defer res.Body.Close()` in `defer func() { _ = .Close() }()`
  to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
  upgrade response Body — coder/websocket can return res with a nil
  Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
  comments explaining why nil-on-error is the contract (cookie
  missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
  revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
  ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
  the dashboard primary nav today

Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
2026-05-03 16:15:17 +01:00
steve 18a9f6624e ci: migrate .golangci.yml to v2 schema + only-new-issues gate
CI / Test (linux/amd64) (pull_request) Successful in 29s
CI / Lint (pull_request) Failing after 16s
CI / Build (windows/amd64) (pull_request) Successful in 20s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (linux/arm64) (pull_request) Successful in 21s
The bump from golangci-lint-action@v6 → v7 (which downloads the v2.x
binary) was blocking CI lint with 'unsupported version of the
configuration: ""' because .golangci.yml was still in the v1 schema.

Migrate the config to v2:
* version: "2" prelude
* disable-all → default: none
* linters-settings → linters.settings
* gofumpt + goimports move into formatters.enable + formatters.settings
* exclude-rules move into linters.exclusions.rules
* gosimple drops (folded into staticcheck in v2)

Fix the four lint hits in the new P2R-02 code:
* host_bandwidth.go: convert hostBandwidthRequest directly to
  hostBandwidthView via type conversion (S1016)
* ui_repo.go: drop unparam savedSection + status arguments from
  renderRepoPage (always "" / always 422 — split GET render from
  validation-fail render)
* ui_schedules.go: gofumpt formatting on the scheduleEditPage struct

Add only-new-issues: true to the lint job. The repo carries ~90
pre-existing findings (gofumpt drift × 31, misspell × 25, missing
godoc × 10, bodyclose × 6, errcheck × 12, …) accumulated before
lint was actually wired into CI. Without this gate, every PR would
fail on baseline noise instead of its own changes.

Track the cleanup as X-06 in tasks.md so the gate is temporary.
2026-05-03 15:00:24 +01:00
steve 2a8dd1eba2 P2R-02 — mark Phase 4 complete, all 6 slices done
CI / Test (linux/amd64) (pull_request) Successful in 1m28s
CI / Lint (pull_request) Failing after 31s
CI / Build (windows/amd64) (pull_request) Successful in 20s
CI / Build (linux/amd64) (pull_request) Successful in 20s
CI / Build (linux/arm64) (pull_request) Successful in 24s
Update tasks.md: Phase 4 of the P2 redesign is done end-to-end.
Slice 1–5 wired the four host-detail tabs against the new
slim-schedule + source-group + repo-maintenance model; slice 6
ran a Playwright sweep against the live :8080 server (login,
walk every tab, create source group, create schedule, Run-now,
confirm a snapshot landed) — clean pass, no console errors.
Screenshots in _diag/p2r-02-sweep/.

Side-fix landed alongside slice 6: agent runner now drops
restic's noisy --json status events from log.stream (the
throttled job.progress envelope already covers them).

Phase 5 (server-side maintenance ticker — P2R-03..08) is next.
2026-05-03 14:49:40 +01:00
steve fab99b4a38 P2R-02 slice 5: dashboard row Run-now uses covering schedule
Replace the placeholder 'Open →' link with a per-host Run-now
decision computed server-side once per render:

* If the host has exactly one enabled schedule whose source-group
  set covers every group on the host → primary 'Run all groups'
  button (HX-POST to that schedule's /run endpoint, fires every
  backup the host knows about in one click).
* Otherwise (zero matches, multiple matches, or any ambiguity) →
  ghost 'Open →' link to /hosts/{id}/sources, where the operator
  picks per-group from the source-group rows.

dashboardPage.Hosts moves from []store.Host to []dashboardHostRow
to carry the precomputed RunAllScheduleID; host_row.html now reads
.Host.* and .RunAllScheduleID. Two extra store calls per host on
dashboard render — fine at fleet sizes we care about; if we ever
need to support thousands of hosts we'll batch these queries.
2026-05-03 13:42:50 +01:00
steve ffba7371c5 agent runner: drop status-event spam from log.stream
restic --json emits a status frame ~every 16ms during a backup.
The runner was forwarding every line to log.stream verbatim, which
flooded the live log pane with duplicate status JSON for any
short-running backup (visible immediately on a 1000-file, ~4MB
test set: ~14 identical 'percent_done: 1' lines in 220ms).

The progress widget already covers the same information at a sane
sample rate (one per second via job.progress), so the raw status
lines in log.stream are double-bookkeeping. Skip them and forward
only non-status lines (file names, errors, summary).

Throttling logic for job.progress is unchanged.
2026-05-03 13:35:18 +01:00
steve 4035c44be3 P2R-02 follow-up: schedule Run-now feedback (single → job log, multi → toast)
Schedules tab Run-now used to silently HX-Redirect back to the
list, leaving the operator wondering whether the click registered.
Now:

* Single-source-group schedule → HX-Redirect to that one job's
  live log, matching the per-source-group Run-now UX from Sources.
* Multi-group schedule → stay on the schedules list and fire a
  success toast ("N backups dispatched: <group names>") via the
  existing rm:toast HX-Trigger channel, so the operator sees clear
  acknowledgement without losing their place.

dispatchBackupForGroup now returns the persisted job ID so the
caller can choose between job-log redirect and toast feedback;
on any internal failure it returns "" and the warning still
hits slog as before. The cron-fired path (dispatchScheduledJob)
ignores the return value, behaviour unchanged.
2026-05-03 13:25:31 +01:00
steve d62b173712 P2R-02 slice 4: Repo tab — connection / bandwidth / maintenance
Three independent forms on /hosts/{id}/repo so saving one section
doesn't disturb the others:

* Connection: edits repo URL, username, password (pre-filled from
  the redacted GET /api/hosts/{id}/repo-credentials view; password
  field shows masked stored-creds placeholder; blank password = keep
  existing). On save, encrypts and pushes config.update to a
  connected agent.
* Bandwidth: host-wide upload/download caps (KB/s; blank = no cap)
  written via store.SetHostBandwidth. New REST endpoint
  PUT /api/hosts/{id}/bandwidth for JSON callers.
* Maintenance: forget/prune/check cadences + check subset %, with
  per-row enabled toggles. Reuses cronParser for validation;
  auto-seeds the row if a host pre-dates the migration.

Right-rail surfaces repo size, snapshot count, snapshots-by-tag
breakdown (counted from existing snapshot tag rows), and an
'untagged snapshots are left alone' note.

Danger-zone re-init button is rendered but disabled with a hint
pointing at P2R-09 (real implementation lands there).

Validation re-renders the page with the relevant form's banner and
all other section state intact. Successful saves redirect with a
?saved=<section> query param so the page surfaces a small ✓ saved
indicator on the relevant form.

ci.yml: bump golangci-lint-action v6→v7 (separate change picked up
in this commit).
2026-05-03 12:14:03 +01:00
steve 8b91d3037c P2R-02 follow-up: Run-now works on disabled schedules with confirm
CI / Test (linux/amd64) (pull_request) Successful in 33s
CI / Lint (pull_request) Failing after 15s
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 23s
CI / Build (linux/arm64) (pull_request) Successful in 23s
Surface the Run-now button on every schedule when the host is online,
not just enabled ones. Disabled rows render the button as a non-primary
style + a HX-confirm dialog ("This schedule is paused — running it now
won't change that. Fire it once anyway?"); enabled rows keep the
zero-friction primary button.

Server-side, Run-now no longer short-circuits on !Enabled — it
dispatches the source groups inline rather than via dispatchScheduledJob
(which always bails on disabled schedules, since cron-tick semantics
are different from explicit operator intent). The audit-log entry
inside dispatchBackupForGroup still records every fire.
2026-05-03 12:07:26 +01:00
steve 64d2fcf7a3 P2R-02 follow-up: clickable rows on Sources/Schedules + cron-preset tooltips
CI / Test (linux/amd64) (pull_request) Successful in 1m57s
CI / Lint (pull_request) Failing after 15s
CI / Build (windows/amd64) (pull_request) Successful in 22s
CI / Build (linux/amd64) (pull_request) Successful in 22s
CI / Build (linux/arm64) (pull_request) Successful in 22s
Aligns Sources and Schedules tab rows with the dashboard's row-click
UX: whole-row click navigates to the row's edit page (mirroring
.host-row.clickable). Drops the redundant Edit buttons; Run-now and
Delete remain in .row-action cells that sit above the row-link
overlay via z-index.

Schedule edit form's cron preset chips now carry human-readable
title= tooltips ("Every day at 03:00", "Every Sunday at 03:00", etc).

tasks.md gets a binding row-design rule covering all current and
future list-row templates, and the P2R-02 entry is split into the
six slices already agreed with the operator (slices 1–3 marked
done, 4 next).
2026-05-03 12:01:55 +01:00
steve 67ca769686 P2R-02 slice 3: Schedules tab — slim list, new/edit form, delete, Run-now
CI / Test (linux/amd64) (pull_request) Failing after 44s
CI / Lint (pull_request) Failing after 13s
CI / Build (windows/amd64) (pull_request) Successful in 19s
CI / Build (linux/amd64) (pull_request) Successful in 19s
CI / Build (linux/arm64) (pull_request) Successful in 25s
Schedules list: status (enabled/paused) + cron + source-group tags +
actions (Run-now when enabled+online, Edit, Delete). Run-now reuses
dispatchScheduledJob — same path real cron fires take, so each
referenced source group runs as its own backup with its own tag.
Falls back to a 409 if the agent is offline.

Schedule new/edit form: cron input with five preset chips
(quick-pick @hourly / nightly / 6h / weekly / monthly), source-group
multi-pick rendered as styled checkbox cards (visual state tracks
the underlying box via a tiny inline script), enabled toggle. No
paths/excludes/retention/kind on the schedule itself — those live on
source groups now.

Server-side validation re-renders with the operator's input + ticked
groups intact. Every successful mutation calls pushScheduleSetAsync.

Adds .schd-row, .preset-chip, .picker styles.
2026-05-03 11:55:16 +01:00
steve dede74fd3a P2R-02 slice 2 follow-up: refuse to delete a host's last source group
CI / Test (linux/amd64) (pull_request) Failing after 45s
CI / Lint (pull_request) Failing after 12s
CI / Build (windows/amd64) (pull_request) Successful in 19s
CI / Build (linux/amd64) (pull_request) Successful in 19s
CI / Build (linux/arm64) (pull_request) Successful in 23s
Belt-and-braces: the UI now disables the Delete button when a group
is the only one on the host (with a tooltip explaining why), and the
server-side handler returns 409 if a curl/form-replay tries anyway.
Every host needs at least one source group to be backup-able, so the
'last group on a fresh host' case is a meaningful accident to guard
against.
2026-05-03 11:49:17 +01:00
steve 0ed9c3d1ec P2R-02 slice 2: Sources tab — list, new/edit form, delete, Run-now
Sources tab now lists every source group on the host with per-row
counts (used-by-N-schedules, snapshot count by tag), the v4
conflict tag (keep-* dimension that has no compatible cadence),
and Run-now / Edit / Delete actions. Run-now reuses the existing
HTMX-aware /hosts/{id}/source-groups/{gid}/run handler.

New /hosts/{id}/sources/new and /sources/{gid}/edit form: name +
includes/excludes textareas + the 3×2 keep-* retention grid +
retry-on-offline knobs. Server-side validation re-renders with the
operator's input intact; the inline conflict banner shows above the
retention grid when ConflictDimension is set.

Delete blocks (UI + server) when the group is referenced by any
schedule. Every successful mutation calls pushScheduleSetAsync so
an online agent re-arms within seconds.

Adds .src-row and .keep-cell to input.css for the row + retention
grid layout.
2026-05-03 11:44:43 +01:00
steve a535822ff3 P2R-02 slice 1: host-detail sub-tab skeleton
Extract header/vitals/sub-tabs into a host_chrome partial that every
host-detail tab page renders. Sources / Schedules / Repo go from
inert divs to real <a> links backed by stub pages that share the
chrome and a 'coming next' body — slices 2/3/4 fill them in.

Also re-establishes the version indicator (host_schedule_version vs
agent's applied_schedule_version) in the header.

Drops the legacy fat-schedule list/edit templates that referenced
fields removed by the P2 redesign (Manual / Paths / RetentionPolicy
on Schedule); the new templates land in slice 3.
2026-05-03 11:37:55 +01:00
steve 21841e38c4 ci: only trigger on PRs into main
Drop the push-to-main trigger; main is fast-forward only via PR, so
the post-merge run was redundant.
2026-05-03 11:25:13 +01:00
steve e968abc042 ci: fix race-trip in enrollment fixture + bump golangci-lint to v2.1.6
- host_credentials_test.go's CreateEnrollmentToken fixture passed 1<<20
  as the TTL (third arg, time.Duration) — that's ~1ms in nanoseconds.
  Local non-race runs finished inside the window, but -race overhead
  blew the deadline so the token was already expired by the time
  GetEnrollmentTokenAttachments / ConsumeEnrollmentToken ran. Use
  time.Hour instead, which matches the spirit of a per-test fixture.
- Lint pin v1.61.0 was built against Go 1.23 and refuses to load a
  config targeting newer toolchains. go.mod is on 1.25, so the lint
  step exited 3 ('the Go language version used to build golangci-lint
  is lower than the targeted Go version'). Bumping to v2.1.6, which
  supports Go 1.25.

Both failures showed up only on the Gitea runner because local make
target runs go test without -race and lint hadn't been re-run after
the go.mod toolchain bump.
2026-05-03 11:13:22 +01:00
steve 713bc4a2bb P2R-01 follow-up: WS-path tests + drop unused retention from backup dispatch
Adds p2r01_ws_test.go covering the two paths the original commit's
in-process tests couldn't reach without a live conn:

- maybeAutoInit dispatches command.run(init) on first hello when creds
  are bound, skips on second hello once a job row exists, and skips
  entirely when the host has no creds.
- dispatchScheduledJob iterates a schedule's source groups and emits
  one backup per group with the right Tag/Includes; persists job rows
  with actor_kind=schedule + scheduled_id; no-ops on a disabled
  schedule.

Drops RetentionPolicy from the per-group Run-now and schedule.fire
backup payloads — the agent's RunBackup ignores it (forget is the
only consumer). Adds Hub.Conn() so tests can grab the live *Conn
post-hello.
2026-05-03 11:00:45 +01:00
steve d000fe7ec1 P2R-01: REST + WS rewire against the slim shape
Schedules CRUD now takes {cron, enabled, source_group_ids[]} with cron
parsed via robfig/cron/v3 and group membership scoped to the host.
New source-groups CRUD lives at /api/hosts/{id}/source-groups; delete
refuses with 409 if any schedule still references the group, returning
the schedule list so the UI can prompt 'remove from these schedules
first.' Repo-maintenance GET/PUT manages forget/prune/check cadences
on host_repo_maintenance — no version bump, the server-side ticker
(P2R-06) drives execution.

Per-source-group Run-now (POST /hosts/{id}/source-groups/{gid}/run)
resolves the group's includes/excludes/retention/tag and dispatches a
backup command.run with the new structured CommandRunPayload fields
(Includes/Excludes/Tag). Old per-host /hosts/{id}/run-backup and
/hosts/{id}/init-repo return 410 Gone with a redirect message.

schedule_push.go is rebuilt: buildScheduleSetPayload assembles the
slim wire shape, pushScheduleSetOnConn ships it during the on-hello
window, pushScheduleSetAsync fires after every CRUD mutation, and
dispatchScheduledJob handles agent schedule.fire by iterating the
schedule's source groups and dispatching one backup per group with
actor_kind=schedule and scheduled_id pointing at the schedule.

Auto-init at first WS connect: when the host has repo creds bound and
no init job in its history, server dispatches restic init. Restic's
'config file already exists' soft-success means re-runs against an
existing repo no-op; we don't auto-retry on failure (operator triggers
re-init manually via the danger zone in P2R-09).

api.Schedule drops Kind/Paths/Excludes/Tags/RetentionPolicy/Manual etc.
in favour of {id, cron, enabled, source_groups: [...]}. The agent
scheduler stops checking sch.Manual; cmd/agent's backup dispatch reads
Includes/Excludes/Tag instead of Args.

Tests cover the new HTTP surface end-to-end: source-groups CRUD with
in-use refusal, schedule validation (bad cron / missing groups /
foreign group), repo-maintenance auto-seed and validation, the 410
route, and buildScheduleSetPayload's wire-shape correctness. Full
suite passes; smoke env exercises auto-init dispatch on hello,
async push after schedule create, and per-source-group Run-now
landing the right paths/excludes/tag at the agent.
2026-05-03 10:56:40 +01:00
steve 337dcc0f0f fix(.mcp.json): wrap playwright under mcpServers key
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-03 10:35:57 +01:00
steve 813158b3d6 P2 redesign · phase 2.5: tasks.md rewrite + UI patch-up
CI / Test (linux/amd64) (push) Failing after 4m47s
CI / Lint (push) Failing after 26s
CI / Build (windows/amd64) (push) Successful in 54s
CI / Build (linux/amd64) (push) Successful in 46s
CI / Build (linux/arm64) (push) Successful in 46s
The store rewrite in 5667cdf left tasks.md describing a data shape
(fat schedules, host.repo_initialised_at, manual flag) that no longer
exists, and left the host-detail templates rendering against fields
the store no longer exposes. This commit reconciles both.

tasks.md
* Mid-phase pivot called out at the top of Phase 2 with commit hashes.
* P2-01..P2-05 kept as done but stamped ⚠️ "shipped against old shape
  — to re-validate under P2R-02".
* P2-04.5 (manual flag) struck as superseded.
* New P2R-NN section covering work that previously lived only in
  commit messages and code stubs:
    P2R-00.1/00.2/00.3/00.4 — phases already shipped (this commit
                              records 00.4)
    P2R-01 — REST + WS rewire against slim schedules + source groups
             + repo maintenance + auto-init
    P2R-02 — UI rewire against the v4 wireframes
    P2R-03..05 — prune / check / unlock command surfaces
    P2R-06 — server-side maintenance ticker (cadence-driven)
    P2R-07 — repo stats panel
    P2R-08 — pending_runs queue worker
    P2R-09 — auto-init UX polish
    P2R-10..12 — pre/post hooks rehomed from schedule onto source group
    P2R-13..14 — bandwidth + next/last-run surface
* P2-16/17/18 (Windows + announce-and-approve) untouched.
* Phase 2 acceptance criteria rewritten against the new model.

UI patch-up (P2R-00.4)
* host_detail.html + host_row.html: removed every $host.RepoInitialisedAt
  reference (column dropped in migration 0008 — render was 500'ing).
* Removed manual init-repo branches; the auto-init path replaces them.
* Schedules sub-tab demoted from active link to inert div until P2R-02
  rebuilds the page (it was linking to a raw 501 from the stubbed
  ui_schedules.go handlers).
* Disabled the four per-host Run-now buttons (dashboard row + host
  detail header + empty-snapshots state + right-rail) with a
  "lands in P2 Phase 4" hint — handler is 501-stubbed pending P2R-01,
  so leaving them clickable produced silent failures over htmx.
* Dashboard row-action becomes "Open →" instead of Run-now.

Project tooling
* .mcp.json at repo root: project-scoped Playwright MCP override.
  Forces --headless (so I don't pop a browser at the operator) and
  --output-dir _diag (so screenshots / traces land in the gitignored
  _diag/ directory rather than scattered at the repo root).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:13:05 +01:00
steve 5667cdf13a P2 redesign · phase 2: store rewrite — sources, slim schedules, repo maintenance
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Go-side data model rebuilt against migration 0008. The fat-Schedule
shape (paths/excludes/tags/retention/manual/kind/options/hooks) is
gone; that surface lives on source_groups now.

* store/types.go
  - Schedule slimmed to {id, host_id, cron, enabled, source_group_ids,
    timestamps}. SourceGroupIDs populated by Get/List, accepted on
    Create/Update so callers pass desired junction state in one shape.
  - SourceGroup added: name (= snapshot tag), includes/excludes,
    retention_policy, retry_max + retry_backoff_seconds, cached
    conflict_dimension.
  - HostRepoMaintenance added: forget/prune/check cadences + enabled.
  - PendingRun added: offline-retry queue.
  - Host loses RepoInitialisedAt; gains BandwidthUpKBps + BandwidthDownKBps.
  - RetentionPolicy moves home from "schedule field" to "source group
    field" but the type itself + Summary() method unchanged.

* store/sources.go (new) — CRUD + GetByName + ConflictDimension cache.
  Group writes bump host_schedule_version; conflict cache writes don't
  (server-internal projection, agent doesn't see it).
* store/maintenance.go (new) — CreateDefault is idempotent (INSERT OR
  IGNORE). UpdateRepoMaintenance doesn't bump schedule version because
  these run on the server's own ticker, not the agent's local cron.
* store/pending.go (new) — Enqueue / DueRunsForRetry / Bump / Delete.
* store/schedules.go — rewritten for slim shape + junction CRUD.
  Update wipes the schedule_source_groups junction wholesale and
  re-inserts (simpler than diffing). Adds SchedulesUsingGroup for
  retention-conflict detection + UI labels.
* store/hosts.go — drops repo_initialised_at scan, adds bandwidth scan.
  New SetHostBandwidth helper.

* HTTP layer — temporarily stubbed during this rewrite (501 returns
  with redesign_in_progress error code). Phase 3 fills these in
  against the new shape:
    - schedules.go REST CRUD
    - schedule_push.go agent reconciliation
    - ui_schedules.go HTML form CRUD
  Run-now-per-host + Init-repo handlers in ui_handlers.go also stubbed
  — both go away in the new model (Run-now per source group; auto-init
  at host enrolment).

* enrollment.go — replaces "seed manual schedule from typed paths"
  with "seed default source group + repo-maintenance row." The default
  group gets the typed paths as its includes; operator edits later
  via Sources tab.

* ws/handler.go — drops the MarkHostRepoInitialised projection (column
  is gone; auto-init makes it derivable from latest init job's status).

Tests:
* store: existing schedule test rewritten for slim shape + junction;
  new sources_test.go covers source-group CRUD, name uniqueness,
  conflict cache, repo-maintenance defaults + idempotent seed,
  pending-runs queue lifecycle.
* http: schedules_test.go and schedule_push_test.go deleted — both
  exercised the obsolete fat-schedule API. Phase 3 rewrites them
  against the new endpoints.

go test ./... green. cmd/server + cmd/agent build. The UI is broken
end-to-end (schedules / sources / repo tabs all hit 501 stubs); Phase 3
restores REST + on-the-wire reconciliation; Phase 4 rewires the UI
templates against the new model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:30:41 +01:00
steve 666af41f46 design: v4 wireframes for P2 redesign (sources / schedules / repo)
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Hi-fi mock of the four pages affected by the redesign:
* /hosts/{id}/sources — list of source groups with per-row meta
  line (includes/excludes count, retention summary, usage,
  snapshot count) and Run-now / Edit / Delete actions. Tweaks
  toggle flips between fresh-host (default empty group, Run-now
  + Delete disabled) and multi-group states.
* /hosts/{id}/sources/{gid}/edit — name (snapshot tag), includes/
  excludes textareas, retention as a 3×2 grid of keep-* cells,
  retry-on-offline, inline conflict banner above retention when
  granularity↔cadence mismatch detected.
* /hosts/{id}/schedules — slim list (status / cron / source-tags
  / actions) plus new-schedule form (cron with quick-pick chips,
  source-group multi-select via clickable check pickers, enabled
  toggle).
* /hosts/{id}/repo — connection (URL/user/password/cert pin),
  bandwidth caps, maintenance rows (forget daily / prune weekly /
  check monthly with 5% subset), danger zone re-init.

Footer carries the retention-conflict detection spec (granularity
vs cadence mismatch). Visual language matches v1: --accent cyan,
JetBrains Mono for IDs/cron, btn tokens, sub-tab nav, hairline
panels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:54:14 +01:00
steve 7a7cac588c P2 redesign · phase 1: migration 0008 — sources + repo maintenance
Schema rebuild for the model collapse described in
design/v4-sources-redesign.html. Three nouns now stand on their
own:

* schedules — slim. Only cron + enabled + host_id. Fat-schedule
  shape (paths/excludes/tags/retention/manual/kind/options/hooks)
  is dropped wholesale. Schedule data wiped — by design (smoke env
  was nuked before this ran; fresh installs have nothing to lose).
* source_groups — name + includes + excludes + retention_policy +
  retry policy + cached conflict_dimension. Group name doubles as
  the snapshot tag so retention can target it cleanly. UNIQUE
  (host_id, name) enforces tag unambiguity.
* schedule_source_groups — N:M junction. One schedule can fire N
  groups per tick; one group can be referenced by N schedules.
* host_repo_maintenance — 1:1 with hosts. Default cadences:
  forget daily 03:00, prune weekly Sun 04:00, check monthly 1st
  05:00 with --read-data-subset 5%. Operator can edit on Repo tab.
* pending_runs — offline-retry queue. Server-side ticker dispatches
  due rows; bounded by source_groups.retry_max + retry_backoff_seconds.

Plus:
* hosts.bandwidth_up_kbps / .bandwidth_down_kbps — host-wide caps.
* hosts.repo_initialised_at — DROPPED. Auto-init on enrol makes
  it derivable from the latest init job; the Init-repo button goes
  too (failure surfaces via job history banner).

Note on FK safety: smoke env was wiped before migration ran, so
DROP TABLE schedules cascades to nothing. Fresh installs apply
0001-0007 then immediately 0008 — same story (no schedule rows
to lose). For an upgrade path on a populated DB, this migration
would need a data-preserving variant; not needed today.

Tests fail to compile/run after this — expected. The Go side
(store types, CRUD, REST handlers, agent runner, UI templates)
gets rebuilt in subsequent phases. tasks.md will track P2 redesign
progress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:54:01 +01:00
steve fdecde0d5c P2-05: forget command with retention policy
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
End-to-end forget plumbing — operator can create a forget schedule
with keep-* values, agent runs restic forget --keep-* … on the
schedule's cron (or via per-row Run-now), snapshot list shrinks,
UI updates.

* api.CommandRunPayload gains retention_policy json.RawMessage so
  the agent doesn't need a typed copy of the server-side struct.
* restic.ForgetPolicy mirrors restic's --keep-* flags. Empty()
  reports zero dimensions; restic wrapper RunForget refuses to
  run an empty policy (would delete every snapshot). Does NOT
  pass --prune — pruning lives behind a separate admin-only
  credential (P2-06); forget just rewrites the snapshot index.
* runner.RunForget mirrors RunBackup's envelope shape so the
  live log viewer works without special-casing. On success
  triggers reportSnapshots (forget shrinks the index, the host's
  snapshot count almost certainly changed).
* cmd/agent dispatcher handles MsgCommandRun with kind=forget,
  decodes RetentionPolicy from the wire, builds restic.ForgetPolicy.
* Server dispatchScheduleNow marshals the schedule's
  RetentionPolicy into the wire payload for kind=forget jobs.
  Refuses to dispatch a forget schedule with empty retention.
* validateSchedule rejects kind=forget without at least one keep-*
  dimension (new error code: missing_retention).
* UI schedule edit form gains a Kind dropdown (backup or forget;
  immutable on edit). Paths block toggles by kind via inline
  data-kind attributes. Form help-text explains the prune
  separation.

Other kinds (prune, check, unlock) deferred to P2-06..08; the
Kind dropdown only offers backup and forget today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 14:07:42 +01:00
steve f62a90b4b3 ui: stop Run-now buttons wrapping to two lines
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Three sites:
* Schedules list per-row Run-now / Edit / Delete column was 1fr
  next to a 1.3fr retention column — too narrow for the three
  buttons. Pin the action column to 240px and add
  whitespace-nowrap to each button so the layout can't squeeze
  them onto two lines regardless.
* Dashboard host_row Run-now button got whitespace-nowrap +
  &nbsp; for the same reason inside the 92px action column.
* Host detail header "Run backup now" — &nbsp; the words so the
  button never breaks across lines if the header gets crowded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:59:42 +01:00
steve 1b947f5a2c restic: don't fall back to parent's HOME when picking the cache dir
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Agent runs as root (HOME=/root from systemd) with ProtectHome=
read-only, so restic's `mkdir /root/.cache/restic` fails on the
first call. Backups still completed (restic falls back to no-cache)
but every job log started with a noisy red "unable to open cache"
warning.

Default to /var/lib/restic-manager unconditionally — that's already
in the unit's ReadWritePaths and survives ProtectHome. ExtraEnv
overrides still win for tests / unusual setups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:43:10 +01:00
steve c565a7abd1 agent unit: drop SystemCallFilter — was killing restic with SIGSYS
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Allow-list filter @system-service excludes some syscalls Go's
runtime + restic's file scanner reach for; init job died
immediately with "bad system call (core dumped)". CapabilityBounding
already constrains what root can do; the Protect*/Restrict* toggles
still cover network / kernel / mount / namespace. Net effect on the
threat model is negligible vs the operational cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:40:43 +01:00
steve 7e49b62e0e Add CLAUDE.md with project-specific rules
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Three rules to date:

* After every make build, restage the agent binary + install
  assets into /tmp/rm-smoke/data/ and replace the running agent
  on this dev box. Plain `make build` doesn't reach either, and
  forgetting has bitten the smoke env twice today (stale agent
  without mergeRestCreds; stale unit without User=root).

* Migrations: prefer ALTER TABLE DROP/RENAME COLUMN (SQLite
  3.35+) over the rebuild dance. With foreign_keys=ON in the DSN,
  DROP TABLE on a parent with ON DELETE CASCADE children wipes
  every dependent table — and PRAGMA foreign_keys=OFF inside a
  migration is a no-op (PRAGMA can only change outside a tx).

* Don't slog restic's merged URL. The user:pass@-embedded form
  exists only inside envSlice() at exec time; if any URL needs
  to be operator-visible, route it through restic.RedactURL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:33:20 +01:00
steve e0037f0026 restic: treat 'config file already exists' on init as soft success
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Re-running restic init on a repo that's already initialised exits
non-zero with "Fatal: ... config file already exists". Semantically
that's a no-op, not a failure — the repo IS initialised, the
caller's intent is satisfied. Sniff stderr for the magic string
and swallow the exit code in that case, emitting an event line
so the operator-facing log says what happened.

Caught while smoke-testing P2-04.5: I'd init'd the repo manually
during a debug session, then the operator clicking the UI's
Init-repo button would hit this and the host's repo_initialised_at
would never flip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:22:01 +01:00
steve 72d8081b0d Add-host: default repo username to hostname; always show htpasswd snippet
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
The pending page suppressed the htpasswd snippet when repo_username
was blank — but with --private-repos the username is required for
auth, and operators routinely leave the field blank assuming the
system will pick something sensible.

* handleUIAddHostPost defaults repo_username to the typed hostname
  when blank. Matches what --private-repos expects (URL path
  segment == username).
* pending_host.html: snippet now renders whenever a password is
  present (always true after the generate-on-blank logic landed
  earlier).
* Form help-text updated to describe the default explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:08:23 +01:00
steve 8a05969953 Add-host: durable pending page + polled awaiting-agent panel
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Two issues from a smoke session:
1. The awaiting-agent panel never refreshed — operator had to go
   back to the dashboard to see the host had connected.
2. Generated passwords were displayed only on the POST response.
   Navigating away (or even an accidental tab close) lost them
   permanently, so the operator couldn't update the rest-server's
   htpasswd.

Both are the same fix: convert the POST-rendered transient
"result state" into a durable GET page at /hosts/pending/{token}.

* New route GET /hosts/pending/{token} renders the install-command +
  htpasswd snippet view. Password is decrypted from the (still-
  encrypted-at-rest) token row on every render — operator can
  refresh, bookmark, navigate away and come back. Once the agent
  enrols, the page redirects to /hosts/{id}; once the token
  expires, redirect to /hosts/new.
* New route GET /hosts/pending/{token}/awaiting returns a polled
  HTML fragment that the pending page swaps in every 2s via HTMX.
  States: awaiting (keep polling) | connected (show "Open host →"
  + "View schedules" CTAs, polling stops) | expired (mint-new
  link, polling stops). Polling stops naturally because only the
  awaiting state's wrapper carries the hx-trigger attribute.
* POST /hosts/new now 303-redirects to /hosts/pending/{token}
  on success; validation errors keep re-rendering the form with
  banner.

Supporting changes:
* New store helper Store.GetEnrollmentTokenStatus(tokenHash) for
  the polling endpoint — returns {expires_at, consumed_at,
  consumed_host} in one round-trip without dragging in the
  attachments-decryption path.
* New ui.Renderer.RenderPartial(w, name, data) for HTMX fragment
  responses (no layout wrap). Picks an arbitrary page's template
  set as the lookup point — every page parses the full common-
  paths list, so they all see every partial.
* add_host.html stripped to form-only; pending_host.html owns the
  result-state UI; awaiting_agent.html is the polled partial.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 12:59:24 +01:00
steve 148e61b33b P2-04.5: kill host.default_paths in favour of manual schedules
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Two independent path lists for "what does this host back up?" was
a real divergence footgun — operator types one set at Add-host time
and a different set into a schedule, both end up in the same repo,
the snapshot history looks fine until restore. Resolution: drop
host.default_paths entirely; add a `manual` flag on schedules.
A manual schedule has paths/excludes/tags/retention like any other
but no cron — it fires only via per-schedule Run-now. Single source
of truth for what gets backed up.

Schema (migration 0007):
* schedules.manual INTEGER NOT NULL DEFAULT 0.
* For every host with non-empty default_paths, seed a manual
  schedule with those paths and bump host_schedule_version.
* ALTER TABLE hosts DROP COLUMN default_paths.
* ALTER TABLE enrollment_tokens RENAME COLUMN default_paths
  TO initial_paths.

Original draft of this migration rebuilt hosts via the
create-new + drop-old + rename-new pattern. With foreign_keys=ON
(set in the connection DSN), DROP TABLE on the parent fired
ON DELETE CASCADE on every child of hosts(id) — schedules /
jobs / snapshots / host_credentials all wiped on the smoke env
when I tried it. SQLite 3.35+ supports column-level ALTERs
directly, so we skip the rebuild dance and avoid the cascade
trap. Six lines of SQL instead of sixty, no FK risk.

Run-now rewiring:
* New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper
  unifies the agent-driven path (cron fire → schedule.fire →
  OnScheduleFire callback) and the UI-driven path (operator
  clicks Run-now on a schedule row). Conn arg is optional; nil
  falls back to Hub.Send.
* New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row
  Run-now button on the schedules list.
* Dashboard's per-host Run-now (handleUIRunBackup) now picks the
  host's only enabled manual schedule, falls back to the only
  enabled schedule, else returns "pick one in Schedules tab".
  Keeps one-click for the common case.

Agent:
* Scheduler skips manual schedules in cron build (silent — they're
  a normal data shape, not an error).
* Wire Schedule struct gains Manual flag.
* Schedule.fire flow unchanged — the agent only ever fires
  non-manual schedules anyway.

UI:
* Add-host form retitled "Initial schedule · manual" so the
  operator knows the paths become an editable schedule under
  the Schedules tab. Result page calls out the manual schedule
  + points at Host > Schedules.
* Schedule edit form: "Manual schedule" checkbox at the top of
  the When section; toggling it hides/shows the cron field via
  inline JS. Server-side validator skips the cron requirement
  when manual=true.
* Schedule list shows a "manual" tag under the status pill and
  renders the When column as "— run-now only —" for manual rows.
  Each row gets a Run-now button when the schedule is enabled
  and the host is online.

Tests + go test ./... green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 12:26:06 +01:00
steve 160d788bae P2-04: schedule editor UI
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Closes the schedule foundations slice — operator can now drive the
plumbing P2-01..03 landed without touching the JSON API.

* New routes:
  - GET  /hosts/{id}/schedules          (list)
  - GET  /hosts/{id}/schedules/new      (create form)
  - POST /hosts/{id}/schedules/new      (create)
  - GET  /hosts/{id}/schedules/{sid}/edit (edit form)
  - POST /hosts/{id}/schedules/{sid}/edit (update)
  - POST /hosts/{id}/schedules/{sid}/delete (delete, confirm-then-redirect)

* List view (web/templates/pages/schedules_list.html):
  status, cron, paths, retention summary, tags, edit/delete buttons.
  Header shows "version N · agent in sync" or "agent at vM" when the
  push hasn't been ack'd yet — backed by host_schedule_version +
  applied_schedule_version. Empty-state CTA points at /schedules/new.

* Create/edit form (web/templates/pages/schedule_edit.html, shared):
  cron expression with five quick-pick presets (daily 3am / every 6h
  / @hourly / weekly Sun / monthly 1st), paths textarea (one per
  line), excludes textarea, tags (comma-separated), retention as six
  numeric fields (mirrors restic's --keep-* flags one-for-one),
  bandwidth caps, enabled toggle. Side panel explains the
  reconciliation flow so the operator knows what saving actually
  does. Validation errors re-render with operator's input intact.

* internal/server/http/ui_schedules.go owns the handlers; reuses
  the same validateSchedule + pushScheduleSetAsync used by the JSON
  API path. Each save audit-logs schedule.created / schedule.updated
  / schedule.deleted (matching the JSON API actions).

* store.RetentionPolicy gains a Summary() method ("last=7, d=14,
  w=4" or "—"). Used by the list view's table cell so templates
  don't have to do any conditional retention rendering.

* Two new template helpers: list (string varargs → []string, used
  for the cron preset row) and joinComma (sibling to joinDot for
  the rare list that wants commas). RetentionPolicy.Summary covers
  the schedule-list case but the helpers are general.

* host_detail.html secondary tabs row converted from inert <div>s
  into <a> links. Snapshots active by default; Schedules now points
  at the new page. Jobs/Repo/Settings remain inert until their
  P2 owners ship.

Hooks UI deferred to P2-15 (lands with the hook execution path).
Single-kind UI (backup only) by design — other kinds get a UI when
their job dispatch lands in P2-05..08.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:44:40 +01:00
steve 6450bf1b88 P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Closes the schedule reconciliation loop end-to-end.

* New `internal/agent/scheduler` package wraps robfig/cron/v3 with
  the lifecycle the agent needs:
  - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting
    for in-flight entries to return), rebuilds from scratch, starts,
    and emits schedule.ack with the version we just applied.
  - Disabled entries skipped silently; bad cron exprs (which
    shouldn't reach us — the server validates — but defensive)
    log a warn and skip.
  - On each cron tick the entry sends a new schedule.fire envelope
    to the server with {schedule_id, scheduled_at}. The scheduler
    itself never builds CommandRunPayloads — server is the source
    of truth for jobs.
  - tx is swapped on every Apply, so reconnect is handled
    naturally: cron entries that fire against a dropped tx log
    "no active connection" and skip the tick.
  - Stop() is idempotent and waits for the cron's in-flight
    workers via cron.Stop().Done().

* New wire message api.MsgScheduleFire + api.ScheduleFirePayload
  for the agent → server "I just fired locally" RPC.

* Server-side dispatch (schedule_push.go: dispatchScheduledJob):
  looks up the schedule by id, validates ownership + that it's
  enabled, builds args from kind (paths for backup; other kinds
  are still arg-less in Phase 2 and grow as those job kinds land
  in P2-05..08), persists a jobs row with actor_kind=schedule +
  scheduled_id, and writes command.run back on the same conn so
  the agent runs through its existing dispatch path.

* store.CreateJob now writes scheduled_id. This column was in the
  schema since 0001 but never populated — the original P1 path
  only had operator-driven jobs, so actor_kind was always 'user'
  and scheduled_id was always nil.

* cmd/agent/main.go integration: dispatcher gains a
  *scheduler.Scheduler; the MsgScheduleSet case now hands the
  payload to scheduler.Apply (in a goroutine so the WS read loop
  keeps draining other messages).

* WS dispatcher gains OnScheduleFire alongside OnScheduleAck.

* Tests:
  - scheduler unit tests (4): ack-on-apply, cron tick fires
    schedule.fire envelope, disabled entries don't fire, replace-
    prior-state stops the old cron.
  - Server-side end-to-end: schedule.fire → command.run with the
    right job_id / kind / args, plus jobs row with actor_kind=
    "schedule" and scheduled_id linking back to the schedule.

Persistence of next-fire times across agent restarts is
deliberately deferred. A missed fire window during downtime
simply fires once on reconnect — that's the desirable behaviour
(the operator wants the missed backup to run, not be silently
skipped because we lost track of when it was due).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:29:12 +01:00
steve 946b6db137 P2-02 (server side): schedule reconciliation push + ack handling
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Server is now the source of truth for the agent's cron set.

* Helpers in schedule_push.go:
  - loadScheduleSetPayload reads the host's schedules + canonical
    version into the wire shape.
  - pushScheduleSetOnConn writes directly to a just-handshaken conn
    (avoids racing against Hub.Register on a brand-new connection).
  - pushScheduleSetAsync is the post-CRUD flavour — no-op when the
    host is offline (the next reconnect's on-hello path catches it
    up, so a missed push is non-fatal).
  - applyScheduleAck records what version the agent has confirmed.

* onAgentHello restructured: was returning early when the host had
  no repo credentials, which made the schedule push unreachable for
  fresh hosts. Split into pushRepoCredsOnHello (silent no-op on
  ErrNotFound) + pushScheduleSetOnConn (always runs). Empty schedule
  list is a valid push: tells the agent to drop stale cron entries.

* WS dispatcher gains an OnScheduleAck hook on HandlerDeps; the
  http server wires it to applyScheduleAck. MsgScheduleAck moves
  out of the "TODO(P2)" group into a real case that decodes the
  payload and forwards to the callback.

* Schedule CRUD handlers each fire pushScheduleSetAsync after the
  audit-log write so the agent picks up changes within seconds.

Tests cover:
  - On-hello push of an already-created schedule, agent acks,
    applied_schedule_version flips on the host row.
  - Connect-then-CRUD: empty initial push (version 0), then a
    follow-on push at version 1 after the operator creates a
    schedule via REST.

Agent-side `schedule.set` handler (parse, replace local cron,
emit `schedule.ack`) is the remainder of P2-02 and lands with
P2-03's local scheduler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:22:06 +01:00
steve 4b075840a1 P2-01: schedule schema + CRUD API
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
The `schedules` table was already laid down in migration 0001; this
slice adds the Go-side data model, store CRUD with atomic version
bumps, and REST endpoints.

* `store.Schedule` + `RetentionPolicy` + `ScheduleOptions` typed
  views (the wire form on the agent side keeps retention/options
  as raw JSON since the agent just forwards them to restic).
* Store CRUD: CreateSchedule / GetSchedule / ListSchedulesByHost /
  UpdateSchedule / DeleteSchedule. Each mutation bumps
  `host_schedule_version` atomically in the same tx via UPSERT on
  `host_schedule_version`. SetHostAppliedScheduleVersion records
  what the agent has confirmed via schedule.ack (P2-02 will use it).
* REST endpoints under /api/hosts/{id}/schedules + /{sid}:
  GET (list, with the version envelope so callers can detect
  drift), POST (create), PUT (update — kind is immutable), DELETE.
* Validation: cron expressions parse via robfig/cron/v3 (same
  parser the agent will use, so anything that validates here will
  fire there); kind ∈ {backup, forget, prune, check} (init/unlock
  are operator-only one-shot kinds, not schedulable); backup
  schedules require ≥1 path; hooks rejected on non-backup kinds
  (spec §14.3).
* All mutations audit-logged.
* Tests: store-level CRUD + version-bump invariants; REST happy
  path (create→list→update→delete with version progression); REST
  validation table covers each rejection code.

newTestServerWithHub now sets BootstrapToken so the schedules
handler tests can use the existing login flow without a parallel
test-server constructor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:12:58 +01:00
steve ee3ee241ea P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:

* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
  drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
  with ReadWritePaths confined to /etc + /var/lib/restic-manager;
  NoNewPrivileges blocks escalation. Install script no longer
  creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
  rationale (matches UrBackup / Veeam / Bareos defaults; trying to
  back up "everything" as an unprivileged user creates silent skips
  on /home, /root, /var/lib/* with no upside vs the threat model
  the agent already implies).

* Init-repo end-to-end. New JobKind="init" wired through agent
  runner, restic.Env.RunInit, server dispatcher, and a UI button
  (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
  flips on init success, on backup success, or on a non-empty
  snapshots.report. The "Run now" / "Init" / "Retry" branching now
  drives both the dashboard host row and the host-detail panel.
  Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
  the safe create-new-then-rename pattern; first version corrupted
  job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
  affected DBs).

* rest-server creds embedded at exec time only. restic.Env gains
  RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
  inside envSlice() and never assigns it back to the struct, so
  nothing slog-able ever sees the cleartext form. RedactURL helper
  for any future surface that needs to log a URL safely. Both
  helpers tested.

* Add-host UX. Repo password is now optional — server mints a
  24-byte URL-safe random one and surfaces it once, alongside an
  htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
  the operator pastes one command on the rest-server host and one
  on the endpoint. Result page also links the install snippet at
  /install/install.sh (was /install.sh — 404'd before) and pipes
  to bash (not sh — script uses set -o pipefail and other
  bashisms; on Debian/Ubuntu sh is dash).

* Late-subscriber race in JobHub. A fast-failing job could finish
  (DB write + Broadcast) before the browser's HX-Redirect → page
  load → WS-connect path completed, so the JS sat forever waiting
  on a job.finished that already passed. JobHub split into
  Register + Send + Run; handleJobStream now subscribes first,
  re-fetches the job, and sends a synthetic job.finished if the
  state is already terminal.

* HTMX error visibility. New toast partial listens to
  htmx:responseError and surfaces the response body as a
  bottom-right toast — every server-side validation error now
  becomes visible without per-handler JS wiring. Also handles
  custom rm:toast events for future server-pushed notifications
  via the HX-Trigger header. Themed via existing CSS vars.

* Dashboard rows are now whole-row clickable to host detail
  (CSS card-link pattern: absolute-positioned anchor + .row-action
  z-index restoration so the action button stays clickable).
  "View →" on a running job links to /jobs/<id> rather than
  /hosts/<id> since the row click already covers the host page.

* "Run first" / "Run first backup" → "Run now" everywhere for
  consistency.

* runbook (docs/e2e-smoke.md) updated — live-log streaming step
  now reflects P1-26; mentions the browser-driven Run-now flow.

* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
  it up; .gitignore now excludes /_diag/ entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:02:12 +01:00
steve 12b72e7dde P1 polish: Host.default_paths interim + restic env hygiene + job_id JS quoting
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Two fixes that close the loop on dashboard run-now and harden the
agent's restic invocation.

Default paths (interim until P2-01 schedules):
  - 0003 migration adds default_paths TEXT NOT NULL DEFAULT '[]'
    to hosts and to enrollment_tokens.
  - Operator types paths in the Add-host form (textarea, one per
    line). They ride on the enrol_token row alongside the
    encrypted creds (paths aren't secret — plain JSON column).
  - On consume, ConsumeEnrollmentToken still just burns the token;
    the new GetEnrollmentTokenAttachments returns both the
    re-bindable creds and the path list in one round trip, the
    handler transfers them onto the new host row inside CreateHost.
  - The dashboard's Run-now and host-detail's "Run backup now"
    button now read Host.DefaultPaths and pass them to dispatchJob.
    A host with no default paths returns 400 with a friendly
    "no paths set" message instead of dispatching a doomed
    `restic backup` with no positional args.
  - Doc comments explicitly call this out as a Phase 1 interim —
    schedules supersede.

Restic env hygiene:
  - envSlice() previously omitted HOME / XDG_CACHE_HOME, which
    bit the smoke runs whenever the agent was launched outside
    systemd (restic refused to start: "neither $XDG_CACHE_HOME
    nor $HOME are defined"). Now both are set explicitly: prefer
    Env.ExtraEnv overrides, fall back to the agent process's own
    HOME, and finally to /var/lib/restic-manager.
  - Comment makes the env policy explicit: parent's RESTIC_* /
    AWS_* / B2_* env is filtered out by design — control-plane
    is the unambiguous source of truth.

JS bug fix in the live log page:
  - {{$job.ID | printf "%q"}} produced a literal-quoted JS string,
    which then went into the WS URL as ".../jobs/"<ID>"/stream"
    → 404. Switched to '{{$job.ID}}' inside the literal so
    html/template's auto-escape does the right thing. Verified
    end-to-end: dashboard "Run now" → live progress + log lines
    arrive over the WS → succeeded pill renders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:35:33 +01:00
steve bd434bd1d0 P1-26: live job log viewer + WS browser fan-out hub
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Closes the P1-21 remainder.

internal/server/ws/jobhub.go — new JobHub. Per-job_id set of
subscribers; each gets a 64-deep buffered channel with a writer
goroutine. Broadcast is non-blocking: if a subscriber is slow,
its channel fills and messages are dropped for that subscriber
only — the agent's read loop is never blocked by a stuck browser.

The agent dispatchAgentMessage path mirrors job.started /
job.progress / log.stream / job.finished envelopes onto the hub
in addition to its existing persistence work. The wire shape is
the same end-to-end, so client-side JS switches on env.type the
same way Go code does.

GET /api/jobs/{id}/stream is the browser endpoint. Auth via
session cookie (HTTP layer); upgrade; subscribe; pump until
context closes.

GET /jobs/{id} renders the live log page. Three states (queued/
running/succeeded/failed) drive the header pill, the progress
bar block, the failure summary panel, and the action button
(Cancel job while running, Back to host afterwards). Already-
persisted log lines are server-rendered on initial load; new
lines arrive over the WS and append to #log-stream. Auto-scrolls
unless the user scrolls up (a "⇢ Follow" pill re-attaches).
On job.finished the page reloads after 600ms to pick up the
final-state header rendered server-side.

POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id}
on success so HTMX lands the operator straight on the live log.
For non-HTMX callers (curl / plain form post) it 303s to the
same target.

store.ListJobLogs returns persisted log lines for initial render
on page load.

Browser-verified end-to-end: enrol → run a real backup against a
sibling restic/rest-server → live progress + 11 log lines stream
in → succeeded pill + final stats land after page reload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:45:56 +01:00
steve 26a2b85e13 P1-25: host detail page (snapshots tab default)
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
GET /hosts/{id} renders the v1 host detail layout:

  - persistent header: status dot (pulse if a job is in flight),
    monospace name, tags, plus a metadata strip (os/arch, agent
    version, restic version, "last seen Xs ago" or "online · last
    heartbeat …").
  - vitals strip: four tiles for last backup (status + relative
    time), repo size, snapshot count, open alerts.
  - sub-tabs: Snapshots is active; Jobs / Repo / Settings are
    visible but inert until P2.
  - snapshot table: short id, time (absolute), paths joined with
    " · ", size, file count, restore button (disabled — wires up
    in P3).
  - right rail: run-now stack (backup live, forget/prune/check/
    unlock disabled with the Phase tag), danger-zone remove panel
    (also disabled for now).

Empty state: when a host has no snapshots yet, the table replaces
itself with a "no snapshots yet" prompt that includes the run-now
button (provided the agent is online).

Pagination cap of 50 most-recent snapshots; full pagination lands
when fleet sizes demand it.

Template helpers grew: comma() now accepts int / int32 / int64 so
templates don't fight Go's type inference; joinDot() concatenates
a []string with " · "; absTime() formats time.Time as
YYYY-MM-DD HH:MM:SS; the existing relTime() already accepts T or
*T after P1-27.

Browser-verified end-to-end with seeded fixture data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:20:21 +01:00
steve dad8c7fe99 P1-27: Add host flow — form + minted-token result page
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
GET /hosts/new renders the focused two-column form (hostname,
tags, repo URL/username/password). POST /hosts/new validates,
mints a one-time token via the new mintEnrollmentToken helper —
shared with the existing JSON /api/enrollment-tokens endpoint —
and re-renders the same page in result state showing:

  - the install command with RM_SERVER + RM_TOKEN filled in (and
    an inline copy-to-clipboard button),
  - an "awaiting agent connection" panel with the hostname
    pre-filled,
  - a troubleshooting list pointing at the most common reasons
    the agent doesn't appear,
  - back-to-dashboard / add-another-host links.

publicURL() resolves RM_BASE_URL first, falling back to scheme +
Host on the inbound request — useful for local smoke without a
proxy.

Browser-verified end-to-end: form submit → token minted → install
command renders with the right values from the form input.

template fn formatRelTime now accepts time.Time *or* *time.Time
so templates can pass either without fighting Go's lack of an
address-of operator.

Deferred: download-preconfigured-installer (a templated .sh with
the values baked in) — copy-paste covers v1; nice-to-have later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:16:54 +01:00
steve ee16bc7ce7 P1-24: live dashboard — fleet summary tiles + host table
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Server-rendered HTML view backed by:
  - new store.FleetSummary aggregating host counts + repo bytes +
    snapshot total + open alerts + last-24h job rollup in two queries.
  - GET /api/hosts (JSON list of hosts in the dashboard projection).
  - GET /api/fleet/summary (JSON aggregate, same shape as above).

The HTML page (web/templates/pages/dashboard.html) renders the four
summary tiles + host table directly from store data — no separate
fetch. Per-row state colour comes from .host-row.{degraded,failed,
offline} which paint a 3px left edge so problem hosts are scannable
without reading. HTMX is loaded into the base layout so per-row
"Run now" buttons can hx-post to /hosts/{id}/run-backup, a thin
HTML wrapper that funnels into a new dispatchJob helper shared
with the JSON /api/hosts/{id}/jobs endpoint.

Empty state (zero hosts) collapses to the "no hosts yet" prompt
with the + Add host CTA — matches the v1 mockup.

Template helpers (internal/server/ui/funcs.go) added for byte
formatting (412 GB / 3.7 TB), relative time (3m ago / 2d ago), and
comma grouping (1,847). Pure Go, no template-magic dependency.

Browser-verified end-to-end with seeded fixture data: five hosts
across all four states render with correct dots, accents, last-
backup pills, sizes, snapshot counts, alerts, tags, and the right
action button (Run now / Retry / Run first / View → / offline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:29:11 +01:00
steve 229f89fee2 P1-23 / P1-28: base layout, login, session-aware nav + Tailwind build
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
P1-28: Tailwind standalone CLI wired into the Makefile. `make tailwind`
downloads the pinned v3.4.17 binary into bin/tailwindcss (gitignored),
builds web/styles/input.css → web/static/css/styles.css. `make build`
now runs the CSS pass first; `make tailwind-watch` for dev. Output is
embedded in the binary via web.FS — single static binary, no Node.

The CSS source carries every component class the v1 mockups defined
(status dots, buttons, host row, log viewer, progress bar, fields,
chips, snippet panel, empty state) so screens that land later can
just reach for them.

P1-23: html/template tree at web/templates with two layouts (base
with chrome, chromeless for login + bootstrap), one nav partial, and
two pages (dashboard placeholder, login). internal/server/ui parses
the tree at startup; ui_handlers.go in the http package wires:

  GET  /         dashboard (303 → /login when unauthed)
  GET  /login    sign-in form
  POST /login    consume form, mint session cookie, 303 → /
  POST /logout   drop cookie, 303 → /login
  GET  /static/* embedded Tailwind bundle

The HTML login flow shares store/session logic with /api/auth/login
via a new authenticateAndSession helper — same security guarantees,
two surface representations (HTML form / JSON).

Verified end-to-end: bootstrap → form-login → authed dashboard →
sign-out → 303 cycle works in the browser; Tailwind output emits
only the component classes referenced in the live templates (9.6kB
minified).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:19:06 +01:00
steve 136e1a1d8f design: extend v1 to login / add-host / host-detail / job-log + lock components
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Five hi-fi screens completing the Phase 1 surface, all in v1's dark
operator-console register.

  v1-login          Sparse centred card. Sign-in + first-error variant.
                    No marketing chrome; build version sits in footer
                    so a returning operator can spot agent drift.

  v1-add-host       Focused two-column page (form left, contextual
                    "what happens next" right) — not a modal. Two
                    states: form (state A) and minted-token result
                    with install command (state B). Backed by
                    POST /api/enrollment-tokens (P1-32).

  v1-host-detail    Persistent header (status dot, mono name, tags,
                    primary CTAs, vitals strip) over four sub-tabs
                    (Snapshots / Jobs / Repo / Settings). Snapshots
                    is the default — the thing 90% of operators
                    want when they click a host name. Right rail
                    holds Recent activity, run-now stack, and a
                    danger-zone panel.

  v1-job-log        WS-streamed log view. Three states: running (live
                    progress bar + auto-scroll cursor), succeeded
                    (summary stats + final lines), failed (error
                    panel + tail). Backed by WS /api/jobs/{id}/stream
                    (P1-21 remainder).

  v1-components     The load-bearing reference. 14 sections covering
                    tokens (colour + type scale), status, buttons,
                    form fields, tags, tabs, host row, log viewer,
                    progress bar, stat tile, modal, toast, install
                    snippet, empty-state pattern. Every CSS class is
                    real and copy-able into the Go template build.

This locks the visual register before P1-23 onwards. Each Phase 1
template gets a {{define}} matching a section in v1-components.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:05:39 +01:00
steve f9c2351ab6 design: v1 polish — row accents, wider last-backup col, empty state
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
- Single .host-row CSS rule replaces 13 inline grid-template-columns
  copies; column widths bumped so "backup running…" doesn't wrap.
- Faint left-edge accent for degraded / failed / offline rows so
  problem hosts are scannable without reading.
- Empty-state hero added: top-bar + nav still present (Dashboard
  active, others dimmed) but body collapses to a calm "no hosts yet"
  prompt with the install command as the load-bearing affordance.
  Prerequisite note keeps the deliberate "restic must already be
  installed" decision visible to first-time operators.

This is the artefact P1-23/24/27 will template against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:48:15 +01:00
steve 81c7825937 design: three hi-fi dashboard directions for review
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Three deliberately differentiated takes on the dashboard so we can
lock the visual register before the UI work starts (P1-23 onwards).

  v1 — Operator console (Linear/Datadog dark register).
       Dense table, monospace numerics, restrained colour, pulsing
       status dot only when a job is running. The natural fit for
       the audience and the most defensible choice.

  v2 — Editorial calm (Stripe/Notion light register).
       Serif hero headline that humanises the data, cards with
       breathing room in a 2-up grid, demoted "quiet hosts" strip,
       subtle rust accent. Reads as trustworthy infrastructure.

  v3 — Print spec (Tufte/aerospace monospace register).
       Pure monospace, near-monochrome, status as typeset glyphs
       (●▶▲○✗) so the screen survives greyscale. "Requires
       attention" block groups problem hosts at the top; activity
       tail reads like a real log. Most polarising; highest
       craft ceiling.

Each file is self-contained (Tailwind via CDN + Google Fonts) and
includes a philosophy preamble + the dashboard hero + a component
vocabulary section so we can read the system, not just one screen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:39:57 +01:00
steve b6cfa99413 agent: log accept/complete on backup jobs; audit: populate host.enrolled payload
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Two warts surfaced during the smoke run:

- Agent was silent between "config.update applied" and "job
  finished" — operators tailing journalctl saw no acknowledgement
  that a command.run had landed. Adds Info logs at job-accept
  ({job_id, paths}) and at successful completion.

- The host.enrolled audit row had an empty {} payload. Now
  carries {hostname, os, arch, has_repo_creds} so an audit-log
  reader can answer "what got enrolled and did the operator
  bundle creds with the token" without joining back to hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:24:56 +01:00
steve 2418e585db fix: enrollment FK race + log-when-rejected; runbook fixes from dry-run
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
The smoke runbook caught a real bug: ConsumeEnrollmentToken was
inserting into host_credentials (FK -> hosts) inside the same tx as
the token burn, but the host row didn't exist yet — CreateHost
runs in the *next* statement. The agent saw a generic 401 with no
clue why.

Fix: drop the host_credentials insert from ConsumeEnrollmentToken;
the HTTP handler now does Consume -> CreateHost ->
SetHostCredentials. SetHostCredentials failure is logged loudly
but doesn't fail the enrol — operator recovers via PUT
/api/hosts/{id}/repo-credentials.

Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the
underlying cause is visible in server logs (the wire response stays
generic to avoid leaking which step failed).

Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new
order (consume -> create host -> SetHostCredentials).

Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly
in use); URLs use trailing slash on the rest path; clarified that
secrets_key is minted on first agent start, not at enrol time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:01:59 +01:00
steve 5d1951ad94 P1-34: e2e smoke runbook + redacted GET /repo-credentials
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Adds docs/e2e-smoke.md — an ~5-minute runbook that walks the full
P1 happy path against a sibling restic/rest-server: bootstrap
admin, mint token with repo creds, enrol an agent, watch the
config.update push land, run a backup, confirm the snapshot, edit
creds and watch the second push fire. Per the design discussion
this is a runbook (not a Go integration test); the Playwright
version lands in P5-06.

GET /api/hosts/{id}/repo-credentials returns the redacted view —
{repo_url, repo_username, has_password} — so the UI can pre-fill
the edit form without ever pulling the password out of the AEAD
blob.

Marks P1-32 / P1-33 / P1-34 done in tasks.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:49:34 +01:00
steve ec276dbc91 P1-33: agent-side encrypted secrets store + push-on-update
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
New internal/agent/secrets package: AEAD blob at
/var/lib/restic-manager/secrets.enc, atomic write (os.CreateTemp +
Sync + Rename), 0600. Key lives in agent.yaml as base64
(SecretsKey) — same trust boundary as the bearer token, minted on
first start via EnsureSecretsKey.

cmd/agent: dispatcher reads creds fresh from secrets.Load() on
each job rather than from in-memory config. config.update merges
the push with what's on disk and persists, so a daemon restart
keeps the latest values. Legacy plaintext repo_url/repo_password
in agent.yaml are silently migrated into secrets.enc on next start
and stripped from the YAML on the following save.

Tests: round-trip + wrong-key rejection + atomic-write
post-condition for secrets; key idempotence + legacy-field
parse/clear for config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:41:28 +01:00
steve 0ba56ed30d P1-32: server-side encrypted repo creds + push-on-hello
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Operator-minted enrollment tokens now carry the repo URL/username/
password as one AEAD blob bound (via additional-data) to the token
hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a
host_credentials row in the same tx as token-burn, so the binding
moves with the credential.

PUT /api/hosts/{id}/repo-credentials lets an operator edit creds
post-enrollment; merges with the existing blob, audits, and pushes
config.update if the agent is connected.

WS handler grows an OnHello hook that the HTTP layer wires to send
the host's decrypted creds as a config.update immediately after the
hello succeeds — synchronously, so a racing command.run lands after
the agent has its repo password.

Schema: 0002_host_credentials.sql adds enc_repo_creds to
enrollment_tokens and a host_credentials table (PK = host_id, FK
ON DELETE CASCADE).

Tests: round-trip token → consume → host_credentials with AAD swap
detection; no-creds path stays compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:38:35 +01:00
steve e58917106d spec/tasks: pull repo-credential plumbing into Phase 1
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Adds P1-32/33/34: encrypted repo creds carried on the enrollment token,
agent-side AEAD secrets file, end-to-end smoke. spec.md §4.2 and §7.3
rewritten to describe the full flow (server-issued at token time,
pushed via config.update on hello, persisted encrypted on the agent)
and to make the encrypted-file-now / OS-keyring-Phase-2 split
explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:32:53 +01:00
steve 6c9558c703 tasks: add P2-18 announce-and-approve, expand P1-27 with preconfigured installer
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
P2-18 captures the keypair + fingerprint-comparison enrollment flow
as a Phase 2 alternative to the token model. Includes guards
(rate limit, pending cap, hostname-collision flagging) and explicit
acceptance criteria.

P1-27 grows to mint encrypted repo creds alongside the token and
expose a one-click preconfigured-installer download from the
"Add host" form (cf. UrBackup Internet-mode push installer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:31:28 +01:00
steve 3904a78f14 P1-22: snapshot listing via restic snapshots --json
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Agent calls restic snapshots --json after each successful backup
(60s timeout, separate from the backup ctx) and ships the projection
over the existing snapshots.report WS envelope. Failure here is
logged but doesn't fail the job — the next successful backup catches
the projection up.

Server-side ReplaceHostSnapshots is delete-then-insert plus a
hosts.snapshot_count update in one transaction so the dashboard's
per-host count stays consistent with the projection. New read
endpoint GET /api/hosts/{id}/snapshots returns the cached list with
a refreshed_at marker so the UI can show staleness when an agent
has been offline.

Schema: dropped the unused snapshots.repo_id FK (repos as a
first-class entity is P2 work), added short_id and refreshed_at
columns, switched the time index to DESC for the most-recent-first
list query. api.Snapshot gains short_id; size_bytes/file_count come
from the embedded summary block on restic 0.16+ and stay zero on
older clients.

Tests cover round-trip, authoritative replacement after forget+prune
shrinkage, and empty-after-wipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:20:57 +01:00
steve 41a4043af3 server: drop in-process TLS — HTTP-only behind reverse proxy
Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx;
making the server do TLS too means double cert config, dual ACME
plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY,
remove TLSEnabled() and the ListenAndServeTLS branch.

Replace the cookie's "Secure if TLS-in-process" check with a new
RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets
RM_COOKIE_SECURE=false; production is always behind a TLS proxy and
the cookie stays Secure.

Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and
populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a
Caddyfile snippet and a hard "do not expose RM_LISTEN publicly"
warning. enrollResponse keeps cert_pin_sha256 in the shape but the
server can't introspect a cert it doesn't terminate — operator
pastes the proxy's hash into -cert-pin at install time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:20:41 +01:00
steve 77a305d064 tasks.md: mark Phase 1 progress
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Captures the state landed in this session:

Done (P1-01..03, P1-05, P1-06, P1-08..16, P1-17..20, P1-29):
  HTTP server, store + schema, crypto, first-run bootstrap,
  every API type with wire-shape tests, WS transport,
  enrollment + hello + heartbeat round-trip, agent config +
  service unit + WS client + sysinfo, restic wrapper, job
  lifecycle store + run-now endpoint, agent runner.

Partial (P1-04, P1-07, P1-21, P1-31):
  CSRF middleware lives with the UI work; audit middleware
  sweep lives with rest of API; live job-log fan-out needs
  the per-job browser hub; signed agent binaries deferred to
  Phase 5.

Open (P1-22..28):
  Snapshot listing, full UI suite (login, dashboard, host
  detail, live job log, add-host, Tailwind build).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:46:16 +01:00
steve 95b49ecab9 phase 1: run-now backup — restic wrapper, job lifecycle, end-to-end
Lands the operator → server → agent → restic → server roundtrip for
on-demand backups. The flow:

  POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]}
    → server creates a queued Job row
    → server emits command.run over WS to the host's agent
    → agent dispatcher spawns runner.RunBackup in a goroutine
    → runner spawns `restic backup --json`, parses each line
    → forwards: job.started, log.stream (every line), job.progress
      (throttled to 1/sec), job.finished (with summary stats blob)
    → server WS handler persists those into jobs / job_logs

P1-16 internal/restic: thin Locate + Env wrapper that runs `restic
  backup --json`, scans stdout/stderr, parses BackupStatus +
  BackupSummary, calls back into a LineHandler so the agent can fan
  out to log.stream + job.progress. Treats exit code 3 as
  "succeeded with issues" (matches restic's contract).

P1-18 store: jobs accessors (CreateJob, MarkJobStarted,
  MarkJobFinished, AppendJobLog, GetJob).

P1-19 server: POST /api/hosts/{id}/jobs creates the Job row,
  validates kind, dispatches via Hub.Send, audit-logs the action.

P1-20 agent runner: wraps restic.RunBackup with throttled progress
  emission. Sender abstraction was added to wsclient.Handler so
  background goroutines can keep replying after dispatch returns.

P1-21 server WS: dispatchAgentMessage now persists job.started,
  job.finished, log.stream into the database. Browser fan-out for
  live tailing lands with the UI work.

Agent gets repo_url + repo_password from agent.yaml in plaintext
for now (mode 0600, owned by service user); spec.md §7.3's keyring
storage moves there in P2. config.update over WS overrides the
in-memory copy (does not persist).

Build clean; all tests pass. End-to-end with a real restic still
needs a host that has restic installed — wire shape verified by
the existing hello/heartbeat round-trip test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:45:04 +01:00
steve e8eccd20c2 phase 1: agent install path — systemd unit, install.sh, asset endpoints
P1-14 deploy/install/restic-manager-agent.service: standard systemd
  unit with the usual hardening switches (NoNewPrivileges, Protect*,
  RestrictRealtime, MemoryDenyWriteExecute). Restart=always with a
  5s backoff. Runs as a dedicated unprivileged restic-manager-agent
  user; the install script creates it.

P1-29 deploy/install/install.sh: arch detection (amd64/arm64), pulls
  the agent binary from /agent/binary, creates the service user
  + dirs (/etc/restic-manager, /var/lib/restic-manager), runs
  enrollment via `agent -enroll-server -enroll-token`, lays down
  the systemd unit, enables and starts it.

  Honours the spec's "detect, don't auto-disable" rule for existing
  schedulers: scans systemd timers, /etc/cron.d/*, /etc/cron.daily/*,
  root crontab for restic-named entries and prints them with the
  exact disable command — operator decides.

P1-31 server endpoints to ship the agent installation payload:
  GET /agent/binary?os=linux&arch=amd64 → serves
    <DataDir>/agent-binaries/restic-manager-agent-linux-amd64
  GET /install/<file>                   → serves
    <DataDir>/install/<file>
  Both endpoints reject path traversal and return 404 if the file
  isn't published. Operators drop the binaries + service unit into
  these directories at release time. Signed-bundle verification is
  deferred to Phase 5 OSS readiness.

All tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:40:36 +01:00
steve f34773b505 phase 1: WS transport, enrollment, agent that hellos and heartbeats
Lands the protocol layer end-to-end: an agent can be enrolled
through the operator UI, store credentials, dial back to the server
over WS, complete the protocol_version handshake, and stay
connected with periodic heartbeats.

Server side:
- P1-09 ws.Hub: one Conn per host_id, last-write-wins eviction,
  json envelope writer with a write mutex, reader, error envelopes.
- P1-09 ws.AgentHandler: bearer-auth, accept upgrade, hello-stage
  (10s deadline, protocol_version checked against
  api.MinAgentProtocolVersion → ErrProtocolTooOld with help URL on
  reject), main read loop, defer hub register/unregister.
- P1-10 POST /api/agents/enroll consumes a one-time token, mints a
  persistent agent bearer (sha-256 stored), creates a host row.
- P1-10 POST /api/enrollment-tokens (operator, session-auth)
  issues a 1h one-time token.
- P1-11 hello upserts agent_version + restic_version +
  protocol_version on the host row, flips status to online.
- P1-12 heartbeat touches last_seen_at; background sweeper marks
  hosts offline after 90s without one.
- store: hosts table accessors, host_schedule_version,
  enrollment_tokens FK on consumed_host dropped (audit-only field;
  the token gets burned before the host row exists).

Agent side:
- P1-13 internal/agent/config: yaml at /etc/restic-manager/agent.yaml,
  atomic Save (tmp+fsync+rename), Enrolled() helper.
- P1-15 internal/agent/wsclient: dial with bearer + optional
  TLS cert pinning (sha-256 of leaf), exponential backoff with
  jitter (1s → 60s cap), heartbeat goroutine, fatal handling for
  ErrProtocolTooOld.
- P1-15 wsclient.Enroll: HTTP POST /api/agents/enroll with sysinfo.
- P1-17 internal/agent/sysinfo: hostname/OS/arch/restic-version
  collection. restic detected by `restic version` parse; absent
  restic doesn't block startup.
- cmd/agent: -enroll-server / -enroll-token flags drive first-run
  enrollment then exit (so the install script can hand off to
  systemd to run the persistent service).

End-to-end smoke verified: bootstrap → login → issue token →
enroll → run agent → server logs `ws agent connected` with the
right host_id and protocol_version 1.

All tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:39:00 +01:00
steve 84fd31ccaa phase 1: HTTP server + first-run bootstrap
P1-01 chi router, slog request log, graceful shutdown via signal
  context. Health endpoint, /api/auth/login, /api/auth/logout,
  /api/bootstrap. Background sweeper for expired sessions and
  enrollment tokens (15 min cadence).

P1-04 (sessions half) HttpOnly Secure-when-TLS cookie carrying a
  base64url token; server stores SHA-256(token) so a stolen DB
  doesn't yield credentials. Unknown user and bad password collapse
  to the same 401 response code so a probe can't enumerate names.

P1-05 first-run admin bootstrap. On a fresh DB the server mints a
  one-time token and prints it to stderr inside a banner. The
  /api/bootstrap handler accepts {token, username, password},
  creates the first admin, then becomes a 409 forever.

P1-07 (partial) audit hooks fire on auth.login and auth.bootstrap.
  Full middleware-driven coverage lands with the rest of the API.

internal/server/config: env > YAML > defaults. RM_LISTEN /
  RM_DATA_DIR / RM_BASE_URL / RM_TLS_CERT / RM_TLS_KEY /
  RM_SECRET_KEY_FILE / RM_TRUSTED_PROXY (CIDR list, validated).

End-to-end smoke test passes: server boots on a fresh dir,
prints the bootstrap token, POST /api/bootstrap creates the admin,
POST /api/auth/login returns 200 with a session cookie.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:28:18 +01:00
steve c275f4ff4c phase 1 foundations: api types, store, crypto, auth
Lands the bottom three layers of Phase 1:

P1-08 internal/api: protocol_version + envelope + every WS message
  shape from spec.md §6.2 (Hello, Heartbeat, Job*, Schedule*, etc).
  Wire-format tests pin the JSON shape so a rename here breaks
  tests instead of silently breaking the agent.

P1-02 + P1-03 internal/store: SQLite via modernc.org/sqlite,
  embed.FS + a tiny version table for hand-rolled migrations.
  0001_initial.sql covers every table from spec.md §5 plus
  enrollment_tokens and host_schedule_version. Typed accessors
  for users / sessions / enrollment / audit. WAL + foreign_keys
  + busy_timeout on by default.

P1-06 internal/crypto: XChaCha20-Poly1305 AEAD wrapper with
  per-message random nonce. Key file lifecycle (generate +
  refuse-to-overwrite, load with size validation). Optional
  additionalData binds ciphertext to the row that owns it.

P1-04 internal/auth (partial — passwords + tokens; sessions
  middleware lands with the HTTP handlers): argon2id following
  RFC 9106 (64 MiB / t=3 / p=4 / 32B), constant-time verify.
  HashToken stores SHA-256 of session/agent/enrollment tokens
  so a stolen DB doesn't hand over credentials.

Build floor moves to Go 1.25 (modernc.org/sqlite v1.50+ requires
it); CI + Dockerfile + README updated. Markdown lint diagnostics
on tasks.md cleared.

All packages tested. ~70 new tests pass in <1s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:24:40 +01:00
steve 595546afb9 spec/tasks: address pre-Phase-1 design feedback
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
Doc-only changes captured before any Phase 1 code lands.

spec.md:
- §4.1 nhooyr.io/websocket → github.com/coder/websocket (the
  maintained fork; the original is unmaintained)
- §4.1 RM_LISTEN documented as source of truth for the bind port;
  add RM_TRUSTED_PROXY env var for X-Forwarded-* handling behind
  Caddy/Traefik
- §4.2 Phase 1 ships Linux only; Windows binaries continue to build
  in CI to keep the codebase portable, but service integration +
  installer move to Phase 2
- §4.2 self-update via apt/choco, not bespoke signed binaries
- §5 add Host.protocol_version + Host.applied_schedule_version
- §6.2 lock protocol_version handshake semantics (clean error on
  mismatch, not weird JSON parse failures)
- §6.2 schedule reconciliation when server unreachable: agent keeps
  firing last-known-good indefinitely; server's view canonical on
  reconnect; UI surfaces drift via applied_schedule_version
- §6.2 schedule.set carries schedule_version; new schedule.ack
  agent→server message
- §10.1 cross-reference RM_LISTEN ↔ compose port mapping
- §14.3 hooks rejected at validation on non-backup schedule kinds

tasks.md:
- P1-14 / P1-30 (Windows service + install.ps1) → Phase 2 as
  P2-16 / P2-17
- P1-29 install.sh detects existing restic timers/cron and prints
  disable commands, doesn't auto-disable
- Phase 1 acceptance: drop Windows from end-to-end criterion,
  require windows cross-compile in CI
- P4-01 rewritten: package-manager-based update delivery
- P5-08 removed (duplicate of P4-08 Prometheus /metrics)
- Various references updated

No Go code changes; build still clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:12:55 +01:00
steve c9368de904 phase 0: project bootstrap
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
P0-01 Go module + cmd/server + cmd/agent skeletons + internal/ tree
P0-02 LICENSE (PolyForm NC 1.0.0), README, CONTRIBUTING
P0-03 golangci-lint, pre-commit, .editorconfig, .gitignore
P0-04 Gitea Actions CI: test (race+coverage), lint, cross-platform build matrix
P0-05 Dockerfile.server (multi-stage, distroless/static), docker-compose.yml
P0-06 Makefile with build/test/lint/fmt/run/release targets

build, vet, test, and cross-compile to linux/{amd64,arm64} + windows/amd64
all verified locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:03:59 +01:00
steve 7612687a14 initial setup ready 2026-04-30 23:55:52 +01:00
228 changed files with 6771 additions and 21633 deletions
-32
View File
@@ -1,32 +0,0 @@
<!--
Thanks for the PR! A few quick checks before submitting:
* Did you open an issue first for non-trivial changes?
* `make lint test` is green locally?
* Commits are focused (one logical change per commit)?
* No `Co-Authored-By` trailers (repo policy)?
* No new dependencies without a one-line justification below?
-->
## Summary
<!-- One paragraph: what changed and why. -->
## Test plan
<!-- Bullet list of what you actually ran. Be specific.
- `make test` → green
- Manually exercised the new flow at /hosts/{id}/foo
- Smoke env: enrolled a fresh host, ran a backup end-to-end
-->
## Notes for the reviewer
<!-- Anything the reviewer needs to know that isn't obvious from the
diff: related issue, follow-up work that's intentionally not
in this PR, deferred concerns, design alternatives considered
and rejected. -->
## Linked issues
<!-- "Closes #123" / "Refs #456" / "Part of P5-06" -->
-52
View File
@@ -1,52 +0,0 @@
---
name: Bug report
about: Something isn't behaving the way the docs / code suggest it should
title: "[bug] "
labels: bug
---
## What happened
<!-- A clear description of the actual behaviour. Include the exact
UI surface, API endpoint, or CLI invocation involved. -->
## What you expected
<!-- What you thought would happen, and where that expectation came from
(docs page, command output, prior behaviour). -->
## Steps to reproduce
1.
2.
3.
## Environment
- restic-manager server version: <!-- `restic-manager-server --version` or footer of the UI -->
- Agent version (if relevant): <!-- `restic-manager-agent --version` -->
- restic version on affected host: <!-- `restic version` -->
- Host OS: <!-- e.g. "Ubuntu 22.04 amd64" or "Windows Server 2022" -->
- How was the server installed: <!-- docker compose / source build / other -->
## Logs / output
<details><summary>Server log (sanitised)</summary>
```
<!-- paste relevant lines; redact tokens, passwords, repo URLs -->
```
</details>
<details><summary>Agent log (sanitised)</summary>
```
```
</details>
## Anything else
<!-- Screenshots, related issues, recent changes you made before the
bug appeared, anything that might help. -->
-34
View File
@@ -1,34 +0,0 @@
---
name: Feature request
about: Suggest a new capability or change to existing behaviour
title: "[feature] "
labels: enhancement
---
## What you're trying to do
<!-- Describe the use case, not the proposed solution. Who is the
operator, what are they trying to accomplish, and what's
blocking them today? -->
## Why the current behaviour falls short
<!-- What does the system do today, and where does it stop short of
the use case above? -->
## Proposed direction (optional)
<!-- If you have a specific design in mind, describe it. Skip this
section if you'd rather leave it to the maintainer. -->
## Scope check
- [ ] I've read [`spec.md`](../spec.md) §2 (Goals & Non-Goals).
- [ ] This isn't already on the roadmap in [`tasks.md`](../tasks.md).
- [ ] This fits the project's "small fleet, one person operating"
target rather than enterprise / multi-tenant / SaaS use cases.
## Anything else
<!-- Related restic features, prior art in similar tools, links to
discussions you've had elsewhere. -->
+39 -79
View File
@@ -2,34 +2,28 @@
#
# Notes for anyone editing this file:
#
# Custom runner image
# Every job runs inside `gitea.dcglab.co.uk/steve/ci-runner-go`
# (recipe: https://gitea.dcglab.co.uk/steve/ci/src/branch/main/images/ci-runner-go).
# That image already ships:
# * Go on PATH at /usr/local/go/bin (so `actions/setup-go` is
# redundant and intentionally NOT used here — the action would
# otherwise re-download Go on every job)
# * Node.js + npm (used by docs / e2e workflows)
# * Docker CLI, Buildx, Compose v2 (used by docker-build steps)
# When bumping the Go floor, push a new ci-runner-go image with
# the matching Go version and bump the date pin in IMAGE below.
#
# Self-hosted runner expectations
# Each runner host bind-mounts persistent volumes for
# /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE),
# and /root/.cache/act (action clones) into every job container —
# regardless of which image the container is built from. As a
# The Gitea runners are provisioned out-of-band (the infra team owns
# the script). Each runner host bind-mounts persistent volumes for
# /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE), and
# /root/.cache/act (action clones) into every job container. As a
# result:
# * Common GitHub actions (actions/checkout, actions/upload-artifact,
# golangci/golangci-lint-action) are pre-cloned into
# /root/.cache/act on the runner, so the per-job
# "git clone https://github.com/actions/..." step is a fetch,
# not a full clone.
# * `cache: true` on actions/setup-go is intentionally OMITTED — the
# action would otherwise tar/untar GOMODCACHE+GOCACHE through the
# Gitea cache backend on every job, undoing the host-volume cache
# and adding ~10s of redundant zstd round-trip per job.
# * Common GitHub actions (actions/checkout, actions/setup-go,
# actions/upload-artifact, golangci/golangci-lint-action) are
# pre-cloned into /root/.cache/act on the runner, so the per-job
# "git clone https://github.com/actions/..." step is a fetch, not
# a full clone.
# * golangci-lint is pre-installed at /usr/local/bin/golangci-lint
# on the runner host BUT that's outside the job's filesystem
# view; the golangci-lint-action below pins a specific version
# and re-downloads — that's fine (deterministic CI > marginal
# speed).
# on the runner (latest v2.x). The golangci-lint-action below
# still pins a specific version and re-downloads — that's fine
# (deterministic CI > marginal speed) but means the host-installed
# binary is currently unused. Drop the `version:` arg below to
# use the host-installed one if you want to trade determinism
# for speed.
#
# Build matrix
# Linux amd64 + arm64 + Windows amd64. CGO_ENABLED=0 throughout —
@@ -38,10 +32,10 @@
# binaries.
#
# Go version
# Anchored by the ci-runner-go image (currently Go 1.25.7). Floor
# is set by the heaviest dep (modernc.org/sqlite v1.50+ requires
# Go 1.23+; we run 1.25 so golangci-lint's Go-version compatibility
# check is happy — see the version pin in the lint job).
# The GO_VERSION env var anchors all three jobs. Floor is set by the
# heaviest dep (modernc.org/sqlite v1.50+ requires Go 1.23+ today;
# we run 1.25 so golangci-lint's Go-version compatibility check is
# happy — see the version pin in the lint job).
#
# upload-artifact
# Pinned at v3 historically; v3 was deprecated upstream. v4 should
@@ -54,68 +48,35 @@ on:
pull_request:
branches: [main]
# Force bash as the default shell. With `container:` set on every
# job, Gitea Actions otherwise picks `sh -e` and our `set -euo
# pipefail` fails on dash with "Illegal option -o pipefail".
defaults:
run:
shell: bash
env:
GO_VERSION: "1.25"
jobs:
test:
# Sharded by package group. server/http and store are the two
# heavy packages (~156s and ~75s in CI respectively under
# `-race`); pulling them onto their own runners lets each shard
# have all CPUs to itself instead of CPU-starving each other on
# one runner. The third shard ("rest") covers everything else.
name: Test (${{ matrix.name }})
name: Test (linux/amd64)
runs-on: ubuntu-latest
container:
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
credentials:
username: ${{ secrets.ZOT_USERNAME }}
password: ${{ secrets.ZOT_PASSWORD }}
strategy:
fail-fast: false
matrix:
include:
- name: server-http
packages: ./internal/server/http/...
- name: store
packages: ./internal/store/...
- name: rest
# Computed at runtime — see the "go test" step below.
packages: ""
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
# cache: true intentionally omitted — see header notes.
- name: go vet
run: go vet ./...
- name: go test
run: |
set -euo pipefail
if [ -n "${{ matrix.packages }}" ]; then
pkgs="${{ matrix.packages }}"
else
# "rest" shard: everything except the dedicated shards.
pkgs=$(go list ./... \
| grep -v '/internal/server/http$' \
| grep -v '/internal/store$')
fi
# shellcheck disable=SC2086
go test -race -coverprofile=coverage.out $pkgs
run: go test -race -coverprofile=coverage.out ./...
- name: coverage summary
run: go tool cover -func=coverage.out | tail -1
lint:
name: Lint
runs-on: ubuntu-latest
container:
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
credentials:
username: ${{ secrets.ZOT_USERNAME }}
password: ${{ secrets.ZOT_PASSWORD }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
# cache: true intentionally omitted — see header notes.
- uses: golangci/golangci-lint-action@v7
with:
# Must be built against the same Go release as go.mod targets,
@@ -129,11 +90,6 @@ jobs:
build:
name: Build (${{ matrix.goos }}/${{ matrix.goarch }})
runs-on: ubuntu-latest
container:
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
credentials:
username: ${{ secrets.ZOT_USERNAME }}
password: ${{ secrets.ZOT_PASSWORD }}
strategy:
fail-fast: false
matrix:
@@ -147,6 +103,10 @@ jobs:
ext: ".exe"
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
# cache: true intentionally omitted — see header notes.
- name: build server + agent
env:
GOOS: ${{ matrix.goos }}
-133
View File
@@ -1,133 +0,0 @@
# P5-06 — End-to-end test suite.
#
# Spec : docs/superpowers/specs/2026-05-07-p5-oss-readiness-design.md
# Stack: e2e/compose.e2e.yml (server + agent + rest-server + playwright)
# Tests: e2e/playwright/tests/*.spec.ts
#
# Triggered on every PR into main and on workflow_dispatch. Runs
# longer than the unit-test workflow (~3-4 minutes for a clean run);
# kept separate so a slow e2e doesn't block the fast lint/test loop.
#
# Networking note: every interaction with the server (health probe,
# Playwright) happens from a container on the compose `rmnet`
# network, addressing the server as `http://server:8080`. We can't
# rely on `127.0.0.1:8080` because Gitea's runner executes steps
# inside its own container, where compose's host port-publish is
# not visible.
name: e2e
on:
pull_request:
branches: [main]
workflow_dispatch:
# Force bash as the default shell — see ci.yml header.
defaults:
run:
shell: bash
jobs:
e2e:
name: Playwright vs docker-compose
runs-on: ubuntu-latest
container: gitea.dcglab.co.uk/steve/ci-runner-go:2026-05-08
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- name: Build the e2e stack
# --profile test pulls in the playwright service which is
# otherwise gated. --pull refreshes base images so a bump
# to the Dockerfile's FROM tag (e.g. mcr.microsoft.com/
# playwright:vX.Y.Z-jammy) isn't masked by a stale runner
# cache that still has the old tag's layers.
run: docker compose --profile test -f e2e/compose.e2e.yml build --pull
- name: Bring up the stack
run: docker compose -f e2e/compose.e2e.yml up -d server rest-server source-fixture
- name: Wait for server health
run: |
set -eu
for i in $(seq 1 30); do
if docker run --rm --network e2e_rmnet curlimages/curl:8.10.1 \
-fsS http://server:8080/api/version >/dev/null 2>&1; then
echo "server up"; exit 0
fi
sleep 2
done
echo "server didn't come up"; docker compose -f e2e/compose.e2e.yml logs server; exit 1
- name: Capture bootstrap token from server logs
id: bootstrap
run: |
set -eu
for i in $(seq 1 15); do
line=$(docker compose -f e2e/compose.e2e.yml logs server 2>&1 | grep -E 'bootstrap token' -A2 | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1 || true)
if [ -n "$line" ]; then
echo "RM_BOOTSTRAP_TOKEN=$line" >> "$GITHUB_ENV"
echo "got bootstrap token (${#line} chars)"
exit 0
fi
sleep 1
done
echo "bootstrap token not found in logs"
docker compose -f e2e/compose.e2e.yml logs server
exit 1
- name: Start the agent
run: docker compose -f e2e/compose.e2e.yml up -d agent
- name: Run Playwright tests
id: playwright
env:
RM_BOOTSTRAP_TOKEN: ${{ env.RM_BOOTSTRAP_TOKEN }}
# --name pins a stable container ID so the next step can
# docker cp out of it before tear-down. We deliberately
# drop --rm so the container survives the test exit; the
# tear-down step removes it.
run: docker compose -f e2e/compose.e2e.yml run --name e2e-pw playwright
- name: Extract Playwright report
if: always() && steps.playwright.outcome != 'skipped'
run: |
mkdir -p e2e/playwright/playwright-report e2e/playwright/test-results
docker cp e2e-pw:/work/playwright-report/. e2e/playwright/playwright-report/ || true
docker cp e2e-pw:/work/test-results/. e2e/playwright/test-results/ || true
- name: Show Playwright failure context (on failure)
if: failure()
run: |
set +e
shopt -s nullglob globstar
for f in e2e/playwright/test-results/**/error-context.md; do
echo "::group::$f"
cat "$f"
echo "::endgroup::"
done
echo "Failure attachments (download via the playwright-report artifact):"
find e2e/playwright/test-results \( -name '*.png' -o -name '*.webm' -o -name 'trace.zip' \) -printf ' %p\n' | sort
- name: Compose logs (on failure)
if: failure()
run: |
docker compose -f e2e/compose.e2e.yml logs --tail=200 server
docker compose -f e2e/compose.e2e.yml logs --tail=200 agent
docker compose -f e2e/compose.e2e.yml logs --tail=200 rest-server
- name: Upload Playwright report (on failure)
if: failure()
uses: actions/upload-artifact@v4
with:
name: playwright-report
path: |
e2e/playwright/playwright-report
e2e/playwright/test-results
retention-days: 7
- name: Tear down
if: always()
run: |
docker rm -f e2e-pw 2>/dev/null || true
docker compose -f e2e/compose.e2e.yml down -v
-111
View File
@@ -1,111 +0,0 @@
# Release workflow — P5-03 (docker-only release path).
#
# Spec : docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md
# Plan : docs/superpowers/plans/2026-05-05-p5-03-docker-only-release.md
#
# What it does
# * Triggered by either:
# - tag push matching v[0-9]+.[0-9]+.[0-9]+ (real release), or
# - workflow_dispatch (snapshot iteration without tagging).
# * Cross-builds a multi-arch (linux/amd64,linux/arm64) image of the
# server, with three agent binaries (linux amd64+arm64, windows amd64)
# plus install.sh / install.ps1 / the systemd unit baked in under
# /opt/restic-manager/dist (the read-only fallback path the server
# handlers use when <DataDir>/... is empty).
# * Pushes to zot OCI registry (docker.dcglab.co.uk).
#
# Tag fan-out
# * tag push: :vX.Y.Z, :X.Y, :X
# * tag push and X >= 1: also :latest
# * workflow_dispatch: only :snapshot-<shortsha>; nothing else moves.
name: Release
on:
push:
tags:
- 'v[0-9]+.[0-9]+.[0-9]+'
workflow_dispatch:
env:
REGISTRY: docker.dcglab.co.uk
IMAGE_NAME: restic-manager
# Force bash as the default shell — see ci.yml header.
defaults:
run:
shell: bash
jobs:
image:
name: Build + push image
runs-on: ubuntu-latest
container:
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
credentials:
username: ${{ secrets.ZOT_USERNAME }}
password: ${{ secrets.ZOT_PASSWORD }}
steps:
- uses: actions/checkout@v4
- uses: docker/setup-qemu-action@v3
- uses: docker/setup-buildx-action@v3
- name: Log in to zot registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ secrets.ZOT_USERNAME }}
password: ${{ secrets.ZOT_PASSWORD }}
- name: Compute tags + version
id: meta
shell: bash
run: |
set -euo pipefail
REG="${REGISTRY}/${IMAGE_NAME}"
DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
SHORT_SHA="${GITHUB_SHA::7}"
if [ "${GITHUB_EVENT_NAME}" = "push" ] && [ "${GITHUB_REF_TYPE}" = "tag" ]; then
TAG="${GITHUB_REF_NAME}" # vX.Y.Z
VER="${TAG#v}" # X.Y.Z
MAJOR="${VER%%.*}"
MINOR="${VER#${MAJOR}.}"; MINOR="${MINOR%%.*}"
TAGS="${REG}:${TAG}"
TAGS="${TAGS},${REG}:${MAJOR}.${MINOR}"
TAGS="${TAGS},${REG}:${MAJOR}"
# Pre-1.0 holds back :latest by design; operators must
# pin a version explicitly until v1.0.0.
if [ "${MAJOR}" -ge 1 ]; then
TAGS="${TAGS},${REG}:latest"
fi
VERSION="${TAG}"
else
TAGS="${REG}:snapshot-${SHORT_SHA}"
VERSION="0.0.0-snapshot-${SHORT_SHA}"
fi
{
echo "tags=${TAGS}"
echo "version=${VERSION}"
echo "date=${DATE}"
} >> "${GITHUB_OUTPUT}"
- name: Build + push
uses: docker/build-push-action@v6
with:
context: .
file: deploy/Dockerfile.server
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ steps.meta.outputs.tags }}
build-args: |
VERSION=${{ steps.meta.outputs.version }}
COMMIT=${{ gitea.sha }}
DATE=${{ steps.meta.outputs.date }}
labels: |
org.opencontainers.image.version=${{ steps.meta.outputs.version }}
org.opencontainers.image.revision=${{ gitea.sha }}
org.opencontainers.image.created=${{ steps.meta.outputs.date }}
-17
View File
@@ -2,10 +2,6 @@
/bin/
/dist/
# Generated mdBook output (source under docs/book/src is committed,
# the rendered book/ directory is not).
/docs/book/book/
# Local data / runtime state
/data/
/certs/
@@ -30,12 +26,6 @@ coverage.html
.env.local
*.local
# Local docker-compose for the dev/test bench. Has host-specific IPs,
# hostnames, and ports — never committed; the canonical reference
# deployment lives in deploy/.
/compose.yaml
/compose.override.yaml
# Local diagnostic helpers (never shipped). Go's build tooling already
# skips paths beginning with _ or ., but ignore explicitly so nothing
# checked in here can leak into a release tarball.
@@ -45,10 +35,3 @@ coverage.html
# tooling already skips paths starting with _, but ignore explicitly
# so an accidental `git add cmd/.` can't sneak them into a release.
/cmd/_*/
# Local-only planning / scratch — never committed.
/ask.md
/docs/superpowers/
# Claude Code agent worktrees (transient, harness-created).
/.claude/worktrees/
-127
View File
@@ -1,127 +0,0 @@
# Changelog
All notable changes to this project are documented here.
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and the project follows [Semantic Versioning](https://semver.org/).
## [Unreleased]
## [1.1.0] - 2026-06-15
### Added
- **Always-On vs intermittent host mode.** A host can now be marked as
not always-on — for laptops/workstations that legitimately sleep,
travel, or shut down outside hours. An intermittent host no longer
raises "agent offline" alerts when it disappears; instead it shows a
calm "asleep" state in the UI ("asleep · last seen … · will catch up
on return") and is covered by a longer-horizon staleness alert (raised
only when it has an enabled schedule and no successful backup in 7
days). When such a host reconnects, the server waits a short settle
window and then automatically dispatches any scheduled backup whose
window elapsed while it was asleep. Toggle per host from the host
detail page (operator-band, audited as `host.mode_updated`). New and
existing hosts default to always-on, so current fleets are unaffected.
### Changed
- Host-detail header redesign: tags and presence are grouped into
labelled, boxed pills with click-to-edit; presence shows a `24x7` /
`Free` chip; the agent "out of date" indicator is simplified (the full
version detail remains in the Agent-update panel and on hover).
- Relative timestamps ("2h ago") now tick client-side, so a tab left
open no longer shows a stale value as wall-clock time moves on.
- Release and CI container images are now published to and pulled from
the zot OCI registry (`docker.dcglab.co.uk`).
## [1.0.1] - 2026-05-09
### Fixed
- Build version is now single-sourced from `internal/version`, and the
server Dockerfile's ldflags were corrected so docker-built binaries
report their real version. Previously `internal/version.Version` stayed
at its "dev" default in docker images, which made every host look
permanently out-of-date to the update logic.
## [1.0.0] - 2026-05-09
First tagged release. Six development phases brought the project from
empty repo to a self-hostable, multi-tenant restic backup orchestrator
with a web UI, JSON API, and self-updating agent fleet.
### Phase 1 — MVP: enrolment, visibility, on-demand backup
- HTTP server, SQLite store with migrations, AEAD-encrypted
credentials at rest, Argon2id password hashing, session cookies.
- WebSocket transport between server and agents (heartbeat, hello,
schedule fan-out, job log streaming).
- Agent install path for Linux (systemd unit + `install.sh`); one-time
enrolment tokens with embedded repo credentials.
- Run-now backup execution end-to-end, snapshot listing.
- Server-side encrypted repo creds pushed to the agent on hello.
### Phase 2 — Scheduling, retention, repo operations
- Source groups (paths + excludes + pre/post hooks + bandwidth caps)
decoupled from schedules; a schedule fires a source group.
- Cron-style schedules with retention policies, server-driven
reconciliation push and ack.
- `restic forget`, `prune`, `check`, `unlock` automation; periodic
maintenance ticker with per-host stagger.
- Pending-runs queue with backpressure (`max_concurrent_jobs` per
host).
- Repo stats panel on the host detail page (size, last-check, last-
prune, stale-lock banner).
- Auto-init of repos on first onboard with credential-failure surface
on the host detail page.
- Announce-and-approve enrolment path for hosts that don't have a
pre-minted token (Ed25519 fingerprint, operator approves).
- Windows agent: SCM service integration + `install.ps1` installer.
- Cross-platform alt-enrolment (announce flow on Windows).
### Phase 3 — Restore, alerts, audit
- Restore wizard: pick a snapshot, pick paths, pick a target
(in-place / new directory), live progress.
- Snapshot diff against parent.
- Alert engine: per-source-group dedup, severity tiers, ack / resolve.
- Live-refresh alerts table with severity cues.
- Audit log UI with filters, sort, CSV export, payload-detail modal.
### Phase 4 — RBAC, OIDC, host tags
- Role-based access control: viewer / operator / admin.
- User management UI (invite, role change, disable, password reset).
- Generic OIDC SSO with JIT user provisioning + role mapping.
- Per-host tags with chip-row filter on the dashboard.
### Phase 5 — OSS readiness
- mdBook-rendered docs site at `docs/book/`.
- Contributor onboarding (CONTRIBUTING.md, security policy, license).
- Docker-only release pipeline + reference deployment compose file.
- Playwright e2e harness covering the smoke runbook.
### Phase 6 — Update delivery + observability
- Agent self-update: server-side channel pin per host, signed binary
fetch via the WS transport, atomic swap with rollback on failure.
- Fleet-wide update orchestration with per-host stagger and an admin
pause switch.
- Prometheus `/metrics` endpoint + Grafana dashboard JSON.
- Repo size trend per host (90-day rolling) on the host detail page.
### Cross-cutting
- Live dashboard with column sort, filters, free-text host search,
background-tab-aware live refresh (5s cadence).
- Pure-Go binary with embedded UI, no Node/CGO at runtime.
- Reproducible `-trimpath -ldflags="-s -w"` builds for
linux/amd64, linux/arm64, windows/amd64.
- Sharded CI (server-http / store / rest), pre-commit hooks (gofumpt,
go vet, golangci-lint).
- Threat model published (`docs/threat-model.md`).
[Unreleased]: https://gitea.dcglab.co.uk/steve/restic-manager/compare/v1.0.0...HEAD
[1.0.0]: https://gitea.dcglab.co.uk/steve/restic-manager/releases/tag/v1.0.0
+10 -40
View File
@@ -2,19 +2,10 @@
Project-specific rules for Claude when working in this repo.
## Commands
Is the user types in any of the following, follow the instructions in the table
| Command | Action |
| --- | --- |
| :release | trigger subagent to commit (if needed), push (if needed), raise PR, wait for PR to pass or fail. If fail, report back. If pass, merge in to main |
## Repo
The repo lives inside a Gitea instance; `tea` CLI is available for use by agents
## Run `go vet` before every commit
CI runs `go vet ./...` and will fail the build on any vet error.
@@ -38,7 +29,7 @@ but the **agent** is fetched by the install script from the server's
**install script** are fetched from `<DataDir>/install/`. Plain
`make build` doesn't touch any of those — the source-of-truth files
in the working tree (`deploy/install/*`, `bin/restic-manager-agent`)
must be copied into `$HOME/smoke/data/...` *and* the running agent
must be copied into `/tmp/rm-smoke/data/...` *and* the running agent
on this dev host needs replacing if the change touches agent code or
the unit file.
@@ -53,13 +44,13 @@ asking the operator to test.**
```sh
# 1. Restage what the install script serves (binary + unit + script).
cp bin/restic-manager-agent \
$HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
/tmp/rm-smoke/data/agent-binaries/restic-manager-agent-linux-amd64
cp deploy/install/install.sh \
$HOME/smoke/data/install/install.sh
/tmp/rm-smoke/data/install/install.sh
cp deploy/install/install.ps1 \
$HOME/smoke/data/install/install.ps1
/tmp/rm-smoke/data/install/install.ps1
cp deploy/install/restic-manager-agent.service \
$HOME/smoke/data/install/restic-manager-agent.service
/tmp/rm-smoke/data/install/restic-manager-agent.service
# 2. Replace the running agent on this dev box and restart the
# service. Skip only when the change is server-side only AND
@@ -74,36 +65,15 @@ sudo -n systemctl restart restic-manager-agent
# 3. The server runs from the working tree; restart it manually
# after a build that touches server code:
pkill -f restic-manager-server
RM_LISTEN=:8080 RM_DATA_DIR=$HOME/smoke/data \
RM_LISTEN=:8080 RM_DATA_DIR=/tmp/rm-smoke/data \
RM_BASE_URL=http://127.0.0.1:8080 \
RM_SECRET_KEY_FILE=$HOME/smoke/data/secret.key \
RM_SECRET_KEY_FILE=/tmp/rm-smoke/data/secret.key \
RM_COOKIE_SECURE=false \
./bin/restic-manager-server >> $HOME/smoke/server.log 2>&1 &
./bin/restic-manager-server >> /tmp/rm-smoke/server.log 2>&1 &
```
## Smoke server: use the Make targets, not raw `nohup`
The smoke server runs as a transient `systemd --user` unit named
`restic-manager-smoke.service` so it survives any sandbox or
process-group boundary that would otherwise SIGTERM a backgrounded
process. Use the Make targets:
```
make smoke-restart # rebuild server + (re)launch as systemd --user unit
make smoke-status # systemctl --user status
make smoke-logs # tail $HOME/smoke/server.log
make smoke-stop # stop the unit
make smoke-deploy # full rebuild + restage agent assets + restart
```
`./bin/restic-manager-server &` from inside a Bash tool call gets
reaped when the tool exits — don't do that. If the unit fails to
start: `systemctl --user status restic-manager-smoke` and
`$HOME/smoke/server.log` have the diagnosis.
`smoke-deploy` does NOT touch `/usr/local/bin/restic-manager-agent`
on this dev box; if your change requires the live agent here to
update, run the agent restage block above by hand.
A `make smoke-deploy` target that bundles all of this would be a
good follow-up.
## Migrations: prefer column-level ALTERs over table rebuilds
-69
View File
@@ -1,69 +0,0 @@
# Code of Conduct
restic-manager is a small project run by one person. This Code of
Conduct sets out the basic expectations for participating in the
project's issue tracker, pull requests, and any other community
spaces (chat, mailing lists) we may run in future.
## Expected behaviour
- **Be civil.** Disagreement is fine; rudeness is not. The same
comment can usually be made without making it personal.
- **Assume good faith.** People asking what feels like a basic
question may be new to the project. People proposing what feels
like a duplicate idea may not have seen the prior discussion.
Point them to the right place politely.
- **Stay on topic.** Issue threads are for the issue. Tangential
conversations belong in their own thread.
- **Acknowledge the project's scope.** restic-manager is
intentionally small in scope (see `spec.md` §2). Reasonable
feature suggestions may still be declined for fit reasons.
## Unacceptable behaviour
- Harassment, threats, or insults — public or private.
- Discriminatory comments based on age, body size, disability,
ethnicity, gender identity or expression, level of experience,
nationality, personal appearance, race, religion, sexual identity
or orientation.
- Sustained disruption — derailing threads, ignoring repeated
requests to take a discussion elsewhere, brigading.
- Publishing other people's private information without permission.
## Reporting
If someone in the project's spaces is behaving in a way that
breaches this Code of Conduct, contact the maintainer directly
through the contact details on their Gitea profile, or via the
private security disclosure path documented in
[SECURITY.md](./SECURITY.md). Reports stay confidential.
The maintainer will review the report, gather context if needed,
and respond. Possible outcomes include a private warning, a public
clarification of expectations, a temporary or permanent ban from
project spaces, or no action if the report doesn't hold up.
There is no formal appeals process — this is a one-person project,
not a foundation. If you think a decision was wrong you can say
so, in writing, to the maintainer; that's it.
## Scope
This Code of Conduct applies to interactions in any space the
project owns or operates: the Gitea repository (issues, pull
requests, discussions, wiki), any chat channels we publish, and
any conferences or events the project is officially represented at.
It does not apply to:
- Forks of the project that aren't being submitted back upstream.
- Conversations between contributors that don't reference the
project.
- Public criticism of the project itself.
## Acknowledgement
This document borrows shape and language from the
[Contributor Covenant](https://www.contributor-covenant.org/) v2.1
but is intentionally shorter and adapted to the project's
single-maintainer reality.
+21 -159
View File
@@ -1,168 +1,30 @@
# Contributing to restic-manager
# Contributing
Thanks for your interest in restic-manager. This document covers how
to set up a development environment, the conventions the project
follows, and how patches make it from your machine into `main`.
Thanks for your interest in contributing to restic-manager.
## Project status and scope
> This is a placeholder. The project is in pre-alpha (Phase 1 / MVP). A
> full contributor guide will land alongside the Phase 5 OSS-readiness
> work — see [`tasks.md`](./tasks.md) P5-02. Until then the notes below
> apply.
restic-manager is in pre-1.0. Core functionality (Phases 04) is
landed; OSS-readiness polish is in progress. The top of
[`tasks.md`](./tasks.md) tracks what's next; [`spec.md`](./spec.md)
is the canonical design doc and the source of truth for any
"why is it built this way" question.
## Before opening a PR
The project is **single-maintainer, hobbyist-scale, and licensed
under [PolyForm Noncommercial 1.0.0](./LICENSE)**. That has two
practical implications:
1. Open an issue first for non-trivial changes — the design is still
moving (see [`spec.md`](./spec.md)) and unsolicited large PRs may
conflict with in-flight work.
2. `make lint test` should pass.
3. Match the existing code style — `gofumpt`, `goimports`, no comments
that just restate what the code does.
4. Keep commits focused; one logical change per commit.
1. Big PRs without prior discussion may be declined for fit
reasons even when they're correct — opening an issue first lets
us check alignment cheaply.
2. Commercial use is not permitted by the license. Bug reports and
patches from operators of personal/community deployments are
very welcome.
## Reporting security issues
## Getting started
### Prerequisites
- Go 1.25 or newer (`go.mod` is the source of truth)
- `make`
- For the front-end CSS bundle: nothing extra — `make build`
downloads a pinned `tailwindcss` standalone binary into `bin/`.
- For the docs site: nothing extra — `make docs` does the same trick
with `mdbook`.
- For end-to-end tests: Docker + Docker Compose, plus `npx` for
Playwright.
### One-time setup
```sh
git clone https://gitea.dcglab.co.uk/steve/restic-manager.git
cd restic-manager
make build # compiles bin/restic-manager-{server,agent}
make test # full unit + integration test sweep
make lint # gofumpt + goimports + golangci-lint
```
### Running locally
For most development, the [smoke environment](./docs/e2e-smoke.md)
is the path of least resistance:
```sh
make smoke-restart # rebuilds, launches as a systemd --user unit
make smoke-logs # tail of the server log
```
Then point a browser at `http://127.0.0.1:8080`. The first run
prints a one-time bootstrap token to the log; use it to create the
admin user.
## Code conventions
### Style
- `gofumpt` for formatting; `goimports` for import grouping.
Both run via the pre-commit hook in this repo.
- `golangci-lint` with `.golangci.yml` defaults; CI rejects on lint
errors.
- UK English in identifiers, comments, log messages, and UI strings
(the misspell linter is configured for the UK locale — see
P3-X5 for the original sweep).
- Comments explain **why**, not what; avoid restating the code.
A surprising invariant or an external constraint is worth
writing down. "Adds 1 to x" is not.
- `slog` for structured logs. Never log secrets — and especially
never the merged-creds rest-server URL (see [`CLAUDE.md`](./CLAUDE.md)).
### File and package layout
- `cmd/server` and `cmd/agent` are the two binary entry points.
- `internal/` holds everything that's not part of the public Go
API (which is none of it — restic-manager isn't a library).
- Per-feature packages live under `internal/server/...` for the
control plane and `internal/agent/...` for the agent.
- `web/templates/` are HTML templates rendered with the standard
library; embedded via `web.FS`.
### Tests
- Unit tests live alongside the code as `*_test.go`. Use the
in-process sqlite store (`store.Open(":memory:")`) when you need
state — there is no test mock layer to maintain.
- HTTP handlers test through `httptest.NewServer` against the real
router; see `internal/server/http/auth_test.go` for the canonical
fixture pattern.
- End-to-end tests live in `e2e/` and run against a Docker Compose
stack. See [`docs/e2e.md`](./docs/e2e.md).
### Database migrations
- Migrations are hand-rolled SQL in `internal/store/migrations/`
and embedded via `embed.FS`.
- Prefer column-level `ALTER TABLE` over rebuilds — see
[`CLAUDE.md`](./CLAUDE.md) "Migrations" section for the FK-cascade
trap that bit migration 0007's first draft.
## Workflow
### Before opening a PR
1. **Open an issue first** for non-trivial changes. The design is
still moving; an issue lets us agree on direction cheaply.
2. Run `make lint test` locally — both must pass.
3. Match existing code style (see above).
4. Keep commits focused: one logical change per commit. Imperative
subject lines, body explaining why if it isn't obvious.
5. Don't add `Co-Authored-By` trailers — repo policy. If you used
AI assistance in writing the patch, that's fine; we just don't
pollute every commit message with attribution boilerplate.
### Pull requests
PRs target `main`. CI runs lint + tests on Linux amd64/arm64 and
Windows amd64; all three must be green to merge. Squash-merge is
the default; the PR title becomes the merge-commit subject, so
keep it short and informative.
The PR template asks for:
- A short description of what changed and why.
- A test plan (commands run, scenarios verified).
- Anything reviewers need to know to assess the change (related
issue, follow-up work, deferred concerns).
### Reporting bugs
Open an issue with:
- restic-manager version (`server --version`) and agent version.
- restic version on the affected host.
- Steps to reproduce.
- Server and agent logs (sanitise any tokens before pasting).
Security-sensitive bugs go through the [SECURITY.md](./SECURITY.md)
disclosure path instead — please don't open a public issue for
them.
### Suggesting features
Open an issue describing the use case (not just the proposed
solution). The roadmap in `tasks.md` shows where the project is
heading; if the suggestion fits a future phase we'll wire it in
there. If it falls outside the project's scope (multi-tenancy, SaaS,
non-restic backends — see `spec.md` §2 non-goals) we'll say so
early to save your time.
## Code of conduct
Project participation is governed by [CODE_OF_CONDUCT.md](./CODE_OF_CONDUCT.md).
The short version: be civil; assume good faith; harassment is not
tolerated.
Please do **not** open a public issue for security problems. A
`SECURITY.md` with a private disclosure path will be added in Phase 5
(P5-05). Until then, contact the repository owner directly via the
contact details on their gitea profile.
## License
By contributing you agree that your contributions are licensed
under the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
By contributing you agree that your contributions are licensed under
the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
+5 -91
View File
@@ -5,15 +5,9 @@ BIN_DIR := bin
SERVER_BIN := $(BIN_DIR)/restic-manager-server
AGENT_BIN := $(BIN_DIR)/restic-manager-agent
VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
COMMIT ?= $(shell git rev-parse HEAD 2>/dev/null || echo none)
DATE ?= $(shell date -u +%Y-%m-%dT%H:%M:%SZ)
VERSION_PKG := gitea.dcglab.co.uk/steve/restic-manager/internal/version
LDFLAGS := -s -w \
-X $(VERSION_PKG).Version=$(VERSION) \
-X $(VERSION_PKG).Commit=$(COMMIT) \
-X $(VERSION_PKG).Date=$(DATE)
LDFLAGS := -s -w -X main.version=$(VERSION)
GOFLAGS := -trimpath
DOCKER_IMAGE ?= gitea.dcglab.co.uk/steve/restic-manager
DOCKER_IMAGE ?= ghcr.io/dcglab/restic-manager
DOCKER_TAG ?= dev
# Tailwind standalone CLI — single binary, no Node toolchain.
@@ -26,29 +20,7 @@ TAILWIND_URL := https://github.com/tailwindlabs/tailwindcss/releases/downlo
TAILWIND_INPUT := web/styles/input.css
TAILWIND_OUTPUT := web/static/css/styles.css
# mdBook for the docs site (P5-01). Single static binary, no
# Rust toolchain — same pattern as Tailwind.
MDBOOK_VERSION ?= v0.4.51
MDBOOK_OS := $(shell uname -s | tr A-Z a-z)
MDBOOK_TRIPLE := $(shell uname -m)-unknown-$(if $(filter darwin,$(MDBOOK_OS)),apple-darwin,linux-gnu)
MDBOOK_BIN := $(BIN_DIR)/mdbook
MDBOOK_TARBALL := mdbook-$(MDBOOK_VERSION)-$(MDBOOK_TRIPLE).tar.gz
MDBOOK_URL := https://github.com/rust-lang/mdBook/releases/download/$(MDBOOK_VERSION)/$(MDBOOK_TARBALL)
DOCS_BOOK_DIR := docs/book
DOCS_BOOK_OUT := $(DOCS_BOOK_DIR)/book
.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch docs docs-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
# ---- smoke-env tooling -------------------------------------------------
# The smoke server runs as a transient user-systemd unit so it survives
# bash-tool boundaries and reboots-of-the-shell. Use `make smoke-restart`
# any time you've rebuilt the server. `make smoke-deploy` is the full
# rebuild + restage + restart workflow described in CLAUDE.md.
SMOKE_UNIT := restic-manager-smoke
SMOKE_DATA_DIR := $(HOME)/smoke/data
SMOKE_LOG_FILE := $(HOME)/smoke/server.log
SMOKE_BASE_URL := http://127.0.0.1:8080
SMOKE_LISTEN := :8080
.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch setup hooks
help:
@grep -E '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN{FS=":.*?## "};{printf " \033[36m%-14s\033[0m %s\n",$$1,$$2}'
@@ -73,18 +45,6 @@ tailwind-watch: $(TAILWIND_BIN) ## Watch and rebuild on every save
@mkdir -p $$(dirname $(TAILWIND_OUTPUT))
$(TAILWIND_BIN) -c tailwind.config.js -i $(TAILWIND_INPUT) -o $(TAILWIND_OUTPUT) --watch
$(MDBOOK_BIN):
@mkdir -p $(BIN_DIR)
@echo "==> downloading mdbook $(MDBOOK_VERSION) ($(MDBOOK_TRIPLE))"
curl -fsSL "$(MDBOOK_URL)" | tar -xz -C $(BIN_DIR) mdbook
@chmod +x $@
docs: $(MDBOOK_BIN) ## Build the docs/book/ mdBook site into docs/book/book/
$(MDBOOK_BIN) build $(DOCS_BOOK_DIR)
docs-watch: $(MDBOOK_BIN) ## Serve the docs site at http://127.0.0.1:3000 with live reload
$(MDBOOK_BIN) serve $(DOCS_BOOK_DIR) -n 127.0.0.1 -p 3000
agent: ## Build the agent binary
@mkdir -p $(BIN_DIR)
CGO_ENABLED=0 go build $(GOFLAGS) -ldflags "$(LDFLAGS)" -o $(AGENT_BIN) ./cmd/agent
@@ -115,7 +75,7 @@ tidy: ## go mod tidy
go mod tidy
clean: ## Remove build artifacts
rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT) $(DOCS_BOOK_OUT)
rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT)
run-server: server ## Build and run the server
$(SERVER_BIN)
@@ -124,53 +84,7 @@ run-agent: agent ## Build and run the agent
$(AGENT_BIN)
docker: ## Build the server Docker image
docker build -f deploy/Dockerfile.server \
--build-arg VERSION=$(VERSION) \
--build-arg COMMIT=$(COMMIT) \
--build-arg DATE=$(DATE) \
-t $(DOCKER_IMAGE):$(DOCKER_TAG) .
smoke-restart: server ## (Re)start the smoke server as a transient user-systemd unit
@systemctl --user reset-failed $(SMOKE_UNIT) >/dev/null 2>&1 || true
@systemctl --user stop $(SMOKE_UNIT) >/dev/null 2>&1 || true
@echo "==> launching $(SMOKE_UNIT)"
systemd-run --user --unit=$(SMOKE_UNIT) \
--setenv=RM_LISTEN=$(SMOKE_LISTEN) \
--setenv=RM_DATA_DIR=$(SMOKE_DATA_DIR) \
--setenv=RM_BASE_URL=$(SMOKE_BASE_URL) \
--setenv=RM_SECRET_KEY_FILE=$(SMOKE_DATA_DIR)/secret.key \
--setenv=RM_COOKIE_SECURE=false \
--property=StandardOutput=append:$(SMOKE_LOG_FILE) \
--property=StandardError=append:$(SMOKE_LOG_FILE) \
--property=Restart=on-failure \
$(PWD)/$(SERVER_BIN)
@for i in 1 2 3 4 5; do \
curl -fsS -o /dev/null $(SMOKE_BASE_URL)/api/version 2>/dev/null && \
{ echo "==> smoke server up: $$(curl -s $(SMOKE_BASE_URL)/api/version)"; exit 0; }; \
sleep 1; \
done; \
echo "!! smoke server did not respond on $(SMOKE_BASE_URL) — check $(SMOKE_LOG_FILE)" >&2; \
systemctl --user status --no-pager $(SMOKE_UNIT) || true; \
exit 1
smoke-stop: ## Stop the smoke server
systemctl --user stop $(SMOKE_UNIT) || true
@systemctl --user reset-failed $(SMOKE_UNIT) >/dev/null 2>&1 || true
smoke-status: ## Show status of the smoke server
@systemctl --user status --no-pager $(SMOKE_UNIT) 2>&1 | head -20 || true
smoke-logs: ## Tail the smoke server log
tail -50 $(SMOKE_LOG_FILE)
smoke-deploy: build smoke-restart ## Rebuild + restage agent into smoke + restart server (full per-CLAUDE.md cycle)
@echo "==> restaging agent + install assets into $(SMOKE_DATA_DIR)"
cp $(AGENT_BIN) $(SMOKE_DATA_DIR)/agent-binaries/restic-manager-agent-linux-amd64
cp deploy/install/install.sh $(SMOKE_DATA_DIR)/install/install.sh
cp deploy/install/install.ps1 $(SMOKE_DATA_DIR)/install/install.ps1
cp deploy/install/restic-manager-agent.service $(SMOKE_DATA_DIR)/install/restic-manager-agent.service
@echo "==> NOTE: this dev box's installed agent at /usr/local/bin/restic-manager-agent is NOT updated by this target."
@echo " Run the agent restage block in CLAUDE.md if your change touches agent code or the unit file."
docker build -f deploy/Dockerfile.server --build-arg VERSION=$(VERSION) -t $(DOCKER_IMAGE):$(DOCKER_TAG) .
release: ## Cross-compile for all supported platforms
@mkdir -p $(BIN_DIR)
+33 -91
View File
@@ -1,62 +1,36 @@
# restic-manager
Self-hosted, browser-based, single-pane-of-glass for managing
[restic](https://restic.net) backups across a fleet of Linux and
Windows endpoints.
[restic](https://restic.net) backups across a fleet of Linux and Windows
endpoints.
> **Status:** pre-1.0, feature-complete for the original use
> case. Phases 04 + 6 are landed (MVP, scheduling, restore,
> RBAC + OIDC, observability); Phase 5 (OSS readiness — docs site,
> contributor onboarding, end-to-end CI) is in flight. See
> [`spec.md`](./spec.md) for the design and [`tasks.md`](./tasks.md)
> for the live roadmap.
> Status: pre-alpha. Phase 0 (project bootstrap) complete; Phase 1 (MVP) in
> progress. See [`spec.md`](./spec.md) for the design and
> [`tasks.md`](./tasks.md) for the roadmap.
## What it does
## What it does (target)
- Central visibility into backup state for every endpoint.
- Trigger any restic operation remotely (`backup`, `forget`,
`prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`,
`restore`).
- Per-host schedules with named source groups + retention.
- Live job log streamed to the browser; downloadable as
text/NDJSON afterwards.
- Restore wizard: browse a snapshot's tree, pick paths, restore
in-place or to a new directory.
- Repo health surfacing (size, raw size, last check, lock state),
plus a 30/90-day repo-size trend.
- Alerting over webhook, ntfy, or SMTP.
- Cross-platform agent (Linux systemd + Windows SCM).
- Append-only-friendly: separate admin credential for prune.
- Optional Prometheus `/metrics` endpoint + sample Grafana
dashboard.
- Optional OIDC SSO (Authelia, Authentik, etc.).
- Central visibility into backup state for every endpoint
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
`check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
- Manage per-host backup schedules from the UI
- Live job progress streamed back to the UI
- Restore wizard (browse snapshots, pick paths, restore to original or
alternate host)
- Repo health surfacing (size, dedup ratio, last check, lock state)
- Alerting on failure or staleness
- Cross-platform agent (Linux + Windows)
- Ransomware-resistant repo access via append-only credentials
## Screenshots
## Architecture (one-line summary)
| Sign in | Empty dashboard | Add host |
|:-------:|:---------------:|:--------:|
| ![Sign in](docs/screenshots/01-login.png) | ![Dashboard, fresh](docs/screenshots/02-dashboard-empty.png) | ![Add host](docs/screenshots/03-add-host.png) |
| Alerts | Settings | Audit log |
|:------:|:--------:|:---------:|
| ![Alerts](docs/screenshots/04-alerts.png) | ![Settings](docs/screenshots/05-settings.png) | ![Audit log](docs/screenshots/06-audit.png) |
(Screenshots from a fresh smoke install with no hosts. A populated
fleet view and the live-log + restore wizard surfaces are part of
the docs site under [`docs/book/`](./docs/book) — `make docs` to
render locally.)
## Architecture (one-line)
A small Go control-plane in Docker, lightweight Go agents on each
endpoint holding an outbound WebSocket to the control-plane, and
a restic repository (rest-server, S3, B2, SFTP — anything restic
speaks) that holds the actual backup data. **The control-plane
never touches backup bytes.**
A small Go control-plane on the Proxmox host, lightweight Go agents on each
endpoint that hold an outbound WebSocket to the control-plane, and a
`restic/rest-server` on Unraid that holds the actual backup data. The
control-plane never touches backup bytes.
Full architecture diagram and component breakdown:
[`spec.md` §3](./spec.md), or the rendered version in the
[docs site](./docs/book/src/concepts/architecture.md).
[`spec.md` §3](./spec.md).
## Repository layout
@@ -64,63 +38,31 @@ Full architecture diagram and component breakdown:
cmd/server/ control-plane binary
cmd/agent/ endpoint agent binary
internal/api shared API types (REST + WS envelopes)
internal/server/ HTTP, WS, UI handlers, alert engine
internal/server/ HTTP, WS, UI handlers
internal/agent/ service integration, restic runner, local scheduler
internal/restic restic CLI wrapper
internal/store SQLite persistence
internal/crypto secret encryption (AEAD)
internal/crypto secret encryption
internal/auth passwords, sessions, agent tokens
web/ server-rendered templates + static assets
deploy/ Dockerfile, docker-compose.yml, install scripts, Grafana dashboard
docs/ prose docs + the mdBook site under docs/book
e2e/ compose stack + Playwright tests for end-to-end CI
deploy/ Dockerfile, docker-compose.yml, install scripts
design/ UI wireframes (Phase 0 design pass)
```
## Quickstart
The reference deployment is a single Docker container fronted by
your existing reverse proxy. See the [installation guide](docs/book/src/getting-started/install.md)
for the full path; the very short version:
```sh
export RM_VERSION=v0.9.0 # pin a real tag
export RM_BASE_URL=https://restic.example.com
export RM_TRUSTED_PROXY=10.0.0.0/8
docker compose -f deploy/docker-compose.yml up -d
```
The server prints a one-time bootstrap token to the log on first
start. POST it to `/api/bootstrap` (or open `/bootstrap` in a
browser) to create the admin user.
## Local development
Requires Go 1.25+. The floor is set by `modernc.org/sqlite` v1.50.
Requires Go 1.25+ (built and tested on 1.26). The floor is set by
`modernc.org/sqlite` v1.50.
```sh
make build # builds cmd/server and cmd/agent into ./bin
make test # runs go test ./...
make lint # runs golangci-lint
make smoke-restart # systemd --user smoke server (see CLAUDE.md)
make docs # renders the mdBook site to docs/book/book/
make run-server # runs the server (dev defaults)
```
End-to-end test harness against a Docker Compose stack with a
sibling Linux agent: see [`docs/e2e.md`](docs/e2e.md). Runs in CI
on every PR.
## Documentation
- **Concepts and operator guides**: [docs site](docs/book/src/intro.md),
rendered with `make docs`.
- **Reverse-proxy setup**: [docs/reverse-proxy.md](docs/reverse-proxy.md).
- **Prometheus + Grafana**: [docs/prometheus.md](docs/prometheus.md).
- **End-to-end test harness**: [docs/e2e.md](docs/e2e.md).
- **Security policy**: [SECURITY.md](SECURITY.md).
- **Contributing**: [CONTRIBUTING.md](CONTRIBUTING.md).
## License
[PolyForm Noncommercial 1.0.0](./LICENSE). Free for personal,
hobby, research, educational, governmental, and other noncommercial
use. Commercial use requires a separate license.
PolyForm Noncommercial 1.0.0 — see [`LICENSE`](./LICENSE). Free for personal,
hobby, research, educational, governmental, and other noncommercial use.
Commercial use requires a separate license.
-137
View File
@@ -1,137 +0,0 @@
# Security policy
restic-manager handles credentials that grant access to backup
repositories — losing them means an attacker can read or destroy a
fleet's backups. We take security reports seriously even at this
project's small scale.
## Supported versions
Pre-1.0, only the latest tagged release on `main` is supported.
Backporting fixes to older tags is not currently offered.
| Version | Supported |
|--------------------|----------------|
| `main` HEAD | Yes |
| Latest released tag| Yes |
| Anything older | No |
## Reporting a vulnerability
**Please don't open a public issue for security problems.**
Instead, use one of these private channels:
1. **Gitea private message** to the repository owner. The
instance is at <https://gitea.dcglab.co.uk> and the owner's
profile (`steve`) has direct-message contact set up.
2. **Email** to the address on the maintainer's Gitea profile.
Use a subject like `[SECURITY] restic-manager: <one-line summary>`
so it doesn't get lost. PGP optional — if you want to encrypt,
ask for a key first.
If you don't get an acknowledgement within **3 working days**,
please escalate through the other channel — solo maintainers do
miss things, and the goal here is to fix the problem, not to
preserve protocol.
### What to include
- A description of the issue and the impact (what does an attacker
gain? confidentiality, integrity, availability?).
- Affected component (server, agent, install script, docs).
- Affected version (`restic-manager-server --version`).
- Reproduction steps if you have them. A working PoC is welcome
but not required — a credible threat model is enough.
- Whether you intend to publish a writeup, and any timing
preferences.
### What we'll do
1. Acknowledge receipt within 3 working days.
2. Confirm or refute the issue, and agree a rough severity (CVSS
or just "this is bad / this isn't"). Asking clarifying
questions is normal at this stage — please don't read it as
foot-dragging.
3. Develop a fix on a private branch, test it, and prepare a
release.
4. Coordinate disclosure timing with you. The default is **30
days from confirmed report to public disclosure**, with a
patched release published before the disclosure date. Faster
if a workable PoC is already circulating; slower only by
mutual agreement.
5. Credit the reporter in the release notes (or omit the credit
if you'd rather stay anonymous — your choice).
## Scope
In scope:
- The server binary (`cmd/server`) and any HTTP, WebSocket, or CLI
surface it exposes.
- The agent binary (`cmd/agent`) and the way it consumes commands
from the server.
- The install scripts (`deploy/install/install.sh`, `install.ps1`)
and the systemd unit shipped with them.
- The docker-compose reference deployment and the docker image we
publish.
- Any cryptographic primitive choice or implementation detail
(AEAD, token hashing, session handling, OIDC handshake).
- Documentation that, if followed, leads operators into an
insecure configuration.
Out of scope (not because they aren't real problems, just not ones
this report channel can act on):
- Vulnerabilities in restic itself — report those upstream at
<https://github.com/restic/restic>.
- Vulnerabilities in third-party dependencies that haven't yet been
patched upstream — report upstream first.
- Issues that require pre-authenticated admin access on the control
plane (admins can already do everything; that's not a privilege
escalation, that's the design).
- DoS via resource exhaustion on a deployment without the
recommended reverse proxy / rate limiting in front (see
`docs/reverse-proxy.md`).
- Social-engineering scenarios that don't have a technical hook
into the project's own surfaces.
## Threat model summary
For context (longer version in [`spec.md`](./spec.md) §11):
- The server is **HTTP-only**; TLS termination, ACME, HSTS, and
edge rate-limiting are the reverse proxy's job.
- Credentials are encrypted at rest with an AEAD key loaded from
`RM_SECRET_KEY_FILE`. The same key encrypts agent secrets that
travel to the agent over the WS channel.
- Agents authenticate with bearer tokens issued at enrolment and
hashed at rest. Compromise of the server DB does **not** leak
bearer tokens in plaintext, but does leak the hashes (which is
enough to log in *as* the agent until the operator revokes —
see [NS-01 / NS-02](./tasks.md) for the revoke + regenerate
flows).
- The control plane intentionally **never touches backup bytes**
the agent runs `restic` directly against the repo. A
compromised control plane can dispatch new jobs but cannot
exfiltrate snapshot contents in-band.
- Append-only credentials are first-class. Forget/prune jobs use a
separate, admin-marked credential that the server only pushes
for the duration of a maintenance dispatch.
## Hardening checklist for operators
- Run behind a TLS-terminating reverse proxy (Caddy/nginx/Traefik).
- Set `RM_TRUSTED_PROXY` to the proxy's CIDR so request IPs aren't
spoofable.
- Back up `RM_SECRET_KEY_FILE` separately from the database.
Without it the encrypted creds are unrecoverable.
- Use append-only credentials for the everyday backup path; only
the optional admin credential should have write/forget/prune
power.
- Disable users (don't delete) when staff change roles — bearer
tokens stay valid until rotated.
- Watch the alert and audit-log views during enrolment of new
hosts.
Thanks for helping keep restic-manager users safe.
+8
View File
@@ -0,0 +1,8 @@
# The ask!
I have numerous servers deployed out in a lab, mainly Linux but some Windows
All have restic installed on them
I need to build a browser based management service that allows me to have a central single-plane-of-glass to monitor and manage all teh endpoints
All endpoints will be enabled for SSH (unless other methods are better?)
Plan out how we would go about this please?
+33 -82
View File
@@ -22,9 +22,10 @@ import (
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
"gitea.dcglab.co.uk/steve/restic-manager/internal/restic"
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)
var version = "dev"
func main() {
if err := run(); err != nil {
slog.Error("agent fatal", "err", err)
@@ -61,7 +62,7 @@ func run() error {
flag.Parse()
if *showVersion {
fmt.Printf("restic-manager-agent %s (commit %s, built %s)\n", version.Version, version.Commit, version.Date)
fmt.Println("restic-manager-agent", version)
return nil
}
@@ -77,14 +78,14 @@ func run() error {
if *enrollServer == "" {
return errors.New("enrollment: -enroll-server is required with -enroll-token")
}
return doEnroll(*enrollServer, *enrollToken, cfg, version.Version)
return doEnroll(*enrollServer, *enrollToken, cfg, version)
}
// Announce-and-approve: -enroll-server set, no token, agent not
// yet enrolled. Run the announce flow inline; on success the cfg
// has the bearer + host_id and we drop into the normal run loop.
if !cfg.Enrolled() && *enrollServer != "" {
if err := doAnnounce(*enrollServer, cfg, version.Version); err != nil {
if err := doAnnounce(*enrollServer, cfg, version); err != nil {
return fmt.Errorf("announce: %w", err)
}
}
@@ -101,7 +102,7 @@ func run() error {
return fmt.Errorf("sysinfo: %w", err)
}
slog.Info("agent starting",
"version", version.Version,
"version", version,
"host_id", cfg.HostID,
"server", cfg.ServerURL,
"restic_version", snap.ResticVersion,
@@ -110,12 +111,6 @@ func run() error {
resticBin, _ := restic.Locate(cfg.ResticPath) // empty is fine; commands fail with a clear error later
// Probe the actual restic binary for restore-flag support. We used
// to gate --no-ownership on a SemVer comparison (added in 0.17),
// but a restic 0.18.1 build was observed in the wild that still
// rejects the flag. The help text is the only reliable signal.
resticSupportsNoOwnership := restic.SupportsRestoreNoOwnership(ctx, resticBin)
// Open the secrets store. If the agent is enrolled but has no
// secrets key yet (legacy YAML), mint one and migrate any
// plaintext repo fields into the encrypted blob.
@@ -131,7 +126,7 @@ func run() error {
CertPinSHA256: cfg.CertPinSHA256,
HelloPayload: api.HelloPayload{
ProtocolVersion: snap.ProtocolVersion,
AgentVersion: version.Version,
AgentVersion: version,
ResticVersion: snap.ResticVersion,
Hostname: snap.Hostname,
OS: snap.OS,
@@ -140,12 +135,10 @@ func run() error {
}
d := &dispatcher{
resticBin: resticBin,
resticVer: snap.ResticVersion,
resticSupportsNoOwnership: resticSupportsNoOwnership,
serverURL: cfg.ServerURL,
secrets: sec,
scheduler: scheduler.New(),
resticBin: resticBin,
resticVer: snap.ResticVersion,
secrets: sec,
scheduler: scheduler.New(),
}
if err := wsclient.Run(ctx, wsCfg, d.handle); err != nil {
return fmt.Errorf("ws run: %w", err)
@@ -207,12 +200,10 @@ func openSecretsStore(cfg *config.Config) (*secrets.Store, error) {
// secrets store on each job — config.update writes through to disk,
// so a job dispatched in the same session sees the latest values.
type dispatcher struct {
resticBin string
resticVer string // e.g. "0.17.1"; empty if restic isn't installed yet
resticSupportsNoOwnership bool // captured at startup from `restic restore --help`
serverURL string // base URL of the server (used by the self-update fetch)
secrets *secrets.Store
scheduler *scheduler.Scheduler
resticBin string
resticVer string // e.g. "0.17.1"; empty if restic isn't installed yet
secrets *secrets.Store
scheduler *scheduler.Scheduler
// Bandwidth caps in KB/s pushed via config.update. Mutated under
// bwMu by the config.update handler; read by runJob when building
@@ -392,12 +383,10 @@ func (d *dispatcher) handle(ctx context.Context, env api.Envelope, tx wsclient.S
"up_kbps", up, "down_kbps", down)
}
case api.MsgCommandUpdate:
var p api.CommandUpdatePayload
if err := env.UnmarshalPayload(&p); err != nil {
return fmt.Errorf("command.update: %w", err)
}
go d.runUpdate(ctx, p, tx)
case api.MsgAgentUpdateAvail:
var p api.AgentUpdateAvailablePayload
_ = env.UnmarshalPayload(&p)
slog.Info("ws agent: update available", "version", p.LatestVersion, "url", p.PackageURL)
default:
slog.Debug("ws agent: ignored message", "type", env.Type)
@@ -471,47 +460,17 @@ func (d *dispatcher) handleTreeList(ctx context.Context, reqID string, p api.Tre
reply(api.TreeListResultPayload{Entries: apiEntries})
}
// failJob ships a synthetic job.started + job.finished(failed) pair
// for a command.run we couldn't even spawn locally — missing restic
// binary, missing credentials, or a malformed payload. Without these
// envelopes the server has no way to know the job will never produce
// output: the row sits in "running", the live stream stays stuck on
// "awaiting agent output," and a subsequent command.cancel arrives
// for a job_id the agent never registered (we log "unknown job"
// because trackJob was never called). Sending a terminal envelope
// here closes the loop on both fronts.
func failJob(p api.CommandRunPayload, tx wsclient.Sender, errMsg string) {
now := time.Now().UTC()
if startedEnv, err := api.Marshal(api.MsgJobStarted, p.JobID, api.JobStartedPayload{
JobID: p.JobID, Kind: p.Kind, StartedAt: now,
}); err == nil {
_ = tx.Send(startedEnv)
}
if finEnv, err := api.Marshal(api.MsgJobFinished, p.JobID, api.JobFinishedPayload{
JobID: p.JobID,
Status: api.JobFailed,
ExitCode: -1,
FinishedAt: now,
Error: errMsg,
}); err == nil {
_ = tx.Send(finEnv)
}
}
// runJob spawns a runner for one job. We launch a goroutine so the
// WS read loop keeps draining messages while restic chugs along.
func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsclient.Sender) error {
if d.resticBin == "" {
failJob(p, tx, "restic binary not located on this agent")
return fmt.Errorf("restic binary not located on this agent")
}
creds, err := d.secrets.Load()
if err != nil {
failJob(p, tx, "load repo credentials: "+err.Error())
return fmt.Errorf("load repo credentials: %w", err)
}
if creds.Empty() {
failJob(p, tx, "repo credentials not configured (waiting for server config.update push)")
return fmt.Errorf("repo credentials not configured (waiting for server config.update push)")
}
// r is the everyday runner — bound to the host's repo
@@ -535,14 +494,13 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
}
r := runner.New(runner.Config{
ResticBin: d.resticBin,
ResticVersion: d.resticVer,
RepoURL: creds.URL,
RepoUsername: creds.Username,
RepoPassword: creds.Password,
SupportsRestoreNoOwnership: d.resticSupportsNoOwnership,
LimitUploadKBps: upKBps,
LimitDownloadKBps: downKBps,
ResticBin: d.resticBin,
ResticVersion: d.resticVer,
RepoURL: creds.URL,
RepoUsername: creds.Username,
RepoPassword: creds.Password,
LimitUploadKBps: upKBps,
LimitDownloadKBps: downKBps,
}, tx, time.Second)
// spawn wraps the kind-specific goroutine: derives a per-job
@@ -598,7 +556,6 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
// policy fallback was specced but skipped — see the
// Phase 5 plan rationale and version.go's lockstep-deploy
// note for why.
failJob(p, tx, "forget: command.run carried no forget_groups (server didn't populate them)")
return fmt.Errorf("forget: command.run carried no forget_groups (server didn't populate them)")
}
groups := make([]restic.ForgetGroup, 0, len(p.ForgetGroups))
@@ -633,14 +590,13 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
runCreds = ac
}
prr := runner.New(runner.Config{
ResticBin: d.resticBin,
ResticVersion: d.resticVer,
RepoURL: runCreds.URL,
RepoUsername: runCreds.Username,
RepoPassword: runCreds.Password,
SupportsRestoreNoOwnership: d.resticSupportsNoOwnership,
LimitUploadKBps: upKBps,
LimitDownloadKBps: downKBps,
ResticBin: d.resticBin,
ResticVersion: d.resticVer,
RepoURL: runCreds.URL,
RepoUsername: runCreds.Username,
RepoPassword: runCreds.Password,
LimitUploadKBps: upKBps,
LimitDownloadKBps: downKBps,
}, tx, time.Second)
slog.Info("agent: accepting prune job", "job_id", p.JobID, "admin_creds", p.RequiresAdminCreds)
spawn("prune", func(jobCtx context.Context) error {
@@ -662,16 +618,13 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
})
case api.JobRestore:
if p.Restore == nil {
failJob(p, tx, "restore: command.run carried no restore payload")
return fmt.Errorf("restore: command.run carried no restore payload")
}
rp := *p.Restore
if rp.SnapshotID == "" {
failJob(p, tx, "restore: snapshot_id is required")
return fmt.Errorf("restore: snapshot_id is required")
}
if !rp.InPlace && rp.TargetDir == "" {
failJob(p, tx, "restore: target_dir required for non-in-place restore")
return fmt.Errorf("restore: target_dir required for non-in-place restore")
}
slog.Info("agent: accepting restore job",
@@ -682,7 +635,6 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
})
case api.JobDiff:
if p.Diff == nil || p.Diff.SnapshotA == "" || p.Diff.SnapshotB == "" {
failJob(p, tx, "diff: command.run carried incomplete diff payload")
return fmt.Errorf("diff: command.run carried incomplete diff payload")
}
dp := *p.Diff
@@ -692,7 +644,6 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
return r.RunDiff(jobCtx, p.JobID, dp.SnapshotA, dp.SnapshotB)
})
default:
failJob(p, tx, fmt.Sprintf("kind %q not implemented on this agent", p.Kind))
return fmt.Errorf("kind %q not implemented yet (Phase 2 lands the rest)", p.Kind)
}
return nil
-65
View File
@@ -1,65 +0,0 @@
package main
import (
"context"
"fmt"
"log/slog"
"time"
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater"
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
)
// runUpdate handles a server-dispatched command.update. It logs progress
// via log.stream so the live job page captures pre-restart state, then
// calls the platform updater. On Linux the updater calls os.Exit; on
// Windows it spawns a detached helper and returns, with the agent then
// exiting.
//
// The terminal job state is set by the server, not the agent: success
// is "agent re-hellos with matching version" rather than anything the
// agent itself can assert. The only `job.finished` we send from here is
// on the failure path, before any restart attempt.
func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) {
logf := func(format string, args ...any) {
line := fmt.Sprintf(format, args...)
slog.Info("ws agent: update: " + line)
env, err := api.Marshal(api.MsgLogStream, "", api.LogStreamLine{
JobID: p.JobID,
TS: time.Now().UTC(),
Stream: api.LogStdout,
Payload: line,
})
if err == nil {
_ = tx.Send(env)
}
}
startedEnv, err := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{
JobID: p.JobID,
Kind: api.JobUpdate,
StartedAt: time.Now().UTC(),
})
if err == nil {
_ = tx.Send(startedEnv)
}
logf("fetching new binary from %s", d.serverURL)
if err := updater.Update(ctx, d.serverURL); err != nil {
logf("update failed: %v", err)
finishedEnv, mErr := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{
JobID: p.JobID,
Status: api.JobFailed,
FinishedAt: time.Now().UTC(),
Error: err.Error(),
})
if mErr == nil {
_ = tx.Send(finishedEnv)
}
return
}
// Unreachable on Linux (Update calls os.Exit). On Windows control
// returns here while the detached helper does the swap-and-restart;
// the agent then exits cleanly so SCM hands off.
}
+7 -52
View File
@@ -9,7 +9,6 @@ import (
"os"
"os/signal"
"path/filepath"
"strings"
"syscall"
"time"
@@ -18,17 +17,15 @@ import (
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
)
var version = "dev"
func main() {
if err := run(); err != nil {
slog.Error("server fatal", "err", err)
@@ -42,7 +39,7 @@ func run() error {
flag.Parse()
if *showVersion {
fmt.Printf("restic-manager-server %s (commit %s, built %s)\n", version.Version, version.Commit, version.Date)
fmt.Println("restic-manager-server", version)
return nil
}
@@ -86,28 +83,15 @@ func run() error {
hub := ws.NewHub()
jobHub := ws.NewJobHub()
metricsRegistry := metrics.NewRegistry()
notifHub := notification.NewHub(st, aead, cfg.BaseURL)
alertEngine := alert.NewEngine(st, notifHub)
updateWatcher := ws.NewUpdateWatcher(st, alertEngine, jobHub)
renderer, err := ui.New()
if err != nil {
return fmt.Errorf("ui: %w", err)
}
var oidcClient *oidc.Client
if cfg.OIDC != nil {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
oidcClient, err = oidc.New(ctx, cfg.OIDC, cfg.BaseURL)
if err != nil {
return fmt.Errorf("oidc: %w", err)
}
slog.Info("oidc enabled", "issuer", cfg.OIDC.Issuer, "display", cfg.OIDC.DisplayName)
}
deps := rmhttp.Deps{
Cfg: cfg,
Store: st,
@@ -116,11 +100,8 @@ func run() error {
JobHub: jobHub,
AlertEngine: alertEngine,
NotificationHub: notifHub,
UpdateWatcher: updateWatcher,
UI: renderer,
Version: version.Version,
OIDC: oidcClient,
Metrics: metricsRegistry,
Version: version,
}
// First-run bootstrap: if the users table is empty, mint a one-time
@@ -141,38 +122,22 @@ func run() error {
// text exactly once; we hash it into BootstrapToken on the
// server-side handler.
fmt.Fprintln(os.Stderr, "================================================================")
fmt.Fprintln(os.Stderr, " FIRST RUN — no admin user exists yet.")
if cfg.BaseURL != "" {
fmt.Fprintln(os.Stderr, " Open this URL in a browser to create the first administrator:")
fmt.Fprintln(os.Stderr, " "+strings.TrimRight(cfg.BaseURL, "/")+"/bootstrap")
} else {
fmt.Fprintln(os.Stderr, " Open the server URL in a browser; you'll be sent to /bootstrap.")
fmt.Fprintln(os.Stderr, " (Set RM_BASE_URL to have a clickable link printed here.)")
}
fmt.Fprintln(os.Stderr, "")
fmt.Fprintln(os.Stderr, " Headless? POST {token, username, password} to /api/bootstrap")
fmt.Fprintln(os.Stderr, " with this one-shot bootstrap token (valid until first user exists):")
fmt.Fprintln(os.Stderr, " FIRST RUN — bootstrap token (use within 1 hour, then it's gone):")
fmt.Fprintln(os.Stderr, " "+token)
fmt.Fprintln(os.Stderr, " POST it to /api/bootstrap with {token, username, password}.")
fmt.Fprintln(os.Stderr, "================================================================")
}
srv := rmhttp.New(deps)
// Fleet-update worker — built after the HTTP server because the
// dispatcher delegates back into srv.DispatchHostUpdate.
fleetWorker := fleetupdate.NewWorker(st, hub,
&serverDispatcher{srv: srv}, alertEngine)
srv.SetFleetWorker(fleetWorker)
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()
go alertEngine.Run(ctx)
go updateWatcher.Run(ctx)
errCh := make(chan error, 1)
go func() {
slog.Info("server listening", "addr", cfg.Listen, "version", version.Version)
slog.Info("server listening", "addr", cfg.Listen, "version", version)
errCh <- srv.Start()
}()
@@ -227,7 +192,6 @@ func run() error {
}
case <-pendingDrainTick.C:
srv.DrainAllDue(ctx)
srv.RunCatchupsDue(ctx)
case <-pendingExpiryTick.C:
if n, err := st.DeleteExpiredPendingHosts(ctx, time.Now().UTC()); err == nil && n > 0 {
slog.Info("expired pending hosts swept", "n", n)
@@ -262,12 +226,3 @@ func run() error {
}
return nil
}
// serverDispatcher adapts the http.Server's DispatchHostUpdate method
// to the fleetupdate.Dispatcher interface. Lives in main so the
// http and fleetupdate packages don't need to know about each other.
type serverDispatcher struct{ srv *rmhttp.Server }
func (d *serverDispatcher) DispatchUpdate(ctx context.Context, hostID, actorUserID string) (string, string, error) {
return d.srv.DispatchHostUpdate(ctx, hostID, actorUserID)
}
+7 -61
View File
@@ -1,17 +1,14 @@
# syntax=docker/dockerfile:1.7
# ---- Build stage --------------------------------------------------------
# Cross-compiles:
# * the server binary for the image's TARGETARCH (linux/amd64 or arm64),
# * three agent binaries (linux/amd64, linux/arm64, windows/amd64) that
# the running server hands out via /agent/binary.
# Pure-Go SQLite (modernc.org/sqlite) means CGO stays off; static binaries
# run on distroless/static.
FROM --platform=$BUILDPLATFORM golang:1.25-alpine AS build
FROM golang:1.25-alpine AS build
WORKDIR /src
# Pure-Go SQLite (modernc.org/sqlite) means we can keep CGO off and build a
# fully static binary that runs on distroless/static.
ENV CGO_ENABLED=0 \
GOOS=linux \
GOFLAGS="-trimpath"
# Cache module downloads in a separate layer.
@@ -21,45 +18,9 @@ RUN go mod download
COPY . .
ARG VERSION=dev
ARG COMMIT=none
ARG DATE=unknown
ARG TARGETOS
ARG TARGETARCH
ENV VERSION_PKG="gitea.dcglab.co.uk/steve/restic-manager/internal/version"
ENV LDFLAGS="-s -w \
-X ${VERSION_PKG}.Version=${VERSION} \
-X ${VERSION_PKG}.Commit=${COMMIT} \
-X ${VERSION_PKG}.Date=${DATE}"
# Server: built for the image's runtime arch.
RUN GOOS=${TARGETOS} GOARCH=${TARGETARCH} \
go build -ldflags="${LDFLAGS}" \
-o /out/restic-manager-server \
./cmd/server
# Empty /data skeleton so the runtime image carries an existing,
# nonroot-owned mount point. Docker copies that ownership onto a
# named volume the first time it's created, which avoids the
# "permission denied" trap on /data/secret.key when the operator
# uses a default `volumes: { rm-data: {} }` declaration.
RUN mkdir -p /out/data
# Agents: identical across image arches — an arm64 server image still
# ships an amd64 agent binary for amd64 endpoints to download.
RUN mkdir -p /out/agent-binaries && \
GOOS=linux GOARCH=amd64 \
go build -ldflags="${LDFLAGS}" \
-o /out/agent-binaries/restic-manager-agent-linux-amd64 \
./cmd/agent && \
GOOS=linux GOARCH=arm64 \
go build -ldflags="${LDFLAGS}" \
-o /out/agent-binaries/restic-manager-agent-linux-arm64 \
./cmd/agent && \
GOOS=windows GOARCH=amd64 \
go build -ldflags="${LDFLAGS}" \
-o /out/agent-binaries/restic-manager-agent-windows-amd64.exe \
./cmd/agent
RUN go build -ldflags="-s -w -X main.version=${VERSION}" \
-o /out/restic-manager-server \
./cmd/server
# ---- Runtime stage ------------------------------------------------------
FROM gcr.io/distroless/static-debian12:nonroot
@@ -70,22 +31,7 @@ LABEL org.opencontainers.image.licenses="PolyForm-Noncommercial-1.0.0"
USER nonroot:nonroot
WORKDIR /
# Server binary on PATH.
COPY --from=build /out/restic-manager-server /usr/local/bin/restic-manager-server
# Image-baked bundled assets (P5-03). Read-only; the /agent/binary and
# /install/* handlers fall back here when <DataDir>/... is empty, so a
# fresh container Just Works without first-run staging. Operators can
# still drop a custom build under <DataDir>/agent-binaries/<name> to
# override per-host.
COPY --from=build --chmod=0755 /out/agent-binaries/ /opt/restic-manager/dist/agent-binaries/
COPY --chmod=0755 deploy/install/install.sh /opt/restic-manager/dist/install/install.sh
COPY --chmod=0644 deploy/install/install.ps1 /opt/restic-manager/dist/install/install.ps1
COPY --chmod=0644 deploy/install/restic-manager-agent.service /opt/restic-manager/dist/install/restic-manager-agent.service
# Pre-created data dir owned by nonroot so a fresh named volume
# inherits the right ownership.
COPY --from=build --chown=nonroot:nonroot /out/data /data
EXPOSE 8443
ENTRYPOINT ["/usr/local/bin/restic-manager-server"]
+9 -40
View File
@@ -1,52 +1,21 @@
# Reference deployment for the restic-manager control plane.
# Mirrors spec.md §10.1 and the P5-07 reference deployment.
# Mirrors spec.md §10.1. Adjust image tag and RM_BASE_URL for your env.
#
# Scope: this compose stands up the server only. TLS termination and
# the public hostname belong to a reverse proxy that lives outside
# this stack (Caddy, Traefik, nginx, HAProxy, your existing edge —
# whatever you already operate). See `docs/reverse-proxy.md` for the
# headers + CIDRs that proxy needs to forward.
#
# Architecture:
# * The server speaks plain HTTP on :8080.
# * The agent binaries + install scripts ship inside the image under
# /opt/restic-manager/dist/, so /agent/binary and /install/*
# serve out of the box without first-run staging.
# * The named volume holds *only* operator state (sqlite,
# secrets.enc, audit log, the AEAD key). Image upgrades replace
# the agents/scripts; the volume is untouched.
# * Pre-1.0 releases never publish :latest — pin to an exact
# vX.Y.Z tag and bump deliberately.
#
# Before first start:
# 1. Pick a version: export RM_VERSION=vX.Y.Z (or substitute below).
# 2. Set RM_BASE_URL to the public HTTPS URL the external proxy
# serves on.
# 3. Set RM_TRUSTED_PROXY to the IP/CIDR the proxy connects from
# (the X-Forwarded-* headers are honoured only when the immediate
# peer matches one of these).
# The server speaks plain HTTP. Front it with a TLS-terminating
# reverse proxy (Caddy/Traefik/nginx). RM_TRUSTED_PROXY must contain
# the proxy's IP/CIDR so X-Forwarded-* headers are honoured.
services:
restic-manager:
image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:?set RM_VERSION to a vX.Y.Z tag}
image: ghcr.io/dcglab/restic-manager:latest
restart: unless-stopped
# Bind to localhost only — your reverse proxy reaches the server
# over loopback (or, if it runs in a separate compose / on
# another host, swap this for an internal docker network or a
# private LAN bind).
# Bind to localhost only — the proxy is what the public reaches.
ports:
- "127.0.0.1:8080:8080"
volumes:
- rm-data:/data
- ./data:/data
environment:
- RM_DATA_DIR=/data
- RM_LISTEN=:8080
- RM_BASE_URL=${RM_BASE_URL:?set RM_BASE_URL to the public https URL}
- RM_BASE_URL=https://restic.lab.example
- RM_SECRET_KEY_FILE=/data/secret.key
- RM_TRUSTED_PROXY=${RM_TRUSTED_PROXY:?set RM_TRUSTED_PROXY to the proxy CIDR}
# Cookies are Secure by default; keep that. Override only for
# local-HTTP smoke tests.
# - RM_COOKIE_SECURE=true
volumes:
rm-data:
- RM_TRUSTED_PROXY=172.16.0.0/12
@@ -1,325 +0,0 @@
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": { "type": "grafana", "uid": "-- Grafana --" },
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "restic-manager fleet overview. Imports against any Prometheus data source.",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"id": 1,
"title": "Fleet status",
"type": "stat",
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "green", "value": 1 }
]
},
"unit": "short"
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "rm_hosts_online",
"legendFormat": "online",
"refId": "A"
},
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "rm_hosts_total",
"legendFormat": "total",
"refId": "B"
}
]
},
{
"id": 2,
"title": "Open alerts",
"type": "stat",
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
},
"unit": "short"
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "none",
"orientation": "horizontal",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "sum by (severity) (rm_active_alerts)",
"legendFormat": "{{severity}}",
"refId": "A"
}
]
},
{
"id": 3,
"title": "Backups failing (last reported run)",
"type": "stat",
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 },
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
},
"unit": "short"
},
"overrides": []
},
"options": {
"colorMode": "value",
"graphMode": "area",
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
"textMode": "auto"
},
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "count(rm_host_last_backup_success == 0)",
"legendFormat": "failing",
"refId": "A"
}
]
},
{
"id": 4,
"title": "Hosts",
"type": "table",
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 },
"fieldConfig": {
"defaults": {
"custom": { "align": "auto", "displayMode": "auto" }
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Value #B" },
"properties": [
{ "id": "displayName", "value": "Last backup (s ago)" },
{ "id": "unit", "value": "s" }
]
},
{
"matcher": { "id": "byName", "options": "Value #C" },
"properties": [
{ "id": "displayName", "value": "Repo size" },
{ "id": "unit", "value": "bytes" }
]
},
{
"matcher": { "id": "byName", "options": "Value #D" },
"properties": [
{ "id": "displayName", "value": "Snapshots" }
]
},
{
"matcher": { "id": "byName", "options": "Value #A" },
"properties": [
{ "id": "displayName", "value": "Online" }
]
},
{
"matcher": { "id": "byName", "options": "Value #E" },
"properties": [
{ "id": "displayName", "value": "Open alerts" }
]
}
]
},
"options": { "showHeader": true },
"transformations": [
{
"id": "merge",
"options": {}
}
],
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "rm_host_agent_online",
"format": "table",
"instant": true,
"refId": "A"
},
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "time() - rm_host_last_backup_timestamp_seconds",
"format": "table",
"instant": true,
"refId": "B"
},
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "rm_host_repo_size_bytes",
"format": "table",
"instant": true,
"refId": "C"
},
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "rm_host_snapshot_count",
"format": "table",
"instant": true,
"refId": "D"
},
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "rm_host_open_alerts",
"format": "table",
"instant": true,
"refId": "E"
}
]
},
{
"id": 5,
"title": "Repo size over time",
"type": "timeseries",
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"axisLabel": "",
"drawStyle": "line",
"fillOpacity": 10,
"lineWidth": 1,
"pointSize": 5,
"showPoints": "never"
},
"unit": "bytes"
},
"overrides": []
},
"options": {
"legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "rm_host_repo_size_bytes",
"legendFormat": "{{host}}",
"refId": "A"
}
]
},
{
"id": 6,
"title": "Job duration p95 (last 1h, by kind)",
"type": "timeseries",
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"drawStyle": "line",
"fillOpacity": 5,
"lineWidth": 1,
"pointSize": 4,
"showPoints": "never"
},
"unit": "s"
},
"overrides": []
},
"options": {
"legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
"expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))",
"legendFormat": "{{kind}}",
"refId": "A"
}
]
}
],
"refresh": "30s",
"schemaVersion": 39,
"style": "dark",
"tags": ["restic-manager", "backups"],
"templating": {
"list": [
{
"current": {},
"hide": 0,
"includeAll": false,
"label": "Prometheus",
"multi": false,
"name": "DS_PROMETHEUS",
"options": [],
"query": "prometheus",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"type": "datasource"
}
]
},
"time": { "from": "now-6h", "to": "now" },
"timepicker": {},
"timezone": "",
"title": "restic-manager — fleet",
"uid": "rm-fleet-overview",
"version": 1,
"weekStart": ""
}
+6 -4
View File
@@ -49,10 +49,12 @@ detect_arch() {
ensure_dirs() {
install -d -m 0700 -o root -g root "$RM_CONFIG_DIR"
install -d -m 0700 -o root -g root "$RM_STATE_DIR"
# Default new-directory restore target: $HOME/rm-restore. With the
# current unit (ProtectSystem=full, no ReadWritePaths pin) the agent
# can mkdir anywhere on real filesystems, so this is just a courtesy
# pre-create so the wizard's default lands in a tidy spot.
# Default new-directory restore target: $HOME/rm-restore. Pre-create
# so the systemd unit's ReadWritePaths bind-mount applies cleanly
# (paths that don't exist when systemd starts get a soft-fail
# because of the '-' prefix, but the agent then can't mkdir into
# the read-only /root). Mode 0700 + root-owned matches the threat
# model — files restored here are operator-readable as root.
install -d -m 0700 -o root -g root /root/rm-restore
}
+10 -24
View File
@@ -33,31 +33,17 @@ CapabilityBoundingSet=CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_FOWNER CAP_CHOWN
AmbientCapabilities=CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_FOWNER CAP_CHOWN
# Hardening — blocks privilege escalation even from root, and
# confines kernel / namespace / privilege surface. Filesystem reads
# stay open (that's the whole job) and restore writes are
# unrestricted: a backup tool whose entire purpose is "put files
# back where they belong" can't have ProtectHome=read-only or
# ProtectSystem=strict without breaking on the first cross-user
# restore. ProtectSystem=full keeps /usr, /boot, /efi read-only so a
# compromised agent can't swap out /usr/bin/restic or drop a kernel
# module, while leaving /home, /root, /var, /opt, /srv, /tmp etc.
# writable for arbitrary restore targets. The agent is treated as a
# high-trust component (it runs operator hooks as root and holds
# repo credentials); the residual hardening is about kernel + privesc
# protection, not write confinement.
# confines writes / network / kernel access to what restic actually
# needs. Filesystem reads stay open: that's the whole job.
NoNewPrivileges=true
ProtectSystem=full
# ProtectSystem=full mounts /usr, /boot, /efi *and* /etc read-only.
# The agent rewrites /etc/restic-manager/agent.yaml on enrolment and
# whenever a new SecretsKey is minted, so we need a targeted
# write-exemption for that dir. No exemption for the rest of /etc:
# the agent has no business editing /etc/passwd, /etc/sudoers, etc.
#
# /usr/local/bin is writable so the self-update flow (P6-01) can
# atomic-rename a fresh binary over the running one. Permitting the
# whole directory (rather than just the binary path) is required
# because os.Rename takes a write lock on the parent dir.
ReadWritePaths=/etc/restic-manager /usr/local/bin
ProtectSystem=strict
# /etc/restic-manager: agent.yaml + secrets.enc.
# /var/lib/restic-manager: agent state (currently unused but reserved).
# /root/rm-restore: default target for new-directory restores
# ($HOME/rm-restore/<job-id>/ resolves here for User=root).
# ReadWritePaths overrides ProtectHome=read-only on this subdir only.
ReadWritePaths=/etc/restic-manager /var/lib/restic-manager -/root/rm-restore
ProtectHome=read-only
ProtectHostname=true
ProtectKernelTunables=true
ProtectKernelModules=true
-249
View File
@@ -1,249 +0,0 @@
# Onboarding a new host — agent instructions
How an automation agent (with a username + password for the
restic-manager server) brings a new host fully online.
The flow is two roles:
- **Controller side**: the agent calls JSON APIs on the
restic-manager server. Needs network reach to the server, plus
username/password.
- **Target side**: the host being onboarded runs the install
script, which calls back to the server with the one-time token.
If the agent is *both* sides (e.g. it can SSH into the target),
it does steps 12 against the server and steps 34 against the
target. If the agent only controls the server, it stops at
step 2 and hands the install snippet to whoever owns the target.
---
## Conventions
- Base URL: `$RM_SERVER` (e.g. `https://restic.lab.example`).
- Session cookie jar: persist `rm_session` between calls.
- All request/response bodies are JSON unless noted.
- On any non-2xx, response body is
`{"code": "...", "message": "..."}`.
---
## 1. Login
```
POST $RM_SERVER/api/auth/login
Content-Type: application/json
{"username": "...", "password": "..."}
```
→ 200 with `{"user_id": "...", "role": "..."}` and a `Set-Cookie:
rm_session=...` (HttpOnly, 24h TTL). Persist the cookie; reuse
it on every subsequent call.
Required role for the next step: **operator** or **admin**.
A viewer-only login can read but cannot mint tokens.
Session expires at 24h. On 401 from a later call, re-login.
---
## 2. Mint an enrolment token
```
POST $RM_SERVER/api/enrollment-tokens
Cookie: rm_session=...
Content-Type: application/json
{
"hostname": "newhost.example",
"tags": ["prod", "london"], // optional
"repo_url": "rest:https://rest.example/newhost",
"repo_username": "...", // optional, for rest-server / S3
"repo_password": "...", // optional
"initial_paths": ["/etc", "/home", "/var/lib"] // optional; default source group
}
```
→ 200 with:
```json
{ "token": "<RAW_ONE_TIME_TOKEN>", "expires_at": "2026-05-09T..." }
```
**Capture `token` immediately — the server only stores its hash
and will never return the raw value again.** TTL is 1 hour.
The repo creds you provided are encrypted under the token hash
and pre-attached to the host. The agent will fetch and store
them at enrol-time; you will not need to push them again.
If you lose the token before the install runs, mint a new one
(the existing one becomes irrelevant; you can leave it to expire
or revoke it via the UI).
---
## 3. Install on the target host
The install script is hosted by the server itself. Running on the
target:
### Linux
```
curl -fsSL $RM_SERVER/install/install.sh | \
sudo RM_SERVER=$RM_SERVER RM_TOKEN=<RAW_ONE_TIME_TOKEN> bash
```
What it does, end-to-end:
1. detects arch (amd64 / arm64)
2. downloads `$RM_SERVER/agent/binary?os=linux&arch=<arch>` to
`/usr/local/bin/restic-manager-agent`
3. creates `/etc/restic-manager/` and `/var/lib/restic-manager/`
(root:root, 0700)
4. calls `POST /api/agents/enroll` with the token; server returns
the persistent agent bearer + `host_id`, written to
`/etc/restic-manager/agent.env`
5. installs the systemd unit, `daemon-reload`, `enable --now`
6. surfaces any pre-existing restic cron/timer entries so the
operator can decide whether to disable them (script does
*not* touch them automatically)
The script is idempotent. Re-running on an already-enrolled host
is a no-op unless `RM_FORCE_REENROLL=1`.
The agent runs as **root** by design — fleet backup needs to
read every file on the system. See
`deploy/install/restic-manager-agent.service` for rationale.
### Windows
```
iwr $RM_SERVER/install/install.ps1 -UseBasicParsing | iex
# (or download + run; needs an elevated PowerShell)
# Required env: $env:RM_SERVER, $env:RM_TOKEN
```
Same flow, lays down a Windows service instead of a systemd unit.
### Manual / non-script enrolment
If the install script can't be used, the wire-level enrol call is:
```
POST $RM_SERVER/api/agents/enroll
Content-Type: application/json
{
"token": "<RAW_ONE_TIME_TOKEN>",
"hostname": "newhost.example",
"os": "linux", // linux | windows
"arch": "amd64", // amd64 | arm64
"agent_version": "...",
"restic_version": "..."
}
```
→ 200 with
`{"host_id": "...", "agent_token": "...", "cert_pin_sha256": "..."}`.
The agent_token goes into `/etc/restic-manager/agent.env` as
`RM_AGENT_TOKEN=...`; subsequent agent → server traffic uses
`Authorization: Bearer $RM_AGENT_TOKEN`.
---
## 4. Verify the host is healthy
Poll until both conditions are true. Cap at ~5 minutes.
```
GET $RM_SERVER/api/hosts
Cookie: rm_session=...
```
→ array of host objects. Find the one with the matching hostname
and check:
- `"status": "online"` — agent connected to the WS heartbeat
- `"repo_status": "ready"``restic init` (or existing-config
detection) completed successfully
If `repo_status` settles on `"init_failed"`, the repo creds are
wrong or the repo URL is unreachable from the target. Inspect
the matching job log:
```
GET $RM_SERVER/api/hosts/<host_id>/jobs (most recent init job)
GET $RM_SERVER/api/jobs/<job_id> (full output)
```
Fix the creds with a creds-update call (see Settings → Repo on
the UI for the exact route — currently form-only) or revoke the
host and start over.
---
## 5. (Optional) configure schedules
A new host gets one default source group covering `initial_paths`
(or `/etc`,`/home` if you didn't pass any) and **no schedule**.
Backups won't run until either:
- a schedule is attached (cron expression, retention, etc.), or
- you trigger an on-demand run via the source-group "Run now"
endpoint.
These are not yet exposed cleanly as JSON-only routes; if the
agent needs them, look at `internal/server/http/schedules*.go`
and `internal/server/http/source_groups*.go` — most are JSON-
capable, some are form-only with HTML 303 responses.
---
## Failure modes — quick reference
| Symptom | Likely cause | Fix |
|---|---|---|
| `401` on `/api/enrollment-tokens` | session expired or viewer role | re-login as operator+ |
| install.sh fails at "enrol": HTTP 410 | token expired (>1h) or already used | mint a fresh token |
| Host shows `status=offline` after install | systemd unit didn't start; firewall blocks WS | `systemctl status restic-manager-agent`, check `$RM_SERVER` reachability |
| `repo_status=init_failed` | bad repo creds or URL | inspect init job log; fix creds; retry probe via `/hosts/{id}/repo/probe` |
| Token list grows with stale rows | normal — they expire at 1h | optional cleanup via `/hosts/enrollment-tokens/{hash}/revoke` |
---
## Minimum reproducible script
```bash
#!/usr/bin/env bash
set -euo pipefail
: "${RM_SERVER:?}" "${RM_USER:?}" "${RM_PASS:?}" "${RM_HOSTNAME:?}" \
"${RM_REPO_URL:?}" "${RM_REPO_USER:?}" "${RM_REPO_PASS:?}"
JAR=$(mktemp)
trap 'rm -f "$JAR"' EXIT
# 1. login
curl -fsS -c "$JAR" -H 'Content-Type: application/json' \
-d "{\"username\":\"$RM_USER\",\"password\":\"$RM_PASS\"}" \
"$RM_SERVER/api/auth/login" >/dev/null
# 2. mint token
TOKEN=$(curl -fsS -b "$JAR" -H 'Content-Type: application/json' \
-d "$(jq -nc \
--arg h "$RM_HOSTNAME" --arg u "$RM_REPO_USER" \
--arg p "$RM_REPO_PASS" --arg r "$RM_REPO_URL" \
'{hostname:$h, repo_url:$r, repo_username:$u, repo_password:$p}')" \
"$RM_SERVER/api/enrollment-tokens" | jq -r .token)
# 3. emit the install snippet for the target machine
cat <<EOF
Run on $RM_HOSTNAME (as root):
curl -fsSL $RM_SERVER/install/install.sh | \\
sudo RM_SERVER=$RM_SERVER RM_TOKEN=$TOKEN bash
EOF
```
-19
View File
@@ -1,19 +0,0 @@
[book]
title = "restic-manager"
description = "Self-hosted control plane for restic backups across a fleet of Linux and Windows endpoints."
authors = ["Steve Cliff"]
language = "en-GB"
multilingual = false
src = "src"
[output.html]
default-theme = "ayu"
preferred-dark-theme = "ayu"
git-repository-url = "https://gitea.dcglab.co.uk/steve/restic-manager"
git-repository-icon = "fa-code-fork"
edit-url-template = "https://gitea.dcglab.co.uk/steve/restic-manager/_edit/main/docs/book/{path}"
no-section-label = false
[output.html.fold]
enable = true
level = 2
-40
View File
@@ -1,40 +0,0 @@
# Summary
[Introduction](./intro.md)
# Getting started
- [Installing the server](./getting-started/install.md)
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
- [Running behind a reverse proxy](./getting-started/reverse-proxy.md)
# Concepts
- [Architecture](./concepts/architecture.md)
- [Credentials and how they flow](./concepts/credentials.md)
- [Schedules and source groups](./concepts/schedules-and-source-groups.md)
- [Repo maintenance](./concepts/repo-maintenance.md)
# Operations
- [Backups and restores](./operations/backups-and-restores.md)
- [Alerts and notifications](./operations/alerts.md)
- [Observability with Prometheus](./operations/observability.md)
- [Updating agents](./operations/updates.md)
# Security
- [Threat model](./security/threat-model.md)
- [Hardening checklist](./security/hardening.md)
- [Reporting vulnerabilities](./security/disclosure.md)
# Reference
- [Environment variables](./reference/env-vars.md)
- [HTTP endpoints](./reference/http-endpoints.md)
---
[Contributing](./contributing.md)
[Roadmap](./roadmap.md)
[License](./license.md)
-121
View File
@@ -1,121 +0,0 @@
# Architecture
## Components
```
┌────────────────────────────────────────────────────────────┐
│ Server (control plane, single process) │
│ * chi-based HTTP API + HTMX server-rendered UI │
│ * WebSocket hub for agent fan-out + browser fan-out │
│ * SQLite store (modernc.org/sqlite, pure Go) │
│ * AEAD encryption helpers │
│ * Alert engine + notification hub │
└────────────┬───────────────────────────────────┬───────────┘
│ outbound WS only │ HTTP(S)
│ │
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
│ Agent (per host) │ │ Browser (operator) │
* coder/websocket │ │ * htmx + a tiny bit │
│ * cron for schedules │ │ of vanilla JS for │
│ * restic wrapper │ │ live job updates │
│ * sysinfo collector │ └──────────────────────────┘
└────────────┬─────────────┘
│ subprocess: restic ...
┌────────────▼─────────────────────────────────────────────────┐
│ restic repository (rest-server, S3, B2, SFTP, local …) │
│ Backup data flows directly here. Server never touches it. │
└──────────────────────────────────────────────────────────────┘
```
## Why outbound-only WebSockets?
The agent dials the server on `/ws/agent` with a bearer token. The
server doesn't initiate connections to the agent. Three reasons:
1. **Firewall friendliness.** Nothing on the endpoint needs an
inbound port; this works behind the typical "branch office NAT"
without router config.
2. **Single auth point.** The bearer token is the only credential
that crosses the boundary; the agent never accepts an
incoming socket.
3. **Reconnect semantics are simpler.** When the connection drops
(NAT timeout, server restart, transient network glitch) the
agent backs off and re-dials; the server marks the host
offline after 90s and lets the alert engine raise a stale-host
alert.
## Why SQLite?
SQLite covers the project's HA non-goal: there isn't one. A small
control plane managing twelve endpoints does not need replication
or a separate database tier. SQLite gives us:
- A single file to back up (plus the secret key).
- Hand-rolled migrations under `internal/store/migrations/`
no migration framework lock-in.
- `WAL` mode plus per-connection foreign-key enforcement.
The migrations file the entire schema; there's no ORM or
query-builder layer between Go code and SQL.
## Why the agent runs `restic` itself, not via the server
The control plane never holds backup bytes in flight. That's
deliberate:
- A compromised control plane cannot exfiltrate snapshot
contents in-band — at worst it can dispatch new backup or
forget jobs (audit-logged) but the data path is between the
agent and the repository.
- The same agent process can target whichever transport restic
natively supports (rest-server, S3, B2, SFTP, local), no
separate mux on the server side.
## Job lifecycle
```
┌──────────────────────┐
operator → │ POST /hosts/{id}/ │
│ run-backup │
└──────────┬───────────┘
│ 1. INSERT INTO jobs (status='queued')
│ 2. dispatch command.run over WS
┌──────────────────────┐
│ Agent dispatches │
│ restic subprocess │
└──────────┬───────────┘
│ 3. job.started ───▶ store.MarkJobStarted
│ 4. job.progress ───▶ JobHub broadcast (live UI)
│ 5. log.stream ───▶ append to job_logs
│ 6. job.finished ───▶ store.MarkJobFinished
│ + alert engine eval
│ + (P6) metrics histogram
terminal: succeeded | failed | cancelled
```
Operators see live updates because the browser subscribes to
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
agent-emitted envelope to all live subscribers in addition to
persisting it.
## What scheduling looks like
- The agent runs a local `robfig/cron/v3` instance.
- The server pushes the desired schedule set to the agent on
hello + after every CRUD change.
- When the agent's cron fires, it sends `schedule.fire` to the
server. The server creates a job row, sends `command.run` back,
and the agent dispatches a normal backup.
- If the WS drops between fire and run, the server queues the
schedule firing into `pending_runs` and drains on agent
reconnect — no missed scheduled backups due to network blips.
For everything that isn't a backup (forget, prune, check), the
server runs a 60-second maintenance ticker against
`host_repo_maintenance` rows and dispatches the relevant command
when a cadence is due. The agent's local cron only handles
backups.
-98
View File
@@ -1,98 +0,0 @@
# Credentials and how they flow
restic-manager handles three credential surfaces:
1. **Operator credentials** — the username + password (or OIDC
identity) that logs into the UI.
2. **Agent bearer tokens** — issued at enrolment, used by the
agent to authenticate its WebSocket to the server.
3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
credentials the agent passes to `restic` itself.
Each has a different threat model and storage strategy.
## Operator credentials
- Local users are stored in `users` with a bcrypt password hash.
- Sessions are random tokens minted at login, stored hashed in
the `sessions` table, expired after 24h. Cookie is HttpOnly,
SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
default).
- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
pinning their IdP identity. Local password login is rejected
for OIDC users.
- Disabling a user soft-deletes them via `disabled_at`
pre-existing sessions are invalidated on the next request.
## Agent bearer tokens
- Minted at enrolment, hashed at rest with `auth.HashToken`.
- The plaintext token only exists in memory at enrolment time
and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
mode `0600`, owned by the service user).
- Compromise of the server DB leaks the hashes, which is enough
to *log in as that agent* until you revoke. Compromise of the
agent host leaks the plaintext (via the config file) — same
end result.
- Rotation: re-enrol the host. Today there's no in-place rotate;
the operator deletes the host (which cascades, including
revoking the bearer hash) and re-runs the install command.
## Repo credentials
This is the credential that ultimately matters for backup
integrity. restic-manager keeps two slots per host:
- **The everyday credential** (`host_credentials.kind = ''`).
Append-only-friendly: this is the one your backup schedule
uses. It can write but not delete or forget.
- **The admin credential** (`host_credentials.kind = 'admin'`).
Has full delete rights. Only pushed to the agent transiently
while a `prune` or `forget` job is dispatching, and discarded
by the agent after the job ends.
### Encryption flow
1. Operator types the credential into the UI or the install form.
2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
memory.
3. Encrypted blob is stored in `host_credentials.cred_blob`.
4. When the agent connects, the server decrypts the blob and
sends the **plaintext** down the WebSocket inside a
`config.update` envelope.
5. The agent stores the plaintext in its in-memory secrets store
for the lifetime of the process; it's reloaded fresh on every
server-side push.
6. When a job runs, the agent merges the credential into the
restic environment (`restic.Env.RepoURL` stays bare; the
`user:pass@…` form is built only inside `envSlice()` at the
moment of `exec.Command`).
The merged form is **never logged**. The slog package's structured
output gets `restic.RedactURL()` for any URL it has cause to
mention.
### Why push plaintext over the wire?
The transport itself is the trust boundary: the WebSocket runs
inside the same TLS-terminated reverse-proxy connection your
browser uses, and the agent has already authenticated with its
bearer token. Re-encrypting the payload on top of that would just
move the key-management problem somewhere else.
If your reverse proxy isn't TLS-terminated, the deployment is
already broken — see [Hardening](../security/hardening.md).
## Setup tokens (admin-driven)
When an admin creates a new user, the server mints a one-time
setup link valid for 1 hour. The hash is stored; the raw token
is shown to the admin once. The user opens the link, sets a
password, and is dropped into a session. Expired tokens are
swept on the alert engine's 60s tick.
Same pattern for enrolment tokens: the raw token only exists in
memory at mint time, and the install snippet is the operator's
only chance to capture it. If you lose it, regenerate via the
**Add host** page (NS-02).
@@ -1,85 +0,0 @@
# Repo maintenance
Backups go in; without maintenance, repos grow forever and
eventually fall over. restic-manager runs three maintenance
operations on a per-host cadence:
| Command | What it does | Default cadence |
|----------|-------------------------------------------------------------|-----------------|
| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
| `prune` | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
| `check` | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
A new field on each host row, `host_repo_maintenance`, holds the
cron expressions and last-fire anchors. The maintenance ticker on
the server runs every 60s, finds hosts whose next-fire is due,
and dispatches the right command. The agent's local cron is
**only** for backups.
## Why server-side and not agent-side?
The agent's cron knows about backups because backups are
per-source-group. Maintenance is per-repo, not per-source-group,
so doing it server-side keeps the per-host wiring simple:
- One ticker, not N agent crons to keep in sync.
- Cancelling a maintenance dispatch is just "don't dispatch the
next one" — no agent-side state to clean up.
- Skipping offline hosts is trivial (no queue; only scheduled
*backups* queue into `pending_runs`).
## Forget and the multi-group payload
A single `forget` job can target several source groups at once.
The wire envelope (`ForgetGroups`) carries one entry per group,
each with its retention policy. The agent runs N
`restic forget --tag <name> --keep-...` invocations in sequence,
streams their output, and reports a single terminal status.
## Prune and the admin credential
Prune mutates the repo. The everyday append-only credential
**cannot** prune — that's the whole point of append-only.
restic-manager keeps a second slot per host (`kind = 'admin'`)
for the credential that can.
When a prune is dispatched (cadence-driven or operator-driven):
1. Server pushes the admin credential to the agent in a fresh
`config.update`.
2. Agent runs `restic prune` with the merged credential.
3. Job finishes; agent discards the admin credential from its
in-memory secrets store.
The server never logs the merged URL (see
[Credentials](./credentials.md)).
## Check and lock state
`restic check` warns about stale locks when it finds them. The
agent ships every check's output back as a `repo.stats` envelope
and a stream of log lines; if a stale lock is detected, the
**Repo** page surfaces a banner with an **Unlock** button. The
operator-only `unlock` command runs `restic unlock` and clears
the banner.
`unlock` has no cadence — it's a manual action, never automatic.
Auto-unlocking would mask the cause (probably a previously
crashed long-running operation) and risk corrupting an
operation the operator has merely lost track of.
## Repo stats
After every backup, check, prune, and unlock, the agent runs
`restic stats --json --mode raw-data` and ships the result as a
`repo.stats` envelope. The server stores this in
`host_repo_stats` (latest only) and `host_repo_stats_history`
(one row per host per day, last-write-wins per column — a
prune-only patch never nulls a backup-time size).
The host detail page surfaces:
- Total size + raw size in the vitals strip.
- Last-check timestamp + colour-coded status.
- Last-prune timestamp.
- 30/90-day repo size trend chart.
@@ -1,105 +0,0 @@
# Schedules and source groups
Two related but separable ideas:
- A **source group** is a named bundle of "what to back up":
include paths, exclude patterns, retention policy, retry
configuration, optional pre/post hooks. The group's name is
used as the restic snapshot tag, so retention can target it
with `restic forget --tag <name>`.
- A **schedule** is a cron expression that, when it fires,
triggers a backup of one or more source groups on a host.
Decoupling them means you can have one schedule covering several
groups (e.g. `0 1 * * *` running both `system` and `data`), and
each group has its own retention without duplicating policy
across schedules.
## Source group anatomy
```yaml
name: data
includes:
- /var/lib/postgresql
- /home
excludes:
- /home/*/.cache
- /home/*/Downloads
retention:
keep_last: 7
keep_daily: 14
keep_weekly: 4
keep_monthly: 6
retry_max: 3
retry_backoff_seconds: 600
pre_hook: |
pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
post_hook: |
rm -f /var/lib/postgresql/dumps/all.dump
```
### Conflict detection
If your retention policy says `keep_hourly: 24` but no schedule
points at this group sub-daily, the UI surfaces a
**conflict-dimension banner** ("`hourly` won't be honoured —
no schedule fires more often than once a day"). The flag is
stored on the source group (`conflict_dimension`) and refreshed
whenever a schedule or group changes.
### Hooks
`pre_hook` and `post_hook` run on the agent host inside
`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
to the live job log as `hook(<phase>): …` lines.
- A non-zero `pre_hook` exit aborts the backup.
- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
in the environment. Use this for cleanup that must happen
whether the backup worked or not.
- Hooks only run for `kind=backup` jobs. They do not run for
`forget`, `prune`, `check`, etc.
- AEAD-encrypted at rest at the HTTP layer; the agent receives
plaintext over the WS channel.
A "host default" pair of hooks lives on the host itself; a
source group's own hooks override them when set.
## Schedule anatomy
```yaml
cron: "0 2 * * *"
enabled: true
source_group_ids:
- <gid for "data">
- <gid for "system">
```
Slim by design: a schedule says **when** and **which groups**.
Everything else (paths, retention, hooks) lives on the groups.
The agent's local cron fires the schedule. If the WebSocket is
down at fire time, the server queues the firing into
`pending_runs` and drains it on the next agent reconnect — a
short network blip won't lose the backup.
### Last / next run
The schedules tab shows "next" (computed by parsing the cron
expression with `robfig/cron/v3`) and "last" (the latest
`actor_kind=schedule` job in the `jobs` table) for every
schedule. The dashboard host row also surfaces `next 12h ago/from
now` when a single covering schedule is the run-now candidate.
## Bandwidth limits
Two places set restic's `--limit-upload` / `--limit-download`:
1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
`bandwidth_down_kbps`). Pushed to the agent on hello and
after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
invocation on the host.
2. **Per-job overrides** on the per-source-group Run-now form.
Win over host caps for the lifetime of that one job.
If neither is set, restic runs unthrottled.
-17
View File
@@ -1,17 +0,0 @@
# Contributing
Full contributor guide:
[`CONTRIBUTING.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CONTRIBUTING.md)
in the repository root.
The short version:
- Open an issue first for non-trivial changes; the design is
still moving and unsolicited large PRs may conflict with
in-flight work.
- `make lint test` must pass.
- One logical change per commit, no `Co-Authored-By` trailers.
- UK English in identifiers and comments; comments explain the
**why** not the **what**.
Code of conduct: [`CODE_OF_CONDUCT.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CODE_OF_CONDUCT.md).
@@ -1,113 +0,0 @@
# Enrolling your first host
The control plane only knows about hosts you've explicitly
enrolled. Two paths exist:
1. **Token-based enrolment** — admin generates a token, pastes it
into an install command on the host. The host appears immediately,
already mapped to the desired repo.
2. **Announce-and-approve** — the agent runs without a token,
"announces" itself to the server, and a human in the UI accepts
the announcement.
Token-based is the default and what most operators want; the
announce flow exists for the case where you can't easily paste a
secret onto the host (auto-imaged endpoints, scripted bring-ups
from a config repo).
## Token-based enrolment
### From the UI
1. Click **+ Add host** on the dashboard.
2. Fill in the hostname, the restic repo URL, and the repo
credentials. The credentials are AEAD-encrypted at the server
immediately; what you paste is what the agent receives.
3. Optionally pick the initial source paths — these become the
first source group on the host.
4. Submit. The server mints a one-time token and shows you a copy-
pasteable install snippet.
### On the host (Linux)
```sh
curl -fsSL https://restic.example.com/install/install.sh | \
sudo RM_SERVER=https://restic.example.com \
RM_ENROL_TOKEN=<token> \
bash
```
The script:
1. Detects architecture (`amd64` or `arm64`).
2. Downloads the agent binary from `/agent/binary?os=…&arch=…`.
3. Drops the systemd unit at
`/etc/systemd/system/restic-manager-agent.service`.
4. Runs the agent in `-enrol` mode, which posts the token and
stores the persistent bearer it gets back.
5. Enables and starts the unit.
Within seconds the host should appear on the dashboard as
**online**.
### On the host (Windows)
```pwsh
$env:RM_SERVER = "https://restic.example.com"
$env:RM_ENROL_TOKEN = "<token>"
iwr -useb $env:RM_SERVER/install/install.ps1 | iex
```
Equivalent shape: registers a Windows service via the SCM
(see P2-16 for details), runs `-enrol`, starts the service.
## Recovering a lost token
Tokens are single-use and short-lived (1h). If you closed the tab
before pasting the install command, head to the **Add host** page —
outstanding tokens are listed there with a **Regenerate** button.
Regenerating revokes the old token's hash and mints a fresh raw
token while preserving the original repo credentials and initial
paths. (NS-02 in `tasks.md` if you want the design rationale.)
## Announce-and-approve
If the host can reach the server but you don't want to paste a
secret on it, run the agent in `-announce` mode:
```sh
restic-manager-agent -announce \
-server https://restic.example.com \
-hostname myhost
```
The host appears in the **Pending hosts** panel on the dashboard
with its hostname, OS, arch, and the source IP that announced it.
Click **Accept**, fill in the repo URL + credentials, and the
server pushes the bearer over the still-open WebSocket. No
back-and-forth round trip.
If you don't accept within an hour the announcement is swept.
## What happens on the agent
After enrolment, the agent:
1. Connects via WebSocket to `/ws/agent` with its bearer token.
2. Sends a `hello` envelope with its OS, arch, agent version,
restic version, and protocol version.
3. Receives a `config.update` carrying its encrypted repo
credentials and any source-group paths.
4. Sits idle, sending a heartbeat every 30s. Operator-driven
"Run now" actions arrive as `command.run` envelopes; scheduled
jobs are driven by the agent's local cron.
## Auto-init of the repository
The first time a backup runs, the agent invokes `restic init`
against the repo you configured at enrolment. If the repo already
exists (`config file already exists`) the agent treats it as a
success and proceeds. The host's repo status (`unknown`
`ready` / `init_failed`) is surfaced under the vitals strip on
the host detail page; if init fails, save fresh credentials in
the **Repo** tab to retry.
-92
View File
@@ -1,92 +0,0 @@
# Installing the server
The reference deployment is a single Docker container fronted by
your existing reverse proxy. The image bundles the server binary,
the cross-compiled agent binaries, and the install scripts.
## Prerequisites
- A Linux host with Docker and Docker Compose.
- A reverse proxy in front (Caddy, nginx, Traefik) terminating
TLS on a public hostname. The server itself is HTTP-only by
design — see [Reverse proxy](./reverse-proxy.md) for why.
- A persistent volume for the server's data directory.
## Quick start
The reference compose file lives at
[`deploy/docker-compose.yml`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/docker-compose.yml):
```yaml
services:
restic-manager:
image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:-latest}
restart: unless-stopped
environment:
RM_LISTEN: ":8080"
RM_DATA_DIR: "/data"
RM_BASE_URL: "https://restic.example.com"
# Trust your reverse proxy's CIDR so X-Forwarded-* are honoured.
RM_TRUSTED_PROXY: "10.0.0.0/8"
volumes:
- rm-data:/data
ports:
# Bind localhost only — your reverse proxy is the public face.
- "127.0.0.1:8080:8080"
volumes:
rm-data:
```
Bring it up:
```sh
docker compose up -d
docker compose logs -f restic-manager
```
The first run prints a one-time **bootstrap token** to the log. Use
it within an hour or it expires; if you miss the window the
container print it again on next start as long as no admin user
exists.
## First-run admin setup
Open `https://restic.example.com/bootstrap` (or whatever your
public URL is). Paste the bootstrap token, pick a username and a
password (≥ 12 characters), and submit. You'll land in the
dashboard logged in as the new admin.
If you'd rather curl it, the equivalent is:
```sh
curl -X POST https://restic.example.com/api/bootstrap \
-H 'Content-Type: application/json' \
-d '{"token":"<token-from-log>","username":"admin","password":"<≥12 chars>"}'
```
## Backing up the secret key
Inside the data volume, `secret.key` holds the AEAD key used to
encrypt every credential at rest. **Back it up separately from
the database.** Without it, encrypted credentials in the database
are unrecoverable; you'd have to re-enrol every host.
A simple working approach: copy `secret.key` to your password
manager or to a separately-backed-up secrets vault the day you
install. It doesn't change.
## Updating the server
```sh
# Pin a new version in your compose file (.env or docker-compose.yml),
# then:
docker compose pull
docker compose up -d
```
Migrations run automatically on startup; the server will refuse to
start if a migration fails (better to bail than to half-migrate).
For the agent self-update story, see
[Updating agents](../operations/updates.md).
@@ -1,95 +0,0 @@
# Running behind a reverse proxy
The restic-manager server is HTTP-only by design. TLS termination,
public hostname, ACME, HSTS, and edge-level rate limiting all
belong to a reverse proxy you already operate outside this project.
## What the proxy must forward
The server reads four headers when (and only when) the immediate
peer matches `RM_TRUSTED_PROXY`:
| Header | Value | Why |
|------------------------|----------------------------------------------------|-----|
| `X-Forwarded-For` | The original client IP | Rate-limit keys, audit log entries, OIDC redirect-URI checks. |
| `X-Forwarded-Proto` | `https` | Used for absolute URLs (e.g. OIDC redirect URIs). |
| `Host` | The public hostname clients use | Cookies are scoped to this; `RM_BASE_URL` must match. |
| `Connection` / `Upgrade` | Pass through unchanged | `/ws/agent` and `/api/jobs/{id}/stream` are WebSockets; without `Upgrade: websocket` they fail. |
Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of
CIDRs) the proxy connects from. Anything outside that range has
its `X-Forwarded-*` headers ignored, so a stray request that
bypasses the proxy can't spoof the client IP.
## Caddy
```caddyfile
restic.example.com {
encode zstd gzip
reverse_proxy 127.0.0.1:8080 {
header_up X-Real-IP {remote_host}
}
}
```
Caddy adds `X-Forwarded-For` / `X-Forwarded-Proto` automatically
and passes WebSocket headers through by default, so this is the
whole config.
## nginx
```nginx
server {
listen 443 ssl http2;
server_name restic.example.com;
ssl_certificate /etc/letsencrypt/live/restic.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/restic.example.com/privkey.pem;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
# WebSocket upgrade
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Long-lived agent WS — disable read timeout for this surface.
proxy_read_timeout 86400s;
}
}
```
## Traefik
```yaml
http:
routers:
restic-manager:
rule: "Host(`restic.example.com`)"
entryPoints: [websecure]
tls:
certResolver: letsencrypt
service: restic-manager
services:
restic-manager:
loadBalancer:
servers:
- url: "http://restic-manager:8080"
passHostHeader: true
```
Traefik forwards WebSocket upgrades and the standard
`X-Forwarded-*` set out of the box.
## Verification
After bringing the proxy up, the audit log should show your real
client IP for an interactive login (not the proxy's local
address). If you see `127.0.0.1` or the proxy's container IP, your
`RM_TRUSTED_PROXY` is wrong or `X-Forwarded-For` isn't being
forwarded.
-86
View File
@@ -1,86 +0,0 @@
# restic-manager
restic-manager is a self-hosted, browser-based, single-pane-of-glass
for managing [restic](https://restic.net) backups across a fleet of
Linux and Windows endpoints. It's designed for **small fleets**
the original target was twelve endpoints — and **one operator**.
## What it does
- Centralised view of every endpoint's last backup, repo size,
snapshot count, and recent jobs.
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
`check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`).
- Per-host backup schedules with source groups (named bundles of
paths + retention policy).
- Live job log streamed to the browser; downloadable as text or NDJSON.
- Restore wizard with snapshot tree browse + path selection.
- Repo-level health surfacing (size, raw size, last-check, lock
state) plus a 30/90-day size trend.
- Alerting over webhook, ntfy, or SMTP.
- Cross-platform agent (Linux + Windows).
- Append-only-credential-friendly with a separate admin credential
for forget/prune.
## What it isn't
- **Not a SaaS.** Single-instance, single-tenant, by design.
- **Not a replacement for restic** — it's a control plane. The agent
shells out to a real `restic` binary.
- **Not highly available.** SQLite, single process; if you need
HA backups, you're shopping in the wrong aisle.
- **Not a multi-protocol backup tool.** restic only.
## How it fits together
```
┌──────────────────────────────────────────────┐
│ Server (control plane, Docker) │
│ - REST + WebSocket API │
│ - SQLite store │
│ - Embedded HTMX UI │
└──────────┬─────────────────────────┬─────────┘
│ outbound WS │ HTTP(S)
│ │
┌──────────▼──────────┐ ┌──────────▼─────────┐
│ Agent (per host) │ │ Browser (operator) │
│ - restic wrapper │ └─────────────────────┘
│ - cron for sched. │
└──────────┬──────────┘
│ restic
┌──────────▼──────────────────────────────────┐
│ rest-server / S3 / SFTP / local repo │
│ (the actual backup data — server never │
│ touches it) │
└─────────────────────────────────────────────┘
```
The control plane is a Go binary that runs in Docker. Each endpoint
runs a small Go agent that holds an outbound WebSocket to the
control plane. Backup data flows directly between the agent and the
restic repository — the control plane never sees a snapshot byte.
## Where to start
- [Installing the server](./getting-started/install.md) walks
through the Docker-based reference deployment.
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
covers the install scripts and the announce-and-approve flow.
- [Architecture](./concepts/architecture.md) is the right read if
you want to know why something is the way it is before running
the install.
## Project status
Pre-1.0 but feature-complete for the original use case. Phases
04 are landed (MVP, scheduling, restore, RBAC + OIDC); Phase 5
(this docs site, contributor onboarding, end-to-end CI) is in
flight. See [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md)
for the live roadmap and [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
for the canonical design doc.
## License
[PolyForm Noncommercial 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0/).
Personal and community deployments welcome; commercial use
requires a separate license.
-39
View File
@@ -1,39 +0,0 @@
# License
restic-manager is licensed under
[**PolyForm Noncommercial 1.0.0**](https://polyformproject.org/licenses/noncommercial/1.0.0/).
The full text lives at
[`LICENSE`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/LICENSE)
in the repository root.
## What this means
- **Personal, hobbyist, educational, charitable, and similar
noncommercial use** is fully permitted, including modification
and redistribution.
- **Commercial use is not permitted** without a separate
license. The maintainer is not currently offering one — if
you need commercial rights, open an issue to start the
conversation.
- The license is permissive about everything except commercial
use: you can fork, modify, deploy in your home/lab, and
contribute back.
## Why this license
The PolyForm Noncommercial license was chosen because:
- It's a real, legal, plainly-worded license (not a custom
half-written variant).
- It permits the realistic uses for a hobby project (the
maintainer's homelab, a friend's fleet, a charity's IT
closet) without inviting commercial vendors to repackage
the work.
- It's compatible with the project staying small and
maintainable — the maintainer doesn't want to be on the hook
for SLA-grade commercial support.
## Contributions
By contributing, you agree your contributions are licensed
under the same PolyForm Noncommercial 1.0.0 license.
-73
View File
@@ -1,73 +0,0 @@
# Alerts and notifications
restic-manager raises alerts on conditions that need human
attention. The alert engine evaluates rules on a 60s tick and
on every job-finished / host-online event.
## Built-in alert kinds
| Kind | Trigger | Severity |
|---------------------|---------|----------|
| `backup_failed` | A backup job ends in `failed` or `cancelled` | warning |
| `forget_failed` | A forget job ends in `failed` | warning |
| `prune_failed` | A prune job ends in `failed` | critical |
| `check_failed` | A check job ends in `failed` | critical |
| `agent_offline` | A host has been offline more than 90s past its heartbeat cadence | warning |
| `stale_schedule` | A schedule's "last run" is more than 1.5 × its interval ago | warning |
| `update_failed` | An agent self-update returned a fail or didn't reconnect within 90s | warning |
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
Each alert has a `dedup_key` so re-firing the same condition
just bumps `last_seen_at` — the operator gets one row per
condition, not a thousand.
## Lifecycle
```
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
│ │
└────────auto-resolve──────┘
(e.g. agent_offline auto-resolves on agent_online)
```
- **Acknowledge** says "I've seen this, stop notifying about it".
- **Resolve** says "the underlying condition is gone".
- Some alerts auto-resolve when the condition clears
(`agent_offline` is the canonical example).
## Notification channels
Configure under **Settings → Notifications**. Each channel can
subscribe to all alerts or filter by severity.
### Webhook
Posts a JSON envelope to a URL of your choice. Useful for
piping into Slack via an Incoming Webhook URL or into your own
alerting tooling.
### ntfy
Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
topic. Configure the topic URL; optional bearer token if you
self-host with auth.
### SMTP
Plain SMTP (with optional TLS). Configure host, port,
username, password, and the recipient list.
## Test fire
Each channel exposes a **Test fire** button that dispatches a
single synthetic alert through the channel without touching the
alert engine. Use this when you've added a channel and want to
verify connectivity before the next real failure happens.
## What gets logged
Every alert raise / acknowledge / resolve writes an audit log
entry. The audit log UI at **Settings → Audit log** filters by
user, action, target, and time range — useful for the
post-incident "who clicked acknowledge on the prune-failure
alert" question.
@@ -1,73 +0,0 @@
# Backups and restores
## Running a backup
Three ways to trigger one:
1. **Scheduled** — the agent's local cron fires at the time set
on the schedule.
2. **Run-now** — operator clicks **Run now** on the host detail
right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
source groups) or to a per-group form for finer control.
3. **API**`POST /api/hosts/{id}/jobs` with the appropriate
payload. Same audit + dispatch path.
In every case the server creates a `jobs` row, broadcasts a
`command.run` to the host, and lands the operator on the live
job log page (HTMX `HX-Redirect`).
## Cancelling a job
Any running job — backup, forget, prune, restore, anything —
exposes a **Cancel** button on its detail page. The server
broadcasts `command.cancel`, and the agent kills the running
restic subprocess via context cancel: SIGTERM first, SIGKILL
after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
SIGTERM step is replaced with `os.Kill` because Windows can't
deliver SIGTERM. Result: a cancelled job lands as `cancelled`
within a couple of hundred milliseconds.
## Restore wizard
Restoring a file or path goes through a four-step wizard at
`/hosts/{id}/restore`:
1. **Pick a snapshot.** Search by id or by date; the page is
pre-populated when you launched the wizard from a snapshot row.
2. **Browse the snapshot tree.** Lazy-loaded children via the
`MsgTreeList` synchronous WS RPC; results are cached
per-wizard-session for 30 minutes. Pick the absolute paths
you want.
3. **Choose a target.** Either **In place** (overwrites the
live filesystem; requires you to type the hostname to
confirm) or **New directory** (default
`$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
`${HOME}` / `~/` and creates the directory chain).
4. **Review and submit.** Server mints a job, dispatches
`command.run` with a `RestorePayload`, and `HX-Redirect`s to
the live job log.
`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
in that release). Hosts running 0.16 don't get the flag and
restore as the running user instead.
## Snapshot diff
Two snapshot ids in the **Diff** form on the host detail page →
a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
to the standard live job log. Useful when investigating a
suspiciously-sized backup.
## Job log artefacts
Every job's log is persisted in `job_logs` (one row per line),
not just streamed in-memory. That gives you:
- A live view at `/jobs/{id}` while the job runs.
- Two download formats from the same page header dropdown:
- **txt** — one line per row, `HH:MM:SS.mmm TAG payload`.
- **ndjson** — one self-contained JSON object per line
(`{seq, ts, stream, payload}`), perfect for `jq`.
Downloads work whether the job is running or finished —
the source is the DB, not the live socket.
-61
View File
@@ -1,61 +0,0 @@
# Observability with Prometheus
restic-manager can expose a Prometheus scrape endpoint at
`GET /metrics`. The endpoint is **opt-in** — without an explicit
auth gate it isn't even mounted, so a forgotten config can't
accidentally publish fleet state.
The full reference lives at
[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
the short version follows.
## Enable the endpoint
Set at least one of:
- `RM_METRICS_TOKEN``Authorization: Bearer <token>` required.
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
Both ANDed when both set. Constant-time token compare; CIDR
honours `X-Forwarded-For` only when the immediate hop matches
`RM_TRUSTED_PROXY`.
## Metrics emitted
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
`rm_active_alerts{severity}`, `rm_build_info{...}`.
- **Per-host gauges**: `rm_host_agent_online`,
`rm_host_last_backup_timestamp_seconds`,
`rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
`rm_host_snapshot_count`, `rm_host_open_alerts`,
`rm_host_repo_status`.
- **Histogram**:
`rm_job_duration_seconds{kind,status,le=…}` (buckets
`1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
In-memory histogram only. Prometheus persists the scrapes; if
you need durable history at hourly resolution that's
Prometheus's job.
## Sample Grafana dashboard
[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
imports through Grafana's **+ → Import → Upload JSON file**.
Six panels:
1. Fleet status (online / total).
2. Open alerts by severity.
3. Backups failing on most-recent run.
4. Hosts table — last backup, repo size, snapshots, open alerts.
5. Repo size over time, one line per host.
6. Job-duration p95 over a 1h window per kind.
## Alerting
restic-manager already has a built-in alert engine
([Alerts](./alerts.md)). The dashboard intentionally doesn't
duplicate it as Prometheus alert rules. If you want
Prometheus-side alerts on top, write your own based on the
metrics above — `rm_host_last_backup_success == 0`,
`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
or whatever suits your environment.
-50
View File
@@ -1,50 +0,0 @@
# Updating agents
Server updates are a `docker compose pull && up -d` away.
Agents update via the control plane.
## Single-host update
Each host's detail page shows an **Update agent** button when
the agent's reported version is older than the server's. The
button:
1. Dispatches a `command.update` to that host.
2. The agent fetches the appropriate binary from
`$RM_SERVER/agent/binary?os=…&arch=…` to
`<binary-path>.new`.
3. Copies the running binary to `<binary-path>.old` (one
revision back, in case rollback is needed).
4. Atomic-renames `.new` over the running binary.
5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
brings the process back on the new binary.
A 90-second timer on the server side waits for a hello at the
target version and marks the update succeeded — or, if the
agent doesn't reconnect at the expected version in time, marks
the update **failed** and raises an `update_failed` alert.
## Fleet update
The admin-only **Settings → Fleet update** page drives a rolling
update across every host in the fleet:
- One host at a time.
- Wait for hello-with-target-version (max 95s).
- On any host failing, **halt** the rollout, raise a
`fleet_update_halted` alert, leave the rest of the fleet on
the old version. No surprise mass-failures.
You can cancel an in-progress fleet update; the worker stops
after the current host finishes.
## TLS and corruption
Updates rely on the reverse proxy's TLS to detect corruption in
transit. There's no separate sha256 verification step — we
chose the simpler model on the basis that the same TLS already
gates every other byte the server hands to the agent.
If you'd like a separate signature step before applying updates,
that's a future-phase enhancement (see `tasks.md` Phase 6
candidates).
-58
View File
@@ -1,58 +0,0 @@
# Environment variables
The server reads its configuration from environment variables
(canonical) with an optional YAML overlay. Env wins over YAML so
operators can tweak a single setting without rewriting the file.
## Server
| Variable | Default | Meaning |
|---------------------------|----------------------------------|---------|
| `RM_LISTEN` | `:8080` | TCP listener for the HTTP server. |
| `RM_DATA_DIR` | `/data` | Persistent state directory (SQLite, secret key, agent assets). |
| `RM_BASE_URL` | (none) | Public URL clients use; required for OIDC redirects + cookie scope. |
| `RM_SECRET_KEY_FILE` | `${RM_DATA_DIR}/secret.key` | Path to the AEAD key file. Auto-generated on first run. |
| `RM_COOKIE_SECURE` | `true` | Set `false` only for local HTTP testing. Controls `Secure` on session cookies. |
| `RM_TRUSTED_PROXY` | (none) | Comma-separated CIDRs trusted for `X-Forwarded-*`. |
| `RM_BUNDLED_ASSETS_DIR` | `/opt/restic-manager/dist` | Read-only path with bundled agent binaries + install scripts (the docker image bakes them here). |
| `RM_METRICS_TOKEN` | (off) | When set, `GET /metrics` requires `Authorization: Bearer <token>`. |
| `RM_METRICS_TRUSTED_CIDR` | (off) | When set, `GET /metrics` restricts source IPs (comma-CIDR). |
OIDC variables (all optional; empty issuer disables OIDC):
| Variable | Meaning |
|--------------------------------|---------|
| `RM_OIDC_ISSUER` | OIDC discovery URL (e.g. `https://auth.example.com`). |
| `RM_OIDC_CLIENT_ID` | Client ID registered with the IdP. |
| `RM_OIDC_CLIENT_SECRET` | Client secret (or use `RM_OIDC_CLIENT_SECRET_FILE`). |
| `RM_OIDC_CLIENT_SECRET_FILE` | Path to a file holding the client secret. |
| `RM_OIDC_DISPLAY_NAME` | Button label on the login page (e.g. "Authelia"). |
| `RM_OIDC_ROLE_CLAIM` | Token claim that carries roles (default `groups`). |
| `RM_OIDC_ROLE_MAPPING` | `idp-group=role` entries, comma-separated (e.g. `rm-admin=admin,rm-ops=operator`). |
| `RM_OIDC_REDIRECT_URL` | Override for the redirect URL; defaults to `${RM_BASE_URL}/auth/oidc/callback`. |
## Agent
| Variable | Default | Meaning |
|----------------------|---------|---------|
| `RM_AGENT_CONFIG` | `/etc/restic-manager/agent.yaml` (Linux) | Config file path. |
The agent's other settings live in the YAML file (server URL,
bearer token, optional cert pin). The install script writes that
file for you at enrolment.
## Build-time
The Makefile threads `-ldflags` from `git describe` into the
`internal/version` package so `--version` and the dashboard
footer show the right values:
```
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION)
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
```
If you build with `go build` directly (no Makefile), `Version`
falls back to `dev` and the agent-update comparison falls back
to "always equal". Source-build deployments can still run; they
just don't participate in the self-update flow.
-82
View File
@@ -1,82 +0,0 @@
# HTTP endpoints
A non-exhaustive map of the surfaces the control plane exposes.
All `/api/*` routes return JSON; all other paths render HTML
(server-rendered with HTMX in the loop).
The canonical wiring lives at
[`internal/server/http/server.go`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/internal/server/http/server.go);
when in doubt, read the routes block there.
## Public (no auth)
| Method | Path | Purpose |
|--------|----------------------------|---------|
| GET | `/healthz` | Liveness probe. Returns 204. |
| POST | `/api/auth/login` | Local-user login. JSON body: `{username, password}`. |
| POST | `/api/auth/logout` | Invalidate the session cookie. |
| POST | `/api/bootstrap` | First-run admin creation. Accepts the token printed at first start. |
| POST | `/api/agents/enroll` | Token-based agent enrolment. |
| POST | `/api/agents/announce` | Announce-and-approve agent enrolment. |
| GET | `/agent/binary?os=&arch=` | Serves the agent binary for the install scripts. |
| GET | `/install/*` | Serves the Linux + Windows install scripts and the systemd unit. |
| GET | `/api/version` | Build version + commit JSON. |
| GET | `/metrics` | Prometheus exposition (only when opted-in via `RM_METRICS_TOKEN` / `RM_METRICS_TRUSTED_CIDR`). |
| GET | `/login`, `/setup`, `/bootstrap` | UI pages. |
## Authenticated (any role)
| Method | Path | Purpose |
|--------|------------------------------------------|---------|
| GET | `/` | Dashboard. |
| GET | `/hosts/{id}` | Host detail. |
| GET | `/hosts/{id}/repo` | Repo tab. |
| GET | `/hosts/{id}/jobs` | Jobs tab. |
| GET | `/hosts/{id}/sources` | Source groups list. |
| GET | `/hosts/{id}/schedules` | Schedules list. |
| GET | `/jobs/{id}` | Live job log. |
| GET | `/api/hosts`, `/api/fleet/summary` | JSON list + summary. |
| GET | `/api/jobs/{id}/stream` | WebSocket subscription to a job's live log. |
| GET | `/api/jobs/{id}/log.{txt,ndjson}` | Persisted log download. |
## Operator role and above
| Method | Path | Purpose |
|--------|---------------------------------------|---------|
| POST | `/hosts/{id}/run-backup` | Run-now (HTMX form-post). |
| POST | `/hosts/{id}/sources/{gid}/run-now` | Per-source-group run-now. |
| POST | `/hosts/{id}/repo/{prune,check,unlock,reinit,probe}` | Maintenance actions. |
| POST | `/api/hosts/{id}/snapshots/diff` | Snapshot-diff job. |
| POST | `/hosts/{id}/restore` | Restore wizard submit. |
| POST | `/api/jobs/{id}/cancel` | Cancel a running job. |
| POST | `/hosts/{id}/tags` | Update host tags. |
| POST | `/hosts/{id}/sources` and friends | Source-group CRUD. |
| POST | `/hosts/{id}/schedules` and friends | Schedule CRUD. |
| POST | `/hosts/{id}/repo/credentials`, `/admin-credentials` | Credential update. |
## Admin role only
| Method | Path | Purpose |
|--------|---------------------------------------|---------|
| POST | `/hosts/new` | Mint enrolment token (Add host). |
| POST | `/hosts/{id}/delete` | Delete + cascade. |
| POST | `/hosts/{id}/update` | Dispatch a single agent update. |
| GET/POST | `/settings/users/...` | User management. |
| POST | `/settings/notifications/...` | Notification channel CRUD + test fire. |
| POST | `/settings/fleet-update/...` | Fleet-update worker. |
## WebSocket
| Path | Who connects | Auth |
|--------------------------------|--------------|------|
| `/ws/agent` | Agent | Bearer token issued at enrolment. |
| `/ws/agent/pending` | Agent (announce flow) | Pending-id query param. |
| `/api/jobs/{id}/stream` | Browser | Session cookie. |
## RBAC enforcement
Routes are grouped into chi route-groups by required role
(`viewer < operator < admin`); the `requireRole` middleware in
`internal/server/http/middleware.go` is the bouncer. Sessions
re-validate `disabled_at` on every request, so a disabled user's
cookie stops working immediately.
-32
View File
@@ -1,32 +0,0 @@
# Roadmap
The live roadmap is in
[`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md).
Phases ship in order; items inside a phase ship as the
opportunity arises.
## Status snapshot
| Phase | Theme | Status |
|-------|--------------------------------------------------|--------|
| 0 | Project bootstrap | ✅ done |
| 1 | MVP: enrolment, visibility, on-demand backup | ✅ done |
| 2 | Scheduling, retention, repo operations | ✅ done |
| 3 | Restore, alerts, audit | ✅ done |
| 4 | RBAC, OIDC, host tags | ✅ done |
| 5 | OSS readiness | 🚧 in flight (this docs site is part of it) |
| 6 | Update delivery + observability polish | ✅ done |
## What's not on the roadmap
The non-goals list in [`spec.md` §2](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md):
- Replacing restic itself or providing custom repo formats
- Managing non-restic backup tools
- Multi-tenancy / SaaS deployment
- High availability of the control plane (SQLite, single-instance)
- Mobile-native apps (responsive web only)
If something there is critical to your use case, restic-manager
isn't the right tool. That's not a closed door — it's a
deliberate scope decision so the project stays maintainable.
-35
View File
@@ -1,35 +0,0 @@
# Reporting vulnerabilities
The full disclosure policy lives in
[`SECURITY.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/SECURITY.md)
at the repo root. The short version:
- **Don't open a public issue.**
- Send a Gitea private message to `steve` on
<https://gitea.dcglab.co.uk>, or email the address on the
maintainer's profile, with a subject like
`[SECURITY] restic-manager: <one-line summary>`.
- Expect an acknowledgement within 3 working days; escalate
through the other channel if you don't get one.
- Default disclosure window is **30 days from confirmed report
to public disclosure**, faster if a PoC is already
circulating, slower only by mutual agreement.
## What to include
A description of the issue and the impact, the affected
component (server / agent / install script / docs), the version,
and reproduction steps. A working PoC is welcome but not
required — a credible threat model is enough.
## In scope vs. out of scope
See the full policy. Quick highlights:
- **In scope:** server, agent, install scripts, docker image,
docker-compose reference, crypto choices, docs that lead to
insecure configs.
- **Out of scope:** restic itself (report upstream), unpatched
third-party deps (report upstream first), pre-authenticated
admin abuse (admins are designed to have full power), DoS on
deployments without the recommended reverse proxy.
-72
View File
@@ -1,72 +0,0 @@
# Hardening checklist
A baseline for new deployments. Most of these are defaults; the
list is here to make audit easy.
## Server
- [ ] Reverse proxy in front, TLS terminating at the proxy
(Caddy/nginx/Traefik).
- [ ] `RM_TRUSTED_PROXY` set to the proxy's CIDR.
- [ ] `RM_BASE_URL` matches the public hostname and the cookie
scope you want.
- [ ] `RM_COOKIE_SECURE=true` (the default; only set `false`
for local HTTP testing).
- [ ] HTTP listener bound to **localhost** in the compose file,
not `0.0.0.0`. The reverse proxy is the only thing that
should reach it.
- [ ] `secret.key` backed up separately from the database.
- [ ] Bootstrap token consumed and the printed log line scrubbed
from any log archive.
## Authentication
- [ ] Admin user has a password ≥ 12 characters (the floor).
- [ ] OIDC enabled if you have an IdP — local password auth
stays as a break-glass.
- [ ] Disabled (not deleted) any users who change roles or leave
so their session is invalidated immediately.
- [ ] The last-admin guard isn't tripped — there's always at
least one enabled admin user.
## Repo credentials
- [ ] Append-only credential set as the everyday cred for every
host.
- [ ] Admin credential set only where prune cadence is enabled.
- [ ] No credentials reused across hosts. Each host should have
its own credential pair so a single host compromise has a
single blast radius.
- [ ] If using rest-server, `--append-only` flag is on for the
everyday user; the prune user is a separate identity.
## Agent
- [ ] Agent runs as `root` (Linux) or `LocalSystem` (Windows)
**only when** the source paths require it. Otherwise pin
a service user that has read access to what's backed up
and nothing else.
- [ ] systemd unit's sandboxing flags are intact
(`NoNewPrivileges`, `Protect*`, `MemoryDenyWriteExecute`).
- [ ] Agent's config file `/etc/restic-manager/agent.yaml` is
mode `0600` and owned by the service user. The bearer
token lives in there.
## Operations
- [ ] Alerts wired to a real channel (webhook into Slack,
ntfy topic, SMTP) — not just sitting in the UI.
- [ ] Test-fire each notification channel after configuring.
- [ ] Audit-log retention is long enough to cover the operator's
incident-response window.
- [ ] Prometheus endpoint, if enabled, gated by token AND CIDR
where practical (default is opt-in / off).
## Recovery
- [ ] A documented procedure for rotating a leaked agent bearer
(delete + re-enrol the host).
- [ ] A test-restore done at least once, end-to-end, before
relying on the system in anger.
- [ ] `secret.key` and the SQLite database covered by separate
backup paths so neither alone reconstitutes the other.
-110
View File
@@ -1,110 +0,0 @@
# Threat model
This page documents what restic-manager defends against, what it
doesn't, and the trust assumptions a deployment is making. The
canonical version lives in [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
§11; the summary here is shaped for operators rather than
implementers.
## Trust boundaries
```
┌──────────────────────────────────────────┐
│ TRUSTED zone │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ Operator's │ │ Reverse │ │
│ │ browser │◄──►│ proxy │ │ TLS terminates here
│ └─────────────┘ └──────┬───────┘ │
└────────────────────────────┼─────────────┘
│ HTTP, plaintext
│ (loopback or trusted LAN)
┌────────────────────────────▼─────────────┐
│ Server (control plane) │
└────────────┬─────────────────────────────┘
│ outbound WebSocket (TLS to clients via proxy)
│ — bearer-authenticated
┌────────────▼──────────────┐
│ Agent (per host) │ ◄── attacker model: assume one
└────────────┬──────────────┘ endpoint can be compromised
│ subprocess
restic ──▶ repository (rest-server / S3 / SFTP / …)
```
## What we defend against
### Network attacker between operator and server
- HTTPS via the reverse proxy is the only operator-facing surface
on a sane deployment.
- `RM_COOKIE_SECURE=true` (default) means the session cookie
refuses to ride a non-HTTPS connection.
- `RM_TRUSTED_PROXY` gates whether `X-Forwarded-*` is honoured;
a bypassing request can't spoof the client IP.
### Compromised agent host
- The agent's bearer token can dispatch commands **only on its
own host**. It can't read other hosts' state, dispatch jobs
on other hosts, or escalate within the control plane.
- If you suspect a host compromise:
1. Disable the agent's host row from **Hosts → Delete**
(cascades the bearer hash).
2. Rotate the repo credential at the rest-server / object
store side.
3. Audit-log lists every action that bearer ever drove.
### DB compromise without the secret key
- Repo credentials are AEAD-encrypted at rest. A DB dump alone
doesn't expose them.
- Agent bearer **hashes** are leaked; that's enough to
authenticate as any agent until you revoke. A rotation
procedure is just "delete + re-enrol" today.
- Operator passwords are bcrypt-hashed; OIDC users have no
password to leak.
- Session tokens are hashed; an attacker can't replay a
session from a DB dump.
### DB compromise WITH the secret key
The attacker can decrypt every credential. Treat
`secret.key` with the same care as a password manager database.
Back it up to a separate vault, not to the same Docker volume
as the database.
### Forget/prune as a DoS vector
- The everyday backup credential cannot prune (append-only).
- The admin credential is only pushed to the agent at the
moment of dispatch and discarded after the job ends.
- Compromise of a single agent host does **not** grant prune
rights — at worst the attacker gets fresh write access until
the credential is rotated.
### Operator-side typo or bad copy-paste
- Repo credentials are stored encrypted; mis-typed creds fail
fast on the next `restic` invocation rather than silently
corrupting state.
- NS-03 added auto-init: the first dispatched job after creds
change runs `restic init`, surfaces the error eagerly under
the host's vitals strip if the creds are bad, and resets the
host's `repo_status` so the operator can retry without
hunting through job logs.
## What we don't defend against
- **Insider threat at the maintainer level.** A malicious
maintainer can publish a backdoored container; SBOM /
signing infrastructure (Phase 6 candidate) would help here
but isn't shipped today.
- **Supply chain.** We pin module versions (`go.sum`) and
pin the Tailwind binary's release tag, but a compromise in
one of those upstreams would land here.
- **Side-channel via restic itself.** A bug in restic that
enables snapshot-content disclosure is restic's problem; the
control plane doesn't see snapshot bytes either way.
- **DoS via resource exhaustion** without the recommended
reverse-proxy / rate-limit in front. Don't expose the
server's HTTP port to the public internet directly.
-120
View File
@@ -1,120 +0,0 @@
# End-to-end test harness
The e2e harness stands up the full production-shaped stack
(server + agent + rest-server) in Docker Compose and drives it
through Playwright. CI runs it on every PR; operators can run it
locally too.
## Files
```
e2e/
├── compose.e2e.yml compose stack: server + rest-server + agent
├── Dockerfile.agent Linux container for the agent (alpine + restic)
├── agent-entrypoint.sh decides between announce / token-enrol / run
└── playwright/
├── package.json
├── playwright.config.ts
└── tests/
├── lib/server.ts bootstrap, login, accept, poll helpers
└── smoke.spec.ts happy-path: enrol → backup → succeeded
```
## Local run
Prerequisites: Docker + Docker Compose, and `npx` for Playwright.
```sh
# 1. Build + bring up the stack (server, rest-server, source data).
docker compose -f e2e/compose.e2e.yml up --build -d server rest-server source-fixture
# 2. Wait for the server, then scrape the bootstrap token from the log.
until curl -fsS http://127.0.0.1:8080/api/version >/dev/null; do sleep 1; done
RM_BOOTSTRAP_TOKEN=$(docker compose -f e2e/compose.e2e.yml logs server \
| grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1)
export RM_BOOTSTRAP_TOKEN
# 3. Start the agent (it announces against the running server).
docker compose -f e2e/compose.e2e.yml up -d agent
# 4. Install + run Playwright.
cd e2e/playwright
npm install
npx playwright install --with-deps chromium
npx playwright test
```
When the test passes you'll see:
```
Running 2 tests using 1 worker
✓ smoke: enrol-via-announce → backup happy path completes in under a minute (47s)
✓ smoke: scrape /metrics metrics endpoint exposes the host gauge (180ms)
2 passed (47.5s)
```
Tear-down:
```sh
docker compose -f e2e/compose.e2e.yml down -v
```
`-v` removes the named volumes too — important between runs because
the rest-server volume holds an initialised repo and the
agent-config volume holds a stale bearer.
## What the test exercises
1. **Bootstrap.** Posts the admin-creation request to
`/api/bootstrap` with the token scraped from the server log.
2. **Login (UI).** Drives the login form via Playwright; verifies
the dashboard loads with a session cookie set.
3. **Pending host appears.** Polls the dashboard for the inline
accept form generated by the announcing agent; reads the
pending-id out of its action URL.
4. **Accept.** POSTs `/api/pending-hosts/{id}/accept` with the
rest-server URL + repo password. The server mints a Host row
+ bearer + AEAD-encrypted creds and pushes the bearer down
the still-open pending WebSocket.
5. **Online + auto-init.** Polls `/api/hosts` until the new host
is `status=online`. Auto-init runs as part of this — the
first dispatched job after creds save is `restic init`.
6. **Run backup.** Submits the host detail page's `Run now`
form; expects `HX-Redirect` to the live job page.
7. **Verify.** Polls `/api/hosts` until the host's
`last_backup_status` flips to `succeeded`.
8. **Metrics.** Scrapes `/metrics` and asserts the
server-gauge + build-info lines are present (the compose
stack opens the endpoint via `RM_METRICS_TRUSTED_CIDR=0.0.0.0/0`).
## CI workflow
[`.gitea/workflows/e2e.yml`](../.gitea/workflows/e2e.yml) runs the
suite on every PR into `main`. On failure it dumps the last 200
lines of each container log as a workflow annotation and uploads
the Playwright HTML report as an artefact.
## When tests fail
- **Pending host never appears.** Agent container probably
couldn't reach the server. Check `docker compose logs agent`
for connection errors and `docker compose logs server` for
any 4xx on `/api/agents/announce`.
- **Backup hangs in `running`.** The agent shells out to
`restic`; check the live job log at
`http://127.0.0.1:8080/jobs/<id>` (still up after a
failed test as long as you didn't `down -v`).
- **`RM_BOOTSTRAP_TOKEN not set`.** The server log scrape
matched the wrong line or the token regex is too tight. The
server prints the token on a line starting with ` ` (four
spaces) inside a banner; widen the regex if your server log
format changes.
## Adding new tests
The harness is intentionally flat — one `*.spec.ts` per
scenario. Reuse the helpers in `lib/server.ts` and avoid
duplicating bootstrap / login boilerplate. Heavy fixtures
(custom users, OIDC IdP) belong in their own compose override
file rather than complicating `compose.e2e.yml`.
File diff suppressed because it is too large Load Diff
-139
View File
@@ -1,139 +0,0 @@
# Prometheus + Grafana
restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
The endpoint is **opt-in** — it is not mounted at all unless you set
at least one of the auth gates below. Once enabled, it serves the
standard `text/plain` exposition format that every Prometheus
release since 2.x parses without configuration.
A sample Grafana dashboard lives at
`deploy/grafana/restic-manager-dashboard.json`.
## Enable the endpoint
Two switches, both off by default. If both are set, both must pass
(token AND source-IP); if only one is set, that gate alone
authorises a scrape.
| Env var | YAML key | Effect |
|----------------------------|------------------------|--------|
| `RM_METRICS_TOKEN` | `metrics_token` | Requires `Authorization: Bearer <token>`. Compared in constant time. |
| `RM_METRICS_TRUSTED_CIDR` | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
When neither is set, `GET /metrics` returns 404 — the route is not
registered with the chi router so a forgotten config can't
accidentally publish fleet state.
### Example: Docker
```yaml
services:
restic-manager:
image: gitea.dcglab.co.uk/steve/restic-manager:latest
environment:
RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
secrets:
- rm_metrics_token
```
(`RM_METRICS_TOKEN_FILE` is not currently supported — set
`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
roadmap.)
## Prometheus scrape config
Drop into your `prometheus.yml`:
```yaml
scrape_configs:
- job_name: restic-manager
metrics_path: /metrics
scheme: https # via your reverse proxy
static_configs:
- targets: ['restic.example.com']
authorization:
type: Bearer
credentials_file: /etc/prometheus/secrets/rm_metrics_token
```
If you don't run a TLS-terminating proxy in front, drop `scheme:
https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
## Metric reference
All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
label (the stable ULID, immune to renames) and a `host` label
(the human-readable name).
### Server gauges
| Name | Labels | Description |
|-----------------------|------------------------------------|-------------|
| `rm_hosts_total` | — | Total number of enrolled hosts (excludes pending announces). |
| `rm_hosts_online` | — | Number of hosts with `status='online'`. |
| `rm_active_alerts` | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
| `rm_build_info` | `version, commit, go_version` | Always 1; pure label-bag for joining. |
### Per-host gauges
| Name | Description |
|--------------------------------------------|-------------|
| `rm_host_agent_online` | 1 if the agent is currently online, 0 otherwise. |
| `rm_host_last_backup_timestamp_seconds` | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
| `rm_host_last_backup_success` | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
| `rm_host_repo_size_bytes` | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
| `rm_host_snapshot_count` | Number of restic snapshots known on the host's repo. |
| `rm_host_open_alerts` | Number of currently open alerts attached to this host. |
| `rm_host_repo_status` | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
### Job duration histogram
```
rm_job_duration_seconds_bucket{kind, status, le}
rm_job_duration_seconds_sum{kind, status}
rm_job_duration_seconds_count{kind, status}
```
`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
`status` ∈ {succeeded, failed, cancelled}.
Buckets (seconds):
```
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
1s 5s 30s 1m 5m 30m 1h 6h 24h
```
The histogram is in-memory only — values reset on process restart.
Operators who want durable history should let Prometheus persist
the scrapes; restic-manager itself is a control plane, not a
metrics database.
## Grafana dashboard
Import `deploy/grafana/restic-manager-dashboard.json`:
1. In Grafana, **+ → Import → Upload JSON file**.
2. Pick the Prometheus data source you scrape with.
3. The dashboard's six panels populate from the metrics above:
* **Fleet status** — online/total stat panel.
* **Open alerts** — by severity.
* **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
* **Repo size over time** — one line per host.
* **Backups failing** — count of hosts whose last backup didn't succeed.
* **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
Alerting is intentionally not configured in the dashboard — the
control plane already has alerts (P3-05) with native channels for
webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
just duplicate state. If you do want Prom-side alerts, copy the
recording rules into your usual location.
## Cardinality
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
histogram rows. A 100-host fleet emits roughly 700 host rows + 270
histogram rows — well below any practical limit. There are no
`job_id` labels (cardinality bomb avoidance) and no per-source-group
labels.
-113
View File
@@ -1,113 +0,0 @@
# Running behind a reverse proxy
The restic-manager server is HTTP-only by design (see `spec.md` §11):
TLS termination, public hostname, ACME, HSTS, and edge-level rate
limiting all belong to a reverse proxy that you already operate
outside this project. The reference compose in `deploy/docker-compose.yml`
stands up *only* the server; this page covers what your proxy needs
to do to make the rest of it work.
## What the proxy must forward
The server reads four headers when (and only when) the immediate peer
matches `RM_TRUSTED_PROXY`:
| Header | Value | Why |
|---------------------|----------------------------------------------------------|-----|
| `X-Forwarded-For` | The original client IP (single value, or comma chain) | Rate-limit keys, audit log entries, and OIDC redirect-URI checks all use the real client IP. |
| `X-Forwarded-Proto` | `https` | The server emits absolute URLs (e.g. OIDC redirect URIs) using this. |
| `Host` | The public hostname clients use | Cookies are scoped to this; `RM_BASE_URL` must match. |
| `Connection`/`Upgrade` | Pass through unchanged | The agent connects on `/ws/agent` and the live-log viewer connects on `/api/jobs/{id}/stream` — both are WebSockets and need `Upgrade: websocket` to survive the hop. |
Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of CIDRs)
the proxy connects from. Anything outside that range has its
`X-Forwarded-*` headers ignored, so a stray request that bypasses the
proxy can't spoof the client IP.
## Example: Caddy
```caddyfile
restic.example.com {
# Caddy's default reverse_proxy preserves Host, sets
# X-Forwarded-For/Proto, and passes Connection: upgrade through,
# so a single directive covers HTTP + WebSocket.
reverse_proxy 127.0.0.1:8080
encode zstd gzip
}
```
`RM_TRUSTED_PROXY=127.0.0.1/32` if Caddy and the server share the
host; the docker-bridge CIDR (commonly `172.16.0.0/12`) if Caddy
runs in another container on the default bridge network.
## Example: nginx
```nginx
server {
listen 443 ssl http2;
server_name restic.example.com;
ssl_certificate /etc/ssl/restic.example.com.fullchain.pem;
ssl_certificate_key /etc/ssl/restic.example.com.key.pem;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
# WebSocket support — agent + live-log endpoints need this.
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
# Trusted-proxy headers.
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
# Live job logs are long-running streams. Bump read timeouts
# so nginx doesn't drop them mid-backup.
proxy_read_timeout 1h;
proxy_send_timeout 1h;
}
}
# Standard websocket upgrade map (define once at the http {} level).
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
```
`RM_TRUSTED_PROXY` for the same-host case: `127.0.0.1/32`.
## Example: Traefik (label-based)
```yaml
labels:
- "traefik.enable=true"
- "traefik.http.routers.restic-manager.rule=Host(`restic.example.com`)"
- "traefik.http.routers.restic-manager.entrypoints=websecure"
- "traefik.http.routers.restic-manager.tls.certresolver=letsencrypt"
- "traefik.http.services.restic-manager.loadbalancer.server.port=8080"
```
Traefik handles `X-Forwarded-*` and `Connection: upgrade` by default.
`RM_TRUSTED_PROXY` should be the docker network the Traefik container
shares with the server (commonly `172.16.0.0/12` for the default
bridge, or whatever your overlay network's CIDR is).
## Sanity-checking the wiring
After bringing the stack up:
1. `curl -fsS https://restic.example.com/healthz` — should return 200.
2. The login page should report HTTPS in the address bar; cookies
set after login should carry the `Secure` flag.
3. Check the server log for the `config resolved` line:
`trusted_proxies` must include the IP/CIDR your proxy actually
connects from.
4. Enrol a test agent — the WebSocket handshake hitting `/ws/agent`
confirms `Upgrade` is being forwarded correctly.
If any of those fail, the proxy is the first place to look — the
server itself is intentionally minimal.
Binary file not shown.

Before

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 98 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 178 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 92 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 47 KiB

@@ -1,223 +0,0 @@
# Always-On vs Intermittent host mode
**Date:** 2026-06-15
**Branch:** `feat-laptop-host-mode`
**Status:** Design — awaiting review
## Problem
The server currently assumes every host should be present 24×7. When an
agent stops heartbeating for 90s it is flipped to `offline`, and after 15
minutes that raises a `warning` alert. This is correct for a server, but
wrong for a host that legitimately comes and goes — a workstation or
laptop that sleeps overnight, travels, or is shut down on weekends. Such
a host generates noise alerts every time it is closed, and — more
importantly — there is **no mechanism to catch up a backup it missed
while it was away.**
Two distinct facts make the catch-up gap real:
- **Backup cron runs on the agent, locally.** The agent fires
`MsgScheduleFire`; the server only dispatches in response. If the host
is asleep, the agent process is suspended, so the cron tick never
fires and no `MsgScheduleFire` is ever sent.
- Therefore the existing `pending_runs` retry queue **does not** cover
this case. `pending_runs` only gets a row when a schedule *fired* but
the agent was momentarily disconnected at dispatch time. A window
missed entirely during sleep never enqueues anything.
## Goal
Let an operator mark a host as **not** always-on. Such a host:
1. Does **not** raise offline/agent-down alerts when it is not visible.
2. Renders a distinct, calm "asleep" state in the UI instead of the
alarming red "offline".
3. When it reconnects, after a short settle delay, the server checks
whether it missed a scheduled backup and — if so — triggers a
catch-up backup automatically.
4. Still raises a *staleness* alert if it has genuinely gone too long
without any backup (a host left in a drawer). This is the only
alert covering an asleep host: while the agent is offline no job
runs, so there is no failure to detect — staleness is the safety
net for "no backups are happening at all."
5. Leaves normal job-failure alerting untouched: a backup that
actually runs (scheduled or catch-up) and fails alerts as it does
today. Failures can only occur while the agent is online and
executing restic.
Default behaviour is unchanged for the entire existing fleet.
## Decisions (from brainstorming)
- **Setting shape:** a single boolean `Always On` checkbox per host,
**default ON**. Checked = today's 24×7 server semantics. Unchecked =
intermittent host. Opt-in only; zero behaviour change for current and
future hosts unless explicitly toggled.
- **Overdue trigger:** evaluated on **reconnect + behind schedule**
(not a continuous always-evaluating sweep).
- **Alert policy for intermittent hosts:** suppress offline alerts;
keep a long-threshold **staleness** alert; keep job-failure alerts.
- **Staleness threshold:** **7 days**, a global constant for v1. May
become per-host configurable later — out of scope now.
- **Catch-up granularity:** **per enabled schedule.** A host with a
daily and a weekly schedule catches up only whichever is actually
behind.
- **UI vocabulary:** not-visible intermittent host shows a grey
`asleep` state; detail line reads
`asleep · last seen <relTime> · will catch up on return`.
- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show
a chip for **Always-On** hosts; **no** chip for intermittent.
## Architecture
The change is deliberately a thin policy + presentation layer over the
existing online/offline state machine. We do **not** add a new `status`
enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a
reinterpretation of `status='offline' AND NOT always_on`.
### 1. Data model
- **Migration `0024_hosts_always_on.sql`:**
```sql
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
```
Column-level ALTER per the repo's migration rules. Default `1` means
every existing row is Always-On — no behaviour change on upgrade.
- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it
through every host SELECT scan and the host insert/update paths.
- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`.
### 2. Online/offline mechanics — UNCHANGED
The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen
host to `status='offline'` and still calls
`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello`
behaviour is untouched. The intermittent distinction is applied
*downstream* of this state, in the alert engine and the templates.
### 3. Alert behaviour
All changes key off `host.AlwaysOn`, which the engine already has access
to via the host row it loads.
- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()`
and the 60s `tick()`): when `!host.AlwaysOn`, do not raise
`agent_offline`.
- **Resolve-on-toggle:** when a host is switched server→intermittent and
has an open `agent_offline` alert, auto-resolve it. (Handled in the
mode-change handler, fanning through the normal resolve path so
channels/audit fire as usual.)
- **Staleness alert** — wire up the currently-dead `KindStaleSchedule`
constant, **for intermittent hosts only.** On the 60s tick, for each
host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND
`LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a
`warning` `stale_schedule` alert (dedup key `""`, one per host).
Auto-resolves when `LastBackupAt` advances past the threshold (i.e.
any successful backup, including the catch-up). Always-On hosts'
`stale_schedule` remains a no-op (unchanged, out of scope).
- If `LastBackupAt == nil` (intermittent host enrolled but never
backed up): no staleness alert in v1 — there is no baseline to
measure against, and onboarding probe state (`repo_status`) already
covers "never successfully set up."
- **Job-failure alerts:** untouched. A catch-up backup that runs and
fails alerts exactly like any other backup.
### 4. Catch-up on reconnect
A new small component — the **catch-up scheduler** — lives server-side
alongside the existing ticks.
- **Arm:** on agent hello (`server/ws/handler.go` hello path /
`onAgentHello`), if the host is `!AlwaysOn`, record
`catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a
subsequent hello just overwrites the timestamp (debounce — rapid
flapping does not stack catch-ups). In-memory is acceptable: catch-up
is best-effort and a server restart simply re-arms on the next hello.
- **Fire:** reuse the existing 30s server tick. For each due entry
(`catchupDueAt <= now`):
1. Re-verify the agent is still connected (`Hub.Connected(hostID)`).
If it bounced back offline within the settle window, drop the entry
(it will re-arm on the next hello).
2. Skip if a backup is already running or queued for the host
(`current_job_id` set, or a relevant `pending_runs` row exists) —
avoid double-firing alongside a normal dispatch or pending drain.
3. For each **enabled** schedule on the host, compute overdue:
```
overdue := sched.Next(host.LastBackupAt) <= now
```
using `robfig/cron/v3` (already a dependency) to parse
`Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly
after the last successful backup; if that moment has already
passed, the window was missed → overdue. (If `LastBackupAt` is nil,
treat as overdue so a never-backed-up intermittent host with a
schedule gets its first run on connect.)
4. For each overdue schedule, dispatch its source-groups via the
existing `dispatchBackupForGroupCore()`.
5. Clear the entry.
Net latency is ~6090s after wake (60s settle + up to one 30s tick).
This path is independent of and complementary to the `pending_runs`
drain, which continues to handle the fired-but-not-sent case.
### 5. UI
- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`,
visually distinct from red `dot-offline`.
- **`partials/host_row.html` and `partials/host_chrome.html`:** when
`!AlwaysOn && status=='offline'`, render the grey dot + label
`asleep`; the detail/last-seen line reads
`asleep · last seen <relTime> · will catch up on return`. All other
states unchanged.
- **24×7 chip:** on the host detail header, render a small
`Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip
for intermittent hosts. (Chip and checkbox highlight the same fact.)
- **Toggle:** an `Always On` checkbox (default checked) on the host edit
surface. Operator-band `POST` (mirrors existing host-edit handlers),
audited as `host.mode_updated`. On save, if switching to intermittent,
trigger the resolve-on-toggle path for any open `agent_offline` alert.
## Error handling & edge cases
- **Toggle server→intermittent while offline+alerting:** open
`agent_offline` alert auto-resolved on save.
- **Toggle intermittent→server while asleep:** host resumes normal
offline/alert semantics; it will alert per the 15-minute floor once
the sweeper/tick next evaluates it.
- **No enabled schedules:** no catch-up and no staleness alert — there
is no backup expectation to measure against.
- **Catch-up vs in-flight work:** guarded by the running/queued check in
step 4.2 so catch-up never races a normal dispatch or pending drain.
- **Agent flaps during settle window:** entry dropped if not connected
at fire time; re-armed on the next hello.
## Testing
- **Alert engine (unit):**
- offline alert suppressed when `!AlwaysOn`.
- staleness alert raised when intermittent + schedule + last backup >
7d; not raised for Always-On hosts; not raised when last backup is
recent; not raised when no enabled schedule.
- staleness alert auto-resolves after a backup advances `LastBackupAt`.
- server→intermittent toggle resolves an open `agent_offline` alert.
- **Overdue computation (unit, table-driven):** `(cronExpr,
lastBackupAt, now) → overdue?` including nil-last-backup and
daily/weekly cases.
- **Catch-up scheduler (unit):** fires only when still connected; skips
when a backup is running/queued; dispatches only overdue schedules.
- **UI (render test):** asleep state + 24×7 chip render under the right
conditions; offline state for Always-On hosts unchanged.
- `go vet ./...` and full `go test ./...` green before merge.
## Out of scope
- Per-host staleness thresholds (global 7d constant for v1).
- Continuous (non-reconnect) overdue evaluation.
- Agent-side catch-up cron — the server is the reliable arbiter.
- Wiring `stale_schedule` for Always-On hosts (separate concern).
## Task tracking
Add an entry to `tasks.md` under "Next steps from testing" (or a new
small section) once the plan is approved, per the repo's tasks.md
source-of-truth rule.
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,259 @@
# P2 Completion Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Close every remaining P2 task in `tasks.md`: P2R-09 (auto-init UX), P2R-10/11/12 (hooks), P2R-13 (bandwidth wiring + per-job override), P2R-14 (schedule next/last run), P2-16 (Windows svc), P2-17 (`install.ps1`), P2-18 (announce-and-approve).
**Architecture:** Server stays HTTP+WS; agent stays a single binary that auto-restages via `make build`. Hooks live on `source_groups` (and host-level defaults). Announce-and-approve adds a separate WS path (`/ws/agent/pending`) and a Pending hosts panel; token-flow stays default. Windows service support uses `golang.org/x/sys/windows/svc` behind a `//go:build windows` tag — Linux builds untouched. **Operator is away — make best guesses on small UX choices, but commit each item separately so the choices are reviewable.**
**Tech Stack:** Go 1.23+, chi router, modernc/sqlite, `coder/websocket`, `robfig/cron/v3`, HTMX + Tailwind, `golang.org/x/sys/windows/svc`, Ed25519 (stdlib).
---
## Pre-flight
- [ ] **Run baseline:** `go vet ./... && go build ./... && go test ./...` — must be green before starting. Restage agent + restart server (per CLAUDE.md restage block) so smoke env is warm.
## Order of execution
Smallest blast-radius first. UI polish → bandwidth → next/last → hooks → announce → Windows. Commit and restage at each task boundary. Run `go vet ./... && go test ./...` before every commit.
---
## Task 1 — P2R-13a: Wire bandwidth caps into restic invocations
**Files:**
- Modify: `internal/restic/runner.go` (add `LimitUploadKBps`, `LimitDownloadKBps` to `Env` or to a per-call options struct already present; emit `--limit-upload N`/`--limit-download N` on `restic backup|forget|prune|check|restore`)
- Modify: `internal/agent/runner/*.go` — pass host-wide caps into the runner. Caps come from `agent.config.Config` or are pushed via `config.update`. Decision: ship caps in the existing `config.update` envelope as new fields `bandwidth_up_kbps`, `bandwidth_down_kbps`. Server pushes on hello + on `PUT /api/hosts/{id}/bandwidth`.
- Modify: `internal/api/messages.go` — extend `ConfigUpdatePayload` with the two int pointers.
- Modify: `internal/server/ws/handler.go` (or wherever hello/config push lives) — include caps in the pushed config.
- Modify: `internal/server/http/host_bandwidth.go` — after `SetHostBandwidth`, fan out a `config.update` to the connected agent (mirror the credentials-edit path).
- Test: `internal/restic/runner_test.go` — assert flag injection.
- Test: `internal/server/ws/*_test.go` — assert config.update carries caps on hello and on edit.
- [ ] **Step 1.1** Add `LimitUploadKBps *int`, `LimitDownloadKBps *int` to whatever per-host config the runner already consults. Existing pattern is `restic.Env{}`; extend it.
- [ ] **Step 1.2** Failing test in `internal/restic/runner_test.go`: build a backup command with `LimitUploadKBps=1024`, assert the resulting argv contains `--limit-upload 1024`.
- [ ] **Step 1.3** Implement: prepend the flags in argv builders for `backup`, `forget`, `prune`, `check`, `restore`. Skip when nil/<=0.
- [ ] **Step 1.4** Wire `config.update` payload — server reads `Host.BandwidthUpKBps`/`DownKBps`, includes them in the existing `ConfigUpdatePayload` push on hello and on bandwidth edit (mirror cred-edit fan-out in `internal/server/http/host_credentials.go`).
- [ ] **Step 1.5** Agent applies caps: store in the in-memory dispatcher state on `config.update`, attach to every restic call.
- [ ] **Step 1.6** `go vet ./... && go test ./... && make build && <restage block>`. Commit:
```
agent+server: apply host bandwidth caps to restic invocations
```
## Task 2 — P2R-13b: Per-job override on Run-now confirm dialog
**Decision:** A small numeric input on the per-source-group Run-now button (and dashboard Run-all). Operator is away — keep it minimal: two optional inputs (up/down KB/s) on the dispatch endpoint; UI shows a `<details>` "Limit bandwidth for this run" disclosure with two number inputs.
**Files:**
- Modify: `internal/server/http/sources.go` (or wherever the per-group Run-now POST lives) — accept optional `bandwidth_up_kbps`/`bandwidth_down_kbps` form fields, pass through.
- Modify: dispatch path (`internal/server/dispatch_*.go` or `ws/handler.go` job-dispatch core) — accept overrides, include in the `command.run` payload.
- Modify: `internal/api/messages.go``CommandRunPayload` gains optional caps that take precedence over host-wide caps when present.
- Modify: agent dispatcher — use payload override if present else falls back to config caps.
- Modify: `web/templates/pages/host_sources.html` (and the schedules Run-now form) — `<details>` block.
- Test: HTTP test for the new form fields; agent runner test for override precedence.
- [ ] **Step 2.1** Failing test: POST to per-group Run-now with `bandwidth_up_kbps=512` → assert dispatched payload carries 512.
- [ ] **Step 2.2** Implement endpoint changes + payload extension.
- [ ] **Step 2.3** Agent override precedence test (payload wins over config).
- [ ] **Step 2.4** UI `<details>` blocks (one per Run-now form).
- [ ] **Step 2.5** Playwright spot-check via `:8080` smoke env: open Sources tab, expand the Run-now disclosure, fire with limit=128, then open the live job log and confirm the agent's restic argv (read `/tmp/rm-smoke/server.log` for the dispatched command — it logs argv) shows `--limit-upload 128`.
- [ ] **Step 2.6** Commit.
## Task 3 — P2R-14: Schedule "next run" / "last run"
**Files:**
- Modify: `internal/store/schedules.go` — add `NextRunAt(time.Time)` derivation helper and `LatestScheduledJobAt(host_id, schedule_id) (time.Time, error)` (or a single batched fetch for all schedules of a host).
- Modify: dashboard host row (`web/templates/partials/host_row.html`) — show "Next: …" and "Last: …" when there's a single covering schedule (already detected in slice 5).
- Modify: `web/templates/pages/host_schedules.html` — add Next/Last columns to the schedules table.
- Modify: relevant page handlers (`internal/server/http/ui_schedules.go`, dashboard handler) — populate the data.
- Test: `schedules_test.go` for next-run derivation (parse cron, compute next from a fixed `now`).
- [ ] **Step 3.1** Add `NextRun(cronExpr string, from time.Time) (time.Time, error)` helper using `robfig/cron/v3`'s `Parse(...).Next(from)`. Test with three crons.
- [ ] **Step 3.2** Add `LatestJobByActorKindForSchedule(host_id, schedule_id) (time.Time, status, error)` query against `jobs` (filter `actor_kind='schedule'` AND `schedule_id=?`, ORDER BY `started_at` DESC LIMIT 1).
- [ ] **Step 3.3** Wire schedules-page handler to populate Next/Last per row; render relative time + ISO tooltip (mirror existing `formatRelTime` template helper if it exists; otherwise use a simple "5m ago" helper).
- [ ] **Step 3.4** Wire dashboard row: when single covering schedule, surface "Next: 03:00" / "Last: 8h ago — succeeded".
- [ ] **Step 3.5** Playwright spot-check: a host with a schedule shows Next/Last; pause it → Next becomes "—" / "(paused)".
- [ ] **Step 3.6** Commit.
## Task 4 — P2R-09: Auto-init UX polish
**Files:**
- Modify: `web/templates/pages/host_repo.html` — danger-zone re-init button + two-step confirm (type the host name).
- Modify: `internal/server/http/ui_repo.go` (or new `repo_reinit.go`) — `POST /hosts/{id}/repo/reinit` admin-only, audit-logged. Server runs `restic init --force` (or wipes-then-inits — pick the safer of the two; restic doesn't truly wipe a repo, the operator must clear the bucket. **Best guess:** dispatch a normal `init` job with a flag that re-runs even if the repo claims to exist; if restic refuses, surface "the repo on the remote already has data — clear it manually before re-init" via the job log).
- Modify: host detail page header / vitals strip — surface init result line. Use the existing latest-`init`-job query to render "repo ready · initialised <relative time> ago" or "init failed · job N · retry".
- Test: HTTP test for re-init endpoint (auth, audit, host-name confirm); template test that the result line renders for both states.
- [ ] **Step 4.1** Add helper: `LatestJobByKind(host_id, "init")` — already exists from P2R-06 (`store.LatestJobByKind`). Reuse.
- [ ] **Step 4.2** Render init line into vitals strip; show "init failed" amber when latest init failed.
- [ ] **Step 4.3** Implement `POST /hosts/{id}/repo/reinit` handler — admin role check, requires a `confirm_hostname` form field that must equal `host.Name`, returns 400 otherwise. Dispatches a fresh `init` job.
- [ ] **Step 4.4** Add danger-zone re-init form to `host_repo.html` (currently disabled per slice 4). Two-step confirm with the typed hostname.
- [ ] **Step 4.5** Playwright: visit `/hosts/{id}/repo`, click re-init, type wrong hostname → blocked; type right hostname → dispatches init job → returns to live log.
- [ ] **Step 4.6** Commit.
## Task 5 — P2R-10: Hook schema (migration 0010)
**Files:**
- Create: `internal/store/migrations/0010_hooks.sql`
- `ALTER TABLE source_groups ADD COLUMN pre_hook BLOB;` (AEAD ciphertext, NULLable)
- `ALTER TABLE source_groups ADD COLUMN post_hook BLOB;`
- `ALTER TABLE hosts ADD COLUMN pre_hook_default BLOB;`
- `ALTER TABLE hosts ADD COLUMN post_hook_default BLOB;`
- All four are AEAD ciphertext (existing `crypto.AEAD`); BLOB column type.
- Modify: `internal/store/types.go` — add `PreHook *string` (decrypted), `PostHook *string` to `SourceGroup`; same to `Host`.
- Modify: `internal/store/sources.go` + `internal/store/hosts.go` — getters/setters encrypt on write, decrypt on read. Pass `crypto.AEAD` through (pattern mirrors `host_credentials.go`).
- Test: encrypt/decrypt round-trip; setting `nil` clears the column.
- [ ] **Step 5.1** Write migration SQL. Column-level ALTERs only (per CLAUDE.md).
- [ ] **Step 5.2** Update store types + getters/setters with AEAD encrypt/decrypt. Mirror `internal/store/host_credentials.go` patterns exactly.
- [ ] **Step 5.3** Round-trip test: set hook on a source group; reload; assert plaintext returned. Set nil; assert nil after reload.
- [ ] **Step 5.4** `go vet && go test`. Commit.
## Task 6 — P2R-11: Agent execution of hooks
**Files:**
- Modify: `internal/api/messages.go``ConfigUpdatePayload` (or the per-source-group bundle inside `ScheduleSetPayload`) carries `PreHook`, `PostHook` plaintext (server has decrypted by then; wire is authenticated WS, same trust boundary as repo creds).
- Modify: agent dispatcher — for `kind=backup` only:
- Run `pre_hook` (if present) via `os/exec` with the host shell (`/bin/sh -c` on Linux, `cmd.exe /C` on Windows). Capture stdout+stderr → JobLog with `hook:` prefix. Non-zero exit aborts the backup, marks the job failed with `pre_hook` error.
- Run `post_hook` (if present) **always** after the backup, with `RM_JOB_STATUS=succeeded|failed` env var. Capture into JobLog, prefix `hook:`. Non-zero exit on post_hook does NOT change job status (warning logged).
- Skip both for `kind` ∈ {forget, prune, check, unlock, init} per spec.md §14.3.
- Test: dispatcher test with a `pre_hook` that exits 1 → backup not started; `post_hook` always runs and sees `RM_JOB_STATUS`.
- [ ] **Step 6.1** Plumb hooks through `ScheduleSetPayload` source-group bundle + per-group Run-now `command.run` payload (override host-default with group hook if both present). Server-side resolution: host default if group hook is empty.
- [ ] **Step 6.2** Agent dispatcher: factor hook execution into `internal/agent/runner/hooks.go`. Use `exec.CommandContext`, set env, plumb output to existing JobLog stream with `Source: "hook"` (or prefix the log lines `hook: …`).
- [ ] **Step 6.3** Failing test in `internal/agent/runner/runner_test.go` (create file if absent): `pre_hook=/bin/false` → job fails with `pre_hook failed (exit 1)` and the actual restic backup never runs (assert via mock-restic shim).
- [ ] **Step 6.4** Test: `post_hook` runs even when backup fails; receives `RM_JOB_STATUS=failed`.
- [ ] **Step 6.5** Test: hooks skipped on `forget`/`prune`/`check`/`unlock` jobs.
- [ ] **Step 6.6** `go vet && go test && make build && <restage block>`. Commit.
## Task 7 — P2R-12: Hook editor UI
**Files:**
- Modify: `web/templates/pages/source_group_edit.html` (new or extend existing source-group form) — `<textarea>` for pre_hook, `<textarea>` for post_hook, with the warning banner: "this hook runs as the agent service user (root on Linux; LocalSystem on Windows)".
- Modify: source-group HTTP handler (`internal/server/http/sources.go`) — accept hook fields on POST/PUT, encrypt-and-persist via store.
- Create: a new "Settings" tab section on host detail (currently inert per P1-25) — wait, just add a new sub-tab or extend Repo page. **Decision:** add `pre_hook_default` / `post_hook_default` to the Repo page under a new "Hooks" section since Settings is still inert.
- Modify: source-group form admin-only check; post-only edit allowed by operators? **Decision:** admin-only edit per spec; render but disable for operators.
- Modify: audit-log writer — emit `source_group.hook_updated` and `host.default_hook_updated` events (without the hook body).
- Test: HTTP test for create + update; admin-only enforcement; audit row written without secret.
- [ ] **Step 7.1** Source-group form extension + handler wiring.
- [ ] **Step 7.2** Repo page Hooks section (host defaults).
- [ ] **Step 7.3** Audit entries.
- [ ] **Step 7.4** Playwright: as admin, set a `pre_hook` of `echo hello`, fire Run-now, open live log, confirm `hook: hello` line appears.
- [ ] **Step 7.5** Commit.
## Task 8 — P2-18a: Announce schema + endpoint
**Files:**
- Create: `internal/store/migrations/0011_pending_hosts.sql`
```sql
CREATE TABLE pending_hosts (
id TEXT PRIMARY KEY,
hostname TEXT NOT NULL,
os TEXT NOT NULL,
arch TEXT NOT NULL,
agent_version TEXT NOT NULL,
restic_version TEXT NOT NULL,
public_key BLOB NOT NULL, -- 32-byte Ed25519
fingerprint TEXT NOT NULL, -- "SHA256:hex"
announced_from_ip TEXT NOT NULL,
first_seen_at TEXT NOT NULL,
last_seen_at TEXT NOT NULL,
expires_at TEXT NOT NULL
);
CREATE INDEX pending_hosts_expires ON pending_hosts(expires_at);
CREATE INDEX pending_hosts_fingerprint ON pending_hosts(fingerprint);
```
- Create: `internal/store/pending_hosts.go``CreatePendingHost`, `GetPendingHostByFingerprint`, `ListPendingHosts`, `DeletePendingHost`, `TouchPendingHost`, `DeleteExpiredPendingHosts`.
- Create: `internal/server/http/announce.go``POST /api/agents/announce` accepts `{hostname, os, arch, agent_version, restic_version, public_key (base64)}`. Validates protocol_version implicitly via `agent_version` check. Token-bucket rate limit per source IP (10/min). Global cap 100 pending rows. Returns `{fingerprint, pending_id, hostname_collision: bool}`.
- Test: `announce_test.go` — happy path; rate limit; cap; collision flag.
- [ ] **Step 8.1** Migration + store layer + tests.
- [ ] **Step 8.2** Endpoint + tests (use a fake clock + in-process token bucket).
- [ ] **Step 8.3** Commit.
## Task 9 — P2-18b: Pending WS + accept/reject
**Files:**
- Create: `internal/server/ws/pending.go``GET /ws/agent/pending` upgrade. Server issues a 32-byte nonce; agent signs it with its Ed25519 private key; server verifies against the `public_key` stored on the pending row keyed by the supplied `pending_id`. If valid, hold the connection open; on accept, push a single `enrolled` message containing `{bearer_token, repo_credentials_aead_blob}` and close cleanly. On reject, close with code 4001 + reason "rejected".
- Create: `internal/server/http/pending.go` — admin-only `POST /api/pending-hosts/{id}/accept` (atomically: mint bearer, decrypt admin-supplied repo creds (passed in form), promote pending row → real `hosts` row, push `enrolled` to the open WS, audit-log) and `POST /api/pending-hosts/{id}/reject` (delete row + close socket).
- Modify: server `main.go` route registration.
- Test: integration test — fake agent opens pending WS, admin POST /accept, agent receives bearer.
- [ ] **Step 9.1** Pending WS handler with nonce-sign verify.
- [ ] **Step 9.2** Accept/reject endpoints. Accept reuses the existing token-consume path internally (mints persistent bearer from `crypto.RandomToken`-style helper, inserts host row + `host_credentials`).
- [ ] **Step 9.3** Tests.
- [ ] **Step 9.4** Commit.
## Task 10 — P2-18c: Agent announce path
**Files:**
- Modify: `cmd/agent/main.go` — when `RM_TOKEN` is unset, switch to announce mode instead of erroring out. `RM_SERVER` still required.
- Create: `internal/agent/announce/announce.go` — generate-or-load Ed25519 keypair (persisted as a file alongside `secrets.enc`, mode 0600). POST `/api/agents/announce`. Open `/ws/agent/pending`. Wait. On `enrolled` message, persist bearer to `agent.yaml`, persist repo creds via existing secrets store, exit announce mode and reconnect via the normal WS path.
- Modify: `deploy/install/install.sh` — when `RM_TOKEN` is missing, run agent in announce mode and `journalctl --follow` until the agent prints the fingerprint, print it to the operator's terminal in big copy-friendly format, then keep following until enrolled.
- Test: end-to-end test in `internal/server/...` using a fake agent.
- [ ] **Step 10.1** Keypair generation + persistence.
- [ ] **Step 10.2** Announce client + pending WS client; print `SHA256:…` fingerprint to stdout in a banner.
- [ ] **Step 10.3** Install script branch.
- [ ] **Step 10.4** Playwright: register a host via announce mode (run agent locally with no RM_TOKEN), log into UI, see Pending hosts panel with the fingerprint, click Accept, confirm host appears.
- [ ] **Step 10.5** Commit.
## Task 11 — P2-18d: Pending hosts UI panel
**Files:**
- Modify: `web/templates/pages/dashboard.html` — add Pending hosts panel above the host list when any pending rows exist.
- Modify: dashboard handler — `Store.ListPendingHosts(now)` (auto-skips expired).
- Add buttons → POST `/api/pending-hosts/{id}/accept` and `/reject` via HTMX.
- Background sweeper for `DeleteExpiredPendingHosts` every 60s (mirror the existing offline-sweeper goroutine pattern).
- [ ] **Step 11.1** Sweeper goroutine.
- [ ] **Step 11.2** Dashboard handler + template.
- [ ] **Step 11.3** Accept form must include the same repo URL/user/pw fields as the token-mint form (admin still supplies repo creds at accept time).
- [ ] **Step 11.4** Playwright sweep.
- [ ] **Step 11.5** Commit.
## Task 12 — P2-16: Windows service integration
**Decision:** Cannot test on Windows from WSL. Goal is a clean compile under `GOOS=windows GOARCH=amd64` and code that follows the canonical `golang.org/x/sys/windows/svc/example` pattern. Untestable beyond compile + manual review; mark in commit message.
**Files:**
- Create: `internal/agent/service/service_windows.go` (build tag `//go:build windows`) — implements `svc.Handler`. `Execute` starts the agent's main loop in a goroutine, listens for `svc.Stop`/`svc.Shutdown`, cancels ctx, waits.
- Create: `internal/agent/service/service_other.go` (build tag `//go:build !windows`) — stub `RunService` that just runs the agent loop in the foreground.
- Create: `internal/agent/service/install_windows.go``Install`, `Uninstall`, `Start`, `Stop` thin wrappers around `mgr` package.
- Modify: `cmd/agent/main.go` — sub-commands: `install`, `uninstall`, `start`, `stop`, `run` (default). `run` delegates to `service.Run()` which on Windows checks `svc.IsWindowsService()` and dispatches accordingly.
- Test: `internal/agent/service/service_windows_test.go` (build-tagged) for argv parsing only — actual SCM interaction can't be tested in CI.
- [ ] **Step 12.1** Implement the svc.Handler shell.
- [ ] **Step 12.2** Install/uninstall wrappers (use `mgr.ConnectLocal()`, `m.CreateService(name, exepath, mgr.Config{...}, "run")`).
- [ ] **Step 12.3** Cross-compile check: `GOOS=windows GOARCH=amd64 go build ./cmd/agent` must succeed.
- [ ] **Step 12.4** Commit with note "untested on Windows; compile-verified only".
## Task 13 — P2-17: install.ps1
**Files:**
- Create: `deploy/install/install.ps1` — PowerShell 5.1+ compatible. Checks admin elevation. Downloads agent binary from `$RM_SERVER/agent/binary?os=windows&arch=amd64`. Drops it at `C:\Program Files\restic-manager\restic-manager-agent.exe`. Runs `restic-manager-agent.exe install` (registers service). Starts it. Detects existing tasks named `*restic*` via `Get-ScheduledTask` and prints them — does not auto-disable. Writes `C:\ProgramData\restic-manager\agent.yaml` with `RM_SERVER` + `RM_TOKEN` (or no token if announce-mode).
- Modify: `internal/server/http/install.go` (or wherever install scripts are served) to also serve `/install/install.ps1`.
- Modify: CLAUDE.md restage block to also stage `install.ps1`.
- [ ] **Step 13.1** Write the script.
- [ ] **Step 13.2** Wire serving + restage.
- [ ] **Step 13.3** Smoke parse: `pwsh -NoProfile -Command "Get-Command -Syntax (Get-ChildItem deploy/install/install.ps1)"` if pwsh is on PATH, else `Set-StrictMode` parse via `pwsh -c "$null = [scriptblock]::Create((Get-Content deploy/install/install.ps1 -Raw))"`. Skip if no pwsh available — note in commit.
- [ ] **Step 13.4** Commit.
## Task 14 — Final integration sweep
- [ ] **Step 14.1** `go vet ./... && go test ./... -race`. Full build. Restage. Restart server.
- [ ] **Step 14.2** Playwright walkthrough on `:8080`: login → dashboard shows pending-hosts empty state → create source group → set a `pre_hook` → Run-now with bandwidth override → confirm hook fires + bandwidth applied → schedules tab shows next/last → repo page shows init-OK line → re-init flow gated by typed hostname.
- [ ] **Step 14.3** Update `tasks.md`: tick P2R-09, P2R-10, P2R-11, P2R-12, P2R-13, P2R-14, P2-16, P2-17, P2-18 done. Update Phase 2 acceptance line items as satisfied.
- [ ] **Step 14.4** Open PR `p2-completion → main` with a summary of every item closed.
---
## Decisions made on the operator's behalf (away)
1. **Bandwidth UI for per-job override:** small `<details>` disclosure under each Run-now button. Simpler than a modal; matches the rest of the app's progressive-disclosure style.
2. **Re-init UX:** server dispatches a fresh `init` job; if restic refuses because the repo already exists, surfaces the error in the job log and instructs the operator to clear the remote bucket. We don't try to forcibly wipe — too dangerous, and the agent doesn't have credentials to wipe S3/B2/etc generically.
3. **Hooks editor lives on the Repo page (host defaults) + on the source-group edit form (per-group override).** Skips inventing a new "Settings" tab since that surface is still inert.
4. **Announce flow:** admin still supplies repo creds at accept time (same form as the token-mint flow). The pending row only carries identity-of-the-endpoint material, never repo creds.
5. **Windows service:** compile-verified only; untested. Commit message will say so.
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,473 @@
# P3 — Alerts (design)
> Phase 3 sub-spec covering the alerts engine, notification channels, and UI
> (P3-05 / P3-06 / P3-07).
>
> Wireframe: `_diag/p3-alerts-wireframe/wireframe.html`. Screenshots in the
> same directory. Spec brainstorm ran 2026-05-04; user approved all ten
> design decisions before this spec was written.
## Scope locked
Brainstorm decisions (in order asked):
1. **Rule model.** Hardcoded rule set, no operator-tunable thresholds in v1.
The engine knows about each rule type internally; per-rule config can land
later if/when an operator asks.
2. **Rule set.** Six rules: `backup_failed`, `forget_failed`, `prune_failed`,
`check_failed`, `stale_schedule`, `agent_offline`.
3. **Engine cadence.** Hybrid. Event hooks at the existing
`MarkJobFinished` and offline-sweeper sites for the immediate triggers;
one 60-second ticker handles stale-schedule detection and auto-resolution.
4. **Resolution.** Auto-resolve when the underlying condition clears + manual
Resolve at any time. Acknowledge is a separate "I've seen it" intermediate
state that does NOT close the alert.
5. **v1 channels.** Webhook + native ntfy + SMTP. Apprise deferred (the
channel plumbing accepts new kinds without reshaping). SMTP added as
a first-class channel post-brainstorm because the use case — overnight
alerts the operator wants to read in the morning rather than be pinged
on at 03:00 — is poorly served by ntfy's push model and clumsy via
webhook → email-gateway.
6. **Channel scope.** Global only. No per-host or per-severity routing in v1.
7. **Notification body.** Structured JSON for webhooks, formatted
title+body+click-URL for ntfy, plus a per-channel "Send test notification"
button with inline result feedback.
8. **Deduplication.** Open-alert uniqueness on `(host_id, kind)` with a
`last_seen_at` bump on every confirming tick. One notification per
occurrence; the UI shows "still happening · Ns ago" while a rule keeps
matching.
9. **Alert UI.** Top-level `/alerts` page (the existing nav stub becomes
real). Per-host vitals "Open alerts" cell links to `/alerts?host_id=...`.
Channel CRUD lives at `/settings/notifications`.
10. **Delivery semantics.** Best-effort fire-and-forget with a 5s timeout
per notification. Failures are logged but not retried. The alert row in
the DB is the source of truth.
## Architecture
The subsystem is three loosely-coupled units behind one `AlertEngine`
goroutine:
```
┌───────────────────────────┐
event hooks ─────────────────►│ │
│ AlertEngine │ ──► raise/resolve
60s ticker ──────────────────►│ (rule evaluation) │ alert row
│ │
└────────────┬──────────────┘
┌──────────────────────┐
│ notification.Hub │
│ (fire-and-forget) │
└──┬────────┬──────────┘
│ │
┌──────▼──┐ ┌──▼──────┐
│ Webhook │ │ Ntfy │ …future channels
└─────────┘ └─────────┘
```
### Component boundaries
| Component | Purpose | Depends on |
| ---------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------- |
| `internal/alert.Engine` | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog |
| `internal/alert.Rule` + per-rule files | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models |
| `internal/notification.Hub` | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table. | store, channel adapters |
| `internal/notification.Channel` (iface) | Single method `Send(ctx, payload) error` with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: `webhookChannel`, `ntfyChannel`, `smtpChannel`. | http.Client; net/smtp + crypto/tls for SMTP |
| `internal/store/alerts.go` | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite |
| `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table). | sqlite, crypto.AEAD (for secrets) |
| `internal/server/http/ui_alerts.go` | `/alerts` page handler + filter parsing + ack/resolve form actions. | store |
| `internal/server/http/ui_notifications.go` | `/settings/notifications` page + channel CRUD + "Send test" handler. | store, notification.Hub |
### Engine event shape
The engine runs as one goroutine per server process started in
`cmd/server/main.go`. It exposes a small set of channels other code writes to:
```go
type Engine struct {
store *store.Store
hub *notification.Hub
// Event channels (buffered, drop-on-full with a slog warning to keep
// hot paths non-blocking). The engine drains them on its own
// goroutine, evaluates the rule, and acts.
jobFinished chan jobFinishedEvent // from store.MarkJobFinished hook
hostOffline chan string // host_id; from offline sweeper
hostOnline chan string // host_id; from ws handler hello
// 60s ticker drives stale-schedule + auto-resolution sweeps.
tick *time.Ticker
}
```
The hot-path call sites (`store.MarkJobFinished`, `ws.handler` offline
sweep, `ws.handler` hello) push to these channels via a tiny
`Engine.Notify*` method that does a non-blocking send. The engine's own
goroutine handles every match — keeps mutation off the hot path.
### Rule catalogue
| Kind | Severity | Trigger | Auto-resolve when |
| ------------------- | -------- | ----------------------------------------------------------------------- | -------------------------------------------------- |
| `backup_failed` | warning | `MarkJobFinished` with kind=backup, status=failed | next backup for the same host succeeds |
| `forget_failed` | warning | `MarkJobFinished` with kind=forget, status=failed | next forget for the same host succeeds |
| `prune_failed` | warning | `MarkJobFinished` with kind=prune, status=failed | next prune for the same host succeeds |
| `check_failed` | critical | `MarkJobFinished` with kind=check, status=failed OR errors_found | next check for the same host succeeds without errors |
| `stale_schedule` | warning | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted |
| `agent_offline` | warning | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`) | hostOnline event for that host |
The 15-minute floor on `agent_offline` exists so a 30-second blip during
agent restart doesn't generate a notification storm. The store's existing
offline sweeper (`hosts.last_seen_at` with 90s threshold) already marks the
host offline; the engine sees the event but waits for the threshold before
raising.
### Dedup + last_seen_at
`store.RaiseOrTouch(host_id, kind, severity, message)`:
```sql
SELECT id, last_seen_at FROM alerts
WHERE host_id = ? AND kind = ? AND resolved_at IS NULL
LIMIT 1;
```
- Found: `UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?`,
return `(id, didRaise=false)`.
- Not found: `INSERT INTO alerts (id, host_id, kind, severity, message,
created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?)`, return
`(id, didRaise=true)`.
The engine fires a notification through the Hub only when `didRaise=true`.
Touch-only events keep the row's `last_seen_at` fresh so the UI can render
"still happening · Ns ago" without spamming the operator's phone.
### Notification payload shapes
**Webhook** — a single JSON envelope per event:
```json
{
"event": "alert.raised",
"alert_id": "01KQT...",
"severity": "warning",
"kind": "backup_failed",
"host_id": "01KQ...",
"host_name": "alfa-01",
"message": "Backup 'system-config' failed: rest-server returned 401",
"raised_at": "2026-05-04T15:42:01Z",
"link": "https://restic-manager.example/alerts/01KQT..."
}
```
`event` is one of `alert.raised | alert.acknowledged | alert.resolved |
alert.test`. The same envelope shape is reused across events — operators
build one bridge, switch on `event` and `severity`.
**SMTP** — single-recipient plain-text email per channel. The channel
config carries the SMTP server credentials and a `to` address; one
channel = one recipient (or one distribution-list address). Operators
who want multiple recipients add multiple channels — keeps the config
flat and the failure modes per-recipient.
Subject pattern is hardcoded (no per-channel template in v1):
```
Subject: [restic-manager] [<severity>] <host_name>: <kind>
From: <configured-from-address>
To: <configured-to-address>
Date: <RFC 5322>
Message-ID: <alert_id@<server-host>>
<message line — same string the webhook/ntfy gets>
Raised at: 2026-05-04T15:42:01Z
Severity: warning
Host: alfa-01
Kind: backup_failed
Open in restic-manager:
https://restic-manager.example/alerts/01KQT...
(This message was sent by restic-manager. Acknowledge or resolve in the UI.)
```
The body is plain text only in v1 — no HTML alternative — both because
the data is already structured well enough as text and because HTML
email opens a long tail of rendering / sanitisation concerns. The
`Message-ID` includes the alert id so a thread-aware client can group
related events (raised → acknowledged → resolved) together.
Encryption:
- **STARTTLS** (default, port 587). Opportunistic upgrade. Most
operator-facing relays.
- **Implicit TLS** (port 465). Connect-then-TLS-handshake.
- **None** (port 25). Plain. Hidden behind a "Yes I understand" warning
on the form because the password goes over the wire.
Auth:
- **PLAIN** (RFC 4616) over TLS. Default and almost always what's wanted.
- **CRAM-MD5** (RFC 2195). Offered if the server advertises it, no UI
toggle — automatic.
- No OAuth2 / XOAUTH2 in v1; that's a real next step if Gmail-without-
app-passwords becomes a recurring ask.
Per-message timeout is 10s (vs 5s for HTTP channels) — STARTTLS
handshake + DATA over a slow link can legitimately take that long.
**Ntfy** — uses the standard publish format:
```
POST /<topic> HTTP/1.1
Host: <server>
Authorization: Bearer <access-token> (if configured)
Title: [warning] alfa-01 backup failed
Priority: 4
Tags: warning,backup_failed
Click: https://restic-manager.example/alerts/01KQT...
Backup 'system-config' failed: rest-server returned 401
```
Severity → priority mapping:
| Severity | Priority |
| --------- | -------- |
| info | 3 (default) |
| warning | 4 (high) |
| critical | 5 (urgent) |
Per-channel `default_priority` setting overrides for non-critical alerts;
critical always goes urgent regardless.
### Test notification
`POST /api/notifications/{channel_id}/test` builds a synthetic event
(severity=info, kind=test_notification, message="Test from
restic-manager", link to the channel's edit page) and runs it through the
real send path. Returns `{ok: bool, latency_ms: int, status_code?: int,
error?: string}`. UI renders the green ✓ / red ✗ feedback inline.
## Routes added
| Method | Path | Purpose |
| ------- | ----------------------------------------------------- | ------------------------------------------------------------- |
| GET | `/alerts` | Fleet alerts list with filters (`?status=open&severity=warning&host_id=...&q=...`) |
| POST | `/alerts/{id}/acknowledge` | Mark alert acknowledged (HTMX form) |
| POST | `/alerts/{id}/resolve` | Manual resolve (HTMX form) |
| GET | `/settings/notifications` | Channel list page |
| GET | `/settings/notifications/new` | Channel kind picker + empty form |
| POST | `/settings/notifications/new` | Validate + create + redirect |
| GET | `/settings/notifications/{id}/edit` | Channel edit form |
| POST | `/settings/notifications/{id}/edit` | Validate + update |
| POST | `/settings/notifications/{id}/delete` | Delete channel (typed-confirm name in the form) |
| POST | `/api/notifications/{id}/test` | Fire test notification, return JSON result |
| GET | `/api/alerts` | JSON list (mirrors the UI filters) for future REST callers |
## Data model
### Migration 0013 — alerts.last_seen_at
```sql
ALTER TABLE alerts ADD COLUMN last_seen_at TEXT;
UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL;
```
Existing alerts (currently zero in production — nothing writes them yet)
get `last_seen_at = created_at`. Column is nullable for forwards-compat
with rows from the alert-engine-pre-bump period.
### Migration 0014 — notification_channels + notification_log
```sql
CREATE TABLE notification_channels (
id TEXT PRIMARY KEY,
kind TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy', 'smtp')),
name TEXT NOT NULL,
enabled INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)),
config BLOB NOT NULL, -- AEAD-encrypted JSON; per-kind shape
default_priority TEXT, -- ntfy only; null for webhook + smtp
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
last_fired_at TEXT
);
CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1;
CREATE TABLE notification_log (
id TEXT PRIMARY KEY,
channel_id TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE,
alert_id TEXT REFERENCES alerts(id) ON DELETE SET NULL,
event TEXT NOT NULL, -- alert.raised | alert.acknowledged | alert.resolved | alert.test
ok INTEGER NOT NULL CHECK (ok IN (0, 1)),
status_code INTEGER,
latency_ms INTEGER,
error TEXT,
fired_at TEXT NOT NULL
);
CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC);
CREATE INDEX notification_log_alert ON notification_log(alert_id);
```
`config` is an AEAD-encrypted JSON blob — bearer tokens for webhooks and
access tokens for ntfy live there. Per-kind config shapes:
```go
type webhookConfig struct {
URL string `json:"url"`
BearerToken string `json:"bearer_token,omitempty"`
HeaderName string `json:"header_name,omitempty"`
HeaderValue string `json:"header_value,omitempty"`
}
type ntfyConfig struct {
ServerURL string `json:"server_url"` // default https://ntfy.sh
Topic string `json:"topic"`
AccessToken string `json:"access_token,omitempty"`
}
type smtpConfig struct {
Host string `json:"host"` // e.g. smtp.example.com
Port int `json:"port"` // default 587 (STARTTLS), 465 (TLS), 25 (none)
Encryption string `json:"encryption"` // "starttls" | "tls" | "none"
Username string `json:"username"`
Password string `json:"password"` // sensitive — AEAD-encrypted with the rest of config
From string `json:"from"` // RFC 5322 address; "alerts@example.com" or "Restic-Manager <alerts@…>"
To string `json:"to"` // single recipient or distribution-list address; v1 = one channel = one to-line
}
```
### Engine state
The engine itself is stateless beyond the channels it owns; all
persisted state is in the existing `alerts` table + the new
`notification_log` table. A process restart re-evaluates from scratch:
on next tick the stale-schedule + auto-resolution sweeps catch up with
whatever happened during the downtime. No outbox to drain.
## UI templates
| Template | Purpose |
| ----------------------------------------- | ------------------------------------------------------ |
| `web/templates/pages/alerts.html` | Fleet alerts page |
| `web/templates/partials/alert_row.html` | One alert row (used by both list and detail-fragment swap) |
| `web/templates/pages/settings.html` | Settings shell with Notifications / Users / Auth sub-tabs |
| `web/templates/pages/notifications.html` | Channel list (Notifications sub-tab body) |
| `web/templates/pages/notification_edit.html` | Channel kind picker + per-kind form + test button + payload preview |
| `web/templates/partials/crit_banner.html` | Dashboard top-of-page banner |
| `web/templates/partials/nav.html` | Existing — gain a `data-alerts-count` attribute on the Alerts tab so the badge auto-updates |
The Settings shell + Notifications sub-tab is the new chrome the wireframe
introduced; Users + Authentication tabs are placeholder links that 404 in
v1 (or render an "Lands later" notice). Same pattern P2R-02 used for
inert sub-tabs.
## Tests (target coverage)
- `internal/alert/engine_test.go` — rule firing per kind: backup_failed
raises on `MarkJobFinished(kind=backup, status=failed)`; touch-only on
the second failure for the same host (no second notification);
auto-resolve on next success.
- `internal/alert/agent_offline_test.go``OnHostOffline` emits without
raising until the 15-min floor; `OnHostOnline` clears the alert.
- `internal/alert/stale_schedule_test.go` — synthetic schedule whose next
fire is in the past triggers; resets when a job lands.
- `internal/notification/webhook_test.go` — payload shape pinned;
authorisation header sent when bearer set; custom header echoed; 5s
timeout enforced; error in `notification_log`.
- `internal/notification/ntfy_test.go` — title/priority/tags/click headers
match the severity mapping; access token sent as `Authorization: Bearer
<token>`; default priority overridden by severity for critical.
- `internal/notification/smtp_test.go` — round-trip against a local
`net/smtp.NewServer`-style fake (or `mhog`/MailHog if convenient):
STARTTLS handshake completes against a self-signed cert; PLAIN auth
uses configured creds; subject + from + to + body bytes match the
spec'd format; Message-ID contains the alert id; 10s timeout enforced;
failure path (auth refused) lands in `notification_log` with the
server's error string.
- `internal/server/http/ui_alerts_test.go` — page renders with filters
applied; ack/resolve POSTs flip the row + write audit; HX-Redirect
bounces back to the filtered list.
- `internal/server/http/ui_notifications_test.go` — CRUD happy paths,
validation re-render, secrets-encrypted-at-rest assertion (load row,
decrypt, compare), test-button hits the real send path against a
test http.Server.
- Migration 0013 + 0014 round-trip tested via `store.Open` on a fresh
db.
## Playwright sweep
End-of-phase sweep mirrors the P2R-02 / P3-restore pattern:
1. Login → `/alerts` (initially empty) → see "All clear · last alert
never" empty state.
2. Trigger a fake-failed-backup via `POST /api/hosts/{id}/jobs` against a
host with a deliberately-wrong rest-server URL. Wait for the
`backup_failed` alert to appear in the list within ~2s of the job
finishing.
3. Acknowledge → row tints + ack actor visible.
4. Take the agent offline (`systemctl stop`); wait 15 min OR mock
`last_seen_at` to 16 min ago via the test harness; confirm
`agent_offline` alert raises once.
5. Restart the agent → `agent_offline` auto-resolves; `backup_failed` is
still open.
6. Configure a webhook channel pointing at a local test sink; click "Send
test" → green ✓.
7. Configure a ntfy channel pointing at a local sink → click "Send test"
→ green ✓.
8. Configure an SMTP channel pointing at a local MailHog (Docker, port
1025, no TLS for the local-only sweep) → click "Send test" → green ✓
→ MailHog UI at :8025 shows the test email with the right subject
and Message-ID.
9. Trigger a fresh failed backup → all three channels receive the
notification (verified from sink logs + MailHog inbox);
`notification_log` has three rows `event=alert.raised, ok=true`.
10. Manually Resolve the open `backup_failed`; confirm all three channels
receive `event=alert.resolved`.
11. Critical-severity test: trigger `check_failed` (mocked) → dashboard
banner appears; clicking it lands on `/alerts?severity=critical&status=open`.
12. Empty the alerts again → banner disappears.
Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console
errors, before handing back.
## What does NOT change
- Existing chrome/templates beyond the small additions noted above.
- Existing `alerts.severity` CHECK (`info`/`warning`/`critical`) — already
the right shape; no migration needed for that.
- Audit log writer pattern — engine writes audit rows for ack/resolve
the same way every other state-changing handler does.
- The agent. Alerts are entirely a server concern; the agent doesn't
know they exist.
## Open questions / explicit non-goals
- **Per-rule cooldowns / re-raise on long-running issues.** Out of scope
(brainstorm question 8 ruled this out). Operators see "still happening"
in the UI; they don't get a reminder ping.
- **SMTP HTML emails.** v1 is plain text only — operators wanting rich
rendering can deploy a webhook → mail-merge bridge, or wait for a v2
template engine. The Message-ID threading + plain text body should be
enough for almost every overnight-digest workflow.
- **SMTP OAuth2 / XOAUTH2.** Out of scope. Gmail / Microsoft 365 with
modern OAuth requires an `app password` workaround in v1. Native
XOAUTH2 lands when an operator asks (or when Google starts refusing
app passwords for non-business accounts in earnest).
- **Multi-recipient SMTP channels.** A channel = one `To`. Operators
wanting multiple recipients add multiple channels. Keeps failure
attribution per-recipient.
- **Apprise sidecar integration.** Deferred per brainstorm. The
`Channel` interface accepts a third impl without reshaping when we get
there.
- **Per-host or per-severity channel routing.** Out of scope. Likely
next step if operators ask: a `min_severity` field on the channel row.
- **Snooze / mute.** Out of scope. Acknowledge is the closest analogue;
full silence-windows would need a new table and is YAGNI for v1.
- **PagerDuty / OpsGenie.** Both have webhook receivers; operators wire
them via the webhook channel today.
- **Alert "rules" UI.** No CRUD; the rule set is hardcoded.
@@ -0,0 +1,342 @@
# P3 — Restore (design)
> Phase 3 sub-spec covering single-host restore (P3-01, P3-02, P3-03, P3-09).
> P3-04 (cross-host restore) is deferred to a new "Future / unscheduled"
> section in `tasks.md` — disaster recovery is already covered by re-enrolling
> a replacement host with the same repo credentials.
>
> Wireframe: `_diag/p3-restore-wizard/wireframe.html`. Screenshot:
> `_diag/p3-restore-wizard/01-full-wizard.png`.
## Scope locked
Brainstorm decisions (in order asked):
1. **In-place vs new-directory.** Default is a new directory under
`/var/restic-restore/<job-id>/`. An "Restore in place (overwrite original
paths)" toggle is gated by typed-confirmation of the host name, mirroring
the repo re-init pattern.
2. **Path-selection granularity.** Tree browser as the path selector, lazy-
loaded via `restic ls --json <snapshot> <path>` per directory expansion.
3. **Cross-host restore (P3-04).** Out of scope this phase. Move to
"Future / unscheduled" in `tasks.md`. The disaster-recovery case is covered
by the standard enrolment flow: stand up a replacement host, paste the
original repo creds at enrolment, snapshots reappear, restore is
same-host.
4. **Snapshot diff (P3-09).** Diff-as-a-job. New `JobDiff` JobKind dispatched
like every other agent operation. Output streams as `log.stream` and
renders on the live job log page.
5. **Wizard entry points.** Top-level "Restore" button on host detail
(`/hosts/{id}/restore`, opens wizard at step 1) plus a per-snapshot
Restore action on snapshot rows (`/hosts/{id}/snapshots/{sid}/restore`,
skips step 1).
6. **Wizard interaction model.** Single-page, sections progressively enable;
tree-browser nodes lazy-load via HTMX partials. No `restore_drafts` table.
7. **Tree-browser data path.** Synchronous WS RPC (`tree.list`
`tree.list.result`, correlation-ID) plus a per-wizard-session in-memory
cache keyed by `{snapshot_id, path}` with ~30-min TTL.
8. **Restore progress UI.** Restore-specific job-page variant: files-restored
/ bytes-restored / throughput / ETA / current-file display, driven by
restic restore's JSON status events surfaced through `job.progress`.
9. **Permissions/ownership.** Policy, not toggle. In-place restore preserves
original ownership; new-directory restore drops ownership
(`--no-ownership`).
10. **Concurrency.** Single-flight per host (one job at a time across all
kinds). Plus a real cancel-job feature: `command.cancel` envelope, agent
kills the `restic` subprocess via context cancel (SIGTERM, SIGKILL after
grace), server transitions the job to `cancelled`. The "Cancel" button
already in the `job_detail` template becomes real for any running job
kind.
11. **Audit + safety.** Audit row on every restore dispatch (`host.restore`
with snapshot ID, paths, target, in-place flag). Recent-restores panel
on the host page surfacing the latest restore job alongside last-backup
and last-init signals. Role gate deferred to P4-03.
## Architecture
Restore composes from existing primitives plus three new pieces:
- **New JobKind values**: `JobRestore`, `JobDiff`. Dispatcher cases mirror
the prune/check pattern. Agent-side handlers wrap `restic.RunRestore` and
`restic.RunDiff` (new methods on the `restic` package).
- **New WS RPC**: `tree.list` request (`{snapshot_id, path}`) ↔
`tree.list.result` reply (`{entries: [{name, type, size}], ...}` or
`{error}`). Reuses existing correlation-ID infrastructure from P1-09. No
`jobs` row.
- **New cancel surface**: `command.cancel` request (`{job_id}`), agent
cancels the running subprocess context, returns `command.ack` + `job.finished`
with status `cancelled`. Server endpoint `POST /api/jobs/{id}/cancel`
bridges UI button → WS envelope.
Everything else (job lifecycle, log streaming, progress envelope, snapshot
listing, audit log writer, host_chrome partial, danger-zone typed-confirmation)
already exists and is reused verbatim.
### Component boundaries
| Component | Purpose | Depends on |
| ---------------------------------- | ---------------------------------------------------- | ----------------------------------------- |
| `internal/restic.RunRestore` | Run `restic restore` with paths + target + ownership | `restic.Env` |
| `internal/restic.RunDiff` | Run `restic diff --json a b` | `restic.Env` |
| `internal/agent/runner` cases | Dispatch `JobRestore` / `JobDiff` jobs | `restic.Run*`, hooks (skipped: backup-only) |
| `internal/agent/runner` cancel hook | Wire WS `command.cancel` → ctx.CancelFunc per job | runner job map |
| `internal/agent/runner` tree-list | Sync RPC handler: `restic ls --json` for one path | `restic.Env` |
| `internal/server/ws/cancel.go` | Validate + send `command.cancel` envelope | hub.Send, store.UpdateJobStatus |
| `internal/server/ws/tree.go` | RPC mediator: `tree.list` request → reply, with cache | hub.SendRPC, in-memory cache |
| `internal/server/http/restore.go` | Wizard routes + dispatch endpoint | store, ws, audit |
| `internal/server/http/diff.go` | Snapshot-diff dispatch endpoint | store, ws |
| `internal/server/http/cancel.go` | `POST /api/jobs/{id}/cancel` | ws |
| `web/templates/pages/host_restore.html` | Wizard page | host_chrome partial |
| `web/templates/partials/tree_node.html` | Lazy-loaded tree node fragment for HTMX swap | — |
| `web/templates/pages/job_detail.html` | Restore-kind progress widget (variant) | existing job_detail |
### Data flow — wizard happy path
```
operator
├─ GET /hosts/{id}/restore
│ server renders wizard shell, snapshot table from store.ListSnapshotsByHost
├─ click snapshot row (or arrives via /hosts/{id}/snapshots/{sid}/restore)
│ wizard advances to step 2, snapshot summary card rendered
├─ expand a tree node (chevron click)
│ HTMX GET /hosts/{id}/restore/tree?snapshot={sid}&path=/etc
│ server checks per-session cache (keyed by sid+path)
│ hit → render tree_node fragment from cache
│ miss → hub.SendRPC(host_id, "tree.list", {sid, path}) → wait reply
│ cache result, render tree_node fragment
├─ tick file/dir checkboxes (form state, no round-trip)
├─ pick target radio (and optionally type host name to unlock in-place)
└─ POST /hosts/{id}/restore (form submit)
server validates: ≥1 path, target mode, in-place ⇒ host name match
write audit row host.restore
store.CreateJob{kind=restore, payload={snapshot_id, paths, target, in_place}}
hub.Send(host_id, "command.run", {job_id, kind=restore, payload})
HX-Redirect: /jobs/{job_id}
```
### Data flow — agent restore execution
```
agent.runner receives command.run kind=restore
├─ check single-flight: if r.activeJobID != "" → reply busy
│ (server queues to pending_runs only for kind=backup; restore returns busy)
├─ allocate ctx, ctxCancel — store cancelFunc against job_id in r.cancels
├─ sendStarted(job_id, JobRestore, now)
├─ build target path: if in_place → "/" else "/var/restic-restore/<job_id>/"
├─ build flags: paths from payload, --no-ownership when !in_place
├─ restic.RunRestore(ctx, env, snapshot_id, paths, target, in_place):
│ restic restore <sid> --target <path> [--no-ownership] -- <p1> <p2> ...
│ parse stdout JSON: forward "status" → job.progress (1Hz throttle), "summary" → final
├─ on success: sendFinished(job_id, succeeded, exit=0)
├─ on ctx.Err() == context.Canceled: sendFinished(job_id, cancelled, exit=130)
└─ delete cancel func from r.cancels
```
### Data flow — cancel
```
operator clicks Cancel on /jobs/{id} (running)
POST /api/jobs/{id}/cancel
server: lookup job, ensure status=running, find host
hub.Send(host_id, "command.cancel", {job_id})
→ agent.runner receives command.cancel
cancelFunc, ok := r.cancels[job_id]
ok && cancelFunc()
→ restic subprocess context done → exec.Cmd kills via SIGTERM
→ if still alive after 5s grace → SIGKILL
→ runner sendFinished(job_id, cancelled, exit=130)
→ server receives job.finished status=cancelled, persists, broadcasts
→ browser refresh shows cancelled state
```
The cancel surface is independently useful for any kind (prune/check/backup) —
not gated to restore. The button already in `job_detail.html` becomes real.
### Tree-list RPC details
New WS message types (added to `internal/api/messages.go`):
```
type TreeListRequestPayload struct {
SnapshotID string `json:"snapshot_id"`
Path string `json:"path"`
}
type TreeListEntry struct {
Name string `json:"name"`
Type string `json:"type"` // "dir" | "file" | "symlink"
Size int64 `json:"size,omitempty"`
}
type TreeListResultPayload struct {
SnapshotID string `json:"snapshot_id"`
Path string `json:"path"`
Entries []TreeListEntry `json:"entries,omitempty"`
Error string `json:"error,omitempty"`
}
```
Server-side mediator (`ws.SendRPC`) takes a request envelope, registers the
correlation ID in a pending map, sends, blocks on a per-call channel until
the matching reply arrives (or 30s timeout). The pattern is small enough
to inline in `internal/server/ws/rpc.go` as a generic helper — future
synchronous RPCs reuse it.
In-memory cache: `map[sessionID]map[cacheKey]TreeListResultPayload` with
`cacheKey = snapshot_id + "\x00" + path`. Session ID minted per wizard
load (HTTP-only cookie scoped to `/hosts/{id}/restore/tree`, lifetime 30
min). On wizard close (browser navigation away) the entry expires
naturally. No persistence, no migration.
Agent handler runs `restic ls --json <sid> <path>` (non-recursive — restic
defaults to recursive but `restic ls` accepts `--long` and a path filter;
parse output line-by-line and emit only direct children of `path`). 60s
context timeout, mirroring existing `restic snapshots` invocation.
### Restore payload
`api.CommandRunPayload` gains a nested optional `restore` field:
```
type RestorePayload struct {
SnapshotID string `json:"snapshot_id"`
Paths []string `json:"paths"` // absolute paths inside the snapshot
InPlace bool `json:"in_place"`
TargetDir string `json:"target_dir"` // empty when in_place=true
PreserveOwner bool `json:"preserve_owner"` // mirrors policy: in_place=>true, else=>false
}
```
The payload is set by the server when dispatching `JobRestore` and ignored
on every other kind. Wire-shape test pinned in `wire_test.go`.
### Diff payload
`api.CommandRunPayload` gains:
```
type DiffPayload struct {
SnapshotA string `json:"snapshot_a"`
SnapshotB string `json:"snapshot_b"`
}
```
Set on `JobDiff`. Output is plain `restic diff --json <a> <b>` forwarded as
`log.stream` lines. Job page renders unchanged — operator reads the diff
output directly.
### Recent-restores panel
A small panel rendered on the host detail page below the existing init-status
line:
```
last restore: succeeded 2h ago · job f73ab4c1… · 3 files to /var/restic-restore/...
```
Backed by a new `store.LatestJobByKind(host_id, JobRestore)` query (mirroring
the existing `store.LatestJobByKind` already used for init/forget/prune/check
in P2R-06). One template addition in `host_chrome.html` next to the
`InitStatus` block.
## Routes added
| Method | Path | Purpose |
| ------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| GET | `/hosts/{id}/restore` | Wizard shell (step 1 = snapshot picker) |
| GET | `/hosts/{id}/snapshots/{sid}/restore` | Wizard shell with snapshot pre-selected (skips step 1) |
| GET | `/hosts/{id}/restore/tree` | HTMX partial: tree node listing for `?snapshot=&path=` |
| POST | `/hosts/{id}/restore` | Validate + dispatch restore job, redirect to live job page |
| POST | `/api/hosts/{id}/snapshots/diff` | Dispatch a diff job for `{snapshot_a, snapshot_b}` |
| POST | `/api/jobs/{id}/cancel` | Send `command.cancel` to host, transition job → cancelled |
## Migrations
None. Restore + diff piggyback on the existing `jobs` table (their `kind` is
new but the schema already accepts arbitrary kind strings — there's no
CHECK constraint on `kind`). The cancel feature uses the existing
`JobCancelled` terminal status. The tree-list cache lives in process memory.
## Tests (target coverage)
- `internal/restic/restore_test.go``RunRestore` invocation builds the
expected argv (paths, --target, --no-ownership flag presence, in-place
variant); JSON status parsing → `BackupStatus`-shaped progress envelopes.
- `internal/restic/diff_test.go``RunDiff` argv shape and JSON forwarding.
- `internal/agent/runner/restore_test.go` — happy path, cancel mid-run
produces `cancelled` finished, in-place vs new-directory dispatch,
single-flight rejects when another job is running.
- `internal/agent/runner/tree_test.go``tree.list` handler returns
direct children for a synthetic restic ls output, surfaces error on
missing snapshot.
- `internal/server/ws/rpc_test.go``SendRPC` correlation matching,
timeout, concurrent calls.
- `internal/server/http/restore_test.go` — wizard renders with snapshots,
POST validates ≥1 path + in-place host-name match, audit row written,
job dispatched with correct payload, in-place without typed-confirm
re-renders form with input intact and an error.
- `internal/server/http/diff_test.go` — POST dispatches `JobDiff`,
snapshot IDs validated against the host's snapshot list.
- `internal/server/http/cancel_test.go` — POST cancel happy path
(running → cancelled), 4xx for non-running jobs, 4xx when host offline.
- `internal/server/http/restore_e2e_test.go` — happy path: GET wizard,
expand `/etc` (HTMX call returns expected fragment), submit, follow
HX-Redirect to job page, see status.
- `web/templates/pages/host_restore_test.go` (template-render test) —
wizard renders all four sections; in-place card disabled until typed
confirm.
## Playwright iteration / sweep
A Playwright sweep at the end (mirroring P2R-02 Slice 6) runs against the
local smoke server with a real agent enrolled. Steps:
1. Login → navigate to alfa-01 host → click Restore.
2. Wizard step 1: pick the most recent snapshot.
3. Wizard step 2: expand a directory two levels, tick three files,
verify tally updates.
4. Wizard step 3: leave default new-directory.
5. Wizard step 4: dispatch.
6. Land on live job page, see progress widget animating, see log lines.
7. Click Cancel mid-flight, verify status transitions to cancelled and
the agent's subprocess actually died (log line `signal: killed` or exit
130).
8. Repeat with in-place mode: type host name, dispatch, verify red
primary button, verify files actually overwritten on host.
9. Snapshot diff: navigate to snapshots, pick two, dispatch diff, see
diff output streamed.
10. Screenshots into `_diag/p3-restore-sweep/`.
End-to-end clean, zero console errors, before handing back.
## What does NOT change
- `host_chrome.html` only grows the recent-restores line; sub-tab list
unchanged (Restore is a top-level button on the host page, not a sub-tab).
- `enrollment.go`, schedule reconciliation, source-group CRUD, repo
maintenance ticker, hook execution — none of these are touched.
- The CLAUDE.md restage block applies as-is when the agent binary changes
(it does — runner gains restore/diff/cancel/tree handlers). The unit
file does not change.
## Open questions / explicit non-goals
- **Restore preview / dry-run.** Restic doesn't have a dry-run for restore.
Out of scope.
- **Resumable restore.** Restic restore is idempotent per-file but not
resumable mid-stream from where it left off. If a restore is cancelled,
the operator re-runs (files already written are overwritten). No state
to track.
- **Restore to a glob/pattern (e.g. `*.conf`).** Out of scope; the tree
picker requires explicit ticks. Power users can edit the URL or use the
CLI.
- **Bandwidth caps for restore.** Honoured automatically — restic's
`--limit-download` is part of `restic.Env` already (P2R-13) and applies
to restore unchanged.
- **Pre/post hooks for restore.** Hooks today gate only `kind=backup`
(P2R-11). Out of scope.
-126
View File
@@ -1,126 +0,0 @@
# Threat model
A short, structured walkthrough of the assets restic-manager
protects, the actors that interact with it, the attack surfaces
exposed, and the mitigations in place. This document is written for
operators considering a deployment and for contributors evaluating
security-sensitive changes. It is **not** a formal certification —
restic-manager has not been third-party audited.
Last reviewed: **2026-05-09** (against v1.0.0).
---
## 1. Assets
In rough order of sensitivity:
| Asset | Why it matters |
|---|---|
| **Restic repository passwords** | Decrypt every backup in the repo. Server holds them encrypted at rest; agents need plaintext at backup-time. |
| **Repository URLs with embedded credentials** (e.g. `rest:https://user:pass@host/repo`) | Same as above — read access to the repo is leak-equivalent to the password. |
| **Agent bearer tokens** | Long-lived credentials authenticating each agent → server WS. Compromise lets an attacker impersonate that host (push fake snapshots, ack fake schedule versions, exfiltrate repo creds the server pushes back). |
| **Server session cookies** | Browser-side session for human operators. Compromise = full UI access at the user's role for the cookie's TTL (24h). |
| **Database secret key** | Wraps every encrypted-at-rest field (repo creds, agent enrolment payloads). Loss of the file means decryptable backups; rotation requires re-pushing creds to every agent. |
| **Bootstrap / setup tokens** | One-shot, time-limited; mint admin or invited-user accounts. |
| **Audit log** | Tamper-evident record of admin actions; read-only via UI. |
| **Backup data on the wire** | Restic itself encrypts on the agent before sending — see "out of scope". |
---
## 2. Actors
| Actor | Trust |
|---|---|
| **Anonymous internet** | Untrusted. Should not reach the server unless proxied behind auth (see deployment guide). |
| **Authenticated viewer** | Read-only on hosts/jobs/alerts/audit. |
| **Authenticated operator** | Add/remove hosts, edit schedules, run backups/restores, mint enrolment tokens, ack alerts. |
| **Authenticated admin** | All of the above plus user management, role changes, fleet update controls, secret-key visibility (no — see below). |
| **Agent** | Trusted to backup-and-report on its own host only. Cannot read other hosts' creds. Bearer-authenticated. |
| **Restic backend (rest-server / S3 / B2 / etc.)** | Out of scope for this document — assumed to authenticate the credentials presented and not collude. |
---
## 3. Attack surfaces and mitigations
### 3.1 First-run bootstrap
- **Surface**: `/bootstrap` UI + `/api/bootstrap` JSON endpoint.
- **Risk**: race between server start and admin creation — an attacker who reaches the server first can claim admin.
- **Mitigations**:
- Bootstrap token printed to stderr exactly once; held in memory, not persisted.
- The UI form on `/bootstrap` uses the in-memory token automatically (no token field for the operator to type or expose).
- Both surfaces self-disable the moment any user row exists (`CountUsers > 0`).
- Token is also blanked from process memory after success (defence in depth).
- **Residual risk**: if an operator brings up the server on the public internet before reaching the bootstrap page, an attacker reaching `/bootstrap` first wins. **Recommendation**: bring the server up behind an existing trusted network or with the listener bound to `127.0.0.1` until first-run is complete.
### 3.2 Local user accounts
- **Surface**: `/login`, `/api/auth/login`.
- **Mitigations**: Argon2id password hashing with per-deployment params; constant-time password compare; session-cookie minting via `crypto/rand`; session rows hash-only (raw token only in cookie).
- **Rate limiting**: Currently not in place at the application layer — the project assumes a reverse proxy enforces login throttling. **Recommendation**: front the server with `caddy`/`nginx` rate-limit rules in production.
- **Password policy**: 12-character minimum on bootstrap and user-setup paths; no maximum, no rotation, no history. Sufficient for self-hosted ops; tighten in policy if a deployment requires it.
### 3.3 OIDC SSO
- **Surface**: `/auth/oidc/*` — generic OIDC client, JIT user provisioning.
- **Mitigations**: state + nonce per flow; role mapping is server-configured (claims trusted only to identify the user, not pick role); user-disabled gate runs after IdP success.
- **Residual risk**: misconfigured role-mapping rules can promote any IdP user to admin. **Recommendation**: review `cfg.OIDC.RoleMappings` carefully.
### 3.4 Agent enrolment
- **Surface**: `/api/agents/enroll` (token-authenticated), `/api/agents/announce` (anonymous, then operator-approves).
- **Mitigations**:
- Token path: one-shot, hashed at rest, 1h TTL; agent receives a fresh long-lived bearer in the response.
- Announce path: agent supplies an Ed25519 public key; operator sees a fingerprint to confirm out-of-band before accepting.
- Bearer tokens are SHA-256 hashed in the DB.
- **Residual risk**: an attacker on the network between operator and target host who intercepts the install snippet can enrol *as* the target. The install script must be served over TLS in production (the docker-only deployment defaults to TLS-by-default; bare-metal deployers must configure their own).
### 3.5 Agent → server WebSocket
- **Surface**: persistent WS authenticated by agent bearer.
- **Mitigations**: bearer is presented per-connection; server pins the agent fingerprint for the announce flow; messages are envelope-typed and rejected if shape-invalid.
- **No payload-level signing** today — TLS is the integrity boundary. A man-in-the-middle with a valid cert chain could swap messages. **Recommendation**: pin the server cert via `RM_SERVER_CERT_PIN_SHA256` if running over a network you don't fully control.
### 3.6 Repo credential lifecycle
- Stored encrypted at rest under the AEAD secret key.
- Pushed to the agent over the WS on hello, on creds change, and on demand.
- Agent persists them encrypted (per-host secret key derived from a value known only to the agent).
- Logged surfaces use `restic.RedactURL()` to strip `user:pass@` from URLs before they reach `slog`.
- Plaintext form is constructed only at `exec.Command` time inside the agent, never stored on a struct field that could be slogged.
### 3.7 Restore
- Operators can restore to any path the agent (running as root) can write.
- Cross-host restore (host A's snapshot → host C) is **deferred** — see F-01. The current single-host restore does not require granting any cross-host privileges.
### 3.8 Audit log
- Append-only writes from the application; SQLite enforces no schema-level immutability.
- A compromise of the SQLite file (via OS-level access) can edit the audit log. **Recommendation**: ship audit entries to an append-only sink (syslog / Loki / Splunk) if tamper-evidence beyond the OS boundary is required.
### 3.9 Self-update channel (P6)
- Agents fetch new binaries via the WS transport from the server.
- Binaries are signature-checked by the agent against a key embedded in the existing agent (see `internal/fleetupdate/`).
- **Residual risk**: a server compromise lets the attacker push code to every agent (running as root). The signing-key compromise window is the same as the server compromise window because both live on the server. Splitting the signing key onto a separate signer is future work (not v1).
---
## 4. Out of scope
- **Restic itself** — its repository format, encryption, and backend protocol are upstream-trusted.
- **The host OS** — root compromise of a host obviously compromises that host's backups.
- **The backup destination** — restic-manager assumes the rest-server / object-store / SFTP target enforces its own auth.
- **Side-channel attacks** on the server process (RAM dump, process tracing).
- **Physical access** to the server's disk.
---
## 5. Reporting
Found something we missed? See `SECURITY.md` for the disclosure
process. Coordinated disclosure preferred; the project is
maintained by a small team and we'll respond as quickly as we
reasonably can.
-42
View File
@@ -1,42 +0,0 @@
# Build a Linux container that runs the restic-manager agent against a
# sibling rest-server in the e2e compose stack. Used only by tests
# (e2e/compose.e2e.yml + .gitea/workflows/e2e.yml).
#
# Two stages:
# 1. golang:alpine to build the agent binary.
# 2. alpine:3.20 with the `restic` package + the built binary.
#
# Pinning by digest is intentional for CI reproducibility.
FROM golang:1.25-alpine AS build
WORKDIR /src
ENV CGO_ENABLED=0 \
GOFLAGS="-trimpath"
COPY go.mod go.sum* ./
RUN go mod download
COPY . .
ARG VERSION=e2e
RUN go build -ldflags="-s -w -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=${VERSION}" \
-o /out/restic-manager-agent ./cmd/agent
FROM alpine:3.20
RUN apk add --no-cache restic ca-certificates curl
COPY --from=build /out/restic-manager-agent /usr/local/bin/restic-manager-agent
# Agents normally run as root because backup paths often need it. The
# e2e fixture only backs up paths under /data which we own, so this
# container would tolerate a non-root user — but staying root keeps
# parity with the production install.
USER root
# The agent needs a writable directory for its config + secrets store.
RUN mkdir -p /etc/restic-manager /var/lib/restic-manager
ENV RM_AGENT_CONFIG=/etc/restic-manager/agent.yaml
# The compose entrypoint sets the announce URL via env.
COPY e2e/agent-entrypoint.sh /usr/local/bin/entrypoint.sh
RUN chmod +x /usr/local/bin/entrypoint.sh
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
-21
View File
@@ -1,21 +0,0 @@
# Playwright runner for the e2e suite. Built and run by
# e2e/compose.e2e.yml so the test process sits on the same docker
# network as the server, agent, and rest-server. The previous setup
# ran Playwright on the workflow runner host and reached the server
# via 127.0.0.1:8080; that fails on Gitea's act-style runners
# because the workflow steps execute inside a runner container,
# not on the host where compose publishes its ports.
FROM mcr.microsoft.com/playwright:v1.59.1-jammy
WORKDIR /work
# Install npm deps in a separate layer keyed off package.json so
# changes to specs don't bust the dep cache.
COPY e2e/playwright/package.json /work/package.json
RUN npm install --no-audit --no-fund
COPY e2e/playwright/ /work/
ENV CI=1
ENTRYPOINT ["npx", "playwright", "test"]
-27
View File
@@ -1,27 +0,0 @@
#!/bin/sh
# Entrypoint for the e2e agent container.
#
# Three states:
# 1. Already enrolled (agent.yaml has a bearer): run the agent.
# 2. Token supplied via $RM_ENROL_TOKEN: enrol then run.
# 3. Otherwise: announce against $RM_SERVER and wait for an admin to
# accept us. The announce flow blocks until accepted, then drops
# straight into the normal run loop, so this is the test-friendly
# path.
set -eu
CFG="${RM_AGENT_CONFIG:-/etc/restic-manager/agent.yaml}"
SERVER="${RM_SERVER:?set RM_SERVER}"
if [ -f "$CFG" ] && grep -q '^agent_token:' "$CFG"; then
exec restic-manager-agent -config "$CFG"
fi
if [ -n "${RM_ENROL_TOKEN:-}" ]; then
exec restic-manager-agent -config "$CFG" \
-enroll-server "$SERVER" \
-enroll-token "$RM_ENROL_TOKEN"
fi
# Announce-and-approve: blocks until an admin accepts, then runs.
exec restic-manager-agent -config "$CFG" -enroll-server "$SERVER"
-113
View File
@@ -1,113 +0,0 @@
# End-to-end test stack — used by .gitea/workflows/e2e.yml and by
# operators who want to run the Playwright suite locally.
#
# Three services:
# * server — restic-manager built from the working tree
# * agent — restic-manager agent built from the working tree
# (announces; Playwright accepts it during the test)
# * rest-server — the actual restic backend, sibling of the agent
#
# Run from the repo root:
# docker compose -f e2e/compose.e2e.yml up --build --abort-on-container-exit
services:
rest-server:
image: restic/rest-server:0.13.0
environment:
DATA_DIR: /data
OPTIONS: "--no-auth"
volumes:
- rest-data:/data
networks: [rmnet]
server:
build:
context: ..
dockerfile: deploy/Dockerfile.server
args:
VERSION: e2e
environment:
RM_LISTEN: ":8080"
RM_DATA_DIR: "/data"
RM_BASE_URL: "http://server:8080"
RM_COOKIE_SECURE: "false"
# Bind the metrics endpoint loose for the test, so one of the
# Playwright assertions can exercise it.
RM_METRICS_TRUSTED_CIDR: "0.0.0.0/0"
volumes:
- server-data:/data
ports:
- "127.0.0.1:8080:8080"
healthcheck:
test: ["CMD", "/usr/local/bin/restic-manager-server", "--version"]
interval: 2s
timeout: 2s
retries: 30
networks: [rmnet]
agent:
build:
context: ..
dockerfile: e2e/Dockerfile.agent
args:
VERSION: e2e
environment:
RM_SERVER: "http://server:8080"
depends_on:
- server
volumes:
# Source paths the agent backs up. Compose pre-populates this
# with a few files so the snapshot list isn't empty.
- source-data:/source
- agent-config:/etc/restic-manager
- agent-state:/var/lib/restic-manager
networks: [rmnet]
# Playwright test runner. Profile-gated so `compose up` doesn't
# start it; CI invokes it via `compose run` and `docker cp`s the
# report+traces out (see .gitea/workflows/e2e.yml). Lives on
# rmnet so it can reach the server via its compose-network DNS
# name rather than depending on host port-publish (which doesn't
# work on Gitea's container-based runners).
#
# Reports are NOT bind-mounted: when the runner job itself runs
# inside a container, `./playwright/...` resolves to a path that
# only exists inside the runner container, so the host docker
# daemon would silently mount an empty dir. Instead the report
# stays inside the playwright container and the workflow extracts
# it via `docker cp` before tearing down.
playwright:
profiles: [test]
build:
context: ..
dockerfile: e2e/Dockerfile.playwright
environment:
RM_BASE_URL: "http://server:8080"
RM_BOOTSTRAP_TOKEN: "${RM_BOOTSTRAP_TOKEN:-}"
depends_on:
- server
- agent
networks: [rmnet]
# One-shot init container that drops a couple of files into the
# source volume so backups have something to snapshot.
source-fixture:
image: alpine:3.20
command: >
sh -c 'mkdir -p /source && echo "hello world" > /source/hello.txt &&
echo "another file" > /source/two.txt && sleep 0.2'
volumes:
- source-data:/source
networks: [rmnet]
restart: "no"
volumes:
server-data:
rest-data:
source-data:
agent-config:
agent-state:
networks:
rmnet:
driver: bridge
-14
View File
@@ -1,14 +0,0 @@
{
"name": "restic-manager-e2e",
"version": "0.0.0",
"private": true,
"type": "module",
"scripts": {
"test": "playwright test",
"test:headed": "playwright test --headed",
"test:debug": "PWDEBUG=1 playwright test"
},
"devDependencies": {
"@playwright/test": "1.59.1"
}
}
-35
View File
@@ -1,35 +0,0 @@
import { defineConfig, devices } from '@playwright/test';
// Single-target Chromium config: the e2e suite is narrow (smoke
// the production-shaped flow against the docker-compose stack).
// Cross-browser matrix doesn't add signal — what we're verifying is
// the server's HTML and the agent's WebSocket handshake, neither of
// which depends on browser engine.
const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
export default defineConfig({
testDir: './tests',
// 4 minutes — the smoke test waits for: enrolment + bootstrap
// (~5s), auto-init landing (~10s), backup completion (~120s
// budget). 60s is far too tight in CI; 4m gives headroom even
// on a contended runner without masking real regressions.
timeout: 240_000,
expect: { timeout: 10_000 },
fullyParallel: false,
retries: process.env.CI ? 1 : 0,
workers: 1,
reporter: [['list'], ['html', { open: 'never' }]],
use: {
baseURL,
trace: 'retain-on-failure',
screenshot: 'only-on-failure',
video: 'retain-on-failure',
},
projects: [
{
name: 'chromium',
use: { ...devices['Desktop Chrome'] },
},
],
});
-152
View File
@@ -1,152 +0,0 @@
// Helpers used by every test. The shape favours the JSON API for
// reads + accept/dispatch (deterministic, easy to assert) and the
// browser for human-facing surfaces (login form, dashboard render).
import { APIRequestContext, expect, Page } from '@playwright/test';
export const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
export interface HostJSON {
id: string;
name: string;
status: string;
repo_status?: string;
last_backup_status?: string;
}
export async function readBootstrapToken(): Promise<string> {
const tok = process.env.RM_BOOTSTRAP_TOKEN;
if (!tok) {
throw new Error('RM_BOOTSTRAP_TOKEN not set — the harness scrapes it from server logs');
}
return tok;
}
export async function bootstrapAdmin(
request: APIRequestContext,
{
username = 'admin',
password = 'e2e-test-password-1234',
}: { username?: string; password?: string } = {},
): Promise<{ username: string; password: string }> {
const token = await readBootstrapToken();
const res = await request.post(`${baseURL}/api/bootstrap`, {
data: { token, username, password },
});
if (!res.ok() && res.status() !== 409 /* already bootstrapped */) {
throw new Error(`bootstrap: ${res.status()} ${await res.text()}`);
}
return { username, password };
}
export async function loginViaUI(page: Page, username: string, password: string): Promise<void> {
await page.goto(`${baseURL}/login`);
await page.locator('#login-username').fill(username);
await page.locator('#login-password').fill(password);
await Promise.all([
page.waitForURL(new RegExp(`^${baseURL}/?$`)),
page.locator('form[action="/login"] button[type="submit"]').click(),
]);
}
/**
* Polls the dashboard until a pending host card is visible, then
* extracts its pending-id from the inline accept form's action URL.
*/
export async function waitForPendingHostID(page: Page): Promise<string> {
const formLocator = page.locator('form[action^="/api/pending-hosts/"][action$="/accept"]').first();
await expect(formLocator).toBeVisible({ timeout: 60_000 });
const action = await formLocator.getAttribute('action');
if (!action) throw new Error('pending host form has no action attribute');
const m = action.match(/\/api\/pending-hosts\/([^/]+)\/accept/);
if (!m) throw new Error(`unexpected action URL: ${action}`);
return m[1];
}
export async function acceptPending(
request: APIRequestContext,
cookie: string,
pendingID: string,
repo: { url: string; username?: string; password: string },
): Promise<void> {
const res = await request.post(`${baseURL}/api/pending-hosts/${pendingID}/accept`, {
headers: { cookie, 'content-type': 'application/json' },
data: {
repo_url: repo.url,
repo_username: repo.username ?? '',
repo_password: repo.password,
},
});
if (!res.ok()) {
throw new Error(`accept: ${res.status()} ${await res.text()}`);
}
}
export async function listHosts(request: APIRequestContext, cookie: string): Promise<HostJSON[]> {
const res = await request.get(`${baseURL}/api/hosts`, { headers: { cookie } });
if (!res.ok()) throw new Error(`list hosts: ${res.status()} ${await res.text()}`);
const body = (await res.json()) as { items?: HostJSON[]; hosts?: HostJSON[] };
return body.items ?? body.hosts ?? [];
}
export async function waitForHostStatus(
request: APIRequestContext,
cookie: string,
matcher: (h: HostJSON) => boolean,
timeoutMs = 60_000,
): Promise<HostJSON> {
const deadline = Date.now() + timeoutMs;
let last: HostJSON | undefined;
while (Date.now() < deadline) {
const hosts = await listHosts(request, cookie);
const hit = hosts.find(matcher);
if (hit) return hit;
last = hosts[0];
await new Promise((r) => setTimeout(r, 1_000));
}
throw new Error(`waitForHostStatus: timeout. Last seen: ${JSON.stringify(last)}`);
}
export async function createSourceGroup(
request: APIRequestContext,
cookie: string,
hostID: string,
body: { name: string; includes: string[]; excludes?: string[] },
): Promise<string> {
const res = await request.post(`${baseURL}/api/hosts/${hostID}/source-groups`, {
headers: { cookie, 'content-type': 'application/json' },
data: {
name: body.name,
includes: body.includes,
excludes: body.excludes ?? [],
retention_policy: {},
retry_max: 0,
retry_backoff_seconds: 0,
},
});
if (!res.ok()) throw new Error(`createSourceGroup: ${res.status()} ${await res.text()}`);
const created = (await res.json()) as { id?: string; group?: { id?: string } };
const id = created.id ?? created.group?.id;
if (!id) throw new Error(`createSourceGroup: no id in response: ${JSON.stringify(created)}`);
return id;
}
export async function runSourceGroup(
request: APIRequestContext,
cookie: string,
hostID: string,
groupID: string,
): Promise<void> {
const res = await request.post(
`${baseURL}/api/hosts/${hostID}/source-groups/${groupID}/run`,
{ headers: { cookie } },
);
if (!res.ok()) throw new Error(`runSourceGroup: ${res.status()} ${await res.text()}`);
}
export async function getSessionCookie(page: Page): Promise<string> {
const cookies = await page.context().cookies();
const c = cookies.find((c) => c.name === 'rm_session');
if (!c) throw new Error('rm_session cookie not set after login');
return `${c.name}=${c.value}`;
}
-90
View File
@@ -1,90 +0,0 @@
// End-to-end smoke: bootstrap → accept pending host → run backup → see succeeded.
//
// The compose stack stands up a server, a sibling rest-server, and an
// agent in announce-and-approve mode. This test drives the operator
// path through the UI (login + dashboard) and the API
// (accept + run-now + poll for terminal) — UI for the human surfaces,
// API for the deterministic ones.
import { test, expect } from '@playwright/test';
import {
baseURL,
bootstrapAdmin,
loginViaUI,
waitForPendingHostID,
acceptPending,
waitForHostStatus,
createSourceGroup,
runSourceGroup,
getSessionCookie,
} from './lib/server';
test.describe('smoke: enrol-via-announce → backup', () => {
test('happy path: enrol → accept → backup → succeeded', async ({ page, request }) => {
const { username, password } = await bootstrapAdmin(request);
await loginViaUI(page, username, password);
// Dashboard renders.
await expect(page.locator('main')).toContainText(/host|fleet|pending/i, { timeout: 10_000 });
// Pending host appears (the agent container has been
// announcing since startup).
const pendingID = await waitForPendingHostID(page);
const cookie = await getSessionCookie(page);
// Accept with the rest-server creds. compose's rest-server runs
// --no-auth, so any credentials work; restic still demands a
// password to encrypt the repo.
await acceptPending(request, cookie, pendingID, {
url: 'rest:http://rest-server:8000/',
password: 'e2e-repo-password',
});
// Wait for the host to come online AND for auto-init to
// finish. Coming online happens as soon as the agent's
// bearer-authed WS attaches (~1s after accept); repo_status
// flips to 'ready' once the auto-init job completes (a
// couple of seconds later). Loading the host page before
// that leaves the Run-backup button disabled because the
// server-rendered HTML reflects the still-in-progress init,
// and the page has no live-refresh on that field.
const readyHost = await waitForHostStatus(
request, cookie,
(h) => h.status === 'online' && h.repo_status === 'ready',
90_000,
);
expect(readyHost.id).toBeTruthy();
// Per-host Run-now is gone; backups are dispatched per
// source-group now. Create one that maps to the agent's
// /source mount, then kick it via the JSON API.
const groupID = await createSourceGroup(request, cookie, readyHost.id, {
name: 'default',
includes: ['/source'],
});
await runSourceGroup(request, cookie, readyHost.id, groupID);
// Wait for the host's last_backup_status to flip to 'succeeded'.
// The host record is the source of truth: it's what the
// dashboard projects from job-completion events on the WS
// channel.
const finishedHost = await waitForHostStatus(
request, cookie,
(h) => h.id === readyHost.id && h.last_backup_status === 'succeeded',
120_000,
);
expect(finishedHost.last_backup_status).toBe('succeeded');
});
});
test.describe('smoke: scrape /metrics', () => {
test('metrics endpoint exposes the host gauge', async ({ request }) => {
// Compose sets RM_METRICS_TRUSTED_CIDR=0.0.0.0/0 so the
// endpoint is open to the test runner.
const res = await request.get(`${baseURL}/metrics`);
expect(res.status()).toBe(200);
const body = await res.text();
expect(body).toContain('rm_hosts_total');
expect(body).toContain('rm_build_info{');
});
});
+3 -7
View File
@@ -3,26 +3,22 @@ module gitea.dcglab.co.uk/steve/restic-manager
go 1.25.0
require (
github.com/coder/websocket v1.8.14
github.com/coreos/go-oidc/v3 v3.18.0
github.com/go-chi/chi/v5 v5.2.5
github.com/golang-jwt/jwt/v5 v5.3.1
github.com/oklog/ulid/v2 v2.1.1
github.com/robfig/cron/v3 v3.0.1
golang.org/x/crypto v0.50.0
golang.org/x/oauth2 v0.36.0
golang.org/x/sys v0.43.0
gopkg.in/yaml.v3 v3.0.1
modernc.org/sqlite v1.50.0
)
require (
github.com/coder/websocket v1.8.14 // indirect
github.com/dustin/go-humanize v1.0.1 // indirect
github.com/go-jose/go-jose/v4 v4.1.4 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/mattn/go-isatty v0.0.20 // indirect
github.com/ncruces/go-strftime v1.0.0 // indirect
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect
github.com/robfig/cron/v3 v3.0.1 // indirect
golang.org/x/sys v0.43.0 // indirect
modernc.org/libc v1.72.0 // indirect
modernc.org/mathutil v1.7.1 // indirect
modernc.org/memory v1.11.0 // indirect
-8
View File
@@ -1,15 +1,9 @@
github.com/coder/websocket v1.8.14 h1:9L0p0iKiNOibykf283eHkKUHHrpG7f65OE3BhhO7v9g=
github.com/coder/websocket v1.8.14/go.mod h1:NX3SzP+inril6yawo5CQXx8+fk145lPDC6pumgx0mVg=
github.com/coreos/go-oidc/v3 v3.18.0 h1:V9orjXynvu5wiC9SemFTWnG4F45v403aIcjWo0d41+A=
github.com/coreos/go-oidc/v3 v3.18.0/go.mod h1:DYCf24+ncYi+XkIH97GY1+dqoRlbaSI26KVTCI9SrY4=
github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
github.com/go-chi/chi/v5 v5.2.5 h1:Eg4myHZBjyvJmAFjFvWgrqDTXFyOzjj7YIm3L3mu6Ug=
github.com/go-chi/chi/v5 v5.2.5/go.mod h1:X7Gx4mteadT3eDOMTsXzmI4/rwUpOwBHLpAfupzFJP0=
github.com/go-jose/go-jose/v4 v4.1.4 h1:moDMcTHmvE6Groj34emNPLs/qtYXRVcd6S7NHbHz3kA=
github.com/go-jose/go-jose/v4 v4.1.4/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
github.com/golang-jwt/jwt/v5 v5.3.1 h1:kYf81DTWFe7t+1VvL7eS+jKFVWaUnK9cB1qbwn63YCY=
github.com/golang-jwt/jwt/v5 v5.3.1/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE=
github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e h1:ijClszYn+mADRFY17kjQEVQ1XRhq2/JR1M3sGqeJoxs=
github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e/go.mod h1:boTsfXsheKC2y+lKOCMpSfarhxDeIzfZG1jqGcPl3cA=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
@@ -31,8 +25,6 @@ golang.org/x/crypto v0.50.0 h1:zO47/JPrL6vsNkINmLoo/PH1gcxpls50DNogFvB5ZGI=
golang.org/x/crypto v0.50.0/go.mod h1:3muZ7vA7PBCE6xgPX7nkzzjiUq87kRItoJQM1Yo8S+Q=
golang.org/x/mod v0.33.0 h1:tHFzIWbBifEmbwtGz65eaWyGiGZatSrT9prnU8DbVL8=
golang.org/x/mod v0.33.0/go.mod h1:swjeQEj+6r7fODbD2cqrnje9PnziFuw4bmLbBZFrQ5w=
golang.org/x/oauth2 v0.36.0 h1:peZ/1z27fi9hUOFCAZaHyrpWG5lwe0RJEEEeH0ThlIs=
golang.org/x/oauth2 v0.36.0/go.mod h1:YDBUJMTkDnJS+A4BP4eZBjCqtokkg1hODuPjwiGPO7Q=
golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4=
golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+7 -13
View File
@@ -32,11 +32,6 @@ type Config struct {
RepoUsername string
RepoPassword string
// SupportsRestoreNoOwnership comes from a startup probe of
// `restic restore --help`; gates the new-dir-restore flag without
// relying on version sniffing.
SupportsRestoreNoOwnership bool
// Bandwidth caps in KB/s applied to every restic invocation.
// <=0 means "no cap". Per-job override: callers that build a
// runner per-dispatch can pass the override value here directly.
@@ -66,14 +61,13 @@ func New(cfg Config, tx Sender, progressMinPeriod time.Duration) *Runner {
// resticEnv builds the shared restic.Env from r.cfg.
func (r *Runner) resticEnv() restic.Env {
return restic.Env{
Bin: r.cfg.ResticBin,
Version: r.cfg.ResticVersion,
RepoURL: r.cfg.RepoURL,
RepoUsername: r.cfg.RepoUsername,
RepoPassword: r.cfg.RepoPassword,
SupportsRestoreNoOwnership: r.cfg.SupportsRestoreNoOwnership,
LimitUploadKBps: r.cfg.LimitUploadKBps,
LimitDownloadKBps: r.cfg.LimitDownloadKBps,
Bin: r.cfg.ResticBin,
Version: r.cfg.ResticVersion,
RepoURL: r.cfg.RepoURL,
RepoUsername: r.cfg.RepoUsername,
RepoPassword: r.cfg.RepoPassword,
LimitUploadKBps: r.cfg.LimitUploadKBps,
LimitDownloadKBps: r.cfg.LimitDownloadKBps,
}
}
+7 -34
View File
@@ -2,14 +2,10 @@ package runner
import (
"context"
"errors"
"os"
"os/exec"
"path/filepath"
"sync"
"syscall"
"testing"
"time"
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
"gitea.dcglab.co.uk/steve/restic-manager/internal/restic"
@@ -47,22 +43,13 @@ func (s *fakeSender) snapshot() []api.Envelope {
// setupScript writes a shell script (without shebang) to a temp dir,
// names it "restic", makes it executable, and returns the path.
//
// Writes to "<path>.tmp" then renames into place. The rename is the
// usual guard against ETXTBSY: under -race + many t.Parallel tests,
// a fork-from-another-goroutine can inherit the writable fd from
// Writes to "<path>.tmp" then renames into place. The rename is what
// makes this race-free: under -race + many t.Parallel tests, a
// fork-from-another-goroutine can inherit the writable fd from
// os.WriteFile before close completes, and exec'ing the file then
// returns ETXTBSY ("text file busy"). The renamed dirent points at
// an inode that has no writable fd open anywhere — exec is safe on
// a vanilla filesystem.
//
// On overlayfs (every job that runs inside a `container:` block on
// our Gitea runner), the rename can briefly leak ETXTBSY anyway —
// the upper layer's "writable inode" bookkeeping lags the userspace
// close. To make the helper deterministic across environments, we
// probe-exec the file with a benign argument until exec succeeds,
// then return. Each script body has a `case "$1" in ... esac` shape
// where unknown args fall through to a clean exit, so the probe is
// a no-op from the test's point of view.
// returns ETXTBSY ("text file busy"). Once the rename lands, the
// final path is a fresh dirent pointing at an inode that has no
// writable fd open anywhere — exec is safe.
func setupScript(t *testing.T, body string) string {
t.Helper()
dir := t.TempDir()
@@ -74,21 +61,7 @@ func setupScript(t *testing.T, body string) string {
if err := os.Rename(tmp, final); err != nil {
t.Fatalf("setupScript: rename: %v", err)
}
deadline := time.Now().Add(3 * time.Second)
for {
err := exec.Command(final, "__rm_probe__").Run()
if err == nil {
return final
}
if !errors.Is(err, syscall.ETXTBSY) {
t.Fatalf("setupScript: probe exec: %v", err)
}
if time.Now().After(deadline) {
t.Fatalf("setupScript: %s still ETXTBSY after 3s", final)
}
time.Sleep(10 * time.Millisecond)
}
return final
}
// firstEnvOfType returns the first envelope with the given type, or
-100
View File
@@ -1,100 +0,0 @@
// Package updater carries the agent's self-update logic.
//
// The flow is operator-driven: the server dispatches a command.update
// WS envelope, the agent fetches a fresh binary from the server's
// /agent/binary endpoint, atomic-renames it over the running binary
// (Linux) or hands off to a detached helper script (Windows), and
// exits cleanly so the service manager restarts under the new
// binary. See docs/superpowers/specs/2026-05-06-p6-01-02-...
//
// Platform-specific code is build-tagged into updater_unix.go /
// updater_windows.go. This file holds the shared HTTP fetch + path
// helpers + the test seam.
package updater
import (
"context"
"fmt"
"io"
"net/http"
"os"
"path/filepath"
"runtime"
"time"
)
// fetch downloads the new binary into <binaryPath>.new, fsyncs, chmods.
// Returns the path of the staged file (always binaryPath + ".new").
func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) {
url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH)
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return "", err
}
c := &http.Client{Timeout: 5 * time.Minute}
res, err := c.Do(req)
if err != nil {
return "", err
}
defer func() { _ = res.Body.Close() }()
if res.StatusCode != http.StatusOK {
return "", fmt.Errorf("agent binary fetch: %s", res.Status)
}
stagePath := binaryPath + ".new"
f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
if err != nil {
return "", err
}
if _, copyErr := io.Copy(f, res.Body); copyErr != nil {
_ = f.Close()
_ = os.Remove(stagePath)
return "", copyErr
}
if syncErr := f.Sync(); syncErr != nil {
_ = f.Close()
_ = os.Remove(stagePath)
return "", syncErr
}
if closeErr := f.Close(); closeErr != nil {
_ = os.Remove(stagePath)
return "", closeErr
}
if err := os.Chmod(stagePath, 0o755); err != nil {
_ = os.Remove(stagePath)
return "", err
}
return stagePath, nil
}
// resolveOwnBinary returns the absolute path of the running binary.
// Refuses /proc/self/exe — that's what os.Executable returns on some
// systems but the path can't be renamed across.
func resolveOwnBinary() (string, error) {
p, err := os.Executable()
if err != nil {
return "", err
}
abs, err := filepath.Abs(p)
if err != nil {
return "", err
}
if abs == "/proc/self/exe" {
return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe)")
}
return abs, nil
}
// UpdateForTest is the platform-neutral test seam. In production the
// platform-specific Update fetches, swaps, then exits the process.
// UpdateForTest stops short of the exit so unit tests can assert on
// file state.
func UpdateForTest(serverURL, binaryPath string) error {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
stage, err := fetch(ctx, serverURL, binaryPath)
if err != nil {
return err
}
return swap(stage, binaryPath)
}
-87
View File
@@ -1,87 +0,0 @@
//go:build !windows
package updater
import (
"bytes"
"io"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"runtime"
"testing"
)
// TestUpdate_LinuxAtomicSwap stages a fake "running binary" file, runs
// UpdateForTest against a fake /agent/binary server, and asserts that
// the binary was swapped, .old preserves the previous bytes, and .new
// was renamed away.
func TestUpdate_LinuxAtomicSwap(t *testing.T) {
tmp := t.TempDir()
binPath := filepath.Join(tmp, "agent")
if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
t.Fatal(err)
}
newBytes := []byte("NEW BINARY CONTENTS")
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path != "/agent/binary" {
http.NotFound(w, r)
return
}
gotOS, gotArch := r.URL.Query().Get("os"), r.URL.Query().Get("arch")
if gotOS != runtime.GOOS || gotArch != runtime.GOARCH {
t.Errorf("query mismatch: got os=%s arch=%s want %s/%s",
gotOS, gotArch, runtime.GOOS, runtime.GOARCH)
}
_, _ = io.Copy(w, bytes.NewReader(newBytes))
}))
defer srv.Close()
if err := UpdateForTest(srv.URL, binPath); err != nil {
t.Fatalf("update: %v", err)
}
got, err := os.ReadFile(binPath)
if err != nil {
t.Fatal(err)
}
if string(got) != string(newBytes) {
t.Fatalf("binary contents: got %q want %q", got, newBytes)
}
old, err := os.ReadFile(binPath + ".old")
if err != nil {
t.Fatalf("agent.old missing: %v", err)
}
if string(old) != "OLD" {
t.Fatalf("agent.old contents: got %q want %q", old, "OLD")
}
if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) {
t.Fatalf("agent.new should be absent after swap, got err=%v", err)
}
}
// TestUpdate_FetchHTTPError surfaces the server's status when the
// binary is not published for this os/arch.
func TestUpdate_FetchHTTPError(t *testing.T) {
tmp := t.TempDir()
binPath := filepath.Join(tmp, "agent")
if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
t.Fatal(err)
}
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Error(w, `{"error":"binary_not_published"}`, http.StatusNotFound)
}))
defer srv.Close()
err := UpdateForTest(srv.URL, binPath)
if err == nil {
t.Fatal("expected error, got nil")
}
got, _ := os.ReadFile(binPath)
if string(got) != "OLD" {
t.Fatalf("binary should not have changed, got %q", got)
}
}
-73
View File
@@ -1,73 +0,0 @@
//go:build !windows
package updater
import (
"context"
"fmt"
"io"
"log/slog"
"os"
"time"
)
// Update fetches the new binary, swaps it in, then exits so systemd
// restarts the process under the new binary. The caller should close
// the WS connection cleanly (so the server transitions the host to
// disconnected immediately rather than waiting for the heartbeat
// sweep) before invoking.
//
// Service-user assumption: the agent runs as root under the
// systemd-shipped unit, which can write the binary path directly.
// If the agent ever moves to a non-root service user, this breaks —
// would need a setuid helper or an out-of-process update service.
func Update(ctx context.Context, serverURL string) error {
binPath, err := resolveOwnBinary()
if err != nil {
return err
}
stage, err := fetch(ctx, serverURL, binPath)
if err != nil {
return err
}
if err := swap(stage, binPath); err != nil {
return err
}
slog.Info("agent self-update: binary swapped, exiting for systemd restart",
"binary", binPath)
// Give logger / WS close-frame a moment to flush, then exit.
time.Sleep(200 * time.Millisecond)
os.Exit(0)
return nil // unreachable
}
// swap copies the running binary to <bin>.old (M1 — keep one revision
// back for hand-rolled rollback), then atomic-renames the staged
// binary into place. Linux supports rename-while-open so this works
// even though the running process holds the source open.
func swap(stagePath, binPath string) error {
src, err := os.Open(binPath)
if err != nil {
return fmt.Errorf("open running binary: %w", err)
}
defer func() { _ = src.Close() }()
dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
if err != nil {
return fmt.Errorf("open .old: %w", err)
}
if _, err := io.Copy(dst, src); err != nil {
_ = dst.Close()
return fmt.Errorf("copy to .old: %w", err)
}
if err := dst.Sync(); err != nil {
_ = dst.Close()
return err
}
if err := dst.Close(); err != nil {
return err
}
if err := os.Rename(stagePath, binPath); err != nil {
return fmt.Errorf("rename .new over running binary: %w", err)
}
return nil
}
-73
View File
@@ -1,73 +0,0 @@
//go:build windows
package updater
import (
"context"
"fmt"
"log/slog"
"os"
"os/exec"
"path/filepath"
"syscall"
"time"
)
// helperScript is rendered with fmt.Sprintf, args order:
//
// %[1]s — running binary path (source for the .old copy)
// %[2]s — .old path
// %[3]s — staged .new path
// %[4]s — running binary path (rename target)
const helperScript = `@echo off
timeout /t 3 /nobreak >nul
copy /Y "%[1]s" "%[2]s"
sc stop restic-manager-agent
:wait
sc query restic-manager-agent | find "STOPPED" >nul
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
move /Y "%[3]s" "%[4]s"
sc start restic-manager-agent
del "%%~f0"
`
// Update on Windows can't overwrite the running .exe in-process
// (exclusive file lock), so we stage the new binary, write a small
// detached helper script that waits, stops the service, swaps the
// binary, and starts the service, then exit cleanly. SCM treats
// clean exits after sc stop as intentional and does not auto-restart;
// the helper's final sc start handles that.
func Update(ctx context.Context, serverURL string) error {
binPath, err := resolveOwnBinary()
if err != nil {
return err
}
stage, err := fetch(ctx, serverURL, binPath)
if err != nil {
return err
}
helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd")
body := fmt.Sprintf(helperScript, binPath, binPath+".old", stage, binPath)
if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil {
return err
}
cmd := exec.Command("cmd.exe", "/c", helperPath)
cmd.SysProcAttr = &syscall.SysProcAttr{
HideWindow: true,
CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW
}
if err := cmd.Start(); err != nil {
return err
}
slog.Info("agent self-update: helper spawned, exiting cleanly",
"binary", binPath, "helper", helperPath)
time.Sleep(200 * time.Millisecond)
os.Exit(0)
return nil // unreachable
}
// swap is unused on Windows — the helper script does the swap.
// Defined to satisfy the build (UpdateForTest references it).
func swap(_, _ string) error {
return fmt.Errorf("updater.swap not implemented on Windows; use the helper script via Update")
}
+6 -75
View File
@@ -22,12 +22,6 @@ import (
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)
// staleBackupThreshold is how long an intermittent host may go without
// a successful backup before we raise a stale_schedule alert. Global
// constant for v1 (may become per-host later). Only intermittent hosts
// are evaluated — always-on hosts' stale_schedule stays a no-op.
const staleBackupThreshold = 7 * 24 * time.Hour
// JobFinishedEvent carries everything the engine needs to evaluate
// the failed-X rules. Pushed via Engine.NotifyJobFinished from the
// MarkJobFinished site.
@@ -155,10 +149,6 @@ func (e *Engine) handleJobFinished(ctx context.Context, ev JobFinishedEvent) {
fmt.Sprintf("%s job %s failed", ev.Kind, ev.JobID), ev.When)
case "succeeded":
e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When)
if ev.Kind == "backup" {
// A fresh backup clears staleness for intermittent hosts.
e.resolveAndNotify(ctx, ev.HostID, KindStaleSchedule, "", ev.When)
}
}
}
@@ -167,12 +157,6 @@ func (e *Engine) handleHostOffline(ctx context.Context, hostID string) {
if err != nil {
return
}
// Intermittent hosts (laptops) legitimately disappear — never raise
// agent_offline for them. The stale_schedule sweep in tick() is the
// only staleness signal for these hosts.
if !host.AlwaysOn {
return
}
// Apply the 15-min floor — raise only when last_seen_at is older
// than agentOfflineFloor. A nil last_seen_at (host enrolled but
// never connected) is treated as "now" so we don't raise
@@ -196,56 +180,18 @@ func (e *Engine) handleHostOnline(ctx context.Context, hostID string) {
// tick is the 60-second sweep. Responsibilities:
// 1. Re-evaluate agent_offline for every offline host that may have
// crossed the floor between explicit events.
// 2. Stale-schedule detection for intermittent hosts — raises
// stale_schedule when LastBackupAt is older than 7 days and the
// host has an enabled schedule. Always-on hosts are excluded.
// 2. Stale-schedule detection — declared in the spec but intentionally
// left as a no-op in v1. The precise "expected to have fired but
// didn't" trigger requires a store helper that lands in a later
// task. The KindStaleSchedule constant is exported so UI code can
// reference the tag string today.
func (e *Engine) tick(ctx context.Context, now time.Time) {
// User-management cleanup piggy-backed here for now. Setup tokens
// have a 1h expiry; the alert engine tick is the cheapest existing
// 60s loop. If more housekeeping queries appear, extract a
// dedicated maintenance loop.
if _, err := e.store.CleanupExpiredSetupTokens(ctx, now); err != nil {
slog.Warn("alert: cleanup expired setup tokens", "err", err)
}
if _, err := e.store.CleanupExpiredOIDCState(ctx, now.Add(-5*time.Minute)); err != nil {
slog.Warn("alert: cleanup expired oidc state", "err", err)
}
hosts, err := e.store.ListHosts(ctx)
if err != nil {
slog.Warn("alert: tick list hosts", "err", err)
return
}
for _, h := range hosts {
// Intermittent hosts: suppress agent_offline entirely; instead
// raise stale_schedule when they have gone too long with no
// successful backup AND they have at least one enabled schedule
// to be measured against. A nil LastBackupAt (never backed up)
// has no baseline — onboarding/repo_status covers that case.
if !h.AlwaysOn {
if h.LastBackupAt == nil {
continue
}
if now.Sub(*h.LastBackupAt) < staleBackupThreshold {
continue
}
hasEnabled, err := e.hostHasEnabledSchedule(ctx, h.ID)
if err != nil {
slog.Warn("alert: tick list schedules", "host_id", h.ID, "err", err)
continue
}
if !hasEnabled {
continue
}
e.raiseAndNotify(ctx, h.ID, KindStaleSchedule, "", "warning",
fmt.Sprintf("No backup in %s (threshold %s)",
roundDur(now.Sub(*h.LastBackupAt)), staleBackupThreshold), now)
// Resolution is handled in handleJobFinished on a successful
// backup (and ResolveOnModeChange on toggle) — the tick only
// raises, it does not auto-resolve.
continue
}
// Always-on hosts: existing agent_offline re-evaluation.
if h.Status != "offline" || h.LastSeenAt == nil {
continue
}
@@ -255,6 +201,7 @@ func (e *Engine) tick(ctx context.Context, now time.Time) {
roundDur(now.Sub(*h.LastSeenAt)), e.agentOfflineFloor), now)
}
}
// Stale-schedule sweep — no-op in v1. See KindStaleSchedule doc comment.
}
// roundDur returns a human-readable duration string, rounding to the
@@ -266,19 +213,3 @@ func roundDur(d time.Duration) string {
}
return d.Round(time.Minute).String()
}
// hostHasEnabledSchedule reports whether the host has at least one
// enabled backup schedule — the precondition for a stale_schedule
// alert (no schedule = no backup expectation to measure against).
func (e *Engine) hostHasEnabledSchedule(ctx context.Context, hostID string) (bool, error) {
schedules, err := e.store.ListSchedulesByHost(ctx, hostID)
if err != nil {
return false, err
}
for _, sc := range schedules {
if sc.Enabled {
return true, nil
}
}
return false, nil
}
-255
View File
@@ -1,255 +0,0 @@
package alert
import (
"context"
"testing"
"time"
"github.com/oklog/ulid/v2"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)
// TestIntermittentHostSuppressesOfflineAlert checks that handleHostOffline
// does NOT raise agent_offline for a host with AlwaysOn=false.
func TestIntermittentHostSuppressesOfflineAlert(t *testing.T) {
t.Parallel()
eng, st, hostID := setupEngine(t)
ctx := context.Background()
// Make the host intermittent.
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
t.Fatalf("SetHostAlwaysOn: %v", err)
}
// Give it a stale last_seen_at well past the floor.
if _, err := st.DB().Exec(
`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
"offline",
hostID,
); err != nil {
t.Fatalf("update last_seen_at: %v", err)
}
eng.handleHostOffline(ctx, hostID)
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
if len(open) != 0 {
t.Fatalf("expected 0 open alerts for intermittent host; got %d: %+v", len(open), open)
}
}
// TestAlwaysOnHostStillRaisesOfflineAlert checks that always-on hosts still
// get an agent_offline alert when offline past the floor.
func TestAlwaysOnHostStillRaisesOfflineAlert(t *testing.T) {
t.Parallel()
eng, st, hostID := setupEngine(t)
ctx := context.Background()
// always_on=true is the default, but be explicit.
if err := st.SetHostAlwaysOn(ctx, hostID, true); err != nil {
t.Fatalf("SetHostAlwaysOn: %v", err)
}
// Give it a stale last_seen_at well past the 15m floor.
if _, err := st.DB().Exec(
`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
"offline",
hostID,
); err != nil {
t.Fatalf("update last_seen_at: %v", err)
}
eng.handleHostOffline(ctx, hostID)
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
if len(open) != 1 || open[0].Kind != KindAgentOffline {
t.Fatalf("expected 1 agent_offline alert; got %d: %+v", len(open), open)
}
}
// TestStalenessAlertForIntermittentHost checks that tick raises stale_schedule
// for an intermittent host whose last backup is older than 7 days AND has an
// enabled schedule. Also verifies that a succeeded backup clears the alert.
func TestStalenessAlertForIntermittentHost(t *testing.T) {
t.Parallel()
eng, st, hostID := setupEngine(t)
ctx := context.Background()
// Make intermittent.
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
t.Fatalf("SetHostAlwaysOn: %v", err)
}
// Create a source group to attach the schedule to.
sgID := ulid.Make().String()
if err := st.CreateSourceGroup(ctx, &store.SourceGroup{
ID: sgID,
HostID: hostID,
Name: "default",
Includes: []string{"/home"},
}); err != nil {
t.Fatalf("CreateSourceGroup: %v", err)
}
// Create an enabled schedule pointing at the source group.
schedID := ulid.Make().String()
if err := st.CreateSchedule(ctx, &store.Schedule{
ID: schedID,
HostID: hostID,
CronExpr: "0 2 * * *",
Enabled: true,
SourceGroupIDs: []string{sgID},
}); err != nil {
t.Fatalf("CreateSchedule: %v", err)
}
// Set last_backup_at to 8 days ago.
eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
if err := st.SetHostLastBackup(ctx, hostID, "succeeded", eightDaysAgo); err != nil {
t.Fatalf("SetHostLastBackup: %v", err)
}
eng.tick(ctx, time.Now().UTC())
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
var staleCount int
for _, a := range open {
if a.Kind == KindStaleSchedule {
staleCount++
}
}
if staleCount != 1 {
t.Fatalf("expected 1 stale_schedule alert after tick; got %d (all open: %+v)", staleCount, open)
}
// A succeeded backup should clear the stale_schedule alert.
eng.handleJobFinished(ctx, JobFinishedEvent{
HostID: hostID,
JobID: ulid.Make().String(),
Kind: "backup",
Status: "succeeded",
SourceGroupID: sgID,
When: time.Now().UTC(),
})
open, _ = st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
for _, a := range open {
if a.Kind == KindStaleSchedule {
t.Fatalf("expected stale_schedule to be resolved after backup succeeded; still open: %+v", a)
}
}
}
// TestNoStalenessWithoutEnabledSchedule checks that no stale_schedule is
// raised for an intermittent host with a stale backup but no enabled schedule.
func TestNoStalenessWithoutEnabledSchedule(t *testing.T) {
t.Parallel()
eng, st, hostID := setupEngine(t)
ctx := context.Background()
// Make intermittent.
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
t.Fatalf("SetHostAlwaysOn: %v", err)
}
// Set last_backup_at to 8 days ago — stale — but no schedule.
eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
if err := st.SetHostLastBackup(ctx, hostID, "succeeded", eightDaysAgo); err != nil {
t.Fatalf("SetHostLastBackup: %v", err)
}
eng.tick(ctx, time.Now().UTC())
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
for _, a := range open {
if a.Kind == KindStaleSchedule {
t.Fatalf("expected no stale_schedule without an enabled schedule; got: %+v", a)
}
}
}
// TestResolveOnModeChangeClearsOfflineAlert checks that ResolveOnModeChange
// clears an open agent_offline alert when a host's mode is toggled.
func TestResolveOnModeChangeClearsOfflineAlert(t *testing.T) {
t.Parallel()
eng, st, hostID := setupEngine(t)
ctx := context.Background()
// Make always-on and set it offline with a stale last_seen_at.
if err := st.SetHostAlwaysOn(ctx, hostID, true); err != nil {
t.Fatalf("SetHostAlwaysOn: %v", err)
}
if _, err := st.DB().Exec(
`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
"offline",
hostID,
); err != nil {
t.Fatalf("update last_seen_at: %v", err)
}
// Raise the offline alert.
eng.handleHostOffline(ctx, hostID)
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
if len(open) != 1 || open[0].Kind != KindAgentOffline {
t.Fatalf("expected 1 agent_offline alert before mode change; got %d: %+v", len(open), open)
}
// Toggle mode — should clear the alert.
eng.ResolveOnModeChange(ctx, hostID, time.Now().UTC())
open, _ = st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
for _, a := range open {
if a.Kind == KindAgentOffline {
t.Fatalf("expected agent_offline to be resolved after mode change; still open: %+v", a)
}
}
}
// TestNoStalenessWhenNeverBackedUp checks that no stale_schedule alert is
// raised for an intermittent host that has never backed up (nil LastBackupAt).
func TestNoStalenessWhenNeverBackedUp(t *testing.T) {
t.Parallel()
eng, st, hostID := setupEngine(t)
ctx := context.Background()
// Make intermittent.
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
t.Fatalf("SetHostAlwaysOn: %v", err)
}
// Create a source group and an enabled schedule — but do NOT set LastBackupAt.
sgID := ulid.Make().String()
if err := st.CreateSourceGroup(ctx, &store.SourceGroup{
ID: sgID,
HostID: hostID,
Name: "default",
Includes: []string{"/home"},
}); err != nil {
t.Fatalf("CreateSourceGroup: %v", err)
}
schedID := ulid.Make().String()
if err := st.CreateSchedule(ctx, &store.Schedule{
ID: schedID,
HostID: hostID,
CronExpr: "0 2 * * *",
Enabled: true,
SourceGroupIDs: []string{sgID},
}); err != nil {
t.Fatalf("CreateSchedule: %v", err)
}
eng.tick(ctx, time.Now().UTC())
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
for _, a := range open {
if a.Kind == KindStaleSchedule {
t.Fatalf("expected no stale_schedule when never backed up; got: %+v", a)
}
}
}
+4 -14
View File
@@ -27,10 +27,10 @@ const (
// integrity is at risk) when a check job fails.
KindCheckFailed = "check_failed"
// KindStaleSchedule is raised for intermittent (non-always-on) hosts
// when their last successful backup is older than staleBackupThreshold
// (7 days) and they have at least one enabled schedule. Resolved on
// backup success or when the host is switched to always-on mode.
// KindStaleSchedule is declared for completeness but intentionally
// left as a no-op in v1. The precise "expected to have fired but
// didn't" logic requires a store helper that lands in a follow-up
// task. Ask the team before implementing.
KindStaleSchedule = "stale_schedule"
// KindAgentOffline is raised when a host's last_seen_at is older
@@ -122,16 +122,6 @@ func alertPayload(ctx context.Context, st *store.Store, ev notification.Event, a
}
}
// ResolveOnModeChange clears any open agent_offline and stale_schedule
// alerts for a host whose always-on flag was just toggled. The next
// 60s tick re-raises whichever still applies under the new mode, so
// this is a self-correcting "wipe and let the sweep settle" call.
// Safe to invoke from the HTTP layer (it only touches the store + hub).
func (e *Engine) ResolveOnModeChange(ctx context.Context, hostID string, when time.Time) {
e.resolveAndNotify(ctx, hostID, KindAgentOffline, "", when)
e.resolveAndNotify(ctx, hostID, KindStaleSchedule, "", when)
}
// resolveAndNotify clears the open (or acknowledged) alert matching
// (host_id, kind, dedup_key) via store.AutoResolve, then fires
// alert.resolved for the row(s) actually closed. Best-effort —
-63
View File
@@ -1,63 +0,0 @@
package alert
import (
"context"
"fmt"
"log/slog"
"time"
"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
)
// Alert-kind constants for P6 self-update flows.
const (
// KindUpdateFailed is raised when an agent fails to come back with
// the expected version after a command.update dispatch (timeout or
// version-mismatch). Resolved by a subsequent matching hello.
KindUpdateFailed = "update_failed"
// KindFleetUpdateHalted is raised when the fleet-update worker
// stops mid-run because a host failed to update or went offline.
// Host-less alert (system-scoped). Manually resolved by an admin.
KindFleetUpdateHalted = "fleet_update_halted"
)
// RaiseUpdateFailed records a per-host update failure. dedupKey is the
// hostID so a re-dispatch on the same host touches the existing alert
// rather than spawning a duplicate.
func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) {
msg := fmt.Sprintf("Agent update failed (job %s): %s", jobID, reason)
e.raiseAndNotify(ctx, hostID, KindUpdateFailed, hostID, "warning", msg, when)
}
// ResolveUpdateFailed clears any open update_failed alert for hostID.
// Called from the WS hello path when the agent reconnects with the
// target version.
func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) {
e.resolveAndNotify(ctx, hostID, KindUpdateFailed, hostID, when)
}
// RaiseFleetUpdateHalted is host-less — the fleet update is a
// system-level concept. We persist it via the dedicated host-less
// alert path so the alerts table's host_id column carries NULL.
func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) {
msg := fmt.Sprintf("Fleet update %s halted: %s", fleetUpdateID, reason)
id, didRaise, err := e.store.RaiseOrTouchSystem(ctx, KindFleetUpdateHalted, fleetUpdateID, "warning", msg, when)
if err != nil {
slog.Warn("alert: raise fleet_update_halted", "fu_id", fleetUpdateID, "err", err)
return
}
if !didRaise {
return
}
go e.hub.Dispatch(ctx, notification.Payload{
Event: notification.EventRaised,
AlertID: id,
Severity: "warning",
Kind: KindFleetUpdateHalted,
HostID: "",
HostName: "",
Message: msg,
RaisedAt: when,
})
}
+7 -9
View File
@@ -63,7 +63,6 @@ const (
JobUnlock JobKind = "unlock"
JobRestore JobKind = "restore"
JobDiff JobKind = "diff"
JobUpdate JobKind = "update"
)
// JobStatus is the lifecycle state of a job.
@@ -362,14 +361,13 @@ type ConfigUpdatePayload struct {
BandwidthDownKBps *int `json:"bandwidth_down_kbps,omitempty"`
}
// CommandUpdatePayload carries no operational data — the agent
// already knows its own os/arch and fetches from its configured
// server URL via /agent/binary. JobID is the server-issued id of
// the update job; the agent echoes it on log.stream lines so the
// live job log captures pre-restart progress, then either exits
// (Linux) or hands off to a detached helper script (Windows).
type CommandUpdatePayload struct {
JobID string `json:"job_id"`
// AgentUpdateAvailablePayload — informational only; the agent does
// NOT self-update. See spec.md §4.2 for the package-manager-based
// update model.
type AgentUpdateAvailablePayload struct {
LatestVersion string `json:"latest_version"`
PackageURL string `json:"package_url"` // apt repo / choco source
Changelog string `json:"changelog,omitempty"`
}
// TreeListRequestPayload is the body of a tree.list RPC. Used by the
+6 -6
View File
@@ -29,12 +29,12 @@ const (
// Server → agent message types.
const (
MsgCommandRun MessageType = "command.run"
MsgCommandCancel MessageType = "command.cancel"
MsgScheduleSet MessageType = "schedule.set"
MsgConfigUpdate MessageType = "config.update"
MsgCommandUpdate MessageType = "command.update"
MsgTreeList MessageType = "tree.list" // sync RPC: list a snapshot's children
MsgCommandRun MessageType = "command.run"
MsgCommandCancel MessageType = "command.cancel"
MsgScheduleSet MessageType = "schedule.set"
MsgConfigUpdate MessageType = "config.update"
MsgAgentUpdateAvail MessageType = "agent.update.available"
MsgTreeList MessageType = "tree.list" // sync RPC: list a snapshot's children
)
// Envelope is the framing for every WS message in either direction.
+2 -19
View File
@@ -9,7 +9,6 @@ import (
"errors"
"fmt"
"strings"
"testing"
"golang.org/x/crypto/argon2"
)
@@ -28,38 +27,22 @@ const (
defaultKeyLen = 32
)
// Cheap params used only when the binary is a `go test` binary
// (testing.Testing() == true). Argon2id at production params costs
// 300500 ms per hash and dominates wall time on CI runners under
// `-race`. Tests don't need real KDF strength — VerifyPassword reads
// params from the encoded hash, so verifying a cheap-params hash
// works the same way.
const (
testMemoryKiB = 8
testIterations = 1
testParallel = 1
)
// HashPassword returns an argon2id-encoded string of the form
//
// $argon2id$v=19$m=...,t=...,p=...$<salt>$<hash>
//
// safe to store in a TEXT column. The salt is freshly random per call.
func HashPassword(password string) (string, error) {
mem, iter, par := uint32(defaultMemoryKiB), uint32(defaultIterations), uint8(defaultParallel)
if testing.Testing() {
mem, iter, par = testMemoryKiB, testIterations, testParallel
}
salt := make([]byte, defaultSaltLen)
if _, err := rand.Read(salt); err != nil {
return "", fmt.Errorf("auth: read salt: %w", err)
}
hash := argon2.IDKey([]byte(password), salt,
iter, mem, par, defaultKeyLen)
defaultIterations, defaultMemoryKiB, defaultParallel, defaultKeyLen)
return fmt.Sprintf("$argon2id$v=%d$m=%d,t=%d,p=%d$%s$%s",
argon2.Version,
mem, iter, par,
defaultMemoryKiB, defaultIterations, defaultParallel,
base64.RawStdEncoding.EncodeToString(salt),
base64.RawStdEncoding.EncodeToString(hash),
), nil
+3 -23
View File
@@ -58,34 +58,14 @@ func (c *NtfyChannel) Send(ctx context.Context, p Payload) (int, time.Duration,
server := strings.TrimRight(c.cfg.ServerURL, "/")
url := server + "/" + c.cfg.Topic
// Body carries the event verb so the body alone is unambiguous when
// it shows up on a phone lockscreen without the title.
body := p.Message
switch p.Event {
case EventResolved:
body = "Resolved · " + p.Message
case EventAcknowledged:
body = "Acknowledged · " + p.Message
}
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewBufferString(body))
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewBufferString(p.Message))
if err != nil {
return 0, 0, fmt.Errorf("ntfy: build request: %w", err)
}
// Title prefix tracks the event so raise vs ack vs resolve are
// visually distinct in the ntfy notification list.
verb := "raised"
switch p.Event {
case EventAcknowledged:
verb = "ack"
case EventResolved:
verb = "resolved"
case EventTest:
verb = "test"
}
req.Header.Set("Content-Type", "text/plain")
req.Header.Set("Title", fmt.Sprintf("[%s · %s] %s %s", verb, p.Severity, p.HostName, p.Kind))
req.Header.Set("Tags", verb+","+p.Severity+","+p.Kind)
req.Header.Set("Title", fmt.Sprintf("[%s] %s %s", p.Severity, p.HostName, p.Kind))
req.Header.Set("Tags", p.Severity+","+p.Kind)
req.Header.Set("Priority", priorityForSeverity(p.Severity, c.defaultPriority))
if p.Link != "" {
req.Header.Set("Click", p.Link)
+2 -2
View File
@@ -60,13 +60,13 @@ func TestNtfySendsHeadersAndBody(t *testing.T) {
t.Fatalf("want 200, got %d", code)
}
if want := "[raised · critical] alfa-01 check_failed"; gotTitle != want {
if want := "[critical] alfa-01 check_failed"; gotTitle != want {
t.Errorf("Title: got %q want %q", gotTitle, want)
}
if gotPri != "5" {
t.Errorf("Priority: got %q want \"5\"", gotPri)
}
if want := "raised,critical,check_failed"; gotTags != want {
if want := "critical,check_failed"; gotTags != want {
t.Errorf("Tags: got %q want %q", gotTags, want)
}
if gotClick != "https://rm.example/a" {
+1 -12
View File
@@ -117,20 +117,9 @@ func extractAddr(s string) string {
// Plain text only; subject hardcoded.
func buildEmailBody(cfg SMTPConfig, msgIDDomain string, p Payload) []byte {
var b strings.Builder
// Subject prefix tracks the event verb so raise vs ack vs resolve
// are visually distinct in the inbox (and threaded by Message-ID).
verb := "raised"
switch p.Event {
case EventAcknowledged:
verb = "ack"
case EventResolved:
verb = "resolved"
case EventTest:
verb = "test"
}
b.WriteString("From: " + cfg.From + "\r\n")
b.WriteString("To: " + cfg.To + "\r\n")
b.WriteString(fmt.Sprintf("Subject: [restic-manager] [%s · %s] %s: %s\r\n", verb, p.Severity, p.HostName, p.Kind))
b.WriteString(fmt.Sprintf("Subject: [restic-manager] [%s] %s: %s\r\n", p.Severity, p.HostName, p.Kind))
b.WriteString("Date: " + p.RaisedAt.UTC().Format(time.RFC1123Z) + "\r\n")
b.WriteString("Message-ID: <" + p.AlertID + "@" + msgIDDomain + ">\r\n")
b.WriteString("MIME-Version: 1.0\r\n")
+1 -1
View File
@@ -133,7 +133,7 @@ func TestSMTPSendsExpectedHeaders(t *testing.T) {
if !strings.Contains(srv.rcptTo, "ops@example.com") {
t.Errorf("RCPT TO: %q", srv.rcptTo)
}
if !strings.Contains(srv.data, "Subject: [restic-manager] [raised · warning] alfa-01: backup_failed") {
if !strings.Contains(srv.data, "Subject: [restic-manager] [warning] alfa-01: backup_failed") {
t.Errorf("subject missing or wrong: %q", srv.data)
}
if !strings.Contains(srv.data, "Message-ID: <01ABC@rm.example>") {
+7 -7
View File
@@ -87,13 +87,13 @@ func (e Env) RunRestore(ctx context.Context, snapshotID string, paths []string,
}
}
args = append(args, "--target", target)
// --no-ownership is nominally a restic 0.17+ flag, but at least
// one downstream 0.18.1 build still rejects it. We rely on a
// runtime probe captured at agent startup (see
// SupportsRestoreNoOwnership) rather than version sniffing.
// In-place restores always preserve ownership — that's the whole
// point of in-place — so we only add the flag for new-dir mode.
if !inPlace && e.SupportsRestoreNoOwnership {
// --no-ownership was added in restic 0.17. Older versions reject
// the flag with "unknown flag: --no-ownership". For new-dir
// restores we want the files owned by the agent user (operator
// can cp them without juggling chown), so pass the flag iff the
// running restic supports it. In-place restores always preserve
// ownership — that's the whole point of in-place.
if !inPlace && e.AtLeastVersion(0, 17) {
args = append(args, "--no-ownership")
}
for _, p := range paths {
+6 -37
View File
@@ -15,26 +15,6 @@ import (
"time"
)
// SupportsRestoreNoOwnership probes the running restic for the
// `--no-ownership` flag on the `restore` subcommand. Some restic
// builds (≥ 0.17 in theory; observed missing on a downstream 0.18.1)
// do not expose it, so we ask the binary directly rather than
// inferring from the version string. Empty `bin` or any failure to
// run the help command returns false — the caller stays on the
// conservative path of not adding the flag.
func SupportsRestoreNoOwnership(ctx context.Context, bin string) bool {
if bin == "" {
return false
}
probeCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()
out, err := exec.CommandContext(probeCtx, bin, "restore", "--help").CombinedOutput()
if err != nil {
return false
}
return strings.Contains(string(out), "--no-ownership")
}
// Locate resolves the path to the restic binary. Honour an explicit
// override if provided, else fall back to PATH.
func Locate(override string) (string, error) {
@@ -69,15 +49,6 @@ type Env struct {
ExtraEnv map[string]string // any other RESTIC_* / passthrough
WorkDir string // CWD; default = current
// SupportsRestoreNoOwnership records whether the running restic's
// `restore --help` advertises the --no-ownership flag. The flag was
// added in 0.17, but at least one downstream build of 0.18.1 still
// rejects it ("unknown flag: --no-ownership") — version sniffing
// proved unreliable, so the agent now probes for the actual flag at
// startup (see internal/restic.SupportsRestoreNoOwnership) and
// passes the resulting boolean down here.
SupportsRestoreNoOwnership bool
// Bandwidth caps in KB/s. <=0 means "no cap" (omit the flag).
// Emitted as restic global flags --limit-upload / --limit-download
// before the subcommand on every invocation.
@@ -536,14 +507,12 @@ func pumpPlain(r io.Reader, stream string, handle LineHandler) error {
// on one or the other for its cache dir; without it the command
// fails before ever talking to the repo.
//
// Default to /var/lib/restic-manager. The unit no longer pins
// ProtectHome=read-only (a backup tool needs to restore anywhere),
// but the explicit HOME stays for two reasons: the parent's HOME
// can be unset under unusual init shapes, and pinning the cache
// under a known agent-owned dir keeps restic's metadata isolated
// from the actual operator home dirs that the agent can now write
// to. ExtraEnv overrides win for callers that want a different
// cache location.
// Default to /var/lib/restic-manager — that's in the systemd unit's
// ReadWritePaths and survives ProtectHome=read-only. We do NOT fall
// back to the parent's HOME env var: the agent runs as root with
// HOME=/root, but ProtectHome makes /root read-only, so restic's
// `mkdir /root/.cache/restic` fails. ExtraEnv overrides win for
// callers that explicitly want a different cache location.
func (e Env) envSlice() []string {
home := "/var/lib/restic-manager"
if h, ok := e.ExtraEnv["HOME"]; ok && h != "" {
+4 -64
View File
@@ -30,35 +30,7 @@ type Config struct {
// Defaults to true. Set RM_COOKIE_SECURE=false only for local HTTP
// testing — production deployments are always behind a TLS proxy
// and the cookie must be Secure.
CookieSecure bool `yaml:"cookie_secure"`
OIDCRaw *OIDCConfig `yaml:"oidc"`
OIDC *OIDCConfig `yaml:"-"`
// BundledAssetsDir is the read-only path inside the image that
// holds agent binaries (under agent-binaries/) and install
// scripts (under install/). The /agent/binary and /install/*
// handlers fall back here when the file is not present in
// DataDir. Source-build deployments can override via
// RM_BUNDLED_ASSETS_DIR.
BundledAssetsDir string `yaml:"bundled_assets_dir"`
// MetricsToken, if set, gates the /metrics scrape endpoint
// behind a `Authorization: Bearer <token>` check (constant-time
// compare). When neither this nor MetricsTrustedCIDRs is set,
// the route is not mounted at all (the endpoint is opt-in).
MetricsToken string `yaml:"metrics_token"`
// MetricsTrustedCIDRs, if non-empty, gates /metrics so only
// callers from these networks may scrape. ANDed with
// MetricsToken when both are set.
MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"`
}
// MetricsAuthEnabled reports whether the operator has opted into
// exposing the Prometheus scrape endpoint by configuring at least
// one auth gate.
func (c Config) MetricsAuthEnabled() bool {
return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0
CookieSecure bool `yaml:"cookie_secure"`
}
// Load resolves config in this order:
@@ -70,10 +42,9 @@ func (c Config) MetricsAuthEnabled() bool {
// safe to start.
func Load(yamlPath string) (Config, error) {
c := Config{
Listen: ":8080",
DataDir: "/data",
CookieSecure: true,
BundledAssetsDir: "/opt/restic-manager/dist",
Listen: ":8080",
DataDir: "/data",
CookieSecure: true,
}
if yamlPath != "" {
@@ -108,22 +79,6 @@ func Load(yamlPath string) (Config, error) {
c.CookieSecure = true
}
}
if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
c.BundledAssetsDir = v
}
if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok {
c.MetricsToken = v
}
if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok {
parts := strings.Split(v, ",")
c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0]
for _, p := range parts {
p = strings.TrimSpace(p)
if p != "" {
c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p)
}
}
}
if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
// Comma-separated CIDRs; allow whitespace for readability.
parts := strings.Split(v, ",")
@@ -136,16 +91,6 @@ func Load(yamlPath string) (Config, error) {
}
}
var rawOIDC OIDCConfig
if c.OIDCRaw != nil {
rawOIDC = *c.OIDCRaw
}
oidc, err := loadOIDC(envSnapshot(), rawOIDC)
if err != nil {
return c, err
}
c.OIDC = oidc
return c, c.validate()
}
@@ -168,10 +113,5 @@ func (c *Config) validate() error {
return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
}
}
for _, cidr := range c.MetricsTrustedCIDRs {
if _, err := netip.ParsePrefix(cidr); err != nil {
return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err)
}
}
return nil
}
-39
View File
@@ -98,45 +98,6 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) {
}
}
func TestMetricsAuthGates(t *testing.T) {
t.Setenv("RM_LISTEN", ":8080")
t.Setenv("RM_DATA_DIR", "/tmp/x")
c, err := Load("")
if err != nil {
t.Fatalf("load: %v", err)
}
if c.MetricsAuthEnabled() {
t.Errorf("metrics endpoint should be off by default")
}
t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes")
t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24")
c, err = Load("")
if err != nil {
t.Fatalf("load: %v", err)
}
if c.MetricsToken != "s3cr3t-token-with-enough-bytes" {
t.Errorf("token: %q", c.MetricsToken)
}
if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" {
t.Errorf("cidrs: %v", got)
}
if !c.MetricsAuthEnabled() {
t.Errorf("MetricsAuthEnabled should be true")
}
}
func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) {
t.Setenv("RM_LISTEN", ":8080")
t.Setenv("RM_DATA_DIR", "/tmp/x")
t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage")
if _, err := Load(""); err == nil {
t.Fatal("expected validation error, got nil")
}
}
func writeFile(path string, body []byte) error {
return writeFileImpl(path, body)
}
-103
View File
@@ -1,103 +0,0 @@
// internal/server/config/oidc.go — OIDC subsection of the server
// config. Disabled when oidc.issuer is empty or absent.
package config
import (
"errors"
"fmt"
"os"
)
// OIDCConfig is the OIDC sub-block. The struct doubles as YAML schema;
// loadOIDC applies env overlays on top and fills defaults.
type OIDCConfig struct {
Issuer string `yaml:"issuer"`
ClientID string `yaml:"client_id"`
ClientSecret string `yaml:"client_secret"`
DisplayName string `yaml:"display_name"`
Scopes []string `yaml:"scopes"`
RoleClaim string `yaml:"role_claim"`
RoleMapping map[string]string `yaml:"role_mapping"`
RedirectURL string `yaml:"redirect_url"`
}
// loadOIDC merges YAML + env, applies defaults, validates. Returns
// nil + nil when OIDC is disabled (issuer empty after merge); a
// non-nil OIDCConfig means the caller should wire OIDC.
//
// Env vars (override YAML when set):
//
// RM_OIDC_ISSUER, RM_OIDC_CLIENT_ID, RM_OIDC_CLIENT_SECRET,
// RM_OIDC_CLIENT_SECRET_FILE, RM_OIDC_DISPLAY_NAME,
// RM_OIDC_REDIRECT_URL.
//
// envs is passed in (rather than read with os.LookupEnv) so unit
// tests can supply a fake env map.
func loadOIDC(envs map[string]string, yaml OIDCConfig) (*OIDCConfig, error) {
c := yaml
if v, ok := envs["RM_OIDC_ISSUER"]; ok {
c.Issuer = v
}
if v, ok := envs["RM_OIDC_CLIENT_ID"]; ok {
c.ClientID = v
}
if v, ok := envs["RM_OIDC_CLIENT_SECRET"]; ok {
c.ClientSecret = v
}
if v, ok := envs["RM_OIDC_CLIENT_SECRET_FILE"]; ok && v != "" {
body, err := os.ReadFile(v)
if err != nil {
return nil, fmt.Errorf("config: oidc client_secret_file: %w", err)
}
c.ClientSecret = string(body)
}
if v, ok := envs["RM_OIDC_DISPLAY_NAME"]; ok {
c.DisplayName = v
}
if v, ok := envs["RM_OIDC_REDIRECT_URL"]; ok {
c.RedirectURL = v
}
if c.Issuer == "" {
return nil, nil
}
if c.ClientID == "" {
return nil, errors.New("config: oidc.client_id required when issuer is set")
}
if c.ClientSecret == "" {
return nil, errors.New("config: oidc.client_secret required when issuer is set")
}
if len(c.RoleMapping) == 0 {
return nil, errors.New("config: oidc.role_mapping must have at least one entry")
}
if c.DisplayName == "" {
c.DisplayName = "SSO"
}
if c.RoleClaim == "" {
c.RoleClaim = "groups"
}
if len(c.Scopes) == 0 {
c.Scopes = []string{"openid", "profile", "email", "groups"}
}
return &c, nil
}
// envSnapshot reads the OIDC env vars into a map. Lets the production
// loadOIDC call site stay env-driven while tests pass an explicit
// map.
func envSnapshot() map[string]string {
keys := []string{
"RM_OIDC_ISSUER", "RM_OIDC_CLIENT_ID", "RM_OIDC_CLIENT_SECRET",
"RM_OIDC_CLIENT_SECRET_FILE", "RM_OIDC_DISPLAY_NAME",
"RM_OIDC_REDIRECT_URL",
}
out := make(map[string]string, len(keys))
for _, k := range keys {
if v, ok := os.LookupEnv(k); ok {
out[k] = v
}
}
return out
}
-72
View File
@@ -1,72 +0,0 @@
package config
import "testing"
func TestOIDCParseDisabledWhenIssuerEmpty(t *testing.T) {
t.Parallel()
c, err := loadOIDC(map[string]string{}, OIDCConfig{})
if err != nil {
t.Fatalf("load: %v", err)
}
if c != nil {
t.Errorf("expected nil OIDC config when issuer empty; got %+v", c)
}
}
func TestOIDCRejectMissingClientID(t *testing.T) {
t.Parallel()
yaml := OIDCConfig{Issuer: "https://x", ClientSecret: "s"}
if _, err := loadOIDC(map[string]string{}, yaml); err == nil {
t.Error("expected error for missing client_id")
}
}
func TestOIDCRejectMissingClientSecret(t *testing.T) {
t.Parallel()
yaml := OIDCConfig{Issuer: "https://x", ClientID: "rm"}
if _, err := loadOIDC(map[string]string{}, yaml); err == nil {
t.Error("expected error for missing client_secret")
}
}
func TestOIDCDefaultsApplied(t *testing.T) {
t.Parallel()
yaml := OIDCConfig{
Issuer: "https://x", ClientID: "rm", ClientSecret: "s",
RoleMapping: map[string]string{"a": "admin"},
}
c, err := loadOIDC(map[string]string{}, yaml)
if err != nil {
t.Fatalf("load: %v", err)
}
if c.RoleClaim != "groups" {
t.Errorf("role_claim default: got %q want groups", c.RoleClaim)
}
if c.DisplayName != "SSO" {
t.Errorf("display_name default: got %q want SSO", c.DisplayName)
}
wantScopes := []string{"openid", "profile", "email", "groups"}
if len(c.Scopes) != len(wantScopes) {
t.Errorf("scopes default: got %v want %v", c.Scopes, wantScopes)
}
}
func TestOIDCEnvOverrides(t *testing.T) {
t.Parallel()
yaml := OIDCConfig{
Issuer: "https://from-yaml", ClientID: "yaml-id", ClientSecret: "yaml-secret",
RoleMapping: map[string]string{"x": "admin"},
}
envs := map[string]string{
"RM_OIDC_ISSUER": "https://from-env",
"RM_OIDC_CLIENT_ID": "env-id",
"RM_OIDC_CLIENT_SECRET": "env-secret",
}
c, err := loadOIDC(envs, yaml)
if err != nil {
t.Fatalf("load: %v", err)
}
if c.Issuer != "https://from-env" || c.ClientID != "env-id" || c.ClientSecret != "env-secret" {
t.Errorf("env override: got %+v", c)
}
}
-221
View File
@@ -1,221 +0,0 @@
// Package fleetupdate drives a rolling, sequential agent self-update
// over a list of hosts. One worker goroutine per Start() call (gated
// at the store layer to at-most-one-running-fleet-update).
package fleetupdate
import (
"context"
"errors"
"fmt"
"log/slog"
"time"
"github.com/oklog/ulid/v2"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)
// Hub is the slim "is this host connected?" surface.
type Hub interface {
Connected(hostID string) bool
}
// Dispatcher sends one command.update envelope. The implementer also
// creates the jobs row, writes audit, and registers with the update
// watcher. Pre-checks are the dispatcher's responsibility — the worker
// passes through whatever error it returns.
type Dispatcher interface {
DispatchUpdate(ctx context.Context, hostID string, actorUserID string) (jobID string, code string, err error)
}
// AlertRaiser is the slim view of the alert engine's host-less raise
// path. Used to emit fleet_update_halted on first failure.
type AlertRaiser interface {
RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time)
}
// Worker is the long-lived fleet-update orchestrator. There is at most
// one *running* fleet update at a time (enforced by the store).
type Worker struct {
store *store.Store
hub Hub
disp Dispatcher
alerts AlertRaiser
// targetVersion is the version every dispatched agent is expected
// to come back with. Captured at Start time to avoid drift.
targetVersion string
// pollPeriod controls the cadence at which the worker re-reads the
// host row to check for the version transition. Exposed for tests.
pollPeriod time.Duration
// hostTimeout bounds how long the worker waits for one host to
// reach the target version before halting.
hostTimeout time.Duration
}
// NewWorker builds an unstarted worker. targetVersion is set on each
// Start call; the values here are defaults.
func NewWorker(st *store.Store, hub Hub, disp Dispatcher, alerts AlertRaiser) *Worker {
return &Worker{
store: st,
hub: hub,
disp: disp,
alerts: alerts,
pollPeriod: 1 * time.Second,
hostTimeout: 95 * time.Second,
}
}
// Start creates the parent + child rows, then spawns the per-host
// worker goroutine. Returns the new fleet_update_id on success.
// store.ErrFleetUpdateRunning bubbles up unchanged.
func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) {
if userID == "" || targetVersion == "" {
return "", errors.New("fleetupdate: userID and targetVersion required")
}
if len(hostIDs) == 0 {
return "", errors.New("fleetupdate: at least one host required")
}
fuID := ulid.Make().String()
now := time.Now().UTC()
if err := w.store.CreateFleetUpdate(ctx, store.FleetUpdate{
ID: fuID,
StartedAt: now,
StartedByUserID: userID,
TargetVersion: targetVersion,
Status: "running",
}, hostIDs); err != nil {
return "", err
}
// The goroutine outlives the request that started it; carry a
// detached context so an HTTP-handler ctx cancel doesn't abort
// the long roll.
bg := context.WithoutCancel(ctx)
go w.run(bg, fuID, userID, targetVersion)
return fuID, nil
}
// Cancel marks the fleet update cancelled. The running goroutine
// observes the new status on its next pre-check and exits without
// dispatching further hosts. The currently-dispatched job is left to
// finish on its own — cancelling agent-side is out of scope for v1.
func (w *Worker) Cancel(ctx context.Context, fuID string) error {
return w.store.CancelFleetUpdate(ctx, fuID, time.Now().UTC())
}
// run is the per-host loop. Halts on first failure; emits one alert
// on transition.
func (w *Worker) run(ctx context.Context, fuID, userID, targetVersion string) {
w.targetVersion = targetVersion
for {
// Check the parent row's status — picks up Cancel.
fu, err := w.store.ActiveFleetUpdate(ctx)
if err != nil {
slog.Warn("fleetupdate: read active", "fu_id", fuID, "err", err)
return
}
if fu == nil || fu.ID != fuID {
// Cancelled, halted, or completed externally. Done.
return
}
pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID)
if err != nil {
slog.Warn("fleetupdate: list pending", "fu_id", fuID, "err", err)
return
}
if len(pending) == 0 {
now := time.Now().UTC()
if err := w.store.CompleteFleetUpdate(ctx, fuID, now); err != nil {
slog.Warn("fleetupdate: complete", "fu_id", fuID, "err", err)
}
return
}
next := pending[0]
w.processHost(ctx, fuID, userID, next)
}
}
// processHost handles one host slot. Marks it skipped, succeeded, or
// failed (and halts the fleet on failure).
func (w *Worker) processHost(ctx context.Context, fuID, userID string, slot store.FleetUpdateHost) {
hostID := slot.HostID
_ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, hostID)
// Pre-flight: re-read the host. The dispatch path repeats most of
// these checks but doing them up-front lets us emit the right
// per-host status (skipped vs failed) without consuming a job row.
host, err := w.store.GetHost(ctx, hostID)
if err != nil || host == nil {
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "skipped", "host not found", "")
return
}
if host.AgentVersion != "" && host.AgentVersion == w.targetVersion {
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "skipped", "already at target version", "")
return
}
if !w.hub.Connected(hostID) {
reason := fmt.Sprintf("host went offline: %s", hostID)
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, "")
w.halt(ctx, fuID, reason)
return
}
// Dispatch.
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "running", "", "")
jobID, code, err := w.disp.DispatchUpdate(ctx, hostID, userID)
if err != nil || code != "" {
reason := dispatchErrorReason(code, err)
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, jobID)
w.halt(ctx, fuID, reason)
return
}
// Poll until the host's recorded agent_version matches target, or
// timeout.
deadline := time.Now().Add(w.hostTimeout)
for time.Now().Before(deadline) {
// Honour cancellation between polls.
fu, err := w.store.ActiveFleetUpdate(ctx)
if err == nil && (fu == nil || fu.ID != fuID) {
// Cancelled mid-host; leave the slot in 'running' for the
// admin to inspect. No further dispatches.
return
}
time.Sleep(w.pollPeriod)
h, err := w.store.GetHost(ctx, hostID)
if err == nil && h != nil && h.AgentVersion == w.targetVersion {
if err := w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "succeeded", "", jobID); err != nil {
slog.Warn("fleetupdate: set succeeded", "fu_id", fuID, "host_id", hostID, "err", err)
}
return
}
}
reason := fmt.Sprintf("timeout waiting for %s to reach %s", hostID, w.targetVersion)
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, jobID)
w.halt(ctx, fuID, reason)
}
func (w *Worker) halt(ctx context.Context, fuID, reason string) {
now := time.Now().UTC()
if err := w.store.HaltFleetUpdate(ctx, fuID, reason, now); err != nil {
slog.Warn("fleetupdate: halt", "fu_id", fuID, "err", err)
}
if w.alerts != nil {
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, reason, now)
}
}
func dispatchErrorReason(code string, err error) string {
if code != "" {
return "dispatch failed: " + code
}
if err != nil {
return err.Error()
}
return "dispatch failed"
}
-344
View File
@@ -1,344 +0,0 @@
package fleetupdate
import (
"context"
"errors"
"path/filepath"
"sync"
"testing"
"time"
"github.com/oklog/ulid/v2"
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)
type fakeHub struct {
mu sync.Mutex
online map[string]bool
}
func (f *fakeHub) Connected(hostID string) bool {
f.mu.Lock()
defer f.mu.Unlock()
return f.online[hostID]
}
type fakeDispatcher struct {
mu sync.Mutex
calls []string // host IDs
// after dispatch, set the host's agent_version to this on the
// store so the worker observes the version transition.
st *store.Store
target string
delayMS int
failOnHost map[string]string // host → error code
}
func (f *fakeDispatcher) DispatchUpdate(ctx context.Context, hostID, _ string) (string, string, error) {
f.mu.Lock()
f.calls = append(f.calls, hostID)
if code, ok := f.failOnHost[hostID]; ok {
f.mu.Unlock()
return "", code, nil
}
st := f.st
target := f.target
delay := f.delayMS
f.mu.Unlock()
jobID := ulid.Make().String()
if st != nil {
_ = st.CreateJob(context.Background(), store.Job{
ID: jobID, HostID: hostID, Kind: "update",
ActorKind: "user", CreatedAt: time.Now().UTC(),
})
}
if st != nil && target != "" {
go func() {
if delay > 0 {
time.Sleep(time.Duration(delay) * time.Millisecond)
}
_ = st.MarkHostHello(context.Background(), hostID, target, "0.17", api.CurrentProtocolVersion, time.Now().UTC())
}()
}
return jobID, "", nil
}
type recAlert struct {
mu sync.Mutex
reasons []string
}
func (r *recAlert) RaiseFleetUpdateHalted(_ context.Context, _ string, reason string, _ time.Time) {
r.mu.Lock()
r.reasons = append(r.reasons, reason)
r.mu.Unlock()
}
func openStore(t *testing.T) *store.Store {
t.Helper()
dir := t.TempDir()
st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
if err != nil {
t.Fatalf("open: %v", err)
}
t.Cleanup(func() { _ = st.Close() })
return st
}
func mustCreateAdmin(t *testing.T, st *store.Store) string {
t.Helper()
uid := ulid.Make().String()
if err := st.CreateUser(context.Background(), store.User{
ID: uid, Username: "u-" + uid[:6],
PasswordHash: "x", Role: store.RoleAdmin, CreatedAt: time.Now().UTC(),
}); err != nil {
t.Fatalf("user: %v", err)
}
return uid
}
func mustCreateHost(t *testing.T, st *store.Store, name, version string) string {
t.Helper()
hostID := ulid.Make().String()
if err := st.CreateHost(context.Background(), store.Host{
ID: hostID, Name: name, OS: "linux", Arch: "amd64",
EnrolledAt: time.Now().UTC(),
}, "deadbeef-"+hostID, ""); err != nil {
t.Fatalf("host: %v", err)
}
if version != "" {
if err := st.MarkHostHello(context.Background(), hostID, version, "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
t.Fatalf("hello: %v", err)
}
}
return hostID
}
func waitForStatus(t *testing.T, st *store.Store, fuID, want string, timeout time.Duration) *store.FleetUpdate {
t.Helper()
deadline := time.Now().Add(timeout)
for time.Now().Before(deadline) {
fu, _, err := st.GetFleetUpdate(context.Background(), fuID)
if err == nil && fu != nil && fu.Status == want {
return fu
}
time.Sleep(20 * time.Millisecond)
}
t.Fatalf("status never reached %q", want)
return nil
}
func TestWorkerTwoHostsBothSucceed(t *testing.T) {
st := openStore(t)
uid := mustCreateAdmin(t, st)
h1 := mustCreateHost(t, st, "h1", "v0")
h2 := mustCreateHost(t, st, "h2", "v0")
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 30}
alerts := &recAlert{}
w := NewWorker(st, hub, disp, alerts)
w.pollPeriod = 20 * time.Millisecond
w.hostTimeout = 2 * time.Second
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
if err != nil {
t.Fatalf("start: %v", err)
}
waitForStatus(t, st, fuID, "completed", 5*time.Second)
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
for _, h := range hosts {
if h.Status != "succeeded" {
t.Errorf("host %s status %q want succeeded", h.HostID, h.Status)
}
}
if n := len(alerts.reasons); n != 0 {
t.Errorf("unexpected halt alert: %v", alerts.reasons)
}
}
func TestWorkerSecondHostTimesOutHalts(t *testing.T) {
st := openStore(t)
uid := mustCreateAdmin(t, st)
h1 := mustCreateHost(t, st, "h1", "v0")
h2 := mustCreateHost(t, st, "h2", "v0")
h3 := mustCreateHost(t, st, "h3", "v0")
hub := &fakeHub{online: map[string]bool{h1: true, h2: true, h3: true}}
// h1 dispatches normally (transitions to v2). h2 dispatch returns
// success but never transitions.
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 20, failOnHost: map[string]string{
h2: "", // not a code-failure; simulate by clearing target on this disp run
}}
// Actually: drop h2 from the auto-transition by faking with a
// per-host store setter. Easiest: subclass via a wrapper.
_ = disp
customDisp := &perHostDispatcher{base: disp, st: st, target: "v2", noTransition: map[string]bool{h2: true}}
alerts := &recAlert{}
w := NewWorker(st, hub, customDisp, alerts)
w.pollPeriod = 20 * time.Millisecond
w.hostTimeout = 200 * time.Millisecond
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2, h3})
if err != nil {
t.Fatalf("start: %v", err)
}
waitForStatus(t, st, fuID, "halted", 3*time.Second)
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
gotStatus := map[string]string{}
for _, h := range hosts {
gotStatus[h.HostID] = h.Status
}
if gotStatus[h1] != "succeeded" {
t.Errorf("h1: %q", gotStatus[h1])
}
if gotStatus[h2] != "failed" {
t.Errorf("h2: %q", gotStatus[h2])
}
if gotStatus[h3] != "pending" {
t.Errorf("h3: %q", gotStatus[h3])
}
alerts.mu.Lock()
defer alerts.mu.Unlock()
if len(alerts.reasons) != 1 {
t.Errorf("alert reasons: %v", alerts.reasons)
}
}
// perHostDispatcher lets a test omit the auto-transition for selected
// hosts so we can simulate timeout.
type perHostDispatcher struct {
mu sync.Mutex
base *fakeDispatcher
st *store.Store
target string
noTransition map[string]bool
}
func (p *perHostDispatcher) DispatchUpdate(_ context.Context, hostID, _ string) (string, string, error) {
p.mu.Lock()
skip := p.noTransition[hostID]
p.mu.Unlock()
jobID := ulid.Make().String()
_ = p.st.CreateJob(context.Background(), store.Job{
ID: jobID, HostID: hostID, Kind: "update",
ActorKind: "user", CreatedAt: time.Now().UTC(),
})
if !skip {
go func() {
time.Sleep(20 * time.Millisecond)
_ = p.st.MarkHostHello(context.Background(), hostID, p.target, "0.17", api.CurrentProtocolVersion, time.Now().UTC())
}()
}
return jobID, "", nil
}
func TestWorkerHostOfflineHalts(t *testing.T) {
st := openStore(t)
uid := mustCreateAdmin(t, st)
h1 := mustCreateHost(t, st, "h1", "v0")
h2 := mustCreateHost(t, st, "h2", "v0")
hub := &fakeHub{online: map[string]bool{h1: false, h2: true}}
disp := &fakeDispatcher{st: st, target: "v2"}
alerts := &recAlert{}
w := NewWorker(st, hub, disp, alerts)
w.pollPeriod = 20 * time.Millisecond
w.hostTimeout = 500 * time.Millisecond
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
if err != nil {
t.Fatalf("start: %v", err)
}
waitForStatus(t, st, fuID, "halted", 2*time.Second)
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
if hosts[0].Status != "failed" {
t.Errorf("h1 status: %q", hosts[0].Status)
}
if hosts[1].Status != "pending" {
t.Errorf("h2 status: %q", hosts[1].Status)
}
}
func TestWorkerAlreadyAtTargetSkipped(t *testing.T) {
st := openStore(t)
uid := mustCreateAdmin(t, st)
h1 := mustCreateHost(t, st, "h1", "v2")
h2 := mustCreateHost(t, st, "h2", "v0")
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 20}
alerts := &recAlert{}
w := NewWorker(st, hub, disp, alerts)
w.pollPeriod = 20 * time.Millisecond
w.hostTimeout = 2 * time.Second
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
if err != nil {
t.Fatalf("start: %v", err)
}
waitForStatus(t, st, fuID, "completed", 4*time.Second)
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
want := map[string]string{h1: "skipped", h2: "succeeded"}
for _, h := range hosts {
if h.Status != want[h.HostID] {
t.Errorf("host %s: got %q want %q", h.HostID, h.Status, want[h.HostID])
}
}
}
func TestWorkerCancelMidRun(t *testing.T) {
st := openStore(t)
uid := mustCreateAdmin(t, st)
h1 := mustCreateHost(t, st, "h1", "v0")
h2 := mustCreateHost(t, st, "h2", "v0")
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
// h1's transition is delayed long enough that we can cancel
// before it lands; h2 should never be touched.
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 500}
alerts := &recAlert{}
w := NewWorker(st, hub, disp, alerts)
w.pollPeriod = 50 * time.Millisecond
w.hostTimeout = 5 * time.Second
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
if err != nil {
t.Fatalf("start: %v", err)
}
// Give the worker a moment to dispatch h1.
time.Sleep(100 * time.Millisecond)
if err := w.Cancel(context.Background(), fuID); err != nil {
t.Fatalf("cancel: %v", err)
}
waitForStatus(t, st, fuID, "cancelled", 2*time.Second)
// h2 should never be dispatched.
disp.mu.Lock()
defer disp.mu.Unlock()
for _, c := range disp.calls {
if c == h2 {
t.Errorf("h2 dispatched after cancel")
}
}
}
func TestWorkerStartWhileActiveErrors(t *testing.T) {
st := openStore(t)
uid := mustCreateAdmin(t, st)
h1 := mustCreateHost(t, st, "h1", "v0")
h2 := mustCreateHost(t, st, "h2", "v0")
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 5_000}
w := NewWorker(st, hub, disp, &recAlert{})
w.pollPeriod = 50 * time.Millisecond
w.hostTimeout = 2 * time.Second
if _, err := w.Start(context.Background(), uid, "v2", []string{h1}); err != nil {
t.Fatalf("first start: %v", err)
}
_, err := w.Start(context.Background(), uid, "v2", []string{h2})
if !errors.Is(err, store.ErrFleetUpdateRunning) {
t.Fatalf("err: %v want ErrFleetUpdateRunning", err)
}
}
+11 -35
View File
@@ -11,23 +11,19 @@ import (
)
// agent_assets.go serves the agent binary (one per OS/arch) and the
// install scripts. Lookup is dual-path:
//
// 1. <DataDir>/agent-binaries/<name> (or <DataDir>/install/<name>) —
// operator-managed override; lets the operator hot-patch a
// pre-release agent without rebuilding the server image.
// 2. <BundledAssetsDir>/agent-binaries/<name> — read-only, baked
// into the server image at build time (P5-03). This is what
// makes a fresh container Just Work without first-run staging.
// install scripts. The binaries live under <DataDir>/agent-binaries/,
// laid down by the release pipeline (or copied by hand for now).
// The install scripts live in <DataDir>/install/ alongside the
// systemd unit.
//
// Both endpoints are intentionally unauthenticated: the install
// payload is unprivileged on its own — it's the one-time enrollment
// token that grants access. Anyone can pull the binary; only
// someone with a valid token can use it productively.
//
// P1-31: signed-binary verification is deferred. The image is the
// unit of trust; pull-by-digest is the verification primitive.
// Future work bumps standalone-binary delivery to minisign/cosign.
// P1-31: signed-binary verification is deferred. Today we serve
// whatever the operator dropped on disk. Future work bumps this to
// minisign/cosign signed bundles.
// installAssetsRoutes adds /agent/binary and /install/* to r.
func (s *Server) handleAgentBinary(w stdhttp.ResponseWriter, r *stdhttp.Request) {
@@ -49,8 +45,8 @@ func (s *Server) handleAgentBinary(w stdhttp.ResponseWriter, r *stdhttp.Request)
ext = ".exe"
}
name := fmt.Sprintf("restic-manager-agent-%s-%s%s", osTag, archTag, ext)
path, ok := s.resolveBundledAsset("agent-binaries", name)
if !ok {
path := filepath.Join(s.deps.Cfg.DataDir, "agent-binaries", name)
if _, err := os.Stat(path); err != nil {
writeJSONError(w, stdhttp.StatusNotFound, "binary_not_published",
fmt.Sprintf("agent binary for %s/%s not published on this server", osTag, archTag))
return
@@ -68,34 +64,14 @@ func (s *Server) handleInstallAsset(w stdhttp.ResponseWriter, r *stdhttp.Request
writeJSONError(w, stdhttp.StatusBadRequest, "bad_path", "")
return
}
path, ok := s.resolveBundledAsset("install", rel)
if !ok {
path := filepath.Join(s.deps.Cfg.DataDir, "install", rel)
if _, err := os.Stat(path); err != nil {
writeJSONError(w, stdhttp.StatusNotFound, "not_found", "")
return
}
stdhttp.ServeFile(w, r, path)
}
// resolveBundledAsset looks up an asset by (subdir, name). DataDir
// wins so an operator can override the image-baked copy by dropping
// a file into <DataDir>/<subdir>/<name>. If neither path resolves,
// returns ("", false).
func (s *Server) resolveBundledAsset(subdir, name string) (string, bool) {
candidates := []string{
filepath.Join(s.deps.Cfg.DataDir, subdir, name),
}
if s.deps.Cfg.BundledAssetsDir != "" {
candidates = append(candidates,
filepath.Join(s.deps.Cfg.BundledAssetsDir, subdir, name))
}
for _, p := range candidates {
if _, err := os.Stat(p); err == nil {
return p, true
}
}
return "", false
}
func validOS(s string) bool {
switch api.HostOS(s) {
case api.OSLinux, api.OSWindows:
-167
View File
@@ -1,167 +0,0 @@
package http
import (
"context"
"io"
stdhttp "net/http"
"net/http/httptest"
"os"
"path/filepath"
"testing"
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
)
// newAssetsTestServer is a minimal scaffold for the /agent/binary and
// /install/* handlers. Two roots: one acts as DataDir, the other as
// the image-baked BundledAssetsDir. Either or both may be empty.
func newAssetsTestServer(t *testing.T, populate func(dataDir, bundleDir string)) string {
t.Helper()
root := t.TempDir()
dataDir := filepath.Join(root, "data")
bundleDir := filepath.Join(root, "dist")
for _, d := range []string{
filepath.Join(dataDir, "agent-binaries"),
filepath.Join(dataDir, "install"),
filepath.Join(bundleDir, "agent-binaries"),
filepath.Join(bundleDir, "install"),
} {
if err := os.MkdirAll(d, 0o755); err != nil {
t.Fatalf("mkdir: %v", err)
}
}
if populate != nil {
populate(dataDir, bundleDir)
}
st, err := store.Open(context.Background(), filepath.Join(root, "rm.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { _ = st.Close() })
keyPath := filepath.Join(root, "secret.key")
_ = crypto.GenerateKeyFile(keyPath)
key, _ := crypto.LoadKeyFromFile(keyPath)
aead, _ := crypto.NewAEAD(key)
deps := Deps{
Cfg: config.Config{
Listen: ":0",
DataDir: dataDir,
SecretKeyFile: keyPath,
BundledAssetsDir: bundleDir,
},
Store: st,
AEAD: aead,
Hub: ws.NewHub(),
BootstrapToken: "test-token",
}
s := New(deps)
ts := httptest.NewServer(s.srv.Handler)
t.Cleanup(ts.Close)
return ts.URL
}
func writeFile(t *testing.T, path string, body []byte) {
t.Helper()
if err := os.WriteFile(path, body, 0o644); err != nil {
t.Fatalf("write %s: %v", path, err)
}
}
func get(t *testing.T, url string) (int, []byte) {
t.Helper()
res, err := stdhttp.Get(url)
if err != nil {
t.Fatalf("GET %s: %v", url, err)
}
defer res.Body.Close()
body, _ := io.ReadAll(res.Body)
return res.StatusCode, body
}
func TestAgentBinary_DataDirHit(t *testing.T) {
t.Parallel()
url := newAssetsTestServer(t, func(dataDir, _ string) {
writeFile(t, filepath.Join(dataDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
[]byte("from-datadir"))
})
code, body := get(t, url+"/agent/binary?os=linux&arch=amd64")
if code != 200 || string(body) != "from-datadir" {
t.Fatalf("got %d %q", code, string(body))
}
}
func TestAgentBinary_BundleFallback(t *testing.T) {
t.Parallel()
url := newAssetsTestServer(t, func(_, bundleDir string) {
writeFile(t, filepath.Join(bundleDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
[]byte("from-bundle"))
})
code, body := get(t, url+"/agent/binary?os=linux&arch=amd64")
if code != 200 || string(body) != "from-bundle" {
t.Fatalf("got %d %q", code, string(body))
}
}
func TestAgentBinary_DataDirShadowsBundle(t *testing.T) {
t.Parallel()
url := newAssetsTestServer(t, func(dataDir, bundleDir string) {
writeFile(t, filepath.Join(dataDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
[]byte("from-datadir"))
writeFile(t, filepath.Join(bundleDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
[]byte("from-bundle"))
})
code, body := get(t, url+"/agent/binary?os=linux&arch=amd64")
if code != 200 || string(body) != "from-datadir" {
t.Fatalf("operator override should win: got %d %q", code, string(body))
}
}
func TestAgentBinary_BothMiss(t *testing.T) {
t.Parallel()
url := newAssetsTestServer(t, nil)
code, _ := get(t, url+"/agent/binary?os=linux&arch=amd64")
if code != 404 {
t.Fatalf("expected 404, got %d", code)
}
}
func TestAgentBinary_WindowsNameHasExe(t *testing.T) {
t.Parallel()
url := newAssetsTestServer(t, func(_, bundleDir string) {
writeFile(t, filepath.Join(bundleDir, "agent-binaries", "restic-manager-agent-windows-amd64.exe"),
[]byte("win"))
})
code, body := get(t, url+"/agent/binary?os=windows&arch=amd64")
if code != 200 || string(body) != "win" {
t.Fatalf("got %d %q", code, string(body))
}
}
func TestInstallAsset_BundleFallback(t *testing.T) {
t.Parallel()
url := newAssetsTestServer(t, func(_, bundleDir string) {
writeFile(t, filepath.Join(bundleDir, "install", "install.sh"), []byte("#!/bin/sh\n"))
})
code, body := get(t, url+"/install/install.sh")
if code != 200 || string(body) != "#!/bin/sh\n" {
t.Fatalf("got %d %q", code, string(body))
}
}
func TestInstallAsset_PathTraversalRejected(t *testing.T) {
t.Parallel()
url := newAssetsTestServer(t, nil)
// chi will normalise some traversal attempts, but the handler
// also rejects any rel containing a slash or backslash. The
// path component of the URL after /install/ is the rel.
code, _ := get(t, url+"/install/..%2fpasswd")
if code == 200 {
t.Fatalf("traversal should not return 200")
}
}

Some files were not shown because too many files have changed in this diff Show More