P5: OSS readiness — docs site, contributor onboarding, e2e harness

P5-01 — Documentation site under docs/book/ rendered with mdBook (downloaded via Makefile, same static-binary pattern as Tailwind). Structured chapters: getting started, concepts, operations, security, reference. `make docs` / `make docs-watch`. Generated output gitignored. P5-02 — CONTRIBUTING.md rewritten from placeholder to a full guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a single-maintainer project. .gitea/issue_template/{bug,feature}.md and PULL_REQUEST_TEMPLATE.md. P5-04 — Six README screenshots captured live from a fresh server bootstrap (login, empty dashboard, add-host, alerts, settings, audit log). README rewritten to centre the screenshot grid and link out to the docs site. P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day default window), scope in/out, threat-model summary, operator hardening checklist. Mirrored as a docs-site chapter. P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up server + sibling Linux agent (alpine + restic) + restic/rest-server. Agent uses announce-and-approve so Playwright can drive the full operator flow: bootstrap → login → accept pending → backup → verify terminal status. Second spec scrapes /metrics to assert the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every PR; local how-to in docs/e2e.md.
2026-05-07 23:56:02 +01:00
parent ff8a5dbead
commit bb4ed3502d
47 changed files with 2818 additions and 61 deletions
@@ -0,0 +1,73 @@
+# Alerts and notifications
+
+restic-manager raises alerts on conditions that need human
+attention. The alert engine evaluates rules on a 60s tick and
+on every job-finished / host-online event.
+
+## Built-in alert kinds
+
+| Kind                | Trigger | Severity |
+|---------------------|---------|----------|
+| `backup_failed`     | A backup job ends in `failed` or `cancelled` | warning |
+| `forget_failed`     | A forget job ends in `failed` | warning |
+| `prune_failed`      | A prune job ends in `failed` | critical |
+| `check_failed`      | A check job ends in `failed` | critical |
+| `agent_offline`     | A host has been offline more than 90s past its heartbeat cadence | warning |
+| `stale_schedule`    | A schedule's "last run" is more than 1.5 × its interval ago | warning |
+| `update_failed`     | An agent self-update returned a fail or didn't reconnect within 90s | warning |
+| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
+
+Each alert has a `dedup_key` so re-firing the same condition
+just bumps `last_seen_at` — the operator gets one row per
+condition, not a thousand.
+
+## Lifecycle
+
+```
+raised  ──acknowledge──▶  acknowledged  ──resolve──▶  resolved
+   │                          │
+   └────────auto-resolve──────┘
+   (e.g. agent_offline auto-resolves on agent_online)
+```
+
+- **Acknowledge** says "I've seen this, stop notifying about it".
+- **Resolve** says "the underlying condition is gone".
+- Some alerts auto-resolve when the condition clears
+  (`agent_offline` is the canonical example).
+
+## Notification channels
+
+Configure under **Settings → Notifications**. Each channel can
+subscribe to all alerts or filter by severity.
+
+### Webhook
+
+Posts a JSON envelope to a URL of your choice. Useful for
+piping into Slack via an Incoming Webhook URL or into your own
+alerting tooling.
+
+### ntfy
+
+Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
+topic. Configure the topic URL; optional bearer token if you
+self-host with auth.
+
+### SMTP
+
+Plain SMTP (with optional TLS). Configure host, port,
+username, password, and the recipient list.
+
+## Test fire
+
+Each channel exposes a **Test fire** button that dispatches a
+single synthetic alert through the channel without touching the
+alert engine. Use this when you've added a channel and want to
+verify connectivity before the next real failure happens.
+
+## What gets logged
+
+Every alert raise / acknowledge / resolve writes an audit log
+entry. The audit log UI at **Settings → Audit log** filters by
+user, action, target, and time range — useful for the
+post-incident "who clicked acknowledge on the prune-failure
+alert" question.
@@ -0,0 +1,73 @@
+# Backups and restores
+
+## Running a backup
+
+Three ways to trigger one:
+
+1. **Scheduled** — the agent's local cron fires at the time set
+   on the schedule.
+2. **Run-now** — operator clicks **Run now** on the host detail
+   right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
+   source groups) or to a per-group form for finer control.
+3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
+   payload. Same audit + dispatch path.
+
+In every case the server creates a `jobs` row, broadcasts a
+`command.run` to the host, and lands the operator on the live
+job log page (HTMX `HX-Redirect`).
+
+## Cancelling a job
+
+Any running job — backup, forget, prune, restore, anything —
+exposes a **Cancel** button on its detail page. The server
+broadcasts `command.cancel`, and the agent kills the running
+restic subprocess via context cancel: SIGTERM first, SIGKILL
+after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
+SIGTERM step is replaced with `os.Kill` because Windows can't
+deliver SIGTERM. Result: a cancelled job lands as `cancelled`
+within a couple of hundred milliseconds.
+
+## Restore wizard
+
+Restoring a file or path goes through a four-step wizard at
+`/hosts/{id}/restore`:
+
+1. **Pick a snapshot.** Search by id or by date; the page is
+   pre-populated when you launched the wizard from a snapshot row.
+2. **Browse the snapshot tree.** Lazy-loaded children via the
+   `MsgTreeList` synchronous WS RPC; results are cached
+   per-wizard-session for 30 minutes. Pick the absolute paths
+   you want.
+3. **Choose a target.** Either **In place** (overwrites the
+   live filesystem; requires you to type the hostname to
+   confirm) or **New directory** (default
+   `$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
+   `${HOME}` / `~/` and creates the directory chain).
+4. **Review and submit.** Server mints a job, dispatches
+   `command.run` with a `RestorePayload`, and `HX-Redirect`s to
+   the live job log.
+
+`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
+in that release). Hosts running 0.16 don't get the flag and
+restore as the running user instead.
+
+## Snapshot diff
+
+Two snapshot ids in the **Diff** form on the host detail page →
+a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
+to the standard live job log. Useful when investigating a
+suspiciously-sized backup.
+
+## Job log artefacts
+
+Every job's log is persisted in `job_logs` (one row per line),
+not just streamed in-memory. That gives you:
+
+- A live view at `/jobs/{id}` while the job runs.
+- Two download formats from the same page header dropdown:
+  - **txt** — one line per row, `HH:MM:SS.mmm  TAG  payload`.
+  - **ndjson** — one self-contained JSON object per line
+    (`{seq, ts, stream, payload}`), perfect for `jq`.
+
+Downloads work whether the job is running or finished —
+the source is the DB, not the live socket.
@@ -0,0 +1,61 @@
+# Observability with Prometheus
+
+restic-manager can expose a Prometheus scrape endpoint at
+`GET /metrics`. The endpoint is **opt-in** — without an explicit
+auth gate it isn't even mounted, so a forgotten config can't
+accidentally publish fleet state.
+
+The full reference lives at
+[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
+the short version follows.
+
+## Enable the endpoint
+
+Set at least one of:
+
+- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
+- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
+
+Both ANDed when both set. Constant-time token compare; CIDR
+honours `X-Forwarded-For` only when the immediate hop matches
+`RM_TRUSTED_PROXY`.
+
+## Metrics emitted
+
+- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
+  `rm_active_alerts{severity}`, `rm_build_info{...}`.
+- **Per-host gauges**: `rm_host_agent_online`,
+  `rm_host_last_backup_timestamp_seconds`,
+  `rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
+  `rm_host_snapshot_count`, `rm_host_open_alerts`,
+  `rm_host_repo_status`.
+- **Histogram**:
+  `rm_job_duration_seconds{kind,status,le=…}` (buckets
+  `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
+
+In-memory histogram only. Prometheus persists the scrapes; if
+you need durable history at hourly resolution that's
+Prometheus's job.
+
+## Sample Grafana dashboard
+
+[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
+imports through Grafana's **+ → Import → Upload JSON file**.
+Six panels:
+
+1. Fleet status (online / total).
+2. Open alerts by severity.
+3. Backups failing on most-recent run.
+4. Hosts table — last backup, repo size, snapshots, open alerts.
+5. Repo size over time, one line per host.
+6. Job-duration p95 over a 1h window per kind.
+
+## Alerting
+
+restic-manager already has a built-in alert engine
+([Alerts](./alerts.md)). The dashboard intentionally doesn't
+duplicate it as Prometheus alert rules. If you want
+Prometheus-side alerts on top, write your own based on the
+metrics above — `rm_host_last_backup_success == 0`,
+`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
+or whatever suits your environment.
@@ -0,0 +1,50 @@
+# Updating agents
+
+Server updates are a `docker compose pull && up -d` away.
+Agents update via the control plane.
+
+## Single-host update
+
+Each host's detail page shows an **Update agent** button when
+the agent's reported version is older than the server's. The
+button:
+
+1. Dispatches a `command.update` to that host.
+2. The agent fetches the appropriate binary from
+   `$RM_SERVER/agent/binary?os=…&arch=…` to
+   `<binary-path>.new`.
+3. Copies the running binary to `<binary-path>.old` (one
+   revision back, in case rollback is needed).
+4. Atomic-renames `.new` over the running binary.
+5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
+   brings the process back on the new binary.
+
+A 90-second timer on the server side waits for a hello at the
+target version and marks the update succeeded — or, if the
+agent doesn't reconnect at the expected version in time, marks
+the update **failed** and raises an `update_failed` alert.
+
+## Fleet update
+
+The admin-only **Settings → Fleet update** page drives a rolling
+update across every host in the fleet:
+
+- One host at a time.
+- Wait for hello-with-target-version (max 95s).
+- On any host failing, **halt** the rollout, raise a
+  `fleet_update_halted` alert, leave the rest of the fleet on
+  the old version. No surprise mass-failures.
+
+You can cancel an in-progress fleet update; the worker stops
+after the current host finishes.
+
+## TLS and corruption
+
+Updates rely on the reverse proxy's TLS to detect corruption in
+transit. There's no separate sha256 verification step — we
+chose the simpler model on the basis that the same TLS already
+gates every other byte the server hands to the agent.
+
+If you'd like a separate signature step before applying updates,
+that's a future-phase enhancement (see `tasks.md` Phase 6
+candidates).