P5: OSS readiness — docs site, contributor onboarding, e2e harness

P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
2026-05-07 23:56:02 +01:00
parent ff8a5dbead
commit bb4ed3502d
47 changed files with 2818 additions and 61 deletions
+73
View File
@@ -0,0 +1,73 @@
# Alerts and notifications
restic-manager raises alerts on conditions that need human
attention. The alert engine evaluates rules on a 60s tick and
on every job-finished / host-online event.
## Built-in alert kinds
| Kind | Trigger | Severity |
|---------------------|---------|----------|
| `backup_failed` | A backup job ends in `failed` or `cancelled` | warning |
| `forget_failed` | A forget job ends in `failed` | warning |
| `prune_failed` | A prune job ends in `failed` | critical |
| `check_failed` | A check job ends in `failed` | critical |
| `agent_offline` | A host has been offline more than 90s past its heartbeat cadence | warning |
| `stale_schedule` | A schedule's "last run" is more than 1.5 × its interval ago | warning |
| `update_failed` | An agent self-update returned a fail or didn't reconnect within 90s | warning |
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
Each alert has a `dedup_key` so re-firing the same condition
just bumps `last_seen_at` — the operator gets one row per
condition, not a thousand.
## Lifecycle
```
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
│ │
└────────auto-resolve──────┘
(e.g. agent_offline auto-resolves on agent_online)
```
- **Acknowledge** says "I've seen this, stop notifying about it".
- **Resolve** says "the underlying condition is gone".
- Some alerts auto-resolve when the condition clears
(`agent_offline` is the canonical example).
## Notification channels
Configure under **Settings → Notifications**. Each channel can
subscribe to all alerts or filter by severity.
### Webhook
Posts a JSON envelope to a URL of your choice. Useful for
piping into Slack via an Incoming Webhook URL or into your own
alerting tooling.
### ntfy
Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
topic. Configure the topic URL; optional bearer token if you
self-host with auth.
### SMTP
Plain SMTP (with optional TLS). Configure host, port,
username, password, and the recipient list.
## Test fire
Each channel exposes a **Test fire** button that dispatches a
single synthetic alert through the channel without touching the
alert engine. Use this when you've added a channel and want to
verify connectivity before the next real failure happens.
## What gets logged
Every alert raise / acknowledge / resolve writes an audit log
entry. The audit log UI at **Settings → Audit log** filters by
user, action, target, and time range — useful for the
post-incident "who clicked acknowledge on the prune-failure
alert" question.
@@ -0,0 +1,73 @@
# Backups and restores
## Running a backup
Three ways to trigger one:
1. **Scheduled** — the agent's local cron fires at the time set
on the schedule.
2. **Run-now** — operator clicks **Run now** on the host detail
right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
source groups) or to a per-group form for finer control.
3. **API**`POST /api/hosts/{id}/jobs` with the appropriate
payload. Same audit + dispatch path.
In every case the server creates a `jobs` row, broadcasts a
`command.run` to the host, and lands the operator on the live
job log page (HTMX `HX-Redirect`).
## Cancelling a job
Any running job — backup, forget, prune, restore, anything —
exposes a **Cancel** button on its detail page. The server
broadcasts `command.cancel`, and the agent kills the running
restic subprocess via context cancel: SIGTERM first, SIGKILL
after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
SIGTERM step is replaced with `os.Kill` because Windows can't
deliver SIGTERM. Result: a cancelled job lands as `cancelled`
within a couple of hundred milliseconds.
## Restore wizard
Restoring a file or path goes through a four-step wizard at
`/hosts/{id}/restore`:
1. **Pick a snapshot.** Search by id or by date; the page is
pre-populated when you launched the wizard from a snapshot row.
2. **Browse the snapshot tree.** Lazy-loaded children via the
`MsgTreeList` synchronous WS RPC; results are cached
per-wizard-session for 30 minutes. Pick the absolute paths
you want.
3. **Choose a target.** Either **In place** (overwrites the
live filesystem; requires you to type the hostname to
confirm) or **New directory** (default
`$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
`${HOME}` / `~/` and creates the directory chain).
4. **Review and submit.** Server mints a job, dispatches
`command.run` with a `RestorePayload`, and `HX-Redirect`s to
the live job log.
`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
in that release). Hosts running 0.16 don't get the flag and
restore as the running user instead.
## Snapshot diff
Two snapshot ids in the **Diff** form on the host detail page →
a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
to the standard live job log. Useful when investigating a
suspiciously-sized backup.
## Job log artefacts
Every job's log is persisted in `job_logs` (one row per line),
not just streamed in-memory. That gives you:
- A live view at `/jobs/{id}` while the job runs.
- Two download formats from the same page header dropdown:
- **txt** — one line per row, `HH:MM:SS.mmm TAG payload`.
- **ndjson** — one self-contained JSON object per line
(`{seq, ts, stream, payload}`), perfect for `jq`.
Downloads work whether the job is running or finished —
the source is the DB, not the live socket.
+61
View File
@@ -0,0 +1,61 @@
# Observability with Prometheus
restic-manager can expose a Prometheus scrape endpoint at
`GET /metrics`. The endpoint is **opt-in** — without an explicit
auth gate it isn't even mounted, so a forgotten config can't
accidentally publish fleet state.
The full reference lives at
[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
the short version follows.
## Enable the endpoint
Set at least one of:
- `RM_METRICS_TOKEN``Authorization: Bearer <token>` required.
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
Both ANDed when both set. Constant-time token compare; CIDR
honours `X-Forwarded-For` only when the immediate hop matches
`RM_TRUSTED_PROXY`.
## Metrics emitted
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
`rm_active_alerts{severity}`, `rm_build_info{...}`.
- **Per-host gauges**: `rm_host_agent_online`,
`rm_host_last_backup_timestamp_seconds`,
`rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
`rm_host_snapshot_count`, `rm_host_open_alerts`,
`rm_host_repo_status`.
- **Histogram**:
`rm_job_duration_seconds{kind,status,le=…}` (buckets
`1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
In-memory histogram only. Prometheus persists the scrapes; if
you need durable history at hourly resolution that's
Prometheus's job.
## Sample Grafana dashboard
[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
imports through Grafana's **+ → Import → Upload JSON file**.
Six panels:
1. Fleet status (online / total).
2. Open alerts by severity.
3. Backups failing on most-recent run.
4. Hosts table — last backup, repo size, snapshots, open alerts.
5. Repo size over time, one line per host.
6. Job-duration p95 over a 1h window per kind.
## Alerting
restic-manager already has a built-in alert engine
([Alerts](./alerts.md)). The dashboard intentionally doesn't
duplicate it as Prometheus alert rules. If you want
Prometheus-side alerts on top, write your own based on the
metrics above — `rm_host_last_backup_success == 0`,
`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
or whatever suits your environment.
+50
View File
@@ -0,0 +1,50 @@
# Updating agents
Server updates are a `docker compose pull && up -d` away.
Agents update via the control plane.
## Single-host update
Each host's detail page shows an **Update agent** button when
the agent's reported version is older than the server's. The
button:
1. Dispatches a `command.update` to that host.
2. The agent fetches the appropriate binary from
`$RM_SERVER/agent/binary?os=…&arch=…` to
`<binary-path>.new`.
3. Copies the running binary to `<binary-path>.old` (one
revision back, in case rollback is needed).
4. Atomic-renames `.new` over the running binary.
5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
brings the process back on the new binary.
A 90-second timer on the server side waits for a hello at the
target version and marks the update succeeded — or, if the
agent doesn't reconnect at the expected version in time, marks
the update **failed** and raises an `update_failed` alert.
## Fleet update
The admin-only **Settings → Fleet update** page drives a rolling
update across every host in the fleet:
- One host at a time.
- Wait for hello-with-target-version (max 95s).
- On any host failing, **halt** the rollout, raise a
`fleet_update_halted` alert, leave the rest of the fleet on
the old version. No surprise mass-failures.
You can cancel an in-progress fleet update; the worker stops
after the current host finishes.
## TLS and corruption
Updates rely on the reverse proxy's TLS to detect corruption in
transit. There's no separate sha256 verification step — we
chose the simpler model on the basis that the same TLS already
gates every other byte the server hands to the agent.
If you'd like a separate signature step before applying updates,
that's a future-phase enhancement (see `tasks.md` Phase 6
candidates).