P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
@@ -0,0 +1,61 @@
|
||||
# Observability with Prometheus
|
||||
|
||||
restic-manager can expose a Prometheus scrape endpoint at
|
||||
`GET /metrics`. The endpoint is **opt-in** — without an explicit
|
||||
auth gate it isn't even mounted, so a forgotten config can't
|
||||
accidentally publish fleet state.
|
||||
|
||||
The full reference lives at
|
||||
[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
|
||||
the short version follows.
|
||||
|
||||
## Enable the endpoint
|
||||
|
||||
Set at least one of:
|
||||
|
||||
- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
|
||||
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
|
||||
|
||||
Both ANDed when both set. Constant-time token compare; CIDR
|
||||
honours `X-Forwarded-For` only when the immediate hop matches
|
||||
`RM_TRUSTED_PROXY`.
|
||||
|
||||
## Metrics emitted
|
||||
|
||||
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
|
||||
`rm_active_alerts{severity}`, `rm_build_info{...}`.
|
||||
- **Per-host gauges**: `rm_host_agent_online`,
|
||||
`rm_host_last_backup_timestamp_seconds`,
|
||||
`rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
|
||||
`rm_host_snapshot_count`, `rm_host_open_alerts`,
|
||||
`rm_host_repo_status`.
|
||||
- **Histogram**:
|
||||
`rm_job_duration_seconds{kind,status,le=…}` (buckets
|
||||
`1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
|
||||
|
||||
In-memory histogram only. Prometheus persists the scrapes; if
|
||||
you need durable history at hourly resolution that's
|
||||
Prometheus's job.
|
||||
|
||||
## Sample Grafana dashboard
|
||||
|
||||
[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
|
||||
imports through Grafana's **+ → Import → Upload JSON file**.
|
||||
Six panels:
|
||||
|
||||
1. Fleet status (online / total).
|
||||
2. Open alerts by severity.
|
||||
3. Backups failing on most-recent run.
|
||||
4. Hosts table — last backup, repo size, snapshots, open alerts.
|
||||
5. Repo size over time, one line per host.
|
||||
6. Job-duration p95 over a 1h window per kind.
|
||||
|
||||
## Alerting
|
||||
|
||||
restic-manager already has a built-in alert engine
|
||||
([Alerts](./alerts.md)). The dashboard intentionally doesn't
|
||||
duplicate it as Prometheus alert rules. If you want
|
||||
Prometheus-side alerts on top, write your own based on the
|
||||
metrics above — `rm_host_last_backup_success == 0`,
|
||||
`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
|
||||
or whatever suits your environment.
|
||||
Reference in New Issue
Block a user