Files
restic-manager/docs/book/src/operations/observability.md
T
steve 89537d417a P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
2026-05-08 20:08:23 +01:00

2.2 KiB

Observability with Prometheus

restic-manager can expose a Prometheus scrape endpoint at GET /metrics. The endpoint is opt-in — without an explicit auth gate it isn't even mounted, so a forgotten config can't accidentally publish fleet state.

The full reference lives at docs/prometheus.md; the short version follows.

Enable the endpoint

Set at least one of:

  • RM_METRICS_TOKENAuthorization: Bearer <token> required.
  • RM_METRICS_TRUSTED_CIDR — restricts source IPs (comma-CIDR).

Both ANDed when both set. Constant-time token compare; CIDR honours X-Forwarded-For only when the immediate hop matches RM_TRUSTED_PROXY.

Metrics emitted

  • Server gauges: rm_hosts_total, rm_hosts_online, rm_active_alerts{severity}, rm_build_info{...}.
  • Per-host gauges: rm_host_agent_online, rm_host_last_backup_timestamp_seconds, rm_host_last_backup_success, rm_host_repo_size_bytes, rm_host_snapshot_count, rm_host_open_alerts, rm_host_repo_status.
  • Histogram: rm_job_duration_seconds{kind,status,le=…} (buckets 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf).

In-memory histogram only. Prometheus persists the scrapes; if you need durable history at hourly resolution that's Prometheus's job.

Sample Grafana dashboard

deploy/grafana/restic-manager-dashboard.json imports through Grafana's + → Import → Upload JSON file. Six panels:

  1. Fleet status (online / total).
  2. Open alerts by severity.
  3. Backups failing on most-recent run.
  4. Hosts table — last backup, repo size, snapshots, open alerts.
  5. Repo size over time, one line per host.
  6. Job-duration p95 over a 1h window per kind.

Alerting

restic-manager already has a built-in alert engine (Alerts). The dashboard intentionally doesn't duplicate it as Prometheus alert rules. If you want Prometheus-side alerts on top, write your own based on the metrics above — rm_host_last_backup_success == 0, time() - rm_host_last_backup_timestamp_seconds > <max age>, or whatever suits your environment.