# Prometheus + Grafana restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`. The endpoint is **opt-in** — it is not mounted at all unless you set at least one of the auth gates below. Once enabled, it serves the standard `text/plain` exposition format that every Prometheus release since 2.x parses without configuration. A sample Grafana dashboard lives at `deploy/grafana/restic-manager-dashboard.json`. ## Enable the endpoint Two switches, both off by default. If both are set, both must pass (token AND source-IP); if only one is set, that gate alone authorises a scrape. | Env var | YAML key | Effect | |----------------------------|------------------------|--------| | `RM_METRICS_TOKEN` | `metrics_token` | Requires `Authorization: Bearer `. Compared in constant time. | | `RM_METRICS_TRUSTED_CIDR` | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. | When neither is set, `GET /metrics` returns 404 — the route is not registered with the chi router so a forgotten config can't accidentally publish fleet state. ### Example: Docker ```yaml services: restic-manager: image: gitea.dcglab.co.uk/steve/restic-manager:latest environment: RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8" secrets: - rm_metrics_token ``` (`RM_METRICS_TOKEN_FILE` is not currently supported — set `RM_METRICS_TOKEN` directly. The `_FILE` convention is on the roadmap.) ## Prometheus scrape config Drop into your `prometheus.yml`: ```yaml scrape_configs: - job_name: restic-manager metrics_path: /metrics scheme: https # via your reverse proxy static_configs: - targets: ['restic.example.com'] authorization: type: Bearer credentials_file: /etc/prometheus/secrets/rm_metrics_token ``` If you don't run a TLS-terminating proxy in front, drop `scheme: https` (the server is HTTP-only — see `docs/reverse-proxy.md`). ## Metric reference All names are `rm_`-prefixed. Per-host metrics carry a `host_id` label (the stable ULID, immune to renames) and a `host` label (the human-readable name). ### Server gauges | Name | Labels | Description | |-----------------------|------------------------------------|-------------| | `rm_hosts_total` | — | Total number of enrolled hosts (excludes pending announces). | | `rm_hosts_online` | — | Number of hosts with `status='online'`. | | `rm_active_alerts` | `severity` ∈ {info, warning, critical} | Open alerts by severity. | | `rm_build_info` | `version, commit, go_version` | Always 1; pure label-bag for joining. | ### Per-host gauges | Name | Description | |--------------------------------------------|-------------| | `rm_host_agent_online` | 1 if the agent is currently online, 0 otherwise. | | `rm_host_last_backup_timestamp_seconds` | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. | | `rm_host_last_backup_success` | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. | | `rm_host_repo_size_bytes` | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. | | `rm_host_snapshot_count` | Number of restic snapshots known on the host's repo. | | `rm_host_open_alerts` | Number of currently open alerts attached to this host. | | `rm_host_repo_status` | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. | ### Job duration histogram ``` rm_job_duration_seconds_bucket{kind, status, le} rm_job_duration_seconds_sum{kind, status} rm_job_duration_seconds_count{kind, status} ``` `kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}. `status` ∈ {succeeded, failed, cancelled}. Buckets (seconds): ``` 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf 1s 5s 30s 1m 5m 30m 1h 6h 24h ``` The histogram is in-memory only — values reset on process restart. Operators who want durable history should let Prometheus persist the scrapes; restic-manager itself is a control plane, not a metrics database. ## Grafana dashboard Import `deploy/grafana/restic-manager-dashboard.json`: 1. In Grafana, **+ → Import → Upload JSON file**. 2. Pick the Prometheus data source you scrape with. 3. The dashboard's six panels populate from the metrics above: * **Fleet status** — online/total stat panel. * **Open alerts** — by severity. * **Hosts** — per-host table (last backup, repo size, snapshots, alerts). * **Repo size over time** — one line per host. * **Backups failing** — count of hosts whose last backup didn't succeed. * **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind. Alerting is intentionally not configured in the dashboard — the control plane already has alerts (P3-05) with native channels for webhook, ntfy, and SMTP. Re-implementing them in Prometheus would just duplicate state. If you do want Prom-side alerts, copy the recording rules into your usual location. ## Cardinality Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. A 100-host fleet emits roughly 700 host rows + 270 histogram rows — well below any practical limit. There are no `job_id` labels (cardinality bomb avoidance) and no per-source-group labels.