New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.
5.7 KiB
Prometheus + Grafana
restic-manager exposes a Prometheus scrape endpoint at GET /metrics.
The endpoint is opt-in — it is not mounted at all unless you set
at least one of the auth gates below. Once enabled, it serves the
standard text/plain exposition format that every Prometheus
release since 2.x parses without configuration.
A sample Grafana dashboard lives at
deploy/grafana/restic-manager-dashboard.json.
Enable the endpoint
Two switches, both off by default. If both are set, both must pass (token AND source-IP); if only one is set, that gate alone authorises a scrape.
| Env var | YAML key | Effect |
|---|---|---|
RM_METRICS_TOKEN |
metrics_token |
Requires Authorization: Bearer <token>. Compared in constant time. |
RM_METRICS_TRUSTED_CIDR |
metrics_trusted_cidrs (list) |
Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours X-Forwarded-For only when the immediate hop matches RM_TRUSTED_PROXY. |
When neither is set, GET /metrics returns 404 — the route is not
registered with the chi router so a forgotten config can't
accidentally publish fleet state.
Example: Docker
services:
restic-manager:
image: gitea.dcglab.co.uk/steve/restic-manager:latest
environment:
RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
secrets:
- rm_metrics_token
(RM_METRICS_TOKEN_FILE is not currently supported — set
RM_METRICS_TOKEN directly. The _FILE convention is on the
roadmap.)
Prometheus scrape config
Drop into your prometheus.yml:
scrape_configs:
- job_name: restic-manager
metrics_path: /metrics
scheme: https # via your reverse proxy
static_configs:
- targets: ['restic.example.com']
authorization:
type: Bearer
credentials_file: /etc/prometheus/secrets/rm_metrics_token
If you don't run a TLS-terminating proxy in front, drop scheme: https (the server is HTTP-only — see docs/reverse-proxy.md).
Metric reference
All names are rm_-prefixed. Per-host metrics carry a host_id
label (the stable ULID, immune to renames) and a host label
(the human-readable name).
Server gauges
| Name | Labels | Description |
|---|---|---|
rm_hosts_total |
— | Total number of enrolled hosts (excludes pending announces). |
rm_hosts_online |
— | Number of hosts with status='online'. |
rm_active_alerts |
severity ∈ {info, warning, critical} |
Open alerts by severity. |
rm_build_info |
version, commit, go_version |
Always 1; pure label-bag for joining. |
Per-host gauges
| Name | Description |
|---|---|
rm_host_agent_online |
1 if the agent is currently online, 0 otherwise. |
rm_host_last_backup_timestamp_seconds |
Unix timestamp of the host's most recent backup. Omitted for hosts with no backup yet. |
rm_host_last_backup_success |
1 if the most recent backup succeeded, 0 otherwise. Omitted for hosts with no backup yet. |
rm_host_repo_size_bytes |
Latest reported repo size from restic stats --mode raw-data. Omitted when unknown. |
rm_host_snapshot_count |
Number of restic snapshots known on the host's repo. |
rm_host_open_alerts |
Number of currently open alerts attached to this host. |
rm_host_repo_status |
Always 1; the status label carries unknown / ready / init_failed. |
Job duration histogram
rm_job_duration_seconds_bucket{kind, status, le}
rm_job_duration_seconds_sum{kind, status}
rm_job_duration_seconds_count{kind, status}
kind ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
status ∈ {succeeded, failed, cancelled}.
Buckets (seconds):
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
1s 5s 30s 1m 5m 30m 1h 6h 24h
The histogram is in-memory only — values reset on process restart. Operators who want durable history should let Prometheus persist the scrapes; restic-manager itself is a control plane, not a metrics database.
Grafana dashboard
Import deploy/grafana/restic-manager-dashboard.json:
- In Grafana, + → Import → Upload JSON file.
- Pick the Prometheus data source you scrape with.
- The dashboard's six panels populate from the metrics above:
- Fleet status — online/total stat panel.
- Open alerts — by severity.
- Hosts — per-host table (last backup, repo size, snapshots, alerts).
- Repo size over time — one line per host.
- Backups failing — count of hosts whose last backup didn't succeed.
- Job duration p95 —
histogram_quantile(0.95, …)over a 1h window per kind.
Alerting is intentionally not configured in the dashboard — the control plane already has alerts (P3-05) with native channels for webhook, ntfy, and SMTP. Re-implementing them in Prometheus would just duplicate state. If you do want Prom-side alerts, copy the recording rules into your usual location.
Cardinality
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
histogram rows. A 100-host fleet emits roughly 700 host rows + 270
histogram rows — well below any practical limit. There are no
job_id labels (cardinality bomb avoidance) and no per-source-group
labels.