Files

T

steve ccd14f7cee P6-04+05: Prometheus /metrics endpoint + Grafana dashboard

New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.

Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.

Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.

Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.

2026-05-07 23:17:15 +01:00

5.7 KiB

Raw Permalink Blame History

Prometheus + Grafana

restic-manager exposes a Prometheus scrape endpoint at GET /metrics. The endpoint is opt-in — it is not mounted at all unless you set at least one of the auth gates below. Once enabled, it serves the standard text/plain exposition format that every Prometheus release since 2.x parses without configuration.

A sample Grafana dashboard lives at deploy/grafana/restic-manager-dashboard.json.

Enable the endpoint

Two switches, both off by default. If both are set, both must pass (token AND source-IP); if only one is set, that gate alone authorises a scrape.

Env var	YAML key	Effect
`RM_METRICS_TOKEN`	`metrics_token`	Requires `Authorization: Bearer <token>`. Compared in constant time.
`RM_METRICS_TRUSTED_CIDR`	`metrics_trusted_cidrs` (list)	Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`.

When neither is set, GET /metrics returns 404 — the route is not registered with the chi router so a forgotten config can't accidentally publish fleet state.

Example: Docker

services:
  restic-manager:
    image: gitea.dcglab.co.uk/steve/restic-manager:latest
    environment:
      RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
      RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
    secrets:
      - rm_metrics_token

(RM_METRICS_TOKEN_FILE is not currently supported — set RM_METRICS_TOKEN directly. The _FILE convention is on the roadmap.)

Prometheus scrape config

Drop into your prometheus.yml:

scrape_configs:
  - job_name: restic-manager
    metrics_path: /metrics
    scheme: https            # via your reverse proxy
    static_configs:
      - targets: ['restic.example.com']
    authorization:
      type: Bearer
      credentials_file: /etc/prometheus/secrets/rm_metrics_token

If you don't run a TLS-terminating proxy in front, drop scheme: https (the server is HTTP-only — see docs/reverse-proxy.md).

Metric reference

All names are rm_-prefixed. Per-host metrics carry a host_id label (the stable ULID, immune to renames) and a host label (the human-readable name).

Server gauges

Name	Labels	Description
`rm_hosts_total`	—	Total number of enrolled hosts (excludes pending announces).
`rm_hosts_online`	—	Number of hosts with `status='online'`.
`rm_active_alerts`	`severity` ∈ {info, warning, critical}	Open alerts by severity.
`rm_build_info`	`version, commit, go_version`	Always 1; pure label-bag for joining.

Per-host gauges

Name	Description
`rm_host_agent_online`	1 if the agent is currently online, 0 otherwise.
`rm_host_last_backup_timestamp_seconds`	Unix timestamp of the host's most recent backup. Omitted for hosts with no backup yet.
`rm_host_last_backup_success`	1 if the most recent backup succeeded, 0 otherwise. Omitted for hosts with no backup yet.
`rm_host_repo_size_bytes`	Latest reported repo size from `restic stats --mode raw-data`. Omitted when unknown.
`rm_host_snapshot_count`	Number of restic snapshots known on the host's repo.
`rm_host_open_alerts`	Number of currently open alerts attached to this host.
`rm_host_repo_status`	Always 1; the `status` label carries `unknown` / `ready` / `init_failed`.

Job duration histogram

rm_job_duration_seconds_bucket{kind, status, le}
rm_job_duration_seconds_sum{kind, status}
rm_job_duration_seconds_count{kind, status}

kind ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}. status ∈ {succeeded, failed, cancelled}.

Buckets (seconds):

1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
1s   5s  30s  1m  5m   30m   1h    6h    24h

The histogram is in-memory only — values reset on process restart. Operators who want durable history should let Prometheus persist the scrapes; restic-manager itself is a control plane, not a metrics database.

Grafana dashboard

Import deploy/grafana/restic-manager-dashboard.json:

In Grafana, + → Import → Upload JSON file.
Pick the Prometheus data source you scrape with.
The dashboard's six panels populate from the metrics above:
- Fleet status — online/total stat panel.
- Open alerts — by severity.
- Hosts — per-host table (last backup, repo size, snapshots, alerts).
- Repo size over time — one line per host.
- Backups failing — count of hosts whose last backup didn't succeed.
- Job duration p95 — histogram_quantile(0.95, …) over a 1h window per kind.

Alerting is intentionally not configured in the dashboard — the control plane already has alerts (P3-05) with native channels for webhook, ntfy, and SMTP. Re-implementing them in Prometheus would just duplicate state. If you do want Prom-side alerts, copy the recording rules into your usual location.

Cardinality

Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. A 100-host fleet emits roughly 700 host rows + 270 histogram rows — well below any practical limit. There are no job_id labels (cardinality bomb avoidance) and no per-source-group labels.

5.7 KiB Raw Permalink Blame History Unescape Escape