Files
restic-manager/docs/prometheus.md
steve ccd14f7cee P6-04+05: Prometheus /metrics endpoint + Grafana dashboard
New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.

Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.

Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.

Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
2026-05-07 23:17:15 +01:00

140 lines
5.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Prometheus + Grafana
restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
The endpoint is **opt-in** — it is not mounted at all unless you set
at least one of the auth gates below. Once enabled, it serves the
standard `text/plain` exposition format that every Prometheus
release since 2.x parses without configuration.
A sample Grafana dashboard lives at
`deploy/grafana/restic-manager-dashboard.json`.
## Enable the endpoint
Two switches, both off by default. If both are set, both must pass
(token AND source-IP); if only one is set, that gate alone
authorises a scrape.
| Env var | YAML key | Effect |
|----------------------------|------------------------|--------|
| `RM_METRICS_TOKEN` | `metrics_token` | Requires `Authorization: Bearer <token>`. Compared in constant time. |
| `RM_METRICS_TRUSTED_CIDR` | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
When neither is set, `GET /metrics` returns 404 — the route is not
registered with the chi router so a forgotten config can't
accidentally publish fleet state.
### Example: Docker
```yaml
services:
restic-manager:
image: gitea.dcglab.co.uk/steve/restic-manager:latest
environment:
RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
secrets:
- rm_metrics_token
```
(`RM_METRICS_TOKEN_FILE` is not currently supported — set
`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
roadmap.)
## Prometheus scrape config
Drop into your `prometheus.yml`:
```yaml
scrape_configs:
- job_name: restic-manager
metrics_path: /metrics
scheme: https # via your reverse proxy
static_configs:
- targets: ['restic.example.com']
authorization:
type: Bearer
credentials_file: /etc/prometheus/secrets/rm_metrics_token
```
If you don't run a TLS-terminating proxy in front, drop `scheme:
https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
## Metric reference
All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
label (the stable ULID, immune to renames) and a `host` label
(the human-readable name).
### Server gauges
| Name | Labels | Description |
|-----------------------|------------------------------------|-------------|
| `rm_hosts_total` | — | Total number of enrolled hosts (excludes pending announces). |
| `rm_hosts_online` | — | Number of hosts with `status='online'`. |
| `rm_active_alerts` | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
| `rm_build_info` | `version, commit, go_version` | Always 1; pure label-bag for joining. |
### Per-host gauges
| Name | Description |
|--------------------------------------------|-------------|
| `rm_host_agent_online` | 1 if the agent is currently online, 0 otherwise. |
| `rm_host_last_backup_timestamp_seconds` | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
| `rm_host_last_backup_success` | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
| `rm_host_repo_size_bytes` | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
| `rm_host_snapshot_count` | Number of restic snapshots known on the host's repo. |
| `rm_host_open_alerts` | Number of currently open alerts attached to this host. |
| `rm_host_repo_status` | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
### Job duration histogram
```
rm_job_duration_seconds_bucket{kind, status, le}
rm_job_duration_seconds_sum{kind, status}
rm_job_duration_seconds_count{kind, status}
```
`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
`status` ∈ {succeeded, failed, cancelled}.
Buckets (seconds):
```
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
1s 5s 30s 1m 5m 30m 1h 6h 24h
```
The histogram is in-memory only — values reset on process restart.
Operators who want durable history should let Prometheus persist
the scrapes; restic-manager itself is a control plane, not a
metrics database.
## Grafana dashboard
Import `deploy/grafana/restic-manager-dashboard.json`:
1. In Grafana, **+ → Import → Upload JSON file**.
2. Pick the Prometheus data source you scrape with.
3. The dashboard's six panels populate from the metrics above:
* **Fleet status** — online/total stat panel.
* **Open alerts** — by severity.
* **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
* **Repo size over time** — one line per host.
* **Backups failing** — count of hosts whose last backup didn't succeed.
* **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
Alerting is intentionally not configured in the dashboard — the
control plane already has alerts (P3-05) with native channels for
webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
just duplicate state. If you do want Prom-side alerts, copy the
recording rules into your usual location.
## Cardinality
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
histogram rows. A 100-host fleet emits roughly 700 host rows + 270
histogram rows — well below any practical limit. There are no
`job_id` labels (cardinality bomb avoidance) and no per-source-group
labels.