P6-04+05: Prometheus /metrics endpoint + Grafana dashboard
New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.
This commit is contained in:
@@ -0,0 +1,139 @@
|
||||
# Prometheus + Grafana
|
||||
|
||||
restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
|
||||
The endpoint is **opt-in** — it is not mounted at all unless you set
|
||||
at least one of the auth gates below. Once enabled, it serves the
|
||||
standard `text/plain` exposition format that every Prometheus
|
||||
release since 2.x parses without configuration.
|
||||
|
||||
A sample Grafana dashboard lives at
|
||||
`deploy/grafana/restic-manager-dashboard.json`.
|
||||
|
||||
## Enable the endpoint
|
||||
|
||||
Two switches, both off by default. If both are set, both must pass
|
||||
(token AND source-IP); if only one is set, that gate alone
|
||||
authorises a scrape.
|
||||
|
||||
| Env var | YAML key | Effect |
|
||||
|----------------------------|------------------------|--------|
|
||||
| `RM_METRICS_TOKEN` | `metrics_token` | Requires `Authorization: Bearer <token>`. Compared in constant time. |
|
||||
| `RM_METRICS_TRUSTED_CIDR` | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
|
||||
|
||||
When neither is set, `GET /metrics` returns 404 — the route is not
|
||||
registered with the chi router so a forgotten config can't
|
||||
accidentally publish fleet state.
|
||||
|
||||
### Example: Docker
|
||||
|
||||
```yaml
|
||||
services:
|
||||
restic-manager:
|
||||
image: gitea.dcglab.co.uk/steve/restic-manager:latest
|
||||
environment:
|
||||
RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
|
||||
RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
|
||||
secrets:
|
||||
- rm_metrics_token
|
||||
```
|
||||
|
||||
(`RM_METRICS_TOKEN_FILE` is not currently supported — set
|
||||
`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
|
||||
roadmap.)
|
||||
|
||||
## Prometheus scrape config
|
||||
|
||||
Drop into your `prometheus.yml`:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: restic-manager
|
||||
metrics_path: /metrics
|
||||
scheme: https # via your reverse proxy
|
||||
static_configs:
|
||||
- targets: ['restic.example.com']
|
||||
authorization:
|
||||
type: Bearer
|
||||
credentials_file: /etc/prometheus/secrets/rm_metrics_token
|
||||
```
|
||||
|
||||
If you don't run a TLS-terminating proxy in front, drop `scheme:
|
||||
https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
|
||||
|
||||
## Metric reference
|
||||
|
||||
All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
|
||||
label (the stable ULID, immune to renames) and a `host` label
|
||||
(the human-readable name).
|
||||
|
||||
### Server gauges
|
||||
|
||||
| Name | Labels | Description |
|
||||
|-----------------------|------------------------------------|-------------|
|
||||
| `rm_hosts_total` | — | Total number of enrolled hosts (excludes pending announces). |
|
||||
| `rm_hosts_online` | — | Number of hosts with `status='online'`. |
|
||||
| `rm_active_alerts` | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
|
||||
| `rm_build_info` | `version, commit, go_version` | Always 1; pure label-bag for joining. |
|
||||
|
||||
### Per-host gauges
|
||||
|
||||
| Name | Description |
|
||||
|--------------------------------------------|-------------|
|
||||
| `rm_host_agent_online` | 1 if the agent is currently online, 0 otherwise. |
|
||||
| `rm_host_last_backup_timestamp_seconds` | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
|
||||
| `rm_host_last_backup_success` | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
|
||||
| `rm_host_repo_size_bytes` | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
|
||||
| `rm_host_snapshot_count` | Number of restic snapshots known on the host's repo. |
|
||||
| `rm_host_open_alerts` | Number of currently open alerts attached to this host. |
|
||||
| `rm_host_repo_status` | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
|
||||
|
||||
### Job duration histogram
|
||||
|
||||
```
|
||||
rm_job_duration_seconds_bucket{kind, status, le}
|
||||
rm_job_duration_seconds_sum{kind, status}
|
||||
rm_job_duration_seconds_count{kind, status}
|
||||
```
|
||||
|
||||
`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
|
||||
`status` ∈ {succeeded, failed, cancelled}.
|
||||
|
||||
Buckets (seconds):
|
||||
|
||||
```
|
||||
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
|
||||
1s 5s 30s 1m 5m 30m 1h 6h 24h
|
||||
```
|
||||
|
||||
The histogram is in-memory only — values reset on process restart.
|
||||
Operators who want durable history should let Prometheus persist
|
||||
the scrapes; restic-manager itself is a control plane, not a
|
||||
metrics database.
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
Import `deploy/grafana/restic-manager-dashboard.json`:
|
||||
|
||||
1. In Grafana, **+ → Import → Upload JSON file**.
|
||||
2. Pick the Prometheus data source you scrape with.
|
||||
3. The dashboard's six panels populate from the metrics above:
|
||||
* **Fleet status** — online/total stat panel.
|
||||
* **Open alerts** — by severity.
|
||||
* **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
|
||||
* **Repo size over time** — one line per host.
|
||||
* **Backups failing** — count of hosts whose last backup didn't succeed.
|
||||
* **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
|
||||
|
||||
Alerting is intentionally not configured in the dashboard — the
|
||||
control plane already has alerts (P3-05) with native channels for
|
||||
webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
|
||||
just duplicate state. If you do want Prom-side alerts, copy the
|
||||
recording rules into your usual location.
|
||||
|
||||
## Cardinality
|
||||
|
||||
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
|
||||
histogram rows. A 100-host fleet emits roughly 700 host rows + 270
|
||||
histogram rows — well below any practical limit. There are no
|
||||
`job_id` labels (cardinality bomb avoidance) and no per-source-group
|
||||
labels.
|
||||
Reference in New Issue
Block a user