Files
restic-manager/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md
T

176 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard
Date: 2026-05-07
Author: Claude (autonomous, sensible-defaults brief from operator)
Tasks: P6-04 (M), P6-05 (S)
## Problem
The control plane already knows everything a backup operator needs
to monitor — last-backup timestamp + status, repo size, snapshot
count, agent online, open alerts, build version — but it surfaces
those only through the dashboard HTML and a few JSON endpoints. To
plug into the operator's existing observability stack we need a
plain Prometheus exposition endpoint and a Grafana dashboard JSON
that reads from it.
## Goals
- `GET /metrics` emits standard Prometheus text-format with the
per-host, server, and job-duration metrics enumerated in the
task entry (P6-04 in `tasks.md`).
- Endpoint is opt-in and gated by a bearer token and/or an IP
allow-list — never publicly readable by default.
- No new third-party dependency (`prometheus/client_golang` is not
pulled in). The exposition format is small and stable enough to
emit by hand; matches the repo's "no Tailwind/Node" style.
- Sample Grafana dashboard committed to the repo so a stranger can
drop it into a Grafana instance and get a working view.
## Non-goals
- OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is
what every prom server still parses and what every example
online demonstrates — pick the boring option).
- Pushgateway or remote-write integration.
- Per-job metric cardinality (no `job_id` labels — that would
make the histogram explode).
- Alerting rules. Operators already have alerts inside
restic-manager (P3-05); duplicating them in Prometheus is a
YAGNI hazard. The dashboard is read-only.
## Auth
Two switches, both off by default. If neither is set the route
isn't mounted at all (404 from the chi router) — this avoids any
accidental "wide-open scrape endpoint" deployment.
| env var | type | meaning |
| --- | --- | --- |
| `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer <token>`. Compared with `crypto/subtle.ConstantTimeCompare`. |
| `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. |
If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`.
## Metrics
All metric names are prefixed `rm_`. Help text is concise.
### Per-host gauges (one row per `host_id`)
```
rm_host_agent_online{host_id,host} 1 if status='online' else 0
rm_host_last_backup_timestamp_seconds{host_id,host} unix seconds; omitted if no backup yet
rm_host_last_backup_success{host_id,host} 1 if last_backup_status='succeeded' else 0; omitted if no backup yet
rm_host_repo_size_bytes{host_id,host} total_size from latest repo stats; omitted if unknown
rm_host_snapshot_count{host_id,host} integer
rm_host_open_alerts{host_id,host} count of open + un-resolved alerts attached to this host
rm_host_repo_status{host_id,host,status} 1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)
```
`host` label is `hosts.name` for human readability; `host_id` is
the stable ULID for joining across renames.
### Server gauges
```
rm_hosts_total count of hosts (excludes pending)
rm_hosts_online count of hosts with status='online'
rm_active_alerts{severity} count of open alerts by severity ∈ {info,warning,critical}
rm_build_info{version,commit,go_version} always 1; pure label-bag for joining
```
### Job duration histogram
```
rm_job_duration_seconds_bucket{kind,status,le=...}
rm_job_duration_seconds_sum{kind,status}
rm_job_duration_seconds_count{kind,status}
```
`kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update}
(every JobKind we currently dispatch). `status`
{succeeded,failed,cancelled}. Buckets cover the realistic range —
short admin commands (unlock, init) finish in seconds; backups can
be hours:
```
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
(1s 5s 30s 1m 5m 30m 1h 6h 24h)
```
In-memory only. Reset on process restart — operators who want
durable history scrape into Prom and let it persist.
## Architecture
New package `internal/server/metrics`:
- `Registry` — owns the histogram state (sync.Mutex + map keyed by
`kind+status`). `ObserveJob(kind, status string, dur time.Duration)`
is the only mutator. Lookups via `Snapshot()` are read-only and
copy out.
- `Render(w io.Writer, snapshot Snapshot)` — emits the full
exposition body. The snapshot is supplied by the HTTP handler
pulling from `Store` on each scrape; the package itself has no
store dependency, which keeps it trivially unit-testable.
New file `internal/server/http/metrics.go`:
- `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current
fleet snapshot from `Store`, ask `metrics.Render` to emit.
- Auth helper `authoriseMetricsScrape(r)` — pure function over
request + config; tested directly.
Wiring:
- `cmd/server` constructs the `metrics.Registry` once and threads
it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps`
(so the job-finished branch can call `ObserveJob`).
- `ws/handler.go` MsgJobFinished branch grows a single line:
`if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`.
Falls back gracefully if the registry was never wired (tests).
Route registration in `server.go`:
```go
if s.deps.Cfg.MetricsAuthEnabled() {
r.Get("/metrics", s.handleMetrics)
}
```
## Cardinality + cost
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1.
A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
## Documentation (P6-05)
- `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
- `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline.
2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`).
3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`.
4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`.
5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`.
6. **Job duration p95**`histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window.
Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
## Testing
- Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
- Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence.
- Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
- End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
## Out of scope, explicitly
- Per-job latency tracking with `job_id` labels (cardinality bomb).
- Restore-specific metrics (P3 surfaces are still settling).
- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
- Auto-discovery / file-SD generators for Prometheus.