8.6 KiB
P6-04 + P6-05 — Prometheus /metrics + Grafana dashboard
Date: 2026-05-07 Author: Claude (autonomous, sensible-defaults brief from operator) Tasks: P6-04 (M), P6-05 (S)
Problem
The control plane already knows everything a backup operator needs to monitor — last-backup timestamp + status, repo size, snapshot count, agent online, open alerts, build version — but it surfaces those only through the dashboard HTML and a few JSON endpoints. To plug into the operator's existing observability stack we need a plain Prometheus exposition endpoint and a Grafana dashboard JSON that reads from it.
Goals
GET /metricsemits standard Prometheus text-format with the per-host, server, and job-duration metrics enumerated in the task entry (P6-04 intasks.md).- Endpoint is opt-in and gated by a bearer token and/or an IP allow-list — never publicly readable by default.
- No new third-party dependency (
prometheus/client_golangis not pulled in). The exposition format is small and stable enough to emit by hand; matches the repo's "no Tailwind/Node" style. - Sample Grafana dashboard committed to the repo so a stranger can drop it into a Grafana instance and get a working view.
Non-goals
- OpenMetrics (the legacy text format with
# HELP/# TYPEis what every prom server still parses and what every example online demonstrates — pick the boring option). - Pushgateway or remote-write integration.
- Per-job metric cardinality (no
job_idlabels — that would make the histogram explode). - Alerting rules. Operators already have alerts inside restic-manager (P3-05); duplicating them in Prometheus is a YAGNI hazard. The dashboard is read-only.
Auth
Two switches, both off by default. If neither is set the route isn't mounted at all (404 from the chi router) — this avoids any accidental "wide-open scrape endpoint" deployment.
| env var | type | meaning |
|---|---|---|
RM_METRICS_TOKEN |
string | If set, callers must send Authorization: Bearer <token>. Compared with crypto/subtle.ConstantTimeCompare. |
RM_METRICS_TRUSTED_CIDR |
comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing RM_TRUSTED_PROXY semantics for honouring X-Forwarded-For. |
If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
YAML overlay mirrors env: metrics_token, metrics_trusted_cidrs.
Metrics
All metric names are prefixed rm_. Help text is concise.
Per-host gauges (one row per host_id)
rm_host_agent_online{host_id,host} 1 if status='online' else 0
rm_host_last_backup_timestamp_seconds{host_id,host} unix seconds; omitted if no backup yet
rm_host_last_backup_success{host_id,host} 1 if last_backup_status='succeeded' else 0; omitted if no backup yet
rm_host_repo_size_bytes{host_id,host} total_size from latest repo stats; omitted if unknown
rm_host_snapshot_count{host_id,host} integer
rm_host_open_alerts{host_id,host} count of open + un-resolved alerts attached to this host
rm_host_repo_status{host_id,host,status} 1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)
host label is hosts.name for human readability; host_id is
the stable ULID for joining across renames.
Server gauges
rm_hosts_total count of hosts (excludes pending)
rm_hosts_online count of hosts with status='online'
rm_active_alerts{severity} count of open alerts by severity ∈ {info,warning,critical}
rm_build_info{version,commit,go_version} always 1; pure label-bag for joining
Job duration histogram
rm_job_duration_seconds_bucket{kind,status,le=...}
rm_job_duration_seconds_sum{kind,status}
rm_job_duration_seconds_count{kind,status}
kind ∈ {backup,forget,prune,check,unlock,restore,diff,init,update}
(every JobKind we currently dispatch). status ∈
{succeeded,failed,cancelled}. Buckets cover the realistic range —
short admin commands (unlock, init) finish in seconds; backups can
be hours:
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
(1s 5s 30s 1m 5m 30m 1h 6h 24h)
In-memory only. Reset on process restart — operators who want durable history scrape into Prom and let it persist.
Architecture
New package internal/server/metrics:
Registry— owns the histogram state (sync.Mutex + map keyed bykind+status).ObserveJob(kind, status string, dur time.Duration)is the only mutator. Lookups viaSnapshot()are read-only and copy out.Render(w io.Writer, snapshot Snapshot)— emits the full exposition body. The snapshot is supplied by the HTTP handler pulling fromStoreon each scrape; the package itself has no store dependency, which keeps it trivially unit-testable.
New file internal/server/http/metrics.go:
handleMetrics(w, r)— auth check (bearer + CIDR), pull current fleet snapshot fromStore, askmetrics.Renderto emit.- Auth helper
authoriseMetricsScrape(r)— pure function over request + config; tested directly.
Wiring:
cmd/serverconstructs themetrics.Registryonce and threads it into bothDeps(for the HTTP layer) andws.HandlerDeps(so the job-finished branch can callObserveJob).ws/handler.goMsgJobFinished branch grows a single line:if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }. Falls back gracefully if the registry was never wired (tests).
Route registration in server.go:
if s.deps.Cfg.MetricsAuthEnabled() {
r.Get("/metrics", s.handleMetrics)
}
Cardinality + cost
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: ListHosts (already exists, one query), ListAlerts filtered by open status (one query), GetHostRepoStats already projected onto Host via repo_size_bytes. No N+1.
A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
Documentation (P6-05)
docs/prometheus.md— sibling to the existingdocs/reverse-proxy.md. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.deploy/grafana/restic-manager-dashboard.json— Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:- Fleet status — stat panel showing
rm_hosts_online / rm_hosts_total+ a sparkline. - Open alerts — stat panel by severity (
sum by (severity) (rm_active_alerts)). - Hosts — table of
host,online,last_backup(relative time viatime() - rm_host_last_backup_timestamp_seconds),repo_size,snapshots. - Repo size over time — time series, one line per host,
rm_host_repo_size_bytes. - Backups failing — time series counting hosts where
rm_host_last_backup_success == 0. - Job duration p95 —
histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))over a 1h window.
- Fleet status — stat panel showing
Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
Testing
- Unit tests for
metrics.Renderagainst a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable. - Unit tests for
metrics.Registry.ObserveJob— concurrent writes, bucket boundary correctness, snapshot independence. - Handler tests for
/metricscovering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both. - End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
Out of scope, explicitly
- Per-job latency tracking with
job_idlabels (cardinality bomb). - Restore-specific metrics (P3 surfaces are still settling).
- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
- Auto-discovery / file-SD generators for Prometheus.