Files
restic-manager/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md
T

8.6 KiB
Raw Blame History

P6-04 + P6-05 — Prometheus /metrics + Grafana dashboard

Date: 2026-05-07 Author: Claude (autonomous, sensible-defaults brief from operator) Tasks: P6-04 (M), P6-05 (S)

Problem

The control plane already knows everything a backup operator needs to monitor — last-backup timestamp + status, repo size, snapshot count, agent online, open alerts, build version — but it surfaces those only through the dashboard HTML and a few JSON endpoints. To plug into the operator's existing observability stack we need a plain Prometheus exposition endpoint and a Grafana dashboard JSON that reads from it.

Goals

  • GET /metrics emits standard Prometheus text-format with the per-host, server, and job-duration metrics enumerated in the task entry (P6-04 in tasks.md).
  • Endpoint is opt-in and gated by a bearer token and/or an IP allow-list — never publicly readable by default.
  • No new third-party dependency (prometheus/client_golang is not pulled in). The exposition format is small and stable enough to emit by hand; matches the repo's "no Tailwind/Node" style.
  • Sample Grafana dashboard committed to the repo so a stranger can drop it into a Grafana instance and get a working view.

Non-goals

  • OpenMetrics (the legacy text format with # HELP/# TYPE is what every prom server still parses and what every example online demonstrates — pick the boring option).
  • Pushgateway or remote-write integration.
  • Per-job metric cardinality (no job_id labels — that would make the histogram explode).
  • Alerting rules. Operators already have alerts inside restic-manager (P3-05); duplicating them in Prometheus is a YAGNI hazard. The dashboard is read-only.

Auth

Two switches, both off by default. If neither is set the route isn't mounted at all (404 from the chi router) — this avoids any accidental "wide-open scrape endpoint" deployment.

env var type meaning
RM_METRICS_TOKEN string If set, callers must send Authorization: Bearer <token>. Compared with crypto/subtle.ConstantTimeCompare.
RM_METRICS_TRUSTED_CIDR comma-CIDR If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing RM_TRUSTED_PROXY semantics for honouring X-Forwarded-For.

If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.

YAML overlay mirrors env: metrics_token, metrics_trusted_cidrs.

Metrics

All metric names are prefixed rm_. Help text is concise.

Per-host gauges (one row per host_id)

rm_host_agent_online{host_id,host}                     1 if status='online' else 0
rm_host_last_backup_timestamp_seconds{host_id,host}    unix seconds; omitted if no backup yet
rm_host_last_backup_success{host_id,host}              1 if last_backup_status='succeeded' else 0; omitted if no backup yet
rm_host_repo_size_bytes{host_id,host}                  total_size from latest repo stats; omitted if unknown
rm_host_snapshot_count{host_id,host}                   integer
rm_host_open_alerts{host_id,host}                      count of open + un-resolved alerts attached to this host
rm_host_repo_status{host_id,host,status}               1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)

host label is hosts.name for human readability; host_id is the stable ULID for joining across renames.

Server gauges

rm_hosts_total                              count of hosts (excludes pending)
rm_hosts_online                             count of hosts with status='online'
rm_active_alerts{severity}                  count of open alerts by severity ∈ {info,warning,critical}
rm_build_info{version,commit,go_version}    always 1; pure label-bag for joining

Job duration histogram

rm_job_duration_seconds_bucket{kind,status,le=...}
rm_job_duration_seconds_sum{kind,status}
rm_job_duration_seconds_count{kind,status}

kind ∈ {backup,forget,prune,check,unlock,restore,diff,init,update} (every JobKind we currently dispatch). status ∈ {succeeded,failed,cancelled}. Buckets cover the realistic range — short admin commands (unlock, init) finish in seconds; backups can be hours:

1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
   (1s   5s  30s  1m   5m  30m   1h    6h   24h)

In-memory only. Reset on process restart — operators who want durable history scrape into Prom and let it persist.

Architecture

New package internal/server/metrics:

  • Registry — owns the histogram state (sync.Mutex + map keyed by kind+status). ObserveJob(kind, status string, dur time.Duration) is the only mutator. Lookups via Snapshot() are read-only and copy out.
  • Render(w io.Writer, snapshot Snapshot) — emits the full exposition body. The snapshot is supplied by the HTTP handler pulling from Store on each scrape; the package itself has no store dependency, which keeps it trivially unit-testable.

New file internal/server/http/metrics.go:

  • handleMetrics(w, r) — auth check (bearer + CIDR), pull current fleet snapshot from Store, ask metrics.Render to emit.
  • Auth helper authoriseMetricsScrape(r) — pure function over request + config; tested directly.

Wiring:

  • cmd/server constructs the metrics.Registry once and threads it into both Deps (for the HTTP layer) and ws.HandlerDeps (so the job-finished branch can call ObserveJob).
  • ws/handler.go MsgJobFinished branch grows a single line: if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }. Falls back gracefully if the registry was never wired (tests).

Route registration in server.go:

if s.deps.Cfg.MetricsAuthEnabled() {
    r.Get("/metrics", s.handleMetrics)
}

Cardinality + cost

Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: ListHosts (already exists, one query), ListAlerts filtered by open status (one query), GetHostRepoStats already projected onto Host via repo_size_bytes. No N+1.

A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.

Documentation (P6-05)

  • docs/prometheus.md — sibling to the existing docs/reverse-proxy.md. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
  • deploy/grafana/restic-manager-dashboard.json — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
    1. Fleet status — stat panel showing rm_hosts_online / rm_hosts_total + a sparkline.
    2. Open alerts — stat panel by severity (sum by (severity) (rm_active_alerts)).
    3. Hosts — table of host, online, last_backup (relative time via time() - rm_host_last_backup_timestamp_seconds), repo_size, snapshots.
    4. Repo size over time — time series, one line per host, rm_host_repo_size_bytes.
    5. Backups failing — time series counting hosts where rm_host_last_backup_success == 0.
    6. Job duration p95histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h]))) over a 1h window.

Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.

Testing

  • Unit tests for metrics.Render against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
  • Unit tests for metrics.Registry.ObserveJob — concurrent writes, bucket boundary correctness, snapshot independence.
  • Handler tests for /metrics covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
  • End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.

Out of scope, explicitly

  • Per-job latency tracking with job_id labels (cardinality bomb).
  • Restore-specific metrics (P3 surfaces are still settling).
  • Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
  • Auto-discovery / file-SD generators for Prometheus.