Files
restic-manager/docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md
T

4.4 KiB

Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard

Spec: docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md

Step 1 — Config wiring

  • Add fields to internal/server/config/config.go:
    • MetricsToken string (yaml metrics_token)
    • MetricsTrustedCIDRs []string (yaml metrics_trusted_cidrs)
    • method (c Config) MetricsAuthEnabled() bool returning true iff at least one of the two is configured.
  • Env loading: RM_METRICS_TOKEN and RM_METRICS_TRUSTED_CIDR (comma-CIDR).
  • validate() extension: ensure each CIDR parses (reuse the same netip.ParsePrefix pattern that already validates TrustedProxies).
  • Tests: extend config_test.go covering both env vars + happy/sad CIDR.

Step 2 — internal/server/metrics package

  • Registry struct: sync.Mutex, map[jobKey]*histogramState where jobKey = struct{kind,status string}.
  • ObserveJob(kind, status string, dur time.Duration) — clamps negative durations to 0; locks; bumps the right bucket + sum + count.
  • Snapshot() Snapshot — copies state under lock; returns plain value type.
  • Snapshot carries Histogram rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).
  • Render(w io.Writer, s Snapshot) error — emits standard text exposition with stable line ordering. No external dep; manual escape of \ " \n in label values per the Prom format spec.
  • Unit tests: golden render, concurrent observe, bucket boundaries.

Step 3 — HTTP handler

  • New internal/server/http/metrics.go:
    • (s *Server) handleMetrics(w, r) — calls authoriseMetricsScrape, then gatherSnapshot(ctx) then metrics.Render.
    • authoriseMetricsScrape(r, cfg) (ok bool, status int) — pure helper; bearer token compared with subtle.ConstantTimeCompare; CIDR check on r.RemoteAddr first, then X-Forwarded-For if a trusted proxy fronted us (mirror realIP's logic; simplest path is to call chi/middleware.RealIP-aware lookup the existing handlers use).
    • gatherSnapshot(ctx) — assembles the snapshot from Store.ListHosts, Store.ListAlerts({Status:"open"}), the metrics registry, and version.Version/version.Commit/runtime.Version().
  • Route mounted in server.go only if s.deps.Cfg.MetricsAuthEnabled().
  • Deps grows a Metrics *metrics.Registry field; nil-tolerant in handlers.

Step 4 — Hook job-finished

  • internal/server/ws/handler.go:
    • HandlerDeps grows Metrics *metrics.Registry.
    • In the MsgJobFinished branch, after the GetJob lookup we already do, observe (job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt))). Skip if job.StartedAt is nil (rare race).
  • cmd/server wires the registry into both Deps and HandlerDeps from a single instance.

Step 5 — Tests

  • internal/server/metrics/registry_test.go — observe + snapshot determinism.
  • internal/server/metrics/render_test.go — golden output for a fixed snapshot.
  • internal/server/http/metrics_test.go — auth matrix (six cases per the spec) using the existing newTestServer fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.

Step 6 — Docs + dashboard (P6-05)

  • docs/prometheus.md — enable + scrape config + metric reference + dashboard import.
  • deploy/grafana/restic-manager-dashboard.json — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with targets[].expr, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.

Step 7 — Tasks.md + verification

  • Strike P6-04, P6-05 in tasks.md; add an "as shipped" note mirroring the prior P6 entries.
  • Run go vet ./..., go test ./..., make build.
  • Push branch (no PR per standing instruction).

Risk register

  • CIDR check for proxied scrapes — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
  • Histogram lock contention — every job finish takes the mutex. Job throughput is low (a few/min/host max), and ObserveJob is a couple of map lookups; no risk in practice.
  • Dashboard JSON drift — Grafana versions evolve. Pinning schemaVersion and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.