# Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md` ## Step 1 — Config wiring - Add fields to `internal/server/config/config.go`: - `MetricsToken string` (yaml `metrics_token`) - `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`) - method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured. - Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR). - `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`). - Tests: extend `config_test.go` covering both env vars + happy/sad CIDR. ## Step 2 — `internal/server/metrics` package - `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`. - `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count. - `Snapshot() Snapshot` — copies state under lock; returns plain value type. - `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info). - `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec. - Unit tests: golden render, concurrent observe, bucket boundaries. ## Step 3 — HTTP handler - New `internal/server/http/metrics.go`: - `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`. - `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use). - `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`. - Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`. - `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers. ## Step 4 — Hook job-finished - `internal/server/ws/handler.go`: - `HandlerDeps` grows `Metrics *metrics.Registry`. - In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race). - `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance. ## Step 5 — Tests - `internal/server/metrics/registry_test.go` — observe + snapshot determinism. - `internal/server/metrics/render_test.go` — golden output for a fixed snapshot. - `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end. ## Step 6 — Docs + dashboard (P6-05) - `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import. - `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable. ## Step 7 — Tasks.md + verification - Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries. - Run `go vet ./...`, `go test ./...`, `make build`. - Push branch (no PR per standing instruction). ## Risk register - **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths. - **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice. - **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.