4.4 KiB
4.4 KiB
Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard
Spec: docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md
Step 1 — Config wiring
- Add fields to
internal/server/config/config.go:MetricsToken string(yamlmetrics_token)MetricsTrustedCIDRs []string(yamlmetrics_trusted_cidrs)- method
(c Config) MetricsAuthEnabled() boolreturning true iff at least one of the two is configured.
- Env loading:
RM_METRICS_TOKENandRM_METRICS_TRUSTED_CIDR(comma-CIDR). validate()extension: ensure each CIDR parses (reuse the samenetip.ParsePrefixpattern that already validatesTrustedProxies).- Tests: extend
config_test.gocovering both env vars + happy/sad CIDR.
Step 2 — internal/server/metrics package
Registrystruct:sync.Mutex,map[jobKey]*histogramStatewherejobKey = struct{kind,status string}.ObserveJob(kind, status string, dur time.Duration)— clamps negative durations to 0; locks; bumps the right bucket + sum + count.Snapshot() Snapshot— copies state under lock; returns plain value type.SnapshotcarriesHistogramrows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).Render(w io.Writer, s Snapshot) error— emits standard text exposition with stable line ordering. No external dep; manual escape of\"\nin label values per the Prom format spec.- Unit tests: golden render, concurrent observe, bucket boundaries.
Step 3 — HTTP handler
- New
internal/server/http/metrics.go:(s *Server) handleMetrics(w, r)— callsauthoriseMetricsScrape, thengatherSnapshot(ctx)thenmetrics.Render.authoriseMetricsScrape(r, cfg) (ok bool, status int)— pure helper; bearer token compared withsubtle.ConstantTimeCompare; CIDR check onr.RemoteAddrfirst, thenX-Forwarded-Forif a trusted proxy fronted us (mirrorrealIP's logic; simplest path is to callchi/middleware.RealIP-aware lookup the existing handlers use).gatherSnapshot(ctx)— assembles the snapshot fromStore.ListHosts,Store.ListAlerts({Status:"open"}), the metrics registry, andversion.Version/version.Commit/runtime.Version().
- Route mounted in
server.goonly ifs.deps.Cfg.MetricsAuthEnabled(). Depsgrows aMetrics *metrics.Registryfield; nil-tolerant in handlers.
Step 4 — Hook job-finished
internal/server/ws/handler.go:HandlerDepsgrowsMetrics *metrics.Registry.- In the
MsgJobFinishedbranch, after theGetJoblookup we already do, observe(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt))). Skip ifjob.StartedAtis nil (rare race).
cmd/serverwires the registry into bothDepsandHandlerDepsfrom a single instance.
Step 5 — Tests
internal/server/metrics/registry_test.go— observe + snapshot determinism.internal/server/metrics/render_test.go— golden output for a fixed snapshot.internal/server/http/metrics_test.go— auth matrix (six cases per the spec) using the existingnewTestServerfixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.
Step 6 — Docs + dashboard (P6-05)
docs/prometheus.md— enable + scrape config + metric reference + dashboard import.deploy/grafana/restic-manager-dashboard.json— six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels withtargets[].expr, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.
Step 7 — Tasks.md + verification
- Strike P6-04, P6-05 in
tasks.md; add an "as shipped" note mirroring the prior P6 entries. - Run
go vet ./...,go test ./...,make build. - Push branch (no PR per standing instruction).
Risk register
- CIDR check for proxied scrapes — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
- Histogram lock contention — every job finish takes the mutex. Job throughput is low (a few/min/host max), and
ObserveJobis a couple of map lookups; no risk in practice. - Dashboard JSON drift — Grafana versions evolve. Pinning
schemaVersionand using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.