62 lines
4.4 KiB
Markdown
62 lines
4.4 KiB
Markdown
# Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard
|
|
|
|
Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`
|
|
|
|
## Step 1 — Config wiring
|
|
|
|
- Add fields to `internal/server/config/config.go`:
|
|
- `MetricsToken string` (yaml `metrics_token`)
|
|
- `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`)
|
|
- method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured.
|
|
- Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR).
|
|
- `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`).
|
|
- Tests: extend `config_test.go` covering both env vars + happy/sad CIDR.
|
|
|
|
## Step 2 — `internal/server/metrics` package
|
|
|
|
- `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`.
|
|
- `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count.
|
|
- `Snapshot() Snapshot` — copies state under lock; returns plain value type.
|
|
- `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).
|
|
- `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec.
|
|
- Unit tests: golden render, concurrent observe, bucket boundaries.
|
|
|
|
## Step 3 — HTTP handler
|
|
|
|
- New `internal/server/http/metrics.go`:
|
|
- `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`.
|
|
- `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use).
|
|
- `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`.
|
|
- Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`.
|
|
- `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers.
|
|
|
|
## Step 4 — Hook job-finished
|
|
|
|
- `internal/server/ws/handler.go`:
|
|
- `HandlerDeps` grows `Metrics *metrics.Registry`.
|
|
- In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race).
|
|
- `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance.
|
|
|
|
## Step 5 — Tests
|
|
|
|
- `internal/server/metrics/registry_test.go` — observe + snapshot determinism.
|
|
- `internal/server/metrics/render_test.go` — golden output for a fixed snapshot.
|
|
- `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.
|
|
|
|
## Step 6 — Docs + dashboard (P6-05)
|
|
|
|
- `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import.
|
|
- `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.
|
|
|
|
## Step 7 — Tasks.md + verification
|
|
|
|
- Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries.
|
|
- Run `go vet ./...`, `go test ./...`, `make build`.
|
|
- Push branch (no PR per standing instruction).
|
|
|
|
## Risk register
|
|
|
|
- **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
|
|
- **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice.
|
|
- **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.
|