From 70ff554402108a166c67fd0ada46f93fa306618f Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Thu, 7 May 2026 23:07:30 +0100 Subject: [PATCH] spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard --- .../2026-05-07-p6-04-05-prometheus-metrics.md | 61 ++++++ ...5-07-p6-04-05-prometheus-metrics-design.md | 175 ++++++++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md create mode 100644 docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md diff --git a/docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md b/docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md new file mode 100644 index 0000000..83c24c6 --- /dev/null +++ b/docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md @@ -0,0 +1,61 @@ +# Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard + +Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md` + +## Step 1 — Config wiring + +- Add fields to `internal/server/config/config.go`: + - `MetricsToken string` (yaml `metrics_token`) + - `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`) + - method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured. +- Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR). +- `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`). +- Tests: extend `config_test.go` covering both env vars + happy/sad CIDR. + +## Step 2 — `internal/server/metrics` package + +- `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`. +- `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count. +- `Snapshot() Snapshot` — copies state under lock; returns plain value type. +- `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info). +- `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec. +- Unit tests: golden render, concurrent observe, bucket boundaries. + +## Step 3 — HTTP handler + +- New `internal/server/http/metrics.go`: + - `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`. + - `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use). + - `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`. +- Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`. +- `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers. + +## Step 4 — Hook job-finished + +- `internal/server/ws/handler.go`: + - `HandlerDeps` grows `Metrics *metrics.Registry`. + - In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race). +- `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance. + +## Step 5 — Tests + +- `internal/server/metrics/registry_test.go` — observe + snapshot determinism. +- `internal/server/metrics/render_test.go` — golden output for a fixed snapshot. +- `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end. + +## Step 6 — Docs + dashboard (P6-05) + +- `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import. +- `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable. + +## Step 7 — Tasks.md + verification + +- Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries. +- Run `go vet ./...`, `go test ./...`, `make build`. +- Push branch (no PR per standing instruction). + +## Risk register + +- **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths. +- **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice. +- **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions. diff --git a/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md b/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md new file mode 100644 index 0000000..6593c11 --- /dev/null +++ b/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md @@ -0,0 +1,175 @@ +# P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard + +Date: 2026-05-07 +Author: Claude (autonomous, sensible-defaults brief from operator) +Tasks: P6-04 (M), P6-05 (S) + +## Problem + +The control plane already knows everything a backup operator needs +to monitor — last-backup timestamp + status, repo size, snapshot +count, agent online, open alerts, build version — but it surfaces +those only through the dashboard HTML and a few JSON endpoints. To +plug into the operator's existing observability stack we need a +plain Prometheus exposition endpoint and a Grafana dashboard JSON +that reads from it. + +## Goals + +- `GET /metrics` emits standard Prometheus text-format with the + per-host, server, and job-duration metrics enumerated in the + task entry (P6-04 in `tasks.md`). +- Endpoint is opt-in and gated by a bearer token and/or an IP + allow-list — never publicly readable by default. +- No new third-party dependency (`prometheus/client_golang` is not + pulled in). The exposition format is small and stable enough to + emit by hand; matches the repo's "no Tailwind/Node" style. +- Sample Grafana dashboard committed to the repo so a stranger can + drop it into a Grafana instance and get a working view. + +## Non-goals + +- OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is + what every prom server still parses and what every example + online demonstrates — pick the boring option). +- Pushgateway or remote-write integration. +- Per-job metric cardinality (no `job_id` labels — that would + make the histogram explode). +- Alerting rules. Operators already have alerts inside + restic-manager (P3-05); duplicating them in Prometheus is a + YAGNI hazard. The dashboard is read-only. + +## Auth + +Two switches, both off by default. If neither is set the route +isn't mounted at all (404 from the chi router) — this avoids any +accidental "wide-open scrape endpoint" deployment. + +| env var | type | meaning | +| --- | --- | --- | +| `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer `. Compared with `crypto/subtle.ConstantTimeCompare`. | +| `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. | + +If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access. + +YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`. + +## Metrics + +All metric names are prefixed `rm_`. Help text is concise. + +### Per-host gauges (one row per `host_id`) + +``` +rm_host_agent_online{host_id,host} 1 if status='online' else 0 +rm_host_last_backup_timestamp_seconds{host_id,host} unix seconds; omitted if no backup yet +rm_host_last_backup_success{host_id,host} 1 if last_backup_status='succeeded' else 0; omitted if no backup yet +rm_host_repo_size_bytes{host_id,host} total_size from latest repo stats; omitted if unknown +rm_host_snapshot_count{host_id,host} integer +rm_host_open_alerts{host_id,host} count of open + un-resolved alerts attached to this host +rm_host_repo_status{host_id,host,status} 1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host) +``` + +`host` label is `hosts.name` for human readability; `host_id` is +the stable ULID for joining across renames. + +### Server gauges + +``` +rm_hosts_total count of hosts (excludes pending) +rm_hosts_online count of hosts with status='online' +rm_active_alerts{severity} count of open alerts by severity ∈ {info,warning,critical} +rm_build_info{version,commit,go_version} always 1; pure label-bag for joining +``` + +### Job duration histogram + +``` +rm_job_duration_seconds_bucket{kind,status,le=...} +rm_job_duration_seconds_sum{kind,status} +rm_job_duration_seconds_count{kind,status} +``` + +`kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update} +(every JobKind we currently dispatch). `status` ∈ +{succeeded,failed,cancelled}. Buckets cover the realistic range — +short admin commands (unlock, init) finish in seconds; backups can +be hours: + +``` +1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf + (1s 5s 30s 1m 5m 30m 1h 6h 24h) +``` + +In-memory only. Reset on process restart — operators who want +durable history scrape into Prom and let it persist. + +## Architecture + +New package `internal/server/metrics`: + +- `Registry` — owns the histogram state (sync.Mutex + map keyed by + `kind+status`). `ObserveJob(kind, status string, dur time.Duration)` + is the only mutator. Lookups via `Snapshot()` are read-only and + copy out. +- `Render(w io.Writer, snapshot Snapshot)` — emits the full + exposition body. The snapshot is supplied by the HTTP handler + pulling from `Store` on each scrape; the package itself has no + store dependency, which keeps it trivially unit-testable. + +New file `internal/server/http/metrics.go`: + +- `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current + fleet snapshot from `Store`, ask `metrics.Render` to emit. +- Auth helper `authoriseMetricsScrape(r)` — pure function over + request + config; tested directly. + +Wiring: + +- `cmd/server` constructs the `metrics.Registry` once and threads + it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps` + (so the job-finished branch can call `ObserveJob`). +- `ws/handler.go` MsgJobFinished branch grows a single line: + `if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`. + Falls back gracefully if the registry was never wired (tests). + +Route registration in `server.go`: + +```go +if s.deps.Cfg.MetricsAuthEnabled() { + r.Get("/metrics", s.handleMetrics) +} +``` + +## Cardinality + cost + +Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1. + +A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile. + +## Documentation (P6-05) + +- `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions. +- `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels: + 1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline. + 2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`). + 3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`. + 4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`. + 5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`. + 6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window. + +Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning. + +## Testing + +- Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable. +- Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence. +- Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both. +- End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job. + +## Out of scope, explicitly + +- Per-job latency tracking with `job_id` labels (cardinality bomb). +- Restore-specific metrics (P3 surfaces are still settling). +- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern). +- Auto-discovery / file-SD generators for Prometheus.