P6-04+05: Prometheus /metrics endpoint + Grafana dashboard

New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.

Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.

Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.

Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
This commit is contained in:
2026-05-07 23:17:15 +01:00
parent 07bce16c84
commit ccd14f7cee
12 changed files with 1480 additions and 2 deletions
+39 -2
View File
@@ -390,8 +390,45 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
> swap, helper `buildRepoTrendView` shared between page-load and
> fragment endpoint). No new dependencies, no client JS, no agent
> change. CI green; in-browser smoke walk-through pending operator.
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
- [x] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
- [x] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
> **As shipped (2026-05-07, branch `p6-04-05-prometheus-metrics`):**
> Spec `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`,
> plan `docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md`.
> New `internal/server/metrics` package emits the legacy
> `text/plain; version=0.0.4` exposition format directly — no
> `prometheus/client_golang` dependency, matching the repo's
> "no Tailwind, no Node" minimal-deps style. `/metrics` is **opt-in**:
> `RM_METRICS_TOKEN` and/or `RM_METRICS_TRUSTED_CIDR` must be set or
> the route isn't mounted at all (404). When both are set, both must
> pass; either alone gates access. Token compare is constant-time.
> CIDR check honours `X-Forwarded-For` only when the immediate hop
> is a configured `RM_TRUSTED_PROXY` (mirrors the existing realIP
> resolution).
>
> **Metrics:** per-host gauges (`rm_host_agent_online`,
> `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`,
> `rm_host_repo_size_bytes`, `rm_host_snapshot_count`,
> `rm_host_open_alerts`, `rm_host_repo_status`); server gauges
> (`rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`,
> `rm_build_info{version,commit,go_version}`); histogram
> `rm_job_duration_seconds_bucket{kind,status,le}` with buckets
> `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`.
> Histogram is in-memory; observations come from the existing
> `MsgJobFinished` branch in `internal/server/ws/handler.go`.
>
> **Docs:** `docs/prometheus.md` covers enable + scrape config +
> metric reference + dashboard import. **Dashboard:**
> `deploy/grafana/restic-manager-dashboard.json` — six panels
> (fleet status, open alerts, backups failing, hosts table, repo
> size over time, job-duration p95). Schema 39, single Prometheus
> datasource variable.
>
> **Tests:** golden-render + concurrent-observe + bucket-boundary
> in the metrics package; auth matrix (no auth → 404; token
> missing/wrong/right; CIDR matching/non-matching; token AND CIDR)
> in the HTTP layer.
### Phase 6 acceptance