P6-04+05: Prometheus /metrics endpoint + Grafana dashboard
CI / Test (rest) (pull_request) Successful in 41s
CI / Test (store) (pull_request) Successful in 43s
CI / Lint (pull_request) Successful in 29s
CI / Build (windows/amd64) (pull_request) Successful in 44s
CI / Test (server-http) (pull_request) Successful in 1m47s
CI / Build (linux/arm64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 2m1s
CI / Test (rest) (pull_request) Successful in 41s
CI / Test (store) (pull_request) Successful in 43s
CI / Lint (pull_request) Successful in 29s
CI / Build (windows/amd64) (pull_request) Successful in 44s
CI / Test (server-http) (pull_request) Successful in 1m47s
CI / Build (linux/arm64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 2m1s
New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.
This commit is contained in:
@@ -390,8 +390,45 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||||
> swap, helper `buildRepoTrendView` shared between page-load and
|
||||
> fragment endpoint). No new dependencies, no client JS, no agent
|
||||
> change. CI green; in-browser smoke walk-through pending operator.
|
||||
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
|
||||
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
|
||||
- [x] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
|
||||
- [x] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
|
||||
|
||||
> **As shipped (2026-05-07, branch `p6-04-05-prometheus-metrics`):**
|
||||
> Spec `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`,
|
||||
> plan `docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md`.
|
||||
> New `internal/server/metrics` package emits the legacy
|
||||
> `text/plain; version=0.0.4` exposition format directly — no
|
||||
> `prometheus/client_golang` dependency, matching the repo's
|
||||
> "no Tailwind, no Node" minimal-deps style. `/metrics` is **opt-in**:
|
||||
> `RM_METRICS_TOKEN` and/or `RM_METRICS_TRUSTED_CIDR` must be set or
|
||||
> the route isn't mounted at all (404). When both are set, both must
|
||||
> pass; either alone gates access. Token compare is constant-time.
|
||||
> CIDR check honours `X-Forwarded-For` only when the immediate hop
|
||||
> is a configured `RM_TRUSTED_PROXY` (mirrors the existing realIP
|
||||
> resolution).
|
||||
>
|
||||
> **Metrics:** per-host gauges (`rm_host_agent_online`,
|
||||
> `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`,
|
||||
> `rm_host_repo_size_bytes`, `rm_host_snapshot_count`,
|
||||
> `rm_host_open_alerts`, `rm_host_repo_status`); server gauges
|
||||
> (`rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`,
|
||||
> `rm_build_info{version,commit,go_version}`); histogram
|
||||
> `rm_job_duration_seconds_bucket{kind,status,le}` with buckets
|
||||
> `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`.
|
||||
> Histogram is in-memory; observations come from the existing
|
||||
> `MsgJobFinished` branch in `internal/server/ws/handler.go`.
|
||||
>
|
||||
> **Docs:** `docs/prometheus.md` covers enable + scrape config +
|
||||
> metric reference + dashboard import. **Dashboard:**
|
||||
> `deploy/grafana/restic-manager-dashboard.json` — six panels
|
||||
> (fleet status, open alerts, backups failing, hosts table, repo
|
||||
> size over time, job-duration p95). Schema 39, single Prometheus
|
||||
> datasource variable.
|
||||
>
|
||||
> **Tests:** golden-render + concurrent-observe + bucket-boundary
|
||||
> in the metrics package; auth matrix (no auth → 404; token
|
||||
> missing/wrong/right; CIDR matching/non-matching; token AND CIDR)
|
||||
> in the HTTP layer.
|
||||
|
||||
### Phase 6 acceptance
|
||||
|
||||
|
||||
Reference in New Issue
Block a user