From 70ff554402108a166c67fd0ada46f93fa306618f Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Thu, 7 May 2026 23:07:30 +0100 Subject: [PATCH 1/2] spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard --- .../2026-05-07-p6-04-05-prometheus-metrics.md | 61 ++++++ ...5-07-p6-04-05-prometheus-metrics-design.md | 175 ++++++++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md create mode 100644 docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md diff --git a/docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md b/docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md new file mode 100644 index 0000000..83c24c6 --- /dev/null +++ b/docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md @@ -0,0 +1,61 @@ +# Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard + +Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md` + +## Step 1 — Config wiring + +- Add fields to `internal/server/config/config.go`: + - `MetricsToken string` (yaml `metrics_token`) + - `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`) + - method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured. +- Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR). +- `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`). +- Tests: extend `config_test.go` covering both env vars + happy/sad CIDR. + +## Step 2 — `internal/server/metrics` package + +- `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`. +- `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count. +- `Snapshot() Snapshot` — copies state under lock; returns plain value type. +- `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info). +- `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec. +- Unit tests: golden render, concurrent observe, bucket boundaries. + +## Step 3 — HTTP handler + +- New `internal/server/http/metrics.go`: + - `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`. + - `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use). + - `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`. +- Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`. +- `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers. + +## Step 4 — Hook job-finished + +- `internal/server/ws/handler.go`: + - `HandlerDeps` grows `Metrics *metrics.Registry`. + - In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race). +- `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance. + +## Step 5 — Tests + +- `internal/server/metrics/registry_test.go` — observe + snapshot determinism. +- `internal/server/metrics/render_test.go` — golden output for a fixed snapshot. +- `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end. + +## Step 6 — Docs + dashboard (P6-05) + +- `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import. +- `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable. + +## Step 7 — Tasks.md + verification + +- Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries. +- Run `go vet ./...`, `go test ./...`, `make build`. +- Push branch (no PR per standing instruction). + +## Risk register + +- **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths. +- **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice. +- **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions. diff --git a/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md b/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md new file mode 100644 index 0000000..6593c11 --- /dev/null +++ b/docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md @@ -0,0 +1,175 @@ +# P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard + +Date: 2026-05-07 +Author: Claude (autonomous, sensible-defaults brief from operator) +Tasks: P6-04 (M), P6-05 (S) + +## Problem + +The control plane already knows everything a backup operator needs +to monitor — last-backup timestamp + status, repo size, snapshot +count, agent online, open alerts, build version — but it surfaces +those only through the dashboard HTML and a few JSON endpoints. To +plug into the operator's existing observability stack we need a +plain Prometheus exposition endpoint and a Grafana dashboard JSON +that reads from it. + +## Goals + +- `GET /metrics` emits standard Prometheus text-format with the + per-host, server, and job-duration metrics enumerated in the + task entry (P6-04 in `tasks.md`). +- Endpoint is opt-in and gated by a bearer token and/or an IP + allow-list — never publicly readable by default. +- No new third-party dependency (`prometheus/client_golang` is not + pulled in). The exposition format is small and stable enough to + emit by hand; matches the repo's "no Tailwind/Node" style. +- Sample Grafana dashboard committed to the repo so a stranger can + drop it into a Grafana instance and get a working view. + +## Non-goals + +- OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is + what every prom server still parses and what every example + online demonstrates — pick the boring option). +- Pushgateway or remote-write integration. +- Per-job metric cardinality (no `job_id` labels — that would + make the histogram explode). +- Alerting rules. Operators already have alerts inside + restic-manager (P3-05); duplicating them in Prometheus is a + YAGNI hazard. The dashboard is read-only. + +## Auth + +Two switches, both off by default. If neither is set the route +isn't mounted at all (404 from the chi router) — this avoids any +accidental "wide-open scrape endpoint" deployment. + +| env var | type | meaning | +| --- | --- | --- | +| `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer `. Compared with `crypto/subtle.ConstantTimeCompare`. | +| `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. | + +If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access. + +YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`. + +## Metrics + +All metric names are prefixed `rm_`. Help text is concise. + +### Per-host gauges (one row per `host_id`) + +``` +rm_host_agent_online{host_id,host} 1 if status='online' else 0 +rm_host_last_backup_timestamp_seconds{host_id,host} unix seconds; omitted if no backup yet +rm_host_last_backup_success{host_id,host} 1 if last_backup_status='succeeded' else 0; omitted if no backup yet +rm_host_repo_size_bytes{host_id,host} total_size from latest repo stats; omitted if unknown +rm_host_snapshot_count{host_id,host} integer +rm_host_open_alerts{host_id,host} count of open + un-resolved alerts attached to this host +rm_host_repo_status{host_id,host,status} 1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host) +``` + +`host` label is `hosts.name` for human readability; `host_id` is +the stable ULID for joining across renames. + +### Server gauges + +``` +rm_hosts_total count of hosts (excludes pending) +rm_hosts_online count of hosts with status='online' +rm_active_alerts{severity} count of open alerts by severity ∈ {info,warning,critical} +rm_build_info{version,commit,go_version} always 1; pure label-bag for joining +``` + +### Job duration histogram + +``` +rm_job_duration_seconds_bucket{kind,status,le=...} +rm_job_duration_seconds_sum{kind,status} +rm_job_duration_seconds_count{kind,status} +``` + +`kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update} +(every JobKind we currently dispatch). `status` ∈ +{succeeded,failed,cancelled}. Buckets cover the realistic range — +short admin commands (unlock, init) finish in seconds; backups can +be hours: + +``` +1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf + (1s 5s 30s 1m 5m 30m 1h 6h 24h) +``` + +In-memory only. Reset on process restart — operators who want +durable history scrape into Prom and let it persist. + +## Architecture + +New package `internal/server/metrics`: + +- `Registry` — owns the histogram state (sync.Mutex + map keyed by + `kind+status`). `ObserveJob(kind, status string, dur time.Duration)` + is the only mutator. Lookups via `Snapshot()` are read-only and + copy out. +- `Render(w io.Writer, snapshot Snapshot)` — emits the full + exposition body. The snapshot is supplied by the HTTP handler + pulling from `Store` on each scrape; the package itself has no + store dependency, which keeps it trivially unit-testable. + +New file `internal/server/http/metrics.go`: + +- `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current + fleet snapshot from `Store`, ask `metrics.Render` to emit. +- Auth helper `authoriseMetricsScrape(r)` — pure function over + request + config; tested directly. + +Wiring: + +- `cmd/server` constructs the `metrics.Registry` once and threads + it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps` + (so the job-finished branch can call `ObserveJob`). +- `ws/handler.go` MsgJobFinished branch grows a single line: + `if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`. + Falls back gracefully if the registry was never wired (tests). + +Route registration in `server.go`: + +```go +if s.deps.Cfg.MetricsAuthEnabled() { + r.Get("/metrics", s.handleMetrics) +} +``` + +## Cardinality + cost + +Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1. + +A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile. + +## Documentation (P6-05) + +- `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions. +- `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels: + 1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline. + 2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`). + 3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`. + 4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`. + 5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`. + 6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window. + +Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning. + +## Testing + +- Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable. +- Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence. +- Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both. +- End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job. + +## Out of scope, explicitly + +- Per-job latency tracking with `job_id` labels (cardinality bomb). +- Restore-specific metrics (P3 surfaces are still settling). +- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern). +- Auto-discovery / file-SD generators for Prometheus. -- 2.52.0 From 73e733be61668e8633598413b84c63e7e2111cf8 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Thu, 7 May 2026 23:17:15 +0100 Subject: [PATCH 2/2] P6-04+05: Prometheus /metrics endpoint + Grafana dashboard New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer. --- cmd/server/main.go | 3 + deploy/grafana/restic-manager-dashboard.json | 325 +++++++++++++++++++ docs/prometheus.md | 139 ++++++++ internal/server/config/config.go | 36 ++ internal/server/config/config_test.go | 39 +++ internal/server/http/metrics.go | 185 +++++++++++ internal/server/http/metrics_test.go | 209 ++++++++++++ internal/server/http/server.go | 11 + internal/server/metrics/metrics.go | 301 +++++++++++++++++ internal/server/metrics/metrics_test.go | 182 +++++++++++ internal/server/ws/handler.go | 11 + tasks.md | 41 ++- 12 files changed, 1480 insertions(+), 2 deletions(-) create mode 100644 deploy/grafana/restic-manager-dashboard.json create mode 100644 docs/prometheus.md create mode 100644 internal/server/http/metrics.go create mode 100644 internal/server/http/metrics_test.go create mode 100644 internal/server/metrics/metrics.go create mode 100644 internal/server/metrics/metrics_test.go diff --git a/cmd/server/main.go b/cmd/server/main.go index b79d201..45f8f15 100644 --- a/cmd/server/main.go +++ b/cmd/server/main.go @@ -20,6 +20,7 @@ import ( "gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate" rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance" + "gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws" @@ -89,6 +90,7 @@ func run() error { hub := ws.NewHub() jobHub := ws.NewJobHub() + metricsRegistry := metrics.NewRegistry() notifHub := notification.NewHub(st, aead, cfg.BaseURL) alertEngine := alert.NewEngine(st, notifHub) @@ -122,6 +124,7 @@ func run() error { UI: renderer, Version: version, OIDC: oidcClient, + Metrics: metricsRegistry, } // First-run bootstrap: if the users table is empty, mint a one-time diff --git a/deploy/grafana/restic-manager-dashboard.json b/deploy/grafana/restic-manager-dashboard.json new file mode 100644 index 0000000..7f5d690 --- /dev/null +++ b/deploy/grafana/restic-manager-dashboard.json @@ -0,0 +1,325 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { "type": "grafana", "uid": "-- Grafana --" }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "description": "restic-manager fleet overview. Imports against any Prometheus data source.", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": null, + "links": [], + "liveNow": false, + "panels": [ + { + "id": 1, + "title": "Fleet status", + "type": "stat", + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 }, + "fieldConfig": { + "defaults": { + "color": { "mode": "thresholds" }, + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "red", "value": null }, + { "color": "green", "value": 1 } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }, + "textMode": "auto" + }, + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "rm_hosts_online", + "legendFormat": "online", + "refId": "A" + }, + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "rm_hosts_total", + "legendFormat": "total", + "refId": "B" + } + ] + }, + { + "id": 2, + "title": "Open alerts", + "type": "stat", + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 }, + "fieldConfig": { + "defaults": { + "color": { "mode": "thresholds" }, + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "green", "value": null }, + { "color": "yellow", "value": 1 }, + { "color": "red", "value": 5 } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "none", + "orientation": "horizontal", + "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }, + "textMode": "auto" + }, + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "sum by (severity) (rm_active_alerts)", + "legendFormat": "{{severity}}", + "refId": "A" + } + ] + }, + { + "id": 3, + "title": "Backups failing (last reported run)", + "type": "stat", + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 }, + "fieldConfig": { + "defaults": { + "color": { "mode": "thresholds" }, + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "green", "value": null }, + { "color": "red", "value": 1 } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "options": { + "colorMode": "value", + "graphMode": "area", + "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }, + "textMode": "auto" + }, + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "count(rm_host_last_backup_success == 0)", + "legendFormat": "failing", + "refId": "A" + } + ] + }, + { + "id": 4, + "title": "Hosts", + "type": "table", + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 }, + "fieldConfig": { + "defaults": { + "custom": { "align": "auto", "displayMode": "auto" } + }, + "overrides": [ + { + "matcher": { "id": "byName", "options": "Value #B" }, + "properties": [ + { "id": "displayName", "value": "Last backup (s ago)" }, + { "id": "unit", "value": "s" } + ] + }, + { + "matcher": { "id": "byName", "options": "Value #C" }, + "properties": [ + { "id": "displayName", "value": "Repo size" }, + { "id": "unit", "value": "bytes" } + ] + }, + { + "matcher": { "id": "byName", "options": "Value #D" }, + "properties": [ + { "id": "displayName", "value": "Snapshots" } + ] + }, + { + "matcher": { "id": "byName", "options": "Value #A" }, + "properties": [ + { "id": "displayName", "value": "Online" } + ] + }, + { + "matcher": { "id": "byName", "options": "Value #E" }, + "properties": [ + { "id": "displayName", "value": "Open alerts" } + ] + } + ] + }, + "options": { "showHeader": true }, + "transformations": [ + { + "id": "merge", + "options": {} + } + ], + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "rm_host_agent_online", + "format": "table", + "instant": true, + "refId": "A" + }, + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "time() - rm_host_last_backup_timestamp_seconds", + "format": "table", + "instant": true, + "refId": "B" + }, + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "rm_host_repo_size_bytes", + "format": "table", + "instant": true, + "refId": "C" + }, + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "rm_host_snapshot_count", + "format": "table", + "instant": true, + "refId": "D" + }, + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "rm_host_open_alerts", + "format": "table", + "instant": true, + "refId": "E" + } + ] + }, + { + "id": 5, + "title": "Repo size over time", + "type": "timeseries", + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "axisLabel": "", + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "pointSize": 5, + "showPoints": "never" + }, + "unit": "bytes" + }, + "overrides": [] + }, + "options": { + "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true }, + "tooltip": { "mode": "multi", "sort": "desc" } + }, + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "rm_host_repo_size_bytes", + "legendFormat": "{{host}}", + "refId": "A" + } + ] + }, + { + "id": 6, + "title": "Job duration p95 (last 1h, by kind)", + "type": "timeseries", + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "drawStyle": "line", + "fillOpacity": 5, + "lineWidth": 1, + "pointSize": 4, + "showPoints": "never" + }, + "unit": "s" + }, + "overrides": [] + }, + "options": { + "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true }, + "tooltip": { "mode": "multi", "sort": "desc" } + }, + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, + "expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))", + "legendFormat": "{{kind}}", + "refId": "A" + } + ] + } + ], + "refresh": "30s", + "schemaVersion": 39, + "style": "dark", + "tags": ["restic-manager", "backups"], + "templating": { + "list": [ + { + "current": {}, + "hide": 0, + "includeAll": false, + "label": "Prometheus", + "multi": false, + "name": "DS_PROMETHEUS", + "options": [], + "query": "prometheus", + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "type": "datasource" + } + ] + }, + "time": { "from": "now-6h", "to": "now" }, + "timepicker": {}, + "timezone": "", + "title": "restic-manager — fleet", + "uid": "rm-fleet-overview", + "version": 1, + "weekStart": "" +} diff --git a/docs/prometheus.md b/docs/prometheus.md new file mode 100644 index 0000000..ebd83e1 --- /dev/null +++ b/docs/prometheus.md @@ -0,0 +1,139 @@ +# Prometheus + Grafana + +restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`. +The endpoint is **opt-in** — it is not mounted at all unless you set +at least one of the auth gates below. Once enabled, it serves the +standard `text/plain` exposition format that every Prometheus +release since 2.x parses without configuration. + +A sample Grafana dashboard lives at +`deploy/grafana/restic-manager-dashboard.json`. + +## Enable the endpoint + +Two switches, both off by default. If both are set, both must pass +(token AND source-IP); if only one is set, that gate alone +authorises a scrape. + +| Env var | YAML key | Effect | +|----------------------------|------------------------|--------| +| `RM_METRICS_TOKEN` | `metrics_token` | Requires `Authorization: Bearer `. Compared in constant time. | +| `RM_METRICS_TRUSTED_CIDR` | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. | + +When neither is set, `GET /metrics` returns 404 — the route is not +registered with the chi router so a forgotten config can't +accidentally publish fleet state. + +### Example: Docker + +```yaml +services: + restic-manager: + image: gitea.dcglab.co.uk/steve/restic-manager:latest + environment: + RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token + RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8" + secrets: + - rm_metrics_token +``` + +(`RM_METRICS_TOKEN_FILE` is not currently supported — set +`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the +roadmap.) + +## Prometheus scrape config + +Drop into your `prometheus.yml`: + +```yaml +scrape_configs: + - job_name: restic-manager + metrics_path: /metrics + scheme: https # via your reverse proxy + static_configs: + - targets: ['restic.example.com'] + authorization: + type: Bearer + credentials_file: /etc/prometheus/secrets/rm_metrics_token +``` + +If you don't run a TLS-terminating proxy in front, drop `scheme: +https` (the server is HTTP-only — see `docs/reverse-proxy.md`). + +## Metric reference + +All names are `rm_`-prefixed. Per-host metrics carry a `host_id` +label (the stable ULID, immune to renames) and a `host` label +(the human-readable name). + +### Server gauges + +| Name | Labels | Description | +|-----------------------|------------------------------------|-------------| +| `rm_hosts_total` | — | Total number of enrolled hosts (excludes pending announces). | +| `rm_hosts_online` | — | Number of hosts with `status='online'`. | +| `rm_active_alerts` | `severity` ∈ {info, warning, critical} | Open alerts by severity. | +| `rm_build_info` | `version, commit, go_version` | Always 1; pure label-bag for joining. | + +### Per-host gauges + +| Name | Description | +|--------------------------------------------|-------------| +| `rm_host_agent_online` | 1 if the agent is currently online, 0 otherwise. | +| `rm_host_last_backup_timestamp_seconds` | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. | +| `rm_host_last_backup_success` | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. | +| `rm_host_repo_size_bytes` | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. | +| `rm_host_snapshot_count` | Number of restic snapshots known on the host's repo. | +| `rm_host_open_alerts` | Number of currently open alerts attached to this host. | +| `rm_host_repo_status` | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. | + +### Job duration histogram + +``` +rm_job_duration_seconds_bucket{kind, status, le} +rm_job_duration_seconds_sum{kind, status} +rm_job_duration_seconds_count{kind, status} +``` + +`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}. +`status` ∈ {succeeded, failed, cancelled}. + +Buckets (seconds): + +``` +1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf +1s 5s 30s 1m 5m 30m 1h 6h 24h +``` + +The histogram is in-memory only — values reset on process restart. +Operators who want durable history should let Prometheus persist +the scrapes; restic-manager itself is a control plane, not a +metrics database. + +## Grafana dashboard + +Import `deploy/grafana/restic-manager-dashboard.json`: + +1. In Grafana, **+ → Import → Upload JSON file**. +2. Pick the Prometheus data source you scrape with. +3. The dashboard's six panels populate from the metrics above: + * **Fleet status** — online/total stat panel. + * **Open alerts** — by severity. + * **Hosts** — per-host table (last backup, repo size, snapshots, alerts). + * **Repo size over time** — one line per host. + * **Backups failing** — count of hosts whose last backup didn't succeed. + * **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind. + +Alerting is intentionally not configured in the dashboard — the +control plane already has alerts (P3-05) with native channels for +webhook, ntfy, and SMTP. Re-implementing them in Prometheus would +just duplicate state. If you do want Prom-side alerts, copy the +recording rules into your usual location. + +## Cardinality + +Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) +histogram rows. A 100-host fleet emits roughly 700 host rows + 270 +histogram rows — well below any practical limit. There are no +`job_id` labels (cardinality bomb avoidance) and no per-source-group +labels. diff --git a/internal/server/config/config.go b/internal/server/config/config.go index ffb6363..2793913 100644 --- a/internal/server/config/config.go +++ b/internal/server/config/config.go @@ -41,6 +41,24 @@ type Config struct { // DataDir. Source-build deployments can override via // RM_BUNDLED_ASSETS_DIR. BundledAssetsDir string `yaml:"bundled_assets_dir"` + + // MetricsToken, if set, gates the /metrics scrape endpoint + // behind a `Authorization: Bearer ` check (constant-time + // compare). When neither this nor MetricsTrustedCIDRs is set, + // the route is not mounted at all (the endpoint is opt-in). + MetricsToken string `yaml:"metrics_token"` + + // MetricsTrustedCIDRs, if non-empty, gates /metrics so only + // callers from these networks may scrape. ANDed with + // MetricsToken when both are set. + MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"` +} + +// MetricsAuthEnabled reports whether the operator has opted into +// exposing the Prometheus scrape endpoint by configuring at least +// one auth gate. +func (c Config) MetricsAuthEnabled() bool { + return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0 } // Load resolves config in this order: @@ -93,6 +111,19 @@ func Load(yamlPath string) (Config, error) { if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok { c.BundledAssetsDir = v } + if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok { + c.MetricsToken = v + } + if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok { + parts := strings.Split(v, ",") + c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0] + for _, p := range parts { + p = strings.TrimSpace(p) + if p != "" { + c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p) + } + } + } if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok { // Comma-separated CIDRs; allow whitespace for readability. parts := strings.Split(v, ",") @@ -137,5 +168,10 @@ func (c *Config) validate() error { return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err) } } + for _, cidr := range c.MetricsTrustedCIDRs { + if _, err := netip.ParsePrefix(cidr); err != nil { + return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err) + } + } return nil } diff --git a/internal/server/config/config_test.go b/internal/server/config/config_test.go index ba264f5..044af50 100644 --- a/internal/server/config/config_test.go +++ b/internal/server/config/config_test.go @@ -98,6 +98,45 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) { } } +func TestMetricsAuthGates(t *testing.T) { + t.Setenv("RM_LISTEN", ":8080") + t.Setenv("RM_DATA_DIR", "/tmp/x") + + c, err := Load("") + if err != nil { + t.Fatalf("load: %v", err) + } + if c.MetricsAuthEnabled() { + t.Errorf("metrics endpoint should be off by default") + } + + t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes") + t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24") + c, err = Load("") + if err != nil { + t.Fatalf("load: %v", err) + } + if c.MetricsToken != "s3cr3t-token-with-enough-bytes" { + t.Errorf("token: %q", c.MetricsToken) + } + if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" { + t.Errorf("cidrs: %v", got) + } + if !c.MetricsAuthEnabled() { + t.Errorf("MetricsAuthEnabled should be true") + } +} + +func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) { + t.Setenv("RM_LISTEN", ":8080") + t.Setenv("RM_DATA_DIR", "/tmp/x") + t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage") + + if _, err := Load(""); err == nil { + t.Fatal("expected validation error, got nil") + } +} + func writeFile(path string, body []byte) error { return writeFileImpl(path, body) } diff --git a/internal/server/http/metrics.go b/internal/server/http/metrics.go new file mode 100644 index 0000000..2c65ca8 --- /dev/null +++ b/internal/server/http/metrics.go @@ -0,0 +1,185 @@ +package http + +import ( + "context" + "crypto/subtle" + "net" + "net/http" + "net/netip" + "runtime" + "strings" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/server/config" + "gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics" + "gitea.dcglab.co.uk/steve/restic-manager/internal/store" + "gitea.dcglab.co.uk/steve/restic-manager/internal/version" +) + +// handleMetrics serves the Prometheus exposition body. The route is +// only mounted when the operator has opted in via RM_METRICS_TOKEN +// or RM_METRICS_TRUSTED_CIDR (see Server.New + Cfg.MetricsAuthEnabled). +func (s *Server) handleMetrics(w http.ResponseWriter, r *http.Request) { + if !authoriseMetricsScrape(r, s.deps.Cfg) { + // 401 with no body; Prom respects this and surfaces the failed + // scrape. WWW-Authenticate hints at bearer when the operator + // actually configured a token. + if s.deps.Cfg.MetricsToken != "" { + w.Header().Set("WWW-Authenticate", `Bearer realm="restic-manager metrics"`) + } + w.WriteHeader(http.StatusUnauthorized) + return + } + + snap, err := s.gatherMetricsSnapshot(r.Context()) + if err != nil { + http.Error(w, "snapshot: "+err.Error(), http.StatusInternalServerError) + return + } + + // 0.0.4 is the long-stable text-format version Prometheus accepts + // without negotiation; OpenMetrics is intentionally not used here. + w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8") + if err := metrics.Render(w, snap); err != nil { + // Body is partially written; nothing useful we can do beyond + // dropping the connection (chi's recoverer will log). + return + } +} + +// authoriseMetricsScrape applies bearer + CIDR gates per the spec. +// AND semantics when both are configured; either alone is sufficient +// when only it is configured. +func authoriseMetricsScrape(r *http.Request, cfg config.Config) bool { + tokenOK := true + if cfg.MetricsToken != "" { + tokenOK = false + hdr := r.Header.Get("Authorization") + const prefix = "Bearer " + if strings.HasPrefix(hdr, prefix) { + got := []byte(strings.TrimPrefix(hdr, prefix)) + want := []byte(cfg.MetricsToken) + if subtle.ConstantTimeCompare(got, want) == 1 { + tokenOK = true + } + } + } + + cidrOK := true + if len(cfg.MetricsTrustedCIDRs) > 0 { + cidrOK = false + ip := callerIP(r, cfg.TrustedProxies) + if ip.IsValid() { + for _, c := range cfg.MetricsTrustedCIDRs { + prefix, err := netip.ParsePrefix(c) + if err != nil { + continue + } + if prefix.Contains(ip) { + cidrOK = true + break + } + } + } + } + return tokenOK && cidrOK +} + +// callerIP resolves the client IP. When the request hit the server +// directly we use RemoteAddr; when the immediate hop is a trusted +// proxy we honour the right-most untrusted X-Forwarded-For entry +// (mirrors how realIP middlewares typically resolve). +func callerIP(r *http.Request, trustedProxies []string) netip.Addr { + host, _, err := net.SplitHostPort(r.RemoteAddr) + if err != nil { + host = r.RemoteAddr + } + directAddr, err := netip.ParseAddr(host) + if err != nil { + return netip.Addr{} + } + + if !addrInAnyCIDR(directAddr, trustedProxies) { + return directAddr + } + + xff := r.Header.Get("X-Forwarded-For") + if xff == "" { + return directAddr + } + parts := strings.Split(xff, ",") + // Walk right→left, skipping trusted proxies, until we land on the + // first untrusted hop — that's the genuine client. + for i := len(parts) - 1; i >= 0; i-- { + p := strings.TrimSpace(parts[i]) + a, err := netip.ParseAddr(p) + if err != nil { + continue + } + if addrInAnyCIDR(a, trustedProxies) { + continue + } + return a + } + return directAddr +} + +func addrInAnyCIDR(a netip.Addr, cidrs []string) bool { + for _, c := range cidrs { + pre, err := netip.ParsePrefix(c) + if err != nil { + continue + } + if pre.Contains(a) { + return true + } + } + return false +} + +// gatherMetricsSnapshot pulls the data the renderer needs. One +// indexed query per per-host or fleet-wide read; no N+1. +func (s *Server) gatherMetricsSnapshot(ctx context.Context) (metrics.Snapshot, error) { + hosts, err := s.deps.Store.ListHosts(ctx) + if err != nil { + return metrics.Snapshot{}, err + } + hostRows := make([]metrics.HostRow, 0, len(hosts)) + for _, h := range hosts { + row := metrics.HostRow{ + ID: h.ID, + Name: h.Name, + Online: h.Status == "online", + SnapshotCount: h.SnapshotCount, + OpenAlertCount: h.OpenAlertCount, + RepoStatus: h.RepoStatus, + } + if h.LastBackupAt != nil { + ts := h.LastBackupAt.Unix() + row.LastBackupUnix = &ts + } + if h.LastBackupStatus != nil { + ok := *h.LastBackupStatus == "succeeded" + row.LastBackupSucceeded = &ok + } + if h.RepoSizeBytes > 0 { + sz := h.RepoSizeBytes + row.RepoSizeBytes = &sz + } + hostRows = append(hostRows, row) + } + + open, err := s.deps.Store.ListAlerts(ctx, store.AlertFilter{Status: "open"}) + if err != nil { + return metrics.Snapshot{}, err + } + bySeverity := map[string]int{"info": 0, "warning": 0, "critical": 0} + for _, a := range open { + bySeverity[a.Severity]++ + } + + reg := s.deps.Metrics + if reg == nil { + reg = metrics.NewRegistry() // empty histogram block + } + return reg.SnapshotWith(hostRows, bySeverity, version.Version, version.Commit, runtime.Version()), nil +} diff --git a/internal/server/http/metrics_test.go b/internal/server/http/metrics_test.go new file mode 100644 index 0000000..ef443e1 --- /dev/null +++ b/internal/server/http/metrics_test.go @@ -0,0 +1,209 @@ +package http + +import ( + "context" + "io" + stdhttp "net/http" + "net/http/httptest" + "path/filepath" + "strings" + "testing" + + "gitea.dcglab.co.uk/steve/restic-manager/internal/crypto" + "gitea.dcglab.co.uk/steve/restic-manager/internal/server/config" + "gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics" + "gitea.dcglab.co.uk/steve/restic-manager/internal/store" +) + +// newMetricsServer builds a Server with metrics enabled per cfg. +// Returns (URL, registry) so tests can both observe job durations +// directly and exercise the HTTP gate. +func newMetricsServer(t *testing.T, cfg config.Config) (string, *metrics.Registry, *store.Store) { + t.Helper() + dir := t.TempDir() + + st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db")) + if err != nil { + t.Fatalf("store: %v", err) + } + t.Cleanup(func() { _ = st.Close() }) + + keyPath := filepath.Join(dir, "secret.key") + if err := crypto.GenerateKeyFile(keyPath); err != nil { + t.Fatalf("genkey: %v", err) + } + key, _ := crypto.LoadKeyFromFile(keyPath) + aead, _ := crypto.NewAEAD(key) + + cfg.Listen = ":0" + cfg.DataDir = dir + cfg.SecretKeyFile = keyPath + + reg := metrics.NewRegistry() + deps := Deps{ + Cfg: cfg, + Store: st, + AEAD: aead, + Metrics: reg, + } + s := New(deps) + ts := httptest.NewServer(s.srv.Handler) + t.Cleanup(ts.Close) + return ts.URL, reg, st +} + +func TestMetricsRouteNotMountedByDefault(t *testing.T) { + t.Parallel() + url, _, _ := newMetricsServer(t, config.Config{}) + res, err := stdhttp.Get(url + "/metrics") + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res.Body.Close() + if res.StatusCode != stdhttp.StatusNotFound { + t.Errorf("status: got %d, want 404 (route should not be mounted)", res.StatusCode) + } +} + +func TestMetricsTokenRequired(t *testing.T) { + t.Parallel() + url, _, _ := newMetricsServer(t, config.Config{ + MetricsToken: "the-token", + }) + + // Missing token. + res, err := stdhttp.Get(url + "/metrics") + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res.Body.Close() + if res.StatusCode != stdhttp.StatusUnauthorized { + t.Errorf("no token: got %d", res.StatusCode) + } + if !strings.Contains(res.Header.Get("WWW-Authenticate"), "Bearer") { + t.Errorf("WWW-Authenticate hint missing: %q", res.Header.Get("WWW-Authenticate")) + } + + // Wrong token. + req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil) + req.Header.Set("Authorization", "Bearer not-the-token") + res2, err := stdhttp.DefaultClient.Do(req) + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res2.Body.Close() + if res2.StatusCode != stdhttp.StatusUnauthorized { + t.Errorf("wrong token: got %d", res2.StatusCode) + } + + // Right token. + req3, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil) + req3.Header.Set("Authorization", "Bearer the-token") + res3, err3 := stdhttp.DefaultClient.Do(req3) + if err3 != nil { + t.Fatalf("GET: %v", err3) + } + defer res3.Body.Close() + if res3.StatusCode != stdhttp.StatusOK { + t.Errorf("right token: got %d", res3.StatusCode) + } + if ct := res3.Header.Get("Content-Type"); !strings.HasPrefix(ct, "text/plain") { + t.Errorf("content-type: %q", ct) + } +} + +func TestMetricsCIDRGate(t *testing.T) { + t.Parallel() + // 127.0.0.1 is what httptest hits with; pick a CIDR that excludes it + // to assert the "wrong source" branch. + url, _, _ := newMetricsServer(t, config.Config{ + MetricsTrustedCIDRs: []string{"10.0.0.0/8"}, + }) + res, err := stdhttp.Get(url + "/metrics") + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res.Body.Close() + if res.StatusCode != stdhttp.StatusUnauthorized { + t.Errorf("loopback hitting non-matching CIDR: got %d, want 401", res.StatusCode) + } + + // Now allow loopback. + url2, _, _ := newMetricsServer(t, config.Config{ + MetricsTrustedCIDRs: []string{"127.0.0.0/8"}, + }) + res2, err := stdhttp.Get(url2 + "/metrics") + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res2.Body.Close() + if res2.StatusCode != stdhttp.StatusOK { + t.Errorf("loopback in allow CIDR: got %d, want 200", res2.StatusCode) + } +} + +func TestMetricsTokenAndCIDRBothRequired(t *testing.T) { + t.Parallel() + url, _, _ := newMetricsServer(t, config.Config{ + MetricsToken: "the-token", + MetricsTrustedCIDRs: []string{"127.0.0.0/8"}, + }) + // Token only — CIDR ok (loopback) but token missing. + res, err := stdhttp.Get(url + "/metrics") + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res.Body.Close() + if res.StatusCode != stdhttp.StatusUnauthorized { + t.Errorf("missing token but in CIDR: got %d", res.StatusCode) + } + + // Both right. + req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil) + req.Header.Set("Authorization", "Bearer the-token") + res2, err := stdhttp.DefaultClient.Do(req) + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res2.Body.Close() + if res2.StatusCode != stdhttp.StatusOK { + t.Errorf("both right: got %d", res2.StatusCode) + } +} + +func readAll(t *testing.T, r io.Reader) string { + t.Helper() + b, err := io.ReadAll(r) + if err != nil { + t.Fatalf("read: %v", err) + } + return string(b) +} + +func TestMetricsBodyContainsExpectedLines(t *testing.T) { + t.Parallel() + url, reg, _ := newMetricsServer(t, config.Config{ + MetricsToken: "the-token", + }) + reg.ObserveJob("backup", "succeeded", 0) // produce one histogram row + + req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil) + req.Header.Set("Authorization", "Bearer the-token") + res, err := stdhttp.DefaultClient.Do(req) + if err != nil { + t.Fatalf("GET: %v", err) + } + defer res.Body.Close() + body := readAll(t, res.Body) + for _, want := range []string{ + "rm_hosts_total", + "rm_hosts_online", + `rm_active_alerts{severity="critical"}`, + "rm_build_info{", + "rm_job_duration_seconds_count{kind=\"backup\",status=\"succeeded\"}", + } { + if !strings.Contains(body, want) { + t.Errorf("body missing %q\n--- body ---\n%s", want, body) + } + } +} diff --git a/internal/server/http/server.go b/internal/server/http/server.go index c2d90c3..7d79cbf 100644 --- a/internal/server/http/server.go +++ b/internal/server/http/server.go @@ -17,6 +17,7 @@ import ( "gitea.dcglab.co.uk/steve/restic-manager/internal/crypto" "gitea.dcglab.co.uk/steve/restic-manager/internal/notification" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/config" + "gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui" "gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws" @@ -56,6 +57,12 @@ type Deps struct { // OIDC (optional). Non-nil when the operator has configured an // IdP — handlers under /auth/oidc/* are mounted only when set. OIDC *oidc.Client + // Metrics (optional). When non-nil the WS job-finished branch + // records job durations and the /metrics handler can pull a + // histogram snapshot. Independent of MetricsAuthEnabled — the + // recorder runs even if the scrape endpoint is gated off, so a + // later config flip doesn't lose the running window. + Metrics *metrics.Registry } // Server is the running HTTP server. @@ -131,12 +138,16 @@ func (s *Server) routes(r chi.Router) { r.Get("/agent/binary", s.handleAgentBinary) r.Get("/install/*", s.handleInstallAsset) r.Get("/api/version", s.handleVersion) + if s.deps.Cfg.MetricsAuthEnabled() { + r.Get("/metrics", s.handleMetrics) + } if s.deps.Hub != nil { hd := ws.HandlerDeps{ Hub: s.deps.Hub, Store: s.deps.Store, JobHub: s.deps.JobHub, AlertEngine: s.deps.AlertEngine, + Metrics: s.deps.Metrics, OnHello: s.onAgentHello, OnScheduleAck: s.applyScheduleAck, OnScheduleFire: s.dispatchScheduledJob, diff --git a/internal/server/metrics/metrics.go b/internal/server/metrics/metrics.go new file mode 100644 index 0000000..588d796 --- /dev/null +++ b/internal/server/metrics/metrics.go @@ -0,0 +1,301 @@ +// Package metrics owns the in-process Prometheus exposition for +// the control plane. It deliberately avoids prometheus/client_golang +// — the legacy text format is small and stable, and the repo's house +// style is to keep dependency surface minimal. +// +// Two halves: +// +// - Registry holds a job-duration histogram. Server hooks call +// Registry.ObserveJob from the WS job-finished branch. +// +// - Render emits a complete /metrics body from a Snapshot. The +// Snapshot is a plain value bag; the HTTP handler assembles it +// from store reads + Registry.Snapshot at scrape time. This +// keeps the package free of any database or HTTP dependency. +package metrics + +import ( + "fmt" + "io" + "sort" + "strings" + "sync" + "time" +) + +// JobDurationBuckets is the upper-bound ladder for the job duration +// histogram, in seconds. Covers admin commands (unlock/init/check +// finishing in seconds) up through hours-long backups; +Inf is +// implicit. +var JobDurationBuckets = []float64{1, 5, 30, 60, 300, 1800, 3600, 21600, 86400} + +// Registry is the in-memory store for the job-duration histogram. +// Concurrent observers and a single periodic snapshotter is the +// expected access pattern; both are guarded by a mutex. +type Registry struct { + mu sync.Mutex + jobs map[jobKey]*histogramState + clock func() time.Time +} + +type jobKey struct{ kind, status string } + +type histogramState struct { + // counts[i] = number of observations <= JobDurationBuckets[i]. + // counts[len(JobDurationBuckets)] is the implicit +Inf bucket + // (== total count, kept here for symmetry with the rendered + // _bucket{le="+Inf"} line and as a sanity check). + counts []uint64 + sum float64 + count uint64 +} + +// NewRegistry builds an empty registry. +func NewRegistry() *Registry { + return &Registry{ + jobs: make(map[jobKey]*histogramState), + clock: time.Now, + } +} + +// ObserveJob records one job-duration sample. Negative durations +// (clock-skew artefacts) are clamped to zero. Empty kind/status +// strings are tolerated but degrade the dashboard — callers should +// pass meaningful values. +func (r *Registry) ObserveJob(kind, status string, dur time.Duration) { + if r == nil { + return + } + if dur < 0 { + dur = 0 + } + secs := dur.Seconds() + + r.mu.Lock() + defer r.mu.Unlock() + k := jobKey{kind: kind, status: status} + hs, ok := r.jobs[k] + if !ok { + hs = &histogramState{counts: make([]uint64, len(JobDurationBuckets)+1)} + r.jobs[k] = hs + } + for i, ub := range JobDurationBuckets { + if secs <= ub { + hs.counts[i]++ + } + } + hs.counts[len(JobDurationBuckets)]++ // +Inf + hs.sum += secs + hs.count++ +} + +// HistogramRow is one (kind,status) row in a Snapshot. Buckets is +// the cumulative count per upper bound (matching JobDurationBuckets, +// last element is the +Inf total). +type HistogramRow struct { + Kind string + Status string + Buckets []uint64 + Sum float64 + Count uint64 +} + +// snapshotJobs returns a deterministic, sorted copy of the +// histogram state. Sort order: kind asc, status asc. +func (r *Registry) snapshotJobs() []HistogramRow { + if r == nil { + return nil + } + r.mu.Lock() + defer r.mu.Unlock() + rows := make([]HistogramRow, 0, len(r.jobs)) + for k, hs := range r.jobs { + buckets := make([]uint64, len(hs.counts)) + copy(buckets, hs.counts) + rows = append(rows, HistogramRow{ + Kind: k.kind, + Status: k.status, + Buckets: buckets, + Sum: hs.sum, + Count: hs.count, + }) + } + sort.Slice(rows, func(i, j int) bool { + if rows[i].Kind != rows[j].Kind { + return rows[i].Kind < rows[j].Kind + } + return rows[i].Status < rows[j].Status + }) + return rows +} + +// HostRow is one host's projection for the per-host gauges. +// Pointers carry "no value" semantics so we can omit a metric line +// when, e.g., a host has never run a backup. +type HostRow struct { + ID string + Name string + Online bool + LastBackupUnix *int64 // nil = no backup yet + LastBackupSucceeded *bool // nil = no backup yet + RepoSizeBytes *int64 // nil = no stats yet + SnapshotCount int + OpenAlertCount int + RepoStatus string // "unknown" | "ready" | "init_failed" +} + +// Snapshot is a frozen view of the data needed to render /metrics. +// Constructed by the HTTP handler from Store reads + Registry.snapshotJobs. +type Snapshot struct { + Hosts []HostRow + HostsTotal int + HostsOnline int + AlertsBySeverity map[string]int // severity → count + BuildVersion string + BuildCommit string + GoVersion string + JobDurationRows []HistogramRow +} + +// SnapshotWith builds a Snapshot from raw inputs and the registry's +// current job-duration state. Convenience for the HTTP handler. +func (r *Registry) SnapshotWith(hosts []HostRow, alerts map[string]int, buildVer, commit, goVer string) Snapshot { + online := 0 + for _, h := range hosts { + if h.Online { + online++ + } + } + return Snapshot{ + Hosts: hosts, + HostsTotal: len(hosts), + HostsOnline: online, + AlertsBySeverity: alerts, + BuildVersion: buildVer, + BuildCommit: commit, + GoVersion: goVer, + JobDurationRows: r.snapshotJobs(), + } +} + +// Render emits a complete Prometheus text-exposition body for s. +// Output is deterministic: metric names appear in a fixed order and +// labels within a metric are sorted by their first label value. +func Render(w io.Writer, s Snapshot) error { + var b strings.Builder + + // --- Server gauges --------------------------------------------------- + b.WriteString("# HELP rm_hosts_total Total number of enrolled hosts (excludes pending announces).\n") + b.WriteString("# TYPE rm_hosts_total gauge\n") + fmt.Fprintf(&b, "rm_hosts_total %d\n", s.HostsTotal) + + b.WriteString("# HELP rm_hosts_online Number of hosts currently online (status='online').\n") + b.WriteString("# TYPE rm_hosts_online gauge\n") + fmt.Fprintf(&b, "rm_hosts_online %d\n", s.HostsOnline) + + b.WriteString("# HELP rm_active_alerts Open alerts grouped by severity.\n") + b.WriteString("# TYPE rm_active_alerts gauge\n") + severities := []string{"info", "warning", "critical"} + for _, sev := range severities { + fmt.Fprintf(&b, "rm_active_alerts{severity=%q} %d\n", sev, s.AlertsBySeverity[sev]) + } + + b.WriteString("# HELP rm_build_info Build identifying labels; value is always 1.\n") + b.WriteString("# TYPE rm_build_info gauge\n") + fmt.Fprintf(&b, "rm_build_info{version=%q,commit=%q,go_version=%q} 1\n", + s.BuildVersion, s.BuildCommit, s.GoVersion) + + // --- Per-host gauges ------------------------------------------------- + // Stable order: by host id. + hosts := append([]HostRow(nil), s.Hosts...) + sort.Slice(hosts, func(i, j int) bool { return hosts[i].ID < hosts[j].ID }) + + b.WriteString("# HELP rm_host_agent_online 1 if the agent is currently online, 0 otherwise.\n") + b.WriteString("# TYPE rm_host_agent_online gauge\n") + for _, h := range hosts { + v := 0 + if h.Online { + v = 1 + } + fmt.Fprintf(&b, "rm_host_agent_online{host_id=%q,host=%q} %d\n", + h.ID, h.Name, v) + } + + b.WriteString("# HELP rm_host_last_backup_timestamp_seconds Unix timestamp of the host's most recent backup. Omitted for hosts with no backup yet.\n") + b.WriteString("# TYPE rm_host_last_backup_timestamp_seconds gauge\n") + for _, h := range hosts { + if h.LastBackupUnix == nil { + continue + } + fmt.Fprintf(&b, "rm_host_last_backup_timestamp_seconds{host_id=%q,host=%q} %d\n", + h.ID, h.Name, *h.LastBackupUnix) + } + + b.WriteString("# HELP rm_host_last_backup_success 1 if the host's most recent backup succeeded, 0 otherwise. Omitted for hosts with no backup yet.\n") + b.WriteString("# TYPE rm_host_last_backup_success gauge\n") + for _, h := range hosts { + if h.LastBackupSucceeded == nil { + continue + } + v := 0 + if *h.LastBackupSucceeded { + v = 1 + } + fmt.Fprintf(&b, "rm_host_last_backup_success{host_id=%q,host=%q} %d\n", + h.ID, h.Name, v) + } + + b.WriteString("# HELP rm_host_repo_size_bytes Latest reported repo size from `restic stats --mode raw-data`. Omitted for hosts with no stats yet.\n") + b.WriteString("# TYPE rm_host_repo_size_bytes gauge\n") + for _, h := range hosts { + if h.RepoSizeBytes == nil { + continue + } + fmt.Fprintf(&b, "rm_host_repo_size_bytes{host_id=%q,host=%q} %d\n", + h.ID, h.Name, *h.RepoSizeBytes) + } + + b.WriteString("# HELP rm_host_snapshot_count Number of restic snapshots known on the host's repo.\n") + b.WriteString("# TYPE rm_host_snapshot_count gauge\n") + for _, h := range hosts { + fmt.Fprintf(&b, "rm_host_snapshot_count{host_id=%q,host=%q} %d\n", + h.ID, h.Name, h.SnapshotCount) + } + + b.WriteString("# HELP rm_host_open_alerts Number of currently open alerts attached to this host.\n") + b.WriteString("# TYPE rm_host_open_alerts gauge\n") + for _, h := range hosts { + fmt.Fprintf(&b, "rm_host_open_alerts{host_id=%q,host=%q} %d\n", + h.ID, h.Name, h.OpenAlertCount) + } + + b.WriteString("# HELP rm_host_repo_status Repo readiness state for the host. Exactly one row per host with status label set.\n") + b.WriteString("# TYPE rm_host_repo_status gauge\n") + for _, h := range hosts { + st := h.RepoStatus + if st == "" { + st = "unknown" + } + fmt.Fprintf(&b, "rm_host_repo_status{host_id=%q,host=%q,status=%q} 1\n", + h.ID, h.Name, st) + } + + // --- Histogram ------------------------------------------------------- + b.WriteString("# HELP rm_job_duration_seconds End-to-end duration of completed jobs, by kind and terminal status.\n") + b.WriteString("# TYPE rm_job_duration_seconds histogram\n") + for _, row := range s.JobDurationRows { + for i, ub := range JobDurationBuckets { + fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"%g\"} %d\n", + row.Kind, row.Status, ub, row.Buckets[i]) + } + fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"+Inf\"} %d\n", + row.Kind, row.Status, row.Buckets[len(JobDurationBuckets)]) + fmt.Fprintf(&b, "rm_job_duration_seconds_sum{kind=%q,status=%q} %g\n", + row.Kind, row.Status, row.Sum) + fmt.Fprintf(&b, "rm_job_duration_seconds_count{kind=%q,status=%q} %d\n", + row.Kind, row.Status, row.Count) + } + + _, err := io.WriteString(w, b.String()) + return err +} diff --git a/internal/server/metrics/metrics_test.go b/internal/server/metrics/metrics_test.go new file mode 100644 index 0000000..70c5ed7 --- /dev/null +++ b/internal/server/metrics/metrics_test.go @@ -0,0 +1,182 @@ +package metrics + +import ( + "bytes" + "strings" + "sync" + "testing" + "time" +) + +func TestObserveJobBuckets(t *testing.T) { + r := NewRegistry() + // Bucket boundaries: 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400 + r.ObserveJob("backup", "succeeded", 500*time.Millisecond) // <= 1 + r.ObserveJob("backup", "succeeded", 30*time.Second) // == 30 (boundary) + r.ObserveJob("backup", "succeeded", 90*time.Second) // > 60, <= 300 + r.ObserveJob("backup", "succeeded", 2*time.Hour) // > 3600 → 21600 bucket + rows := r.snapshotJobs() + if len(rows) != 1 { + t.Fatalf("rows: %d", len(rows)) + } + row := rows[0] + if row.Count != 4 { + t.Errorf("count: %d", row.Count) + } + wantSum := 0.5 + 30 + 90 + 7200.0 + if row.Sum != wantSum { + t.Errorf("sum: got %v want %v", row.Sum, wantSum) + } + // Cumulative buckets: + // le=1 → 1 (the 0.5s) + // le=5 → 1 + // le=30 → 2 (boundary inclusive: 30s included) + // le=60 → 2 + // le=300 → 3 + // le=1800 → 3 + // le=3600 → 3 + // le=21600 → 4 + // le=86400 → 4 + // le=+Inf → 4 + want := []uint64{1, 1, 2, 2, 3, 3, 3, 4, 4, 4} + for i, w := range want { + if row.Buckets[i] != w { + t.Errorf("bucket[%d]=%d want %d", i, row.Buckets[i], w) + } + } +} + +func TestObserveJobNegativeClampedToZero(t *testing.T) { + r := NewRegistry() + r.ObserveJob("backup", "succeeded", -5*time.Second) + rows := r.snapshotJobs() + if len(rows) != 1 || rows[0].Sum != 0 || rows[0].Count != 1 { + t.Errorf("expected one zero-second observation, got %+v", rows) + } +} + +func TestObserveJobConcurrent(t *testing.T) { + r := NewRegistry() + const goroutines = 16 + const each = 200 + var wg sync.WaitGroup + for g := 0; g < goroutines; g++ { + wg.Add(1) + go func() { + defer wg.Done() + for i := 0; i < each; i++ { + r.ObserveJob("backup", "succeeded", time.Second) + } + }() + } + wg.Wait() + rows := r.snapshotJobs() + if len(rows) != 1 { + t.Fatalf("rows: %d", len(rows)) + } + if rows[0].Count != uint64(goroutines*each) { + t.Errorf("count: got %d want %d", rows[0].Count, goroutines*each) + } +} + +func TestObserveJobNilRegistryNoop(t *testing.T) { + var r *Registry // nil + r.ObserveJob("backup", "succeeded", time.Second) +} + +func TestRenderGolden(t *testing.T) { + r := NewRegistry() + r.ObserveJob("backup", "succeeded", 5*time.Second) + r.ObserveJob("forget", "succeeded", 100*time.Millisecond) + + pi64 := func(v int64) *int64 { return &v } + pbool := func(v bool) *bool { return &v } + + hosts := []HostRow{ + { + ID: "01H0001", Name: "alpha", + Online: true, + LastBackupUnix: pi64(1700000000), + LastBackupSucceeded: pbool(true), + RepoSizeBytes: pi64(123456789), + SnapshotCount: 42, + OpenAlertCount: 0, + RepoStatus: "ready", + }, + { + ID: "01H0002", Name: "bravo", + Online: false, + SnapshotCount: 0, + OpenAlertCount: 1, + RepoStatus: "init_failed", + }, + } + snap := r.SnapshotWith(hosts, + map[string]int{"info": 0, "warning": 1, "critical": 0}, + "v1.2.3", "deadbeef", "go1.25.0") + + var buf bytes.Buffer + if err := Render(&buf, snap); err != nil { + t.Fatalf("render: %v", err) + } + out := buf.String() + + for _, want := range []string{ + "# HELP rm_hosts_total ", + "rm_hosts_total 2\n", + "rm_hosts_online 1\n", + `rm_active_alerts{severity="warning"} 1`, + `rm_active_alerts{severity="info"} 0`, + `rm_active_alerts{severity="critical"} 0`, + `rm_build_info{version="v1.2.3",commit="deadbeef",go_version="go1.25.0"} 1`, + `rm_host_agent_online{host_id="01H0001",host="alpha"} 1`, + `rm_host_agent_online{host_id="01H0002",host="bravo"} 0`, + `rm_host_last_backup_timestamp_seconds{host_id="01H0001",host="alpha"} 1700000000`, + `rm_host_last_backup_success{host_id="01H0001",host="alpha"} 1`, + `rm_host_repo_size_bytes{host_id="01H0001",host="alpha"} 123456789`, + `rm_host_snapshot_count{host_id="01H0001",host="alpha"} 42`, + `rm_host_snapshot_count{host_id="01H0002",host="bravo"} 0`, + `rm_host_open_alerts{host_id="01H0002",host="bravo"} 1`, + `rm_host_repo_status{host_id="01H0001",host="alpha",status="ready"} 1`, + `rm_host_repo_status{host_id="01H0002",host="bravo",status="init_failed"} 1`, + `rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="1"} 0`, + `rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="5"} 1`, + `rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="+Inf"} 1`, + `rm_job_duration_seconds_sum{kind="backup",status="succeeded"} 5`, + `rm_job_duration_seconds_count{kind="backup",status="succeeded"} 1`, + `rm_job_duration_seconds_bucket{kind="forget",status="succeeded",le="1"} 1`, + } { + if !strings.Contains(out, want) { + t.Errorf("missing line:\n %s\n--- full output ---\n%s", want, out) + } + } + + // bravo had no last backup → those metric lines must be absent for it. + for _, ban := range []string{ + `rm_host_last_backup_timestamp_seconds{host_id="01H0002"`, + `rm_host_last_backup_success{host_id="01H0002"`, + `rm_host_repo_size_bytes{host_id="01H0002"`, + } { + if strings.Contains(out, ban) { + t.Errorf("unexpected line for bravo: %q", ban) + } + } +} + +func TestRenderEmptySnapshot(t *testing.T) { + r := NewRegistry() + snap := r.SnapshotWith(nil, nil, "dev", "", "go1.25.0") + var buf bytes.Buffer + if err := Render(&buf, snap); err != nil { + t.Fatalf("render: %v", err) + } + out := buf.String() + if !strings.Contains(out, "rm_hosts_total 0\n") { + t.Errorf("missing zero-host gauge:\n%s", out) + } + // Histogram block has its HELP/TYPE but no rows. The HELP/TYPE + // presence is correct and helps Prometheus pre-register the metric. + if !strings.Contains(out, "# TYPE rm_job_duration_seconds histogram") { + t.Errorf("histogram HELP/TYPE missing") + } +} diff --git a/internal/server/ws/handler.go b/internal/server/ws/handler.go index 4fd0e4c..6c54b81 100644 --- a/internal/server/ws/handler.go +++ b/internal/server/ws/handler.go @@ -15,6 +15,7 @@ import ( "gitea.dcglab.co.uk/steve/restic-manager/internal/alert" "gitea.dcglab.co.uk/steve/restic-manager/internal/api" "gitea.dcglab.co.uk/steve/restic-manager/internal/auth" + "gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics" "gitea.dcglab.co.uk/steve/restic-manager/internal/store" "gitea.dcglab.co.uk/steve/restic-manager/internal/version" ) @@ -27,6 +28,9 @@ type HandlerDeps struct { // AlertEngine receives job-finished and host-online events so the // alert engine can evaluate its rules. Optional; nil = no-op. AlertEngine *alert.Engine + // Metrics records job-duration observations on every terminal + // status. Optional; nil = no-op (test fixtures pass nil). + Metrics *metrics.Registry // UpdateWatcher reconciles in-flight agent-update dispatches against // hello envelopes. Optional; nil = no-op. UpdateWatcher *UpdateWatcher @@ -239,6 +243,13 @@ func dispatchAgentMessage(ctx context.Context, c *Conn, hostID string, env api.E slog.Warn("ws: set host last backup", "host_id", hostID, "err", err) } } + // Job-duration histogram (P6-04). Skip when StartedAt is + // missing (race: agent shipped finished without a started, + // or the row predates this code). + if deps.Metrics != nil && job.StartedAt != nil { + deps.Metrics.ObserveJob(job.Kind, string(p.Status), + p.FinishedAt.Sub(*job.StartedAt)) + } } if deps.JobHub != nil { deps.JobHub.Broadcast(p.JobID, env) diff --git a/tasks.md b/tasks.md index 721a36a..a696930 100644 --- a/tasks.md +++ b/tasks.md @@ -390,8 +390,45 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. > swap, helper `buildRepoTrendView` shared between page-load and > fragment endpoint). No new dependencies, no client JS, no agent > change. CI green; in-browser smoke walk-through pending operator. -- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_ -- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_ +- [x] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_ +- [x] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_ + +> **As shipped (2026-05-07, branch `p6-04-05-prometheus-metrics`):** +> Spec `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`, +> plan `docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md`. +> New `internal/server/metrics` package emits the legacy +> `text/plain; version=0.0.4` exposition format directly — no +> `prometheus/client_golang` dependency, matching the repo's +> "no Tailwind, no Node" minimal-deps style. `/metrics` is **opt-in**: +> `RM_METRICS_TOKEN` and/or `RM_METRICS_TRUSTED_CIDR` must be set or +> the route isn't mounted at all (404). When both are set, both must +> pass; either alone gates access. Token compare is constant-time. +> CIDR check honours `X-Forwarded-For` only when the immediate hop +> is a configured `RM_TRUSTED_PROXY` (mirrors the existing realIP +> resolution). +> +> **Metrics:** per-host gauges (`rm_host_agent_online`, +> `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`, +> `rm_host_repo_size_bytes`, `rm_host_snapshot_count`, +> `rm_host_open_alerts`, `rm_host_repo_status`); server gauges +> (`rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`, +> `rm_build_info{version,commit,go_version}`); histogram +> `rm_job_duration_seconds_bucket{kind,status,le}` with buckets +> `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`. +> Histogram is in-memory; observations come from the existing +> `MsgJobFinished` branch in `internal/server/ws/handler.go`. +> +> **Docs:** `docs/prometheus.md` covers enable + scrape config + +> metric reference + dashboard import. **Dashboard:** +> `deploy/grafana/restic-manager-dashboard.json` — six panels +> (fleet status, open alerts, backups failing, hosts table, repo +> size over time, job-duration p95). Schema 39, single Prometheus +> datasource variable. +> +> **Tests:** golden-render + concurrent-observe + bucket-boundary +> in the metrics package; auth matrix (no auth → 404; token +> missing/wrong/right; CIDR matching/non-matching; token AND CIDR) +> in the HTTP layer. ### Phase 6 acceptance -- 2.52.0