176 lines
8.6 KiB
Markdown
176 lines
8.6 KiB
Markdown
# P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard
|
||
|
||
Date: 2026-05-07
|
||
Author: Claude (autonomous, sensible-defaults brief from operator)
|
||
Tasks: P6-04 (M), P6-05 (S)
|
||
|
||
## Problem
|
||
|
||
The control plane already knows everything a backup operator needs
|
||
to monitor — last-backup timestamp + status, repo size, snapshot
|
||
count, agent online, open alerts, build version — but it surfaces
|
||
those only through the dashboard HTML and a few JSON endpoints. To
|
||
plug into the operator's existing observability stack we need a
|
||
plain Prometheus exposition endpoint and a Grafana dashboard JSON
|
||
that reads from it.
|
||
|
||
## Goals
|
||
|
||
- `GET /metrics` emits standard Prometheus text-format with the
|
||
per-host, server, and job-duration metrics enumerated in the
|
||
task entry (P6-04 in `tasks.md`).
|
||
- Endpoint is opt-in and gated by a bearer token and/or an IP
|
||
allow-list — never publicly readable by default.
|
||
- No new third-party dependency (`prometheus/client_golang` is not
|
||
pulled in). The exposition format is small and stable enough to
|
||
emit by hand; matches the repo's "no Tailwind/Node" style.
|
||
- Sample Grafana dashboard committed to the repo so a stranger can
|
||
drop it into a Grafana instance and get a working view.
|
||
|
||
## Non-goals
|
||
|
||
- OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is
|
||
what every prom server still parses and what every example
|
||
online demonstrates — pick the boring option).
|
||
- Pushgateway or remote-write integration.
|
||
- Per-job metric cardinality (no `job_id` labels — that would
|
||
make the histogram explode).
|
||
- Alerting rules. Operators already have alerts inside
|
||
restic-manager (P3-05); duplicating them in Prometheus is a
|
||
YAGNI hazard. The dashboard is read-only.
|
||
|
||
## Auth
|
||
|
||
Two switches, both off by default. If neither is set the route
|
||
isn't mounted at all (404 from the chi router) — this avoids any
|
||
accidental "wide-open scrape endpoint" deployment.
|
||
|
||
| env var | type | meaning |
|
||
| --- | --- | --- |
|
||
| `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer <token>`. Compared with `crypto/subtle.ConstantTimeCompare`. |
|
||
| `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. |
|
||
|
||
If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
|
||
|
||
YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`.
|
||
|
||
## Metrics
|
||
|
||
All metric names are prefixed `rm_`. Help text is concise.
|
||
|
||
### Per-host gauges (one row per `host_id`)
|
||
|
||
```
|
||
rm_host_agent_online{host_id,host} 1 if status='online' else 0
|
||
rm_host_last_backup_timestamp_seconds{host_id,host} unix seconds; omitted if no backup yet
|
||
rm_host_last_backup_success{host_id,host} 1 if last_backup_status='succeeded' else 0; omitted if no backup yet
|
||
rm_host_repo_size_bytes{host_id,host} total_size from latest repo stats; omitted if unknown
|
||
rm_host_snapshot_count{host_id,host} integer
|
||
rm_host_open_alerts{host_id,host} count of open + un-resolved alerts attached to this host
|
||
rm_host_repo_status{host_id,host,status} 1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)
|
||
```
|
||
|
||
`host` label is `hosts.name` for human readability; `host_id` is
|
||
the stable ULID for joining across renames.
|
||
|
||
### Server gauges
|
||
|
||
```
|
||
rm_hosts_total count of hosts (excludes pending)
|
||
rm_hosts_online count of hosts with status='online'
|
||
rm_active_alerts{severity} count of open alerts by severity ∈ {info,warning,critical}
|
||
rm_build_info{version,commit,go_version} always 1; pure label-bag for joining
|
||
```
|
||
|
||
### Job duration histogram
|
||
|
||
```
|
||
rm_job_duration_seconds_bucket{kind,status,le=...}
|
||
rm_job_duration_seconds_sum{kind,status}
|
||
rm_job_duration_seconds_count{kind,status}
|
||
```
|
||
|
||
`kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update}
|
||
(every JobKind we currently dispatch). `status` ∈
|
||
{succeeded,failed,cancelled}. Buckets cover the realistic range —
|
||
short admin commands (unlock, init) finish in seconds; backups can
|
||
be hours:
|
||
|
||
```
|
||
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
|
||
(1s 5s 30s 1m 5m 30m 1h 6h 24h)
|
||
```
|
||
|
||
In-memory only. Reset on process restart — operators who want
|
||
durable history scrape into Prom and let it persist.
|
||
|
||
## Architecture
|
||
|
||
New package `internal/server/metrics`:
|
||
|
||
- `Registry` — owns the histogram state (sync.Mutex + map keyed by
|
||
`kind+status`). `ObserveJob(kind, status string, dur time.Duration)`
|
||
is the only mutator. Lookups via `Snapshot()` are read-only and
|
||
copy out.
|
||
- `Render(w io.Writer, snapshot Snapshot)` — emits the full
|
||
exposition body. The snapshot is supplied by the HTTP handler
|
||
pulling from `Store` on each scrape; the package itself has no
|
||
store dependency, which keeps it trivially unit-testable.
|
||
|
||
New file `internal/server/http/metrics.go`:
|
||
|
||
- `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current
|
||
fleet snapshot from `Store`, ask `metrics.Render` to emit.
|
||
- Auth helper `authoriseMetricsScrape(r)` — pure function over
|
||
request + config; tested directly.
|
||
|
||
Wiring:
|
||
|
||
- `cmd/server` constructs the `metrics.Registry` once and threads
|
||
it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps`
|
||
(so the job-finished branch can call `ObserveJob`).
|
||
- `ws/handler.go` MsgJobFinished branch grows a single line:
|
||
`if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`.
|
||
Falls back gracefully if the registry was never wired (tests).
|
||
|
||
Route registration in `server.go`:
|
||
|
||
```go
|
||
if s.deps.Cfg.MetricsAuthEnabled() {
|
||
r.Get("/metrics", s.handleMetrics)
|
||
}
|
||
```
|
||
|
||
## Cardinality + cost
|
||
|
||
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1.
|
||
|
||
A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
|
||
|
||
## Documentation (P6-05)
|
||
|
||
- `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
|
||
- `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
|
||
1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline.
|
||
2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`).
|
||
3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`.
|
||
4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`.
|
||
5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`.
|
||
6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window.
|
||
|
||
Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
|
||
|
||
## Testing
|
||
|
||
- Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
|
||
- Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence.
|
||
- Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
|
||
- End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
|
||
|
||
## Out of scope, explicitly
|
||
|
||
- Per-job latency tracking with `job_id` labels (cardinality bomb).
|
||
- Restore-specific metrics (P3 surfaces are still settling).
|
||
- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
|
||
- Auto-discovery / file-SD generators for Prometheus.
|