P6-04+05: Prometheus /metrics endpoint + Grafana dashboard
CI / Test (rest) (pull_request) Successful in 41s
CI / Test (store) (pull_request) Successful in 43s
CI / Lint (pull_request) Successful in 29s
CI / Build (windows/amd64) (pull_request) Successful in 44s
CI / Test (server-http) (pull_request) Successful in 1m47s
CI / Build (linux/arm64) (pull_request) Successful in 43s
CI / Build (linux/amd64) (pull_request) Successful in 2m1s

New internal/server/metrics package emits the legacy text/plain
exposition format directly, so we don't pull in
prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN
and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if
neither gate is set. Both gates ANDed when both configured.

Per-host gauges (online, last_backup_*, repo_size_bytes,
snapshot_count, open_alerts, repo_status), server gauges
(hosts_total/online, active_alerts by severity, build_info), and
an in-memory job-duration histogram observed from the existing
MsgJobFinished branch in the WS handler.

Docs in docs/prometheus.md (enable + scrape config + metric
reference + dashboard import). Sample dashboard at
deploy/grafana/restic-manager-dashboard.json - six panels,
Grafana schema 39, single Prometheus datasource variable.

Tests: golden render, concurrent observe, bucket boundaries in
the metrics package; auth matrix (no auth -> 404, token gate,
CIDR gate, both required) in the HTTP layer.
This commit is contained in:
2026-05-07 23:17:15 +01:00
parent 70ff554402
commit 73e733be61
12 changed files with 1480 additions and 2 deletions
+3
View File
@@ -20,6 +20,7 @@ import (
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
@@ -89,6 +90,7 @@ func run() error {
hub := ws.NewHub()
jobHub := ws.NewJobHub()
metricsRegistry := metrics.NewRegistry()
notifHub := notification.NewHub(st, aead, cfg.BaseURL)
alertEngine := alert.NewEngine(st, notifHub)
@@ -122,6 +124,7 @@ func run() error {
UI: renderer,
Version: version,
OIDC: oidcClient,
Metrics: metricsRegistry,
}
// First-run bootstrap: if the users table is empty, mint a one-time