# Observability with Prometheus restic-manager can expose a Prometheus scrape endpoint at `GET /metrics`. The endpoint is **opt-in** — without an explicit auth gate it isn't even mounted, so a forgotten config can't accidentally publish fleet state. The full reference lives at [`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md); the short version follows. ## Enable the endpoint Set at least one of: - `RM_METRICS_TOKEN` — `Authorization: Bearer ` required. - `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR). Both ANDed when both set. Constant-time token compare; CIDR honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. ## Metrics emitted - **Server gauges**: `rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`, `rm_build_info{...}`. - **Per-host gauges**: `rm_host_agent_online`, `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`, `rm_host_repo_size_bytes`, `rm_host_snapshot_count`, `rm_host_open_alerts`, `rm_host_repo_status`. - **Histogram**: `rm_job_duration_seconds{kind,status,le=…}` (buckets `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`). In-memory histogram only. Prometheus persists the scrapes; if you need durable history at hourly resolution that's Prometheus's job. ## Sample Grafana dashboard [`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json) imports through Grafana's **+ → Import → Upload JSON file**. Six panels: 1. Fleet status (online / total). 2. Open alerts by severity. 3. Backups failing on most-recent run. 4. Hosts table — last backup, repo size, snapshots, open alerts. 5. Repo size over time, one line per host. 6. Job-duration p95 over a 1h window per kind. ## Alerting restic-manager already has a built-in alert engine ([Alerts](./alerts.md)). The dashboard intentionally doesn't duplicate it as Prometheus alert rules. If you want Prometheus-side alerts on top, write your own based on the metrics above — `rm_host_last_backup_success == 0`, `time() - rm_host_last_backup_timestamp_seconds > `, or whatever suits your environment.