restic-manager/docs/prometheus.md

# Prometheus + Grafana

restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
The endpoint is **opt-in** — it is not mounted at all unless you set
at least one of the auth gates below. Once enabled, it serves the
standard `text/plain` exposition format that every Prometheus
release since 2.x parses without configuration.

A sample Grafana dashboard lives at
`deploy/grafana/restic-manager-dashboard.json`.

## Enable the endpoint

Two switches, both off by default. If both are set, both must pass
(token AND source-IP); if only one is set, that gate alone
authorises a scrape.

| Env var                    | YAML key               | Effect |
|----------------------------|------------------------|--------|
| `RM_METRICS_TOKEN`         | `metrics_token`        | Requires `Authorization: Bearer <token>`. Compared in constant time. |
| `RM_METRICS_TRUSTED_CIDR`  | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |

When neither is set, `GET /metrics` returns 404 — the route is not
registered with the chi router so a forgotten config can't
accidentally publish fleet state.

### Example: Docker

```yaml
services:
  restic-manager:
    image: gitea.dcglab.co.uk/steve/restic-manager:latest
    environment:
      RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
      RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
    secrets:
      - rm_metrics_token
```

(`RM_METRICS_TOKEN_FILE` is not currently supported — set
`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
roadmap.)

## Prometheus scrape config

Drop into your `prometheus.yml`:

```yaml
scrape_configs:
  - job_name: restic-manager
    metrics_path: /metrics
    scheme: https            # via your reverse proxy
    static_configs:
      - targets: ['restic.example.com']
    authorization:
      type: Bearer
      credentials_file: /etc/prometheus/secrets/rm_metrics_token
```

If you don't run a TLS-terminating proxy in front, drop `scheme:
https` (the server is HTTP-only — see `docs/reverse-proxy.md`).

## Metric reference

All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
label (the stable ULID, immune to renames) and a `host` label
(the human-readable name).

### Server gauges

| Name                  | Labels                             | Description |
|-----------------------|------------------------------------|-------------|
| `rm_hosts_total`      | —                                  | Total number of enrolled hosts (excludes pending announces). |
| `rm_hosts_online`     | —                                  | Number of hosts with `status='online'`. |
| `rm_active_alerts`    | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
| `rm_build_info`       | `version, commit, go_version`      | Always 1; pure label-bag for joining. |

### Per-host gauges

| Name                                       | Description |
|--------------------------------------------|-------------|
| `rm_host_agent_online`                     | 1 if the agent is currently online, 0 otherwise. |
| `rm_host_last_backup_timestamp_seconds`    | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
| `rm_host_last_backup_success`              | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
| `rm_host_repo_size_bytes`                  | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
| `rm_host_snapshot_count`                   | Number of restic snapshots known on the host's repo. |
| `rm_host_open_alerts`                      | Number of currently open alerts attached to this host. |
| `rm_host_repo_status`                      | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |

### Job duration histogram

```
rm_job_duration_seconds_bucket{kind, status, le}
rm_job_duration_seconds_sum{kind, status}
rm_job_duration_seconds_count{kind, status}
```

`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
`status` ∈ {succeeded, failed, cancelled}.

Buckets (seconds):

```
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
1s   5s  30s  1m  5m   30m   1h    6h    24h
```

The histogram is in-memory only — values reset on process restart.
Operators who want durable history should let Prometheus persist
the scrapes; restic-manager itself is a control plane, not a
metrics database.

## Grafana dashboard

Import `deploy/grafana/restic-manager-dashboard.json`:

1. In Grafana, **+ → Import → Upload JSON file**.
2. Pick the Prometheus data source you scrape with.
3. The dashboard's six panels populate from the metrics above:
   * **Fleet status** — online/total stat panel.
   * **Open alerts** — by severity.
   * **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
   * **Repo size over time** — one line per host.
   * **Backups failing** — count of hosts whose last backup didn't succeed.
   * **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.

Alerting is intentionally not configured in the dashboard — the
control plane already has alerts (P3-05) with native channels for
webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
just duplicate state. If you do want Prom-side alerts, copy the
recording rules into your usual location.

## Cardinality

Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
histogram rows. A 100-host fleet emits roughly 700 host rows + 270
histogram rows — well below any practical limit. There are no
`job_id` labels (cardinality bomb avoidance) and no per-source-group
labels.