e2e: pin Playwright to 1.59.1

`@playwright/test` was loose-pinned to ^1.50.0; npm resolved it to 1.59.1 inside the runner image, which only ships browser binaries for 1.50.0. Pin both the package and the docker image to v1.59.1 so deps and binaries stay aligned.
e2e: run health probe + Playwright on the compose network
2026-05-08 20:09:17 +01:00 · 2026-05-08 20:08:23 +01:00 · 2026-05-08 20:08:23 +01:00 · 2026-05-08 18:31:57 +00:00 · 2026-05-07 23:17:15 +01:00 · 2026-05-07 23:07:30 +01:00
15 changed files with 1719 additions and 8 deletions
@@ -20,6 +20,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
 	rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
@@ -89,6 +90,7 @@ func run() error {
 	hub := ws.NewHub()
 	jobHub := ws.NewJobHub()
 	metricsRegistry := metrics.NewRegistry()
 	notifHub := notification.NewHub(st, aead, cfg.BaseURL)
 	alertEngine := alert.NewEngine(st, notifHub)
@@ -122,6 +124,7 @@ func run() error {
 		UI:              renderer,
 		Version:         version,
 		OIDC:            oidcClient,
 		Metrics:         metricsRegistry,
 	}
 	// First-run bootstrap: if the users table is empty, mint a one-time
@@ -0,0 +1,325 @@
 {
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": { "type": "grafana", "uid": "-- Grafana --" },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "description": "restic-manager fleet overview. Imports against any Prometheus data source.",
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "id": 1,
      "title": "Fleet status",
      "type": "stat",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "red", "value": null },
              { "color": "green", "value": 1 }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_hosts_online",
          "legendFormat": "online",
          "refId": "A"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_hosts_total",
          "legendFormat": "total",
          "refId": "B"
        }
      ]
    },
    {
      "id": 2,
      "title": "Open alerts",
      "type": "stat",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "orientation": "horizontal",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "sum by (severity) (rm_active_alerts)",
          "legendFormat": "{{severity}}",
          "refId": "A"
        }
      ]
    },
    {
      "id": 3,
      "title": "Backups failing (last reported run)",
      "type": "stat",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 1 }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "count(rm_host_last_backup_success == 0)",
          "legendFormat": "failing",
          "refId": "A"
        }
      ]
    },
    {
      "id": 4,
      "title": "Hosts",
      "type": "table",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 },
      "fieldConfig": {
        "defaults": {
          "custom": { "align": "auto", "displayMode": "auto" }
        },
        "overrides": [
          {
            "matcher": { "id": "byName", "options": "Value #B" },
            "properties": [
              { "id": "displayName", "value": "Last backup (s ago)" },
              { "id": "unit", "value": "s" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #C" },
            "properties": [
              { "id": "displayName", "value": "Repo size" },
              { "id": "unit", "value": "bytes" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #D" },
            "properties": [
              { "id": "displayName", "value": "Snapshots" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #A" },
            "properties": [
              { "id": "displayName", "value": "Online" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #E" },
            "properties": [
              { "id": "displayName", "value": "Open alerts" }
            ]
          }
        ]
      },
      "options": { "showHeader": true },
      "transformations": [
        {
          "id": "merge",
          "options": {}
        }
      ],
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_agent_online",
          "format": "table",
          "instant": true,
          "refId": "A"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "time() - rm_host_last_backup_timestamp_seconds",
          "format": "table",
          "instant": true,
          "refId": "B"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_repo_size_bytes",
          "format": "table",
          "instant": true,
          "refId": "C"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_snapshot_count",
          "format": "table",
          "instant": true,
          "refId": "D"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_open_alerts",
          "format": "table",
          "instant": true,
          "refId": "E"
        }
      ]
    },
    {
      "id": 5,
      "title": "Repo size over time",
      "type": "timeseries",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "axisLabel": "",
            "drawStyle": "line",
            "fillOpacity": 10,
            "lineWidth": 1,
            "pointSize": 5,
            "showPoints": "never"
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "options": {
        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
        "tooltip": { "mode": "multi", "sort": "desc" }
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_repo_size_bytes",
          "legendFormat": "{{host}}",
          "refId": "A"
        }
      ]
    },
    {
      "id": 6,
      "title": "Job duration p95 (last 1h, by kind)",
      "type": "timeseries",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "drawStyle": "line",
            "fillOpacity": 5,
            "lineWidth": 1,
            "pointSize": 4,
            "showPoints": "never"
          },
          "unit": "s"
        },
        "overrides": []
      },
      "options": {
        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
        "tooltip": { "mode": "multi", "sort": "desc" }
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))",
          "legendFormat": "{{kind}}",
          "refId": "A"
        }
      ]
    }
  ],
  "refresh": "30s",
  "schemaVersion": 39,
  "style": "dark",
  "tags": ["restic-manager", "backups"],
  "templating": {
    "list": [
      {
        "current": {},
        "hide": 0,
        "includeAll": false,
        "label": "Prometheus",
        "multi": false,
        "name": "DS_PROMETHEUS",
        "options": [],
        "query": "prometheus",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      }
    ]
  },
  "time": { "from": "now-6h", "to": "now" },
  "timepicker": {},
  "timezone": "",
  "title": "restic-manager — fleet",
  "uid": "rm-fleet-overview",
  "version": 1,
  "weekStart": ""
 }
@@ -0,0 +1,139 @@
 # Prometheus + Grafana
 restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
 The endpoint is **opt-in** — it is not mounted at all unless you set
 at least one of the auth gates below. Once enabled, it serves the
 standard `text/plain` exposition format that every Prometheus
 release since 2.x parses without configuration.
 A sample Grafana dashboard lives at
 `deploy/grafana/restic-manager-dashboard.json`.
 ## Enable the endpoint
 Two switches, both off by default. If both are set, both must pass
 (token AND source-IP); if only one is set, that gate alone
 authorises a scrape.
 | Env var                    | YAML key               | Effect |
 |----------------------------|------------------------|--------|
 | `RM_METRICS_TOKEN`         | `metrics_token`        | Requires `Authorization: Bearer <token>`. Compared in constant time. |
 | `RM_METRICS_TRUSTED_CIDR`  | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
 When neither is set, `GET /metrics` returns 404 — the route is not
 registered with the chi router so a forgotten config can't
 accidentally publish fleet state.
 ### Example: Docker
 ```yaml
 services:
  restic-manager:
    image: gitea.dcglab.co.uk/steve/restic-manager:latest
    environment:
      RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
      RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
    secrets:
      - rm_metrics_token
 ```
 (`RM_METRICS_TOKEN_FILE` is not currently supported — set
 `RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
 roadmap.)
 ## Prometheus scrape config
 Drop into your `prometheus.yml`:
 ```yaml
 scrape_configs:
  - job_name: restic-manager
    metrics_path: /metrics
    scheme: https            # via your reverse proxy
    static_configs:
      - targets: ['restic.example.com']
    authorization:
      type: Bearer
      credentials_file: /etc/prometheus/secrets/rm_metrics_token
 ```
 If you don't run a TLS-terminating proxy in front, drop `scheme:
 https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
 ## Metric reference
 All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
 label (the stable ULID, immune to renames) and a `host` label
 (the human-readable name).
 ### Server gauges
 | Name                  | Labels                             | Description |
 |-----------------------|------------------------------------|-------------|
 | `rm_hosts_total`      | —                                  | Total number of enrolled hosts (excludes pending announces). |
 | `rm_hosts_online`     | —                                  | Number of hosts with `status='online'`. |
 | `rm_active_alerts`    | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
 | `rm_build_info`       | `version, commit, go_version`      | Always 1; pure label-bag for joining. |
 ### Per-host gauges
 | Name                                       | Description |
 |--------------------------------------------|-------------|
 | `rm_host_agent_online`                     | 1 if the agent is currently online, 0 otherwise. |
 | `rm_host_last_backup_timestamp_seconds`    | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
 | `rm_host_last_backup_success`              | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
 | `rm_host_repo_size_bytes`                  | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
 | `rm_host_snapshot_count`                   | Number of restic snapshots known on the host's repo. |
 | `rm_host_open_alerts`                      | Number of currently open alerts attached to this host. |
 | `rm_host_repo_status`                      | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
 ### Job duration histogram
 ```
 rm_job_duration_seconds_bucket{kind, status, le}
 rm_job_duration_seconds_sum{kind, status}
 rm_job_duration_seconds_count{kind, status}
 ```
 `kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
 `status` ∈ {succeeded, failed, cancelled}.
 Buckets (seconds):
 ```
 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
 1s   5s  30s  1m  5m   30m   1h    6h    24h
 ```
 The histogram is in-memory only — values reset on process restart.
 Operators who want durable history should let Prometheus persist
 the scrapes; restic-manager itself is a control plane, not a
 metrics database.
 ## Grafana dashboard
 Import `deploy/grafana/restic-manager-dashboard.json`:
 1. In Grafana, **+ → Import → Upload JSON file**.
 2. Pick the Prometheus data source you scrape with.
 3. The dashboard's six panels populate from the metrics above:
   * **Fleet status** — online/total stat panel.
   * **Open alerts** — by severity.
   * **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
   * **Repo size over time** — one line per host.
   * **Backups failing** — count of hosts whose last backup didn't succeed.
   * **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
 Alerting is intentionally not configured in the dashboard — the
 control plane already has alerts (P3-05) with native channels for
 webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
 just duplicate state. If you do want Prom-side alerts, copy the
 recording rules into your usual location.
 ## Cardinality
 Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
 histogram rows. A 100-host fleet emits roughly 700 host rows + 270
 histogram rows — well below any practical limit. There are no
 `job_id` labels (cardinality bomb avoidance) and no per-source-group
 labels.
@@ -0,0 +1,61 @@
 # Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard
 Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`
 ## Step 1 — Config wiring
 - Add fields to `internal/server/config/config.go`:
  - `MetricsToken string` (yaml `metrics_token`)
  - `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`)
  - method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured.
 - Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR).
 - `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`).
 - Tests: extend `config_test.go` covering both env vars + happy/sad CIDR.
 ## Step 2 — `internal/server/metrics` package
 - `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`.
 - `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count.
 - `Snapshot() Snapshot` — copies state under lock; returns plain value type.
 - `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).
 - `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec.
 - Unit tests: golden render, concurrent observe, bucket boundaries.
 ## Step 3 — HTTP handler
 - New `internal/server/http/metrics.go`:
  - `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`.
  - `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use).
  - `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`.
 - Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`.
 - `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers.
 ## Step 4 — Hook job-finished
 - `internal/server/ws/handler.go`:
  - `HandlerDeps` grows `Metrics *metrics.Registry`.
  - In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race).
 - `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance.
 ## Step 5 — Tests
 - `internal/server/metrics/registry_test.go` — observe + snapshot determinism.
 - `internal/server/metrics/render_test.go` — golden output for a fixed snapshot.
 - `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.
 ## Step 6 — Docs + dashboard (P6-05)
 - `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import.
 - `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.
 ## Step 7 — Tasks.md + verification
 - Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries.
 - Run `go vet ./...`, `go test ./...`, `make build`.
 - Push branch (no PR per standing instruction).
 ## Risk register
 - **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
 - **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice.
 - **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.
@@ -0,0 +1,175 @@
 # P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard
 Date: 2026-05-07
 Author: Claude (autonomous, sensible-defaults brief from operator)
 Tasks: P6-04 (M), P6-05 (S)
 ## Problem
 The control plane already knows everything a backup operator needs
 to monitor — last-backup timestamp + status, repo size, snapshot
 count, agent online, open alerts, build version — but it surfaces
 those only through the dashboard HTML and a few JSON endpoints. To
 plug into the operator's existing observability stack we need a
 plain Prometheus exposition endpoint and a Grafana dashboard JSON
 that reads from it.
 ## Goals
 - `GET /metrics` emits standard Prometheus text-format with the
  per-host, server, and job-duration metrics enumerated in the
  task entry (P6-04 in `tasks.md`).
 - Endpoint is opt-in and gated by a bearer token and/or an IP
  allow-list — never publicly readable by default.
 - No new third-party dependency (`prometheus/client_golang` is not
  pulled in). The exposition format is small and stable enough to
  emit by hand; matches the repo's "no Tailwind/Node" style.
 - Sample Grafana dashboard committed to the repo so a stranger can
  drop it into a Grafana instance and get a working view.
 ## Non-goals
 - OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is
  what every prom server still parses and what every example
  online demonstrates — pick the boring option).
 - Pushgateway or remote-write integration.
 - Per-job metric cardinality (no `job_id` labels — that would
  make the histogram explode).
 - Alerting rules. Operators already have alerts inside
  restic-manager (P3-05); duplicating them in Prometheus is a
  YAGNI hazard. The dashboard is read-only.
 ## Auth
 Two switches, both off by default. If neither is set the route
 isn't mounted at all (404 from the chi router) — this avoids any
 accidental "wide-open scrape endpoint" deployment.
 | env var | type | meaning |
 | --- | --- | --- |
 | `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer <token>`. Compared with `crypto/subtle.ConstantTimeCompare`. |
 | `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. |
 If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
 YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`.
 ## Metrics
 All metric names are prefixed `rm_`. Help text is concise.
 ### Per-host gauges (one row per `host_id`)
 ```
 rm_host_agent_online{host_id,host}                     1 if status='online' else 0
 rm_host_last_backup_timestamp_seconds{host_id,host}    unix seconds; omitted if no backup yet
 rm_host_last_backup_success{host_id,host}              1 if last_backup_status='succeeded' else 0; omitted if no backup yet
 rm_host_repo_size_bytes{host_id,host}                  total_size from latest repo stats; omitted if unknown
 rm_host_snapshot_count{host_id,host}                   integer
 rm_host_open_alerts{host_id,host}                      count of open + un-resolved alerts attached to this host
 rm_host_repo_status{host_id,host,status}               1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)
 ```
 `host` label is `hosts.name` for human readability; `host_id` is
 the stable ULID for joining across renames.
 ### Server gauges
 ```
 rm_hosts_total                              count of hosts (excludes pending)
 rm_hosts_online                             count of hosts with status='online'
 rm_active_alerts{severity}                  count of open alerts by severity ∈ {info,warning,critical}
 rm_build_info{version,commit,go_version}    always 1; pure label-bag for joining
 ```
 ### Job duration histogram
 ```
 rm_job_duration_seconds_bucket{kind,status,le=...}
 rm_job_duration_seconds_sum{kind,status}
 rm_job_duration_seconds_count{kind,status}
 ```
 `kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update}
 (every JobKind we currently dispatch). `status` ∈
 {succeeded,failed,cancelled}. Buckets cover the realistic range —
 short admin commands (unlock, init) finish in seconds; backups can
 be hours:
 ```
 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
   (1s   5s  30s  1m   5m  30m   1h    6h   24h)
 ```
 In-memory only. Reset on process restart — operators who want
 durable history scrape into Prom and let it persist.
 ## Architecture
 New package `internal/server/metrics`:
 - `Registry` — owns the histogram state (sync.Mutex + map keyed by
  `kind+status`). `ObserveJob(kind, status string, dur time.Duration)`
  is the only mutator. Lookups via `Snapshot()` are read-only and
  copy out.
 - `Render(w io.Writer, snapshot Snapshot)` — emits the full
  exposition body. The snapshot is supplied by the HTTP handler
  pulling from `Store` on each scrape; the package itself has no
  store dependency, which keeps it trivially unit-testable.
 New file `internal/server/http/metrics.go`:
 - `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current
  fleet snapshot from `Store`, ask `metrics.Render` to emit.
 - Auth helper `authoriseMetricsScrape(r)` — pure function over
  request + config; tested directly.
 Wiring:
 - `cmd/server` constructs the `metrics.Registry` once and threads
  it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps`
  (so the job-finished branch can call `ObserveJob`).
 - `ws/handler.go` MsgJobFinished branch grows a single line:
  `if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`.
  Falls back gracefully if the registry was never wired (tests).
 Route registration in `server.go`:
 ```go
 if s.deps.Cfg.MetricsAuthEnabled() {
    r.Get("/metrics", s.handleMetrics)
 }
 ```
 ## Cardinality + cost
 Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1.
 A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
 ## Documentation (P6-05)
 - `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
 - `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
  1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline.
  2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`).
  3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`.
  4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`.
  5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`.
  6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window.
 Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
 ## Testing
 - Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
 - Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence.
 - Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
 - End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
 ## Out of scope, explicitly
 - Per-job latency tracking with `job_id` labels (cardinality bomb).
 - Restore-specific metrics (P3 surfaces are still settling).
 - Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
 - Auto-discovery / file-SD generators for Prometheus.
@@ -68,12 +68,9 @@ test.describe('smoke: enrol-via-announce → backup', () => {
 });
 test.describe('smoke: scrape /metrics', () => {
-    // The /metrics endpoint is documented (RM_METRICS_TOKEN /
+    test('metrics endpoint exposes the host gauge', async ({ request }) => {
-    // RM_METRICS_TRUSTED_CIDR, gauges rm_hosts_total / rm_build_info)
+        // Compose sets RM_METRICS_TRUSTED_CIDR=0.0.0.0/0 so the
-    // but not yet implemented in the server. Skipping until the
+        // endpoint is open to the test runner.
    // Prometheus exposition lands; tracked separately from this
    // e2e harness.
    test.skip('metrics endpoint exposes the host gauge', async ({ request }) => {
        const res = await request.get(`${baseURL}/metrics`);
        expect(res.status()).toBe(200);
        const body = await res.text();
@@ -41,6 +41,24 @@ type Config struct {
 	// DataDir. Source-build deployments can override via
 	// RM_BUNDLED_ASSETS_DIR.
 	BundledAssetsDir string `yaml:"bundled_assets_dir"`
 	// MetricsToken, if set, gates the /metrics scrape endpoint
 	// behind a `Authorization: Bearer <token>` check (constant-time
 	// compare). When neither this nor MetricsTrustedCIDRs is set,
 	// the route is not mounted at all (the endpoint is opt-in).
 	MetricsToken string `yaml:"metrics_token"`
 	// MetricsTrustedCIDRs, if non-empty, gates /metrics so only
 	// callers from these networks may scrape. ANDed with
 	// MetricsToken when both are set.
 	MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"`
 }
 // MetricsAuthEnabled reports whether the operator has opted into
 // exposing the Prometheus scrape endpoint by configuring at least
 // one auth gate.
 func (c Config) MetricsAuthEnabled() bool {
 	return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0
 }
 // Load resolves config in this order:
@@ -93,6 +111,19 @@ func Load(yamlPath string) (Config, error) {
 	if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
 		c.BundledAssetsDir = v
 	}
 	if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok {
 		c.MetricsToken = v
 	}
 	if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok {
 		parts := strings.Split(v, ",")
 		c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0]
 		for _, p := range parts {
 			p = strings.TrimSpace(p)
 			if p != "" {
 				c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p)
 			}
 		}
 	}
 	if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
 		// Comma-separated CIDRs; allow whitespace for readability.
 		parts := strings.Split(v, ",")
@@ -137,5 +168,10 @@ func (c *Config) validate() error {
 			return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
 		}
 	}
 	for _, cidr := range c.MetricsTrustedCIDRs {
 		if _, err := netip.ParsePrefix(cidr); err != nil {
 			return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err)
 		}
 	}
 	return nil
 }
@@ -98,6 +98,45 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) {
 	}
 }
 func TestMetricsAuthGates(t *testing.T) {
 	t.Setenv("RM_LISTEN", ":8080")
 	t.Setenv("RM_DATA_DIR", "/tmp/x")
 	c, err := Load("")
 	if err != nil {
 		t.Fatalf("load: %v", err)
 	}
 	if c.MetricsAuthEnabled() {
 		t.Errorf("metrics endpoint should be off by default")
 	}
 	t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes")
 	t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24")
 	c, err = Load("")
 	if err != nil {
 		t.Fatalf("load: %v", err)
 	}
 	if c.MetricsToken != "s3cr3t-token-with-enough-bytes" {
 		t.Errorf("token: %q", c.MetricsToken)
 	}
 	if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" {
 		t.Errorf("cidrs: %v", got)
 	}
 	if !c.MetricsAuthEnabled() {
 		t.Errorf("MetricsAuthEnabled should be true")
 	}
 }
 func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) {
 	t.Setenv("RM_LISTEN", ":8080")
 	t.Setenv("RM_DATA_DIR", "/tmp/x")
 	t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage")
 	if _, err := Load(""); err == nil {
 		t.Fatal("expected validation error, got nil")
 	}
 }
 func writeFile(path string, body []byte) error {
 	return writeFileImpl(path, body)
 }
@@ -0,0 +1,185 @@
 package http
 import (
 	"context"
 	"crypto/subtle"
 	"net"
 	"net/http"
 	"net/netip"
 	"runtime"
 	"strings"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
 )
 // handleMetrics serves the Prometheus exposition body. The route is
 // only mounted when the operator has opted in via RM_METRICS_TOKEN
 // or RM_METRICS_TRUSTED_CIDR (see Server.New + Cfg.MetricsAuthEnabled).
 func (s *Server) handleMetrics(w http.ResponseWriter, r *http.Request) {
 	if !authoriseMetricsScrape(r, s.deps.Cfg) {
 		// 401 with no body; Prom respects this and surfaces the failed
 		// scrape. WWW-Authenticate hints at bearer when the operator
 		// actually configured a token.
 		if s.deps.Cfg.MetricsToken != "" {
 			w.Header().Set("WWW-Authenticate", `Bearer realm="restic-manager metrics"`)
 		}
 		w.WriteHeader(http.StatusUnauthorized)
 		return
 	}
 	snap, err := s.gatherMetricsSnapshot(r.Context())
 	if err != nil {
 		http.Error(w, "snapshot: "+err.Error(), http.StatusInternalServerError)
 		return
 	}
 	// 0.0.4 is the long-stable text-format version Prometheus accepts
 	// without negotiation; OpenMetrics is intentionally not used here.
 	w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
 	if err := metrics.Render(w, snap); err != nil {
 		// Body is partially written; nothing useful we can do beyond
 		// dropping the connection (chi's recoverer will log).
 		return
 	}
 }
 // authoriseMetricsScrape applies bearer + CIDR gates per the spec.
 // AND semantics when both are configured; either alone is sufficient
 // when only it is configured.
 func authoriseMetricsScrape(r *http.Request, cfg config.Config) bool {
 	tokenOK := true
 	if cfg.MetricsToken != "" {
 		tokenOK = false
 		hdr := r.Header.Get("Authorization")
 		const prefix = "Bearer "
 		if strings.HasPrefix(hdr, prefix) {
 			got := []byte(strings.TrimPrefix(hdr, prefix))
 			want := []byte(cfg.MetricsToken)
 			if subtle.ConstantTimeCompare(got, want) == 1 {
 				tokenOK = true
 			}
 		}
 	}
 	cidrOK := true
 	if len(cfg.MetricsTrustedCIDRs) > 0 {
 		cidrOK = false
 		ip := callerIP(r, cfg.TrustedProxies)
 		if ip.IsValid() {
 			for _, c := range cfg.MetricsTrustedCIDRs {
 				prefix, err := netip.ParsePrefix(c)
 				if err != nil {
 					continue
 				}
 				if prefix.Contains(ip) {
 					cidrOK = true
 					break
 				}
 			}
 		}
 	}
 	return tokenOK && cidrOK
 }
 // callerIP resolves the client IP. When the request hit the server
 // directly we use RemoteAddr; when the immediate hop is a trusted
 // proxy we honour the right-most untrusted X-Forwarded-For entry
 // (mirrors how realIP middlewares typically resolve).
 func callerIP(r *http.Request, trustedProxies []string) netip.Addr {
 	host, _, err := net.SplitHostPort(r.RemoteAddr)
 	if err != nil {
 		host = r.RemoteAddr
 	}
 	directAddr, err := netip.ParseAddr(host)
 	if err != nil {
 		return netip.Addr{}
 	}
 	if !addrInAnyCIDR(directAddr, trustedProxies) {
 		return directAddr
 	}
 	xff := r.Header.Get("X-Forwarded-For")
 	if xff == "" {
 		return directAddr
 	}
 	parts := strings.Split(xff, ",")
 	// Walk right→left, skipping trusted proxies, until we land on the
 	// first untrusted hop — that's the genuine client.
 	for i := len(parts) - 1; i >= 0; i-- {
 		p := strings.TrimSpace(parts[i])
 		a, err := netip.ParseAddr(p)
 		if err != nil {
 			continue
 		}
 		if addrInAnyCIDR(a, trustedProxies) {
 			continue
 		}
 		return a
 	}
 	return directAddr
 }
 func addrInAnyCIDR(a netip.Addr, cidrs []string) bool {
 	for _, c := range cidrs {
 		pre, err := netip.ParsePrefix(c)
 		if err != nil {
 			continue
 		}
 		if pre.Contains(a) {
 			return true
 		}
 	}
 	return false
 }
 // gatherMetricsSnapshot pulls the data the renderer needs. One
 // indexed query per per-host or fleet-wide read; no N+1.
 func (s *Server) gatherMetricsSnapshot(ctx context.Context) (metrics.Snapshot, error) {
 	hosts, err := s.deps.Store.ListHosts(ctx)
 	if err != nil {
 		return metrics.Snapshot{}, err
 	}
 	hostRows := make([]metrics.HostRow, 0, len(hosts))
 	for _, h := range hosts {
 		row := metrics.HostRow{
 			ID:             h.ID,
 			Name:           h.Name,
 			Online:         h.Status == "online",
 			SnapshotCount:  h.SnapshotCount,
 			OpenAlertCount: h.OpenAlertCount,
 			RepoStatus:     h.RepoStatus,
 		}
 		if h.LastBackupAt != nil {
 			ts := h.LastBackupAt.Unix()
 			row.LastBackupUnix = &ts
 		}
 		if h.LastBackupStatus != nil {
 			ok := *h.LastBackupStatus == "succeeded"
 			row.LastBackupSucceeded = &ok
 		}
 		if h.RepoSizeBytes > 0 {
 			sz := h.RepoSizeBytes
 			row.RepoSizeBytes = &sz
 		}
 		hostRows = append(hostRows, row)
 	}
 	open, err := s.deps.Store.ListAlerts(ctx, store.AlertFilter{Status: "open"})
 	if err != nil {
 		return metrics.Snapshot{}, err
 	}
 	bySeverity := map[string]int{"info": 0, "warning": 0, "critical": 0}
 	for _, a := range open {
 		bySeverity[a.Severity]++
 	}
 	reg := s.deps.Metrics
 	if reg == nil {
 		reg = metrics.NewRegistry() // empty histogram block
 	}
 	return reg.SnapshotWith(hostRows, bySeverity, version.Version, version.Commit, runtime.Version()), nil
 }
@@ -0,0 +1,209 @@
 package http
 import (
 	"context"
 	"io"
 	stdhttp "net/http"
 	"net/http/httptest"
 	"path/filepath"
 	"strings"
 	"testing"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 )
 // newMetricsServer builds a Server with metrics enabled per cfg.
 // Returns (URL, registry) so tests can both observe job durations
 // directly and exercise the HTTP gate.
 func newMetricsServer(t *testing.T, cfg config.Config) (string, *metrics.Registry, *store.Store) {
 	t.Helper()
 	dir := t.TempDir()
 	st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
 	if err != nil {
 		t.Fatalf("store: %v", err)
 	}
 	t.Cleanup(func() { _ = st.Close() })
 	keyPath := filepath.Join(dir, "secret.key")
 	if err := crypto.GenerateKeyFile(keyPath); err != nil {
 		t.Fatalf("genkey: %v", err)
 	}
 	key, _ := crypto.LoadKeyFromFile(keyPath)
 	aead, _ := crypto.NewAEAD(key)
 	cfg.Listen = ":0"
 	cfg.DataDir = dir
 	cfg.SecretKeyFile = keyPath
 	reg := metrics.NewRegistry()
 	deps := Deps{
 		Cfg:     cfg,
 		Store:   st,
 		AEAD:    aead,
 		Metrics: reg,
 	}
 	s := New(deps)
 	ts := httptest.NewServer(s.srv.Handler)
 	t.Cleanup(ts.Close)
 	return ts.URL, reg, st
 }
 func TestMetricsRouteNotMountedByDefault(t *testing.T) {
 	t.Parallel()
 	url, _, _ := newMetricsServer(t, config.Config{})
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusNotFound {
 		t.Errorf("status: got %d, want 404 (route should not be mounted)", res.StatusCode)
 	}
 }
 func TestMetricsTokenRequired(t *testing.T) {
 	t.Parallel()
 	url, _, _ := newMetricsServer(t, config.Config{
 		MetricsToken: "the-token",
 	})
 	// Missing token.
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("no token: got %d", res.StatusCode)
 	}
 	if !strings.Contains(res.Header.Get("WWW-Authenticate"), "Bearer") {
 		t.Errorf("WWW-Authenticate hint missing: %q", res.Header.Get("WWW-Authenticate"))
 	}
 	// Wrong token.
 	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req.Header.Set("Authorization", "Bearer not-the-token")
 	res2, err := stdhttp.DefaultClient.Do(req)
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res2.Body.Close()
 	if res2.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("wrong token: got %d", res2.StatusCode)
 	}
 	// Right token.
 	req3, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req3.Header.Set("Authorization", "Bearer the-token")
 	res3, err3 := stdhttp.DefaultClient.Do(req3)
 	if err3 != nil {
 		t.Fatalf("GET: %v", err3)
 	}
 	defer res3.Body.Close()
 	if res3.StatusCode != stdhttp.StatusOK {
 		t.Errorf("right token: got %d", res3.StatusCode)
 	}
 	if ct := res3.Header.Get("Content-Type"); !strings.HasPrefix(ct, "text/plain") {
 		t.Errorf("content-type: %q", ct)
 	}
 }
 func TestMetricsCIDRGate(t *testing.T) {
 	t.Parallel()
 	// 127.0.0.1 is what httptest hits with; pick a CIDR that excludes it
 	// to assert the "wrong source" branch.
 	url, _, _ := newMetricsServer(t, config.Config{
 		MetricsTrustedCIDRs: []string{"10.0.0.0/8"},
 	})
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("loopback hitting non-matching CIDR: got %d, want 401", res.StatusCode)
 	}
 	// Now allow loopback.
 	url2, _, _ := newMetricsServer(t, config.Config{
 		MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
 	})
 	res2, err := stdhttp.Get(url2 + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res2.Body.Close()
 	if res2.StatusCode != stdhttp.StatusOK {
 		t.Errorf("loopback in allow CIDR: got %d, want 200", res2.StatusCode)
 	}
 }
 func TestMetricsTokenAndCIDRBothRequired(t *testing.T) {
 	t.Parallel()
 	url, _, _ := newMetricsServer(t, config.Config{
 		MetricsToken:        "the-token",
 		MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
 	})
 	// Token only — CIDR ok (loopback) but token missing.
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("missing token but in CIDR: got %d", res.StatusCode)
 	}
 	// Both right.
 	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req.Header.Set("Authorization", "Bearer the-token")
 	res2, err := stdhttp.DefaultClient.Do(req)
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res2.Body.Close()
 	if res2.StatusCode != stdhttp.StatusOK {
 		t.Errorf("both right: got %d", res2.StatusCode)
 	}
 }
 func readAll(t *testing.T, r io.Reader) string {
 	t.Helper()
 	b, err := io.ReadAll(r)
 	if err != nil {
 		t.Fatalf("read: %v", err)
 	}
 	return string(b)
 }
 func TestMetricsBodyContainsExpectedLines(t *testing.T) {
 	t.Parallel()
 	url, reg, _ := newMetricsServer(t, config.Config{
 		MetricsToken: "the-token",
 	})
 	reg.ObserveJob("backup", "succeeded", 0) // produce one histogram row
 	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req.Header.Set("Authorization", "Bearer the-token")
 	res, err := stdhttp.DefaultClient.Do(req)
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	body := readAll(t, res.Body)
 	for _, want := range []string{
 		"rm_hosts_total",
 		"rm_hosts_online",
 		`rm_active_alerts{severity="critical"}`,
 		"rm_build_info{",
 		"rm_job_duration_seconds_count{kind=\"backup\",status=\"succeeded\"}",
 	} {
 		if !strings.Contains(body, want) {
 			t.Errorf("body missing %q\n--- body ---\n%s", want, body)
 		}
 	}
 }
@@ -17,6 +17,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
@@ -56,6 +57,12 @@ type Deps struct {
 	// OIDC (optional). Non-nil when the operator has configured an
 	// IdP — handlers under /auth/oidc/* are mounted only when set.
 	OIDC *oidc.Client
 	// Metrics (optional). When non-nil the WS job-finished branch
 	// records job durations and the /metrics handler can pull a
 	// histogram snapshot. Independent of MetricsAuthEnabled — the
 	// recorder runs even if the scrape endpoint is gated off, so a
 	// later config flip doesn't lose the running window.
 	Metrics *metrics.Registry
 }
 // Server is the running HTTP server.
@@ -131,12 +138,16 @@ func (s *Server) routes(r chi.Router) {
 	r.Get("/agent/binary", s.handleAgentBinary)
 	r.Get("/install/*", s.handleInstallAsset)
 	r.Get("/api/version", s.handleVersion)
 	if s.deps.Cfg.MetricsAuthEnabled() {
 		r.Get("/metrics", s.handleMetrics)
 	}
 	if s.deps.Hub != nil {
 		hd := ws.HandlerDeps{
 			Hub:            s.deps.Hub,
 			Store:          s.deps.Store,
 			JobHub:         s.deps.JobHub,
 			AlertEngine:    s.deps.AlertEngine,
 			Metrics:        s.deps.Metrics,
 			OnHello:        s.onAgentHello,
 			OnScheduleAck:  s.applyScheduleAck,
 			OnScheduleFire: s.dispatchScheduledJob,
@@ -0,0 +1,301 @@
 // Package metrics owns the in-process Prometheus exposition for
 // the control plane. It deliberately avoids prometheus/client_golang
 // — the legacy text format is small and stable, and the repo's house
 // style is to keep dependency surface minimal.
 //
 // Two halves:
 //
 //   - Registry holds a job-duration histogram. Server hooks call
 //     Registry.ObserveJob from the WS job-finished branch.
 //
 //   - Render emits a complete /metrics body from a Snapshot. The
 //     Snapshot is a plain value bag; the HTTP handler assembles it
 //     from store reads + Registry.Snapshot at scrape time. This
 //     keeps the package free of any database or HTTP dependency.
 package metrics
 import (
 	"fmt"
 	"io"
 	"sort"
 	"strings"
 	"sync"
 	"time"
 )
 // JobDurationBuckets is the upper-bound ladder for the job duration
 // histogram, in seconds. Covers admin commands (unlock/init/check
 // finishing in seconds) up through hours-long backups; +Inf is
 // implicit.
 var JobDurationBuckets = []float64{1, 5, 30, 60, 300, 1800, 3600, 21600, 86400}
 // Registry is the in-memory store for the job-duration histogram.
 // Concurrent observers and a single periodic snapshotter is the
 // expected access pattern; both are guarded by a mutex.
 type Registry struct {
 	mu    sync.Mutex
 	jobs  map[jobKey]*histogramState
 	clock func() time.Time
 }
 type jobKey struct{ kind, status string }
 type histogramState struct {
 	// counts[i] = number of observations <= JobDurationBuckets[i].
 	// counts[len(JobDurationBuckets)] is the implicit +Inf bucket
 	// (== total count, kept here for symmetry with the rendered
 	// _bucket{le="+Inf"} line and as a sanity check).
 	counts []uint64
 	sum    float64
 	count  uint64
 }
 // NewRegistry builds an empty registry.
 func NewRegistry() *Registry {
 	return &Registry{
 		jobs:  make(map[jobKey]*histogramState),
 		clock: time.Now,
 	}
 }
 // ObserveJob records one job-duration sample. Negative durations
 // (clock-skew artefacts) are clamped to zero. Empty kind/status
 // strings are tolerated but degrade the dashboard — callers should
 // pass meaningful values.
 func (r *Registry) ObserveJob(kind, status string, dur time.Duration) {
 	if r == nil {
 		return
 	}
 	if dur < 0 {
 		dur = 0
 	}
 	secs := dur.Seconds()
 	r.mu.Lock()
 	defer r.mu.Unlock()
 	k := jobKey{kind: kind, status: status}
 	hs, ok := r.jobs[k]
 	if !ok {
 		hs = &histogramState{counts: make([]uint64, len(JobDurationBuckets)+1)}
 		r.jobs[k] = hs
 	}
 	for i, ub := range JobDurationBuckets {
 		if secs <= ub {
 			hs.counts[i]++
 		}
 	}
 	hs.counts[len(JobDurationBuckets)]++ // +Inf
 	hs.sum += secs
 	hs.count++
 }
 // HistogramRow is one (kind,status) row in a Snapshot. Buckets is
 // the cumulative count per upper bound (matching JobDurationBuckets,
 // last element is the +Inf total).
 type HistogramRow struct {
 	Kind    string
 	Status  string
 	Buckets []uint64
 	Sum     float64
 	Count   uint64
 }
 // snapshotJobs returns a deterministic, sorted copy of the
 // histogram state. Sort order: kind asc, status asc.
 func (r *Registry) snapshotJobs() []HistogramRow {
 	if r == nil {
 		return nil
 	}
 	r.mu.Lock()
 	defer r.mu.Unlock()
 	rows := make([]HistogramRow, 0, len(r.jobs))
 	for k, hs := range r.jobs {
 		buckets := make([]uint64, len(hs.counts))
 		copy(buckets, hs.counts)
 		rows = append(rows, HistogramRow{
 			Kind:    k.kind,
 			Status:  k.status,
 			Buckets: buckets,
 			Sum:     hs.sum,
 			Count:   hs.count,
 		})
 	}
 	sort.Slice(rows, func(i, j int) bool {
 		if rows[i].Kind != rows[j].Kind {
 			return rows[i].Kind < rows[j].Kind
 		}
 		return rows[i].Status < rows[j].Status
 	})
 	return rows
 }
 // HostRow is one host's projection for the per-host gauges.
 // Pointers carry "no value" semantics so we can omit a metric line
 // when, e.g., a host has never run a backup.
 type HostRow struct {
 	ID                  string
 	Name                string
 	Online              bool
 	LastBackupUnix      *int64 // nil = no backup yet
 	LastBackupSucceeded *bool  // nil = no backup yet
 	RepoSizeBytes       *int64 // nil = no stats yet
 	SnapshotCount       int
 	OpenAlertCount      int
 	RepoStatus          string // "unknown" | "ready" | "init_failed"
 }
 // Snapshot is a frozen view of the data needed to render /metrics.
 // Constructed by the HTTP handler from Store reads + Registry.snapshotJobs.
 type Snapshot struct {
 	Hosts            []HostRow
 	HostsTotal       int
 	HostsOnline      int
 	AlertsBySeverity map[string]int // severity → count
 	BuildVersion     string
 	BuildCommit      string
 	GoVersion        string
 	JobDurationRows  []HistogramRow
 }
 // SnapshotWith builds a Snapshot from raw inputs and the registry's
 // current job-duration state. Convenience for the HTTP handler.
 func (r *Registry) SnapshotWith(hosts []HostRow, alerts map[string]int, buildVer, commit, goVer string) Snapshot {
 	online := 0
 	for _, h := range hosts {
 		if h.Online {
 			online++
 		}
 	}
 	return Snapshot{
 		Hosts:            hosts,
 		HostsTotal:       len(hosts),
 		HostsOnline:      online,
 		AlertsBySeverity: alerts,
 		BuildVersion:     buildVer,
 		BuildCommit:      commit,
 		GoVersion:        goVer,
 		JobDurationRows:  r.snapshotJobs(),
 	}
 }
 // Render emits a complete Prometheus text-exposition body for s.
 // Output is deterministic: metric names appear in a fixed order and
 // labels within a metric are sorted by their first label value.
 func Render(w io.Writer, s Snapshot) error {
 	var b strings.Builder
 	// --- Server gauges ---------------------------------------------------
 	b.WriteString("# HELP rm_hosts_total Total number of enrolled hosts (excludes pending announces).\n")
 	b.WriteString("# TYPE rm_hosts_total gauge\n")
 	fmt.Fprintf(&b, "rm_hosts_total %d\n", s.HostsTotal)
 	b.WriteString("# HELP rm_hosts_online Number of hosts currently online (status='online').\n")
 	b.WriteString("# TYPE rm_hosts_online gauge\n")
 	fmt.Fprintf(&b, "rm_hosts_online %d\n", s.HostsOnline)
 	b.WriteString("# HELP rm_active_alerts Open alerts grouped by severity.\n")
 	b.WriteString("# TYPE rm_active_alerts gauge\n")
 	severities := []string{"info", "warning", "critical"}
 	for _, sev := range severities {
 		fmt.Fprintf(&b, "rm_active_alerts{severity=%q} %d\n", sev, s.AlertsBySeverity[sev])
 	}
 	b.WriteString("# HELP rm_build_info Build identifying labels; value is always 1.\n")
 	b.WriteString("# TYPE rm_build_info gauge\n")
 	fmt.Fprintf(&b, "rm_build_info{version=%q,commit=%q,go_version=%q} 1\n",
 		s.BuildVersion, s.BuildCommit, s.GoVersion)
 	// --- Per-host gauges -------------------------------------------------
 	// Stable order: by host id.
 	hosts := append([]HostRow(nil), s.Hosts...)
 	sort.Slice(hosts, func(i, j int) bool { return hosts[i].ID < hosts[j].ID })
 	b.WriteString("# HELP rm_host_agent_online 1 if the agent is currently online, 0 otherwise.\n")
 	b.WriteString("# TYPE rm_host_agent_online gauge\n")
 	for _, h := range hosts {
 		v := 0
 		if h.Online {
 			v = 1
 		}
 		fmt.Fprintf(&b, "rm_host_agent_online{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, v)
 	}
 	b.WriteString("# HELP rm_host_last_backup_timestamp_seconds Unix timestamp of the host's most recent backup. Omitted for hosts with no backup yet.\n")
 	b.WriteString("# TYPE rm_host_last_backup_timestamp_seconds gauge\n")
 	for _, h := range hosts {
 		if h.LastBackupUnix == nil {
 			continue
 		}
 		fmt.Fprintf(&b, "rm_host_last_backup_timestamp_seconds{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, *h.LastBackupUnix)
 	}
 	b.WriteString("# HELP rm_host_last_backup_success 1 if the host's most recent backup succeeded, 0 otherwise. Omitted for hosts with no backup yet.\n")
 	b.WriteString("# TYPE rm_host_last_backup_success gauge\n")
 	for _, h := range hosts {
 		if h.LastBackupSucceeded == nil {
 			continue
 		}
 		v := 0
 		if *h.LastBackupSucceeded {
 			v = 1
 		}
 		fmt.Fprintf(&b, "rm_host_last_backup_success{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, v)
 	}
 	b.WriteString("# HELP rm_host_repo_size_bytes Latest reported repo size from `restic stats --mode raw-data`. Omitted for hosts with no stats yet.\n")
 	b.WriteString("# TYPE rm_host_repo_size_bytes gauge\n")
 	for _, h := range hosts {
 		if h.RepoSizeBytes == nil {
 			continue
 		}
 		fmt.Fprintf(&b, "rm_host_repo_size_bytes{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, *h.RepoSizeBytes)
 	}
 	b.WriteString("# HELP rm_host_snapshot_count Number of restic snapshots known on the host's repo.\n")
 	b.WriteString("# TYPE rm_host_snapshot_count gauge\n")
 	for _, h := range hosts {
 		fmt.Fprintf(&b, "rm_host_snapshot_count{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, h.SnapshotCount)
 	}
 	b.WriteString("# HELP rm_host_open_alerts Number of currently open alerts attached to this host.\n")
 	b.WriteString("# TYPE rm_host_open_alerts gauge\n")
 	for _, h := range hosts {
 		fmt.Fprintf(&b, "rm_host_open_alerts{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, h.OpenAlertCount)
 	}
 	b.WriteString("# HELP rm_host_repo_status Repo readiness state for the host. Exactly one row per host with status label set.\n")
 	b.WriteString("# TYPE rm_host_repo_status gauge\n")
 	for _, h := range hosts {
 		st := h.RepoStatus
 		if st == "" {
 			st = "unknown"
 		}
 		fmt.Fprintf(&b, "rm_host_repo_status{host_id=%q,host=%q,status=%q} 1\n",
 			h.ID, h.Name, st)
 	}
 	// --- Histogram -------------------------------------------------------
 	b.WriteString("# HELP rm_job_duration_seconds End-to-end duration of completed jobs, by kind and terminal status.\n")
 	b.WriteString("# TYPE rm_job_duration_seconds histogram\n")
 	for _, row := range s.JobDurationRows {
 		for i, ub := range JobDurationBuckets {
 			fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"%g\"} %d\n",
 				row.Kind, row.Status, ub, row.Buckets[i])
 		}
 		fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"+Inf\"} %d\n",
 			row.Kind, row.Status, row.Buckets[len(JobDurationBuckets)])
 		fmt.Fprintf(&b, "rm_job_duration_seconds_sum{kind=%q,status=%q} %g\n",
 			row.Kind, row.Status, row.Sum)
 		fmt.Fprintf(&b, "rm_job_duration_seconds_count{kind=%q,status=%q} %d\n",
 			row.Kind, row.Status, row.Count)
 	}
 	_, err := io.WriteString(w, b.String())
 	return err
 }
@@ -0,0 +1,182 @@
 package metrics
 import (
 	"bytes"
 	"strings"
 	"sync"
 	"testing"
 	"time"
 )
 func TestObserveJobBuckets(t *testing.T) {
 	r := NewRegistry()
 	// Bucket boundaries: 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400
 	r.ObserveJob("backup", "succeeded", 500*time.Millisecond) // <= 1
 	r.ObserveJob("backup", "succeeded", 30*time.Second)       // == 30 (boundary)
 	r.ObserveJob("backup", "succeeded", 90*time.Second)       // > 60, <= 300
 	r.ObserveJob("backup", "succeeded", 2*time.Hour)          // > 3600 → 21600 bucket
 	rows := r.snapshotJobs()
 	if len(rows) != 1 {
 		t.Fatalf("rows: %d", len(rows))
 	}
 	row := rows[0]
 	if row.Count != 4 {
 		t.Errorf("count: %d", row.Count)
 	}
 	wantSum := 0.5 + 30 + 90 + 7200.0
 	if row.Sum != wantSum {
 		t.Errorf("sum: got %v want %v", row.Sum, wantSum)
 	}
 	// Cumulative buckets:
 	//  le=1     → 1 (the 0.5s)
 	//  le=5     → 1
 	//  le=30    → 2 (boundary inclusive: 30s included)
 	//  le=60    → 2
 	//  le=300   → 3
 	//  le=1800  → 3
 	//  le=3600  → 3
 	//  le=21600 → 4
 	//  le=86400 → 4
 	//  le=+Inf  → 4
 	want := []uint64{1, 1, 2, 2, 3, 3, 3, 4, 4, 4}
 	for i, w := range want {
 		if row.Buckets[i] != w {
 			t.Errorf("bucket[%d]=%d want %d", i, row.Buckets[i], w)
 		}
 	}
 }
 func TestObserveJobNegativeClampedToZero(t *testing.T) {
 	r := NewRegistry()
 	r.ObserveJob("backup", "succeeded", -5*time.Second)
 	rows := r.snapshotJobs()
 	if len(rows) != 1 || rows[0].Sum != 0 || rows[0].Count != 1 {
 		t.Errorf("expected one zero-second observation, got %+v", rows)
 	}
 }
 func TestObserveJobConcurrent(t *testing.T) {
 	r := NewRegistry()
 	const goroutines = 16
 	const each = 200
 	var wg sync.WaitGroup
 	for g := 0; g < goroutines; g++ {
 		wg.Add(1)
 		go func() {
 			defer wg.Done()
 			for i := 0; i < each; i++ {
 				r.ObserveJob("backup", "succeeded", time.Second)
 			}
 		}()
 	}
 	wg.Wait()
 	rows := r.snapshotJobs()
 	if len(rows) != 1 {
 		t.Fatalf("rows: %d", len(rows))
 	}
 	if rows[0].Count != uint64(goroutines*each) {
 		t.Errorf("count: got %d want %d", rows[0].Count, goroutines*each)
 	}
 }
 func TestObserveJobNilRegistryNoop(t *testing.T) {
 	var r *Registry // nil
 	r.ObserveJob("backup", "succeeded", time.Second)
 }
 func TestRenderGolden(t *testing.T) {
 	r := NewRegistry()
 	r.ObserveJob("backup", "succeeded", 5*time.Second)
 	r.ObserveJob("forget", "succeeded", 100*time.Millisecond)
 	pi64 := func(v int64) *int64 { return &v }
 	pbool := func(v bool) *bool { return &v }
 	hosts := []HostRow{
 		{
 			ID: "01H0001", Name: "alpha",
 			Online:              true,
 			LastBackupUnix:      pi64(1700000000),
 			LastBackupSucceeded: pbool(true),
 			RepoSizeBytes:       pi64(123456789),
 			SnapshotCount:       42,
 			OpenAlertCount:      0,
 			RepoStatus:          "ready",
 		},
 		{
 			ID: "01H0002", Name: "bravo",
 			Online:         false,
 			SnapshotCount:  0,
 			OpenAlertCount: 1,
 			RepoStatus:     "init_failed",
 		},
 	}
 	snap := r.SnapshotWith(hosts,
 		map[string]int{"info": 0, "warning": 1, "critical": 0},
 		"v1.2.3", "deadbeef", "go1.25.0")
 	var buf bytes.Buffer
 	if err := Render(&buf, snap); err != nil {
 		t.Fatalf("render: %v", err)
 	}
 	out := buf.String()
 	for _, want := range []string{
 		"# HELP rm_hosts_total ",
 		"rm_hosts_total 2\n",
 		"rm_hosts_online 1\n",
 		`rm_active_alerts{severity="warning"} 1`,
 		`rm_active_alerts{severity="info"} 0`,
 		`rm_active_alerts{severity="critical"} 0`,
 		`rm_build_info{version="v1.2.3",commit="deadbeef",go_version="go1.25.0"} 1`,
 		`rm_host_agent_online{host_id="01H0001",host="alpha"} 1`,
 		`rm_host_agent_online{host_id="01H0002",host="bravo"} 0`,
 		`rm_host_last_backup_timestamp_seconds{host_id="01H0001",host="alpha"} 1700000000`,
 		`rm_host_last_backup_success{host_id="01H0001",host="alpha"} 1`,
 		`rm_host_repo_size_bytes{host_id="01H0001",host="alpha"} 123456789`,
 		`rm_host_snapshot_count{host_id="01H0001",host="alpha"} 42`,
 		`rm_host_snapshot_count{host_id="01H0002",host="bravo"} 0`,
 		`rm_host_open_alerts{host_id="01H0002",host="bravo"} 1`,
 		`rm_host_repo_status{host_id="01H0001",host="alpha",status="ready"} 1`,
 		`rm_host_repo_status{host_id="01H0002",host="bravo",status="init_failed"} 1`,
 		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="1"} 0`,
 		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="5"} 1`,
 		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="+Inf"} 1`,
 		`rm_job_duration_seconds_sum{kind="backup",status="succeeded"} 5`,
 		`rm_job_duration_seconds_count{kind="backup",status="succeeded"} 1`,
 		`rm_job_duration_seconds_bucket{kind="forget",status="succeeded",le="1"} 1`,
 	} {
 		if !strings.Contains(out, want) {
 			t.Errorf("missing line:\n  %s\n--- full output ---\n%s", want, out)
 		}
 	}
 	// bravo had no last backup → those metric lines must be absent for it.
 	for _, ban := range []string{
 		`rm_host_last_backup_timestamp_seconds{host_id="01H0002"`,
 		`rm_host_last_backup_success{host_id="01H0002"`,
 		`rm_host_repo_size_bytes{host_id="01H0002"`,
 	} {
 		if strings.Contains(out, ban) {
 			t.Errorf("unexpected line for bravo: %q", ban)
 		}
 	}
 }
 func TestRenderEmptySnapshot(t *testing.T) {
 	r := NewRegistry()
 	snap := r.SnapshotWith(nil, nil, "dev", "", "go1.25.0")
 	var buf bytes.Buffer
 	if err := Render(&buf, snap); err != nil {
 		t.Fatalf("render: %v", err)
 	}
 	out := buf.String()
 	if !strings.Contains(out, "rm_hosts_total 0\n") {
 		t.Errorf("missing zero-host gauge:\n%s", out)
 	}
 	// Histogram block has its HELP/TYPE but no rows. The HELP/TYPE
 	// presence is correct and helps Prometheus pre-register the metric.
 	if !strings.Contains(out, "# TYPE rm_job_duration_seconds histogram") {
 		t.Errorf("histogram HELP/TYPE missing")
 	}
 }
@@ -15,6 +15,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/alert"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/auth"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
 )
@@ -27,6 +28,9 @@ type HandlerDeps struct {
 	// AlertEngine receives job-finished and host-online events so the
 	// alert engine can evaluate its rules. Optional; nil = no-op.
 	AlertEngine *alert.Engine
 	// Metrics records job-duration observations on every terminal
 	// status. Optional; nil = no-op (test fixtures pass nil).
 	Metrics *metrics.Registry
 	// UpdateWatcher reconciles in-flight agent-update dispatches against
 	// hello envelopes. Optional; nil = no-op.
 	UpdateWatcher *UpdateWatcher
@@ -239,6 +243,13 @@ func dispatchAgentMessage(ctx context.Context, c *Conn, hostID string, env api.E
 					slog.Warn("ws: set host last backup", "host_id", hostID, "err", err)
 				}
 			}
 			// Job-duration histogram (P6-04). Skip when StartedAt is
 			// missing (race: agent shipped finished without a started,
 			// or the row predates this code).
 			if deps.Metrics != nil && job.StartedAt != nil {
 				deps.Metrics.ObserveJob(job.Kind, string(p.Status),
 					p.FinishedAt.Sub(*job.StartedAt))
 			}
 		}
 		if deps.JobHub != nil {
 			deps.JobHub.Broadcast(p.JobID, env)
@@ -432,8 +432,45 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
 > swap, helper `buildRepoTrendView` shared between page-load and
 > fragment endpoint). No new dependencies, no client JS, no agent
 > change. CI green; in-browser smoke walk-through pending operator.
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
+- [x] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
+- [x] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
 > **As shipped (2026-05-07, branch `p6-04-05-prometheus-metrics`):**
 > Spec `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`,
 > plan `docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md`.
 > New `internal/server/metrics` package emits the legacy
 > `text/plain; version=0.0.4` exposition format directly — no
 > `prometheus/client_golang` dependency, matching the repo's
 > "no Tailwind, no Node" minimal-deps style. `/metrics` is **opt-in**:
 > `RM_METRICS_TOKEN` and/or `RM_METRICS_TRUSTED_CIDR` must be set or
 > the route isn't mounted at all (404). When both are set, both must
 > pass; either alone gates access. Token compare is constant-time.
 > CIDR check honours `X-Forwarded-For` only when the immediate hop
 > is a configured `RM_TRUSTED_PROXY` (mirrors the existing realIP
 > resolution).
 >
 > **Metrics:** per-host gauges (`rm_host_agent_online`,
 > `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`,
 > `rm_host_repo_size_bytes`, `rm_host_snapshot_count`,
 > `rm_host_open_alerts`, `rm_host_repo_status`); server gauges
 > (`rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`,
 > `rm_build_info{version,commit,go_version}`); histogram
 > `rm_job_duration_seconds_bucket{kind,status,le}` with buckets
 > `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`.
 > Histogram is in-memory; observations come from the existing
 > `MsgJobFinished` branch in `internal/server/ws/handler.go`.
 >
 > **Docs:** `docs/prometheus.md` covers enable + scrape config +
 > metric reference + dashboard import. **Dashboard:**
 > `deploy/grafana/restic-manager-dashboard.json` — six panels
 > (fleet status, open alerts, backups failing, hosts table, repo
 > size over time, job-duration p95). Schema 39, single Prometheus
 > datasource variable.
 >
 > **Tests:** golden-render + concurrent-observe + bucket-boundary
 > in the metrics package; auth matrix (no auth → 404; token
 > missing/wrong/right; CIDR matching/non-matching; token AND CIDR)
 > in the HTTP layer.
 ### Phase 6 acceptance
Author	SHA1	Message	Date
steve	a3f134bcd6	e2e: pin Playwright to 1.59.1 CI / Test (rest) (pull_request) Successful in 34s Details CI / Test (store) (pull_request) Successful in 54s Details CI / Lint (pull_request) Successful in 26s Details CI / Build (windows/amd64) (pull_request) Successful in 26s Details CI / Build (linux/amd64) (pull_request) Successful in 25s Details CI / Build (linux/arm64) (pull_request) Successful in 25s Details e2e / Playwright vs docker-compose (pull_request) Failing after 1m36s Details CI / Test (server-http) (pull_request) Successful in 3m19s Details `@playwright/test` was loose-pinned to ^1.50.0; npm resolved it to 1.59.1 inside the runner image, which only ships browser binaries for 1.50.0. Pin both the package and the docker image to v1.59.1 so deps and binaries stay aligned.	2026-05-08 20:09:17 +01:00
steve	17b9ee08b7	e2e: run health probe + Playwright on the compose network Gitea's act-style runners execute workflow steps inside a runner container, so compose's host port-publish (127.0.0.1:8080:8080) is not reachable from the steps. PR #23's e2e job timed out waiting for the server even though the container was up and listening. Move both the health probe and the Playwright run onto rmnet so they address the server as http://server:8080: * health probe: docker run --rm --network e2e_rmnet curlimages/curl * Playwright: new mcr.microsoft.com/playwright-based image, added as a profile-gated `playwright` service in compose.e2e.yml, invoked via `docker compose run --rm playwright`. Drops the setup-node + npm install runner steps.	2026-05-08 20:08:23 +01:00
steve	89537d417a	P5: OSS readiness — docs site, contributor onboarding, e2e harness P5-01 — Documentation site under docs/book/ rendered with mdBook (downloaded via Makefile, same static-binary pattern as Tailwind). Structured chapters: getting started, concepts, operations, security, reference. `make docs` / `make docs-watch`. Generated output gitignored. P5-02 — CONTRIBUTING.md rewritten from placeholder to a full guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a single-maintainer project. .gitea/issue_template/{bug,feature}.md and PULL_REQUEST_TEMPLATE.md. P5-04 — Six README screenshots captured live from a fresh server bootstrap (login, empty dashboard, add-host, alerts, settings, audit log). README rewritten to centre the screenshot grid and link out to the docs site. P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day default window), scope in/out, threat-model summary, operator hardening checklist. Mirrored as a docs-site chapter. P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up server + sibling Linux agent (alpine + restic) + restic/rest-server. Agent uses announce-and-approve so Playwright can drive the full operator flow: bootstrap → login → accept pending → backup → verify terminal status. Second spec scrapes /metrics to assert the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every PR; local how-to in docs/e2e.md.	2026-05-08 20:08:23 +01:00
steve	a252b25854	Merge pull request 'spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard' (#22 ) from p6-04-05-prometheus-metrics into main Reviewed-on: #22	2026-05-08 18:31:57 +00:00
steve	73e733be61	P6-04+05: Prometheus /metrics endpoint + Grafana dashboard CI / Test (rest) (pull_request) Successful in 41s Details CI / Test (store) (pull_request) Successful in 43s Details CI / Lint (pull_request) Successful in 29s Details CI / Build (windows/amd64) (pull_request) Successful in 44s Details CI / Test (server-http) (pull_request) Successful in 1m47s Details CI / Build (linux/arm64) (pull_request) Successful in 43s Details CI / Build (linux/amd64) (pull_request) Successful in 2m1s Details New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.	2026-05-07 23:17:15 +01:00
steve	70ff554402	spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard	2026-05-07 23:07:30 +01:00