e2e: pin Playwright to 1.59.1

`@playwright/test` was loose-pinned to ^1.50.0; npm resolved it to 1.59.1 inside the runner image, which only ships browser binaries for 1.50.0. Pin both the package and the docker image to v1.59.1 so deps and binaries stay aligned.
e2e: run health probe + Playwright on the compose network
2026-05-08 20:09:17 +01:00 · 2026-05-08 20:08:23 +01:00 · 2026-05-08 20:08:23 +01:00 · 2026-05-08 18:31:57 +00:00 · 2026-05-07 23:17:15 +01:00 · 2026-05-07 23:07:30 +01:00
61 changed files with 4577 additions and 63 deletions
@@ -0,0 +1,32 @@
 <!--
 Thanks for the PR! A few quick checks before submitting:
 * Did you open an issue first for non-trivial changes?
 * `make lint test` is green locally?
 * Commits are focused (one logical change per commit)?
 * No `Co-Authored-By` trailers (repo policy)?
 * No new dependencies without a one-line justification below?
 -->
 ## Summary
 <!-- One paragraph: what changed and why. -->
 ## Test plan
 <!-- Bullet list of what you actually ran. Be specific.
     - `make test` → green
     - Manually exercised the new flow at /hosts/{id}/foo
     - Smoke env: enrolled a fresh host, ran a backup end-to-end
 -->
 ## Notes for the reviewer
 <!-- Anything the reviewer needs to know that isn't obvious from the
     diff: related issue, follow-up work that's intentionally not
     in this PR, deferred concerns, design alternatives considered
     and rejected. -->
 ## Linked issues
 <!-- "Closes #123" / "Refs #456" / "Part of P5-06" -->
@@ -0,0 +1,52 @@
 ---
 name: Bug report
 about: Something isn't behaving the way the docs / code suggest it should
 title: "[bug] "
 labels: bug
 ---
 ## What happened
 <!-- A clear description of the actual behaviour. Include the exact
     UI surface, API endpoint, or CLI invocation involved. -->
 ## What you expected
 <!-- What you thought would happen, and where that expectation came from
     (docs page, command output, prior behaviour). -->
 ## Steps to reproduce
 1.
 2.
 3.
 ## Environment
 - restic-manager server version: <!-- `restic-manager-server --version` or footer of the UI -->
 - Agent version (if relevant): <!-- `restic-manager-agent --version` -->
 - restic version on affected host: <!-- `restic version` -->
 - Host OS: <!-- e.g. "Ubuntu 22.04 amd64" or "Windows Server 2022" -->
 - How was the server installed: <!-- docker compose / source build / other -->
 ## Logs / output
 <details><summary>Server log (sanitised)</summary>
 ```
 <!-- paste relevant lines; redact tokens, passwords, repo URLs -->
 ```
 </details>
 <details><summary>Agent log (sanitised)</summary>
 ```
 ```
 </details>
 ## Anything else
 <!-- Screenshots, related issues, recent changes you made before the
     bug appeared, anything that might help. -->
@@ -0,0 +1,34 @@
 ---
 name: Feature request
 about: Suggest a new capability or change to existing behaviour
 title: "[feature] "
 labels: enhancement
 ---
 ## What you're trying to do
 <!-- Describe the use case, not the proposed solution. Who is the
     operator, what are they trying to accomplish, and what's
     blocking them today? -->
 ## Why the current behaviour falls short
 <!-- What does the system do today, and where does it stop short of
     the use case above? -->
 ## Proposed direction (optional)
 <!-- If you have a specific design in mind, describe it. Skip this
     section if you'd rather leave it to the maintainer. -->
 ## Scope check
 - [ ] I've read [`spec.md`](../spec.md) §2 (Goals & Non-Goals).
 - [ ] This isn't already on the roadmap in [`tasks.md`](../tasks.md).
 - [ ] This fits the project's "small fleet, one person operating"
      target rather than enterprise / multi-tenant / SaaS use cases.
 ## Anything else
 <!-- Related restic features, prior art in similar tools, links to
     discussions you've had elsewhere. -->
@@ -0,0 +1,98 @@
 # P5-06 — End-to-end test suite.
 #
 # Spec : docs/superpowers/specs/2026-05-07-p5-oss-readiness-design.md
 # Stack: e2e/compose.e2e.yml (server + agent + rest-server + playwright)
 # Tests: e2e/playwright/tests/*.spec.ts
 #
 # Triggered on every PR into main and on workflow_dispatch. Runs
 # longer than the unit-test workflow (~3-4 minutes for a clean run);
 # kept separate so a slow e2e doesn't block the fast lint/test loop.
 #
 # Networking note: every interaction with the server (health probe,
 # Playwright) happens from a container on the compose `rmnet`
 # network, addressing the server as `http://server:8080`. We can't
 # rely on `127.0.0.1:8080` because Gitea's runner executes steps
 # inside its own container, where compose's host port-publish is
 # not visible.
 name: e2e
 on:
  pull_request:
    branches: [main]
  workflow_dispatch:
 jobs:
  e2e:
    name: Playwright vs docker-compose
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@v4
      - name: Build the e2e stack
        run: docker compose -f e2e/compose.e2e.yml build
      - name: Bring up the stack
        run: docker compose -f e2e/compose.e2e.yml up -d server rest-server source-fixture
      - name: Wait for server health
        run: |
          set -eu
          for i in $(seq 1 30); do
            if docker run --rm --network e2e_rmnet curlimages/curl:8.10.1 \
                  -fsS http://server:8080/api/version >/dev/null 2>&1; then
              echo "server up"; exit 0
            fi
            sleep 2
          done
          echo "server didn't come up"; docker compose -f e2e/compose.e2e.yml logs server; exit 1
      - name: Capture bootstrap token from server logs
        id: bootstrap
        run: |
          set -eu
          for i in $(seq 1 15); do
            line=$(docker compose -f e2e/compose.e2e.yml logs server 2>&1 | grep -E 'bootstrap token' -A2 | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1 || true)
            if [ -n "$line" ]; then
              echo "RM_BOOTSTRAP_TOKEN=$line" >> "$GITHUB_ENV"
              echo "got bootstrap token (${#line} chars)"
              exit 0
            fi
            sleep 1
          done
          echo "bootstrap token not found in logs"
          docker compose -f e2e/compose.e2e.yml logs server
          exit 1
      - name: Start the agent
        run: docker compose -f e2e/compose.e2e.yml up -d agent
      - name: Prepare report mounts
        run: |
          mkdir -p e2e/playwright/playwright-report e2e/playwright/test-results
          chmod -R a+rwX e2e/playwright/playwright-report e2e/playwright/test-results
      - name: Run Playwright tests
        env:
          RM_BOOTSTRAP_TOKEN: ${{ env.RM_BOOTSTRAP_TOKEN }}
        run: docker compose -f e2e/compose.e2e.yml run --rm playwright
      - name: Compose logs (on failure)
        if: failure()
        run: |
          docker compose -f e2e/compose.e2e.yml logs --tail=200 server
          docker compose -f e2e/compose.e2e.yml logs --tail=200 agent
          docker compose -f e2e/compose.e2e.yml logs --tail=200 rest-server
      - name: Upload Playwright report (on failure)
        if: failure()
        uses: actions/upload-artifact@v3
        with:
          name: playwright-report
          path: e2e/playwright/playwright-report
          retention-days: 7
      - name: Tear down
        if: always()
        run: docker compose -f e2e/compose.e2e.yml down -v
@@ -2,6 +2,10 @@
 /bin/
 /dist/
 # Generated mdBook output (source under docs/book/src is committed,
 # the rendered book/ directory is not).
 /docs/book/book/
 # Local data / runtime state
 /data/
 /certs/
@@ -0,0 +1,69 @@
 # Code of Conduct
 restic-manager is a small project run by one person. This Code of
 Conduct sets out the basic expectations for participating in the
 project's issue tracker, pull requests, and any other community
 spaces (chat, mailing lists) we may run in future.
 ## Expected behaviour
 - **Be civil.** Disagreement is fine; rudeness is not. The same
  comment can usually be made without making it personal.
 - **Assume good faith.** People asking what feels like a basic
  question may be new to the project. People proposing what feels
  like a duplicate idea may not have seen the prior discussion.
  Point them to the right place politely.
 - **Stay on topic.** Issue threads are for the issue. Tangential
  conversations belong in their own thread.
 - **Acknowledge the project's scope.** restic-manager is
  intentionally small in scope (see `spec.md` §2). Reasonable
  feature suggestions may still be declined for fit reasons.
 ## Unacceptable behaviour
 - Harassment, threats, or insults — public or private.
 - Discriminatory comments based on age, body size, disability,
  ethnicity, gender identity or expression, level of experience,
  nationality, personal appearance, race, religion, sexual identity
  or orientation.
 - Sustained disruption — derailing threads, ignoring repeated
  requests to take a discussion elsewhere, brigading.
 - Publishing other people's private information without permission.
 ## Reporting
 If someone in the project's spaces is behaving in a way that
 breaches this Code of Conduct, contact the maintainer directly
 through the contact details on their Gitea profile, or via the
 private security disclosure path documented in
 [SECURITY.md](./SECURITY.md). Reports stay confidential.
 The maintainer will review the report, gather context if needed,
 and respond. Possible outcomes include a private warning, a public
 clarification of expectations, a temporary or permanent ban from
 project spaces, or no action if the report doesn't hold up.
 There is no formal appeals process — this is a one-person project,
 not a foundation. If you think a decision was wrong you can say
 so, in writing, to the maintainer; that's it.
 ## Scope
 This Code of Conduct applies to interactions in any space the
 project owns or operates: the Gitea repository (issues, pull
 requests, discussions, wiki), any chat channels we publish, and
 any conferences or events the project is officially represented at.
 It does not apply to:
 - Forks of the project that aren't being submitted back upstream.
 - Conversations between contributors that don't reference the
  project.
 - Public criticism of the project itself.
 ## Acknowledgement
 This document borrows shape and language from the
 [Contributor Covenant](https://www.contributor-covenant.org/) v2.1
 but is intentionally shorter and adapted to the project's
 single-maintainer reality.
@@ -1,30 +1,168 @@
-# Contributing
+# Contributing to restic-manager
-Thanks for your interest in contributing to restic-manager.
+Thanks for your interest in restic-manager. This document covers how
 to set up a development environment, the conventions the project
 follows, and how patches make it from your machine into `main`.
-> This is a placeholder. The project is in pre-alpha (Phase 1 / MVP). A
+## Project status and scope
 > full contributor guide will land alongside the Phase 5 OSS-readiness
 > work — see [`tasks.md`](./tasks.md) P5-02. Until then the notes below
 > apply.
-## Before opening a PR
+restic-manager is in pre-1.0. Core functionality (Phases 0–4) is
 landed; OSS-readiness polish is in progress. The top of
 [`tasks.md`](./tasks.md) tracks what's next; [`spec.md`](./spec.md)
 is the canonical design doc and the source of truth for any
 "why is it built this way" question.
-1. Open an issue first for non-trivial changes — the design is still
+The project is **single-maintainer, hobbyist-scale, and licensed
-   moving (see [`spec.md`](./spec.md)) and unsolicited large PRs may
+under [PolyForm Noncommercial 1.0.0](./LICENSE)**. That has two
-   conflict with in-flight work.
+practical implications:
 2. `make lint test` should pass.
 3. Match the existing code style — `gofumpt`, `goimports`, no comments
   that just restate what the code does.
 4. Keep commits focused; one logical change per commit.
-## Reporting security issues
+1. Big PRs without prior discussion may be declined for fit
   reasons even when they're correct — opening an issue first lets
   us check alignment cheaply.
 2. Commercial use is not permitted by the license. Bug reports and
   patches from operators of personal/community deployments are
   very welcome.
-Please do **not** open a public issue for security problems. A
+## Getting started
-`SECURITY.md` with a private disclosure path will be added in Phase 5
+
-(P5-05). Until then, contact the repository owner directly via the
+### Prerequisites
-contact details on their gitea profile.
+
 - Go 1.25 or newer (`go.mod` is the source of truth)
 - `make`
 - For the front-end CSS bundle: nothing extra — `make build`
  downloads a pinned `tailwindcss` standalone binary into `bin/`.
 - For the docs site: nothing extra — `make docs` does the same trick
  with `mdbook`.
 - For end-to-end tests: Docker + Docker Compose, plus `npx` for
  Playwright.
 ### One-time setup
 ```sh
 git clone https://gitea.dcglab.co.uk/steve/restic-manager.git
 cd restic-manager
 make build          # compiles bin/restic-manager-{server,agent}
 make test           # full unit + integration test sweep
 make lint           # gofumpt + goimports + golangci-lint
 ```
 ### Running locally
 For most development, the [smoke environment](./docs/e2e-smoke.md)
 is the path of least resistance:
 ```sh
 make smoke-restart  # rebuilds, launches as a systemd --user unit
 make smoke-logs     # tail of the server log
 ```
 Then point a browser at `http://127.0.0.1:8080`. The first run
 prints a one-time bootstrap token to the log; use it to create the
 admin user.
 ## Code conventions
 ### Style
 - `gofumpt` for formatting; `goimports` for import grouping.
  Both run via the pre-commit hook in this repo.
 - `golangci-lint` with `.golangci.yml` defaults; CI rejects on lint
  errors.
 - UK English in identifiers, comments, log messages, and UI strings
  (the misspell linter is configured for the UK locale — see
  P3-X5 for the original sweep).
 - Comments explain **why**, not what; avoid restating the code.
  A surprising invariant or an external constraint is worth
  writing down. "Adds 1 to x" is not.
 - `slog` for structured logs. Never log secrets — and especially
  never the merged-creds rest-server URL (see [`CLAUDE.md`](./CLAUDE.md)).
 ### File and package layout
 - `cmd/server` and `cmd/agent` are the two binary entry points.
 - `internal/` holds everything that's not part of the public Go
  API (which is none of it — restic-manager isn't a library).
 - Per-feature packages live under `internal/server/...` for the
  control plane and `internal/agent/...` for the agent.
 - `web/templates/` are HTML templates rendered with the standard
  library; embedded via `web.FS`.
 ### Tests
 - Unit tests live alongside the code as `*_test.go`. Use the
  in-process sqlite store (`store.Open(":memory:")`) when you need
  state — there is no test mock layer to maintain.
 - HTTP handlers test through `httptest.NewServer` against the real
  router; see `internal/server/http/auth_test.go` for the canonical
  fixture pattern.
 - End-to-end tests live in `e2e/` and run against a Docker Compose
  stack. See [`docs/e2e.md`](./docs/e2e.md).
 ### Database migrations
 - Migrations are hand-rolled SQL in `internal/store/migrations/`
  and embedded via `embed.FS`.
 - Prefer column-level `ALTER TABLE` over rebuilds — see
  [`CLAUDE.md`](./CLAUDE.md) "Migrations" section for the FK-cascade
  trap that bit migration 0007's first draft.
 ## Workflow
 ### Before opening a PR
 1. **Open an issue first** for non-trivial changes. The design is
   still moving; an issue lets us agree on direction cheaply.
 2. Run `make lint test` locally — both must pass.
 3. Match existing code style (see above).
 4. Keep commits focused: one logical change per commit. Imperative
   subject lines, body explaining why if it isn't obvious.
 5. Don't add `Co-Authored-By` trailers — repo policy. If you used
   AI assistance in writing the patch, that's fine; we just don't
   pollute every commit message with attribution boilerplate.
 ### Pull requests
 PRs target `main`. CI runs lint + tests on Linux amd64/arm64 and
 Windows amd64; all three must be green to merge. Squash-merge is
 the default; the PR title becomes the merge-commit subject, so
 keep it short and informative.
 The PR template asks for:
 - A short description of what changed and why.
 - A test plan (commands run, scenarios verified).
 - Anything reviewers need to know to assess the change (related
  issue, follow-up work, deferred concerns).
 ### Reporting bugs
 Open an issue with:
 - restic-manager version (`server --version`) and agent version.
 - restic version on the affected host.
 - Steps to reproduce.
 - Server and agent logs (sanitise any tokens before pasting).
 Security-sensitive bugs go through the [SECURITY.md](./SECURITY.md)
 disclosure path instead — please don't open a public issue for
 them.
 ### Suggesting features
 Open an issue describing the use case (not just the proposed
 solution). The roadmap in `tasks.md` shows where the project is
 heading; if the suggestion fits a future phase we'll wire it in
 there. If it falls outside the project's scope (multi-tenancy, SaaS,
 non-restic backends — see `spec.md` §2 non-goals) we'll say so
 early to save your time.
 ## Code of conduct
 Project participation is governed by [CODE_OF_CONDUCT.md](./CODE_OF_CONDUCT.md).
 The short version: be civil; assume good faith; harassment is not
 tolerated.
 ## License
-By contributing you agree that your contributions are licensed under
+By contributing you agree that your contributions are licensed
-the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
+under the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
@@ -24,7 +24,18 @@ TAILWIND_URL      := https://github.com/tailwindlabs/tailwindcss/releases/downlo
 TAILWIND_INPUT    := web/styles/input.css
 TAILWIND_OUTPUT   := web/static/css/styles.css
-.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
+# mdBook for the docs site (P5-01). Single static binary, no
 # Rust toolchain — same pattern as Tailwind.
 MDBOOK_VERSION    ?= v0.4.51
 MDBOOK_OS         := $(shell uname -s | tr A-Z a-z)
 MDBOOK_TRIPLE     := $(shell uname -m)-unknown-$(if $(filter darwin,$(MDBOOK_OS)),apple-darwin,linux-gnu)
 MDBOOK_BIN        := $(BIN_DIR)/mdbook
 MDBOOK_TARBALL    := mdbook-$(MDBOOK_VERSION)-$(MDBOOK_TRIPLE).tar.gz
 MDBOOK_URL        := https://github.com/rust-lang/mdBook/releases/download/$(MDBOOK_VERSION)/$(MDBOOK_TARBALL)
 DOCS_BOOK_DIR     := docs/book
 DOCS_BOOK_OUT     := $(DOCS_BOOK_DIR)/book
 .PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch docs docs-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
 # ---- smoke-env tooling -------------------------------------------------
 # The smoke server runs as a transient user-systemd unit so it survives
@@ -60,6 +71,18 @@ tailwind-watch: $(TAILWIND_BIN) ## Watch and rebuild on every save
 	@mkdir -p $$(dirname $(TAILWIND_OUTPUT))
 	$(TAILWIND_BIN) -c tailwind.config.js -i $(TAILWIND_INPUT) -o $(TAILWIND_OUTPUT) --watch
 $(MDBOOK_BIN):
 	@mkdir -p $(BIN_DIR)
 	@echo "==> downloading mdbook $(MDBOOK_VERSION) ($(MDBOOK_TRIPLE))"
 	curl -fsSL "$(MDBOOK_URL)" | tar -xz -C $(BIN_DIR) mdbook
 	@chmod +x $@
 docs: $(MDBOOK_BIN) ## Build the docs/book/ mdBook site into docs/book/book/
 	$(MDBOOK_BIN) build $(DOCS_BOOK_DIR)
 docs-watch: $(MDBOOK_BIN) ## Serve the docs site at http://127.0.0.1:3000 with live reload
 	$(MDBOOK_BIN) serve $(DOCS_BOOK_DIR) -n 127.0.0.1 -p 3000
 agent: ## Build the agent binary
 	@mkdir -p $(BIN_DIR)
 	CGO_ENABLED=0 go build $(GOFLAGS) -ldflags "$(LDFLAGS)" -o $(AGENT_BIN) ./cmd/agent
@@ -90,7 +113,7 @@ tidy: ## go mod tidy
 	go mod tidy
 clean: ## Remove build artifacts
-	rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT)
+	rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT) $(DOCS_BOOK_OUT)
 run-server: server ## Build and run the server
 	$(SERVER_BIN)
@@ -1,36 +1,62 @@
 # restic-manager
 Self-hosted, browser-based, single-pane-of-glass for managing
-[restic](https://restic.net) backups across a fleet of Linux and Windows
+[restic](https://restic.net) backups across a fleet of Linux and
-endpoints.
+Windows endpoints.
-> Status: pre-alpha. Phase 0 (project bootstrap) complete; Phase 1 (MVP) in
+> **Status:** pre-1.0, feature-complete for the original use
-> progress. See [`spec.md`](./spec.md) for the design and
+> case. Phases 0–4 + 6 are landed (MVP, scheduling, restore,
-> [`tasks.md`](./tasks.md) for the roadmap.
+> RBAC + OIDC, observability); Phase 5 (OSS readiness — docs site,
 > contributor onboarding, end-to-end CI) is in flight. See
 > [`spec.md`](./spec.md) for the design and [`tasks.md`](./tasks.md)
 > for the live roadmap.
-## What it does (target)
+## What it does
- Central visibility into backup state for every endpoint
+- Central visibility into backup state for every endpoint.
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
+- Trigger any restic operation remotely (`backup`, `forget`,
-  `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
+  `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`,
- Manage per-host backup schedules from the UI
+  `restore`).
- Live job progress streamed back to the UI
+- Per-host schedules with named source groups + retention.
- Restore wizard (browse snapshots, pick paths, restore to original or
+- Live job log streamed to the browser; downloadable as
-  alternate host)
+  text/NDJSON afterwards.
- Repo health surfacing (size, dedup ratio, last check, lock state)
+- Restore wizard: browse a snapshot's tree, pick paths, restore
- Alerting on failure or staleness
+  in-place or to a new directory.
- Cross-platform agent (Linux + Windows)
+- Repo health surfacing (size, raw size, last check, lock state),
- Ransomware-resistant repo access via append-only credentials
+  plus a 30/90-day repo-size trend.
 - Alerting over webhook, ntfy, or SMTP.
 - Cross-platform agent (Linux systemd + Windows SCM).
 - Append-only-friendly: separate admin credential for prune.
 - Optional Prometheus `/metrics` endpoint + sample Grafana
  dashboard.
 - Optional OIDC SSO (Authelia, Authentik, etc.).
-## Architecture (one-line summary)
+## Screenshots
-A small Go control-plane on the Proxmox host, lightweight Go agents on each
+| Sign in | Empty dashboard | Add host |
-endpoint that hold an outbound WebSocket to the control-plane, and a
+|:-------:|:---------------:|:--------:|
-`restic/rest-server` on Unraid that holds the actual backup data. The
+| ![Sign in](docs/screenshots/01-login.png) | ![Dashboard, fresh](docs/screenshots/02-dashboard-empty.png) | ![Add host](docs/screenshots/03-add-host.png) |
-control-plane never touches backup bytes.
+
 | Alerts | Settings | Audit log |
 |:------:|:--------:|:---------:|
 | ![Alerts](docs/screenshots/04-alerts.png) | ![Settings](docs/screenshots/05-settings.png) | ![Audit log](docs/screenshots/06-audit.png) |
 (Screenshots from a fresh smoke install with no hosts. A populated
 fleet view and the live-log + restore wizard surfaces are part of
 the docs site under [`docs/book/`](./docs/book) — `make docs` to
 render locally.)
 ## Architecture (one-line)
 A small Go control-plane in Docker, lightweight Go agents on each
 endpoint holding an outbound WebSocket to the control-plane, and
 a restic repository (rest-server, S3, B2, SFTP — anything restic
 speaks) that holds the actual backup data. **The control-plane
 never touches backup bytes.**
 Full architecture diagram and component breakdown:
-[`spec.md` §3](./spec.md).
+[`spec.md` §3](./spec.md), or the rendered version in the
 [docs site](./docs/book/src/concepts/architecture.md).
 ## Repository layout
@@ -38,31 +64,63 @@ Full architecture diagram and component breakdown:
 cmd/server/        control-plane binary
 cmd/agent/         endpoint agent binary
 internal/api       shared API types (REST + WS envelopes)
-internal/server/   HTTP, WS, UI handlers
+internal/server/   HTTP, WS, UI handlers, alert engine
 internal/agent/    service integration, restic runner, local scheduler
 internal/restic    restic CLI wrapper
 internal/store     SQLite persistence
-internal/crypto    secret encryption
+internal/crypto    secret encryption (AEAD)
 internal/auth      passwords, sessions, agent tokens
 web/               server-rendered templates + static assets
-deploy/            Dockerfile, docker-compose.yml, install scripts
+deploy/            Dockerfile, docker-compose.yml, install scripts, Grafana dashboard
-design/            UI wireframes (Phase 0 design pass)
+docs/              prose docs + the mdBook site under docs/book
 e2e/               compose stack + Playwright tests for end-to-end CI
 ```
 ## Quickstart
 The reference deployment is a single Docker container fronted by
 your existing reverse proxy. See the [installation guide](docs/book/src/getting-started/install.md)
 for the full path; the very short version:
 ```sh
 export RM_VERSION=v0.9.0    # pin a real tag
 export RM_BASE_URL=https://restic.example.com
 export RM_TRUSTED_PROXY=10.0.0.0/8
 docker compose -f deploy/docker-compose.yml up -d
 ```
 The server prints a one-time bootstrap token to the log on first
 start. POST it to `/api/bootstrap` (or open `/bootstrap` in a
 browser) to create the admin user.
 ## Local development
-Requires Go 1.25+ (built and tested on 1.26). The floor is set by
+Requires Go 1.25+. The floor is set by `modernc.org/sqlite` v1.50.
 `modernc.org/sqlite` v1.50.
 ```sh
 make build           # builds cmd/server and cmd/agent into ./bin
 make test            # runs go test ./...
 make lint            # runs golangci-lint
-make run-server      # runs the server (dev defaults)
+make smoke-restart   # systemd --user smoke server (see CLAUDE.md)
 make docs            # renders the mdBook site to docs/book/book/
 ```
 End-to-end test harness against a Docker Compose stack with a
 sibling Linux agent: see [`docs/e2e.md`](docs/e2e.md). Runs in CI
 on every PR.
 ## Documentation
 - **Concepts and operator guides**: [docs site](docs/book/src/intro.md),
  rendered with `make docs`.
 - **Reverse-proxy setup**: [docs/reverse-proxy.md](docs/reverse-proxy.md).
 - **Prometheus + Grafana**: [docs/prometheus.md](docs/prometheus.md).
 - **End-to-end test harness**: [docs/e2e.md](docs/e2e.md).
 - **Security policy**: [SECURITY.md](SECURITY.md).
 - **Contributing**: [CONTRIBUTING.md](CONTRIBUTING.md).
 ## License
-PolyForm Noncommercial 1.0.0 — see [`LICENSE`](./LICENSE). Free for personal,
+[PolyForm Noncommercial 1.0.0](./LICENSE). Free for personal,
-hobby, research, educational, governmental, and other noncommercial use.
+hobby, research, educational, governmental, and other noncommercial
-Commercial use requires a separate license.
+use. Commercial use requires a separate license.
@@ -0,0 +1,137 @@
 # Security policy
 restic-manager handles credentials that grant access to backup
 repositories — losing them means an attacker can read or destroy a
 fleet's backups. We take security reports seriously even at this
 project's small scale.
 ## Supported versions
 Pre-1.0, only the latest tagged release on `main` is supported.
 Backporting fixes to older tags is not currently offered.
 | Version            | Supported      |
 |--------------------|----------------|
 | `main` HEAD        | Yes            |
 | Latest released tag| Yes            |
 | Anything older     | No             |
 ## Reporting a vulnerability
 **Please don't open a public issue for security problems.**
 Instead, use one of these private channels:
 1. **Gitea private message** to the repository owner. The
   instance is at <https://gitea.dcglab.co.uk> and the owner's
   profile (`steve`) has direct-message contact set up.
 2. **Email** to the address on the maintainer's Gitea profile.
   Use a subject like `[SECURITY] restic-manager: <one-line summary>`
   so it doesn't get lost. PGP optional — if you want to encrypt,
   ask for a key first.
 If you don't get an acknowledgement within **3 working days**,
 please escalate through the other channel — solo maintainers do
 miss things, and the goal here is to fix the problem, not to
 preserve protocol.
 ### What to include
 - A description of the issue and the impact (what does an attacker
  gain? confidentiality, integrity, availability?).
 - Affected component (server, agent, install script, docs).
 - Affected version (`restic-manager-server --version`).
 - Reproduction steps if you have them. A working PoC is welcome
  but not required — a credible threat model is enough.
 - Whether you intend to publish a writeup, and any timing
  preferences.
 ### What we'll do
 1. Acknowledge receipt within 3 working days.
 2. Confirm or refute the issue, and agree a rough severity (CVSS
   or just "this is bad / this isn't"). Asking clarifying
   questions is normal at this stage — please don't read it as
   foot-dragging.
 3. Develop a fix on a private branch, test it, and prepare a
   release.
 4. Coordinate disclosure timing with you. The default is **30
   days from confirmed report to public disclosure**, with a
   patched release published before the disclosure date. Faster
   if a workable PoC is already circulating; slower only by
   mutual agreement.
 5. Credit the reporter in the release notes (or omit the credit
   if you'd rather stay anonymous — your choice).
 ## Scope
 In scope:
 - The server binary (`cmd/server`) and any HTTP, WebSocket, or CLI
  surface it exposes.
 - The agent binary (`cmd/agent`) and the way it consumes commands
  from the server.
 - The install scripts (`deploy/install/install.sh`, `install.ps1`)
  and the systemd unit shipped with them.
 - The docker-compose reference deployment and the docker image we
  publish.
 - Any cryptographic primitive choice or implementation detail
  (AEAD, token hashing, session handling, OIDC handshake).
 - Documentation that, if followed, leads operators into an
  insecure configuration.
 Out of scope (not because they aren't real problems, just not ones
 this report channel can act on):
 - Vulnerabilities in restic itself — report those upstream at
  <https://github.com/restic/restic>.
 - Vulnerabilities in third-party dependencies that haven't yet been
  patched upstream — report upstream first.
 - Issues that require pre-authenticated admin access on the control
  plane (admins can already do everything; that's not a privilege
  escalation, that's the design).
 - DoS via resource exhaustion on a deployment without the
  recommended reverse proxy / rate limiting in front (see
  `docs/reverse-proxy.md`).
 - Social-engineering scenarios that don't have a technical hook
  into the project's own surfaces.
 ## Threat model summary
 For context (longer version in [`spec.md`](./spec.md) §11):
 - The server is **HTTP-only**; TLS termination, ACME, HSTS, and
  edge rate-limiting are the reverse proxy's job.
 - Credentials are encrypted at rest with an AEAD key loaded from
  `RM_SECRET_KEY_FILE`. The same key encrypts agent secrets that
  travel to the agent over the WS channel.
 - Agents authenticate with bearer tokens issued at enrolment and
  hashed at rest. Compromise of the server DB does **not** leak
  bearer tokens in plaintext, but does leak the hashes (which is
  enough to log in *as* the agent until the operator revokes —
  see [NS-01 / NS-02](./tasks.md) for the revoke + regenerate
  flows).
 - The control plane intentionally **never touches backup bytes** —
  the agent runs `restic` directly against the repo. A
  compromised control plane can dispatch new jobs but cannot
  exfiltrate snapshot contents in-band.
 - Append-only credentials are first-class. Forget/prune jobs use a
  separate, admin-marked credential that the server only pushes
  for the duration of a maintenance dispatch.
 ## Hardening checklist for operators
 - Run behind a TLS-terminating reverse proxy (Caddy/nginx/Traefik).
 - Set `RM_TRUSTED_PROXY` to the proxy's CIDR so request IPs aren't
  spoofable.
 - Back up `RM_SECRET_KEY_FILE` separately from the database.
  Without it the encrypted creds are unrecoverable.
 - Use append-only credentials for the everyday backup path; only
  the optional admin credential should have write/forget/prune
  power.
 - Disable users (don't delete) when staff change roles — bearer
  tokens stay valid until rotated.
 - Watch the alert and audit-log views during enrolment of new
  hosts.
 Thanks for helping keep restic-manager users safe.
@@ -20,6 +20,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
 	rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
@@ -89,6 +90,7 @@ func run() error {
 	hub := ws.NewHub()
 	jobHub := ws.NewJobHub()
 	metricsRegistry := metrics.NewRegistry()
 	notifHub := notification.NewHub(st, aead, cfg.BaseURL)
 	alertEngine := alert.NewEngine(st, notifHub)
@@ -122,6 +124,7 @@ func run() error {
 		UI:              renderer,
 		Version:         version,
 		OIDC:            oidcClient,
 		Metrics:         metricsRegistry,
 	}
 	// First-run bootstrap: if the users table is empty, mint a one-time
@@ -0,0 +1,325 @@
 {
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": { "type": "grafana", "uid": "-- Grafana --" },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "description": "restic-manager fleet overview. Imports against any Prometheus data source.",
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "id": 1,
      "title": "Fleet status",
      "type": "stat",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "red", "value": null },
              { "color": "green", "value": 1 }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_hosts_online",
          "legendFormat": "online",
          "refId": "A"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_hosts_total",
          "legendFormat": "total",
          "refId": "B"
        }
      ]
    },
    {
      "id": 2,
      "title": "Open alerts",
      "type": "stat",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "options": {
        "colorMode": "value",
        "graphMode": "none",
        "orientation": "horizontal",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "sum by (severity) (rm_active_alerts)",
          "legendFormat": "{{severity}}",
          "refId": "A"
        }
      ]
    },
    {
      "id": 3,
      "title": "Backups failing (last reported run)",
      "type": "stat",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 1 }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
        "textMode": "auto"
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "count(rm_host_last_backup_success == 0)",
          "legendFormat": "failing",
          "refId": "A"
        }
      ]
    },
    {
      "id": 4,
      "title": "Hosts",
      "type": "table",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 },
      "fieldConfig": {
        "defaults": {
          "custom": { "align": "auto", "displayMode": "auto" }
        },
        "overrides": [
          {
            "matcher": { "id": "byName", "options": "Value #B" },
            "properties": [
              { "id": "displayName", "value": "Last backup (s ago)" },
              { "id": "unit", "value": "s" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #C" },
            "properties": [
              { "id": "displayName", "value": "Repo size" },
              { "id": "unit", "value": "bytes" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #D" },
            "properties": [
              { "id": "displayName", "value": "Snapshots" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #A" },
            "properties": [
              { "id": "displayName", "value": "Online" }
            ]
          },
          {
            "matcher": { "id": "byName", "options": "Value #E" },
            "properties": [
              { "id": "displayName", "value": "Open alerts" }
            ]
          }
        ]
      },
      "options": { "showHeader": true },
      "transformations": [
        {
          "id": "merge",
          "options": {}
        }
      ],
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_agent_online",
          "format": "table",
          "instant": true,
          "refId": "A"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "time() - rm_host_last_backup_timestamp_seconds",
          "format": "table",
          "instant": true,
          "refId": "B"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_repo_size_bytes",
          "format": "table",
          "instant": true,
          "refId": "C"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_snapshot_count",
          "format": "table",
          "instant": true,
          "refId": "D"
        },
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_open_alerts",
          "format": "table",
          "instant": true,
          "refId": "E"
        }
      ]
    },
    {
      "id": 5,
      "title": "Repo size over time",
      "type": "timeseries",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "axisLabel": "",
            "drawStyle": "line",
            "fillOpacity": 10,
            "lineWidth": 1,
            "pointSize": 5,
            "showPoints": "never"
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "options": {
        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
        "tooltip": { "mode": "multi", "sort": "desc" }
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "rm_host_repo_size_bytes",
          "legendFormat": "{{host}}",
          "refId": "A"
        }
      ]
    },
    {
      "id": 6,
      "title": "Job duration p95 (last 1h, by kind)",
      "type": "timeseries",
      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "custom": {
            "drawStyle": "line",
            "fillOpacity": 5,
            "lineWidth": 1,
            "pointSize": 4,
            "showPoints": "never"
          },
          "unit": "s"
        },
        "overrides": []
      },
      "options": {
        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
        "tooltip": { "mode": "multi", "sort": "desc" }
      },
      "targets": [
        {
          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
          "expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))",
          "legendFormat": "{{kind}}",
          "refId": "A"
        }
      ]
    }
  ],
  "refresh": "30s",
  "schemaVersion": 39,
  "style": "dark",
  "tags": ["restic-manager", "backups"],
  "templating": {
    "list": [
      {
        "current": {},
        "hide": 0,
        "includeAll": false,
        "label": "Prometheus",
        "multi": false,
        "name": "DS_PROMETHEUS",
        "options": [],
        "query": "prometheus",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      }
    ]
  },
  "time": { "from": "now-6h", "to": "now" },
  "timepicker": {},
  "timezone": "",
  "title": "restic-manager — fleet",
  "uid": "rm-fleet-overview",
  "version": 1,
  "weekStart": ""
 }
@@ -0,0 +1,19 @@
 [book]
 title = "restic-manager"
 description = "Self-hosted control plane for restic backups across a fleet of Linux and Windows endpoints."
 authors = ["Steve Cliff"]
 language = "en-GB"
 multilingual = false
 src = "src"
 [output.html]
 default-theme = "ayu"
 preferred-dark-theme = "ayu"
 git-repository-url = "https://gitea.dcglab.co.uk/steve/restic-manager"
 git-repository-icon = "fa-code-fork"
 edit-url-template = "https://gitea.dcglab.co.uk/steve/restic-manager/_edit/main/docs/book/{path}"
 no-section-label = false
 [output.html.fold]
 enable = true
 level = 2
@@ -0,0 +1,40 @@
 # Summary
 [Introduction](./intro.md)
 # Getting started
 - [Installing the server](./getting-started/install.md)
 - [Enrolling your first host](./getting-started/enrolling-hosts.md)
 - [Running behind a reverse proxy](./getting-started/reverse-proxy.md)
 # Concepts
 - [Architecture](./concepts/architecture.md)
 - [Credentials and how they flow](./concepts/credentials.md)
 - [Schedules and source groups](./concepts/schedules-and-source-groups.md)
 - [Repo maintenance](./concepts/repo-maintenance.md)
 # Operations
 - [Backups and restores](./operations/backups-and-restores.md)
 - [Alerts and notifications](./operations/alerts.md)
 - [Observability with Prometheus](./operations/observability.md)
 - [Updating agents](./operations/updates.md)
 # Security
 - [Threat model](./security/threat-model.md)
 - [Hardening checklist](./security/hardening.md)
 - [Reporting vulnerabilities](./security/disclosure.md)
 # Reference
 - [Environment variables](./reference/env-vars.md)
 - [HTTP endpoints](./reference/http-endpoints.md)
 ---
 [Contributing](./contributing.md)
 [Roadmap](./roadmap.md)
 [License](./license.md)
@@ -0,0 +1,121 @@
 # Architecture
 ## Components
 ```
 ┌────────────────────────────────────────────────────────────┐
 │  Server (control plane, single process)                    │
 │   * chi-based HTTP API + HTMX server-rendered UI           │
 │   * WebSocket hub for agent fan-out + browser fan-out      │
 │   * SQLite store (modernc.org/sqlite, pure Go)             │
 │   * AEAD encryption helpers                                │
 │   * Alert engine + notification hub                        │
 └────────────┬───────────────────────────────────┬───────────┘
             │ outbound WS only                   │ HTTP(S)
             │                                    │
 ┌────────────▼─────────────┐         ┌────────────▼─────────────┐
 │  Agent (per host)        │         │  Browser (operator)      │
 │   * coder/websocket      │         │   * htmx + a tiny bit    │
 │   * cron for schedules   │         │     of vanilla JS for    │
 │   * restic wrapper       │         │     live job updates     │
 │   * sysinfo collector    │         └──────────────────────────┘
 └────────────┬─────────────┘
             │ subprocess: restic ...
             │
 ┌────────────▼─────────────────────────────────────────────────┐
 │  restic repository (rest-server, S3, B2, SFTP, local …)      │
 │  Backup data flows directly here. Server never touches it.   │
 └──────────────────────────────────────────────────────────────┘
 ```
 ## Why outbound-only WebSockets?
 The agent dials the server on `/ws/agent` with a bearer token. The
 server doesn't initiate connections to the agent. Three reasons:
 1. **Firewall friendliness.** Nothing on the endpoint needs an
   inbound port; this works behind the typical "branch office NAT"
   without router config.
 2. **Single auth point.** The bearer token is the only credential
   that crosses the boundary; the agent never accepts an
   incoming socket.
 3. **Reconnect semantics are simpler.** When the connection drops
   (NAT timeout, server restart, transient network glitch) the
   agent backs off and re-dials; the server marks the host
   offline after 90s and lets the alert engine raise a stale-host
   alert.
 ## Why SQLite?
 SQLite covers the project's HA non-goal: there isn't one. A small
 control plane managing twelve endpoints does not need replication
 or a separate database tier. SQLite gives us:
 - A single file to back up (plus the secret key).
 - Hand-rolled migrations under `internal/store/migrations/` —
  no migration framework lock-in.
 - `WAL` mode plus per-connection foreign-key enforcement.
 The migrations file the entire schema; there's no ORM or
 query-builder layer between Go code and SQL.
 ## Why the agent runs `restic` itself, not via the server
 The control plane never holds backup bytes in flight. That's
 deliberate:
 - A compromised control plane cannot exfiltrate snapshot
  contents in-band — at worst it can dispatch new backup or
  forget jobs (audit-logged) but the data path is between the
  agent and the repository.
 - The same agent process can target whichever transport restic
  natively supports (rest-server, S3, B2, SFTP, local), no
  separate mux on the server side.
 ## Job lifecycle
 ```
            ┌──────────────────────┐
 operator →  │ POST /hosts/{id}/    │
            │       run-backup     │
            └──────────┬───────────┘
                       │   1. INSERT INTO jobs (status='queued')
                       │   2. dispatch command.run over WS
                       ▼
            ┌──────────────────────┐
            │ Agent dispatches     │
            │ restic subprocess    │
            └──────────┬───────────┘
                       │
                       │   3. job.started   ───▶ store.MarkJobStarted
                       │   4. job.progress  ───▶ JobHub broadcast (live UI)
                       │   5. log.stream    ───▶ append to job_logs
                       │   6. job.finished  ───▶ store.MarkJobFinished
                       │                          + alert engine eval
                       │                          + (P6) metrics histogram
                       ▼
                  terminal: succeeded | failed | cancelled
 ```
 Operators see live updates because the browser subscribes to
 `/api/jobs/{id}/stream`, and the WS handler broadcasts each
 agent-emitted envelope to all live subscribers in addition to
 persisting it.
 ## What scheduling looks like
 - The agent runs a local `robfig/cron/v3` instance.
 - The server pushes the desired schedule set to the agent on
  hello + after every CRUD change.
 - When the agent's cron fires, it sends `schedule.fire` to the
  server. The server creates a job row, sends `command.run` back,
  and the agent dispatches a normal backup.
 - If the WS drops between fire and run, the server queues the
  schedule firing into `pending_runs` and drains on agent
  reconnect — no missed scheduled backups due to network blips.
 For everything that isn't a backup (forget, prune, check), the
 server runs a 60-second maintenance ticker against
 `host_repo_maintenance` rows and dispatches the relevant command
 when a cadence is due. The agent's local cron only handles
 backups.
@@ -0,0 +1,98 @@
 # Credentials and how they flow
 restic-manager handles three credential surfaces:
 1. **Operator credentials** — the username + password (or OIDC
   identity) that logs into the UI.
 2. **Agent bearer tokens** — issued at enrolment, used by the
   agent to authenticate its WebSocket to the server.
 3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
   credentials the agent passes to `restic` itself.
 Each has a different threat model and storage strategy.
 ## Operator credentials
 - Local users are stored in `users` with a bcrypt password hash.
 - Sessions are random tokens minted at login, stored hashed in
  the `sessions` table, expired after 24h. Cookie is HttpOnly,
  SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
  default).
 - OIDC users carry `auth_source='oidc'` and an `oidc_subject`
  pinning their IdP identity. Local password login is rejected
  for OIDC users.
 - Disabling a user soft-deletes them via `disabled_at` —
  pre-existing sessions are invalidated on the next request.
 ## Agent bearer tokens
 - Minted at enrolment, hashed at rest with `auth.HashToken`.
 - The plaintext token only exists in memory at enrolment time
  and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
  mode `0600`, owned by the service user).
 - Compromise of the server DB leaks the hashes, which is enough
  to *log in as that agent* until you revoke. Compromise of the
  agent host leaks the plaintext (via the config file) — same
  end result.
 - Rotation: re-enrol the host. Today there's no in-place rotate;
  the operator deletes the host (which cascades, including
  revoking the bearer hash) and re-runs the install command.
 ## Repo credentials
 This is the credential that ultimately matters for backup
 integrity. restic-manager keeps two slots per host:
 - **The everyday credential** (`host_credentials.kind = ''`).
  Append-only-friendly: this is the one your backup schedule
  uses. It can write but not delete or forget.
 - **The admin credential** (`host_credentials.kind = 'admin'`).
  Has full delete rights. Only pushed to the agent transiently
  while a `prune` or `forget` job is dispatching, and discarded
  by the agent after the job ends.
 ### Encryption flow
 1. Operator types the credential into the UI or the install form.
 2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
   key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
   memory.
 3. Encrypted blob is stored in `host_credentials.cred_blob`.
 4. When the agent connects, the server decrypts the blob and
   sends the **plaintext** down the WebSocket inside a
   `config.update` envelope.
 5. The agent stores the plaintext in its in-memory secrets store
   for the lifetime of the process; it's reloaded fresh on every
   server-side push.
 6. When a job runs, the agent merges the credential into the
   restic environment (`restic.Env.RepoURL` stays bare; the
   `user:pass@…` form is built only inside `envSlice()` at the
   moment of `exec.Command`).
 The merged form is **never logged**. The slog package's structured
 output gets `restic.RedactURL()` for any URL it has cause to
 mention.
 ### Why push plaintext over the wire?
 The transport itself is the trust boundary: the WebSocket runs
 inside the same TLS-terminated reverse-proxy connection your
 browser uses, and the agent has already authenticated with its
 bearer token. Re-encrypting the payload on top of that would just
 move the key-management problem somewhere else.
 If your reverse proxy isn't TLS-terminated, the deployment is
 already broken — see [Hardening](../security/hardening.md).
 ## Setup tokens (admin-driven)
 When an admin creates a new user, the server mints a one-time
 setup link valid for 1 hour. The hash is stored; the raw token
 is shown to the admin once. The user opens the link, sets a
 password, and is dropped into a session. Expired tokens are
 swept on the alert engine's 60s tick.
 Same pattern for enrolment tokens: the raw token only exists in
 memory at mint time, and the install snippet is the operator's
 only chance to capture it. If you lose it, regenerate via the
 **Add host** page (NS-02).
@@ -0,0 +1,85 @@
 # Repo maintenance
 Backups go in; without maintenance, repos grow forever and
 eventually fall over. restic-manager runs three maintenance
 operations on a per-host cadence:
 | Command  | What it does                                                | Default cadence |
 |----------|-------------------------------------------------------------|-----------------|
 | `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
 | `prune`  | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
 | `check`  | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
 A new field on each host row, `host_repo_maintenance`, holds the
 cron expressions and last-fire anchors. The maintenance ticker on
 the server runs every 60s, finds hosts whose next-fire is due,
 and dispatches the right command. The agent's local cron is
 **only** for backups.
 ## Why server-side and not agent-side?
 The agent's cron knows about backups because backups are
 per-source-group. Maintenance is per-repo, not per-source-group,
 so doing it server-side keeps the per-host wiring simple:
 - One ticker, not N agent crons to keep in sync.
 - Cancelling a maintenance dispatch is just "don't dispatch the
  next one" — no agent-side state to clean up.
 - Skipping offline hosts is trivial (no queue; only scheduled
  *backups* queue into `pending_runs`).
 ## Forget and the multi-group payload
 A single `forget` job can target several source groups at once.
 The wire envelope (`ForgetGroups`) carries one entry per group,
 each with its retention policy. The agent runs N
 `restic forget --tag <name> --keep-...` invocations in sequence,
 streams their output, and reports a single terminal status.
 ## Prune and the admin credential
 Prune mutates the repo. The everyday append-only credential
 **cannot** prune — that's the whole point of append-only.
 restic-manager keeps a second slot per host (`kind = 'admin'`)
 for the credential that can.
 When a prune is dispatched (cadence-driven or operator-driven):
 1. Server pushes the admin credential to the agent in a fresh
   `config.update`.
 2. Agent runs `restic prune` with the merged credential.
 3. Job finishes; agent discards the admin credential from its
   in-memory secrets store.
 The server never logs the merged URL (see
 [Credentials](./credentials.md)).
 ## Check and lock state
 `restic check` warns about stale locks when it finds them. The
 agent ships every check's output back as a `repo.stats` envelope
 and a stream of log lines; if a stale lock is detected, the
 **Repo** page surfaces a banner with an **Unlock** button. The
 operator-only `unlock` command runs `restic unlock` and clears
 the banner.
 `unlock` has no cadence — it's a manual action, never automatic.
 Auto-unlocking would mask the cause (probably a previously
 crashed long-running operation) and risk corrupting an
 operation the operator has merely lost track of.
 ## Repo stats
 After every backup, check, prune, and unlock, the agent runs
 `restic stats --json --mode raw-data` and ships the result as a
 `repo.stats` envelope. The server stores this in
 `host_repo_stats` (latest only) and `host_repo_stats_history`
 (one row per host per day, last-write-wins per column — a
 prune-only patch never nulls a backup-time size).
 The host detail page surfaces:
 - Total size + raw size in the vitals strip.
 - Last-check timestamp + colour-coded status.
 - Last-prune timestamp.
 - 30/90-day repo size trend chart.
@@ -0,0 +1,105 @@
 # Schedules and source groups
 Two related but separable ideas:
 - A **source group** is a named bundle of "what to back up":
  include paths, exclude patterns, retention policy, retry
  configuration, optional pre/post hooks. The group's name is
  used as the restic snapshot tag, so retention can target it
  with `restic forget --tag <name>`.
 - A **schedule** is a cron expression that, when it fires,
  triggers a backup of one or more source groups on a host.
 Decoupling them means you can have one schedule covering several
 groups (e.g. `0 1 * * *` running both `system` and `data`), and
 each group has its own retention without duplicating policy
 across schedules.
 ## Source group anatomy
 ```yaml
 name: data
 includes:
  - /var/lib/postgresql
  - /home
 excludes:
  - /home/*/.cache
  - /home/*/Downloads
 retention:
  keep_last: 7
  keep_daily: 14
  keep_weekly: 4
  keep_monthly: 6
 retry_max: 3
 retry_backoff_seconds: 600
 pre_hook: |
  pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
 post_hook: |
  rm -f /var/lib/postgresql/dumps/all.dump
 ```
 ### Conflict detection
 If your retention policy says `keep_hourly: 24` but no schedule
 points at this group sub-daily, the UI surfaces a
 **conflict-dimension banner** ("`hourly` won't be honoured —
 no schedule fires more often than once a day"). The flag is
 stored on the source group (`conflict_dimension`) and refreshed
 whenever a schedule or group changes.
 ### Hooks
 `pre_hook` and `post_hook` run on the agent host inside
 `/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
 to the live job log as `hook(<phase>): …` lines.
 - A non-zero `pre_hook` exit aborts the backup.
 - `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
  in the environment. Use this for cleanup that must happen
  whether the backup worked or not.
 - Hooks only run for `kind=backup` jobs. They do not run for
  `forget`, `prune`, `check`, etc.
 - AEAD-encrypted at rest at the HTTP layer; the agent receives
  plaintext over the WS channel.
 A "host default" pair of hooks lives on the host itself; a
 source group's own hooks override them when set.
 ## Schedule anatomy
 ```yaml
 cron: "0 2 * * *"
 enabled: true
 source_group_ids:
  - <gid for "data">
  - <gid for "system">
 ```
 Slim by design: a schedule says **when** and **which groups**.
 Everything else (paths, retention, hooks) lives on the groups.
 The agent's local cron fires the schedule. If the WebSocket is
 down at fire time, the server queues the firing into
 `pending_runs` and drains it on the next agent reconnect — a
 short network blip won't lose the backup.
 ### Last / next run
 The schedules tab shows "next" (computed by parsing the cron
 expression with `robfig/cron/v3`) and "last" (the latest
 `actor_kind=schedule` job in the `jobs` table) for every
 schedule. The dashboard host row also surfaces `next 12h ago/from
 now` when a single covering schedule is the run-now candidate.
 ## Bandwidth limits
 Two places set restic's `--limit-upload` / `--limit-download`:
 1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
   `bandwidth_down_kbps`). Pushed to the agent on hello and
   after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
   invocation on the host.
 2. **Per-job overrides** on the per-source-group Run-now form.
   Win over host caps for the lifetime of that one job.
 If neither is set, restic runs unthrottled.
@@ -0,0 +1,17 @@
 # Contributing
 Full contributor guide:
 [`CONTRIBUTING.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CONTRIBUTING.md)
 in the repository root.
 The short version:
 - Open an issue first for non-trivial changes; the design is
  still moving and unsolicited large PRs may conflict with
  in-flight work.
 - `make lint test` must pass.
 - One logical change per commit, no `Co-Authored-By` trailers.
 - UK English in identifiers and comments; comments explain the
  **why** not the **what**.
 Code of conduct: [`CODE_OF_CONDUCT.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CODE_OF_CONDUCT.md).
@@ -0,0 +1,113 @@
 # Enrolling your first host
 The control plane only knows about hosts you've explicitly
 enrolled. Two paths exist:
 1. **Token-based enrolment** — admin generates a token, pastes it
   into an install command on the host. The host appears immediately,
   already mapped to the desired repo.
 2. **Announce-and-approve** — the agent runs without a token,
   "announces" itself to the server, and a human in the UI accepts
   the announcement.
 Token-based is the default and what most operators want; the
 announce flow exists for the case where you can't easily paste a
 secret onto the host (auto-imaged endpoints, scripted bring-ups
 from a config repo).
 ## Token-based enrolment
 ### From the UI
 1. Click **+ Add host** on the dashboard.
 2. Fill in the hostname, the restic repo URL, and the repo
   credentials. The credentials are AEAD-encrypted at the server
   immediately; what you paste is what the agent receives.
 3. Optionally pick the initial source paths — these become the
   first source group on the host.
 4. Submit. The server mints a one-time token and shows you a copy-
   pasteable install snippet.
 ### On the host (Linux)
 ```sh
 curl -fsSL https://restic.example.com/install/install.sh | \
    sudo RM_SERVER=https://restic.example.com \
         RM_ENROL_TOKEN=<token> \
         bash
 ```
 The script:
 1. Detects architecture (`amd64` or `arm64`).
 2. Downloads the agent binary from `/agent/binary?os=…&arch=…`.
 3. Drops the systemd unit at
   `/etc/systemd/system/restic-manager-agent.service`.
 4. Runs the agent in `-enrol` mode, which posts the token and
   stores the persistent bearer it gets back.
 5. Enables and starts the unit.
 Within seconds the host should appear on the dashboard as
 **online**.
 ### On the host (Windows)
 ```pwsh
 $env:RM_SERVER  = "https://restic.example.com"
 $env:RM_ENROL_TOKEN = "<token>"
 iwr -useb $env:RM_SERVER/install/install.ps1 | iex
 ```
 Equivalent shape: registers a Windows service via the SCM
 (see P2-16 for details), runs `-enrol`, starts the service.
 ## Recovering a lost token
 Tokens are single-use and short-lived (1h). If you closed the tab
 before pasting the install command, head to the **Add host** page —
 outstanding tokens are listed there with a **Regenerate** button.
 Regenerating revokes the old token's hash and mints a fresh raw
 token while preserving the original repo credentials and initial
 paths. (NS-02 in `tasks.md` if you want the design rationale.)
 ## Announce-and-approve
 If the host can reach the server but you don't want to paste a
 secret on it, run the agent in `-announce` mode:
 ```sh
 restic-manager-agent -announce \
                     -server https://restic.example.com \
                     -hostname myhost
 ```
 The host appears in the **Pending hosts** panel on the dashboard
 with its hostname, OS, arch, and the source IP that announced it.
 Click **Accept**, fill in the repo URL + credentials, and the
 server pushes the bearer over the still-open WebSocket. No
 back-and-forth round trip.
 If you don't accept within an hour the announcement is swept.
 ## What happens on the agent
 After enrolment, the agent:
 1. Connects via WebSocket to `/ws/agent` with its bearer token.
 2. Sends a `hello` envelope with its OS, arch, agent version,
   restic version, and protocol version.
 3. Receives a `config.update` carrying its encrypted repo
   credentials and any source-group paths.
 4. Sits idle, sending a heartbeat every 30s. Operator-driven
   "Run now" actions arrive as `command.run` envelopes; scheduled
   jobs are driven by the agent's local cron.
 ## Auto-init of the repository
 The first time a backup runs, the agent invokes `restic init`
 against the repo you configured at enrolment. If the repo already
 exists (`config file already exists`) the agent treats it as a
 success and proceeds. The host's repo status (`unknown` →
 `ready` / `init_failed`) is surfaced under the vitals strip on
 the host detail page; if init fails, save fresh credentials in
 the **Repo** tab to retry.
@@ -0,0 +1,92 @@
 # Installing the server
 The reference deployment is a single Docker container fronted by
 your existing reverse proxy. The image bundles the server binary,
 the cross-compiled agent binaries, and the install scripts.
 ## Prerequisites
 - A Linux host with Docker and Docker Compose.
 - A reverse proxy in front (Caddy, nginx, Traefik) terminating
  TLS on a public hostname. The server itself is HTTP-only by
  design — see [Reverse proxy](./reverse-proxy.md) for why.
 - A persistent volume for the server's data directory.
 ## Quick start
 The reference compose file lives at
 [`deploy/docker-compose.yml`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/docker-compose.yml):
 ```yaml
 services:
  restic-manager:
    image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:-latest}
    restart: unless-stopped
    environment:
      RM_LISTEN: ":8080"
      RM_DATA_DIR: "/data"
      RM_BASE_URL: "https://restic.example.com"
      # Trust your reverse proxy's CIDR so X-Forwarded-* are honoured.
      RM_TRUSTED_PROXY: "10.0.0.0/8"
    volumes:
      - rm-data:/data
    ports:
      # Bind localhost only — your reverse proxy is the public face.
      - "127.0.0.1:8080:8080"
 volumes:
  rm-data:
 ```
 Bring it up:
 ```sh
 docker compose up -d
 docker compose logs -f restic-manager
 ```
 The first run prints a one-time **bootstrap token** to the log. Use
 it within an hour or it expires; if you miss the window the
 container print it again on next start as long as no admin user
 exists.
 ## First-run admin setup
 Open `https://restic.example.com/bootstrap` (or whatever your
 public URL is). Paste the bootstrap token, pick a username and a
 password (≥ 12 characters), and submit. You'll land in the
 dashboard logged in as the new admin.
 If you'd rather curl it, the equivalent is:
 ```sh
 curl -X POST https://restic.example.com/api/bootstrap \
     -H 'Content-Type: application/json' \
     -d '{"token":"<token-from-log>","username":"admin","password":"<≥12 chars>"}'
 ```
 ## Backing up the secret key
 Inside the data volume, `secret.key` holds the AEAD key used to
 encrypt every credential at rest. **Back it up separately from
 the database.** Without it, encrypted credentials in the database
 are unrecoverable; you'd have to re-enrol every host.
 A simple working approach: copy `secret.key` to your password
 manager or to a separately-backed-up secrets vault the day you
 install. It doesn't change.
 ## Updating the server
 ```sh
 # Pin a new version in your compose file (.env or docker-compose.yml),
 # then:
 docker compose pull
 docker compose up -d
 ```
 Migrations run automatically on startup; the server will refuse to
 start if a migration fails (better to bail than to half-migrate).
 For the agent self-update story, see
 [Updating agents](../operations/updates.md).
@@ -0,0 +1,95 @@
 # Running behind a reverse proxy
 The restic-manager server is HTTP-only by design. TLS termination,
 public hostname, ACME, HSTS, and edge-level rate limiting all
 belong to a reverse proxy you already operate outside this project.
 ## What the proxy must forward
 The server reads four headers when (and only when) the immediate
 peer matches `RM_TRUSTED_PROXY`:
 | Header                 | Value                                              | Why |
 |------------------------|----------------------------------------------------|-----|
 | `X-Forwarded-For`      | The original client IP                             | Rate-limit keys, audit log entries, OIDC redirect-URI checks. |
 | `X-Forwarded-Proto`    | `https`                                            | Used for absolute URLs (e.g. OIDC redirect URIs). |
 | `Host`                 | The public hostname clients use                    | Cookies are scoped to this; `RM_BASE_URL` must match. |
 | `Connection` / `Upgrade` | Pass through unchanged                           | `/ws/agent` and `/api/jobs/{id}/stream` are WebSockets; without `Upgrade: websocket` they fail. |
 Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of
 CIDRs) the proxy connects from. Anything outside that range has
 its `X-Forwarded-*` headers ignored, so a stray request that
 bypasses the proxy can't spoof the client IP.
 ## Caddy
 ```caddyfile
 restic.example.com {
    encode zstd gzip
    reverse_proxy 127.0.0.1:8080 {
        header_up X-Real-IP {remote_host}
    }
 }
 ```
 Caddy adds `X-Forwarded-For` / `X-Forwarded-Proto` automatically
 and passes WebSocket headers through by default, so this is the
 whole config.
 ## nginx
 ```nginx
 server {
    listen 443 ssl http2;
    server_name restic.example.com;
    ssl_certificate     /etc/letsencrypt/live/restic.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/restic.example.com/privkey.pem;
    location / {
        proxy_pass         http://127.0.0.1:8080;
        proxy_http_version 1.1;
        proxy_set_header   Host              $host;
        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto https;
        # WebSocket upgrade
        proxy_set_header   Upgrade           $http_upgrade;
        proxy_set_header   Connection        "upgrade";
        # Long-lived agent WS — disable read timeout for this surface.
        proxy_read_timeout 86400s;
    }
 }
 ```
 ## Traefik
 ```yaml
 http:
  routers:
    restic-manager:
      rule: "Host(`restic.example.com`)"
      entryPoints: [websecure]
      tls:
        certResolver: letsencrypt
      service: restic-manager
  services:
    restic-manager:
      loadBalancer:
        servers:
          - url: "http://restic-manager:8080"
        passHostHeader: true
 ```
 Traefik forwards WebSocket upgrades and the standard
 `X-Forwarded-*` set out of the box.
 ## Verification
 After bringing the proxy up, the audit log should show your real
 client IP for an interactive login (not the proxy's local
 address). If you see `127.0.0.1` or the proxy's container IP, your
 `RM_TRUSTED_PROXY` is wrong or `X-Forwarded-For` isn't being
 forwarded.
@@ -0,0 +1,86 @@
 # restic-manager
 restic-manager is a self-hosted, browser-based, single-pane-of-glass
 for managing [restic](https://restic.net) backups across a fleet of
 Linux and Windows endpoints. It's designed for **small fleets** —
 the original target was twelve endpoints — and **one operator**.
 ## What it does
 - Centralised view of every endpoint's last backup, repo size,
  snapshot count, and recent jobs.
 - Trigger any restic operation remotely (`backup`, `forget`, `prune`,
  `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`).
 - Per-host backup schedules with source groups (named bundles of
  paths + retention policy).
 - Live job log streamed to the browser; downloadable as text or NDJSON.
 - Restore wizard with snapshot tree browse + path selection.
 - Repo-level health surfacing (size, raw size, last-check, lock
  state) plus a 30/90-day size trend.
 - Alerting over webhook, ntfy, or SMTP.
 - Cross-platform agent (Linux + Windows).
 - Append-only-credential-friendly with a separate admin credential
  for forget/prune.
 ## What it isn't
 - **Not a SaaS.** Single-instance, single-tenant, by design.
 - **Not a replacement for restic** — it's a control plane. The agent
  shells out to a real `restic` binary.
 - **Not highly available.** SQLite, single process; if you need
  HA backups, you're shopping in the wrong aisle.
 - **Not a multi-protocol backup tool.** restic only.
 ## How it fits together
 ```
 ┌──────────────────────────────────────────────┐
 │  Server (control plane, Docker)              │
 │   - REST + WebSocket API                     │
 │   - SQLite store                             │
 │   - Embedded HTMX UI                         │
 └──────────┬─────────────────────────┬─────────┘
           │ outbound WS              │ HTTP(S)
           │                          │
 ┌──────────▼──────────┐    ┌──────────▼─────────┐
 │  Agent (per host)   │    │  Browser (operator) │
 │   - restic wrapper  │    └─────────────────────┘
 │   - cron for sched. │
 └──────────┬──────────┘
           │ restic
 ┌──────────▼──────────────────────────────────┐
 │  rest-server / S3 / SFTP / local repo       │
 │  (the actual backup data — server never     │
 │   touches it)                               │
 └─────────────────────────────────────────────┘
 ```
 The control plane is a Go binary that runs in Docker. Each endpoint
 runs a small Go agent that holds an outbound WebSocket to the
 control plane. Backup data flows directly between the agent and the
 restic repository — the control plane never sees a snapshot byte.
 ## Where to start
 - [Installing the server](./getting-started/install.md) walks
  through the Docker-based reference deployment.
 - [Enrolling your first host](./getting-started/enrolling-hosts.md)
  covers the install scripts and the announce-and-approve flow.
 - [Architecture](./concepts/architecture.md) is the right read if
  you want to know why something is the way it is before running
  the install.
 ## Project status
 Pre-1.0 but feature-complete for the original use case. Phases
 0–4 are landed (MVP, scheduling, restore, RBAC + OIDC); Phase 5
 (this docs site, contributor onboarding, end-to-end CI) is in
 flight. See [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md)
 for the live roadmap and [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
 for the canonical design doc.
 ## License
 [PolyForm Noncommercial 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0/).
 Personal and community deployments welcome; commercial use
 requires a separate license.
@@ -0,0 +1,39 @@
 # License
 restic-manager is licensed under
 [**PolyForm Noncommercial 1.0.0**](https://polyformproject.org/licenses/noncommercial/1.0.0/).
 The full text lives at
 [`LICENSE`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/LICENSE)
 in the repository root.
 ## What this means
 - **Personal, hobbyist, educational, charitable, and similar
  noncommercial use** is fully permitted, including modification
  and redistribution.
 - **Commercial use is not permitted** without a separate
  license. The maintainer is not currently offering one — if
  you need commercial rights, open an issue to start the
  conversation.
 - The license is permissive about everything except commercial
  use: you can fork, modify, deploy in your home/lab, and
  contribute back.
 ## Why this license
 The PolyForm Noncommercial license was chosen because:
 - It's a real, legal, plainly-worded license (not a custom
  half-written variant).
 - It permits the realistic uses for a hobby project (the
  maintainer's homelab, a friend's fleet, a charity's IT
  closet) without inviting commercial vendors to repackage
  the work.
 - It's compatible with the project staying small and
  maintainable — the maintainer doesn't want to be on the hook
  for SLA-grade commercial support.
 ## Contributions
 By contributing, you agree your contributions are licensed
 under the same PolyForm Noncommercial 1.0.0 license.
@@ -0,0 +1,73 @@
 # Alerts and notifications
 restic-manager raises alerts on conditions that need human
 attention. The alert engine evaluates rules on a 60s tick and
 on every job-finished / host-online event.
 ## Built-in alert kinds
 | Kind                | Trigger | Severity |
 |---------------------|---------|----------|
 | `backup_failed`     | A backup job ends in `failed` or `cancelled` | warning |
 | `forget_failed`     | A forget job ends in `failed` | warning |
 | `prune_failed`      | A prune job ends in `failed` | critical |
 | `check_failed`      | A check job ends in `failed` | critical |
 | `agent_offline`     | A host has been offline more than 90s past its heartbeat cadence | warning |
 | `stale_schedule`    | A schedule's "last run" is more than 1.5 × its interval ago | warning |
 | `update_failed`     | An agent self-update returned a fail or didn't reconnect within 90s | warning |
 | `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
 Each alert has a `dedup_key` so re-firing the same condition
 just bumps `last_seen_at` — the operator gets one row per
 condition, not a thousand.
 ## Lifecycle
 ```
 raised  ──acknowledge──▶  acknowledged  ──resolve──▶  resolved
   │                          │
   └────────auto-resolve──────┘
   (e.g. agent_offline auto-resolves on agent_online)
 ```
 - **Acknowledge** says "I've seen this, stop notifying about it".
 - **Resolve** says "the underlying condition is gone".
 - Some alerts auto-resolve when the condition clears
  (`agent_offline` is the canonical example).
 ## Notification channels
 Configure under **Settings → Notifications**. Each channel can
 subscribe to all alerts or filter by severity.
 ### Webhook
 Posts a JSON envelope to a URL of your choice. Useful for
 piping into Slack via an Incoming Webhook URL or into your own
 alerting tooling.
 ### ntfy
 Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
 topic. Configure the topic URL; optional bearer token if you
 self-host with auth.
 ### SMTP
 Plain SMTP (with optional TLS). Configure host, port,
 username, password, and the recipient list.
 ## Test fire
 Each channel exposes a **Test fire** button that dispatches a
 single synthetic alert through the channel without touching the
 alert engine. Use this when you've added a channel and want to
 verify connectivity before the next real failure happens.
 ## What gets logged
 Every alert raise / acknowledge / resolve writes an audit log
 entry. The audit log UI at **Settings → Audit log** filters by
 user, action, target, and time range — useful for the
 post-incident "who clicked acknowledge on the prune-failure
 alert" question.
@@ -0,0 +1,73 @@
 # Backups and restores
 ## Running a backup
 Three ways to trigger one:
 1. **Scheduled** — the agent's local cron fires at the time set
   on the schedule.
 2. **Run-now** — operator clicks **Run now** on the host detail
   right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
   source groups) or to a per-group form for finer control.
 3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
   payload. Same audit + dispatch path.
 In every case the server creates a `jobs` row, broadcasts a
 `command.run` to the host, and lands the operator on the live
 job log page (HTMX `HX-Redirect`).
 ## Cancelling a job
 Any running job — backup, forget, prune, restore, anything —
 exposes a **Cancel** button on its detail page. The server
 broadcasts `command.cancel`, and the agent kills the running
 restic subprocess via context cancel: SIGTERM first, SIGKILL
 after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
 SIGTERM step is replaced with `os.Kill` because Windows can't
 deliver SIGTERM. Result: a cancelled job lands as `cancelled`
 within a couple of hundred milliseconds.
 ## Restore wizard
 Restoring a file or path goes through a four-step wizard at
 `/hosts/{id}/restore`:
 1. **Pick a snapshot.** Search by id or by date; the page is
   pre-populated when you launched the wizard from a snapshot row.
 2. **Browse the snapshot tree.** Lazy-loaded children via the
   `MsgTreeList` synchronous WS RPC; results are cached
   per-wizard-session for 30 minutes. Pick the absolute paths
   you want.
 3. **Choose a target.** Either **In place** (overwrites the
   live filesystem; requires you to type the hostname to
   confirm) or **New directory** (default
   `$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
   `${HOME}` / `~/` and creates the directory chain).
 4. **Review and submit.** Server mints a job, dispatches
   `command.run` with a `RestorePayload`, and `HX-Redirect`s to
   the live job log.
 `--no-ownership` is gated on restic ≥ 0.17 (the flag was added
 in that release). Hosts running 0.16 don't get the flag and
 restore as the running user instead.
 ## Snapshot diff
 Two snapshot ids in the **Diff** form on the host detail page →
 a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
 to the standard live job log. Useful when investigating a
 suspiciously-sized backup.
 ## Job log artefacts
 Every job's log is persisted in `job_logs` (one row per line),
 not just streamed in-memory. That gives you:
 - A live view at `/jobs/{id}` while the job runs.
 - Two download formats from the same page header dropdown:
  - **txt** — one line per row, `HH:MM:SS.mmm  TAG  payload`.
  - **ndjson** — one self-contained JSON object per line
    (`{seq, ts, stream, payload}`), perfect for `jq`.
 Downloads work whether the job is running or finished —
 the source is the DB, not the live socket.
@@ -0,0 +1,61 @@
 # Observability with Prometheus
 restic-manager can expose a Prometheus scrape endpoint at
 `GET /metrics`. The endpoint is **opt-in** — without an explicit
 auth gate it isn't even mounted, so a forgotten config can't
 accidentally publish fleet state.
 The full reference lives at
 [`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
 the short version follows.
 ## Enable the endpoint
 Set at least one of:
 - `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
 - `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
 Both ANDed when both set. Constant-time token compare; CIDR
 honours `X-Forwarded-For` only when the immediate hop matches
 `RM_TRUSTED_PROXY`.
 ## Metrics emitted
 - **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
  `rm_active_alerts{severity}`, `rm_build_info{...}`.
 - **Per-host gauges**: `rm_host_agent_online`,
  `rm_host_last_backup_timestamp_seconds`,
  `rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
  `rm_host_snapshot_count`, `rm_host_open_alerts`,
  `rm_host_repo_status`.
 - **Histogram**:
  `rm_job_duration_seconds{kind,status,le=…}` (buckets
  `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
 In-memory histogram only. Prometheus persists the scrapes; if
 you need durable history at hourly resolution that's
 Prometheus's job.
 ## Sample Grafana dashboard
 [`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
 imports through Grafana's **+ → Import → Upload JSON file**.
 Six panels:
 1. Fleet status (online / total).
 2. Open alerts by severity.
 3. Backups failing on most-recent run.
 4. Hosts table — last backup, repo size, snapshots, open alerts.
 5. Repo size over time, one line per host.
 6. Job-duration p95 over a 1h window per kind.
 ## Alerting
 restic-manager already has a built-in alert engine
 ([Alerts](./alerts.md)). The dashboard intentionally doesn't
 duplicate it as Prometheus alert rules. If you want
 Prometheus-side alerts on top, write your own based on the
 metrics above — `rm_host_last_backup_success == 0`,
 `time() - rm_host_last_backup_timestamp_seconds > <max age>`,
 or whatever suits your environment.
@@ -0,0 +1,50 @@
 # Updating agents
 Server updates are a `docker compose pull && up -d` away.
 Agents update via the control plane.
 ## Single-host update
 Each host's detail page shows an **Update agent** button when
 the agent's reported version is older than the server's. The
 button:
 1. Dispatches a `command.update` to that host.
 2. The agent fetches the appropriate binary from
   `$RM_SERVER/agent/binary?os=…&arch=…` to
   `<binary-path>.new`.
 3. Copies the running binary to `<binary-path>.old` (one
   revision back, in case rollback is needed).
 4. Atomic-renames `.new` over the running binary.
 5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
   brings the process back on the new binary.
 A 90-second timer on the server side waits for a hello at the
 target version and marks the update succeeded — or, if the
 agent doesn't reconnect at the expected version in time, marks
 the update **failed** and raises an `update_failed` alert.
 ## Fleet update
 The admin-only **Settings → Fleet update** page drives a rolling
 update across every host in the fleet:
 - One host at a time.
 - Wait for hello-with-target-version (max 95s).
 - On any host failing, **halt** the rollout, raise a
  `fleet_update_halted` alert, leave the rest of the fleet on
  the old version. No surprise mass-failures.
 You can cancel an in-progress fleet update; the worker stops
 after the current host finishes.
 ## TLS and corruption
 Updates rely on the reverse proxy's TLS to detect corruption in
 transit. There's no separate sha256 verification step — we
 chose the simpler model on the basis that the same TLS already
 gates every other byte the server hands to the agent.
 If you'd like a separate signature step before applying updates,
 that's a future-phase enhancement (see `tasks.md` Phase 6
 candidates).
@@ -0,0 +1,58 @@
 # Environment variables
 The server reads its configuration from environment variables
 (canonical) with an optional YAML overlay. Env wins over YAML so
 operators can tweak a single setting without rewriting the file.
 ## Server
 | Variable                  | Default                          | Meaning |
 |---------------------------|----------------------------------|---------|
 | `RM_LISTEN`               | `:8080`                          | TCP listener for the HTTP server. |
 | `RM_DATA_DIR`             | `/data`                          | Persistent state directory (SQLite, secret key, agent assets). |
 | `RM_BASE_URL`             | (none)                           | Public URL clients use; required for OIDC redirects + cookie scope. |
 | `RM_SECRET_KEY_FILE`      | `${RM_DATA_DIR}/secret.key`      | Path to the AEAD key file. Auto-generated on first run. |
 | `RM_COOKIE_SECURE`        | `true`                           | Set `false` only for local HTTP testing. Controls `Secure` on session cookies. |
 | `RM_TRUSTED_PROXY`        | (none)                           | Comma-separated CIDRs trusted for `X-Forwarded-*`. |
 | `RM_BUNDLED_ASSETS_DIR`   | `/opt/restic-manager/dist`       | Read-only path with bundled agent binaries + install scripts (the docker image bakes them here). |
 | `RM_METRICS_TOKEN`        | (off)                            | When set, `GET /metrics` requires `Authorization: Bearer <token>`. |
 | `RM_METRICS_TRUSTED_CIDR` | (off)                            | When set, `GET /metrics` restricts source IPs (comma-CIDR). |
 OIDC variables (all optional; empty issuer disables OIDC):
 | Variable                       | Meaning |
 |--------------------------------|---------|
 | `RM_OIDC_ISSUER`               | OIDC discovery URL (e.g. `https://auth.example.com`). |
 | `RM_OIDC_CLIENT_ID`            | Client ID registered with the IdP. |
 | `RM_OIDC_CLIENT_SECRET`        | Client secret (or use `RM_OIDC_CLIENT_SECRET_FILE`). |
 | `RM_OIDC_CLIENT_SECRET_FILE`   | Path to a file holding the client secret. |
 | `RM_OIDC_DISPLAY_NAME`         | Button label on the login page (e.g. "Authelia"). |
 | `RM_OIDC_ROLE_CLAIM`           | Token claim that carries roles (default `groups`). |
 | `RM_OIDC_ROLE_MAPPING`         | `idp-group=role` entries, comma-separated (e.g. `rm-admin=admin,rm-ops=operator`). |
 | `RM_OIDC_REDIRECT_URL`         | Override for the redirect URL; defaults to `${RM_BASE_URL}/auth/oidc/callback`. |
 ## Agent
 | Variable             | Default | Meaning |
 |----------------------|---------|---------|
 | `RM_AGENT_CONFIG`    | `/etc/restic-manager/agent.yaml` (Linux) | Config file path. |
 The agent's other settings live in the YAML file (server URL,
 bearer token, optional cert pin). The install script writes that
 file for you at enrolment.
 ## Build-time
 The Makefile threads `-ldflags` from `git describe` into the
 `internal/version` package so `--version` and the dashboard
 footer show the right values:
 ```
 -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION)
 -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
 ```
 If you build with `go build` directly (no Makefile), `Version`
 falls back to `dev` and the agent-update comparison falls back
 to "always equal". Source-build deployments can still run; they
 just don't participate in the self-update flow.
@@ -0,0 +1,82 @@
 # HTTP endpoints
 A non-exhaustive map of the surfaces the control plane exposes.
 All `/api/*` routes return JSON; all other paths render HTML
 (server-rendered with HTMX in the loop).
 The canonical wiring lives at
 [`internal/server/http/server.go`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/internal/server/http/server.go);
 when in doubt, read the routes block there.
 ## Public (no auth)
 | Method | Path                       | Purpose |
 |--------|----------------------------|---------|
 | GET    | `/healthz`                 | Liveness probe. Returns 204. |
 | POST   | `/api/auth/login`          | Local-user login. JSON body: `{username, password}`. |
 | POST   | `/api/auth/logout`         | Invalidate the session cookie. |
 | POST   | `/api/bootstrap`           | First-run admin creation. Accepts the token printed at first start. |
 | POST   | `/api/agents/enroll`       | Token-based agent enrolment. |
 | POST   | `/api/agents/announce`     | Announce-and-approve agent enrolment. |
 | GET    | `/agent/binary?os=&arch=`  | Serves the agent binary for the install scripts. |
 | GET    | `/install/*`               | Serves the Linux + Windows install scripts and the systemd unit. |
 | GET    | `/api/version`             | Build version + commit JSON. |
 | GET    | `/metrics`                 | Prometheus exposition (only when opted-in via `RM_METRICS_TOKEN` / `RM_METRICS_TRUSTED_CIDR`). |
 | GET    | `/login`, `/setup`, `/bootstrap` | UI pages. |
 ## Authenticated (any role)
 | Method | Path                                     | Purpose |
 |--------|------------------------------------------|---------|
 | GET    | `/`                                      | Dashboard. |
 | GET    | `/hosts/{id}`                            | Host detail. |
 | GET    | `/hosts/{id}/repo`                       | Repo tab. |
 | GET    | `/hosts/{id}/jobs`                       | Jobs tab. |
 | GET    | `/hosts/{id}/sources`                    | Source groups list. |
 | GET    | `/hosts/{id}/schedules`                  | Schedules list. |
 | GET    | `/jobs/{id}`                             | Live job log. |
 | GET    | `/api/hosts`, `/api/fleet/summary`       | JSON list + summary. |
 | GET    | `/api/jobs/{id}/stream`                  | WebSocket subscription to a job's live log. |
 | GET    | `/api/jobs/{id}/log.{txt,ndjson}`        | Persisted log download. |
 ## Operator role and above
 | Method | Path                                  | Purpose |
 |--------|---------------------------------------|---------|
 | POST   | `/hosts/{id}/run-backup`              | Run-now (HTMX form-post). |
 | POST   | `/hosts/{id}/sources/{gid}/run-now`   | Per-source-group run-now. |
 | POST   | `/hosts/{id}/repo/{prune,check,unlock,reinit,probe}` | Maintenance actions. |
 | POST   | `/api/hosts/{id}/snapshots/diff`      | Snapshot-diff job. |
 | POST   | `/hosts/{id}/restore`                 | Restore wizard submit. |
 | POST   | `/api/jobs/{id}/cancel`               | Cancel a running job. |
 | POST   | `/hosts/{id}/tags`                    | Update host tags. |
 | POST   | `/hosts/{id}/sources` and friends     | Source-group CRUD. |
 | POST   | `/hosts/{id}/schedules` and friends   | Schedule CRUD. |
 | POST   | `/hosts/{id}/repo/credentials`, `/admin-credentials` | Credential update. |
 ## Admin role only
 | Method | Path                                  | Purpose |
 |--------|---------------------------------------|---------|
 | POST   | `/hosts/new`                          | Mint enrolment token (Add host). |
 | POST   | `/hosts/{id}/delete`                  | Delete + cascade. |
 | POST   | `/hosts/{id}/update`                  | Dispatch a single agent update. |
 | GET/POST | `/settings/users/...`                | User management. |
 | POST   | `/settings/notifications/...`         | Notification channel CRUD + test fire. |
 | POST   | `/settings/fleet-update/...`          | Fleet-update worker. |
 ## WebSocket
 | Path                           | Who connects | Auth |
 |--------------------------------|--------------|------|
 | `/ws/agent`                    | Agent        | Bearer token issued at enrolment. |
 | `/ws/agent/pending`            | Agent (announce flow) | Pending-id query param. |
 | `/api/jobs/{id}/stream`        | Browser      | Session cookie. |
 ## RBAC enforcement
 Routes are grouped into chi route-groups by required role
 (`viewer < operator < admin`); the `requireRole` middleware in
 `internal/server/http/middleware.go` is the bouncer. Sessions
 re-validate `disabled_at` on every request, so a disabled user's
 cookie stops working immediately.
@@ -0,0 +1,32 @@
 # Roadmap
 The live roadmap is in
 [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md).
 Phases ship in order; items inside a phase ship as the
 opportunity arises.
 ## Status snapshot
 | Phase | Theme                                            | Status |
 |-------|--------------------------------------------------|--------|
 | 0     | Project bootstrap                                | ✅ done |
 | 1     | MVP: enrolment, visibility, on-demand backup     | ✅ done |
 | 2     | Scheduling, retention, repo operations           | ✅ done |
 | 3     | Restore, alerts, audit                           | ✅ done |
 | 4     | RBAC, OIDC, host tags                            | ✅ done |
 | 5     | OSS readiness                                    | 🚧 in flight (this docs site is part of it) |
 | 6     | Update delivery + observability polish           | ✅ done |
 ## What's not on the roadmap
 The non-goals list in [`spec.md` §2](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md):
 - Replacing restic itself or providing custom repo formats
 - Managing non-restic backup tools
 - Multi-tenancy / SaaS deployment
 - High availability of the control plane (SQLite, single-instance)
 - Mobile-native apps (responsive web only)
 If something there is critical to your use case, restic-manager
 isn't the right tool. That's not a closed door — it's a
 deliberate scope decision so the project stays maintainable.
@@ -0,0 +1,35 @@
 # Reporting vulnerabilities
 The full disclosure policy lives in
 [`SECURITY.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/SECURITY.md)
 at the repo root. The short version:
 - **Don't open a public issue.**
 - Send a Gitea private message to `steve` on
  <https://gitea.dcglab.co.uk>, or email the address on the
  maintainer's profile, with a subject like
  `[SECURITY] restic-manager: <one-line summary>`.
 - Expect an acknowledgement within 3 working days; escalate
  through the other channel if you don't get one.
 - Default disclosure window is **30 days from confirmed report
  to public disclosure**, faster if a PoC is already
  circulating, slower only by mutual agreement.
 ## What to include
 A description of the issue and the impact, the affected
 component (server / agent / install script / docs), the version,
 and reproduction steps. A working PoC is welcome but not
 required — a credible threat model is enough.
 ## In scope vs. out of scope
 See the full policy. Quick highlights:
 - **In scope:** server, agent, install scripts, docker image,
  docker-compose reference, crypto choices, docs that lead to
  insecure configs.
 - **Out of scope:** restic itself (report upstream), unpatched
  third-party deps (report upstream first), pre-authenticated
  admin abuse (admins are designed to have full power), DoS on
  deployments without the recommended reverse proxy.
@@ -0,0 +1,72 @@
 # Hardening checklist
 A baseline for new deployments. Most of these are defaults; the
 list is here to make audit easy.
 ## Server
 - [ ] Reverse proxy in front, TLS terminating at the proxy
      (Caddy/nginx/Traefik).
 - [ ] `RM_TRUSTED_PROXY` set to the proxy's CIDR.
 - [ ] `RM_BASE_URL` matches the public hostname and the cookie
      scope you want.
 - [ ] `RM_COOKIE_SECURE=true` (the default; only set `false`
      for local HTTP testing).
 - [ ] HTTP listener bound to **localhost** in the compose file,
      not `0.0.0.0`. The reverse proxy is the only thing that
      should reach it.
 - [ ] `secret.key` backed up separately from the database.
 - [ ] Bootstrap token consumed and the printed log line scrubbed
      from any log archive.
 ## Authentication
 - [ ] Admin user has a password ≥ 12 characters (the floor).
 - [ ] OIDC enabled if you have an IdP — local password auth
      stays as a break-glass.
 - [ ] Disabled (not deleted) any users who change roles or leave
      so their session is invalidated immediately.
 - [ ] The last-admin guard isn't tripped — there's always at
      least one enabled admin user.
 ## Repo credentials
 - [ ] Append-only credential set as the everyday cred for every
      host.
 - [ ] Admin credential set only where prune cadence is enabled.
 - [ ] No credentials reused across hosts. Each host should have
      its own credential pair so a single host compromise has a
      single blast radius.
 - [ ] If using rest-server, `--append-only` flag is on for the
      everyday user; the prune user is a separate identity.
 ## Agent
 - [ ] Agent runs as `root` (Linux) or `LocalSystem` (Windows)
      **only when** the source paths require it. Otherwise pin
      a service user that has read access to what's backed up
      and nothing else.
 - [ ] systemd unit's sandboxing flags are intact
      (`NoNewPrivileges`, `Protect*`, `MemoryDenyWriteExecute`).
 - [ ] Agent's config file `/etc/restic-manager/agent.yaml` is
      mode `0600` and owned by the service user. The bearer
      token lives in there.
 ## Operations
 - [ ] Alerts wired to a real channel (webhook into Slack,
      ntfy topic, SMTP) — not just sitting in the UI.
 - [ ] Test-fire each notification channel after configuring.
 - [ ] Audit-log retention is long enough to cover the operator's
      incident-response window.
 - [ ] Prometheus endpoint, if enabled, gated by token AND CIDR
      where practical (default is opt-in / off).
 ## Recovery
 - [ ] A documented procedure for rotating a leaked agent bearer
      (delete + re-enrol the host).
 - [ ] A test-restore done at least once, end-to-end, before
      relying on the system in anger.
 - [ ] `secret.key` and the SQLite database covered by separate
      backup paths so neither alone reconstitutes the other.
@@ -0,0 +1,110 @@
 # Threat model
 This page documents what restic-manager defends against, what it
 doesn't, and the trust assumptions a deployment is making. The
 canonical version lives in [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
 §11; the summary here is shaped for operators rather than
 implementers.
 ## Trust boundaries
 ```
 ┌──────────────────────────────────────────┐
 │  TRUSTED zone                            │
 │  ┌─────────────┐    ┌──────────────┐     │
 │  │  Operator's │    │   Reverse    │     │
 │  │   browser   │◄──►│    proxy     │     │  TLS terminates here
 │  └─────────────┘    └──────┬───────┘     │
 └────────────────────────────┼─────────────┘
                             │ HTTP, plaintext
                             │ (loopback or trusted LAN)
 ┌────────────────────────────▼─────────────┐
 │  Server (control plane)                  │
 └────────────┬─────────────────────────────┘
             │ outbound WebSocket (TLS to clients via proxy)
             │ — bearer-authenticated
 ┌────────────▼──────────────┐
 │  Agent (per host)         │  ◄── attacker model: assume one
 └────────────┬──────────────┘       endpoint can be compromised
             │ subprocess
             ▼
   restic ──▶ repository (rest-server / S3 / SFTP / …)
 ```
 ## What we defend against
 ### Network attacker between operator and server
 - HTTPS via the reverse proxy is the only operator-facing surface
  on a sane deployment.
 - `RM_COOKIE_SECURE=true` (default) means the session cookie
  refuses to ride a non-HTTPS connection.
 - `RM_TRUSTED_PROXY` gates whether `X-Forwarded-*` is honoured;
  a bypassing request can't spoof the client IP.
 ### Compromised agent host
 - The agent's bearer token can dispatch commands **only on its
  own host**. It can't read other hosts' state, dispatch jobs
  on other hosts, or escalate within the control plane.
 - If you suspect a host compromise:
  1. Disable the agent's host row from **Hosts → Delete**
     (cascades the bearer hash).
  2. Rotate the repo credential at the rest-server / object
     store side.
  3. Audit-log lists every action that bearer ever drove.
 ### DB compromise without the secret key
 - Repo credentials are AEAD-encrypted at rest. A DB dump alone
  doesn't expose them.
 - Agent bearer **hashes** are leaked; that's enough to
  authenticate as any agent until you revoke. A rotation
  procedure is just "delete + re-enrol" today.
 - Operator passwords are bcrypt-hashed; OIDC users have no
  password to leak.
 - Session tokens are hashed; an attacker can't replay a
  session from a DB dump.
 ### DB compromise WITH the secret key
 The attacker can decrypt every credential. Treat
 `secret.key` with the same care as a password manager database.
 Back it up to a separate vault, not to the same Docker volume
 as the database.
 ### Forget/prune as a DoS vector
 - The everyday backup credential cannot prune (append-only).
 - The admin credential is only pushed to the agent at the
  moment of dispatch and discarded after the job ends.
 - Compromise of a single agent host does **not** grant prune
  rights — at worst the attacker gets fresh write access until
  the credential is rotated.
 ### Operator-side typo or bad copy-paste
 - Repo credentials are stored encrypted; mis-typed creds fail
  fast on the next `restic` invocation rather than silently
  corrupting state.
 - NS-03 added auto-init: the first dispatched job after creds
  change runs `restic init`, surfaces the error eagerly under
  the host's vitals strip if the creds are bad, and resets the
  host's `repo_status` so the operator can retry without
  hunting through job logs.
 ## What we don't defend against
 - **Insider threat at the maintainer level.** A malicious
  maintainer can publish a backdoored container; SBOM /
  signing infrastructure (Phase 6 candidate) would help here
  but isn't shipped today.
 - **Supply chain.** We pin module versions (`go.sum`) and
  pin the Tailwind binary's release tag, but a compromise in
  one of those upstreams would land here.
 - **Side-channel via restic itself.** A bug in restic that
  enables snapshot-content disclosure is restic's problem; the
  control plane doesn't see snapshot bytes either way.
 - **DoS via resource exhaustion** without the recommended
  reverse-proxy / rate-limit in front. Don't expose the
  server's HTTP port to the public internet directly.
@@ -0,0 +1,120 @@
 # End-to-end test harness
 The e2e harness stands up the full production-shaped stack
 (server + agent + rest-server) in Docker Compose and drives it
 through Playwright. CI runs it on every PR; operators can run it
 locally too.
 ## Files
 ```
 e2e/
 ├── compose.e2e.yml         compose stack: server + rest-server + agent
 ├── Dockerfile.agent        Linux container for the agent (alpine + restic)
 ├── agent-entrypoint.sh     decides between announce / token-enrol / run
 └── playwright/
    ├── package.json
    ├── playwright.config.ts
    └── tests/
        ├── lib/server.ts   bootstrap, login, accept, poll helpers
        └── smoke.spec.ts   happy-path: enrol → backup → succeeded
 ```
 ## Local run
 Prerequisites: Docker + Docker Compose, and `npx` for Playwright.
 ```sh
 # 1. Build + bring up the stack (server, rest-server, source data).
 docker compose -f e2e/compose.e2e.yml up --build -d server rest-server source-fixture
 # 2. Wait for the server, then scrape the bootstrap token from the log.
 until curl -fsS http://127.0.0.1:8080/api/version >/dev/null; do sleep 1; done
 RM_BOOTSTRAP_TOKEN=$(docker compose -f e2e/compose.e2e.yml logs server \
    | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1)
 export RM_BOOTSTRAP_TOKEN
 # 3. Start the agent (it announces against the running server).
 docker compose -f e2e/compose.e2e.yml up -d agent
 # 4. Install + run Playwright.
 cd e2e/playwright
 npm install
 npx playwright install --with-deps chromium
 npx playwright test
 ```
 When the test passes you'll see:
 ```
 Running 2 tests using 1 worker
  ✓  smoke: enrol-via-announce → backup › happy path completes in under a minute (47s)
  ✓  smoke: scrape /metrics › metrics endpoint exposes the host gauge (180ms)
  2 passed (47.5s)
 ```
 Tear-down:
 ```sh
 docker compose -f e2e/compose.e2e.yml down -v
 ```
 `-v` removes the named volumes too — important between runs because
 the rest-server volume holds an initialised repo and the
 agent-config volume holds a stale bearer.
 ## What the test exercises
 1. **Bootstrap.** Posts the admin-creation request to
   `/api/bootstrap` with the token scraped from the server log.
 2. **Login (UI).** Drives the login form via Playwright; verifies
   the dashboard loads with a session cookie set.
 3. **Pending host appears.** Polls the dashboard for the inline
   accept form generated by the announcing agent; reads the
   pending-id out of its action URL.
 4. **Accept.** POSTs `/api/pending-hosts/{id}/accept` with the
   rest-server URL + repo password. The server mints a Host row
   + bearer + AEAD-encrypted creds and pushes the bearer down
   the still-open pending WebSocket.
 5. **Online + auto-init.** Polls `/api/hosts` until the new host
   is `status=online`. Auto-init runs as part of this — the
   first dispatched job after creds save is `restic init`.
 6. **Run backup.** Submits the host detail page's `Run now`
   form; expects `HX-Redirect` to the live job page.
 7. **Verify.** Polls `/api/hosts` until the host's
   `last_backup_status` flips to `succeeded`.
 8. **Metrics.** Scrapes `/metrics` and asserts the
   server-gauge + build-info lines are present (the compose
   stack opens the endpoint via `RM_METRICS_TRUSTED_CIDR=0.0.0.0/0`).
 ## CI workflow
 [`.gitea/workflows/e2e.yml`](../.gitea/workflows/e2e.yml) runs the
 suite on every PR into `main`. On failure it dumps the last 200
 lines of each container log as a workflow annotation and uploads
 the Playwright HTML report as an artefact.
 ## When tests fail
 - **Pending host never appears.** Agent container probably
  couldn't reach the server. Check `docker compose logs agent`
  for connection errors and `docker compose logs server` for
  any 4xx on `/api/agents/announce`.
 - **Backup hangs in `running`.** The agent shells out to
  `restic`; check the live job log at
  `http://127.0.0.1:8080/jobs/<id>` (still up after a
  failed test as long as you didn't `down -v`).
 - **`RM_BOOTSTRAP_TOKEN not set`.** The server log scrape
  matched the wrong line or the token regex is too tight. The
  server prints the token on a line starting with `    ` (four
  spaces) inside a banner; widen the regex if your server log
  format changes.
 ## Adding new tests
 The harness is intentionally flat — one `*.spec.ts` per
 scenario. Reuse the helpers in `lib/server.ts` and avoid
 duplicating bootstrap / login boilerplate. Heavy fixtures
 (custom users, OIDC IdP) belong in their own compose override
 file rather than complicating `compose.e2e.yml`.
@@ -0,0 +1,139 @@
 # Prometheus + Grafana
 restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
 The endpoint is **opt-in** — it is not mounted at all unless you set
 at least one of the auth gates below. Once enabled, it serves the
 standard `text/plain` exposition format that every Prometheus
 release since 2.x parses without configuration.
 A sample Grafana dashboard lives at
 `deploy/grafana/restic-manager-dashboard.json`.
 ## Enable the endpoint
 Two switches, both off by default. If both are set, both must pass
 (token AND source-IP); if only one is set, that gate alone
 authorises a scrape.
 | Env var                    | YAML key               | Effect |
 |----------------------------|------------------------|--------|
 | `RM_METRICS_TOKEN`         | `metrics_token`        | Requires `Authorization: Bearer <token>`. Compared in constant time. |
 | `RM_METRICS_TRUSTED_CIDR`  | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
 When neither is set, `GET /metrics` returns 404 — the route is not
 registered with the chi router so a forgotten config can't
 accidentally publish fleet state.
 ### Example: Docker
 ```yaml
 services:
  restic-manager:
    image: gitea.dcglab.co.uk/steve/restic-manager:latest
    environment:
      RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
      RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
    secrets:
      - rm_metrics_token
 ```
 (`RM_METRICS_TOKEN_FILE` is not currently supported — set
 `RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
 roadmap.)
 ## Prometheus scrape config
 Drop into your `prometheus.yml`:
 ```yaml
 scrape_configs:
  - job_name: restic-manager
    metrics_path: /metrics
    scheme: https            # via your reverse proxy
    static_configs:
      - targets: ['restic.example.com']
    authorization:
      type: Bearer
      credentials_file: /etc/prometheus/secrets/rm_metrics_token
 ```
 If you don't run a TLS-terminating proxy in front, drop `scheme:
 https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
 ## Metric reference
 All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
 label (the stable ULID, immune to renames) and a `host` label
 (the human-readable name).
 ### Server gauges
 | Name                  | Labels                             | Description |
 |-----------------------|------------------------------------|-------------|
 | `rm_hosts_total`      | —                                  | Total number of enrolled hosts (excludes pending announces). |
 | `rm_hosts_online`     | —                                  | Number of hosts with `status='online'`. |
 | `rm_active_alerts`    | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
 | `rm_build_info`       | `version, commit, go_version`      | Always 1; pure label-bag for joining. |
 ### Per-host gauges
 | Name                                       | Description |
 |--------------------------------------------|-------------|
 | `rm_host_agent_online`                     | 1 if the agent is currently online, 0 otherwise. |
 | `rm_host_last_backup_timestamp_seconds`    | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
 | `rm_host_last_backup_success`              | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
 | `rm_host_repo_size_bytes`                  | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
 | `rm_host_snapshot_count`                   | Number of restic snapshots known on the host's repo. |
 | `rm_host_open_alerts`                      | Number of currently open alerts attached to this host. |
 | `rm_host_repo_status`                      | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
 ### Job duration histogram
 ```
 rm_job_duration_seconds_bucket{kind, status, le}
 rm_job_duration_seconds_sum{kind, status}
 rm_job_duration_seconds_count{kind, status}
 ```
 `kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
 `status` ∈ {succeeded, failed, cancelled}.
 Buckets (seconds):
 ```
 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
 1s   5s  30s  1m  5m   30m   1h    6h    24h
 ```
 The histogram is in-memory only — values reset on process restart.
 Operators who want durable history should let Prometheus persist
 the scrapes; restic-manager itself is a control plane, not a
 metrics database.
 ## Grafana dashboard
 Import `deploy/grafana/restic-manager-dashboard.json`:
 1. In Grafana, **+ → Import → Upload JSON file**.
 2. Pick the Prometheus data source you scrape with.
 3. The dashboard's six panels populate from the metrics above:
   * **Fleet status** — online/total stat panel.
   * **Open alerts** — by severity.
   * **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
   * **Repo size over time** — one line per host.
   * **Backups failing** — count of hosts whose last backup didn't succeed.
   * **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
 Alerting is intentionally not configured in the dashboard — the
 control plane already has alerts (P3-05) with native channels for
 webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
 just duplicate state. If you do want Prom-side alerts, copy the
 recording rules into your usual location.
 ## Cardinality
 Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
 histogram rows. A 100-host fleet emits roughly 700 host rows + 270
 histogram rows — well below any practical limit. There are no
 `job_id` labels (cardinality bomb avoidance) and no per-source-group
 labels.
@@ -0,0 +1,61 @@
 # Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard
 Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`
 ## Step 1 — Config wiring
 - Add fields to `internal/server/config/config.go`:
  - `MetricsToken string` (yaml `metrics_token`)
  - `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`)
  - method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured.
 - Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR).
 - `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`).
 - Tests: extend `config_test.go` covering both env vars + happy/sad CIDR.
 ## Step 2 — `internal/server/metrics` package
 - `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`.
 - `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count.
 - `Snapshot() Snapshot` — copies state under lock; returns plain value type.
 - `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).
 - `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec.
 - Unit tests: golden render, concurrent observe, bucket boundaries.
 ## Step 3 — HTTP handler
 - New `internal/server/http/metrics.go`:
  - `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`.
  - `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use).
  - `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`.
 - Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`.
 - `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers.
 ## Step 4 — Hook job-finished
 - `internal/server/ws/handler.go`:
  - `HandlerDeps` grows `Metrics *metrics.Registry`.
  - In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race).
 - `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance.
 ## Step 5 — Tests
 - `internal/server/metrics/registry_test.go` — observe + snapshot determinism.
 - `internal/server/metrics/render_test.go` — golden output for a fixed snapshot.
 - `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.
 ## Step 6 — Docs + dashboard (P6-05)
 - `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import.
 - `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.
 ## Step 7 — Tasks.md + verification
 - Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries.
 - Run `go vet ./...`, `go test ./...`, `make build`.
 - Push branch (no PR per standing instruction).
 ## Risk register
 - **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
 - **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice.
 - **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.
@@ -0,0 +1,175 @@
 # P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard
 Date: 2026-05-07
 Author: Claude (autonomous, sensible-defaults brief from operator)
 Tasks: P6-04 (M), P6-05 (S)
 ## Problem
 The control plane already knows everything a backup operator needs
 to monitor — last-backup timestamp + status, repo size, snapshot
 count, agent online, open alerts, build version — but it surfaces
 those only through the dashboard HTML and a few JSON endpoints. To
 plug into the operator's existing observability stack we need a
 plain Prometheus exposition endpoint and a Grafana dashboard JSON
 that reads from it.
 ## Goals
 - `GET /metrics` emits standard Prometheus text-format with the
  per-host, server, and job-duration metrics enumerated in the
  task entry (P6-04 in `tasks.md`).
 - Endpoint is opt-in and gated by a bearer token and/or an IP
  allow-list — never publicly readable by default.
 - No new third-party dependency (`prometheus/client_golang` is not
  pulled in). The exposition format is small and stable enough to
  emit by hand; matches the repo's "no Tailwind/Node" style.
 - Sample Grafana dashboard committed to the repo so a stranger can
  drop it into a Grafana instance and get a working view.
 ## Non-goals
 - OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is
  what every prom server still parses and what every example
  online demonstrates — pick the boring option).
 - Pushgateway or remote-write integration.
 - Per-job metric cardinality (no `job_id` labels — that would
  make the histogram explode).
 - Alerting rules. Operators already have alerts inside
  restic-manager (P3-05); duplicating them in Prometheus is a
  YAGNI hazard. The dashboard is read-only.
 ## Auth
 Two switches, both off by default. If neither is set the route
 isn't mounted at all (404 from the chi router) — this avoids any
 accidental "wide-open scrape endpoint" deployment.
 | env var | type | meaning |
 | --- | --- | --- |
 | `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer <token>`. Compared with `crypto/subtle.ConstantTimeCompare`. |
 | `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. |
 If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
 YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`.
 ## Metrics
 All metric names are prefixed `rm_`. Help text is concise.
 ### Per-host gauges (one row per `host_id`)
 ```
 rm_host_agent_online{host_id,host}                     1 if status='online' else 0
 rm_host_last_backup_timestamp_seconds{host_id,host}    unix seconds; omitted if no backup yet
 rm_host_last_backup_success{host_id,host}              1 if last_backup_status='succeeded' else 0; omitted if no backup yet
 rm_host_repo_size_bytes{host_id,host}                  total_size from latest repo stats; omitted if unknown
 rm_host_snapshot_count{host_id,host}                   integer
 rm_host_open_alerts{host_id,host}                      count of open + un-resolved alerts attached to this host
 rm_host_repo_status{host_id,host,status}               1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)
 ```
 `host` label is `hosts.name` for human readability; `host_id` is
 the stable ULID for joining across renames.
 ### Server gauges
 ```
 rm_hosts_total                              count of hosts (excludes pending)
 rm_hosts_online                             count of hosts with status='online'
 rm_active_alerts{severity}                  count of open alerts by severity ∈ {info,warning,critical}
 rm_build_info{version,commit,go_version}    always 1; pure label-bag for joining
 ```
 ### Job duration histogram
 ```
 rm_job_duration_seconds_bucket{kind,status,le=...}
 rm_job_duration_seconds_sum{kind,status}
 rm_job_duration_seconds_count{kind,status}
 ```
 `kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update}
 (every JobKind we currently dispatch). `status` ∈
 {succeeded,failed,cancelled}. Buckets cover the realistic range —
 short admin commands (unlock, init) finish in seconds; backups can
 be hours:
 ```
 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
   (1s   5s  30s  1m   5m  30m   1h    6h   24h)
 ```
 In-memory only. Reset on process restart — operators who want
 durable history scrape into Prom and let it persist.
 ## Architecture
 New package `internal/server/metrics`:
 - `Registry` — owns the histogram state (sync.Mutex + map keyed by
  `kind+status`). `ObserveJob(kind, status string, dur time.Duration)`
  is the only mutator. Lookups via `Snapshot()` are read-only and
  copy out.
 - `Render(w io.Writer, snapshot Snapshot)` — emits the full
  exposition body. The snapshot is supplied by the HTTP handler
  pulling from `Store` on each scrape; the package itself has no
  store dependency, which keeps it trivially unit-testable.
 New file `internal/server/http/metrics.go`:
 - `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current
  fleet snapshot from `Store`, ask `metrics.Render` to emit.
 - Auth helper `authoriseMetricsScrape(r)` — pure function over
  request + config; tested directly.
 Wiring:
 - `cmd/server` constructs the `metrics.Registry` once and threads
  it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps`
  (so the job-finished branch can call `ObserveJob`).
 - `ws/handler.go` MsgJobFinished branch grows a single line:
  `if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`.
  Falls back gracefully if the registry was never wired (tests).
 Route registration in `server.go`:
 ```go
 if s.deps.Cfg.MetricsAuthEnabled() {
    r.Get("/metrics", s.handleMetrics)
 }
 ```
 ## Cardinality + cost
 Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1.
 A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
 ## Documentation (P6-05)
 - `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
 - `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
  1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline.
  2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`).
  3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`.
  4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`.
  5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`.
  6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window.
 Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
 ## Testing
 - Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
 - Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence.
 - Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
 - End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
 ## Out of scope, explicitly
 - Per-job latency tracking with `job_id` labels (cardinality bomb).
 - Restore-specific metrics (P3 surfaces are still settling).
 - Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
 - Auto-discovery / file-SD generators for Prometheus.
@@ -0,0 +1,42 @@
 # Build a Linux container that runs the restic-manager agent against a
 # sibling rest-server in the e2e compose stack. Used only by tests
 # (e2e/compose.e2e.yml + .gitea/workflows/e2e.yml).
 #
 # Two stages:
 #   1. golang:alpine to build the agent binary.
 #   2. alpine:3.20 with the `restic` package + the built binary.
 #
 # Pinning by digest is intentional for CI reproducibility.
 FROM golang:1.25-alpine AS build
 WORKDIR /src
 ENV CGO_ENABLED=0 \
    GOFLAGS="-trimpath"
 COPY go.mod go.sum* ./
 RUN go mod download
 COPY . .
 ARG VERSION=e2e
 RUN go build -ldflags="-s -w -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=${VERSION}" \
        -o /out/restic-manager-agent ./cmd/agent
 FROM alpine:3.20
 RUN apk add --no-cache restic ca-certificates curl
 COPY --from=build /out/restic-manager-agent /usr/local/bin/restic-manager-agent
 # Agents normally run as root because backup paths often need it. The
 # e2e fixture only backs up paths under /data which we own, so this
 # container would tolerate a non-root user — but staying root keeps
 # parity with the production install.
 USER root
 # The agent needs a writable directory for its config + secrets store.
 RUN mkdir -p /etc/restic-manager /var/lib/restic-manager-agent
 ENV RM_AGENT_CONFIG=/etc/restic-manager/agent.yaml
 # The compose entrypoint sets the announce URL via env.
 COPY e2e/agent-entrypoint.sh /usr/local/bin/entrypoint.sh
 RUN chmod +x /usr/local/bin/entrypoint.sh
 ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
@@ -0,0 +1,21 @@
 # Playwright runner for the e2e suite. Built and run by
 # e2e/compose.e2e.yml so the test process sits on the same docker
 # network as the server, agent, and rest-server. The previous setup
 # ran Playwright on the workflow runner host and reached the server
 # via 127.0.0.1:8080; that fails on Gitea's act-style runners
 # because the workflow steps execute inside a runner container,
 # not on the host where compose publishes its ports.
 FROM mcr.microsoft.com/playwright:v1.59.1-jammy
 WORKDIR /work
 # Install npm deps in a separate layer keyed off package.json so
 # changes to specs don't bust the dep cache.
 COPY e2e/playwright/package.json /work/package.json
 RUN npm install --no-audit --no-fund
 COPY e2e/playwright/ /work/
 ENV CI=1
 ENTRYPOINT ["npx", "playwright", "test"]
@@ -0,0 +1,27 @@
 #!/bin/sh
 # Entrypoint for the e2e agent container.
 #
 # Three states:
 #   1. Already enrolled (agent.yaml has a bearer): run the agent.
 #   2. Token supplied via $RM_ENROL_TOKEN: enrol then run.
 #   3. Otherwise: announce against $RM_SERVER and wait for an admin to
 #      accept us. The announce flow blocks until accepted, then drops
 #      straight into the normal run loop, so this is the test-friendly
 #      path.
 set -eu
 CFG="${RM_AGENT_CONFIG:-/etc/restic-manager/agent.yaml}"
 SERVER="${RM_SERVER:?set RM_SERVER}"
 if [ -f "$CFG" ] && grep -q '^agent_token:' "$CFG"; then
    exec restic-manager-agent -config "$CFG"
 fi
 if [ -n "${RM_ENROL_TOKEN:-}" ]; then
    exec restic-manager-agent -config "$CFG" \
        -enroll-server "$SERVER" \
        -enroll-token "$RM_ENROL_TOKEN"
 fi
 # Announce-and-approve: blocks until an admin accepts, then runs.
 exec restic-manager-agent -config "$CFG" -enroll-server "$SERVER"
@@ -0,0 +1,108 @@
 # End-to-end test stack — used by .gitea/workflows/e2e.yml and by
 # operators who want to run the Playwright suite locally.
 #
 # Three services:
 #   * server      — restic-manager built from the working tree
 #   * agent       — restic-manager agent built from the working tree
 #                   (announces; Playwright accepts it during the test)
 #   * rest-server — the actual restic backend, sibling of the agent
 #
 # Run from the repo root:
 #   docker compose -f e2e/compose.e2e.yml up --build --abort-on-container-exit
 services:
  rest-server:
    image: restic/rest-server:0.13.0
    environment:
      DATA_DIR: /data
      OPTIONS: "--no-auth"
    volumes:
      - rest-data:/data
    networks: [rmnet]
  server:
    build:
      context: ..
      dockerfile: deploy/Dockerfile.server
      args:
        VERSION: e2e
    environment:
      RM_LISTEN: ":8080"
      RM_DATA_DIR: "/data"
      RM_BASE_URL: "http://server:8080"
      RM_COOKIE_SECURE: "false"
      # Bind the metrics endpoint loose for the test, so one of the
      # Playwright assertions can exercise it.
      RM_METRICS_TRUSTED_CIDR: "0.0.0.0/0"
    volumes:
      - server-data:/data
    ports:
      - "127.0.0.1:8080:8080"
    healthcheck:
      test: ["CMD", "/usr/local/bin/restic-manager-server", "--version"]
      interval: 2s
      timeout: 2s
      retries: 30
    networks: [rmnet]
  agent:
    build:
      context: ..
      dockerfile: e2e/Dockerfile.agent
      args:
        VERSION: e2e
    environment:
      RM_SERVER: "http://server:8080"
    depends_on:
      - server
    volumes:
      # Source paths the agent backs up. Compose pre-populates this
      # with a few files so the snapshot list isn't empty.
      - source-data:/source
      - agent-config:/etc/restic-manager
      - agent-state:/var/lib/restic-manager-agent
    networks: [rmnet]
  # Playwright test runner. Profile-gated so `compose up` doesn't
  # start it; CI runs it via `compose run --rm playwright`. Lives on
  # rmnet so it can reach the server via its compose-network DNS
  # name rather than depending on host port-publish (which doesn't
  # work on Gitea's container-based runners).
  playwright:
    profiles: [test]
    build:
      context: ..
      dockerfile: e2e/Dockerfile.playwright
    environment:
      RM_BASE_URL: "http://server:8080"
      RM_BOOTSTRAP_TOKEN: "${RM_BOOTSTRAP_TOKEN:-}"
    volumes:
      - ./playwright/playwright-report:/work/playwright-report
      - ./playwright/test-results:/work/test-results
    depends_on:
      - server
      - agent
    networks: [rmnet]
  # One-shot init container that drops a couple of files into the
  # source volume so backups have something to snapshot.
  source-fixture:
    image: alpine:3.20
    command: >
      sh -c 'mkdir -p /source && echo "hello world" > /source/hello.txt &&
             echo "another file" > /source/two.txt && sleep 0.2'
    volumes:
      - source-data:/source
    networks: [rmnet]
    restart: "no"
 volumes:
  server-data:
  rest-data:
  source-data:
  agent-config:
  agent-state:
 networks:
  rmnet:
    driver: bridge
@@ -0,0 +1,14 @@
 {
  "name": "restic-manager-e2e",
  "version": "0.0.0",
  "private": true,
  "type": "module",
  "scripts": {
    "test": "playwright test",
    "test:headed": "playwright test --headed",
    "test:debug": "PWDEBUG=1 playwright test"
  },
  "devDependencies": {
    "@playwright/test": "1.59.1"
  }
 }
@@ -0,0 +1,31 @@
 import { defineConfig, devices } from '@playwright/test';
 // Single-target Chromium config: the e2e suite is narrow (smoke
 // the production-shaped flow against the docker-compose stack).
 // Cross-browser matrix doesn't add signal — what we're verifying is
 // the server's HTML and the agent's WebSocket handshake, neither of
 // which depends on browser engine.
 const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
 export default defineConfig({
    testDir: './tests',
    timeout: 60_000,
    expect: { timeout: 10_000 },
    fullyParallel: false,
    retries: process.env.CI ? 1 : 0,
    workers: 1,
    reporter: [['list'], ['html', { open: 'never' }]],
    use: {
        baseURL,
        trace: 'retain-on-failure',
        screenshot: 'only-on-failure',
        video: 'retain-on-failure',
    },
    projects: [
        {
            name: 'chromium',
            use: { ...devices['Desktop Chrome'] },
        },
    ],
 });
@@ -0,0 +1,114 @@
 // Helpers used by every test. The shape favours the JSON API for
 // reads + accept/dispatch (deterministic, easy to assert) and the
 // browser for human-facing surfaces (login form, dashboard render).
 import { APIRequestContext, expect, Page } from '@playwright/test';
 export const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
 export interface HostJSON {
    id: string;
    name: string;
    status: string;
    last_backup_status?: string;
 }
 export async function readBootstrapToken(): Promise<string> {
    const tok = process.env.RM_BOOTSTRAP_TOKEN;
    if (!tok) {
        throw new Error('RM_BOOTSTRAP_TOKEN not set — the harness scrapes it from server logs');
    }
    return tok;
 }
 export async function bootstrapAdmin(
    request: APIRequestContext,
    {
        username = 'admin',
        password = 'e2e-test-password-1234',
    }: { username?: string; password?: string } = {},
 ): Promise<{ username: string; password: string }> {
    const token = await readBootstrapToken();
    const res = await request.post(`${baseURL}/api/bootstrap`, {
        data: { token, username, password },
    });
    if (!res.ok() && res.status() !== 409 /* already bootstrapped */) {
        throw new Error(`bootstrap: ${res.status()} ${await res.text()}`);
    }
    return { username, password };
 }
 export async function loginViaUI(page: Page, username: string, password: string): Promise<void> {
    await page.goto(`${baseURL}/login`);
    await page.locator('#login-username').fill(username);
    await page.locator('#login-password').fill(password);
    await Promise.all([
        page.waitForURL(new RegExp(`^${baseURL}/?$`)),
        page.locator('form[action="/login"] button[type="submit"]').click(),
    ]);
 }
 /**
 * Polls the dashboard until a pending host card is visible, then
 * extracts its pending-id from the inline accept form's action URL.
 */
 export async function waitForPendingHostID(page: Page): Promise<string> {
    const formLocator = page.locator('form[action^="/api/pending-hosts/"][action$="/accept"]').first();
    await expect(formLocator).toBeVisible({ timeout: 60_000 });
    const action = await formLocator.getAttribute('action');
    if (!action) throw new Error('pending host form has no action attribute');
    const m = action.match(/\/api\/pending-hosts\/([^/]+)\/accept/);
    if (!m) throw new Error(`unexpected action URL: ${action}`);
    return m[1];
 }
 export async function acceptPending(
    request: APIRequestContext,
    cookie: string,
    pendingID: string,
    repo: { url: string; username?: string; password: string },
 ): Promise<void> {
    const res = await request.post(`${baseURL}/api/pending-hosts/${pendingID}/accept`, {
        headers: { cookie, 'content-type': 'application/json' },
        data: {
            repo_url: repo.url,
            repo_username: repo.username ?? '',
            repo_password: repo.password,
        },
    });
    if (!res.ok()) {
        throw new Error(`accept: ${res.status()} ${await res.text()}`);
    }
 }
 export async function listHosts(request: APIRequestContext, cookie: string): Promise<HostJSON[]> {
    const res = await request.get(`${baseURL}/api/hosts`, { headers: { cookie } });
    if (!res.ok()) throw new Error(`list hosts: ${res.status()} ${await res.text()}`);
    const body = (await res.json()) as { items?: HostJSON[]; hosts?: HostJSON[] };
    return body.items ?? body.hosts ?? [];
 }
 export async function waitForHostStatus(
    request: APIRequestContext,
    cookie: string,
    matcher: (h: HostJSON) => boolean,
    timeoutMs = 60_000,
 ): Promise<HostJSON> {
    const deadline = Date.now() + timeoutMs;
    let last: HostJSON | undefined;
    while (Date.now() < deadline) {
        const hosts = await listHosts(request, cookie);
        const hit = hosts.find(matcher);
        if (hit) return hit;
        last = hosts[0];
        await new Promise((r) => setTimeout(r, 1_000));
    }
    throw new Error(`waitForHostStatus: timeout. Last seen: ${JSON.stringify(last)}`);
 }
 export async function getSessionCookie(page: Page): Promise<string> {
    const cookies = await page.context().cookies();
    const c = cookies.find((c) => c.name === 'rm_session');
    if (!c) throw new Error('rm_session cookie not set after login');
    return `${c.name}=${c.value}`;
 }
@@ -0,0 +1,80 @@
 // End-to-end smoke: bootstrap → accept pending host → run backup → see succeeded.
 //
 // The compose stack stands up a server, a sibling rest-server, and an
 // agent in announce-and-approve mode. This test drives the operator
 // path through the UI (login + dashboard) and the API
 // (accept + run-now + poll for terminal) — UI for the human surfaces,
 // API for the deterministic ones.
 import { test, expect } from '@playwright/test';
 import {
    baseURL,
    bootstrapAdmin,
    loginViaUI,
    waitForPendingHostID,
    acceptPending,
    waitForHostStatus,
    getSessionCookie,
 } from './lib/server';
 test.describe('smoke: enrol-via-announce → backup', () => {
    test('happy path completes in under a minute', async ({ page, request }) => {
        const { username, password } = await bootstrapAdmin(request);
        await loginViaUI(page, username, password);
        // Dashboard renders.
        await expect(page.locator('main')).toContainText(/host|fleet|pending/i, { timeout: 10_000 });
        // Pending host appears (the agent container has been
        // announcing since startup).
        const pendingID = await waitForPendingHostID(page);
        const cookie = await getSessionCookie(page);
        // Accept with the rest-server creds. compose's rest-server runs
        // --no-auth, so any credentials work; restic still demands a
        // password to encrypt the repo.
        await acceptPending(request, cookie, pendingID, {
            url: 'rest:http://rest-server:8000/',
            password: 'e2e-repo-password',
        });
        // Wait for the host to come online + auto-init to land.
        const onlineHost = await waitForHostStatus(
            request, cookie,
            (h) => h.status === 'online',
            60_000,
        );
        expect(onlineHost.id).toBeTruthy();
        // Trigger a backup via the UI form-post (HX-Redirect to /jobs/{id}).
        await page.goto(`${baseURL}/hosts/${onlineHost.id}`);
        await Promise.all([
            page.waitForURL(/\/jobs\//),
            page.locator('form[action$="/run-backup"] button[type="submit"]').first().click(),
        ]);
        // Wait for the host's last_backup_status to flip to 'succeeded'.
        // The job page itself is harder to assert on (it uses
        // server-pushed updates and a reload-on-finish pattern); the
        // host record is the source of truth and is what the dashboard
        // surfaces.
        const finishedHost = await waitForHostStatus(
            request, cookie,
            (h) => h.id === onlineHost.id && h.last_backup_status === 'succeeded',
            120_000,
        );
        expect(finishedHost.last_backup_status).toBe('succeeded');
    });
 });
 test.describe('smoke: scrape /metrics', () => {
    test('metrics endpoint exposes the host gauge', async ({ request }) => {
        // Compose sets RM_METRICS_TRUSTED_CIDR=0.0.0.0/0 so the
        // endpoint is open to the test runner.
        const res = await request.get(`${baseURL}/metrics`);
        expect(res.status()).toBe(200);
        const body = await res.text();
        expect(body).toContain('rm_hosts_total');
        expect(body).toContain('rm_build_info{');
    });
 });
@@ -41,6 +41,24 @@ type Config struct {
 	// DataDir. Source-build deployments can override via
 	// RM_BUNDLED_ASSETS_DIR.
 	BundledAssetsDir string `yaml:"bundled_assets_dir"`
 	// MetricsToken, if set, gates the /metrics scrape endpoint
 	// behind a `Authorization: Bearer <token>` check (constant-time
 	// compare). When neither this nor MetricsTrustedCIDRs is set,
 	// the route is not mounted at all (the endpoint is opt-in).
 	MetricsToken string `yaml:"metrics_token"`
 	// MetricsTrustedCIDRs, if non-empty, gates /metrics so only
 	// callers from these networks may scrape. ANDed with
 	// MetricsToken when both are set.
 	MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"`
 }
 // MetricsAuthEnabled reports whether the operator has opted into
 // exposing the Prometheus scrape endpoint by configuring at least
 // one auth gate.
 func (c Config) MetricsAuthEnabled() bool {
 	return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0
 }
 // Load resolves config in this order:
@@ -93,6 +111,19 @@ func Load(yamlPath string) (Config, error) {
 	if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
 		c.BundledAssetsDir = v
 	}
 	if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok {
 		c.MetricsToken = v
 	}
 	if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok {
 		parts := strings.Split(v, ",")
 		c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0]
 		for _, p := range parts {
 			p = strings.TrimSpace(p)
 			if p != "" {
 				c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p)
 			}
 		}
 	}
 	if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
 		// Comma-separated CIDRs; allow whitespace for readability.
 		parts := strings.Split(v, ",")
@@ -137,5 +168,10 @@ func (c *Config) validate() error {
 			return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
 		}
 	}
 	for _, cidr := range c.MetricsTrustedCIDRs {
 		if _, err := netip.ParsePrefix(cidr); err != nil {
 			return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err)
 		}
 	}
 	return nil
 }
@@ -98,6 +98,45 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) {
 	}
 }
 func TestMetricsAuthGates(t *testing.T) {
 	t.Setenv("RM_LISTEN", ":8080")
 	t.Setenv("RM_DATA_DIR", "/tmp/x")
 	c, err := Load("")
 	if err != nil {
 		t.Fatalf("load: %v", err)
 	}
 	if c.MetricsAuthEnabled() {
 		t.Errorf("metrics endpoint should be off by default")
 	}
 	t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes")
 	t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24")
 	c, err = Load("")
 	if err != nil {
 		t.Fatalf("load: %v", err)
 	}
 	if c.MetricsToken != "s3cr3t-token-with-enough-bytes" {
 		t.Errorf("token: %q", c.MetricsToken)
 	}
 	if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" {
 		t.Errorf("cidrs: %v", got)
 	}
 	if !c.MetricsAuthEnabled() {
 		t.Errorf("MetricsAuthEnabled should be true")
 	}
 }
 func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) {
 	t.Setenv("RM_LISTEN", ":8080")
 	t.Setenv("RM_DATA_DIR", "/tmp/x")
 	t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage")
 	if _, err := Load(""); err == nil {
 		t.Fatal("expected validation error, got nil")
 	}
 }
 func writeFile(path string, body []byte) error {
 	return writeFileImpl(path, body)
 }
@@ -0,0 +1,185 @@
 package http
 import (
 	"context"
 	"crypto/subtle"
 	"net"
 	"net/http"
 	"net/netip"
 	"runtime"
 	"strings"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
 )
 // handleMetrics serves the Prometheus exposition body. The route is
 // only mounted when the operator has opted in via RM_METRICS_TOKEN
 // or RM_METRICS_TRUSTED_CIDR (see Server.New + Cfg.MetricsAuthEnabled).
 func (s *Server) handleMetrics(w http.ResponseWriter, r *http.Request) {
 	if !authoriseMetricsScrape(r, s.deps.Cfg) {
 		// 401 with no body; Prom respects this and surfaces the failed
 		// scrape. WWW-Authenticate hints at bearer when the operator
 		// actually configured a token.
 		if s.deps.Cfg.MetricsToken != "" {
 			w.Header().Set("WWW-Authenticate", `Bearer realm="restic-manager metrics"`)
 		}
 		w.WriteHeader(http.StatusUnauthorized)
 		return
 	}
 	snap, err := s.gatherMetricsSnapshot(r.Context())
 	if err != nil {
 		http.Error(w, "snapshot: "+err.Error(), http.StatusInternalServerError)
 		return
 	}
 	// 0.0.4 is the long-stable text-format version Prometheus accepts
 	// without negotiation; OpenMetrics is intentionally not used here.
 	w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
 	if err := metrics.Render(w, snap); err != nil {
 		// Body is partially written; nothing useful we can do beyond
 		// dropping the connection (chi's recoverer will log).
 		return
 	}
 }
 // authoriseMetricsScrape applies bearer + CIDR gates per the spec.
 // AND semantics when both are configured; either alone is sufficient
 // when only it is configured.
 func authoriseMetricsScrape(r *http.Request, cfg config.Config) bool {
 	tokenOK := true
 	if cfg.MetricsToken != "" {
 		tokenOK = false
 		hdr := r.Header.Get("Authorization")
 		const prefix = "Bearer "
 		if strings.HasPrefix(hdr, prefix) {
 			got := []byte(strings.TrimPrefix(hdr, prefix))
 			want := []byte(cfg.MetricsToken)
 			if subtle.ConstantTimeCompare(got, want) == 1 {
 				tokenOK = true
 			}
 		}
 	}
 	cidrOK := true
 	if len(cfg.MetricsTrustedCIDRs) > 0 {
 		cidrOK = false
 		ip := callerIP(r, cfg.TrustedProxies)
 		if ip.IsValid() {
 			for _, c := range cfg.MetricsTrustedCIDRs {
 				prefix, err := netip.ParsePrefix(c)
 				if err != nil {
 					continue
 				}
 				if prefix.Contains(ip) {
 					cidrOK = true
 					break
 				}
 			}
 		}
 	}
 	return tokenOK && cidrOK
 }
 // callerIP resolves the client IP. When the request hit the server
 // directly we use RemoteAddr; when the immediate hop is a trusted
 // proxy we honour the right-most untrusted X-Forwarded-For entry
 // (mirrors how realIP middlewares typically resolve).
 func callerIP(r *http.Request, trustedProxies []string) netip.Addr {
 	host, _, err := net.SplitHostPort(r.RemoteAddr)
 	if err != nil {
 		host = r.RemoteAddr
 	}
 	directAddr, err := netip.ParseAddr(host)
 	if err != nil {
 		return netip.Addr{}
 	}
 	if !addrInAnyCIDR(directAddr, trustedProxies) {
 		return directAddr
 	}
 	xff := r.Header.Get("X-Forwarded-For")
 	if xff == "" {
 		return directAddr
 	}
 	parts := strings.Split(xff, ",")
 	// Walk right→left, skipping trusted proxies, until we land on the
 	// first untrusted hop — that's the genuine client.
 	for i := len(parts) - 1; i >= 0; i-- {
 		p := strings.TrimSpace(parts[i])
 		a, err := netip.ParseAddr(p)
 		if err != nil {
 			continue
 		}
 		if addrInAnyCIDR(a, trustedProxies) {
 			continue
 		}
 		return a
 	}
 	return directAddr
 }
 func addrInAnyCIDR(a netip.Addr, cidrs []string) bool {
 	for _, c := range cidrs {
 		pre, err := netip.ParsePrefix(c)
 		if err != nil {
 			continue
 		}
 		if pre.Contains(a) {
 			return true
 		}
 	}
 	return false
 }
 // gatherMetricsSnapshot pulls the data the renderer needs. One
 // indexed query per per-host or fleet-wide read; no N+1.
 func (s *Server) gatherMetricsSnapshot(ctx context.Context) (metrics.Snapshot, error) {
 	hosts, err := s.deps.Store.ListHosts(ctx)
 	if err != nil {
 		return metrics.Snapshot{}, err
 	}
 	hostRows := make([]metrics.HostRow, 0, len(hosts))
 	for _, h := range hosts {
 		row := metrics.HostRow{
 			ID:             h.ID,
 			Name:           h.Name,
 			Online:         h.Status == "online",
 			SnapshotCount:  h.SnapshotCount,
 			OpenAlertCount: h.OpenAlertCount,
 			RepoStatus:     h.RepoStatus,
 		}
 		if h.LastBackupAt != nil {
 			ts := h.LastBackupAt.Unix()
 			row.LastBackupUnix = &ts
 		}
 		if h.LastBackupStatus != nil {
 			ok := *h.LastBackupStatus == "succeeded"
 			row.LastBackupSucceeded = &ok
 		}
 		if h.RepoSizeBytes > 0 {
 			sz := h.RepoSizeBytes
 			row.RepoSizeBytes = &sz
 		}
 		hostRows = append(hostRows, row)
 	}
 	open, err := s.deps.Store.ListAlerts(ctx, store.AlertFilter{Status: "open"})
 	if err != nil {
 		return metrics.Snapshot{}, err
 	}
 	bySeverity := map[string]int{"info": 0, "warning": 0, "critical": 0}
 	for _, a := range open {
 		bySeverity[a.Severity]++
 	}
 	reg := s.deps.Metrics
 	if reg == nil {
 		reg = metrics.NewRegistry() // empty histogram block
 	}
 	return reg.SnapshotWith(hostRows, bySeverity, version.Version, version.Commit, runtime.Version()), nil
 }
@@ -0,0 +1,209 @@
 package http
 import (
 	"context"
 	"io"
 	stdhttp "net/http"
 	"net/http/httptest"
 	"path/filepath"
 	"strings"
 	"testing"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 )
 // newMetricsServer builds a Server with metrics enabled per cfg.
 // Returns (URL, registry) so tests can both observe job durations
 // directly and exercise the HTTP gate.
 func newMetricsServer(t *testing.T, cfg config.Config) (string, *metrics.Registry, *store.Store) {
 	t.Helper()
 	dir := t.TempDir()
 	st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
 	if err != nil {
 		t.Fatalf("store: %v", err)
 	}
 	t.Cleanup(func() { _ = st.Close() })
 	keyPath := filepath.Join(dir, "secret.key")
 	if err := crypto.GenerateKeyFile(keyPath); err != nil {
 		t.Fatalf("genkey: %v", err)
 	}
 	key, _ := crypto.LoadKeyFromFile(keyPath)
 	aead, _ := crypto.NewAEAD(key)
 	cfg.Listen = ":0"
 	cfg.DataDir = dir
 	cfg.SecretKeyFile = keyPath
 	reg := metrics.NewRegistry()
 	deps := Deps{
 		Cfg:     cfg,
 		Store:   st,
 		AEAD:    aead,
 		Metrics: reg,
 	}
 	s := New(deps)
 	ts := httptest.NewServer(s.srv.Handler)
 	t.Cleanup(ts.Close)
 	return ts.URL, reg, st
 }
 func TestMetricsRouteNotMountedByDefault(t *testing.T) {
 	t.Parallel()
 	url, _, _ := newMetricsServer(t, config.Config{})
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusNotFound {
 		t.Errorf("status: got %d, want 404 (route should not be mounted)", res.StatusCode)
 	}
 }
 func TestMetricsTokenRequired(t *testing.T) {
 	t.Parallel()
 	url, _, _ := newMetricsServer(t, config.Config{
 		MetricsToken: "the-token",
 	})
 	// Missing token.
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("no token: got %d", res.StatusCode)
 	}
 	if !strings.Contains(res.Header.Get("WWW-Authenticate"), "Bearer") {
 		t.Errorf("WWW-Authenticate hint missing: %q", res.Header.Get("WWW-Authenticate"))
 	}
 	// Wrong token.
 	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req.Header.Set("Authorization", "Bearer not-the-token")
 	res2, err := stdhttp.DefaultClient.Do(req)
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res2.Body.Close()
 	if res2.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("wrong token: got %d", res2.StatusCode)
 	}
 	// Right token.
 	req3, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req3.Header.Set("Authorization", "Bearer the-token")
 	res3, err3 := stdhttp.DefaultClient.Do(req3)
 	if err3 != nil {
 		t.Fatalf("GET: %v", err3)
 	}
 	defer res3.Body.Close()
 	if res3.StatusCode != stdhttp.StatusOK {
 		t.Errorf("right token: got %d", res3.StatusCode)
 	}
 	if ct := res3.Header.Get("Content-Type"); !strings.HasPrefix(ct, "text/plain") {
 		t.Errorf("content-type: %q", ct)
 	}
 }
 func TestMetricsCIDRGate(t *testing.T) {
 	t.Parallel()
 	// 127.0.0.1 is what httptest hits with; pick a CIDR that excludes it
 	// to assert the "wrong source" branch.
 	url, _, _ := newMetricsServer(t, config.Config{
 		MetricsTrustedCIDRs: []string{"10.0.0.0/8"},
 	})
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("loopback hitting non-matching CIDR: got %d, want 401", res.StatusCode)
 	}
 	// Now allow loopback.
 	url2, _, _ := newMetricsServer(t, config.Config{
 		MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
 	})
 	res2, err := stdhttp.Get(url2 + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res2.Body.Close()
 	if res2.StatusCode != stdhttp.StatusOK {
 		t.Errorf("loopback in allow CIDR: got %d, want 200", res2.StatusCode)
 	}
 }
 func TestMetricsTokenAndCIDRBothRequired(t *testing.T) {
 	t.Parallel()
 	url, _, _ := newMetricsServer(t, config.Config{
 		MetricsToken:        "the-token",
 		MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
 	})
 	// Token only — CIDR ok (loopback) but token missing.
 	res, err := stdhttp.Get(url + "/metrics")
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	if res.StatusCode != stdhttp.StatusUnauthorized {
 		t.Errorf("missing token but in CIDR: got %d", res.StatusCode)
 	}
 	// Both right.
 	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req.Header.Set("Authorization", "Bearer the-token")
 	res2, err := stdhttp.DefaultClient.Do(req)
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res2.Body.Close()
 	if res2.StatusCode != stdhttp.StatusOK {
 		t.Errorf("both right: got %d", res2.StatusCode)
 	}
 }
 func readAll(t *testing.T, r io.Reader) string {
 	t.Helper()
 	b, err := io.ReadAll(r)
 	if err != nil {
 		t.Fatalf("read: %v", err)
 	}
 	return string(b)
 }
 func TestMetricsBodyContainsExpectedLines(t *testing.T) {
 	t.Parallel()
 	url, reg, _ := newMetricsServer(t, config.Config{
 		MetricsToken: "the-token",
 	})
 	reg.ObserveJob("backup", "succeeded", 0) // produce one histogram row
 	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
 	req.Header.Set("Authorization", "Bearer the-token")
 	res, err := stdhttp.DefaultClient.Do(req)
 	if err != nil {
 		t.Fatalf("GET: %v", err)
 	}
 	defer res.Body.Close()
 	body := readAll(t, res.Body)
 	for _, want := range []string{
 		"rm_hosts_total",
 		"rm_hosts_online",
 		`rm_active_alerts{severity="critical"}`,
 		"rm_build_info{",
 		"rm_job_duration_seconds_count{kind=\"backup\",status=\"succeeded\"}",
 	} {
 		if !strings.Contains(body, want) {
 			t.Errorf("body missing %q\n--- body ---\n%s", want, body)
 		}
 	}
 }
@@ -17,6 +17,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
@@ -56,6 +57,12 @@ type Deps struct {
 	// OIDC (optional). Non-nil when the operator has configured an
 	// IdP — handlers under /auth/oidc/* are mounted only when set.
 	OIDC *oidc.Client
 	// Metrics (optional). When non-nil the WS job-finished branch
 	// records job durations and the /metrics handler can pull a
 	// histogram snapshot. Independent of MetricsAuthEnabled — the
 	// recorder runs even if the scrape endpoint is gated off, so a
 	// later config flip doesn't lose the running window.
 	Metrics *metrics.Registry
 }
 // Server is the running HTTP server.
@@ -131,12 +138,16 @@ func (s *Server) routes(r chi.Router) {
 	r.Get("/agent/binary", s.handleAgentBinary)
 	r.Get("/install/*", s.handleInstallAsset)
 	r.Get("/api/version", s.handleVersion)
 	if s.deps.Cfg.MetricsAuthEnabled() {
 		r.Get("/metrics", s.handleMetrics)
 	}
 	if s.deps.Hub != nil {
 		hd := ws.HandlerDeps{
 			Hub:            s.deps.Hub,
 			Store:          s.deps.Store,
 			JobHub:         s.deps.JobHub,
 			AlertEngine:    s.deps.AlertEngine,
 			Metrics:        s.deps.Metrics,
 			OnHello:        s.onAgentHello,
 			OnScheduleAck:  s.applyScheduleAck,
 			OnScheduleFire: s.dispatchScheduledJob,
@@ -0,0 +1,301 @@
 // Package metrics owns the in-process Prometheus exposition for
 // the control plane. It deliberately avoids prometheus/client_golang
 // — the legacy text format is small and stable, and the repo's house
 // style is to keep dependency surface minimal.
 //
 // Two halves:
 //
 //   - Registry holds a job-duration histogram. Server hooks call
 //     Registry.ObserveJob from the WS job-finished branch.
 //
 //   - Render emits a complete /metrics body from a Snapshot. The
 //     Snapshot is a plain value bag; the HTTP handler assembles it
 //     from store reads + Registry.Snapshot at scrape time. This
 //     keeps the package free of any database or HTTP dependency.
 package metrics
 import (
 	"fmt"
 	"io"
 	"sort"
 	"strings"
 	"sync"
 	"time"
 )
 // JobDurationBuckets is the upper-bound ladder for the job duration
 // histogram, in seconds. Covers admin commands (unlock/init/check
 // finishing in seconds) up through hours-long backups; +Inf is
 // implicit.
 var JobDurationBuckets = []float64{1, 5, 30, 60, 300, 1800, 3600, 21600, 86400}
 // Registry is the in-memory store for the job-duration histogram.
 // Concurrent observers and a single periodic snapshotter is the
 // expected access pattern; both are guarded by a mutex.
 type Registry struct {
 	mu    sync.Mutex
 	jobs  map[jobKey]*histogramState
 	clock func() time.Time
 }
 type jobKey struct{ kind, status string }
 type histogramState struct {
 	// counts[i] = number of observations <= JobDurationBuckets[i].
 	// counts[len(JobDurationBuckets)] is the implicit +Inf bucket
 	// (== total count, kept here for symmetry with the rendered
 	// _bucket{le="+Inf"} line and as a sanity check).
 	counts []uint64
 	sum    float64
 	count  uint64
 }
 // NewRegistry builds an empty registry.
 func NewRegistry() *Registry {
 	return &Registry{
 		jobs:  make(map[jobKey]*histogramState),
 		clock: time.Now,
 	}
 }
 // ObserveJob records one job-duration sample. Negative durations
 // (clock-skew artefacts) are clamped to zero. Empty kind/status
 // strings are tolerated but degrade the dashboard — callers should
 // pass meaningful values.
 func (r *Registry) ObserveJob(kind, status string, dur time.Duration) {
 	if r == nil {
 		return
 	}
 	if dur < 0 {
 		dur = 0
 	}
 	secs := dur.Seconds()
 	r.mu.Lock()
 	defer r.mu.Unlock()
 	k := jobKey{kind: kind, status: status}
 	hs, ok := r.jobs[k]
 	if !ok {
 		hs = &histogramState{counts: make([]uint64, len(JobDurationBuckets)+1)}
 		r.jobs[k] = hs
 	}
 	for i, ub := range JobDurationBuckets {
 		if secs <= ub {
 			hs.counts[i]++
 		}
 	}
 	hs.counts[len(JobDurationBuckets)]++ // +Inf
 	hs.sum += secs
 	hs.count++
 }
 // HistogramRow is one (kind,status) row in a Snapshot. Buckets is
 // the cumulative count per upper bound (matching JobDurationBuckets,
 // last element is the +Inf total).
 type HistogramRow struct {
 	Kind    string
 	Status  string
 	Buckets []uint64
 	Sum     float64
 	Count   uint64
 }
 // snapshotJobs returns a deterministic, sorted copy of the
 // histogram state. Sort order: kind asc, status asc.
 func (r *Registry) snapshotJobs() []HistogramRow {
 	if r == nil {
 		return nil
 	}
 	r.mu.Lock()
 	defer r.mu.Unlock()
 	rows := make([]HistogramRow, 0, len(r.jobs))
 	for k, hs := range r.jobs {
 		buckets := make([]uint64, len(hs.counts))
 		copy(buckets, hs.counts)
 		rows = append(rows, HistogramRow{
 			Kind:    k.kind,
 			Status:  k.status,
 			Buckets: buckets,
 			Sum:     hs.sum,
 			Count:   hs.count,
 		})
 	}
 	sort.Slice(rows, func(i, j int) bool {
 		if rows[i].Kind != rows[j].Kind {
 			return rows[i].Kind < rows[j].Kind
 		}
 		return rows[i].Status < rows[j].Status
 	})
 	return rows
 }
 // HostRow is one host's projection for the per-host gauges.
 // Pointers carry "no value" semantics so we can omit a metric line
 // when, e.g., a host has never run a backup.
 type HostRow struct {
 	ID                  string
 	Name                string
 	Online              bool
 	LastBackupUnix      *int64 // nil = no backup yet
 	LastBackupSucceeded *bool  // nil = no backup yet
 	RepoSizeBytes       *int64 // nil = no stats yet
 	SnapshotCount       int
 	OpenAlertCount      int
 	RepoStatus          string // "unknown" | "ready" | "init_failed"
 }
 // Snapshot is a frozen view of the data needed to render /metrics.
 // Constructed by the HTTP handler from Store reads + Registry.snapshotJobs.
 type Snapshot struct {
 	Hosts            []HostRow
 	HostsTotal       int
 	HostsOnline      int
 	AlertsBySeverity map[string]int // severity → count
 	BuildVersion     string
 	BuildCommit      string
 	GoVersion        string
 	JobDurationRows  []HistogramRow
 }
 // SnapshotWith builds a Snapshot from raw inputs and the registry's
 // current job-duration state. Convenience for the HTTP handler.
 func (r *Registry) SnapshotWith(hosts []HostRow, alerts map[string]int, buildVer, commit, goVer string) Snapshot {
 	online := 0
 	for _, h := range hosts {
 		if h.Online {
 			online++
 		}
 	}
 	return Snapshot{
 		Hosts:            hosts,
 		HostsTotal:       len(hosts),
 		HostsOnline:      online,
 		AlertsBySeverity: alerts,
 		BuildVersion:     buildVer,
 		BuildCommit:      commit,
 		GoVersion:        goVer,
 		JobDurationRows:  r.snapshotJobs(),
 	}
 }
 // Render emits a complete Prometheus text-exposition body for s.
 // Output is deterministic: metric names appear in a fixed order and
 // labels within a metric are sorted by their first label value.
 func Render(w io.Writer, s Snapshot) error {
 	var b strings.Builder
 	// --- Server gauges ---------------------------------------------------
 	b.WriteString("# HELP rm_hosts_total Total number of enrolled hosts (excludes pending announces).\n")
 	b.WriteString("# TYPE rm_hosts_total gauge\n")
 	fmt.Fprintf(&b, "rm_hosts_total %d\n", s.HostsTotal)
 	b.WriteString("# HELP rm_hosts_online Number of hosts currently online (status='online').\n")
 	b.WriteString("# TYPE rm_hosts_online gauge\n")
 	fmt.Fprintf(&b, "rm_hosts_online %d\n", s.HostsOnline)
 	b.WriteString("# HELP rm_active_alerts Open alerts grouped by severity.\n")
 	b.WriteString("# TYPE rm_active_alerts gauge\n")
 	severities := []string{"info", "warning", "critical"}
 	for _, sev := range severities {
 		fmt.Fprintf(&b, "rm_active_alerts{severity=%q} %d\n", sev, s.AlertsBySeverity[sev])
 	}
 	b.WriteString("# HELP rm_build_info Build identifying labels; value is always 1.\n")
 	b.WriteString("# TYPE rm_build_info gauge\n")
 	fmt.Fprintf(&b, "rm_build_info{version=%q,commit=%q,go_version=%q} 1\n",
 		s.BuildVersion, s.BuildCommit, s.GoVersion)
 	// --- Per-host gauges -------------------------------------------------
 	// Stable order: by host id.
 	hosts := append([]HostRow(nil), s.Hosts...)
 	sort.Slice(hosts, func(i, j int) bool { return hosts[i].ID < hosts[j].ID })
 	b.WriteString("# HELP rm_host_agent_online 1 if the agent is currently online, 0 otherwise.\n")
 	b.WriteString("# TYPE rm_host_agent_online gauge\n")
 	for _, h := range hosts {
 		v := 0
 		if h.Online {
 			v = 1
 		}
 		fmt.Fprintf(&b, "rm_host_agent_online{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, v)
 	}
 	b.WriteString("# HELP rm_host_last_backup_timestamp_seconds Unix timestamp of the host's most recent backup. Omitted for hosts with no backup yet.\n")
 	b.WriteString("# TYPE rm_host_last_backup_timestamp_seconds gauge\n")
 	for _, h := range hosts {
 		if h.LastBackupUnix == nil {
 			continue
 		}
 		fmt.Fprintf(&b, "rm_host_last_backup_timestamp_seconds{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, *h.LastBackupUnix)
 	}
 	b.WriteString("# HELP rm_host_last_backup_success 1 if the host's most recent backup succeeded, 0 otherwise. Omitted for hosts with no backup yet.\n")
 	b.WriteString("# TYPE rm_host_last_backup_success gauge\n")
 	for _, h := range hosts {
 		if h.LastBackupSucceeded == nil {
 			continue
 		}
 		v := 0
 		if *h.LastBackupSucceeded {
 			v = 1
 		}
 		fmt.Fprintf(&b, "rm_host_last_backup_success{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, v)
 	}
 	b.WriteString("# HELP rm_host_repo_size_bytes Latest reported repo size from `restic stats --mode raw-data`. Omitted for hosts with no stats yet.\n")
 	b.WriteString("# TYPE rm_host_repo_size_bytes gauge\n")
 	for _, h := range hosts {
 		if h.RepoSizeBytes == nil {
 			continue
 		}
 		fmt.Fprintf(&b, "rm_host_repo_size_bytes{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, *h.RepoSizeBytes)
 	}
 	b.WriteString("# HELP rm_host_snapshot_count Number of restic snapshots known on the host's repo.\n")
 	b.WriteString("# TYPE rm_host_snapshot_count gauge\n")
 	for _, h := range hosts {
 		fmt.Fprintf(&b, "rm_host_snapshot_count{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, h.SnapshotCount)
 	}
 	b.WriteString("# HELP rm_host_open_alerts Number of currently open alerts attached to this host.\n")
 	b.WriteString("# TYPE rm_host_open_alerts gauge\n")
 	for _, h := range hosts {
 		fmt.Fprintf(&b, "rm_host_open_alerts{host_id=%q,host=%q} %d\n",
 			h.ID, h.Name, h.OpenAlertCount)
 	}
 	b.WriteString("# HELP rm_host_repo_status Repo readiness state for the host. Exactly one row per host with status label set.\n")
 	b.WriteString("# TYPE rm_host_repo_status gauge\n")
 	for _, h := range hosts {
 		st := h.RepoStatus
 		if st == "" {
 			st = "unknown"
 		}
 		fmt.Fprintf(&b, "rm_host_repo_status{host_id=%q,host=%q,status=%q} 1\n",
 			h.ID, h.Name, st)
 	}
 	// --- Histogram -------------------------------------------------------
 	b.WriteString("# HELP rm_job_duration_seconds End-to-end duration of completed jobs, by kind and terminal status.\n")
 	b.WriteString("# TYPE rm_job_duration_seconds histogram\n")
 	for _, row := range s.JobDurationRows {
 		for i, ub := range JobDurationBuckets {
 			fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"%g\"} %d\n",
 				row.Kind, row.Status, ub, row.Buckets[i])
 		}
 		fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"+Inf\"} %d\n",
 			row.Kind, row.Status, row.Buckets[len(JobDurationBuckets)])
 		fmt.Fprintf(&b, "rm_job_duration_seconds_sum{kind=%q,status=%q} %g\n",
 			row.Kind, row.Status, row.Sum)
 		fmt.Fprintf(&b, "rm_job_duration_seconds_count{kind=%q,status=%q} %d\n",
 			row.Kind, row.Status, row.Count)
 	}
 	_, err := io.WriteString(w, b.String())
 	return err
 }
@@ -0,0 +1,182 @@
 package metrics
 import (
 	"bytes"
 	"strings"
 	"sync"
 	"testing"
 	"time"
 )
 func TestObserveJobBuckets(t *testing.T) {
 	r := NewRegistry()
 	// Bucket boundaries: 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400
 	r.ObserveJob("backup", "succeeded", 500*time.Millisecond) // <= 1
 	r.ObserveJob("backup", "succeeded", 30*time.Second)       // == 30 (boundary)
 	r.ObserveJob("backup", "succeeded", 90*time.Second)       // > 60, <= 300
 	r.ObserveJob("backup", "succeeded", 2*time.Hour)          // > 3600 → 21600 bucket
 	rows := r.snapshotJobs()
 	if len(rows) != 1 {
 		t.Fatalf("rows: %d", len(rows))
 	}
 	row := rows[0]
 	if row.Count != 4 {
 		t.Errorf("count: %d", row.Count)
 	}
 	wantSum := 0.5 + 30 + 90 + 7200.0
 	if row.Sum != wantSum {
 		t.Errorf("sum: got %v want %v", row.Sum, wantSum)
 	}
 	// Cumulative buckets:
 	//  le=1     → 1 (the 0.5s)
 	//  le=5     → 1
 	//  le=30    → 2 (boundary inclusive: 30s included)
 	//  le=60    → 2
 	//  le=300   → 3
 	//  le=1800  → 3
 	//  le=3600  → 3
 	//  le=21600 → 4
 	//  le=86400 → 4
 	//  le=+Inf  → 4
 	want := []uint64{1, 1, 2, 2, 3, 3, 3, 4, 4, 4}
 	for i, w := range want {
 		if row.Buckets[i] != w {
 			t.Errorf("bucket[%d]=%d want %d", i, row.Buckets[i], w)
 		}
 	}
 }
 func TestObserveJobNegativeClampedToZero(t *testing.T) {
 	r := NewRegistry()
 	r.ObserveJob("backup", "succeeded", -5*time.Second)
 	rows := r.snapshotJobs()
 	if len(rows) != 1 || rows[0].Sum != 0 || rows[0].Count != 1 {
 		t.Errorf("expected one zero-second observation, got %+v", rows)
 	}
 }
 func TestObserveJobConcurrent(t *testing.T) {
 	r := NewRegistry()
 	const goroutines = 16
 	const each = 200
 	var wg sync.WaitGroup
 	for g := 0; g < goroutines; g++ {
 		wg.Add(1)
 		go func() {
 			defer wg.Done()
 			for i := 0; i < each; i++ {
 				r.ObserveJob("backup", "succeeded", time.Second)
 			}
 		}()
 	}
 	wg.Wait()
 	rows := r.snapshotJobs()
 	if len(rows) != 1 {
 		t.Fatalf("rows: %d", len(rows))
 	}
 	if rows[0].Count != uint64(goroutines*each) {
 		t.Errorf("count: got %d want %d", rows[0].Count, goroutines*each)
 	}
 }
 func TestObserveJobNilRegistryNoop(t *testing.T) {
 	var r *Registry // nil
 	r.ObserveJob("backup", "succeeded", time.Second)
 }
 func TestRenderGolden(t *testing.T) {
 	r := NewRegistry()
 	r.ObserveJob("backup", "succeeded", 5*time.Second)
 	r.ObserveJob("forget", "succeeded", 100*time.Millisecond)
 	pi64 := func(v int64) *int64 { return &v }
 	pbool := func(v bool) *bool { return &v }
 	hosts := []HostRow{
 		{
 			ID: "01H0001", Name: "alpha",
 			Online:              true,
 			LastBackupUnix:      pi64(1700000000),
 			LastBackupSucceeded: pbool(true),
 			RepoSizeBytes:       pi64(123456789),
 			SnapshotCount:       42,
 			OpenAlertCount:      0,
 			RepoStatus:          "ready",
 		},
 		{
 			ID: "01H0002", Name: "bravo",
 			Online:         false,
 			SnapshotCount:  0,
 			OpenAlertCount: 1,
 			RepoStatus:     "init_failed",
 		},
 	}
 	snap := r.SnapshotWith(hosts,
 		map[string]int{"info": 0, "warning": 1, "critical": 0},
 		"v1.2.3", "deadbeef", "go1.25.0")
 	var buf bytes.Buffer
 	if err := Render(&buf, snap); err != nil {
 		t.Fatalf("render: %v", err)
 	}
 	out := buf.String()
 	for _, want := range []string{
 		"# HELP rm_hosts_total ",
 		"rm_hosts_total 2\n",
 		"rm_hosts_online 1\n",
 		`rm_active_alerts{severity="warning"} 1`,
 		`rm_active_alerts{severity="info"} 0`,
 		`rm_active_alerts{severity="critical"} 0`,
 		`rm_build_info{version="v1.2.3",commit="deadbeef",go_version="go1.25.0"} 1`,
 		`rm_host_agent_online{host_id="01H0001",host="alpha"} 1`,
 		`rm_host_agent_online{host_id="01H0002",host="bravo"} 0`,
 		`rm_host_last_backup_timestamp_seconds{host_id="01H0001",host="alpha"} 1700000000`,
 		`rm_host_last_backup_success{host_id="01H0001",host="alpha"} 1`,
 		`rm_host_repo_size_bytes{host_id="01H0001",host="alpha"} 123456789`,
 		`rm_host_snapshot_count{host_id="01H0001",host="alpha"} 42`,
 		`rm_host_snapshot_count{host_id="01H0002",host="bravo"} 0`,
 		`rm_host_open_alerts{host_id="01H0002",host="bravo"} 1`,
 		`rm_host_repo_status{host_id="01H0001",host="alpha",status="ready"} 1`,
 		`rm_host_repo_status{host_id="01H0002",host="bravo",status="init_failed"} 1`,
 		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="1"} 0`,
 		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="5"} 1`,
 		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="+Inf"} 1`,
 		`rm_job_duration_seconds_sum{kind="backup",status="succeeded"} 5`,
 		`rm_job_duration_seconds_count{kind="backup",status="succeeded"} 1`,
 		`rm_job_duration_seconds_bucket{kind="forget",status="succeeded",le="1"} 1`,
 	} {
 		if !strings.Contains(out, want) {
 			t.Errorf("missing line:\n  %s\n--- full output ---\n%s", want, out)
 		}
 	}
 	// bravo had no last backup → those metric lines must be absent for it.
 	for _, ban := range []string{
 		`rm_host_last_backup_timestamp_seconds{host_id="01H0002"`,
 		`rm_host_last_backup_success{host_id="01H0002"`,
 		`rm_host_repo_size_bytes{host_id="01H0002"`,
 	} {
 		if strings.Contains(out, ban) {
 			t.Errorf("unexpected line for bravo: %q", ban)
 		}
 	}
 }
 func TestRenderEmptySnapshot(t *testing.T) {
 	r := NewRegistry()
 	snap := r.SnapshotWith(nil, nil, "dev", "", "go1.25.0")
 	var buf bytes.Buffer
 	if err := Render(&buf, snap); err != nil {
 		t.Fatalf("render: %v", err)
 	}
 	out := buf.String()
 	if !strings.Contains(out, "rm_hosts_total 0\n") {
 		t.Errorf("missing zero-host gauge:\n%s", out)
 	}
 	// Histogram block has its HELP/TYPE but no rows. The HELP/TYPE
 	// presence is correct and helps Prometheus pre-register the metric.
 	if !strings.Contains(out, "# TYPE rm_job_duration_seconds histogram") {
 		t.Errorf("histogram HELP/TYPE missing")
 	}
 }
@@ -15,6 +15,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/alert"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/auth"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
 )
@@ -27,6 +28,9 @@ type HandlerDeps struct {
 	// AlertEngine receives job-finished and host-online events so the
 	// alert engine can evaluate its rules. Optional; nil = no-op.
 	AlertEngine *alert.Engine
 	// Metrics records job-duration observations on every terminal
 	// status. Optional; nil = no-op (test fixtures pass nil).
 	Metrics *metrics.Registry
 	// UpdateWatcher reconciles in-flight agent-update dispatches against
 	// hello envelopes. Optional; nil = no-op.
 	UpdateWatcher *UpdateWatcher
@@ -239,6 +243,13 @@ func dispatchAgentMessage(ctx context.Context, c *Conn, hostID string, env api.E
 					slog.Warn("ws: set host last backup", "host_id", hostID, "err", err)
 				}
 			}
 			// Job-duration histogram (P6-04). Skip when StartedAt is
 			// missing (race: agent shipped finished without a started,
 			// or the row predates this code).
 			if deps.Metrics != nil && job.StartedAt != nil {
 				deps.Metrics.ObserveJob(job.Kind, string(p.Status),
 					p.FinishedAt.Sub(*job.StartedAt))
 			}
 		}
 		if deps.JobHub != nil {
 			deps.JobHub.Broadcast(p.JobID, env)
@@ -326,12 +326,54 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
 ## Phase 5 — OSS readiness
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
+- [x] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
+- [x] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
 - [x] **P5-03** (S) Release automation — **pivoted away from goreleaser/binary archives** on 2026-05-05 (spec: `docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md`). Single deliverable per tag: a multi-arch (linux amd64+arm64) server image, with cross-compiled agent binaries (linux amd64+arm64, windows amd64) + `install.sh` + `install.ps1` + the systemd unit baked under `/opt/restic-manager/dist/`. The `/agent/binary` and `/install/*` handlers fall back from `<DataDir>/...` to `<BundledAssetsDir>/...` so a fresh container Just Works. Workflow `.gitea/workflows/release.yml` triggers on `v*.*.*` tag-push (real release: fan-out `:vX.Y.Z`, `:X.Y`, `:X`, plus `:latest` once `MAJOR>=1`) and `workflow_dispatch` (snapshot: `:snapshot-<shortsha>` only). Pushed to the Gitea container registry on this instance — no external creds, no GHCR mirror. Cosign / SBOM / minisign / GHCR mirror deferred to Phase 6. Source builds via `make build` remain a first-class path.
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
+- [x] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
+- [x] **P5-05** (S) `SECURITY.md` with disclosure process
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
+- [x] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
 > **As shipped (2026-05-07, branch `p5-oss-readiness`):**
 >
 > **P5-01 — docs site.** mdBook under `docs/book/` with structured
 > chapters: getting-started (install, enrolling hosts, reverse
 > proxy), concepts (architecture, credentials, schedules + source
 > groups, repo maintenance), operations (backups + restores, alerts,
 > observability, updates), security (threat model, hardening,
 > disclosure), reference (env vars, HTTP endpoints), plus
 > contributing / roadmap / license pages. mdBook binary downloaded
 > via Makefile (`make docs` / `make docs-watch`) — same "static
 > binary, no toolchain" pattern as Tailwind. Generated `book/`
 > dir gitignored.
 >
 > **P5-02 — CONTRIBUTING + CoC + templates.** `CONTRIBUTING.md`
 > rewritten from placeholder to full guide (setup, conventions,
 > workflow, RBAC of the project itself). `CODE_OF_CONDUCT.md`
 > shaped on the Contributor Covenant but adapted for a
 > single-maintainer project. `.gitea/issue_template/{bug_report,feature_request}.md`
 > + `.gitea/PULL_REQUEST_TEMPLATE.md`.
 >
 > **P5-04 — README screenshots.** Six full-page captures from a
 > fresh server bootstrap under `docs/screenshots/` (login, empty
 > dashboard, add host, alerts, settings, audit log). README
 > rewritten to centre the screenshot grid + link out to docs site.
 > Captured live from a working build via Playwright; replaceable
 > as the UI evolves without breaking layout.
 >
 > **P5-05 — SECURITY.md.** Disclosure policy (3-day ack, 30-day
 > default disclosure window), supported-versions matrix, scope
 > in/out, threat-model summary, hardening checklist for
 > operators. Mirrored as a chapter in the docs site.
 >
 > **P5-06 — e2e harness.** `e2e/compose.e2e.yml` stands up
 > server + sibling Linux agent (alpine + restic) + restic/rest-server
 > backend, with announce-and-approve as the enrolment path so
 > Playwright drives the operator flow end-to-end. Tests under
 > `e2e/playwright/tests/`: smoke spec covers bootstrap → login →
 > accept-pending → backup → terminal-status; second spec scrapes
 > `/metrics` to verify the P6-04 endpoint. New
 > `.gitea/workflows/e2e.yml` runs on every PR (separate from the
 > fast lint/test workflow). Local how-to in `docs/e2e.md`.
 - [x] **P5-07** (S) Reference deployment landed alongside P5-03. `deploy/docker-compose.yml` stands up *only* the server (image-pinned via `RM_VERSION`, named volume for operator state, bound to localhost) — TLS termination is left to whichever reverse proxy the operator already runs. `docs/reverse-proxy.md` documents the headers + WebSocket pass-through the proxy must forward, the `RM_TRUSTED_PROXY` CIDR rule, and worked examples for Caddy, nginx, and Traefik.
 ### Phase 5 acceptance
@@ -390,8 +432,45 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
 > swap, helper `buildRepoTrendView` shared between page-load and
 > fragment endpoint). No new dependencies, no client JS, no agent
 > change. CI green; in-browser smoke walk-through pending operator.
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
+- [x] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
+- [x] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
 > **As shipped (2026-05-07, branch `p6-04-05-prometheus-metrics`):**
 > Spec `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`,
 > plan `docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md`.
 > New `internal/server/metrics` package emits the legacy
 > `text/plain; version=0.0.4` exposition format directly — no
 > `prometheus/client_golang` dependency, matching the repo's
 > "no Tailwind, no Node" minimal-deps style. `/metrics` is **opt-in**:
 > `RM_METRICS_TOKEN` and/or `RM_METRICS_TRUSTED_CIDR` must be set or
 > the route isn't mounted at all (404). When both are set, both must
 > pass; either alone gates access. Token compare is constant-time.
 > CIDR check honours `X-Forwarded-For` only when the immediate hop
 > is a configured `RM_TRUSTED_PROXY` (mirrors the existing realIP
 > resolution).
 >
 > **Metrics:** per-host gauges (`rm_host_agent_online`,
 > `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`,
 > `rm_host_repo_size_bytes`, `rm_host_snapshot_count`,
 > `rm_host_open_alerts`, `rm_host_repo_status`); server gauges
 > (`rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`,
 > `rm_build_info{version,commit,go_version}`); histogram
 > `rm_job_duration_seconds_bucket{kind,status,le}` with buckets
 > `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`.
 > Histogram is in-memory; observations come from the existing
 > `MsgJobFinished` branch in `internal/server/ws/handler.go`.
 >
 > **Docs:** `docs/prometheus.md` covers enable + scrape config +
 > metric reference + dashboard import. **Dashboard:**
 > `deploy/grafana/restic-manager-dashboard.json` — six panels
 > (fleet status, open alerts, backups failing, hosts table, repo
 > size over time, job-duration p95). Schema 39, single Prometheus
 > datasource variable.
 >
 > **Tests:** golden-render + concurrent-observe + bucket-boundary
 > in the metrics package; auth matrix (no auth → 404; token
 > missing/wrong/right; CIDR matching/non-matching; token AND CIDR)
 > in the HTTP layer.
 ### Phase 6 acceptance
Author	SHA1	Message	Date
steve	a3f134bcd6	e2e: pin Playwright to 1.59.1 CI / Test (rest) (pull_request) Successful in 34s Details CI / Test (store) (pull_request) Successful in 54s Details CI / Lint (pull_request) Successful in 26s Details CI / Build (windows/amd64) (pull_request) Successful in 26s Details CI / Build (linux/amd64) (pull_request) Successful in 25s Details CI / Build (linux/arm64) (pull_request) Successful in 25s Details e2e / Playwright vs docker-compose (pull_request) Failing after 1m36s Details CI / Test (server-http) (pull_request) Successful in 3m19s Details `@playwright/test` was loose-pinned to ^1.50.0; npm resolved it to 1.59.1 inside the runner image, which only ships browser binaries for 1.50.0. Pin both the package and the docker image to v1.59.1 so deps and binaries stay aligned.	2026-05-08 20:09:17 +01:00
steve	17b9ee08b7	e2e: run health probe + Playwright on the compose network Gitea's act-style runners execute workflow steps inside a runner container, so compose's host port-publish (127.0.0.1:8080:8080) is not reachable from the steps. PR #23's e2e job timed out waiting for the server even though the container was up and listening. Move both the health probe and the Playwright run onto rmnet so they address the server as http://server:8080: * health probe: docker run --rm --network e2e_rmnet curlimages/curl * Playwright: new mcr.microsoft.com/playwright-based image, added as a profile-gated `playwright` service in compose.e2e.yml, invoked via `docker compose run --rm playwright`. Drops the setup-node + npm install runner steps.	2026-05-08 20:08:23 +01:00
steve	89537d417a	P5: OSS readiness — docs site, contributor onboarding, e2e harness P5-01 — Documentation site under docs/book/ rendered with mdBook (downloaded via Makefile, same static-binary pattern as Tailwind). Structured chapters: getting started, concepts, operations, security, reference. `make docs` / `make docs-watch`. Generated output gitignored. P5-02 — CONTRIBUTING.md rewritten from placeholder to a full guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a single-maintainer project. .gitea/issue_template/{bug,feature}.md and PULL_REQUEST_TEMPLATE.md. P5-04 — Six README screenshots captured live from a fresh server bootstrap (login, empty dashboard, add-host, alerts, settings, audit log). README rewritten to centre the screenshot grid and link out to the docs site. P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day default window), scope in/out, threat-model summary, operator hardening checklist. Mirrored as a docs-site chapter. P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up server + sibling Linux agent (alpine + restic) + restic/rest-server. Agent uses announce-and-approve so Playwright can drive the full operator flow: bootstrap → login → accept pending → backup → verify terminal status. Second spec scrapes /metrics to assert the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every PR; local how-to in docs/e2e.md.	2026-05-08 20:08:23 +01:00
steve	a252b25854	Merge pull request 'spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard' (#22 ) from p6-04-05-prometheus-metrics into main Reviewed-on: #22	2026-05-08 18:31:57 +00:00
steve	73e733be61	P6-04+05: Prometheus /metrics endpoint + Grafana dashboard CI / Test (rest) (pull_request) Successful in 41s Details CI / Test (store) (pull_request) Successful in 43s Details CI / Lint (pull_request) Successful in 29s Details CI / Build (windows/amd64) (pull_request) Successful in 44s Details CI / Test (server-http) (pull_request) Successful in 1m47s Details CI / Build (linux/arm64) (pull_request) Successful in 43s Details CI / Build (linux/amd64) (pull_request) Successful in 2m1s Details New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.	2026-05-07 23:17:15 +01:00
steve	70ff554402	spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard	2026-05-07 23:07:30 +01:00