e2e: pin Playwright to 1.59.1

`@playwright/test` was loose-pinned to ^1.50.0; npm resolved it to 1.59.1 inside the runner image, which only ships browser binaries for 1.50.0. Pin both the package and the docker image to v1.59.1 so deps and binaries stay aligned.
e2e: run health probe + Playwright on the compose network
2026-05-08 20:09:17 +01:00 · 2026-05-08 20:08:23 +01:00 · 2026-05-08 20:08:23 +01:00 · 2026-05-08 18:31:57 +00:00 · 2026-05-07 23:17:15 +01:00 · 2026-05-07 23:07:30 +01:00
61 changed files with 4577 additions and 63 deletions
@@ -0,0 +1,32 @@
+<!--
+Thanks for the PR! A few quick checks before submitting:
+
+* Did you open an issue first for non-trivial changes?
+* `make lint test` is green locally?
+* Commits are focused (one logical change per commit)?
+* No `Co-Authored-By` trailers (repo policy)?
+* No new dependencies without a one-line justification below?
+-->
+
+## Summary
+
+<!-- One paragraph: what changed and why. -->
+
+## Test plan
+
+<!-- Bullet list of what you actually ran. Be specific.
+     - `make test` → green
+     - Manually exercised the new flow at /hosts/{id}/foo
+     - Smoke env: enrolled a fresh host, ran a backup end-to-end
+-->
+
+## Notes for the reviewer
+
+<!-- Anything the reviewer needs to know that isn't obvious from the
+     diff: related issue, follow-up work that's intentionally not
+     in this PR, deferred concerns, design alternatives considered
+     and rejected. -->
+
+## Linked issues
+
+<!-- "Closes #123" / "Refs #456" / "Part of P5-06" -->
@@ -0,0 +1,52 @@
+---
+name: Bug report
+about: Something isn't behaving the way the docs / code suggest it should
+title: "[bug] "
+labels: bug
+---
+
+## What happened
+
+<!-- A clear description of the actual behaviour. Include the exact
+     UI surface, API endpoint, or CLI invocation involved. -->
+
+## What you expected
+
+<!-- What you thought would happen, and where that expectation came from
+     (docs page, command output, prior behaviour). -->
+
+## Steps to reproduce
+
+1.
+2.
+3.
+
+## Environment
+
+- restic-manager server version: <!-- `restic-manager-server --version` or footer of the UI -->
+- Agent version (if relevant): <!-- `restic-manager-agent --version` -->
+- restic version on affected host: <!-- `restic version` -->
+- Host OS: <!-- e.g. "Ubuntu 22.04 amd64" or "Windows Server 2022" -->
+- How was the server installed: <!-- docker compose / source build / other -->
+
+## Logs / output
+
+<details><summary>Server log (sanitised)</summary>
+
+```
+<!-- paste relevant lines; redact tokens, passwords, repo URLs -->
+```
+
+</details>
+
+<details><summary>Agent log (sanitised)</summary>
+
+```
+```
+
+</details>
+
+## Anything else
+
+<!-- Screenshots, related issues, recent changes you made before the
+     bug appeared, anything that might help. -->
@@ -0,0 +1,34 @@
+---
+name: Feature request
+about: Suggest a new capability or change to existing behaviour
+title: "[feature] "
+labels: enhancement
+---
+
+## What you're trying to do
+
+<!-- Describe the use case, not the proposed solution. Who is the
+     operator, what are they trying to accomplish, and what's
+     blocking them today? -->
+
+## Why the current behaviour falls short
+
+<!-- What does the system do today, and where does it stop short of
+     the use case above? -->
+
+## Proposed direction (optional)
+
+<!-- If you have a specific design in mind, describe it. Skip this
+     section if you'd rather leave it to the maintainer. -->
+
+## Scope check
+
+- [ ] I've read [`spec.md`](../spec.md) §2 (Goals & Non-Goals).
+- [ ] This isn't already on the roadmap in [`tasks.md`](../tasks.md).
+- [ ] This fits the project's "small fleet, one person operating"
+      target rather than enterprise / multi-tenant / SaaS use cases.
+
+## Anything else
+
+<!-- Related restic features, prior art in similar tools, links to
+     discussions you've had elsewhere. -->
@@ -0,0 +1,98 @@
+# P5-06 — End-to-end test suite.
+#
+# Spec : docs/superpowers/specs/2026-05-07-p5-oss-readiness-design.md
+# Stack: e2e/compose.e2e.yml (server + agent + rest-server + playwright)
+# Tests: e2e/playwright/tests/*.spec.ts
+#
+# Triggered on every PR into main and on workflow_dispatch. Runs
+# longer than the unit-test workflow (~3-4 minutes for a clean run);
+# kept separate so a slow e2e doesn't block the fast lint/test loop.
+#
+# Networking note: every interaction with the server (health probe,
+# Playwright) happens from a container on the compose `rmnet`
+# network, addressing the server as `http://server:8080`. We can't
+# rely on `127.0.0.1:8080` because Gitea's runner executes steps
+# inside its own container, where compose's host port-publish is
+# not visible.
+
+name: e2e
+
+on:
+  pull_request:
+    branches: [main]
+  workflow_dispatch:
+
+jobs:
+  e2e:
+    name: Playwright vs docker-compose
+    runs-on: ubuntu-latest
+    timeout-minutes: 15
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Build the e2e stack
+        run: docker compose -f e2e/compose.e2e.yml build
+
+      - name: Bring up the stack
+        run: docker compose -f e2e/compose.e2e.yml up -d server rest-server source-fixture
+
+      - name: Wait for server health
+        run: |
+          set -eu
+          for i in $(seq 1 30); do
+            if docker run --rm --network e2e_rmnet curlimages/curl:8.10.1 \
+                  -fsS http://server:8080/api/version >/dev/null 2>&1; then
+              echo "server up"; exit 0
+            fi
+            sleep 2
+          done
+          echo "server didn't come up"; docker compose -f e2e/compose.e2e.yml logs server; exit 1
+
+      - name: Capture bootstrap token from server logs
+        id: bootstrap
+        run: |
+          set -eu
+          for i in $(seq 1 15); do
+            line=$(docker compose -f e2e/compose.e2e.yml logs server 2>&1 | grep -E 'bootstrap token' -A2 | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1 || true)
+            if [ -n "$line" ]; then
+              echo "RM_BOOTSTRAP_TOKEN=$line" >> "$GITHUB_ENV"
+              echo "got bootstrap token (${#line} chars)"
+              exit 0
+            fi
+            sleep 1
+          done
+          echo "bootstrap token not found in logs"
+          docker compose -f e2e/compose.e2e.yml logs server
+          exit 1
+
+      - name: Start the agent
+        run: docker compose -f e2e/compose.e2e.yml up -d agent
+
+      - name: Prepare report mounts
+        run: |
+          mkdir -p e2e/playwright/playwright-report e2e/playwright/test-results
+          chmod -R a+rwX e2e/playwright/playwright-report e2e/playwright/test-results
+
+      - name: Run Playwright tests
+        env:
+          RM_BOOTSTRAP_TOKEN: ${{ env.RM_BOOTSTRAP_TOKEN }}
+        run: docker compose -f e2e/compose.e2e.yml run --rm playwright
+
+      - name: Compose logs (on failure)
+        if: failure()
+        run: |
+          docker compose -f e2e/compose.e2e.yml logs --tail=200 server
+          docker compose -f e2e/compose.e2e.yml logs --tail=200 agent
+          docker compose -f e2e/compose.e2e.yml logs --tail=200 rest-server
+
+      - name: Upload Playwright report (on failure)
+        if: failure()
+        uses: actions/upload-artifact@v3
+        with:
+          name: playwright-report
+          path: e2e/playwright/playwright-report
+          retention-days: 7
+
+      - name: Tear down
+        if: always()
+        run: docker compose -f e2e/compose.e2e.yml down -v
@@ -2,6 +2,10 @@
 /bin/
 /dist/

+# Generated mdBook output (source under docs/book/src is committed,
+# the rendered book/ directory is not).
+/docs/book/book/
+
 # Local data / runtime state
 /data/
 /certs/
@@ -0,0 +1,69 @@
+# Code of Conduct
+
+restic-manager is a small project run by one person. This Code of
+Conduct sets out the basic expectations for participating in the
+project's issue tracker, pull requests, and any other community
+spaces (chat, mailing lists) we may run in future.
+
+## Expected behaviour
+
+- **Be civil.** Disagreement is fine; rudeness is not. The same
+  comment can usually be made without making it personal.
+- **Assume good faith.** People asking what feels like a basic
+  question may be new to the project. People proposing what feels
+  like a duplicate idea may not have seen the prior discussion.
+  Point them to the right place politely.
+- **Stay on topic.** Issue threads are for the issue. Tangential
+  conversations belong in their own thread.
+- **Acknowledge the project's scope.** restic-manager is
+  intentionally small in scope (see `spec.md` §2). Reasonable
+  feature suggestions may still be declined for fit reasons.
+
+## Unacceptable behaviour
+
+- Harassment, threats, or insults — public or private.
+- Discriminatory comments based on age, body size, disability,
+  ethnicity, gender identity or expression, level of experience,
+  nationality, personal appearance, race, religion, sexual identity
+  or orientation.
+- Sustained disruption — derailing threads, ignoring repeated
+  requests to take a discussion elsewhere, brigading.
+- Publishing other people's private information without permission.
+
+## Reporting
+
+If someone in the project's spaces is behaving in a way that
+breaches this Code of Conduct, contact the maintainer directly
+through the contact details on their Gitea profile, or via the
+private security disclosure path documented in
+[SECURITY.md](./SECURITY.md). Reports stay confidential.
+
+The maintainer will review the report, gather context if needed,
+and respond. Possible outcomes include a private warning, a public
+clarification of expectations, a temporary or permanent ban from
+project spaces, or no action if the report doesn't hold up.
+
+There is no formal appeals process — this is a one-person project,
+not a foundation. If you think a decision was wrong you can say
+so, in writing, to the maintainer; that's it.
+
+## Scope
+
+This Code of Conduct applies to interactions in any space the
+project owns or operates: the Gitea repository (issues, pull
+requests, discussions, wiki), any chat channels we publish, and
+any conferences or events the project is officially represented at.
+
+It does not apply to:
+
+- Forks of the project that aren't being submitted back upstream.
+- Conversations between contributors that don't reference the
+  project.
+- Public criticism of the project itself.
+
+## Acknowledgement
+
+This document borrows shape and language from the
+[Contributor Covenant](https://www.contributor-covenant.org/) v2.1
+but is intentionally shorter and adapted to the project's
+single-maintainer reality.
@@ -1,30 +1,168 @@
-# Contributing
+# Contributing to restic-manager

-Thanks for your interest in contributing to restic-manager.
+Thanks for your interest in restic-manager. This document covers how
+to set up a development environment, the conventions the project
+follows, and how patches make it from your machine into `main`.

-> This is a placeholder. The project is in pre-alpha (Phase 1 / MVP). A
-> full contributor guide will land alongside the Phase 5 OSS-readiness
-> work — see [`tasks.md`](./tasks.md) P5-02. Until then the notes below
-> apply.
+## Project status and scope

-## Before opening a PR
+restic-manager is in pre-1.0. Core functionality (Phases 0–4) is
+landed; OSS-readiness polish is in progress. The top of
+[`tasks.md`](./tasks.md) tracks what's next; [`spec.md`](./spec.md)
+is the canonical design doc and the source of truth for any
+"why is it built this way" question.

-1. Open an issue first for non-trivial changes — the design is still
-   moving (see [`spec.md`](./spec.md)) and unsolicited large PRs may
-   conflict with in-flight work.
-2. `make lint test` should pass.
-3. Match the existing code style — `gofumpt`, `goimports`, no comments
-   that just restate what the code does.
-4. Keep commits focused; one logical change per commit.
+The project is **single-maintainer, hobbyist-scale, and licensed
+under [PolyForm Noncommercial 1.0.0](./LICENSE)**. That has two
+practical implications:

-## Reporting security issues
+1. Big PRs without prior discussion may be declined for fit
+   reasons even when they're correct — opening an issue first lets
+   us check alignment cheaply.
+2. Commercial use is not permitted by the license. Bug reports and
+   patches from operators of personal/community deployments are
+   very welcome.

-Please do **not** open a public issue for security problems. A
-`SECURITY.md` with a private disclosure path will be added in Phase 5
-(P5-05). Until then, contact the repository owner directly via the
-contact details on their gitea profile.
+## Getting started
+
+### Prerequisites
+
+- Go 1.25 or newer (`go.mod` is the source of truth)
+- `make`
+- For the front-end CSS bundle: nothing extra — `make build`
+  downloads a pinned `tailwindcss` standalone binary into `bin/`.
+- For the docs site: nothing extra — `make docs` does the same trick
+  with `mdbook`.
+- For end-to-end tests: Docker + Docker Compose, plus `npx` for
+  Playwright.
+
+### One-time setup
+
+```sh
+git clone https://gitea.dcglab.co.uk/steve/restic-manager.git
+cd restic-manager
+make build          # compiles bin/restic-manager-{server,agent}
+make test           # full unit + integration test sweep
+make lint           # gofumpt + goimports + golangci-lint
+```
+
+### Running locally
+
+For most development, the [smoke environment](./docs/e2e-smoke.md)
+is the path of least resistance:
+
+```sh
+make smoke-restart  # rebuilds, launches as a systemd --user unit
+make smoke-logs     # tail of the server log
+```
+
+Then point a browser at `http://127.0.0.1:8080`. The first run
+prints a one-time bootstrap token to the log; use it to create the
+admin user.
+
+## Code conventions
+
+### Style
+
+- `gofumpt` for formatting; `goimports` for import grouping.
+  Both run via the pre-commit hook in this repo.
+- `golangci-lint` with `.golangci.yml` defaults; CI rejects on lint
+  errors.
+- UK English in identifiers, comments, log messages, and UI strings
+  (the misspell linter is configured for the UK locale — see
+  P3-X5 for the original sweep).
+- Comments explain **why**, not what; avoid restating the code.
+  A surprising invariant or an external constraint is worth
+  writing down. "Adds 1 to x" is not.
+- `slog` for structured logs. Never log secrets — and especially
+  never the merged-creds rest-server URL (see [`CLAUDE.md`](./CLAUDE.md)).
+
+### File and package layout
+
+- `cmd/server` and `cmd/agent` are the two binary entry points.
+- `internal/` holds everything that's not part of the public Go
+  API (which is none of it — restic-manager isn't a library).
+- Per-feature packages live under `internal/server/...` for the
+  control plane and `internal/agent/...` for the agent.
+- `web/templates/` are HTML templates rendered with the standard
+  library; embedded via `web.FS`.
+
+### Tests
+
+- Unit tests live alongside the code as `*_test.go`. Use the
+  in-process sqlite store (`store.Open(":memory:")`) when you need
+  state — there is no test mock layer to maintain.
+- HTTP handlers test through `httptest.NewServer` against the real
+  router; see `internal/server/http/auth_test.go` for the canonical
+  fixture pattern.
+- End-to-end tests live in `e2e/` and run against a Docker Compose
+  stack. See [`docs/e2e.md`](./docs/e2e.md).
+
+### Database migrations
+
+- Migrations are hand-rolled SQL in `internal/store/migrations/`
+  and embedded via `embed.FS`.
+- Prefer column-level `ALTER TABLE` over rebuilds — see
+  [`CLAUDE.md`](./CLAUDE.md) "Migrations" section for the FK-cascade
+  trap that bit migration 0007's first draft.
+
+## Workflow
+
+### Before opening a PR
+
+1. **Open an issue first** for non-trivial changes. The design is
+   still moving; an issue lets us agree on direction cheaply.
+2. Run `make lint test` locally — both must pass.
+3. Match existing code style (see above).
+4. Keep commits focused: one logical change per commit. Imperative
+   subject lines, body explaining why if it isn't obvious.
+5. Don't add `Co-Authored-By` trailers — repo policy. If you used
+   AI assistance in writing the patch, that's fine; we just don't
+   pollute every commit message with attribution boilerplate.
+
+### Pull requests
+
+PRs target `main`. CI runs lint + tests on Linux amd64/arm64 and
+Windows amd64; all three must be green to merge. Squash-merge is
+the default; the PR title becomes the merge-commit subject, so
+keep it short and informative.
+
+The PR template asks for:
+
+- A short description of what changed and why.
+- A test plan (commands run, scenarios verified).
+- Anything reviewers need to know to assess the change (related
+  issue, follow-up work, deferred concerns).
+
+### Reporting bugs
+
+Open an issue with:
+
+- restic-manager version (`server --version`) and agent version.
+- restic version on the affected host.
+- Steps to reproduce.
+- Server and agent logs (sanitise any tokens before pasting).
+
+Security-sensitive bugs go through the [SECURITY.md](./SECURITY.md)
+disclosure path instead — please don't open a public issue for
+them.
+
+### Suggesting features
+
+Open an issue describing the use case (not just the proposed
+solution). The roadmap in `tasks.md` shows where the project is
+heading; if the suggestion fits a future phase we'll wire it in
+there. If it falls outside the project's scope (multi-tenancy, SaaS,
+non-restic backends — see `spec.md` §2 non-goals) we'll say so
+early to save your time.
+
+## Code of conduct
+
+Project participation is governed by [CODE_OF_CONDUCT.md](./CODE_OF_CONDUCT.md).
+The short version: be civil; assume good faith; harassment is not
+tolerated.

 ## License

-By contributing you agree that your contributions are licensed under
-the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
+By contributing you agree that your contributions are licensed
+under the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
@@ -24,7 +24,18 @@ TAILWIND_URL      := https://github.com/tailwindlabs/tailwindcss/releases/downlo
 TAILWIND_INPUT    := web/styles/input.css
 TAILWIND_OUTPUT   := web/static/css/styles.css

-.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
+# mdBook for the docs site (P5-01). Single static binary, no
+# Rust toolchain — same pattern as Tailwind.
+MDBOOK_VERSION    ?= v0.4.51
+MDBOOK_OS         := $(shell uname -s | tr A-Z a-z)
+MDBOOK_TRIPLE     := $(shell uname -m)-unknown-$(if $(filter darwin,$(MDBOOK_OS)),apple-darwin,linux-gnu)
+MDBOOK_BIN        := $(BIN_DIR)/mdbook
+MDBOOK_TARBALL    := mdbook-$(MDBOOK_VERSION)-$(MDBOOK_TRIPLE).tar.gz
+MDBOOK_URL        := https://github.com/rust-lang/mdBook/releases/download/$(MDBOOK_VERSION)/$(MDBOOK_TARBALL)
+DOCS_BOOK_DIR     := docs/book
+DOCS_BOOK_OUT     := $(DOCS_BOOK_DIR)/book
+
+.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch docs docs-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy

 # ---- smoke-env tooling -------------------------------------------------
 # The smoke server runs as a transient user-systemd unit so it survives
@@ -60,6 +71,18 @@ tailwind-watch: $(TAILWIND_BIN) ## Watch and rebuild on every save
 	@mkdir -p $$(dirname $(TAILWIND_OUTPUT))
 	$(TAILWIND_BIN) -c tailwind.config.js -i $(TAILWIND_INPUT) -o $(TAILWIND_OUTPUT) --watch

+$(MDBOOK_BIN):
+	@mkdir -p $(BIN_DIR)
+	@echo "==> downloading mdbook $(MDBOOK_VERSION) ($(MDBOOK_TRIPLE))"
+	curl -fsSL "$(MDBOOK_URL)" | tar -xz -C $(BIN_DIR) mdbook
+	@chmod +x $@
+
+docs: $(MDBOOK_BIN) ## Build the docs/book/ mdBook site into docs/book/book/
+	$(MDBOOK_BIN) build $(DOCS_BOOK_DIR)
+
+docs-watch: $(MDBOOK_BIN) ## Serve the docs site at http://127.0.0.1:3000 with live reload
+	$(MDBOOK_BIN) serve $(DOCS_BOOK_DIR) -n 127.0.0.1 -p 3000
+
 agent: ## Build the agent binary
 	@mkdir -p $(BIN_DIR)
 	CGO_ENABLED=0 go build $(GOFLAGS) -ldflags "$(LDFLAGS)" -o $(AGENT_BIN) ./cmd/agent
@@ -90,7 +113,7 @@ tidy: ## go mod tidy
 	go mod tidy

 clean: ## Remove build artifacts
-	rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT)
+	rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT) $(DOCS_BOOK_OUT)

 run-server: server ## Build and run the server
 	$(SERVER_BIN)
@@ -1,36 +1,62 @@
 # restic-manager

 Self-hosted, browser-based, single-pane-of-glass for managing
-[restic](https://restic.net) backups across a fleet of Linux and Windows
-endpoints.
+[restic](https://restic.net) backups across a fleet of Linux and
+Windows endpoints.

-> Status: pre-alpha. Phase 0 (project bootstrap) complete; Phase 1 (MVP) in
-> progress. See [`spec.md`](./spec.md) for the design and
-> [`tasks.md`](./tasks.md) for the roadmap.
+> **Status:** pre-1.0, feature-complete for the original use
+> case. Phases 0–4 + 6 are landed (MVP, scheduling, restore,
+> RBAC + OIDC, observability); Phase 5 (OSS readiness — docs site,
+> contributor onboarding, end-to-end CI) is in flight. See
+> [`spec.md`](./spec.md) for the design and [`tasks.md`](./tasks.md)
+> for the live roadmap.

-## What it does (target)
+## What it does

- Central visibility into backup state for every endpoint
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
-  `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
- Manage per-host backup schedules from the UI
- Live job progress streamed back to the UI
- Restore wizard (browse snapshots, pick paths, restore to original or
-  alternate host)
- Repo health surfacing (size, dedup ratio, last check, lock state)
- Alerting on failure or staleness
- Cross-platform agent (Linux + Windows)
- Ransomware-resistant repo access via append-only credentials
+- Central visibility into backup state for every endpoint.
+- Trigger any restic operation remotely (`backup`, `forget`,
+  `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`,
+  `restore`).
+- Per-host schedules with named source groups + retention.
+- Live job log streamed to the browser; downloadable as
+  text/NDJSON afterwards.
+- Restore wizard: browse a snapshot's tree, pick paths, restore
+  in-place or to a new directory.
+- Repo health surfacing (size, raw size, last check, lock state),
+  plus a 30/90-day repo-size trend.
+- Alerting over webhook, ntfy, or SMTP.
+- Cross-platform agent (Linux systemd + Windows SCM).
+- Append-only-friendly: separate admin credential for prune.
+- Optional Prometheus `/metrics` endpoint + sample Grafana
+  dashboard.
+- Optional OIDC SSO (Authelia, Authentik, etc.).

-## Architecture (one-line summary)
+## Screenshots

-A small Go control-plane on the Proxmox host, lightweight Go agents on each
-endpoint that hold an outbound WebSocket to the control-plane, and a
-`restic/rest-server` on Unraid that holds the actual backup data. The
-control-plane never touches backup bytes.
+| Sign in | Empty dashboard | Add host |
+|:-------:|:---------------:|:--------:|
+| ![Sign in](docs/screenshots/01-login.png) | ![Dashboard, fresh](docs/screenshots/02-dashboard-empty.png) | ![Add host](docs/screenshots/03-add-host.png) |
+
+| Alerts | Settings | Audit log |
+|:------:|:--------:|:---------:|
+| ![Alerts](docs/screenshots/04-alerts.png) | ![Settings](docs/screenshots/05-settings.png) | ![Audit log](docs/screenshots/06-audit.png) |
+
+(Screenshots from a fresh smoke install with no hosts. A populated
+fleet view and the live-log + restore wizard surfaces are part of
+the docs site under [`docs/book/`](./docs/book) — `make docs` to
+render locally.)
+
+## Architecture (one-line)
+
+A small Go control-plane in Docker, lightweight Go agents on each
+endpoint holding an outbound WebSocket to the control-plane, and
+a restic repository (rest-server, S3, B2, SFTP — anything restic
+speaks) that holds the actual backup data. **The control-plane
+never touches backup bytes.**

 Full architecture diagram and component breakdown:
-[`spec.md` §3](./spec.md).
+[`spec.md` §3](./spec.md), or the rendered version in the
+[docs site](./docs/book/src/concepts/architecture.md).

 ## Repository layout

@@ -38,31 +64,63 @@ Full architecture diagram and component breakdown:
 cmd/server/        control-plane binary
 cmd/agent/         endpoint agent binary
 internal/api       shared API types (REST + WS envelopes)
-internal/server/   HTTP, WS, UI handlers
+internal/server/   HTTP, WS, UI handlers, alert engine
 internal/agent/    service integration, restic runner, local scheduler
 internal/restic    restic CLI wrapper
 internal/store     SQLite persistence
-internal/crypto    secret encryption
+internal/crypto    secret encryption (AEAD)
 internal/auth      passwords, sessions, agent tokens
 web/               server-rendered templates + static assets
-deploy/            Dockerfile, docker-compose.yml, install scripts
-design/            UI wireframes (Phase 0 design pass)
+deploy/            Dockerfile, docker-compose.yml, install scripts, Grafana dashboard
+docs/              prose docs + the mdBook site under docs/book
+e2e/               compose stack + Playwright tests for end-to-end CI
 ```

+## Quickstart
+
+The reference deployment is a single Docker container fronted by
+your existing reverse proxy. See the [installation guide](docs/book/src/getting-started/install.md)
+for the full path; the very short version:
+
+```sh
+export RM_VERSION=v0.9.0    # pin a real tag
+export RM_BASE_URL=https://restic.example.com
+export RM_TRUSTED_PROXY=10.0.0.0/8
+docker compose -f deploy/docker-compose.yml up -d
+```
+
+The server prints a one-time bootstrap token to the log on first
+start. POST it to `/api/bootstrap` (or open `/bootstrap` in a
+browser) to create the admin user.
+
 ## Local development

-Requires Go 1.25+ (built and tested on 1.26). The floor is set by
-`modernc.org/sqlite` v1.50.
+Requires Go 1.25+. The floor is set by `modernc.org/sqlite` v1.50.

 ```sh
 make build           # builds cmd/server and cmd/agent into ./bin
 make test            # runs go test ./...
 make lint            # runs golangci-lint
-make run-server      # runs the server (dev defaults)
+make smoke-restart   # systemd --user smoke server (see CLAUDE.md)
+make docs            # renders the mdBook site to docs/book/book/
 ```

+End-to-end test harness against a Docker Compose stack with a
+sibling Linux agent: see [`docs/e2e.md`](docs/e2e.md). Runs in CI
+on every PR.
+
+## Documentation
+
+- **Concepts and operator guides**: [docs site](docs/book/src/intro.md),
+  rendered with `make docs`.
+- **Reverse-proxy setup**: [docs/reverse-proxy.md](docs/reverse-proxy.md).
+- **Prometheus + Grafana**: [docs/prometheus.md](docs/prometheus.md).
+- **End-to-end test harness**: [docs/e2e.md](docs/e2e.md).
+- **Security policy**: [SECURITY.md](SECURITY.md).
+- **Contributing**: [CONTRIBUTING.md](CONTRIBUTING.md).
+
 ## License

-PolyForm Noncommercial 1.0.0 — see [`LICENSE`](./LICENSE). Free for personal,
-hobby, research, educational, governmental, and other noncommercial use.
-Commercial use requires a separate license.
+[PolyForm Noncommercial 1.0.0](./LICENSE). Free for personal,
+hobby, research, educational, governmental, and other noncommercial
+use. Commercial use requires a separate license.
@@ -0,0 +1,137 @@
+# Security policy
+
+restic-manager handles credentials that grant access to backup
+repositories — losing them means an attacker can read or destroy a
+fleet's backups. We take security reports seriously even at this
+project's small scale.
+
+## Supported versions
+
+Pre-1.0, only the latest tagged release on `main` is supported.
+Backporting fixes to older tags is not currently offered.
+
+| Version            | Supported      |
+|--------------------|----------------|
+| `main` HEAD        | Yes            |
+| Latest released tag| Yes            |
+| Anything older     | No             |
+
+## Reporting a vulnerability
+
+**Please don't open a public issue for security problems.**
+
+Instead, use one of these private channels:
+
+1. **Gitea private message** to the repository owner. The
+   instance is at <https://gitea.dcglab.co.uk> and the owner's
+   profile (`steve`) has direct-message contact set up.
+2. **Email** to the address on the maintainer's Gitea profile.
+   Use a subject like `[SECURITY] restic-manager: <one-line summary>`
+   so it doesn't get lost. PGP optional — if you want to encrypt,
+   ask for a key first.
+
+If you don't get an acknowledgement within **3 working days**,
+please escalate through the other channel — solo maintainers do
+miss things, and the goal here is to fix the problem, not to
+preserve protocol.
+
+### What to include
+
+- A description of the issue and the impact (what does an attacker
+  gain? confidentiality, integrity, availability?).
+- Affected component (server, agent, install script, docs).
+- Affected version (`restic-manager-server --version`).
+- Reproduction steps if you have them. A working PoC is welcome
+  but not required — a credible threat model is enough.
+- Whether you intend to publish a writeup, and any timing
+  preferences.
+
+### What we'll do
+
+1. Acknowledge receipt within 3 working days.
+2. Confirm or refute the issue, and agree a rough severity (CVSS
+   or just "this is bad / this isn't"). Asking clarifying
+   questions is normal at this stage — please don't read it as
+   foot-dragging.
+3. Develop a fix on a private branch, test it, and prepare a
+   release.
+4. Coordinate disclosure timing with you. The default is **30
+   days from confirmed report to public disclosure**, with a
+   patched release published before the disclosure date. Faster
+   if a workable PoC is already circulating; slower only by
+   mutual agreement.
+5. Credit the reporter in the release notes (or omit the credit
+   if you'd rather stay anonymous — your choice).
+
+## Scope
+
+In scope:
+
+- The server binary (`cmd/server`) and any HTTP, WebSocket, or CLI
+  surface it exposes.
+- The agent binary (`cmd/agent`) and the way it consumes commands
+  from the server.
+- The install scripts (`deploy/install/install.sh`, `install.ps1`)
+  and the systemd unit shipped with them.
+- The docker-compose reference deployment and the docker image we
+  publish.
+- Any cryptographic primitive choice or implementation detail
+  (AEAD, token hashing, session handling, OIDC handshake).
+- Documentation that, if followed, leads operators into an
+  insecure configuration.
+
+Out of scope (not because they aren't real problems, just not ones
+this report channel can act on):
+
+- Vulnerabilities in restic itself — report those upstream at
+  <https://github.com/restic/restic>.
+- Vulnerabilities in third-party dependencies that haven't yet been
+  patched upstream — report upstream first.
+- Issues that require pre-authenticated admin access on the control
+  plane (admins can already do everything; that's not a privilege
+  escalation, that's the design).
+- DoS via resource exhaustion on a deployment without the
+  recommended reverse proxy / rate limiting in front (see
+  `docs/reverse-proxy.md`).
+- Social-engineering scenarios that don't have a technical hook
+  into the project's own surfaces.
+
+## Threat model summary
+
+For context (longer version in [`spec.md`](./spec.md) §11):
+
+- The server is **HTTP-only**; TLS termination, ACME, HSTS, and
+  edge rate-limiting are the reverse proxy's job.
+- Credentials are encrypted at rest with an AEAD key loaded from
+  `RM_SECRET_KEY_FILE`. The same key encrypts agent secrets that
+  travel to the agent over the WS channel.
+- Agents authenticate with bearer tokens issued at enrolment and
+  hashed at rest. Compromise of the server DB does **not** leak
+  bearer tokens in plaintext, but does leak the hashes (which is
+  enough to log in *as* the agent until the operator revokes —
+  see [NS-01 / NS-02](./tasks.md) for the revoke + regenerate
+  flows).
+- The control plane intentionally **never touches backup bytes** —
+  the agent runs `restic` directly against the repo. A
+  compromised control plane can dispatch new jobs but cannot
+  exfiltrate snapshot contents in-band.
+- Append-only credentials are first-class. Forget/prune jobs use a
+  separate, admin-marked credential that the server only pushes
+  for the duration of a maintenance dispatch.
+
+## Hardening checklist for operators
+
+- Run behind a TLS-terminating reverse proxy (Caddy/nginx/Traefik).
+- Set `RM_TRUSTED_PROXY` to the proxy's CIDR so request IPs aren't
+  spoofable.
+- Back up `RM_SECRET_KEY_FILE` separately from the database.
+  Without it the encrypted creds are unrecoverable.
+- Use append-only credentials for the everyday backup path; only
+  the optional admin credential should have write/forget/prune
+  power.
+- Disable users (don't delete) when staff change roles — bearer
+  tokens stay valid until rotated.
+- Watch the alert and audit-log views during enrolment of new
+  hosts.
+
+Thanks for helping keep restic-manager users safe.
@@ -20,6 +20,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
 	rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
@@ -89,6 +90,7 @@ func run() error {

 	hub := ws.NewHub()
 	jobHub := ws.NewJobHub()
+	metricsRegistry := metrics.NewRegistry()

 	notifHub := notification.NewHub(st, aead, cfg.BaseURL)
 	alertEngine := alert.NewEngine(st, notifHub)
@@ -122,6 +124,7 @@ func run() error {
 		UI:              renderer,
 		Version:         version,
 		OIDC:            oidcClient,
+		Metrics:         metricsRegistry,
 	}

 	// First-run bootstrap: if the users table is empty, mint a one-time
@@ -0,0 +1,325 @@
+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": { "type": "grafana", "uid": "-- Grafana --" },
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "description": "restic-manager fleet overview. Imports against any Prometheus data source.",
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 0,
+  "id": null,
+  "links": [],
+  "liveNow": false,
+  "panels": [
+    {
+      "id": 1,
+      "title": "Fleet status",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+      "gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "thresholds" },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "red", "value": null },
+              { "color": "green", "value": 1 }
+            ]
+          },
+          "unit": "short"
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "textMode": "auto"
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "rm_hosts_online",
+          "legendFormat": "online",
+          "refId": "A"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "rm_hosts_total",
+          "legendFormat": "total",
+          "refId": "B"
+        }
+      ]
+    },
+    {
+      "id": 2,
+      "title": "Open alerts",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+      "gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "thresholds" },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 1 },
+              { "color": "red", "value": 5 }
+            ]
+          },
+          "unit": "short"
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "orientation": "horizontal",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "textMode": "auto"
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "sum by (severity) (rm_active_alerts)",
+          "legendFormat": "{{severity}}",
+          "refId": "A"
+        }
+      ]
+    },
+    {
+      "id": 3,
+      "title": "Backups failing (last reported run)",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+      "gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "thresholds" },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "red", "value": 1 }
+            ]
+          },
+          "unit": "short"
+        },
+        "overrides": []
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "textMode": "auto"
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "count(rm_host_last_backup_success == 0)",
+          "legendFormat": "failing",
+          "refId": "A"
+        }
+      ]
+    },
+    {
+      "id": 4,
+      "title": "Hosts",
+      "type": "table",
+      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+      "gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 },
+      "fieldConfig": {
+        "defaults": {
+          "custom": { "align": "auto", "displayMode": "auto" }
+        },
+        "overrides": [
+          {
+            "matcher": { "id": "byName", "options": "Value #B" },
+            "properties": [
+              { "id": "displayName", "value": "Last backup (s ago)" },
+              { "id": "unit", "value": "s" }
+            ]
+          },
+          {
+            "matcher": { "id": "byName", "options": "Value #C" },
+            "properties": [
+              { "id": "displayName", "value": "Repo size" },
+              { "id": "unit", "value": "bytes" }
+            ]
+          },
+          {
+            "matcher": { "id": "byName", "options": "Value #D" },
+            "properties": [
+              { "id": "displayName", "value": "Snapshots" }
+            ]
+          },
+          {
+            "matcher": { "id": "byName", "options": "Value #A" },
+            "properties": [
+              { "id": "displayName", "value": "Online" }
+            ]
+          },
+          {
+            "matcher": { "id": "byName", "options": "Value #E" },
+            "properties": [
+              { "id": "displayName", "value": "Open alerts" }
+            ]
+          }
+        ]
+      },
+      "options": { "showHeader": true },
+      "transformations": [
+        {
+          "id": "merge",
+          "options": {}
+        }
+      ],
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "rm_host_agent_online",
+          "format": "table",
+          "instant": true,
+          "refId": "A"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "time() - rm_host_last_backup_timestamp_seconds",
+          "format": "table",
+          "instant": true,
+          "refId": "B"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "rm_host_repo_size_bytes",
+          "format": "table",
+          "instant": true,
+          "refId": "C"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "rm_host_snapshot_count",
+          "format": "table",
+          "instant": true,
+          "refId": "D"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "rm_host_open_alerts",
+          "format": "table",
+          "instant": true,
+          "refId": "E"
+        }
+      ]
+    },
+    {
+      "id": 5,
+      "title": "Repo size over time",
+      "type": "timeseries",
+      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "palette-classic" },
+          "custom": {
+            "axisLabel": "",
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "lineWidth": 1,
+            "pointSize": 5,
+            "showPoints": "never"
+          },
+          "unit": "bytes"
+        },
+        "overrides": []
+      },
+      "options": {
+        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
+        "tooltip": { "mode": "multi", "sort": "desc" }
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "rm_host_repo_size_bytes",
+          "legendFormat": "{{host}}",
+          "refId": "A"
+        }
+      ]
+    },
+    {
+      "id": 6,
+      "title": "Job duration p95 (last 1h, by kind)",
+      "type": "timeseries",
+      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "palette-classic" },
+          "custom": {
+            "drawStyle": "line",
+            "fillOpacity": 5,
+            "lineWidth": 1,
+            "pointSize": 4,
+            "showPoints": "never"
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "options": {
+        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
+        "tooltip": { "mode": "multi", "sort": "desc" }
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
+          "expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))",
+          "legendFormat": "{{kind}}",
+          "refId": "A"
+        }
+      ]
+    }
+  ],
+  "refresh": "30s",
+  "schemaVersion": 39,
+  "style": "dark",
+  "tags": ["restic-manager", "backups"],
+  "templating": {
+    "list": [
+      {
+        "current": {},
+        "hide": 0,
+        "includeAll": false,
+        "label": "Prometheus",
+        "multi": false,
+        "name": "DS_PROMETHEUS",
+        "options": [],
+        "query": "prometheus",
+        "refresh": 1,
+        "regex": "",
+        "skipUrlSync": false,
+        "type": "datasource"
+      }
+    ]
+  },
+  "time": { "from": "now-6h", "to": "now" },
+  "timepicker": {},
+  "timezone": "",
+  "title": "restic-manager — fleet",
+  "uid": "rm-fleet-overview",
+  "version": 1,
+  "weekStart": ""
+}
@@ -0,0 +1,19 @@
+[book]
+title = "restic-manager"
+description = "Self-hosted control plane for restic backups across a fleet of Linux and Windows endpoints."
+authors = ["Steve Cliff"]
+language = "en-GB"
+multilingual = false
+src = "src"
+
+[output.html]
+default-theme = "ayu"
+preferred-dark-theme = "ayu"
+git-repository-url = "https://gitea.dcglab.co.uk/steve/restic-manager"
+git-repository-icon = "fa-code-fork"
+edit-url-template = "https://gitea.dcglab.co.uk/steve/restic-manager/_edit/main/docs/book/{path}"
+no-section-label = false
+
+[output.html.fold]
+enable = true
+level = 2
@@ -0,0 +1,40 @@
+# Summary
+
+[Introduction](./intro.md)
+
+# Getting started
+
+- [Installing the server](./getting-started/install.md)
+- [Enrolling your first host](./getting-started/enrolling-hosts.md)
+- [Running behind a reverse proxy](./getting-started/reverse-proxy.md)
+
+# Concepts
+
+- [Architecture](./concepts/architecture.md)
+- [Credentials and how they flow](./concepts/credentials.md)
+- [Schedules and source groups](./concepts/schedules-and-source-groups.md)
+- [Repo maintenance](./concepts/repo-maintenance.md)
+
+# Operations
+
+- [Backups and restores](./operations/backups-and-restores.md)
+- [Alerts and notifications](./operations/alerts.md)
+- [Observability with Prometheus](./operations/observability.md)
+- [Updating agents](./operations/updates.md)
+
+# Security
+
+- [Threat model](./security/threat-model.md)
+- [Hardening checklist](./security/hardening.md)
+- [Reporting vulnerabilities](./security/disclosure.md)
+
+# Reference
+
+- [Environment variables](./reference/env-vars.md)
+- [HTTP endpoints](./reference/http-endpoints.md)
+
+---
+
+[Contributing](./contributing.md)
+[Roadmap](./roadmap.md)
+[License](./license.md)
@@ -0,0 +1,121 @@
+# Architecture
+
+## Components
+
+```
+┌────────────────────────────────────────────────────────────┐
+│  Server (control plane, single process)                    │
+│   * chi-based HTTP API + HTMX server-rendered UI           │
+│   * WebSocket hub for agent fan-out + browser fan-out      │
+│   * SQLite store (modernc.org/sqlite, pure Go)             │
+│   * AEAD encryption helpers                                │
+│   * Alert engine + notification hub                        │
+└────────────┬───────────────────────────────────┬───────────┘
+             │ outbound WS only                   │ HTTP(S)
+             │                                    │
+┌────────────▼─────────────┐         ┌────────────▼─────────────┐
+│  Agent (per host)        │         │  Browser (operator)      │
+│   * coder/websocket      │         │   * htmx + a tiny bit    │
+│   * cron for schedules   │         │     of vanilla JS for    │
+│   * restic wrapper       │         │     live job updates     │
+│   * sysinfo collector    │         └──────────────────────────┘
+└────────────┬─────────────┘
+             │ subprocess: restic ...
+             │
+┌────────────▼─────────────────────────────────────────────────┐
+│  restic repository (rest-server, S3, B2, SFTP, local …)      │
+│  Backup data flows directly here. Server never touches it.   │
+└──────────────────────────────────────────────────────────────┘
+```
+
+## Why outbound-only WebSockets?
+
+The agent dials the server on `/ws/agent` with a bearer token. The
+server doesn't initiate connections to the agent. Three reasons:
+
+1. **Firewall friendliness.** Nothing on the endpoint needs an
+   inbound port; this works behind the typical "branch office NAT"
+   without router config.
+2. **Single auth point.** The bearer token is the only credential
+   that crosses the boundary; the agent never accepts an
+   incoming socket.
+3. **Reconnect semantics are simpler.** When the connection drops
+   (NAT timeout, server restart, transient network glitch) the
+   agent backs off and re-dials; the server marks the host
+   offline after 90s and lets the alert engine raise a stale-host
+   alert.
+
+## Why SQLite?
+
+SQLite covers the project's HA non-goal: there isn't one. A small
+control plane managing twelve endpoints does not need replication
+or a separate database tier. SQLite gives us:
+
+- A single file to back up (plus the secret key).
+- Hand-rolled migrations under `internal/store/migrations/` —
+  no migration framework lock-in.
+- `WAL` mode plus per-connection foreign-key enforcement.
+
+The migrations file the entire schema; there's no ORM or
+query-builder layer between Go code and SQL.
+
+## Why the agent runs `restic` itself, not via the server
+
+The control plane never holds backup bytes in flight. That's
+deliberate:
+
+- A compromised control plane cannot exfiltrate snapshot
+  contents in-band — at worst it can dispatch new backup or
+  forget jobs (audit-logged) but the data path is between the
+  agent and the repository.
+- The same agent process can target whichever transport restic
+  natively supports (rest-server, S3, B2, SFTP, local), no
+  separate mux on the server side.
+
+## Job lifecycle
+
+```
+            ┌──────────────────────┐
+operator →  │ POST /hosts/{id}/    │
+            │       run-backup     │
+            └──────────┬───────────┘
+                       │   1. INSERT INTO jobs (status='queued')
+                       │   2. dispatch command.run over WS
+                       ▼
+            ┌──────────────────────┐
+            │ Agent dispatches     │
+            │ restic subprocess    │
+            └──────────┬───────────┘
+                       │
+                       │   3. job.started   ───▶ store.MarkJobStarted
+                       │   4. job.progress  ───▶ JobHub broadcast (live UI)
+                       │   5. log.stream    ───▶ append to job_logs
+                       │   6. job.finished  ───▶ store.MarkJobFinished
+                       │                          + alert engine eval
+                       │                          + (P6) metrics histogram
+                       ▼
+                  terminal: succeeded | failed | cancelled
+```
+
+Operators see live updates because the browser subscribes to
+`/api/jobs/{id}/stream`, and the WS handler broadcasts each
+agent-emitted envelope to all live subscribers in addition to
+persisting it.
+
+## What scheduling looks like
+
+- The agent runs a local `robfig/cron/v3` instance.
+- The server pushes the desired schedule set to the agent on
+  hello + after every CRUD change.
+- When the agent's cron fires, it sends `schedule.fire` to the
+  server. The server creates a job row, sends `command.run` back,
+  and the agent dispatches a normal backup.
+- If the WS drops between fire and run, the server queues the
+  schedule firing into `pending_runs` and drains on agent
+  reconnect — no missed scheduled backups due to network blips.
+
+For everything that isn't a backup (forget, prune, check), the
+server runs a 60-second maintenance ticker against
+`host_repo_maintenance` rows and dispatches the relevant command
+when a cadence is due. The agent's local cron only handles
+backups.
@@ -0,0 +1,98 @@
+# Credentials and how they flow
+
+restic-manager handles three credential surfaces:
+
+1. **Operator credentials** — the username + password (or OIDC
+   identity) that logs into the UI.
+2. **Agent bearer tokens** — issued at enrolment, used by the
+   agent to authenticate its WebSocket to the server.
+3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
+   credentials the agent passes to `restic` itself.
+
+Each has a different threat model and storage strategy.
+
+## Operator credentials
+
+- Local users are stored in `users` with a bcrypt password hash.
+- Sessions are random tokens minted at login, stored hashed in
+  the `sessions` table, expired after 24h. Cookie is HttpOnly,
+  SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
+  default).
+- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
+  pinning their IdP identity. Local password login is rejected
+  for OIDC users.
+- Disabling a user soft-deletes them via `disabled_at` —
+  pre-existing sessions are invalidated on the next request.
+
+## Agent bearer tokens
+
+- Minted at enrolment, hashed at rest with `auth.HashToken`.
+- The plaintext token only exists in memory at enrolment time
+  and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
+  mode `0600`, owned by the service user).
+- Compromise of the server DB leaks the hashes, which is enough
+  to *log in as that agent* until you revoke. Compromise of the
+  agent host leaks the plaintext (via the config file) — same
+  end result.
+- Rotation: re-enrol the host. Today there's no in-place rotate;
+  the operator deletes the host (which cascades, including
+  revoking the bearer hash) and re-runs the install command.
+
+## Repo credentials
+
+This is the credential that ultimately matters for backup
+integrity. restic-manager keeps two slots per host:
+
+- **The everyday credential** (`host_credentials.kind = ''`).
+  Append-only-friendly: this is the one your backup schedule
+  uses. It can write but not delete or forget.
+- **The admin credential** (`host_credentials.kind = 'admin'`).
+  Has full delete rights. Only pushed to the agent transiently
+  while a `prune` or `forget` job is dispatching, and discarded
+  by the agent after the job ends.
+
+### Encryption flow
+
+1. Operator types the credential into the UI or the install form.
+2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
+   key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
+   memory.
+3. Encrypted blob is stored in `host_credentials.cred_blob`.
+4. When the agent connects, the server decrypts the blob and
+   sends the **plaintext** down the WebSocket inside a
+   `config.update` envelope.
+5. The agent stores the plaintext in its in-memory secrets store
+   for the lifetime of the process; it's reloaded fresh on every
+   server-side push.
+6. When a job runs, the agent merges the credential into the
+   restic environment (`restic.Env.RepoURL` stays bare; the
+   `user:pass@…` form is built only inside `envSlice()` at the
+   moment of `exec.Command`).
+
+The merged form is **never logged**. The slog package's structured
+output gets `restic.RedactURL()` for any URL it has cause to
+mention.
+
+### Why push plaintext over the wire?
+
+The transport itself is the trust boundary: the WebSocket runs
+inside the same TLS-terminated reverse-proxy connection your
+browser uses, and the agent has already authenticated with its
+bearer token. Re-encrypting the payload on top of that would just
+move the key-management problem somewhere else.
+
+If your reverse proxy isn't TLS-terminated, the deployment is
+already broken — see [Hardening](../security/hardening.md).
+
+## Setup tokens (admin-driven)
+
+When an admin creates a new user, the server mints a one-time
+setup link valid for 1 hour. The hash is stored; the raw token
+is shown to the admin once. The user opens the link, sets a
+password, and is dropped into a session. Expired tokens are
+swept on the alert engine's 60s tick.
+
+Same pattern for enrolment tokens: the raw token only exists in
+memory at mint time, and the install snippet is the operator's
+only chance to capture it. If you lose it, regenerate via the
+**Add host** page (NS-02).
@@ -0,0 +1,85 @@
+# Repo maintenance
+
+Backups go in; without maintenance, repos grow forever and
+eventually fall over. restic-manager runs three maintenance
+operations on a per-host cadence:
+
+| Command  | What it does                                                | Default cadence |
+|----------|-------------------------------------------------------------|-----------------|
+| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
+| `prune`  | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
+| `check`  | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
+
+A new field on each host row, `host_repo_maintenance`, holds the
+cron expressions and last-fire anchors. The maintenance ticker on
+the server runs every 60s, finds hosts whose next-fire is due,
+and dispatches the right command. The agent's local cron is
+**only** for backups.
+
+## Why server-side and not agent-side?
+
+The agent's cron knows about backups because backups are
+per-source-group. Maintenance is per-repo, not per-source-group,
+so doing it server-side keeps the per-host wiring simple:
+
+- One ticker, not N agent crons to keep in sync.
+- Cancelling a maintenance dispatch is just "don't dispatch the
+  next one" — no agent-side state to clean up.
+- Skipping offline hosts is trivial (no queue; only scheduled
+  *backups* queue into `pending_runs`).
+
+## Forget and the multi-group payload
+
+A single `forget` job can target several source groups at once.
+The wire envelope (`ForgetGroups`) carries one entry per group,
+each with its retention policy. The agent runs N
+`restic forget --tag <name> --keep-...` invocations in sequence,
+streams their output, and reports a single terminal status.
+
+## Prune and the admin credential
+
+Prune mutates the repo. The everyday append-only credential
+**cannot** prune — that's the whole point of append-only.
+restic-manager keeps a second slot per host (`kind = 'admin'`)
+for the credential that can.
+
+When a prune is dispatched (cadence-driven or operator-driven):
+
+1. Server pushes the admin credential to the agent in a fresh
+   `config.update`.
+2. Agent runs `restic prune` with the merged credential.
+3. Job finishes; agent discards the admin credential from its
+   in-memory secrets store.
+
+The server never logs the merged URL (see
+[Credentials](./credentials.md)).
+
+## Check and lock state
+
+`restic check` warns about stale locks when it finds them. The
+agent ships every check's output back as a `repo.stats` envelope
+and a stream of log lines; if a stale lock is detected, the
+**Repo** page surfaces a banner with an **Unlock** button. The
+operator-only `unlock` command runs `restic unlock` and clears
+the banner.
+
+`unlock` has no cadence — it's a manual action, never automatic.
+Auto-unlocking would mask the cause (probably a previously
+crashed long-running operation) and risk corrupting an
+operation the operator has merely lost track of.
+
+## Repo stats
+
+After every backup, check, prune, and unlock, the agent runs
+`restic stats --json --mode raw-data` and ships the result as a
+`repo.stats` envelope. The server stores this in
+`host_repo_stats` (latest only) and `host_repo_stats_history`
+(one row per host per day, last-write-wins per column — a
+prune-only patch never nulls a backup-time size).
+
+The host detail page surfaces:
+
+- Total size + raw size in the vitals strip.
+- Last-check timestamp + colour-coded status.
+- Last-prune timestamp.
+- 30/90-day repo size trend chart.
@@ -0,0 +1,105 @@
+# Schedules and source groups
+
+Two related but separable ideas:
+
+- A **source group** is a named bundle of "what to back up":
+  include paths, exclude patterns, retention policy, retry
+  configuration, optional pre/post hooks. The group's name is
+  used as the restic snapshot tag, so retention can target it
+  with `restic forget --tag <name>`.
+- A **schedule** is a cron expression that, when it fires,
+  triggers a backup of one or more source groups on a host.
+
+Decoupling them means you can have one schedule covering several
+groups (e.g. `0 1 * * *` running both `system` and `data`), and
+each group has its own retention without duplicating policy
+across schedules.
+
+## Source group anatomy
+
+```yaml
+name: data
+includes:
+  - /var/lib/postgresql
+  - /home
+excludes:
+  - /home/*/.cache
+  - /home/*/Downloads
+retention:
+  keep_last: 7
+  keep_daily: 14
+  keep_weekly: 4
+  keep_monthly: 6
+retry_max: 3
+retry_backoff_seconds: 600
+pre_hook: |
+  pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
+post_hook: |
+  rm -f /var/lib/postgresql/dumps/all.dump
+```
+
+### Conflict detection
+
+If your retention policy says `keep_hourly: 24` but no schedule
+points at this group sub-daily, the UI surfaces a
+**conflict-dimension banner** ("`hourly` won't be honoured —
+no schedule fires more often than once a day"). The flag is
+stored on the source group (`conflict_dimension`) and refreshed
+whenever a schedule or group changes.
+
+### Hooks
+
+`pre_hook` and `post_hook` run on the agent host inside
+`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
+to the live job log as `hook(<phase>): …` lines.
+
+- A non-zero `pre_hook` exit aborts the backup.
+- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
+  in the environment. Use this for cleanup that must happen
+  whether the backup worked or not.
+- Hooks only run for `kind=backup` jobs. They do not run for
+  `forget`, `prune`, `check`, etc.
+- AEAD-encrypted at rest at the HTTP layer; the agent receives
+  plaintext over the WS channel.
+
+A "host default" pair of hooks lives on the host itself; a
+source group's own hooks override them when set.
+
+## Schedule anatomy
+
+```yaml
+cron: "0 2 * * *"
+enabled: true
+source_group_ids:
+  - <gid for "data">
+  - <gid for "system">
+```
+
+Slim by design: a schedule says **when** and **which groups**.
+Everything else (paths, retention, hooks) lives on the groups.
+
+The agent's local cron fires the schedule. If the WebSocket is
+down at fire time, the server queues the firing into
+`pending_runs` and drains it on the next agent reconnect — a
+short network blip won't lose the backup.
+
+### Last / next run
+
+The schedules tab shows "next" (computed by parsing the cron
+expression with `robfig/cron/v3`) and "last" (the latest
+`actor_kind=schedule` job in the `jobs` table) for every
+schedule. The dashboard host row also surfaces `next 12h ago/from
+now` when a single covering schedule is the run-now candidate.
+
+## Bandwidth limits
+
+Two places set restic's `--limit-upload` / `--limit-download`:
+
+1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
+   `bandwidth_down_kbps`). Pushed to the agent on hello and
+   after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
+   invocation on the host.
+2. **Per-job overrides** on the per-source-group Run-now form.
+   Win over host caps for the lifetime of that one job.
+
+If neither is set, restic runs unthrottled.
@@ -0,0 +1,17 @@
+# Contributing
+
+Full contributor guide:
+[`CONTRIBUTING.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CONTRIBUTING.md)
+in the repository root.
+
+The short version:
+
+- Open an issue first for non-trivial changes; the design is
+  still moving and unsolicited large PRs may conflict with
+  in-flight work.
+- `make lint test` must pass.
+- One logical change per commit, no `Co-Authored-By` trailers.
+- UK English in identifiers and comments; comments explain the
+  **why** not the **what**.
+
+Code of conduct: [`CODE_OF_CONDUCT.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CODE_OF_CONDUCT.md).
@@ -0,0 +1,113 @@
+# Enrolling your first host
+
+The control plane only knows about hosts you've explicitly
+enrolled. Two paths exist:
+
+1. **Token-based enrolment** — admin generates a token, pastes it
+   into an install command on the host. The host appears immediately,
+   already mapped to the desired repo.
+2. **Announce-and-approve** — the agent runs without a token,
+   "announces" itself to the server, and a human in the UI accepts
+   the announcement.
+
+Token-based is the default and what most operators want; the
+announce flow exists for the case where you can't easily paste a
+secret onto the host (auto-imaged endpoints, scripted bring-ups
+from a config repo).
+
+## Token-based enrolment
+
+### From the UI
+
+1. Click **+ Add host** on the dashboard.
+2. Fill in the hostname, the restic repo URL, and the repo
+   credentials. The credentials are AEAD-encrypted at the server
+   immediately; what you paste is what the agent receives.
+3. Optionally pick the initial source paths — these become the
+   first source group on the host.
+4. Submit. The server mints a one-time token and shows you a copy-
+   pasteable install snippet.
+
+### On the host (Linux)
+
+```sh
+curl -fsSL https://restic.example.com/install/install.sh | \
+    sudo RM_SERVER=https://restic.example.com \
+         RM_ENROL_TOKEN=<token> \
+         bash
+```
+
+The script:
+
+1. Detects architecture (`amd64` or `arm64`).
+2. Downloads the agent binary from `/agent/binary?os=…&arch=…`.
+3. Drops the systemd unit at
+   `/etc/systemd/system/restic-manager-agent.service`.
+4. Runs the agent in `-enrol` mode, which posts the token and
+   stores the persistent bearer it gets back.
+5. Enables and starts the unit.
+
+Within seconds the host should appear on the dashboard as
+**online**.
+
+### On the host (Windows)
+
+```pwsh
+$env:RM_SERVER  = "https://restic.example.com"
+$env:RM_ENROL_TOKEN = "<token>"
+iwr -useb $env:RM_SERVER/install/install.ps1 | iex
+```
+
+Equivalent shape: registers a Windows service via the SCM
+(see P2-16 for details), runs `-enrol`, starts the service.
+
+## Recovering a lost token
+
+Tokens are single-use and short-lived (1h). If you closed the tab
+before pasting the install command, head to the **Add host** page —
+outstanding tokens are listed there with a **Regenerate** button.
+Regenerating revokes the old token's hash and mints a fresh raw
+token while preserving the original repo credentials and initial
+paths. (NS-02 in `tasks.md` if you want the design rationale.)
+
+## Announce-and-approve
+
+If the host can reach the server but you don't want to paste a
+secret on it, run the agent in `-announce` mode:
+
+```sh
+restic-manager-agent -announce \
+                     -server https://restic.example.com \
+                     -hostname myhost
+```
+
+The host appears in the **Pending hosts** panel on the dashboard
+with its hostname, OS, arch, and the source IP that announced it.
+Click **Accept**, fill in the repo URL + credentials, and the
+server pushes the bearer over the still-open WebSocket. No
+back-and-forth round trip.
+
+If you don't accept within an hour the announcement is swept.
+
+## What happens on the agent
+
+After enrolment, the agent:
+
+1. Connects via WebSocket to `/ws/agent` with its bearer token.
+2. Sends a `hello` envelope with its OS, arch, agent version,
+   restic version, and protocol version.
+3. Receives a `config.update` carrying its encrypted repo
+   credentials and any source-group paths.
+4. Sits idle, sending a heartbeat every 30s. Operator-driven
+   "Run now" actions arrive as `command.run` envelopes; scheduled
+   jobs are driven by the agent's local cron.
+
+## Auto-init of the repository
+
+The first time a backup runs, the agent invokes `restic init`
+against the repo you configured at enrolment. If the repo already
+exists (`config file already exists`) the agent treats it as a
+success and proceeds. The host's repo status (`unknown` →
+`ready` / `init_failed`) is surfaced under the vitals strip on
+the host detail page; if init fails, save fresh credentials in
+the **Repo** tab to retry.
@@ -0,0 +1,92 @@
+# Installing the server
+
+The reference deployment is a single Docker container fronted by
+your existing reverse proxy. The image bundles the server binary,
+the cross-compiled agent binaries, and the install scripts.
+
+## Prerequisites
+
+- A Linux host with Docker and Docker Compose.
+- A reverse proxy in front (Caddy, nginx, Traefik) terminating
+  TLS on a public hostname. The server itself is HTTP-only by
+  design — see [Reverse proxy](./reverse-proxy.md) for why.
+- A persistent volume for the server's data directory.
+
+## Quick start
+
+The reference compose file lives at
+[`deploy/docker-compose.yml`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/docker-compose.yml):
+
+```yaml
+services:
+  restic-manager:
+    image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:-latest}
+    restart: unless-stopped
+    environment:
+      RM_LISTEN: ":8080"
+      RM_DATA_DIR: "/data"
+      RM_BASE_URL: "https://restic.example.com"
+      # Trust your reverse proxy's CIDR so X-Forwarded-* are honoured.
+      RM_TRUSTED_PROXY: "10.0.0.0/8"
+    volumes:
+      - rm-data:/data
+    ports:
+      # Bind localhost only — your reverse proxy is the public face.
+      - "127.0.0.1:8080:8080"
+
+volumes:
+  rm-data:
+```
+
+Bring it up:
+
+```sh
+docker compose up -d
+docker compose logs -f restic-manager
+```
+
+The first run prints a one-time **bootstrap token** to the log. Use
+it within an hour or it expires; if you miss the window the
+container print it again on next start as long as no admin user
+exists.
+
+## First-run admin setup
+
+Open `https://restic.example.com/bootstrap` (or whatever your
+public URL is). Paste the bootstrap token, pick a username and a
+password (≥ 12 characters), and submit. You'll land in the
+dashboard logged in as the new admin.
+
+If you'd rather curl it, the equivalent is:
+
+```sh
+curl -X POST https://restic.example.com/api/bootstrap \
+     -H 'Content-Type: application/json' \
+     -d '{"token":"<token-from-log>","username":"admin","password":"<≥12 chars>"}'
+```
+
+## Backing up the secret key
+
+Inside the data volume, `secret.key` holds the AEAD key used to
+encrypt every credential at rest. **Back it up separately from
+the database.** Without it, encrypted credentials in the database
+are unrecoverable; you'd have to re-enrol every host.
+
+A simple working approach: copy `secret.key` to your password
+manager or to a separately-backed-up secrets vault the day you
+install. It doesn't change.
+
+## Updating the server
+
+```sh
+# Pin a new version in your compose file (.env or docker-compose.yml),
+# then:
+docker compose pull
+docker compose up -d
+```
+
+Migrations run automatically on startup; the server will refuse to
+start if a migration fails (better to bail than to half-migrate).
+
+For the agent self-update story, see
+[Updating agents](../operations/updates.md).
@@ -0,0 +1,95 @@
+# Running behind a reverse proxy
+
+The restic-manager server is HTTP-only by design. TLS termination,
+public hostname, ACME, HSTS, and edge-level rate limiting all
+belong to a reverse proxy you already operate outside this project.
+
+## What the proxy must forward
+
+The server reads four headers when (and only when) the immediate
+peer matches `RM_TRUSTED_PROXY`:
+
+| Header                 | Value                                              | Why |
+|------------------------|----------------------------------------------------|-----|
+| `X-Forwarded-For`      | The original client IP                             | Rate-limit keys, audit log entries, OIDC redirect-URI checks. |
+| `X-Forwarded-Proto`    | `https`                                            | Used for absolute URLs (e.g. OIDC redirect URIs). |
+| `Host`                 | The public hostname clients use                    | Cookies are scoped to this; `RM_BASE_URL` must match. |
+| `Connection` / `Upgrade` | Pass through unchanged                           | `/ws/agent` and `/api/jobs/{id}/stream` are WebSockets; without `Upgrade: websocket` they fail. |
+
+Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of
+CIDRs) the proxy connects from. Anything outside that range has
+its `X-Forwarded-*` headers ignored, so a stray request that
+bypasses the proxy can't spoof the client IP.
+
+## Caddy
+
+```caddyfile
+restic.example.com {
+    encode zstd gzip
+    reverse_proxy 127.0.0.1:8080 {
+        header_up X-Real-IP {remote_host}
+    }
+}
+```
+
+Caddy adds `X-Forwarded-For` / `X-Forwarded-Proto` automatically
+and passes WebSocket headers through by default, so this is the
+whole config.
+
+## nginx
+
+```nginx
+server {
+    listen 443 ssl http2;
+    server_name restic.example.com;
+
+    ssl_certificate     /etc/letsencrypt/live/restic.example.com/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/restic.example.com/privkey.pem;
+
+    location / {
+        proxy_pass         http://127.0.0.1:8080;
+        proxy_http_version 1.1;
+        proxy_set_header   Host              $host;
+        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
+        proxy_set_header   X-Forwarded-Proto https;
+
+        # WebSocket upgrade
+        proxy_set_header   Upgrade           $http_upgrade;
+        proxy_set_header   Connection        "upgrade";
+
+        # Long-lived agent WS — disable read timeout for this surface.
+        proxy_read_timeout 86400s;
+    }
+}
+```
+
+## Traefik
+
+```yaml
+http:
+  routers:
+    restic-manager:
+      rule: "Host(`restic.example.com`)"
+      entryPoints: [websecure]
+      tls:
+        certResolver: letsencrypt
+      service: restic-manager
+
+  services:
+    restic-manager:
+      loadBalancer:
+        servers:
+          - url: "http://restic-manager:8080"
+        passHostHeader: true
+```
+
+Traefik forwards WebSocket upgrades and the standard
+`X-Forwarded-*` set out of the box.
+
+## Verification
+
+After bringing the proxy up, the audit log should show your real
+client IP for an interactive login (not the proxy's local
+address). If you see `127.0.0.1` or the proxy's container IP, your
+`RM_TRUSTED_PROXY` is wrong or `X-Forwarded-For` isn't being
+forwarded.
@@ -0,0 +1,86 @@
+# restic-manager
+
+restic-manager is a self-hosted, browser-based, single-pane-of-glass
+for managing [restic](https://restic.net) backups across a fleet of
+Linux and Windows endpoints. It's designed for **small fleets** —
+the original target was twelve endpoints — and **one operator**.
+
+## What it does
+
+- Centralised view of every endpoint's last backup, repo size,
+  snapshot count, and recent jobs.
+- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
+  `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`).
+- Per-host backup schedules with source groups (named bundles of
+  paths + retention policy).
+- Live job log streamed to the browser; downloadable as text or NDJSON.
+- Restore wizard with snapshot tree browse + path selection.
+- Repo-level health surfacing (size, raw size, last-check, lock
+  state) plus a 30/90-day size trend.
+- Alerting over webhook, ntfy, or SMTP.
+- Cross-platform agent (Linux + Windows).
+- Append-only-credential-friendly with a separate admin credential
+  for forget/prune.
+
+## What it isn't
+
+- **Not a SaaS.** Single-instance, single-tenant, by design.
+- **Not a replacement for restic** — it's a control plane. The agent
+  shells out to a real `restic` binary.
+- **Not highly available.** SQLite, single process; if you need
+  HA backups, you're shopping in the wrong aisle.
+- **Not a multi-protocol backup tool.** restic only.
+
+## How it fits together
+
+```
+┌──────────────────────────────────────────────┐
+│  Server (control plane, Docker)              │
+│   - REST + WebSocket API                     │
+│   - SQLite store                             │
+│   - Embedded HTMX UI                         │
+└──────────┬─────────────────────────┬─────────┘
+           │ outbound WS              │ HTTP(S)
+           │                          │
+┌──────────▼──────────┐    ┌──────────▼─────────┐
+│  Agent (per host)   │    │  Browser (operator) │
+│   - restic wrapper  │    └─────────────────────┘
+│   - cron for sched. │
+└──────────┬──────────┘
+           │ restic
+┌──────────▼──────────────────────────────────┐
+│  rest-server / S3 / SFTP / local repo       │
+│  (the actual backup data — server never     │
+│   touches it)                               │
+└─────────────────────────────────────────────┘
+```
+
+The control plane is a Go binary that runs in Docker. Each endpoint
+runs a small Go agent that holds an outbound WebSocket to the
+control plane. Backup data flows directly between the agent and the
+restic repository — the control plane never sees a snapshot byte.
+
+## Where to start
+
+- [Installing the server](./getting-started/install.md) walks
+  through the Docker-based reference deployment.
+- [Enrolling your first host](./getting-started/enrolling-hosts.md)
+  covers the install scripts and the announce-and-approve flow.
+- [Architecture](./concepts/architecture.md) is the right read if
+  you want to know why something is the way it is before running
+  the install.
+
+## Project status
+
+Pre-1.0 but feature-complete for the original use case. Phases
+0–4 are landed (MVP, scheduling, restore, RBAC + OIDC); Phase 5
+(this docs site, contributor onboarding, end-to-end CI) is in
+flight. See [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md)
+for the live roadmap and [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
+for the canonical design doc.
+
+## License
+
+[PolyForm Noncommercial 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0/).
+Personal and community deployments welcome; commercial use
+requires a separate license.
@@ -0,0 +1,39 @@
+# License
+
+restic-manager is licensed under
+[**PolyForm Noncommercial 1.0.0**](https://polyformproject.org/licenses/noncommercial/1.0.0/).
+The full text lives at
+[`LICENSE`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/LICENSE)
+in the repository root.
+
+## What this means
+
+- **Personal, hobbyist, educational, charitable, and similar
+  noncommercial use** is fully permitted, including modification
+  and redistribution.
+- **Commercial use is not permitted** without a separate
+  license. The maintainer is not currently offering one — if
+  you need commercial rights, open an issue to start the
+  conversation.
+- The license is permissive about everything except commercial
+  use: you can fork, modify, deploy in your home/lab, and
+  contribute back.
+
+## Why this license
+
+The PolyForm Noncommercial license was chosen because:
+
+- It's a real, legal, plainly-worded license (not a custom
+  half-written variant).
+- It permits the realistic uses for a hobby project (the
+  maintainer's homelab, a friend's fleet, a charity's IT
+  closet) without inviting commercial vendors to repackage
+  the work.
+- It's compatible with the project staying small and
+  maintainable — the maintainer doesn't want to be on the hook
+  for SLA-grade commercial support.
+
+## Contributions
+
+By contributing, you agree your contributions are licensed
+under the same PolyForm Noncommercial 1.0.0 license.
@@ -0,0 +1,73 @@
+# Alerts and notifications
+
+restic-manager raises alerts on conditions that need human
+attention. The alert engine evaluates rules on a 60s tick and
+on every job-finished / host-online event.
+
+## Built-in alert kinds
+
+| Kind                | Trigger | Severity |
+|---------------------|---------|----------|
+| `backup_failed`     | A backup job ends in `failed` or `cancelled` | warning |
+| `forget_failed`     | A forget job ends in `failed` | warning |
+| `prune_failed`      | A prune job ends in `failed` | critical |
+| `check_failed`      | A check job ends in `failed` | critical |
+| `agent_offline`     | A host has been offline more than 90s past its heartbeat cadence | warning |
+| `stale_schedule`    | A schedule's "last run" is more than 1.5 × its interval ago | warning |
+| `update_failed`     | An agent self-update returned a fail or didn't reconnect within 90s | warning |
+| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
+
+Each alert has a `dedup_key` so re-firing the same condition
+just bumps `last_seen_at` — the operator gets one row per
+condition, not a thousand.
+
+## Lifecycle
+
+```
+raised  ──acknowledge──▶  acknowledged  ──resolve──▶  resolved
+   │                          │
+   └────────auto-resolve──────┘
+   (e.g. agent_offline auto-resolves on agent_online)
+```
+
+- **Acknowledge** says "I've seen this, stop notifying about it".
+- **Resolve** says "the underlying condition is gone".
+- Some alerts auto-resolve when the condition clears
+  (`agent_offline` is the canonical example).
+
+## Notification channels
+
+Configure under **Settings → Notifications**. Each channel can
+subscribe to all alerts or filter by severity.
+
+### Webhook
+
+Posts a JSON envelope to a URL of your choice. Useful for
+piping into Slack via an Incoming Webhook URL or into your own
+alerting tooling.
+
+### ntfy
+
+Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
+topic. Configure the topic URL; optional bearer token if you
+self-host with auth.
+
+### SMTP
+
+Plain SMTP (with optional TLS). Configure host, port,
+username, password, and the recipient list.
+
+## Test fire
+
+Each channel exposes a **Test fire** button that dispatches a
+single synthetic alert through the channel without touching the
+alert engine. Use this when you've added a channel and want to
+verify connectivity before the next real failure happens.
+
+## What gets logged
+
+Every alert raise / acknowledge / resolve writes an audit log
+entry. The audit log UI at **Settings → Audit log** filters by
+user, action, target, and time range — useful for the
+post-incident "who clicked acknowledge on the prune-failure
+alert" question.
@@ -0,0 +1,73 @@
+# Backups and restores
+
+## Running a backup
+
+Three ways to trigger one:
+
+1. **Scheduled** — the agent's local cron fires at the time set
+   on the schedule.
+2. **Run-now** — operator clicks **Run now** on the host detail
+   right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
+   source groups) or to a per-group form for finer control.
+3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
+   payload. Same audit + dispatch path.
+
+In every case the server creates a `jobs` row, broadcasts a
+`command.run` to the host, and lands the operator on the live
+job log page (HTMX `HX-Redirect`).
+
+## Cancelling a job
+
+Any running job — backup, forget, prune, restore, anything —
+exposes a **Cancel** button on its detail page. The server
+broadcasts `command.cancel`, and the agent kills the running
+restic subprocess via context cancel: SIGTERM first, SIGKILL
+after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
+SIGTERM step is replaced with `os.Kill` because Windows can't
+deliver SIGTERM. Result: a cancelled job lands as `cancelled`
+within a couple of hundred milliseconds.
+
+## Restore wizard
+
+Restoring a file or path goes through a four-step wizard at
+`/hosts/{id}/restore`:
+
+1. **Pick a snapshot.** Search by id or by date; the page is
+   pre-populated when you launched the wizard from a snapshot row.
+2. **Browse the snapshot tree.** Lazy-loaded children via the
+   `MsgTreeList` synchronous WS RPC; results are cached
+   per-wizard-session for 30 minutes. Pick the absolute paths
+   you want.
+3. **Choose a target.** Either **In place** (overwrites the
+   live filesystem; requires you to type the hostname to
+   confirm) or **New directory** (default
+   `$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
+   `${HOME}` / `~/` and creates the directory chain).
+4. **Review and submit.** Server mints a job, dispatches
+   `command.run` with a `RestorePayload`, and `HX-Redirect`s to
+   the live job log.
+
+`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
+in that release). Hosts running 0.16 don't get the flag and
+restore as the running user instead.
+
+## Snapshot diff
+
+Two snapshot ids in the **Diff** form on the host detail page →
+a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
+to the standard live job log. Useful when investigating a
+suspiciously-sized backup.
+
+## Job log artefacts
+
+Every job's log is persisted in `job_logs` (one row per line),
+not just streamed in-memory. That gives you:
+
+- A live view at `/jobs/{id}` while the job runs.
+- Two download formats from the same page header dropdown:
+  - **txt** — one line per row, `HH:MM:SS.mmm  TAG  payload`.
+  - **ndjson** — one self-contained JSON object per line
+    (`{seq, ts, stream, payload}`), perfect for `jq`.
+
+Downloads work whether the job is running or finished —
+the source is the DB, not the live socket.
@@ -0,0 +1,61 @@
+# Observability with Prometheus
+
+restic-manager can expose a Prometheus scrape endpoint at
+`GET /metrics`. The endpoint is **opt-in** — without an explicit
+auth gate it isn't even mounted, so a forgotten config can't
+accidentally publish fleet state.
+
+The full reference lives at
+[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
+the short version follows.
+
+## Enable the endpoint
+
+Set at least one of:
+
+- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
+- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
+
+Both ANDed when both set. Constant-time token compare; CIDR
+honours `X-Forwarded-For` only when the immediate hop matches
+`RM_TRUSTED_PROXY`.
+
+## Metrics emitted
+
+- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
+  `rm_active_alerts{severity}`, `rm_build_info{...}`.
+- **Per-host gauges**: `rm_host_agent_online`,
+  `rm_host_last_backup_timestamp_seconds`,
+  `rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
+  `rm_host_snapshot_count`, `rm_host_open_alerts`,
+  `rm_host_repo_status`.
+- **Histogram**:
+  `rm_job_duration_seconds{kind,status,le=…}` (buckets
+  `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
+
+In-memory histogram only. Prometheus persists the scrapes; if
+you need durable history at hourly resolution that's
+Prometheus's job.
+
+## Sample Grafana dashboard
+
+[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
+imports through Grafana's **+ → Import → Upload JSON file**.
+Six panels:
+
+1. Fleet status (online / total).
+2. Open alerts by severity.
+3. Backups failing on most-recent run.
+4. Hosts table — last backup, repo size, snapshots, open alerts.
+5. Repo size over time, one line per host.
+6. Job-duration p95 over a 1h window per kind.
+
+## Alerting
+
+restic-manager already has a built-in alert engine
+([Alerts](./alerts.md)). The dashboard intentionally doesn't
+duplicate it as Prometheus alert rules. If you want
+Prometheus-side alerts on top, write your own based on the
+metrics above — `rm_host_last_backup_success == 0`,
+`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
+or whatever suits your environment.
@@ -0,0 +1,50 @@
+# Updating agents
+
+Server updates are a `docker compose pull && up -d` away.
+Agents update via the control plane.
+
+## Single-host update
+
+Each host's detail page shows an **Update agent** button when
+the agent's reported version is older than the server's. The
+button:
+
+1. Dispatches a `command.update` to that host.
+2. The agent fetches the appropriate binary from
+   `$RM_SERVER/agent/binary?os=…&arch=…` to
+   `<binary-path>.new`.
+3. Copies the running binary to `<binary-path>.old` (one
+   revision back, in case rollback is needed).
+4. Atomic-renames `.new` over the running binary.
+5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
+   brings the process back on the new binary.
+
+A 90-second timer on the server side waits for a hello at the
+target version and marks the update succeeded — or, if the
+agent doesn't reconnect at the expected version in time, marks
+the update **failed** and raises an `update_failed` alert.
+
+## Fleet update
+
+The admin-only **Settings → Fleet update** page drives a rolling
+update across every host in the fleet:
+
+- One host at a time.
+- Wait for hello-with-target-version (max 95s).
+- On any host failing, **halt** the rollout, raise a
+  `fleet_update_halted` alert, leave the rest of the fleet on
+  the old version. No surprise mass-failures.
+
+You can cancel an in-progress fleet update; the worker stops
+after the current host finishes.
+
+## TLS and corruption
+
+Updates rely on the reverse proxy's TLS to detect corruption in
+transit. There's no separate sha256 verification step — we
+chose the simpler model on the basis that the same TLS already
+gates every other byte the server hands to the agent.
+
+If you'd like a separate signature step before applying updates,
+that's a future-phase enhancement (see `tasks.md` Phase 6
+candidates).
@@ -0,0 +1,58 @@
+# Environment variables
+
+The server reads its configuration from environment variables
+(canonical) with an optional YAML overlay. Env wins over YAML so
+operators can tweak a single setting without rewriting the file.
+
+## Server
+
+| Variable                  | Default                          | Meaning |
+|---------------------------|----------------------------------|---------|
+| `RM_LISTEN`               | `:8080`                          | TCP listener for the HTTP server. |
+| `RM_DATA_DIR`             | `/data`                          | Persistent state directory (SQLite, secret key, agent assets). |
+| `RM_BASE_URL`             | (none)                           | Public URL clients use; required for OIDC redirects + cookie scope. |
+| `RM_SECRET_KEY_FILE`      | `${RM_DATA_DIR}/secret.key`      | Path to the AEAD key file. Auto-generated on first run. |
+| `RM_COOKIE_SECURE`        | `true`                           | Set `false` only for local HTTP testing. Controls `Secure` on session cookies. |
+| `RM_TRUSTED_PROXY`        | (none)                           | Comma-separated CIDRs trusted for `X-Forwarded-*`. |
+| `RM_BUNDLED_ASSETS_DIR`   | `/opt/restic-manager/dist`       | Read-only path with bundled agent binaries + install scripts (the docker image bakes them here). |
+| `RM_METRICS_TOKEN`        | (off)                            | When set, `GET /metrics` requires `Authorization: Bearer <token>`. |
+| `RM_METRICS_TRUSTED_CIDR` | (off)                            | When set, `GET /metrics` restricts source IPs (comma-CIDR). |
+
+OIDC variables (all optional; empty issuer disables OIDC):
+
+| Variable                       | Meaning |
+|--------------------------------|---------|
+| `RM_OIDC_ISSUER`               | OIDC discovery URL (e.g. `https://auth.example.com`). |
+| `RM_OIDC_CLIENT_ID`            | Client ID registered with the IdP. |
+| `RM_OIDC_CLIENT_SECRET`        | Client secret (or use `RM_OIDC_CLIENT_SECRET_FILE`). |
+| `RM_OIDC_CLIENT_SECRET_FILE`   | Path to a file holding the client secret. |
+| `RM_OIDC_DISPLAY_NAME`         | Button label on the login page (e.g. "Authelia"). |
+| `RM_OIDC_ROLE_CLAIM`           | Token claim that carries roles (default `groups`). |
+| `RM_OIDC_ROLE_MAPPING`         | `idp-group=role` entries, comma-separated (e.g. `rm-admin=admin,rm-ops=operator`). |
+| `RM_OIDC_REDIRECT_URL`         | Override for the redirect URL; defaults to `${RM_BASE_URL}/auth/oidc/callback`. |
+
+## Agent
+
+| Variable             | Default | Meaning |
+|----------------------|---------|---------|
+| `RM_AGENT_CONFIG`    | `/etc/restic-manager/agent.yaml` (Linux) | Config file path. |
+
+The agent's other settings live in the YAML file (server URL,
+bearer token, optional cert pin). The install script writes that
+file for you at enrolment.
+
+## Build-time
+
+The Makefile threads `-ldflags` from `git describe` into the
+`internal/version` package so `--version` and the dashboard
+footer show the right values:
+
+```
+-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION)
+-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
+```
+
+If you build with `go build` directly (no Makefile), `Version`
+falls back to `dev` and the agent-update comparison falls back
+to "always equal". Source-build deployments can still run; they
+just don't participate in the self-update flow.
@@ -0,0 +1,82 @@
+# HTTP endpoints
+
+A non-exhaustive map of the surfaces the control plane exposes.
+All `/api/*` routes return JSON; all other paths render HTML
+(server-rendered with HTMX in the loop).
+
+The canonical wiring lives at
+[`internal/server/http/server.go`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/internal/server/http/server.go);
+when in doubt, read the routes block there.
+
+## Public (no auth)
+
+| Method | Path                       | Purpose |
+|--------|----------------------------|---------|
+| GET    | `/healthz`                 | Liveness probe. Returns 204. |
+| POST   | `/api/auth/login`          | Local-user login. JSON body: `{username, password}`. |
+| POST   | `/api/auth/logout`         | Invalidate the session cookie. |
+| POST   | `/api/bootstrap`           | First-run admin creation. Accepts the token printed at first start. |
+| POST   | `/api/agents/enroll`       | Token-based agent enrolment. |
+| POST   | `/api/agents/announce`     | Announce-and-approve agent enrolment. |
+| GET    | `/agent/binary?os=&arch=`  | Serves the agent binary for the install scripts. |
+| GET    | `/install/*`               | Serves the Linux + Windows install scripts and the systemd unit. |
+| GET    | `/api/version`             | Build version + commit JSON. |
+| GET    | `/metrics`                 | Prometheus exposition (only when opted-in via `RM_METRICS_TOKEN` / `RM_METRICS_TRUSTED_CIDR`). |
+| GET    | `/login`, `/setup`, `/bootstrap` | UI pages. |
+
+## Authenticated (any role)
+
+| Method | Path                                     | Purpose |
+|--------|------------------------------------------|---------|
+| GET    | `/`                                      | Dashboard. |
+| GET    | `/hosts/{id}`                            | Host detail. |
+| GET    | `/hosts/{id}/repo`                       | Repo tab. |
+| GET    | `/hosts/{id}/jobs`                       | Jobs tab. |
+| GET    | `/hosts/{id}/sources`                    | Source groups list. |
+| GET    | `/hosts/{id}/schedules`                  | Schedules list. |
+| GET    | `/jobs/{id}`                             | Live job log. |
+| GET    | `/api/hosts`, `/api/fleet/summary`       | JSON list + summary. |
+| GET    | `/api/jobs/{id}/stream`                  | WebSocket subscription to a job's live log. |
+| GET    | `/api/jobs/{id}/log.{txt,ndjson}`        | Persisted log download. |
+
+## Operator role and above
+
+| Method | Path                                  | Purpose |
+|--------|---------------------------------------|---------|
+| POST   | `/hosts/{id}/run-backup`              | Run-now (HTMX form-post). |
+| POST   | `/hosts/{id}/sources/{gid}/run-now`   | Per-source-group run-now. |
+| POST   | `/hosts/{id}/repo/{prune,check,unlock,reinit,probe}` | Maintenance actions. |
+| POST   | `/api/hosts/{id}/snapshots/diff`      | Snapshot-diff job. |
+| POST   | `/hosts/{id}/restore`                 | Restore wizard submit. |
+| POST   | `/api/jobs/{id}/cancel`               | Cancel a running job. |
+| POST   | `/hosts/{id}/tags`                    | Update host tags. |
+| POST   | `/hosts/{id}/sources` and friends     | Source-group CRUD. |
+| POST   | `/hosts/{id}/schedules` and friends   | Schedule CRUD. |
+| POST   | `/hosts/{id}/repo/credentials`, `/admin-credentials` | Credential update. |
+
+## Admin role only
+
+| Method | Path                                  | Purpose |
+|--------|---------------------------------------|---------|
+| POST   | `/hosts/new`                          | Mint enrolment token (Add host). |
+| POST   | `/hosts/{id}/delete`                  | Delete + cascade. |
+| POST   | `/hosts/{id}/update`                  | Dispatch a single agent update. |
+| GET/POST | `/settings/users/...`                | User management. |
+| POST   | `/settings/notifications/...`         | Notification channel CRUD + test fire. |
+| POST   | `/settings/fleet-update/...`          | Fleet-update worker. |
+
+## WebSocket
+
+| Path                           | Who connects | Auth |
+|--------------------------------|--------------|------|
+| `/ws/agent`                    | Agent        | Bearer token issued at enrolment. |
+| `/ws/agent/pending`            | Agent (announce flow) | Pending-id query param. |
+| `/api/jobs/{id}/stream`        | Browser      | Session cookie. |
+
+## RBAC enforcement
+
+Routes are grouped into chi route-groups by required role
+(`viewer < operator < admin`); the `requireRole` middleware in
+`internal/server/http/middleware.go` is the bouncer. Sessions
+re-validate `disabled_at` on every request, so a disabled user's
+cookie stops working immediately.
@@ -0,0 +1,32 @@
+# Roadmap
+
+The live roadmap is in
+[`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md).
+Phases ship in order; items inside a phase ship as the
+opportunity arises.
+
+## Status snapshot
+
+| Phase | Theme                                            | Status |
+|-------|--------------------------------------------------|--------|
+| 0     | Project bootstrap                                | ✅ done |
+| 1     | MVP: enrolment, visibility, on-demand backup     | ✅ done |
+| 2     | Scheduling, retention, repo operations           | ✅ done |
+| 3     | Restore, alerts, audit                           | ✅ done |
+| 4     | RBAC, OIDC, host tags                            | ✅ done |
+| 5     | OSS readiness                                    | 🚧 in flight (this docs site is part of it) |
+| 6     | Update delivery + observability polish           | ✅ done |
+
+## What's not on the roadmap
+
+The non-goals list in [`spec.md` §2](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md):
+
+- Replacing restic itself or providing custom repo formats
+- Managing non-restic backup tools
+- Multi-tenancy / SaaS deployment
+- High availability of the control plane (SQLite, single-instance)
+- Mobile-native apps (responsive web only)
+
+If something there is critical to your use case, restic-manager
+isn't the right tool. That's not a closed door — it's a
+deliberate scope decision so the project stays maintainable.
@@ -0,0 +1,35 @@
+# Reporting vulnerabilities
+
+The full disclosure policy lives in
+[`SECURITY.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/SECURITY.md)
+at the repo root. The short version:
+
+- **Don't open a public issue.**
+- Send a Gitea private message to `steve` on
+  <https://gitea.dcglab.co.uk>, or email the address on the
+  maintainer's profile, with a subject like
+  `[SECURITY] restic-manager: <one-line summary>`.
+- Expect an acknowledgement within 3 working days; escalate
+  through the other channel if you don't get one.
+- Default disclosure window is **30 days from confirmed report
+  to public disclosure**, faster if a PoC is already
+  circulating, slower only by mutual agreement.
+
+## What to include
+
+A description of the issue and the impact, the affected
+component (server / agent / install script / docs), the version,
+and reproduction steps. A working PoC is welcome but not
+required — a credible threat model is enough.
+
+## In scope vs. out of scope
+
+See the full policy. Quick highlights:
+
+- **In scope:** server, agent, install scripts, docker image,
+  docker-compose reference, crypto choices, docs that lead to
+  insecure configs.
+- **Out of scope:** restic itself (report upstream), unpatched
+  third-party deps (report upstream first), pre-authenticated
+  admin abuse (admins are designed to have full power), DoS on
+  deployments without the recommended reverse proxy.
@@ -0,0 +1,72 @@
+# Hardening checklist
+
+A baseline for new deployments. Most of these are defaults; the
+list is here to make audit easy.
+
+## Server
+
+- [ ] Reverse proxy in front, TLS terminating at the proxy
+      (Caddy/nginx/Traefik).
+- [ ] `RM_TRUSTED_PROXY` set to the proxy's CIDR.
+- [ ] `RM_BASE_URL` matches the public hostname and the cookie
+      scope you want.
+- [ ] `RM_COOKIE_SECURE=true` (the default; only set `false`
+      for local HTTP testing).
+- [ ] HTTP listener bound to **localhost** in the compose file,
+      not `0.0.0.0`. The reverse proxy is the only thing that
+      should reach it.
+- [ ] `secret.key` backed up separately from the database.
+- [ ] Bootstrap token consumed and the printed log line scrubbed
+      from any log archive.
+
+## Authentication
+
+- [ ] Admin user has a password ≥ 12 characters (the floor).
+- [ ] OIDC enabled if you have an IdP — local password auth
+      stays as a break-glass.
+- [ ] Disabled (not deleted) any users who change roles or leave
+      so their session is invalidated immediately.
+- [ ] The last-admin guard isn't tripped — there's always at
+      least one enabled admin user.
+
+## Repo credentials
+
+- [ ] Append-only credential set as the everyday cred for every
+      host.
+- [ ] Admin credential set only where prune cadence is enabled.
+- [ ] No credentials reused across hosts. Each host should have
+      its own credential pair so a single host compromise has a
+      single blast radius.
+- [ ] If using rest-server, `--append-only` flag is on for the
+      everyday user; the prune user is a separate identity.
+
+## Agent
+
+- [ ] Agent runs as `root` (Linux) or `LocalSystem` (Windows)
+      **only when** the source paths require it. Otherwise pin
+      a service user that has read access to what's backed up
+      and nothing else.
+- [ ] systemd unit's sandboxing flags are intact
+      (`NoNewPrivileges`, `Protect*`, `MemoryDenyWriteExecute`).
+- [ ] Agent's config file `/etc/restic-manager/agent.yaml` is
+      mode `0600` and owned by the service user. The bearer
+      token lives in there.
+
+## Operations
+
+- [ ] Alerts wired to a real channel (webhook into Slack,
+      ntfy topic, SMTP) — not just sitting in the UI.
+- [ ] Test-fire each notification channel after configuring.
+- [ ] Audit-log retention is long enough to cover the operator's
+      incident-response window.
+- [ ] Prometheus endpoint, if enabled, gated by token AND CIDR
+      where practical (default is opt-in / off).
+
+## Recovery
+
+- [ ] A documented procedure for rotating a leaked agent bearer
+      (delete + re-enrol the host).
+- [ ] A test-restore done at least once, end-to-end, before
+      relying on the system in anger.
+- [ ] `secret.key` and the SQLite database covered by separate
+      backup paths so neither alone reconstitutes the other.
@@ -0,0 +1,110 @@
+# Threat model
+
+This page documents what restic-manager defends against, what it
+doesn't, and the trust assumptions a deployment is making. The
+canonical version lives in [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
+§11; the summary here is shaped for operators rather than
+implementers.
+
+## Trust boundaries
+
+```
+┌──────────────────────────────────────────┐
+│  TRUSTED zone                            │
+│  ┌─────────────┐    ┌──────────────┐     │
+│  │  Operator's │    │   Reverse    │     │
+│  │   browser   │◄──►│    proxy     │     │  TLS terminates here
+│  └─────────────┘    └──────┬───────┘     │
+└────────────────────────────┼─────────────┘
+                             │ HTTP, plaintext
+                             │ (loopback or trusted LAN)
+┌────────────────────────────▼─────────────┐
+│  Server (control plane)                  │
+└────────────┬─────────────────────────────┘
+             │ outbound WebSocket (TLS to clients via proxy)
+             │ — bearer-authenticated
+┌────────────▼──────────────┐
+│  Agent (per host)         │  ◄── attacker model: assume one
+└────────────┬──────────────┘       endpoint can be compromised
+             │ subprocess
+             ▼
+   restic ──▶ repository (rest-server / S3 / SFTP / …)
+```
+
+## What we defend against
+
+### Network attacker between operator and server
+
+- HTTPS via the reverse proxy is the only operator-facing surface
+  on a sane deployment.
+- `RM_COOKIE_SECURE=true` (default) means the session cookie
+  refuses to ride a non-HTTPS connection.
+- `RM_TRUSTED_PROXY` gates whether `X-Forwarded-*` is honoured;
+  a bypassing request can't spoof the client IP.
+
+### Compromised agent host
+
+- The agent's bearer token can dispatch commands **only on its
+  own host**. It can't read other hosts' state, dispatch jobs
+  on other hosts, or escalate within the control plane.
+- If you suspect a host compromise:
+  1. Disable the agent's host row from **Hosts → Delete**
+     (cascades the bearer hash).
+  2. Rotate the repo credential at the rest-server / object
+     store side.
+  3. Audit-log lists every action that bearer ever drove.
+
+### DB compromise without the secret key
+
+- Repo credentials are AEAD-encrypted at rest. A DB dump alone
+  doesn't expose them.
+- Agent bearer **hashes** are leaked; that's enough to
+  authenticate as any agent until you revoke. A rotation
+  procedure is just "delete + re-enrol" today.
+- Operator passwords are bcrypt-hashed; OIDC users have no
+  password to leak.
+- Session tokens are hashed; an attacker can't replay a
+  session from a DB dump.
+
+### DB compromise WITH the secret key
+
+The attacker can decrypt every credential. Treat
+`secret.key` with the same care as a password manager database.
+Back it up to a separate vault, not to the same Docker volume
+as the database.
+
+### Forget/prune as a DoS vector
+
+- The everyday backup credential cannot prune (append-only).
+- The admin credential is only pushed to the agent at the
+  moment of dispatch and discarded after the job ends.
+- Compromise of a single agent host does **not** grant prune
+  rights — at worst the attacker gets fresh write access until
+  the credential is rotated.
+
+### Operator-side typo or bad copy-paste
+
+- Repo credentials are stored encrypted; mis-typed creds fail
+  fast on the next `restic` invocation rather than silently
+  corrupting state.
+- NS-03 added auto-init: the first dispatched job after creds
+  change runs `restic init`, surfaces the error eagerly under
+  the host's vitals strip if the creds are bad, and resets the
+  host's `repo_status` so the operator can retry without
+  hunting through job logs.
+
+## What we don't defend against
+
+- **Insider threat at the maintainer level.** A malicious
+  maintainer can publish a backdoored container; SBOM /
+  signing infrastructure (Phase 6 candidate) would help here
+  but isn't shipped today.
+- **Supply chain.** We pin module versions (`go.sum`) and
+  pin the Tailwind binary's release tag, but a compromise in
+  one of those upstreams would land here.
+- **Side-channel via restic itself.** A bug in restic that
+  enables snapshot-content disclosure is restic's problem; the
+  control plane doesn't see snapshot bytes either way.
+- **DoS via resource exhaustion** without the recommended
+  reverse-proxy / rate-limit in front. Don't expose the
+  server's HTTP port to the public internet directly.
@@ -0,0 +1,120 @@
+# End-to-end test harness
+
+The e2e harness stands up the full production-shaped stack
+(server + agent + rest-server) in Docker Compose and drives it
+through Playwright. CI runs it on every PR; operators can run it
+locally too.
+
+## Files
+
+```
+e2e/
+├── compose.e2e.yml         compose stack: server + rest-server + agent
+├── Dockerfile.agent        Linux container for the agent (alpine + restic)
+├── agent-entrypoint.sh     decides between announce / token-enrol / run
+└── playwright/
+    ├── package.json
+    ├── playwright.config.ts
+    └── tests/
+        ├── lib/server.ts   bootstrap, login, accept, poll helpers
+        └── smoke.spec.ts   happy-path: enrol → backup → succeeded
+```
+
+## Local run
+
+Prerequisites: Docker + Docker Compose, and `npx` for Playwright.
+
+```sh
+# 1. Build + bring up the stack (server, rest-server, source data).
+docker compose -f e2e/compose.e2e.yml up --build -d server rest-server source-fixture
+
+# 2. Wait for the server, then scrape the bootstrap token from the log.
+until curl -fsS http://127.0.0.1:8080/api/version >/dev/null; do sleep 1; done
+RM_BOOTSTRAP_TOKEN=$(docker compose -f e2e/compose.e2e.yml logs server \
+    | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1)
+export RM_BOOTSTRAP_TOKEN
+
+# 3. Start the agent (it announces against the running server).
+docker compose -f e2e/compose.e2e.yml up -d agent
+
+# 4. Install + run Playwright.
+cd e2e/playwright
+npm install
+npx playwright install --with-deps chromium
+npx playwright test
+```
+
+When the test passes you'll see:
+
+```
+Running 2 tests using 1 worker
+  ✓  smoke: enrol-via-announce → backup › happy path completes in under a minute (47s)
+  ✓  smoke: scrape /metrics › metrics endpoint exposes the host gauge (180ms)
+
+  2 passed (47.5s)
+```
+
+Tear-down:
+
+```sh
+docker compose -f e2e/compose.e2e.yml down -v
+```
+
+`-v` removes the named volumes too — important between runs because
+the rest-server volume holds an initialised repo and the
+agent-config volume holds a stale bearer.
+
+## What the test exercises
+
+1. **Bootstrap.** Posts the admin-creation request to
+   `/api/bootstrap` with the token scraped from the server log.
+2. **Login (UI).** Drives the login form via Playwright; verifies
+   the dashboard loads with a session cookie set.
+3. **Pending host appears.** Polls the dashboard for the inline
+   accept form generated by the announcing agent; reads the
+   pending-id out of its action URL.
+4. **Accept.** POSTs `/api/pending-hosts/{id}/accept` with the
+   rest-server URL + repo password. The server mints a Host row
+   + bearer + AEAD-encrypted creds and pushes the bearer down
+   the still-open pending WebSocket.
+5. **Online + auto-init.** Polls `/api/hosts` until the new host
+   is `status=online`. Auto-init runs as part of this — the
+   first dispatched job after creds save is `restic init`.
+6. **Run backup.** Submits the host detail page's `Run now`
+   form; expects `HX-Redirect` to the live job page.
+7. **Verify.** Polls `/api/hosts` until the host's
+   `last_backup_status` flips to `succeeded`.
+8. **Metrics.** Scrapes `/metrics` and asserts the
+   server-gauge + build-info lines are present (the compose
+   stack opens the endpoint via `RM_METRICS_TRUSTED_CIDR=0.0.0.0/0`).
+
+## CI workflow
+
+[`.gitea/workflows/e2e.yml`](../.gitea/workflows/e2e.yml) runs the
+suite on every PR into `main`. On failure it dumps the last 200
+lines of each container log as a workflow annotation and uploads
+the Playwright HTML report as an artefact.
+
+## When tests fail
+
+- **Pending host never appears.** Agent container probably
+  couldn't reach the server. Check `docker compose logs agent`
+  for connection errors and `docker compose logs server` for
+  any 4xx on `/api/agents/announce`.
+- **Backup hangs in `running`.** The agent shells out to
+  `restic`; check the live job log at
+  `http://127.0.0.1:8080/jobs/<id>` (still up after a
+  failed test as long as you didn't `down -v`).
+- **`RM_BOOTSTRAP_TOKEN not set`.** The server log scrape
+  matched the wrong line or the token regex is too tight. The
+  server prints the token on a line starting with `    ` (four
+  spaces) inside a banner; widen the regex if your server log
+  format changes.
+
+## Adding new tests
+
+The harness is intentionally flat — one `*.spec.ts` per
+scenario. Reuse the helpers in `lib/server.ts` and avoid
+duplicating bootstrap / login boilerplate. Heavy fixtures
+(custom users, OIDC IdP) belong in their own compose override
+file rather than complicating `compose.e2e.yml`.
@@ -0,0 +1,139 @@
+# Prometheus + Grafana
+
+restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
+The endpoint is **opt-in** — it is not mounted at all unless you set
+at least one of the auth gates below. Once enabled, it serves the
+standard `text/plain` exposition format that every Prometheus
+release since 2.x parses without configuration.
+
+A sample Grafana dashboard lives at
+`deploy/grafana/restic-manager-dashboard.json`.
+
+## Enable the endpoint
+
+Two switches, both off by default. If both are set, both must pass
+(token AND source-IP); if only one is set, that gate alone
+authorises a scrape.
+
+| Env var                    | YAML key               | Effect |
+|----------------------------|------------------------|--------|
+| `RM_METRICS_TOKEN`         | `metrics_token`        | Requires `Authorization: Bearer <token>`. Compared in constant time. |
+| `RM_METRICS_TRUSTED_CIDR`  | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
+
+When neither is set, `GET /metrics` returns 404 — the route is not
+registered with the chi router so a forgotten config can't
+accidentally publish fleet state.
+
+### Example: Docker
+
+```yaml
+services:
+  restic-manager:
+    image: gitea.dcglab.co.uk/steve/restic-manager:latest
+    environment:
+      RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
+      RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
+    secrets:
+      - rm_metrics_token
+```
+
+(`RM_METRICS_TOKEN_FILE` is not currently supported — set
+`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
+roadmap.)
+
+## Prometheus scrape config
+
+Drop into your `prometheus.yml`:
+
+```yaml
+scrape_configs:
+  - job_name: restic-manager
+    metrics_path: /metrics
+    scheme: https            # via your reverse proxy
+    static_configs:
+      - targets: ['restic.example.com']
+    authorization:
+      type: Bearer
+      credentials_file: /etc/prometheus/secrets/rm_metrics_token
+```
+
+If you don't run a TLS-terminating proxy in front, drop `scheme:
+https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
+
+## Metric reference
+
+All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
+label (the stable ULID, immune to renames) and a `host` label
+(the human-readable name).
+
+### Server gauges
+
+| Name                  | Labels                             | Description |
+|-----------------------|------------------------------------|-------------|
+| `rm_hosts_total`      | —                                  | Total number of enrolled hosts (excludes pending announces). |
+| `rm_hosts_online`     | —                                  | Number of hosts with `status='online'`. |
+| `rm_active_alerts`    | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
+| `rm_build_info`       | `version, commit, go_version`      | Always 1; pure label-bag for joining. |
+
+### Per-host gauges
+
+| Name                                       | Description |
+|--------------------------------------------|-------------|
+| `rm_host_agent_online`                     | 1 if the agent is currently online, 0 otherwise. |
+| `rm_host_last_backup_timestamp_seconds`    | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
+| `rm_host_last_backup_success`              | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
+| `rm_host_repo_size_bytes`                  | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
+| `rm_host_snapshot_count`                   | Number of restic snapshots known on the host's repo. |
+| `rm_host_open_alerts`                      | Number of currently open alerts attached to this host. |
+| `rm_host_repo_status`                      | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
+
+### Job duration histogram
+
+```
+rm_job_duration_seconds_bucket{kind, status, le}
+rm_job_duration_seconds_sum{kind, status}
+rm_job_duration_seconds_count{kind, status}
+```
+
+`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
+`status` ∈ {succeeded, failed, cancelled}.
+
+Buckets (seconds):
+
+```
+1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
+1s   5s  30s  1m  5m   30m   1h    6h    24h
+```
+
+The histogram is in-memory only — values reset on process restart.
+Operators who want durable history should let Prometheus persist
+the scrapes; restic-manager itself is a control plane, not a
+metrics database.
+
+## Grafana dashboard
+
+Import `deploy/grafana/restic-manager-dashboard.json`:
+
+1. In Grafana, **+ → Import → Upload JSON file**.
+2. Pick the Prometheus data source you scrape with.
+3. The dashboard's six panels populate from the metrics above:
+   * **Fleet status** — online/total stat panel.
+   * **Open alerts** — by severity.
+   * **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
+   * **Repo size over time** — one line per host.
+   * **Backups failing** — count of hosts whose last backup didn't succeed.
+   * **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
+
+Alerting is intentionally not configured in the dashboard — the
+control plane already has alerts (P3-05) with native channels for
+webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
+just duplicate state. If you do want Prom-side alerts, copy the
+recording rules into your usual location.
+
+## Cardinality
+
+Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
+histogram rows. A 100-host fleet emits roughly 700 host rows + 270
+histogram rows — well below any practical limit. There are no
+`job_id` labels (cardinality bomb avoidance) and no per-source-group
+labels.
@@ -0,0 +1,61 @@
+# Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard
+
+Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`
+
+## Step 1 — Config wiring
+
+- Add fields to `internal/server/config/config.go`:
+  - `MetricsToken string` (yaml `metrics_token`)
+  - `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`)
+  - method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured.
+- Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR).
+- `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`).
+- Tests: extend `config_test.go` covering both env vars + happy/sad CIDR.
+
+## Step 2 — `internal/server/metrics` package
+
+- `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`.
+- `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count.
+- `Snapshot() Snapshot` — copies state under lock; returns plain value type.
+- `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).
+- `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec.
+- Unit tests: golden render, concurrent observe, bucket boundaries.
+
+## Step 3 — HTTP handler
+
+- New `internal/server/http/metrics.go`:
+  - `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`.
+  - `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use).
+  - `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`.
+- Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`.
+- `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers.
+
+## Step 4 — Hook job-finished
+
+- `internal/server/ws/handler.go`:
+  - `HandlerDeps` grows `Metrics *metrics.Registry`.
+  - In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race).
+- `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance.
+
+## Step 5 — Tests
+
+- `internal/server/metrics/registry_test.go` — observe + snapshot determinism.
+- `internal/server/metrics/render_test.go` — golden output for a fixed snapshot.
+- `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.
+
+## Step 6 — Docs + dashboard (P6-05)
+
+- `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import.
+- `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.
+
+## Step 7 — Tasks.md + verification
+
+- Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries.
+- Run `go vet ./...`, `go test ./...`, `make build`.
+- Push branch (no PR per standing instruction).
+
+## Risk register
+
+- **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
+- **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice.
+- **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.
@@ -0,0 +1,175 @@
+# P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard
+
+Date: 2026-05-07
+Author: Claude (autonomous, sensible-defaults brief from operator)
+Tasks: P6-04 (M), P6-05 (S)
+
+## Problem
+
+The control plane already knows everything a backup operator needs
+to monitor — last-backup timestamp + status, repo size, snapshot
+count, agent online, open alerts, build version — but it surfaces
+those only through the dashboard HTML and a few JSON endpoints. To
+plug into the operator's existing observability stack we need a
+plain Prometheus exposition endpoint and a Grafana dashboard JSON
+that reads from it.
+
+## Goals
+
+- `GET /metrics` emits standard Prometheus text-format with the
+  per-host, server, and job-duration metrics enumerated in the
+  task entry (P6-04 in `tasks.md`).
+- Endpoint is opt-in and gated by a bearer token and/or an IP
+  allow-list — never publicly readable by default.
+- No new third-party dependency (`prometheus/client_golang` is not
+  pulled in). The exposition format is small and stable enough to
+  emit by hand; matches the repo's "no Tailwind/Node" style.
+- Sample Grafana dashboard committed to the repo so a stranger can
+  drop it into a Grafana instance and get a working view.
+
+## Non-goals
+
+- OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is
+  what every prom server still parses and what every example
+  online demonstrates — pick the boring option).
+- Pushgateway or remote-write integration.
+- Per-job metric cardinality (no `job_id` labels — that would
+  make the histogram explode).
+- Alerting rules. Operators already have alerts inside
+  restic-manager (P3-05); duplicating them in Prometheus is a
+  YAGNI hazard. The dashboard is read-only.
+
+## Auth
+
+Two switches, both off by default. If neither is set the route
+isn't mounted at all (404 from the chi router) — this avoids any
+accidental "wide-open scrape endpoint" deployment.
+
+| env var | type | meaning |
+| --- | --- | --- |
+| `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer <token>`. Compared with `crypto/subtle.ConstantTimeCompare`. |
+| `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. |
+
+If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
+
+YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`.
+
+## Metrics
+
+All metric names are prefixed `rm_`. Help text is concise.
+
+### Per-host gauges (one row per `host_id`)
+
+```
+rm_host_agent_online{host_id,host}                     1 if status='online' else 0
+rm_host_last_backup_timestamp_seconds{host_id,host}    unix seconds; omitted if no backup yet
+rm_host_last_backup_success{host_id,host}              1 if last_backup_status='succeeded' else 0; omitted if no backup yet
+rm_host_repo_size_bytes{host_id,host}                  total_size from latest repo stats; omitted if unknown
+rm_host_snapshot_count{host_id,host}                   integer
+rm_host_open_alerts{host_id,host}                      count of open + un-resolved alerts attached to this host
+rm_host_repo_status{host_id,host,status}               1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)
+```
+
+`host` label is `hosts.name` for human readability; `host_id` is
+the stable ULID for joining across renames.
+
+### Server gauges
+
+```
+rm_hosts_total                              count of hosts (excludes pending)
+rm_hosts_online                             count of hosts with status='online'
+rm_active_alerts{severity}                  count of open alerts by severity ∈ {info,warning,critical}
+rm_build_info{version,commit,go_version}    always 1; pure label-bag for joining
+```
+
+### Job duration histogram
+
+```
+rm_job_duration_seconds_bucket{kind,status,le=...}
+rm_job_duration_seconds_sum{kind,status}
+rm_job_duration_seconds_count{kind,status}
+```
+
+`kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update}
+(every JobKind we currently dispatch). `status` ∈
+{succeeded,failed,cancelled}. Buckets cover the realistic range —
+short admin commands (unlock, init) finish in seconds; backups can
+be hours:
+
+```
+1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
+   (1s   5s  30s  1m   5m  30m   1h    6h   24h)
+```
+
+In-memory only. Reset on process restart — operators who want
+durable history scrape into Prom and let it persist.
+
+## Architecture
+
+New package `internal/server/metrics`:
+
+- `Registry` — owns the histogram state (sync.Mutex + map keyed by
+  `kind+status`). `ObserveJob(kind, status string, dur time.Duration)`
+  is the only mutator. Lookups via `Snapshot()` are read-only and
+  copy out.
+- `Render(w io.Writer, snapshot Snapshot)` — emits the full
+  exposition body. The snapshot is supplied by the HTTP handler
+  pulling from `Store` on each scrape; the package itself has no
+  store dependency, which keeps it trivially unit-testable.
+
+New file `internal/server/http/metrics.go`:
+
+- `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current
+  fleet snapshot from `Store`, ask `metrics.Render` to emit.
+- Auth helper `authoriseMetricsScrape(r)` — pure function over
+  request + config; tested directly.
+
+Wiring:
+
+- `cmd/server` constructs the `metrics.Registry` once and threads
+  it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps`
+  (so the job-finished branch can call `ObserveJob`).
+- `ws/handler.go` MsgJobFinished branch grows a single line:
+  `if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`.
+  Falls back gracefully if the registry was never wired (tests).
+
+Route registration in `server.go`:
+
+```go
+if s.deps.Cfg.MetricsAuthEnabled() {
+    r.Get("/metrics", s.handleMetrics)
+}
+```
+
+## Cardinality + cost
+
+Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1.
+
+A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
+
+## Documentation (P6-05)
+
+- `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
+- `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
+  1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline.
+  2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`).
+  3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`.
+  4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`.
+  5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`.
+  6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window.
+
+Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
+
+## Testing
+
+- Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
+- Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence.
+- Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
+- End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
+
+## Out of scope, explicitly
+
+- Per-job latency tracking with `job_id` labels (cardinality bomb).
+- Restore-specific metrics (P3 surfaces are still settling).
+- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
+- Auto-discovery / file-SD generators for Prometheus.
@@ -0,0 +1,42 @@
+# Build a Linux container that runs the restic-manager agent against a
+# sibling rest-server in the e2e compose stack. Used only by tests
+# (e2e/compose.e2e.yml + .gitea/workflows/e2e.yml).
+#
+# Two stages:
+#   1. golang:alpine to build the agent binary.
+#   2. alpine:3.20 with the `restic` package + the built binary.
+#
+# Pinning by digest is intentional for CI reproducibility.
+
+FROM golang:1.25-alpine AS build
+WORKDIR /src
+
+ENV CGO_ENABLED=0 \
+    GOFLAGS="-trimpath"
+
+COPY go.mod go.sum* ./
+RUN go mod download
+
+COPY . .
+ARG VERSION=e2e
+RUN go build -ldflags="-s -w -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=${VERSION}" \
+        -o /out/restic-manager-agent ./cmd/agent
+
+FROM alpine:3.20
+RUN apk add --no-cache restic ca-certificates curl
+COPY --from=build /out/restic-manager-agent /usr/local/bin/restic-manager-agent
+
+# Agents normally run as root because backup paths often need it. The
+# e2e fixture only backs up paths under /data which we own, so this
+# container would tolerate a non-root user — but staying root keeps
+# parity with the production install.
+USER root
+
+# The agent needs a writable directory for its config + secrets store.
+RUN mkdir -p /etc/restic-manager /var/lib/restic-manager-agent
+ENV RM_AGENT_CONFIG=/etc/restic-manager/agent.yaml
+
+# The compose entrypoint sets the announce URL via env.
+COPY e2e/agent-entrypoint.sh /usr/local/bin/entrypoint.sh
+RUN chmod +x /usr/local/bin/entrypoint.sh
+ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
@@ -0,0 +1,21 @@
+# Playwright runner for the e2e suite. Built and run by
+# e2e/compose.e2e.yml so the test process sits on the same docker
+# network as the server, agent, and rest-server. The previous setup
+# ran Playwright on the workflow runner host and reached the server
+# via 127.0.0.1:8080; that fails on Gitea's act-style runners
+# because the workflow steps execute inside a runner container,
+# not on the host where compose publishes its ports.
+
+FROM mcr.microsoft.com/playwright:v1.59.1-jammy
+
+WORKDIR /work
+
+# Install npm deps in a separate layer keyed off package.json so
+# changes to specs don't bust the dep cache.
+COPY e2e/playwright/package.json /work/package.json
+RUN npm install --no-audit --no-fund
+
+COPY e2e/playwright/ /work/
+
+ENV CI=1
+ENTRYPOINT ["npx", "playwright", "test"]
@@ -0,0 +1,27 @@
+#!/bin/sh
+# Entrypoint for the e2e agent container.
+#
+# Three states:
+#   1. Already enrolled (agent.yaml has a bearer): run the agent.
+#   2. Token supplied via $RM_ENROL_TOKEN: enrol then run.
+#   3. Otherwise: announce against $RM_SERVER and wait for an admin to
+#      accept us. The announce flow blocks until accepted, then drops
+#      straight into the normal run loop, so this is the test-friendly
+#      path.
+set -eu
+
+CFG="${RM_AGENT_CONFIG:-/etc/restic-manager/agent.yaml}"
+SERVER="${RM_SERVER:?set RM_SERVER}"
+
+if [ -f "$CFG" ] && grep -q '^agent_token:' "$CFG"; then
+    exec restic-manager-agent -config "$CFG"
+fi
+
+if [ -n "${RM_ENROL_TOKEN:-}" ]; then
+    exec restic-manager-agent -config "$CFG" \
+        -enroll-server "$SERVER" \
+        -enroll-token "$RM_ENROL_TOKEN"
+fi
+
+# Announce-and-approve: blocks until an admin accepts, then runs.
+exec restic-manager-agent -config "$CFG" -enroll-server "$SERVER"
@@ -0,0 +1,108 @@
+# End-to-end test stack — used by .gitea/workflows/e2e.yml and by
+# operators who want to run the Playwright suite locally.
+#
+# Three services:
+#   * server      — restic-manager built from the working tree
+#   * agent       — restic-manager agent built from the working tree
+#                   (announces; Playwright accepts it during the test)
+#   * rest-server — the actual restic backend, sibling of the agent
+#
+# Run from the repo root:
+#   docker compose -f e2e/compose.e2e.yml up --build --abort-on-container-exit
+
+services:
+  rest-server:
+    image: restic/rest-server:0.13.0
+    environment:
+      DATA_DIR: /data
+      OPTIONS: "--no-auth"
+    volumes:
+      - rest-data:/data
+    networks: [rmnet]
+
+  server:
+    build:
+      context: ..
+      dockerfile: deploy/Dockerfile.server
+      args:
+        VERSION: e2e
+    environment:
+      RM_LISTEN: ":8080"
+      RM_DATA_DIR: "/data"
+      RM_BASE_URL: "http://server:8080"
+      RM_COOKIE_SECURE: "false"
+      # Bind the metrics endpoint loose for the test, so one of the
+      # Playwright assertions can exercise it.
+      RM_METRICS_TRUSTED_CIDR: "0.0.0.0/0"
+    volumes:
+      - server-data:/data
+    ports:
+      - "127.0.0.1:8080:8080"
+    healthcheck:
+      test: ["CMD", "/usr/local/bin/restic-manager-server", "--version"]
+      interval: 2s
+      timeout: 2s
+      retries: 30
+    networks: [rmnet]
+
+  agent:
+    build:
+      context: ..
+      dockerfile: e2e/Dockerfile.agent
+      args:
+        VERSION: e2e
+    environment:
+      RM_SERVER: "http://server:8080"
+    depends_on:
+      - server
+    volumes:
+      # Source paths the agent backs up. Compose pre-populates this
+      # with a few files so the snapshot list isn't empty.
+      - source-data:/source
+      - agent-config:/etc/restic-manager
+      - agent-state:/var/lib/restic-manager-agent
+    networks: [rmnet]
+
+  # Playwright test runner. Profile-gated so `compose up` doesn't
+  # start it; CI runs it via `compose run --rm playwright`. Lives on
+  # rmnet so it can reach the server via its compose-network DNS
+  # name rather than depending on host port-publish (which doesn't
+  # work on Gitea's container-based runners).
+  playwright:
+    profiles: [test]
+    build:
+      context: ..
+      dockerfile: e2e/Dockerfile.playwright
+    environment:
+      RM_BASE_URL: "http://server:8080"
+      RM_BOOTSTRAP_TOKEN: "${RM_BOOTSTRAP_TOKEN:-}"
+    volumes:
+      - ./playwright/playwright-report:/work/playwright-report
+      - ./playwright/test-results:/work/test-results
+    depends_on:
+      - server
+      - agent
+    networks: [rmnet]
+
+  # One-shot init container that drops a couple of files into the
+  # source volume so backups have something to snapshot.
+  source-fixture:
+    image: alpine:3.20
+    command: >
+      sh -c 'mkdir -p /source && echo "hello world" > /source/hello.txt &&
+             echo "another file" > /source/two.txt && sleep 0.2'
+    volumes:
+      - source-data:/source
+    networks: [rmnet]
+    restart: "no"
+
+volumes:
+  server-data:
+  rest-data:
+  source-data:
+  agent-config:
+  agent-state:
+
+networks:
+  rmnet:
+    driver: bridge
@@ -0,0 +1,14 @@
+{
+  "name": "restic-manager-e2e",
+  "version": "0.0.0",
+  "private": true,
+  "type": "module",
+  "scripts": {
+    "test": "playwright test",
+    "test:headed": "playwright test --headed",
+    "test:debug": "PWDEBUG=1 playwright test"
+  },
+  "devDependencies": {
+    "@playwright/test": "1.59.1"
+  }
+}
@@ -0,0 +1,31 @@
+import { defineConfig, devices } from '@playwright/test';
+
+// Single-target Chromium config: the e2e suite is narrow (smoke
+// the production-shaped flow against the docker-compose stack).
+// Cross-browser matrix doesn't add signal — what we're verifying is
+// the server's HTML and the agent's WebSocket handshake, neither of
+// which depends on browser engine.
+
+const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
+
+export default defineConfig({
+    testDir: './tests',
+    timeout: 60_000,
+    expect: { timeout: 10_000 },
+    fullyParallel: false,
+    retries: process.env.CI ? 1 : 0,
+    workers: 1,
+    reporter: [['list'], ['html', { open: 'never' }]],
+    use: {
+        baseURL,
+        trace: 'retain-on-failure',
+        screenshot: 'only-on-failure',
+        video: 'retain-on-failure',
+    },
+    projects: [
+        {
+            name: 'chromium',
+            use: { ...devices['Desktop Chrome'] },
+        },
+    ],
+});
@@ -0,0 +1,114 @@
+// Helpers used by every test. The shape favours the JSON API for
+// reads + accept/dispatch (deterministic, easy to assert) and the
+// browser for human-facing surfaces (login form, dashboard render).
+
+import { APIRequestContext, expect, Page } from '@playwright/test';
+
+export const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
+
+export interface HostJSON {
+    id: string;
+    name: string;
+    status: string;
+    last_backup_status?: string;
+}
+
+export async function readBootstrapToken(): Promise<string> {
+    const tok = process.env.RM_BOOTSTRAP_TOKEN;
+    if (!tok) {
+        throw new Error('RM_BOOTSTRAP_TOKEN not set — the harness scrapes it from server logs');
+    }
+    return tok;
+}
+
+export async function bootstrapAdmin(
+    request: APIRequestContext,
+    {
+        username = 'admin',
+        password = 'e2e-test-password-1234',
+    }: { username?: string; password?: string } = {},
+): Promise<{ username: string; password: string }> {
+    const token = await readBootstrapToken();
+    const res = await request.post(`${baseURL}/api/bootstrap`, {
+        data: { token, username, password },
+    });
+    if (!res.ok() && res.status() !== 409 /* already bootstrapped */) {
+        throw new Error(`bootstrap: ${res.status()} ${await res.text()}`);
+    }
+    return { username, password };
+}
+
+export async function loginViaUI(page: Page, username: string, password: string): Promise<void> {
+    await page.goto(`${baseURL}/login`);
+    await page.locator('#login-username').fill(username);
+    await page.locator('#login-password').fill(password);
+    await Promise.all([
+        page.waitForURL(new RegExp(`^${baseURL}/?$`)),
+        page.locator('form[action="/login"] button[type="submit"]').click(),
+    ]);
+}
+
+/**
+ * Polls the dashboard until a pending host card is visible, then
+ * extracts its pending-id from the inline accept form's action URL.
+ */
+export async function waitForPendingHostID(page: Page): Promise<string> {
+    const formLocator = page.locator('form[action^="/api/pending-hosts/"][action$="/accept"]').first();
+    await expect(formLocator).toBeVisible({ timeout: 60_000 });
+    const action = await formLocator.getAttribute('action');
+    if (!action) throw new Error('pending host form has no action attribute');
+    const m = action.match(/\/api\/pending-hosts\/([^/]+)\/accept/);
+    if (!m) throw new Error(`unexpected action URL: ${action}`);
+    return m[1];
+}
+
+export async function acceptPending(
+    request: APIRequestContext,
+    cookie: string,
+    pendingID: string,
+    repo: { url: string; username?: string; password: string },
+): Promise<void> {
+    const res = await request.post(`${baseURL}/api/pending-hosts/${pendingID}/accept`, {
+        headers: { cookie, 'content-type': 'application/json' },
+        data: {
+            repo_url: repo.url,
+            repo_username: repo.username ?? '',
+            repo_password: repo.password,
+        },
+    });
+    if (!res.ok()) {
+        throw new Error(`accept: ${res.status()} ${await res.text()}`);
+    }
+}
+
+export async function listHosts(request: APIRequestContext, cookie: string): Promise<HostJSON[]> {
+    const res = await request.get(`${baseURL}/api/hosts`, { headers: { cookie } });
+    if (!res.ok()) throw new Error(`list hosts: ${res.status()} ${await res.text()}`);
+    const body = (await res.json()) as { items?: HostJSON[]; hosts?: HostJSON[] };
+    return body.items ?? body.hosts ?? [];
+}
+
+export async function waitForHostStatus(
+    request: APIRequestContext,
+    cookie: string,
+    matcher: (h: HostJSON) => boolean,
+    timeoutMs = 60_000,
+): Promise<HostJSON> {
+    const deadline = Date.now() + timeoutMs;
+    let last: HostJSON | undefined;
+    while (Date.now() < deadline) {
+        const hosts = await listHosts(request, cookie);
+        const hit = hosts.find(matcher);
+        if (hit) return hit;
+        last = hosts[0];
+        await new Promise((r) => setTimeout(r, 1_000));
+    }
+    throw new Error(`waitForHostStatus: timeout. Last seen: ${JSON.stringify(last)}`);
+}
+
+export async function getSessionCookie(page: Page): Promise<string> {
+    const cookies = await page.context().cookies();
+    const c = cookies.find((c) => c.name === 'rm_session');
+    if (!c) throw new Error('rm_session cookie not set after login');
+    return `${c.name}=${c.value}`;
+}
@@ -0,0 +1,80 @@
+// End-to-end smoke: bootstrap → accept pending host → run backup → see succeeded.
+//
+// The compose stack stands up a server, a sibling rest-server, and an
+// agent in announce-and-approve mode. This test drives the operator
+// path through the UI (login + dashboard) and the API
+// (accept + run-now + poll for terminal) — UI for the human surfaces,
+// API for the deterministic ones.
+
+import { test, expect } from '@playwright/test';
+import {
+    baseURL,
+    bootstrapAdmin,
+    loginViaUI,
+    waitForPendingHostID,
+    acceptPending,
+    waitForHostStatus,
+    getSessionCookie,
+} from './lib/server';
+
+test.describe('smoke: enrol-via-announce → backup', () => {
+    test('happy path completes in under a minute', async ({ page, request }) => {
+        const { username, password } = await bootstrapAdmin(request);
+        await loginViaUI(page, username, password);
+
+        // Dashboard renders.
+        await expect(page.locator('main')).toContainText(/host|fleet|pending/i, { timeout: 10_000 });
+
+        // Pending host appears (the agent container has been
+        // announcing since startup).
+        const pendingID = await waitForPendingHostID(page);
+        const cookie = await getSessionCookie(page);
+
+        // Accept with the rest-server creds. compose's rest-server runs
+        // --no-auth, so any credentials work; restic still demands a
+        // password to encrypt the repo.
+        await acceptPending(request, cookie, pendingID, {
+            url: 'rest:http://rest-server:8000/',
+            password: 'e2e-repo-password',
+        });
+
+        // Wait for the host to come online + auto-init to land.
+        const onlineHost = await waitForHostStatus(
+            request, cookie,
+            (h) => h.status === 'online',
+            60_000,
+        );
+        expect(onlineHost.id).toBeTruthy();
+
+        // Trigger a backup via the UI form-post (HX-Redirect to /jobs/{id}).
+        await page.goto(`${baseURL}/hosts/${onlineHost.id}`);
+        await Promise.all([
+            page.waitForURL(/\/jobs\//),
+            page.locator('form[action$="/run-backup"] button[type="submit"]').first().click(),
+        ]);
+
+        // Wait for the host's last_backup_status to flip to 'succeeded'.
+        // The job page itself is harder to assert on (it uses
+        // server-pushed updates and a reload-on-finish pattern); the
+        // host record is the source of truth and is what the dashboard
+        // surfaces.
+        const finishedHost = await waitForHostStatus(
+            request, cookie,
+            (h) => h.id === onlineHost.id && h.last_backup_status === 'succeeded',
+            120_000,
+        );
+        expect(finishedHost.last_backup_status).toBe('succeeded');
+    });
+});
+
+test.describe('smoke: scrape /metrics', () => {
+    test('metrics endpoint exposes the host gauge', async ({ request }) => {
+        // Compose sets RM_METRICS_TRUSTED_CIDR=0.0.0.0/0 so the
+        // endpoint is open to the test runner.
+        const res = await request.get(`${baseURL}/metrics`);
+        expect(res.status()).toBe(200);
+        const body = await res.text();
+        expect(body).toContain('rm_hosts_total');
+        expect(body).toContain('rm_build_info{');
+    });
+});
@@ -41,6 +41,24 @@ type Config struct {
 	// DataDir. Source-build deployments can override via
 	// RM_BUNDLED_ASSETS_DIR.
 	BundledAssetsDir string `yaml:"bundled_assets_dir"`
+
+	// MetricsToken, if set, gates the /metrics scrape endpoint
+	// behind a `Authorization: Bearer <token>` check (constant-time
+	// compare). When neither this nor MetricsTrustedCIDRs is set,
+	// the route is not mounted at all (the endpoint is opt-in).
+	MetricsToken string `yaml:"metrics_token"`
+
+	// MetricsTrustedCIDRs, if non-empty, gates /metrics so only
+	// callers from these networks may scrape. ANDed with
+	// MetricsToken when both are set.
+	MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"`
+}
+
+// MetricsAuthEnabled reports whether the operator has opted into
+// exposing the Prometheus scrape endpoint by configuring at least
+// one auth gate.
+func (c Config) MetricsAuthEnabled() bool {
+	return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0
 }

 // Load resolves config in this order:
@@ -93,6 +111,19 @@ func Load(yamlPath string) (Config, error) {
 	if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
 		c.BundledAssetsDir = v
 	}
+	if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok {
+		c.MetricsToken = v
+	}
+	if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok {
+		parts := strings.Split(v, ",")
+		c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0]
+		for _, p := range parts {
+			p = strings.TrimSpace(p)
+			if p != "" {
+				c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p)
+			}
+		}
+	}
 	if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
 		// Comma-separated CIDRs; allow whitespace for readability.
 		parts := strings.Split(v, ",")
@@ -137,5 +168,10 @@ func (c *Config) validate() error {
 			return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
 		}
 	}
+	for _, cidr := range c.MetricsTrustedCIDRs {
+		if _, err := netip.ParsePrefix(cidr); err != nil {
+			return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err)
+		}
+	}
 	return nil
 }
@@ -98,6 +98,45 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) {
 	}
 }

+func TestMetricsAuthGates(t *testing.T) {
+	t.Setenv("RM_LISTEN", ":8080")
+	t.Setenv("RM_DATA_DIR", "/tmp/x")
+
+	c, err := Load("")
+	if err != nil {
+		t.Fatalf("load: %v", err)
+	}
+	if c.MetricsAuthEnabled() {
+		t.Errorf("metrics endpoint should be off by default")
+	}
+
+	t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes")
+	t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24")
+	c, err = Load("")
+	if err != nil {
+		t.Fatalf("load: %v", err)
+	}
+	if c.MetricsToken != "s3cr3t-token-with-enough-bytes" {
+		t.Errorf("token: %q", c.MetricsToken)
+	}
+	if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" {
+		t.Errorf("cidrs: %v", got)
+	}
+	if !c.MetricsAuthEnabled() {
+		t.Errorf("MetricsAuthEnabled should be true")
+	}
+}
+
+func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) {
+	t.Setenv("RM_LISTEN", ":8080")
+	t.Setenv("RM_DATA_DIR", "/tmp/x")
+	t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage")
+
+	if _, err := Load(""); err == nil {
+		t.Fatal("expected validation error, got nil")
+	}
+}
+
 func writeFile(path string, body []byte) error {
 	return writeFileImpl(path, body)
 }
@@ -0,0 +1,185 @@
+package http
+
+import (
+	"context"
+	"crypto/subtle"
+	"net"
+	"net/http"
+	"net/netip"
+	"runtime"
+	"strings"
+
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
+)
+
+// handleMetrics serves the Prometheus exposition body. The route is
+// only mounted when the operator has opted in via RM_METRICS_TOKEN
+// or RM_METRICS_TRUSTED_CIDR (see Server.New + Cfg.MetricsAuthEnabled).
+func (s *Server) handleMetrics(w http.ResponseWriter, r *http.Request) {
+	if !authoriseMetricsScrape(r, s.deps.Cfg) {
+		// 401 with no body; Prom respects this and surfaces the failed
+		// scrape. WWW-Authenticate hints at bearer when the operator
+		// actually configured a token.
+		if s.deps.Cfg.MetricsToken != "" {
+			w.Header().Set("WWW-Authenticate", `Bearer realm="restic-manager metrics"`)
+		}
+		w.WriteHeader(http.StatusUnauthorized)
+		return
+	}
+
+	snap, err := s.gatherMetricsSnapshot(r.Context())
+	if err != nil {
+		http.Error(w, "snapshot: "+err.Error(), http.StatusInternalServerError)
+		return
+	}
+
+	// 0.0.4 is the long-stable text-format version Prometheus accepts
+	// without negotiation; OpenMetrics is intentionally not used here.
+	w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
+	if err := metrics.Render(w, snap); err != nil {
+		// Body is partially written; nothing useful we can do beyond
+		// dropping the connection (chi's recoverer will log).
+		return
+	}
+}
+
+// authoriseMetricsScrape applies bearer + CIDR gates per the spec.
+// AND semantics when both are configured; either alone is sufficient
+// when only it is configured.
+func authoriseMetricsScrape(r *http.Request, cfg config.Config) bool {
+	tokenOK := true
+	if cfg.MetricsToken != "" {
+		tokenOK = false
+		hdr := r.Header.Get("Authorization")
+		const prefix = "Bearer "
+		if strings.HasPrefix(hdr, prefix) {
+			got := []byte(strings.TrimPrefix(hdr, prefix))
+			want := []byte(cfg.MetricsToken)
+			if subtle.ConstantTimeCompare(got, want) == 1 {
+				tokenOK = true
+			}
+		}
+	}
+
+	cidrOK := true
+	if len(cfg.MetricsTrustedCIDRs) > 0 {
+		cidrOK = false
+		ip := callerIP(r, cfg.TrustedProxies)
+		if ip.IsValid() {
+			for _, c := range cfg.MetricsTrustedCIDRs {
+				prefix, err := netip.ParsePrefix(c)
+				if err != nil {
+					continue
+				}
+				if prefix.Contains(ip) {
+					cidrOK = true
+					break
+				}
+			}
+		}
+	}
+	return tokenOK && cidrOK
+}
+
+// callerIP resolves the client IP. When the request hit the server
+// directly we use RemoteAddr; when the immediate hop is a trusted
+// proxy we honour the right-most untrusted X-Forwarded-For entry
+// (mirrors how realIP middlewares typically resolve).
+func callerIP(r *http.Request, trustedProxies []string) netip.Addr {
+	host, _, err := net.SplitHostPort(r.RemoteAddr)
+	if err != nil {
+		host = r.RemoteAddr
+	}
+	directAddr, err := netip.ParseAddr(host)
+	if err != nil {
+		return netip.Addr{}
+	}
+
+	if !addrInAnyCIDR(directAddr, trustedProxies) {
+		return directAddr
+	}
+
+	xff := r.Header.Get("X-Forwarded-For")
+	if xff == "" {
+		return directAddr
+	}
+	parts := strings.Split(xff, ",")
+	// Walk right→left, skipping trusted proxies, until we land on the
+	// first untrusted hop — that's the genuine client.
+	for i := len(parts) - 1; i >= 0; i-- {
+		p := strings.TrimSpace(parts[i])
+		a, err := netip.ParseAddr(p)
+		if err != nil {
+			continue
+		}
+		if addrInAnyCIDR(a, trustedProxies) {
+			continue
+		}
+		return a
+	}
+	return directAddr
+}
+
+func addrInAnyCIDR(a netip.Addr, cidrs []string) bool {
+	for _, c := range cidrs {
+		pre, err := netip.ParsePrefix(c)
+		if err != nil {
+			continue
+		}
+		if pre.Contains(a) {
+			return true
+		}
+	}
+	return false
+}
+
+// gatherMetricsSnapshot pulls the data the renderer needs. One
+// indexed query per per-host or fleet-wide read; no N+1.
+func (s *Server) gatherMetricsSnapshot(ctx context.Context) (metrics.Snapshot, error) {
+	hosts, err := s.deps.Store.ListHosts(ctx)
+	if err != nil {
+		return metrics.Snapshot{}, err
+	}
+	hostRows := make([]metrics.HostRow, 0, len(hosts))
+	for _, h := range hosts {
+		row := metrics.HostRow{
+			ID:             h.ID,
+			Name:           h.Name,
+			Online:         h.Status == "online",
+			SnapshotCount:  h.SnapshotCount,
+			OpenAlertCount: h.OpenAlertCount,
+			RepoStatus:     h.RepoStatus,
+		}
+		if h.LastBackupAt != nil {
+			ts := h.LastBackupAt.Unix()
+			row.LastBackupUnix = &ts
+		}
+		if h.LastBackupStatus != nil {
+			ok := *h.LastBackupStatus == "succeeded"
+			row.LastBackupSucceeded = &ok
+		}
+		if h.RepoSizeBytes > 0 {
+			sz := h.RepoSizeBytes
+			row.RepoSizeBytes = &sz
+		}
+		hostRows = append(hostRows, row)
+	}
+
+	open, err := s.deps.Store.ListAlerts(ctx, store.AlertFilter{Status: "open"})
+	if err != nil {
+		return metrics.Snapshot{}, err
+	}
+	bySeverity := map[string]int{"info": 0, "warning": 0, "critical": 0}
+	for _, a := range open {
+		bySeverity[a.Severity]++
+	}
+
+	reg := s.deps.Metrics
+	if reg == nil {
+		reg = metrics.NewRegistry() // empty histogram block
+	}
+	return reg.SnapshotWith(hostRows, bySeverity, version.Version, version.Commit, runtime.Version()), nil
+}
@@ -0,0 +1,209 @@
+package http
+
+import (
+	"context"
+	"io"
+	stdhttp "net/http"
+	"net/http/httptest"
+	"path/filepath"
+	"strings"
+	"testing"
+
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
+)
+
+// newMetricsServer builds a Server with metrics enabled per cfg.
+// Returns (URL, registry) so tests can both observe job durations
+// directly and exercise the HTTP gate.
+func newMetricsServer(t *testing.T, cfg config.Config) (string, *metrics.Registry, *store.Store) {
+	t.Helper()
+	dir := t.TempDir()
+
+	st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
+	if err != nil {
+		t.Fatalf("store: %v", err)
+	}
+	t.Cleanup(func() { _ = st.Close() })
+
+	keyPath := filepath.Join(dir, "secret.key")
+	if err := crypto.GenerateKeyFile(keyPath); err != nil {
+		t.Fatalf("genkey: %v", err)
+	}
+	key, _ := crypto.LoadKeyFromFile(keyPath)
+	aead, _ := crypto.NewAEAD(key)
+
+	cfg.Listen = ":0"
+	cfg.DataDir = dir
+	cfg.SecretKeyFile = keyPath
+
+	reg := metrics.NewRegistry()
+	deps := Deps{
+		Cfg:     cfg,
+		Store:   st,
+		AEAD:    aead,
+		Metrics: reg,
+	}
+	s := New(deps)
+	ts := httptest.NewServer(s.srv.Handler)
+	t.Cleanup(ts.Close)
+	return ts.URL, reg, st
+}
+
+func TestMetricsRouteNotMountedByDefault(t *testing.T) {
+	t.Parallel()
+	url, _, _ := newMetricsServer(t, config.Config{})
+	res, err := stdhttp.Get(url + "/metrics")
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res.Body.Close()
+	if res.StatusCode != stdhttp.StatusNotFound {
+		t.Errorf("status: got %d, want 404 (route should not be mounted)", res.StatusCode)
+	}
+}
+
+func TestMetricsTokenRequired(t *testing.T) {
+	t.Parallel()
+	url, _, _ := newMetricsServer(t, config.Config{
+		MetricsToken: "the-token",
+	})
+
+	// Missing token.
+	res, err := stdhttp.Get(url + "/metrics")
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res.Body.Close()
+	if res.StatusCode != stdhttp.StatusUnauthorized {
+		t.Errorf("no token: got %d", res.StatusCode)
+	}
+	if !strings.Contains(res.Header.Get("WWW-Authenticate"), "Bearer") {
+		t.Errorf("WWW-Authenticate hint missing: %q", res.Header.Get("WWW-Authenticate"))
+	}
+
+	// Wrong token.
+	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
+	req.Header.Set("Authorization", "Bearer not-the-token")
+	res2, err := stdhttp.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res2.Body.Close()
+	if res2.StatusCode != stdhttp.StatusUnauthorized {
+		t.Errorf("wrong token: got %d", res2.StatusCode)
+	}
+
+	// Right token.
+	req3, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
+	req3.Header.Set("Authorization", "Bearer the-token")
+	res3, err3 := stdhttp.DefaultClient.Do(req3)
+	if err3 != nil {
+		t.Fatalf("GET: %v", err3)
+	}
+	defer res3.Body.Close()
+	if res3.StatusCode != stdhttp.StatusOK {
+		t.Errorf("right token: got %d", res3.StatusCode)
+	}
+	if ct := res3.Header.Get("Content-Type"); !strings.HasPrefix(ct, "text/plain") {
+		t.Errorf("content-type: %q", ct)
+	}
+}
+
+func TestMetricsCIDRGate(t *testing.T) {
+	t.Parallel()
+	// 127.0.0.1 is what httptest hits with; pick a CIDR that excludes it
+	// to assert the "wrong source" branch.
+	url, _, _ := newMetricsServer(t, config.Config{
+		MetricsTrustedCIDRs: []string{"10.0.0.0/8"},
+	})
+	res, err := stdhttp.Get(url + "/metrics")
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res.Body.Close()
+	if res.StatusCode != stdhttp.StatusUnauthorized {
+		t.Errorf("loopback hitting non-matching CIDR: got %d, want 401", res.StatusCode)
+	}
+
+	// Now allow loopback.
+	url2, _, _ := newMetricsServer(t, config.Config{
+		MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
+	})
+	res2, err := stdhttp.Get(url2 + "/metrics")
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res2.Body.Close()
+	if res2.StatusCode != stdhttp.StatusOK {
+		t.Errorf("loopback in allow CIDR: got %d, want 200", res2.StatusCode)
+	}
+}
+
+func TestMetricsTokenAndCIDRBothRequired(t *testing.T) {
+	t.Parallel()
+	url, _, _ := newMetricsServer(t, config.Config{
+		MetricsToken:        "the-token",
+		MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
+	})
+	// Token only — CIDR ok (loopback) but token missing.
+	res, err := stdhttp.Get(url + "/metrics")
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res.Body.Close()
+	if res.StatusCode != stdhttp.StatusUnauthorized {
+		t.Errorf("missing token but in CIDR: got %d", res.StatusCode)
+	}
+
+	// Both right.
+	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
+	req.Header.Set("Authorization", "Bearer the-token")
+	res2, err := stdhttp.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res2.Body.Close()
+	if res2.StatusCode != stdhttp.StatusOK {
+		t.Errorf("both right: got %d", res2.StatusCode)
+	}
+}
+
+func readAll(t *testing.T, r io.Reader) string {
+	t.Helper()
+	b, err := io.ReadAll(r)
+	if err != nil {
+		t.Fatalf("read: %v", err)
+	}
+	return string(b)
+}
+
+func TestMetricsBodyContainsExpectedLines(t *testing.T) {
+	t.Parallel()
+	url, reg, _ := newMetricsServer(t, config.Config{
+		MetricsToken: "the-token",
+	})
+	reg.ObserveJob("backup", "succeeded", 0) // produce one histogram row
+
+	req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
+	req.Header.Set("Authorization", "Bearer the-token")
+	res, err := stdhttp.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("GET: %v", err)
+	}
+	defer res.Body.Close()
+	body := readAll(t, res.Body)
+	for _, want := range []string{
+		"rm_hosts_total",
+		"rm_hosts_online",
+		`rm_active_alerts{severity="critical"}`,
+		"rm_build_info{",
+		"rm_job_duration_seconds_count{kind=\"backup\",status=\"succeeded\"}",
+	} {
+		if !strings.Contains(body, want) {
+			t.Errorf("body missing %q\n--- body ---\n%s", want, body)
+		}
+	}
+}
@@ -17,6 +17,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
@@ -56,6 +57,12 @@ type Deps struct {
 	// OIDC (optional). Non-nil when the operator has configured an
 	// IdP — handlers under /auth/oidc/* are mounted only when set.
 	OIDC *oidc.Client
+	// Metrics (optional). When non-nil the WS job-finished branch
+	// records job durations and the /metrics handler can pull a
+	// histogram snapshot. Independent of MetricsAuthEnabled — the
+	// recorder runs even if the scrape endpoint is gated off, so a
+	// later config flip doesn't lose the running window.
+	Metrics *metrics.Registry
 }

 // Server is the running HTTP server.
@@ -131,12 +138,16 @@ func (s *Server) routes(r chi.Router) {
 	r.Get("/agent/binary", s.handleAgentBinary)
 	r.Get("/install/*", s.handleInstallAsset)
 	r.Get("/api/version", s.handleVersion)
+	if s.deps.Cfg.MetricsAuthEnabled() {
+		r.Get("/metrics", s.handleMetrics)
+	}
 	if s.deps.Hub != nil {
 		hd := ws.HandlerDeps{
 			Hub:            s.deps.Hub,
 			Store:          s.deps.Store,
 			JobHub:         s.deps.JobHub,
 			AlertEngine:    s.deps.AlertEngine,
+			Metrics:        s.deps.Metrics,
 			OnHello:        s.onAgentHello,
 			OnScheduleAck:  s.applyScheduleAck,
 			OnScheduleFire: s.dispatchScheduledJob,
@@ -0,0 +1,301 @@
+// Package metrics owns the in-process Prometheus exposition for
+// the control plane. It deliberately avoids prometheus/client_golang
+// — the legacy text format is small and stable, and the repo's house
+// style is to keep dependency surface minimal.
+//
+// Two halves:
+//
+//   - Registry holds a job-duration histogram. Server hooks call
+//     Registry.ObserveJob from the WS job-finished branch.
+//
+//   - Render emits a complete /metrics body from a Snapshot. The
+//     Snapshot is a plain value bag; the HTTP handler assembles it
+//     from store reads + Registry.Snapshot at scrape time. This
+//     keeps the package free of any database or HTTP dependency.
+package metrics
+
+import (
+	"fmt"
+	"io"
+	"sort"
+	"strings"
+	"sync"
+	"time"
+)
+
+// JobDurationBuckets is the upper-bound ladder for the job duration
+// histogram, in seconds. Covers admin commands (unlock/init/check
+// finishing in seconds) up through hours-long backups; +Inf is
+// implicit.
+var JobDurationBuckets = []float64{1, 5, 30, 60, 300, 1800, 3600, 21600, 86400}
+
+// Registry is the in-memory store for the job-duration histogram.
+// Concurrent observers and a single periodic snapshotter is the
+// expected access pattern; both are guarded by a mutex.
+type Registry struct {
+	mu    sync.Mutex
+	jobs  map[jobKey]*histogramState
+	clock func() time.Time
+}
+
+type jobKey struct{ kind, status string }
+
+type histogramState struct {
+	// counts[i] = number of observations <= JobDurationBuckets[i].
+	// counts[len(JobDurationBuckets)] is the implicit +Inf bucket
+	// (== total count, kept here for symmetry with the rendered
+	// _bucket{le="+Inf"} line and as a sanity check).
+	counts []uint64
+	sum    float64
+	count  uint64
+}
+
+// NewRegistry builds an empty registry.
+func NewRegistry() *Registry {
+	return &Registry{
+		jobs:  make(map[jobKey]*histogramState),
+		clock: time.Now,
+	}
+}
+
+// ObserveJob records one job-duration sample. Negative durations
+// (clock-skew artefacts) are clamped to zero. Empty kind/status
+// strings are tolerated but degrade the dashboard — callers should
+// pass meaningful values.
+func (r *Registry) ObserveJob(kind, status string, dur time.Duration) {
+	if r == nil {
+		return
+	}
+	if dur < 0 {
+		dur = 0
+	}
+	secs := dur.Seconds()
+
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	k := jobKey{kind: kind, status: status}
+	hs, ok := r.jobs[k]
+	if !ok {
+		hs = &histogramState{counts: make([]uint64, len(JobDurationBuckets)+1)}
+		r.jobs[k] = hs
+	}
+	for i, ub := range JobDurationBuckets {
+		if secs <= ub {
+			hs.counts[i]++
+		}
+	}
+	hs.counts[len(JobDurationBuckets)]++ // +Inf
+	hs.sum += secs
+	hs.count++
+}
+
+// HistogramRow is one (kind,status) row in a Snapshot. Buckets is
+// the cumulative count per upper bound (matching JobDurationBuckets,
+// last element is the +Inf total).
+type HistogramRow struct {
+	Kind    string
+	Status  string
+	Buckets []uint64
+	Sum     float64
+	Count   uint64
+}
+
+// snapshotJobs returns a deterministic, sorted copy of the
+// histogram state. Sort order: kind asc, status asc.
+func (r *Registry) snapshotJobs() []HistogramRow {
+	if r == nil {
+		return nil
+	}
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	rows := make([]HistogramRow, 0, len(r.jobs))
+	for k, hs := range r.jobs {
+		buckets := make([]uint64, len(hs.counts))
+		copy(buckets, hs.counts)
+		rows = append(rows, HistogramRow{
+			Kind:    k.kind,
+			Status:  k.status,
+			Buckets: buckets,
+			Sum:     hs.sum,
+			Count:   hs.count,
+		})
+	}
+	sort.Slice(rows, func(i, j int) bool {
+		if rows[i].Kind != rows[j].Kind {
+			return rows[i].Kind < rows[j].Kind
+		}
+		return rows[i].Status < rows[j].Status
+	})
+	return rows
+}
+
+// HostRow is one host's projection for the per-host gauges.
+// Pointers carry "no value" semantics so we can omit a metric line
+// when, e.g., a host has never run a backup.
+type HostRow struct {
+	ID                  string
+	Name                string
+	Online              bool
+	LastBackupUnix      *int64 // nil = no backup yet
+	LastBackupSucceeded *bool  // nil = no backup yet
+	RepoSizeBytes       *int64 // nil = no stats yet
+	SnapshotCount       int
+	OpenAlertCount      int
+	RepoStatus          string // "unknown" | "ready" | "init_failed"
+}
+
+// Snapshot is a frozen view of the data needed to render /metrics.
+// Constructed by the HTTP handler from Store reads + Registry.snapshotJobs.
+type Snapshot struct {
+	Hosts            []HostRow
+	HostsTotal       int
+	HostsOnline      int
+	AlertsBySeverity map[string]int // severity → count
+	BuildVersion     string
+	BuildCommit      string
+	GoVersion        string
+	JobDurationRows  []HistogramRow
+}
+
+// SnapshotWith builds a Snapshot from raw inputs and the registry's
+// current job-duration state. Convenience for the HTTP handler.
+func (r *Registry) SnapshotWith(hosts []HostRow, alerts map[string]int, buildVer, commit, goVer string) Snapshot {
+	online := 0
+	for _, h := range hosts {
+		if h.Online {
+			online++
+		}
+	}
+	return Snapshot{
+		Hosts:            hosts,
+		HostsTotal:       len(hosts),
+		HostsOnline:      online,
+		AlertsBySeverity: alerts,
+		BuildVersion:     buildVer,
+		BuildCommit:      commit,
+		GoVersion:        goVer,
+		JobDurationRows:  r.snapshotJobs(),
+	}
+}
+
+// Render emits a complete Prometheus text-exposition body for s.
+// Output is deterministic: metric names appear in a fixed order and
+// labels within a metric are sorted by their first label value.
+func Render(w io.Writer, s Snapshot) error {
+	var b strings.Builder
+
+	// --- Server gauges ---------------------------------------------------
+	b.WriteString("# HELP rm_hosts_total Total number of enrolled hosts (excludes pending announces).\n")
+	b.WriteString("# TYPE rm_hosts_total gauge\n")
+	fmt.Fprintf(&b, "rm_hosts_total %d\n", s.HostsTotal)
+
+	b.WriteString("# HELP rm_hosts_online Number of hosts currently online (status='online').\n")
+	b.WriteString("# TYPE rm_hosts_online gauge\n")
+	fmt.Fprintf(&b, "rm_hosts_online %d\n", s.HostsOnline)
+
+	b.WriteString("# HELP rm_active_alerts Open alerts grouped by severity.\n")
+	b.WriteString("# TYPE rm_active_alerts gauge\n")
+	severities := []string{"info", "warning", "critical"}
+	for _, sev := range severities {
+		fmt.Fprintf(&b, "rm_active_alerts{severity=%q} %d\n", sev, s.AlertsBySeverity[sev])
+	}
+
+	b.WriteString("# HELP rm_build_info Build identifying labels; value is always 1.\n")
+	b.WriteString("# TYPE rm_build_info gauge\n")
+	fmt.Fprintf(&b, "rm_build_info{version=%q,commit=%q,go_version=%q} 1\n",
+		s.BuildVersion, s.BuildCommit, s.GoVersion)
+
+	// --- Per-host gauges -------------------------------------------------
+	// Stable order: by host id.
+	hosts := append([]HostRow(nil), s.Hosts...)
+	sort.Slice(hosts, func(i, j int) bool { return hosts[i].ID < hosts[j].ID })
+
+	b.WriteString("# HELP rm_host_agent_online 1 if the agent is currently online, 0 otherwise.\n")
+	b.WriteString("# TYPE rm_host_agent_online gauge\n")
+	for _, h := range hosts {
+		v := 0
+		if h.Online {
+			v = 1
+		}
+		fmt.Fprintf(&b, "rm_host_agent_online{host_id=%q,host=%q} %d\n",
+			h.ID, h.Name, v)
+	}
+
+	b.WriteString("# HELP rm_host_last_backup_timestamp_seconds Unix timestamp of the host's most recent backup. Omitted for hosts with no backup yet.\n")
+	b.WriteString("# TYPE rm_host_last_backup_timestamp_seconds gauge\n")
+	for _, h := range hosts {
+		if h.LastBackupUnix == nil {
+			continue
+		}
+		fmt.Fprintf(&b, "rm_host_last_backup_timestamp_seconds{host_id=%q,host=%q} %d\n",
+			h.ID, h.Name, *h.LastBackupUnix)
+	}
+
+	b.WriteString("# HELP rm_host_last_backup_success 1 if the host's most recent backup succeeded, 0 otherwise. Omitted for hosts with no backup yet.\n")
+	b.WriteString("# TYPE rm_host_last_backup_success gauge\n")
+	for _, h := range hosts {
+		if h.LastBackupSucceeded == nil {
+			continue
+		}
+		v := 0
+		if *h.LastBackupSucceeded {
+			v = 1
+		}
+		fmt.Fprintf(&b, "rm_host_last_backup_success{host_id=%q,host=%q} %d\n",
+			h.ID, h.Name, v)
+	}
+
+	b.WriteString("# HELP rm_host_repo_size_bytes Latest reported repo size from `restic stats --mode raw-data`. Omitted for hosts with no stats yet.\n")
+	b.WriteString("# TYPE rm_host_repo_size_bytes gauge\n")
+	for _, h := range hosts {
+		if h.RepoSizeBytes == nil {
+			continue
+		}
+		fmt.Fprintf(&b, "rm_host_repo_size_bytes{host_id=%q,host=%q} %d\n",
+			h.ID, h.Name, *h.RepoSizeBytes)
+	}
+
+	b.WriteString("# HELP rm_host_snapshot_count Number of restic snapshots known on the host's repo.\n")
+	b.WriteString("# TYPE rm_host_snapshot_count gauge\n")
+	for _, h := range hosts {
+		fmt.Fprintf(&b, "rm_host_snapshot_count{host_id=%q,host=%q} %d\n",
+			h.ID, h.Name, h.SnapshotCount)
+	}
+
+	b.WriteString("# HELP rm_host_open_alerts Number of currently open alerts attached to this host.\n")
+	b.WriteString("# TYPE rm_host_open_alerts gauge\n")
+	for _, h := range hosts {
+		fmt.Fprintf(&b, "rm_host_open_alerts{host_id=%q,host=%q} %d\n",
+			h.ID, h.Name, h.OpenAlertCount)
+	}
+
+	b.WriteString("# HELP rm_host_repo_status Repo readiness state for the host. Exactly one row per host with status label set.\n")
+	b.WriteString("# TYPE rm_host_repo_status gauge\n")
+	for _, h := range hosts {
+		st := h.RepoStatus
+		if st == "" {
+			st = "unknown"
+		}
+		fmt.Fprintf(&b, "rm_host_repo_status{host_id=%q,host=%q,status=%q} 1\n",
+			h.ID, h.Name, st)
+	}
+
+	// --- Histogram -------------------------------------------------------
+	b.WriteString("# HELP rm_job_duration_seconds End-to-end duration of completed jobs, by kind and terminal status.\n")
+	b.WriteString("# TYPE rm_job_duration_seconds histogram\n")
+	for _, row := range s.JobDurationRows {
+		for i, ub := range JobDurationBuckets {
+			fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"%g\"} %d\n",
+				row.Kind, row.Status, ub, row.Buckets[i])
+		}
+		fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"+Inf\"} %d\n",
+			row.Kind, row.Status, row.Buckets[len(JobDurationBuckets)])
+		fmt.Fprintf(&b, "rm_job_duration_seconds_sum{kind=%q,status=%q} %g\n",
+			row.Kind, row.Status, row.Sum)
+		fmt.Fprintf(&b, "rm_job_duration_seconds_count{kind=%q,status=%q} %d\n",
+			row.Kind, row.Status, row.Count)
+	}
+
+	_, err := io.WriteString(w, b.String())
+	return err
+}
@@ -0,0 +1,182 @@
+package metrics
+
+import (
+	"bytes"
+	"strings"
+	"sync"
+	"testing"
+	"time"
+)
+
+func TestObserveJobBuckets(t *testing.T) {
+	r := NewRegistry()
+	// Bucket boundaries: 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400
+	r.ObserveJob("backup", "succeeded", 500*time.Millisecond) // <= 1
+	r.ObserveJob("backup", "succeeded", 30*time.Second)       // == 30 (boundary)
+	r.ObserveJob("backup", "succeeded", 90*time.Second)       // > 60, <= 300
+	r.ObserveJob("backup", "succeeded", 2*time.Hour)          // > 3600 → 21600 bucket
+	rows := r.snapshotJobs()
+	if len(rows) != 1 {
+		t.Fatalf("rows: %d", len(rows))
+	}
+	row := rows[0]
+	if row.Count != 4 {
+		t.Errorf("count: %d", row.Count)
+	}
+	wantSum := 0.5 + 30 + 90 + 7200.0
+	if row.Sum != wantSum {
+		t.Errorf("sum: got %v want %v", row.Sum, wantSum)
+	}
+	// Cumulative buckets:
+	//  le=1     → 1 (the 0.5s)
+	//  le=5     → 1
+	//  le=30    → 2 (boundary inclusive: 30s included)
+	//  le=60    → 2
+	//  le=300   → 3
+	//  le=1800  → 3
+	//  le=3600  → 3
+	//  le=21600 → 4
+	//  le=86400 → 4
+	//  le=+Inf  → 4
+	want := []uint64{1, 1, 2, 2, 3, 3, 3, 4, 4, 4}
+	for i, w := range want {
+		if row.Buckets[i] != w {
+			t.Errorf("bucket[%d]=%d want %d", i, row.Buckets[i], w)
+		}
+	}
+}
+
+func TestObserveJobNegativeClampedToZero(t *testing.T) {
+	r := NewRegistry()
+	r.ObserveJob("backup", "succeeded", -5*time.Second)
+	rows := r.snapshotJobs()
+	if len(rows) != 1 || rows[0].Sum != 0 || rows[0].Count != 1 {
+		t.Errorf("expected one zero-second observation, got %+v", rows)
+	}
+}
+
+func TestObserveJobConcurrent(t *testing.T) {
+	r := NewRegistry()
+	const goroutines = 16
+	const each = 200
+	var wg sync.WaitGroup
+	for g := 0; g < goroutines; g++ {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			for i := 0; i < each; i++ {
+				r.ObserveJob("backup", "succeeded", time.Second)
+			}
+		}()
+	}
+	wg.Wait()
+	rows := r.snapshotJobs()
+	if len(rows) != 1 {
+		t.Fatalf("rows: %d", len(rows))
+	}
+	if rows[0].Count != uint64(goroutines*each) {
+		t.Errorf("count: got %d want %d", rows[0].Count, goroutines*each)
+	}
+}
+
+func TestObserveJobNilRegistryNoop(t *testing.T) {
+	var r *Registry // nil
+	r.ObserveJob("backup", "succeeded", time.Second)
+}
+
+func TestRenderGolden(t *testing.T) {
+	r := NewRegistry()
+	r.ObserveJob("backup", "succeeded", 5*time.Second)
+	r.ObserveJob("forget", "succeeded", 100*time.Millisecond)
+
+	pi64 := func(v int64) *int64 { return &v }
+	pbool := func(v bool) *bool { return &v }
+
+	hosts := []HostRow{
+		{
+			ID: "01H0001", Name: "alpha",
+			Online:              true,
+			LastBackupUnix:      pi64(1700000000),
+			LastBackupSucceeded: pbool(true),
+			RepoSizeBytes:       pi64(123456789),
+			SnapshotCount:       42,
+			OpenAlertCount:      0,
+			RepoStatus:          "ready",
+		},
+		{
+			ID: "01H0002", Name: "bravo",
+			Online:         false,
+			SnapshotCount:  0,
+			OpenAlertCount: 1,
+			RepoStatus:     "init_failed",
+		},
+	}
+	snap := r.SnapshotWith(hosts,
+		map[string]int{"info": 0, "warning": 1, "critical": 0},
+		"v1.2.3", "deadbeef", "go1.25.0")
+
+	var buf bytes.Buffer
+	if err := Render(&buf, snap); err != nil {
+		t.Fatalf("render: %v", err)
+	}
+	out := buf.String()
+
+	for _, want := range []string{
+		"# HELP rm_hosts_total ",
+		"rm_hosts_total 2\n",
+		"rm_hosts_online 1\n",
+		`rm_active_alerts{severity="warning"} 1`,
+		`rm_active_alerts{severity="info"} 0`,
+		`rm_active_alerts{severity="critical"} 0`,
+		`rm_build_info{version="v1.2.3",commit="deadbeef",go_version="go1.25.0"} 1`,
+		`rm_host_agent_online{host_id="01H0001",host="alpha"} 1`,
+		`rm_host_agent_online{host_id="01H0002",host="bravo"} 0`,
+		`rm_host_last_backup_timestamp_seconds{host_id="01H0001",host="alpha"} 1700000000`,
+		`rm_host_last_backup_success{host_id="01H0001",host="alpha"} 1`,
+		`rm_host_repo_size_bytes{host_id="01H0001",host="alpha"} 123456789`,
+		`rm_host_snapshot_count{host_id="01H0001",host="alpha"} 42`,
+		`rm_host_snapshot_count{host_id="01H0002",host="bravo"} 0`,
+		`rm_host_open_alerts{host_id="01H0002",host="bravo"} 1`,
+		`rm_host_repo_status{host_id="01H0001",host="alpha",status="ready"} 1`,
+		`rm_host_repo_status{host_id="01H0002",host="bravo",status="init_failed"} 1`,
+		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="1"} 0`,
+		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="5"} 1`,
+		`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="+Inf"} 1`,
+		`rm_job_duration_seconds_sum{kind="backup",status="succeeded"} 5`,
+		`rm_job_duration_seconds_count{kind="backup",status="succeeded"} 1`,
+		`rm_job_duration_seconds_bucket{kind="forget",status="succeeded",le="1"} 1`,
+	} {
+		if !strings.Contains(out, want) {
+			t.Errorf("missing line:\n  %s\n--- full output ---\n%s", want, out)
+		}
+	}
+
+	// bravo had no last backup → those metric lines must be absent for it.
+	for _, ban := range []string{
+		`rm_host_last_backup_timestamp_seconds{host_id="01H0002"`,
+		`rm_host_last_backup_success{host_id="01H0002"`,
+		`rm_host_repo_size_bytes{host_id="01H0002"`,
+	} {
+		if strings.Contains(out, ban) {
+			t.Errorf("unexpected line for bravo: %q", ban)
+		}
+	}
+}
+
+func TestRenderEmptySnapshot(t *testing.T) {
+	r := NewRegistry()
+	snap := r.SnapshotWith(nil, nil, "dev", "", "go1.25.0")
+	var buf bytes.Buffer
+	if err := Render(&buf, snap); err != nil {
+		t.Fatalf("render: %v", err)
+	}
+	out := buf.String()
+	if !strings.Contains(out, "rm_hosts_total 0\n") {
+		t.Errorf("missing zero-host gauge:\n%s", out)
+	}
+	// Histogram block has its HELP/TYPE but no rows. The HELP/TYPE
+	// presence is correct and helps Prometheus pre-register the metric.
+	if !strings.Contains(out, "# TYPE rm_job_duration_seconds histogram") {
+		t.Errorf("histogram HELP/TYPE missing")
+	}
+}
@@ -15,6 +15,7 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/alert"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/auth"
+	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
 )
@@ -27,6 +28,9 @@ type HandlerDeps struct {
 	// AlertEngine receives job-finished and host-online events so the
 	// alert engine can evaluate its rules. Optional; nil = no-op.
 	AlertEngine *alert.Engine
+	// Metrics records job-duration observations on every terminal
+	// status. Optional; nil = no-op (test fixtures pass nil).
+	Metrics *metrics.Registry
 	// UpdateWatcher reconciles in-flight agent-update dispatches against
 	// hello envelopes. Optional; nil = no-op.
 	UpdateWatcher *UpdateWatcher
@@ -239,6 +243,13 @@ func dispatchAgentMessage(ctx context.Context, c *Conn, hostID string, env api.E
 					slog.Warn("ws: set host last backup", "host_id", hostID, "err", err)
 				}
 			}
+			// Job-duration histogram (P6-04). Skip when StartedAt is
+			// missing (race: agent shipped finished without a started,
+			// or the row predates this code).
+			if deps.Metrics != nil && job.StartedAt != nil {
+				deps.Metrics.ObserveJob(job.Kind, string(p.Status),
+					p.FinishedAt.Sub(*job.StartedAt))
+			}
 		}
 		if deps.JobHub != nil {
 			deps.JobHub.Broadcast(p.JobID, env)
@@ -326,12 +326,54 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.

 ## Phase 5 — OSS readiness

- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
+- [x] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
+- [x] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
 - [x] **P5-03** (S) Release automation — **pivoted away from goreleaser/binary archives** on 2026-05-05 (spec: `docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md`). Single deliverable per tag: a multi-arch (linux amd64+arm64) server image, with cross-compiled agent binaries (linux amd64+arm64, windows amd64) + `install.sh` + `install.ps1` + the systemd unit baked under `/opt/restic-manager/dist/`. The `/agent/binary` and `/install/*` handlers fall back from `<DataDir>/...` to `<BundledAssetsDir>/...` so a fresh container Just Works. Workflow `.gitea/workflows/release.yml` triggers on `v*.*.*` tag-push (real release: fan-out `:vX.Y.Z`, `:X.Y`, `:X`, plus `:latest` once `MAJOR>=1`) and `workflow_dispatch` (snapshot: `:snapshot-<shortsha>` only). Pushed to the Gitea container registry on this instance — no external creds, no GHCR mirror. Cosign / SBOM / minisign / GHCR mirror deferred to Phase 6. Source builds via `make build` remain a first-class path.
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
+- [x] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
+- [x] **P5-05** (S) `SECURITY.md` with disclosure process
+- [x] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
+
+> **As shipped (2026-05-07, branch `p5-oss-readiness`):**
+>
+> **P5-01 — docs site.** mdBook under `docs/book/` with structured
+> chapters: getting-started (install, enrolling hosts, reverse
+> proxy), concepts (architecture, credentials, schedules + source
+> groups, repo maintenance), operations (backups + restores, alerts,
+> observability, updates), security (threat model, hardening,
+> disclosure), reference (env vars, HTTP endpoints), plus
+> contributing / roadmap / license pages. mdBook binary downloaded
+> via Makefile (`make docs` / `make docs-watch`) — same "static
+> binary, no toolchain" pattern as Tailwind. Generated `book/`
+> dir gitignored.
+>
+> **P5-02 — CONTRIBUTING + CoC + templates.** `CONTRIBUTING.md`
+> rewritten from placeholder to full guide (setup, conventions,
+> workflow, RBAC of the project itself). `CODE_OF_CONDUCT.md`
+> shaped on the Contributor Covenant but adapted for a
+> single-maintainer project. `.gitea/issue_template/{bug_report,feature_request}.md`
+> + `.gitea/PULL_REQUEST_TEMPLATE.md`.
+>
+> **P5-04 — README screenshots.** Six full-page captures from a
+> fresh server bootstrap under `docs/screenshots/` (login, empty
+> dashboard, add host, alerts, settings, audit log). README
+> rewritten to centre the screenshot grid + link out to docs site.
+> Captured live from a working build via Playwright; replaceable
+> as the UI evolves without breaking layout.
+>
+> **P5-05 — SECURITY.md.** Disclosure policy (3-day ack, 30-day
+> default disclosure window), supported-versions matrix, scope
+> in/out, threat-model summary, hardening checklist for
+> operators. Mirrored as a chapter in the docs site.
+>
+> **P5-06 — e2e harness.** `e2e/compose.e2e.yml` stands up
+> server + sibling Linux agent (alpine + restic) + restic/rest-server
+> backend, with announce-and-approve as the enrolment path so
+> Playwright drives the operator flow end-to-end. Tests under
+> `e2e/playwright/tests/`: smoke spec covers bootstrap → login →
+> accept-pending → backup → terminal-status; second spec scrapes
+> `/metrics` to verify the P6-04 endpoint. New
+> `.gitea/workflows/e2e.yml` runs on every PR (separate from the
+> fast lint/test workflow). Local how-to in `docs/e2e.md`.
 - [x] **P5-07** (S) Reference deployment landed alongside P5-03. `deploy/docker-compose.yml` stands up *only* the server (image-pinned via `RM_VERSION`, named volume for operator state, bound to localhost) — TLS termination is left to whichever reverse proxy the operator already runs. `docs/reverse-proxy.md` documents the headers + WebSocket pass-through the proxy must forward, the `RM_TRUSTED_PROXY` CIDR rule, and worked examples for Caddy, nginx, and Traefik.

 ### Phase 5 acceptance
@@ -390,8 +432,45 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
 > swap, helper `buildRepoTrendView` shared between page-load and
 > fragment endpoint). No new dependencies, no client JS, no agent
 > change. CI green; in-browser smoke walk-through pending operator.
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
+- [x] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
+- [x] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
+
+> **As shipped (2026-05-07, branch `p6-04-05-prometheus-metrics`):**
+> Spec `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`,
+> plan `docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md`.
+> New `internal/server/metrics` package emits the legacy
+> `text/plain; version=0.0.4` exposition format directly — no
+> `prometheus/client_golang` dependency, matching the repo's
+> "no Tailwind, no Node" minimal-deps style. `/metrics` is **opt-in**:
+> `RM_METRICS_TOKEN` and/or `RM_METRICS_TRUSTED_CIDR` must be set or
+> the route isn't mounted at all (404). When both are set, both must
+> pass; either alone gates access. Token compare is constant-time.
+> CIDR check honours `X-Forwarded-For` only when the immediate hop
+> is a configured `RM_TRUSTED_PROXY` (mirrors the existing realIP
+> resolution).
+>
+> **Metrics:** per-host gauges (`rm_host_agent_online`,
+> `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`,
+> `rm_host_repo_size_bytes`, `rm_host_snapshot_count`,
+> `rm_host_open_alerts`, `rm_host_repo_status`); server gauges
+> (`rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`,
+> `rm_build_info{version,commit,go_version}`); histogram
+> `rm_job_duration_seconds_bucket{kind,status,le}` with buckets
+> `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`.
+> Histogram is in-memory; observations come from the existing
+> `MsgJobFinished` branch in `internal/server/ws/handler.go`.
+>
+> **Docs:** `docs/prometheus.md` covers enable + scrape config +
+> metric reference + dashboard import. **Dashboard:**
+> `deploy/grafana/restic-manager-dashboard.json` — six panels
+> (fleet status, open alerts, backups failing, hosts table, repo
+> size over time, job-duration p95). Schema 39, single Prometheus
+> datasource variable.
+>
+> **Tests:** golden-render + concurrent-observe + bucket-boundary
+> in the metrics package; auth matrix (no auth → 404; token
+> missing/wrong/right; CIDR matching/non-matching; token AND CIDR)
+> in the HTTP layer.

 ### Phase 6 acceptance
Author	SHA1	Message	Date
steve	a3f134bcd6	e2e: pin Playwright to 1.59.1 CI / Test (rest) (pull_request) Successful in 34s Details CI / Test (store) (pull_request) Successful in 54s Details CI / Lint (pull_request) Successful in 26s Details CI / Build (windows/amd64) (pull_request) Successful in 26s Details CI / Build (linux/amd64) (pull_request) Successful in 25s Details CI / Build (linux/arm64) (pull_request) Successful in 25s Details e2e / Playwright vs docker-compose (pull_request) Failing after 1m36s Details CI / Test (server-http) (pull_request) Successful in 3m19s Details `@playwright/test` was loose-pinned to ^1.50.0; npm resolved it to 1.59.1 inside the runner image, which only ships browser binaries for 1.50.0. Pin both the package and the docker image to v1.59.1 so deps and binaries stay aligned.	2026-05-08 20:09:17 +01:00
steve	17b9ee08b7	e2e: run health probe + Playwright on the compose network Gitea's act-style runners execute workflow steps inside a runner container, so compose's host port-publish (127.0.0.1:8080:8080) is not reachable from the steps. PR #23's e2e job timed out waiting for the server even though the container was up and listening. Move both the health probe and the Playwright run onto rmnet so they address the server as http://server:8080: * health probe: docker run --rm --network e2e_rmnet curlimages/curl * Playwright: new mcr.microsoft.com/playwright-based image, added as a profile-gated `playwright` service in compose.e2e.yml, invoked via `docker compose run --rm playwright`. Drops the setup-node + npm install runner steps.	2026-05-08 20:08:23 +01:00
steve	89537d417a	P5: OSS readiness — docs site, contributor onboarding, e2e harness P5-01 — Documentation site under docs/book/ rendered with mdBook (downloaded via Makefile, same static-binary pattern as Tailwind). Structured chapters: getting started, concepts, operations, security, reference. `make docs` / `make docs-watch`. Generated output gitignored. P5-02 — CONTRIBUTING.md rewritten from placeholder to a full guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a single-maintainer project. .gitea/issue_template/{bug,feature}.md and PULL_REQUEST_TEMPLATE.md. P5-04 — Six README screenshots captured live from a fresh server bootstrap (login, empty dashboard, add-host, alerts, settings, audit log). README rewritten to centre the screenshot grid and link out to the docs site. P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day default window), scope in/out, threat-model summary, operator hardening checklist. Mirrored as a docs-site chapter. P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up server + sibling Linux agent (alpine + restic) + restic/rest-server. Agent uses announce-and-approve so Playwright can drive the full operator flow: bootstrap → login → accept pending → backup → verify terminal status. Second spec scrapes /metrics to assert the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every PR; local how-to in docs/e2e.md.	2026-05-08 20:08:23 +01:00
steve	a252b25854	Merge pull request 'spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard' (#22 ) from p6-04-05-prometheus-metrics into main Reviewed-on: #22	2026-05-08 18:31:57 +00:00
steve	73e733be61	P6-04+05: Prometheus /metrics endpoint + Grafana dashboard CI / Test (rest) (pull_request) Successful in 41s Details CI / Test (store) (pull_request) Successful in 43s Details CI / Lint (pull_request) Successful in 29s Details CI / Build (windows/amd64) (pull_request) Successful in 44s Details CI / Test (server-http) (pull_request) Successful in 1m47s Details CI / Build (linux/arm64) (pull_request) Successful in 43s Details CI / Build (linux/amd64) (pull_request) Successful in 2m1s Details New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.	2026-05-07 23:17:15 +01:00
steve	70ff554402	spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard	2026-05-07 23:07:30 +01:00