ui(alerts): make Acknowledge vs Resolve distinction visible

Both buttons make the row leave the Open tab, so on a quiet system they look identical. The behavioural difference only manifests next time the underlying condition fires: - Acknowledge silences fan-out while the problem persists; the alert parks on the Acknowledged tab and recurrences just touch last_seen_at without re-notifying. - Resolve closes the alert. If the same condition fires again later, a fresh alert with a new id raises and the channels fan out as if it were the first time. Add a one-line legend under the page header explaining both, and title= tooltips on each button covering the same ground for keyboard and assistive tech.
Merge pull request 'Phase 3 — Alerts: per-source-group dedup' (#8 ) from p3-alerts-dedup into main
2026-05-04 23:11:46 +01:00 · 2026-05-04 22:11:08 +00:00 · 2026-05-04 22:59:48 +01:00 · 2026-05-04 21:51:16 +00:00 · 2026-05-04 22:49:46 +01:00 · 2026-05-04 22:40:46 +01:00
228 changed files with 6771 additions and 21633 deletions
@@ -1,32 +0,0 @@
-<!--
-Thanks for the PR! A few quick checks before submitting:
-
-* Did you open an issue first for non-trivial changes?
-* `make lint test` is green locally?
-* Commits are focused (one logical change per commit)?
-* No `Co-Authored-By` trailers (repo policy)?
-* No new dependencies without a one-line justification below?
-->
-
-## Summary
-
-<!-- One paragraph: what changed and why. -->
-
-## Test plan
-
-<!-- Bullet list of what you actually ran. Be specific.
-     - `make test` → green
-     - Manually exercised the new flow at /hosts/{id}/foo
-     - Smoke env: enrolled a fresh host, ran a backup end-to-end
-->
-
-## Notes for the reviewer
-
-<!-- Anything the reviewer needs to know that isn't obvious from the
-     diff: related issue, follow-up work that's intentionally not
-     in this PR, deferred concerns, design alternatives considered
-     and rejected. -->
-
-## Linked issues
-
-<!-- "Closes #123" / "Refs #456" / "Part of P5-06" -->
@@ -1,52 +0,0 @@
---
-name: Bug report
-about: Something isn't behaving the way the docs / code suggest it should
-title: "[bug] "
-labels: bug
---
-
-## What happened
-
-<!-- A clear description of the actual behaviour. Include the exact
-     UI surface, API endpoint, or CLI invocation involved. -->
-
-## What you expected
-
-<!-- What you thought would happen, and where that expectation came from
-     (docs page, command output, prior behaviour). -->
-
-## Steps to reproduce
-
-1.
-2.
-3.
-
-## Environment
-
- restic-manager server version: <!-- `restic-manager-server --version` or footer of the UI -->
- Agent version (if relevant): <!-- `restic-manager-agent --version` -->
- restic version on affected host: <!-- `restic version` -->
- Host OS: <!-- e.g. "Ubuntu 22.04 amd64" or "Windows Server 2022" -->
- How was the server installed: <!-- docker compose / source build / other -->
-
-## Logs / output
-
-<details><summary>Server log (sanitised)</summary>
-
-```
-<!-- paste relevant lines; redact tokens, passwords, repo URLs -->
-```
-
-</details>
-
-<details><summary>Agent log (sanitised)</summary>
-
-```
-```
-
-</details>
-
-## Anything else
-
-<!-- Screenshots, related issues, recent changes you made before the
-     bug appeared, anything that might help. -->
@@ -1,34 +0,0 @@
---
-name: Feature request
-about: Suggest a new capability or change to existing behaviour
-title: "[feature] "
-labels: enhancement
---
-
-## What you're trying to do
-
-<!-- Describe the use case, not the proposed solution. Who is the
-     operator, what are they trying to accomplish, and what's
-     blocking them today? -->
-
-## Why the current behaviour falls short
-
-<!-- What does the system do today, and where does it stop short of
-     the use case above? -->
-
-## Proposed direction (optional)
-
-<!-- If you have a specific design in mind, describe it. Skip this
-     section if you'd rather leave it to the maintainer. -->
-
-## Scope check
-
- [ ] I've read [`spec.md`](../spec.md) §2 (Goals & Non-Goals).
- [ ] This isn't already on the roadmap in [`tasks.md`](../tasks.md).
- [ ] This fits the project's "small fleet, one person operating"
-      target rather than enterprise / multi-tenant / SaaS use cases.
-
-## Anything else
-
-<!-- Related restic features, prior art in similar tools, links to
-     discussions you've had elsewhere. -->
@@ -2,34 +2,28 @@
 #
 # Notes for anyone editing this file:
 #
-# Custom runner image
-#   Every job runs inside `gitea.dcglab.co.uk/steve/ci-runner-go`
-#   (recipe: https://gitea.dcglab.co.uk/steve/ci/src/branch/main/images/ci-runner-go).
-#   That image already ships:
-#     * Go on PATH at /usr/local/go/bin (so `actions/setup-go` is
-#       redundant and intentionally NOT used here — the action would
-#       otherwise re-download Go on every job)
-#     * Node.js + npm (used by docs / e2e workflows)
-#     * Docker CLI, Buildx, Compose v2 (used by docker-build steps)
-#   When bumping the Go floor, push a new ci-runner-go image with
-#   the matching Go version and bump the date pin in IMAGE below.
-#
 # Self-hosted runner expectations
-#   Each runner host bind-mounts persistent volumes for
-#   /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE),
-#   and /root/.cache/act (action clones) into every job container —
-#   regardless of which image the container is built from. As a
+#   The Gitea runners are provisioned out-of-band (the infra team owns
+#   the script). Each runner host bind-mounts persistent volumes for
+#   /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE), and
+#   /root/.cache/act (action clones) into every job container. As a
 #   result:
-#     * Common GitHub actions (actions/checkout, actions/upload-artifact,
-#       golangci/golangci-lint-action) are pre-cloned into
-#       /root/.cache/act on the runner, so the per-job
-#       "git clone https://github.com/actions/..." step is a fetch,
-#       not a full clone.
+#     * `cache: true` on actions/setup-go is intentionally OMITTED — the
+#       action would otherwise tar/untar GOMODCACHE+GOCACHE through the
+#       Gitea cache backend on every job, undoing the host-volume cache
+#       and adding ~10s of redundant zstd round-trip per job.
+#     * Common GitHub actions (actions/checkout, actions/setup-go,
+#       actions/upload-artifact, golangci/golangci-lint-action) are
+#       pre-cloned into /root/.cache/act on the runner, so the per-job
+#       "git clone https://github.com/actions/..." step is a fetch, not
+#       a full clone.
 #     * golangci-lint is pre-installed at /usr/local/bin/golangci-lint
-#       on the runner host BUT that's outside the job's filesystem
-#       view; the golangci-lint-action below pins a specific version
-#       and re-downloads — that's fine (deterministic CI > marginal
-#       speed).
+#       on the runner (latest v2.x). The golangci-lint-action below
+#       still pins a specific version and re-downloads — that's fine
+#       (deterministic CI > marginal speed) but means the host-installed
+#       binary is currently unused. Drop the `version:` arg below to
+#       use the host-installed one if you want to trade determinism
+#       for speed.
 #
 # Build matrix
 #   Linux amd64 + arm64 + Windows amd64. CGO_ENABLED=0 throughout —
@@ -38,10 +32,10 @@
 #   binaries.
 #
 # Go version
-#   Anchored by the ci-runner-go image (currently Go 1.25.7). Floor
-#   is set by the heaviest dep (modernc.org/sqlite v1.50+ requires
-#   Go 1.23+; we run 1.25 so golangci-lint's Go-version compatibility
-#   check is happy — see the version pin in the lint job).
+#   The GO_VERSION env var anchors all three jobs. Floor is set by the
+#   heaviest dep (modernc.org/sqlite v1.50+ requires Go 1.23+ today;
+#   we run 1.25 so golangci-lint's Go-version compatibility check is
+#   happy — see the version pin in the lint job).
 #
 # upload-artifact
 #   Pinned at v3 historically; v3 was deprecated upstream. v4 should
@@ -54,68 +48,35 @@ on:
  pull_request:
    branches: [main]

-# Force bash as the default shell. With `container:` set on every
-# job, Gitea Actions otherwise picks `sh -e` and our `set -euo
-# pipefail` fails on dash with "Illegal option -o pipefail".
-defaults:
-  run:
-    shell: bash
+env:
+  GO_VERSION: "1.25"

 jobs:
  test:
-    # Sharded by package group. server/http and store are the two
-    # heavy packages (~156s and ~75s in CI respectively under
-    # `-race`); pulling them onto their own runners lets each shard
-    # have all CPUs to itself instead of CPU-starving each other on
-    # one runner. The third shard ("rest") covers everything else.
-    name: Test (${{ matrix.name }})
+    name: Test (linux/amd64)
    runs-on: ubuntu-latest
-    container:
-      image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
-      credentials:
-        username: ${{ secrets.ZOT_USERNAME }}
-        password: ${{ secrets.ZOT_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix:
-        include:
-          - name: server-http
-            packages: ./internal/server/http/...
-          - name: store
-            packages: ./internal/store/...
-          - name: rest
-            # Computed at runtime — see the "go test" step below.
-            packages: ""
    steps:
      - uses: actions/checkout@v4
+      - uses: actions/setup-go@v5
+        with:
+          go-version: ${{ env.GO_VERSION }}
+          # cache: true intentionally omitted — see header notes.
      - name: go vet
        run: go vet ./...
      - name: go test
-        run: |
-          set -euo pipefail
-          if [ -n "${{ matrix.packages }}" ]; then
-            pkgs="${{ matrix.packages }}"
-          else
-            # "rest" shard: everything except the dedicated shards.
-            pkgs=$(go list ./... \
-              | grep -v '/internal/server/http$' \
-              | grep -v '/internal/store$')
-          fi
-          # shellcheck disable=SC2086
-          go test -race -coverprofile=coverage.out $pkgs
+        run: go test -race -coverprofile=coverage.out ./...
      - name: coverage summary
        run: go tool cover -func=coverage.out | tail -1

  lint:
    name: Lint
    runs-on: ubuntu-latest
-    container:
-      image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
-      credentials:
-        username: ${{ secrets.ZOT_USERNAME }}
-        password: ${{ secrets.ZOT_PASSWORD }}
    steps:
      - uses: actions/checkout@v4
+      - uses: actions/setup-go@v5
+        with:
+          go-version: ${{ env.GO_VERSION }}
+          # cache: true intentionally omitted — see header notes.
      - uses: golangci/golangci-lint-action@v7
        with:
          # Must be built against the same Go release as go.mod targets,
@@ -129,11 +90,6 @@ jobs:
  build:
    name: Build (${{ matrix.goos }}/${{ matrix.goarch }})
    runs-on: ubuntu-latest
-    container:
-      image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
-      credentials:
-        username: ${{ secrets.ZOT_USERNAME }}
-        password: ${{ secrets.ZOT_PASSWORD }}
    strategy:
      fail-fast: false
      matrix:
@@ -147,6 +103,10 @@ jobs:
            ext: ".exe"
    steps:
      - uses: actions/checkout@v4
+      - uses: actions/setup-go@v5
+        with:
+          go-version: ${{ env.GO_VERSION }}
+          # cache: true intentionally omitted — see header notes.
      - name: build server + agent
        env:
          GOOS: ${{ matrix.goos }}
@@ -1,133 +0,0 @@
-# P5-06 — End-to-end test suite.
-#
-# Spec : docs/superpowers/specs/2026-05-07-p5-oss-readiness-design.md
-# Stack: e2e/compose.e2e.yml (server + agent + rest-server + playwright)
-# Tests: e2e/playwright/tests/*.spec.ts
-#
-# Triggered on every PR into main and on workflow_dispatch. Runs
-# longer than the unit-test workflow (~3-4 minutes for a clean run);
-# kept separate so a slow e2e doesn't block the fast lint/test loop.
-#
-# Networking note: every interaction with the server (health probe,
-# Playwright) happens from a container on the compose `rmnet`
-# network, addressing the server as `http://server:8080`. We can't
-# rely on `127.0.0.1:8080` because Gitea's runner executes steps
-# inside its own container, where compose's host port-publish is
-# not visible.
-
-name: e2e
-
-on:
-  pull_request:
-    branches: [main]
-  workflow_dispatch:
-
-# Force bash as the default shell — see ci.yml header.
-defaults:
-  run:
-    shell: bash
-
-jobs:
-  e2e:
-    name: Playwright vs docker-compose
-    runs-on: ubuntu-latest
-    container: gitea.dcglab.co.uk/steve/ci-runner-go:2026-05-08
-    timeout-minutes: 15
-    steps:
-      - uses: actions/checkout@v4
-
-      - name: Build the e2e stack
-        # --profile test pulls in the playwright service which is
-        # otherwise gated. --pull refreshes base images so a bump
-        # to the Dockerfile's FROM tag (e.g. mcr.microsoft.com/
-        # playwright:vX.Y.Z-jammy) isn't masked by a stale runner
-        # cache that still has the old tag's layers.
-        run: docker compose --profile test -f e2e/compose.e2e.yml build --pull
-
-      - name: Bring up the stack
-        run: docker compose -f e2e/compose.e2e.yml up -d server rest-server source-fixture
-
-      - name: Wait for server health
-        run: |
-          set -eu
-          for i in $(seq 1 30); do
-            if docker run --rm --network e2e_rmnet curlimages/curl:8.10.1 \
-                  -fsS http://server:8080/api/version >/dev/null 2>&1; then
-              echo "server up"; exit 0
-            fi
-            sleep 2
-          done
-          echo "server didn't come up"; docker compose -f e2e/compose.e2e.yml logs server; exit 1
-
-      - name: Capture bootstrap token from server logs
-        id: bootstrap
-        run: |
-          set -eu
-          for i in $(seq 1 15); do
-            line=$(docker compose -f e2e/compose.e2e.yml logs server 2>&1 | grep -E 'bootstrap token' -A2 | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1 || true)
-            if [ -n "$line" ]; then
-              echo "RM_BOOTSTRAP_TOKEN=$line" >> "$GITHUB_ENV"
-              echo "got bootstrap token (${#line} chars)"
-              exit 0
-            fi
-            sleep 1
-          done
-          echo "bootstrap token not found in logs"
-          docker compose -f e2e/compose.e2e.yml logs server
-          exit 1
-
-      - name: Start the agent
-        run: docker compose -f e2e/compose.e2e.yml up -d agent
-
-      - name: Run Playwright tests
-        id: playwright
-        env:
-          RM_BOOTSTRAP_TOKEN: ${{ env.RM_BOOTSTRAP_TOKEN }}
-        # --name pins a stable container ID so the next step can
-        # docker cp out of it before tear-down. We deliberately
-        # drop --rm so the container survives the test exit; the
-        # tear-down step removes it.
-        run: docker compose -f e2e/compose.e2e.yml run --name e2e-pw playwright
-
-      - name: Extract Playwright report
-        if: always() && steps.playwright.outcome != 'skipped'
-        run: |
-          mkdir -p e2e/playwright/playwright-report e2e/playwright/test-results
-          docker cp e2e-pw:/work/playwright-report/. e2e/playwright/playwright-report/ || true
-          docker cp e2e-pw:/work/test-results/. e2e/playwright/test-results/ || true
-
-      - name: Show Playwright failure context (on failure)
-        if: failure()
-        run: |
-          set +e
-          shopt -s nullglob globstar
-          for f in e2e/playwright/test-results/**/error-context.md; do
-            echo "::group::$f"
-            cat "$f"
-            echo "::endgroup::"
-          done
-          echo "Failure attachments (download via the playwright-report artifact):"
-          find e2e/playwright/test-results \( -name '*.png' -o -name '*.webm' -o -name 'trace.zip' \) -printf '  %p\n' | sort
-
-      - name: Compose logs (on failure)
-        if: failure()
-        run: |
-          docker compose -f e2e/compose.e2e.yml logs --tail=200 server
-          docker compose -f e2e/compose.e2e.yml logs --tail=200 agent
-          docker compose -f e2e/compose.e2e.yml logs --tail=200 rest-server
-
-      - name: Upload Playwright report (on failure)
-        if: failure()
-        uses: actions/upload-artifact@v4
-        with:
-          name: playwright-report
-          path: |
-            e2e/playwright/playwright-report
-            e2e/playwright/test-results
-          retention-days: 7
-
-      - name: Tear down
-        if: always()
-        run: |
-          docker rm -f e2e-pw 2>/dev/null || true
-          docker compose -f e2e/compose.e2e.yml down -v
@@ -1,111 +0,0 @@
-# Release workflow — P5-03 (docker-only release path).
-#
-# Spec : docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md
-# Plan : docs/superpowers/plans/2026-05-05-p5-03-docker-only-release.md
-#
-# What it does
-#   * Triggered by either:
-#       - tag push matching v[0-9]+.[0-9]+.[0-9]+ (real release), or
-#       - workflow_dispatch (snapshot iteration without tagging).
-#   * Cross-builds a multi-arch (linux/amd64,linux/arm64) image of the
-#     server, with three agent binaries (linux amd64+arm64, windows amd64)
-#     plus install.sh / install.ps1 / the systemd unit baked in under
-#     /opt/restic-manager/dist (the read-only fallback path the server
-#     handlers use when <DataDir>/... is empty).
-#   * Pushes to zot OCI registry (docker.dcglab.co.uk).
-#
-# Tag fan-out
-#   * tag push: :vX.Y.Z, :X.Y, :X
-#   * tag push and X >= 1: also :latest
-#   * workflow_dispatch: only :snapshot-<shortsha>; nothing else moves.
-
-name: Release
-
-on:
-  push:
-    tags:
-      - 'v[0-9]+.[0-9]+.[0-9]+'
-  workflow_dispatch:
-
-env:
-  REGISTRY: docker.dcglab.co.uk
-  IMAGE_NAME: restic-manager
-
-# Force bash as the default shell — see ci.yml header.
-defaults:
-  run:
-    shell: bash
-
-jobs:
-  image:
-    name: Build + push image
-    runs-on: ubuntu-latest
-    container:
-      image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
-      credentials:
-        username: ${{ secrets.ZOT_USERNAME }}
-        password: ${{ secrets.ZOT_PASSWORD }}
-    steps:
-      - uses: actions/checkout@v4
-
-      - uses: docker/setup-qemu-action@v3
-      - uses: docker/setup-buildx-action@v3
-
-      - name: Log in to zot registry
-        uses: docker/login-action@v3
-        with:
-          registry: ${{ env.REGISTRY }}
-          username: ${{ secrets.ZOT_USERNAME }}
-          password: ${{ secrets.ZOT_PASSWORD }}
-
-      - name: Compute tags + version
-        id: meta
-        shell: bash
-        run: |
-          set -euo pipefail
-          REG="${REGISTRY}/${IMAGE_NAME}"
-          DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
-          SHORT_SHA="${GITHUB_SHA::7}"
-
-          if [ "${GITHUB_EVENT_NAME}" = "push" ] && [ "${GITHUB_REF_TYPE}" = "tag" ]; then
-            TAG="${GITHUB_REF_NAME}"            # vX.Y.Z
-            VER="${TAG#v}"                       # X.Y.Z
-            MAJOR="${VER%%.*}"
-            MINOR="${VER#${MAJOR}.}"; MINOR="${MINOR%%.*}"
-
-            TAGS="${REG}:${TAG}"
-            TAGS="${TAGS},${REG}:${MAJOR}.${MINOR}"
-            TAGS="${TAGS},${REG}:${MAJOR}"
-            # Pre-1.0 holds back :latest by design; operators must
-            # pin a version explicitly until v1.0.0.
-            if [ "${MAJOR}" -ge 1 ]; then
-              TAGS="${TAGS},${REG}:latest"
-            fi
-            VERSION="${TAG}"
-          else
-            TAGS="${REG}:snapshot-${SHORT_SHA}"
-            VERSION="0.0.0-snapshot-${SHORT_SHA}"
-          fi
-
-          {
-            echo "tags=${TAGS}"
-            echo "version=${VERSION}"
-            echo "date=${DATE}"
-          } >> "${GITHUB_OUTPUT}"
-
-      - name: Build + push
-        uses: docker/build-push-action@v6
-        with:
-          context: .
-          file: deploy/Dockerfile.server
-          platforms: linux/amd64,linux/arm64
-          push: true
-          tags: ${{ steps.meta.outputs.tags }}
-          build-args: |
-            VERSION=${{ steps.meta.outputs.version }}
-            COMMIT=${{ gitea.sha }}
-            DATE=${{ steps.meta.outputs.date }}
-          labels: |
-            org.opencontainers.image.version=${{ steps.meta.outputs.version }}
-            org.opencontainers.image.revision=${{ gitea.sha }}
-            org.opencontainers.image.created=${{ steps.meta.outputs.date }}
@@ -2,10 +2,6 @@
 /bin/
 /dist/

-# Generated mdBook output (source under docs/book/src is committed,
-# the rendered book/ directory is not).
-/docs/book/book/
-
 # Local data / runtime state
 /data/
 /certs/
@@ -30,12 +26,6 @@ coverage.html
 .env.local
 *.local

-# Local docker-compose for the dev/test bench. Has host-specific IPs,
-# hostnames, and ports — never committed; the canonical reference
-# deployment lives in deploy/.
-/compose.yaml
-/compose.override.yaml
-
 # Local diagnostic helpers (never shipped). Go's build tooling already
 # skips paths beginning with _ or ., but ignore explicitly so nothing
 # checked in here can leak into a release tarball.
@@ -45,10 +35,3 @@ coverage.html
 # tooling already skips paths starting with _, but ignore explicitly
 # so an accidental `git add cmd/.` can't sneak them into a release.
 /cmd/_*/
-
-# Local-only planning / scratch — never committed.
-/ask.md
-/docs/superpowers/
-
-# Claude Code agent worktrees (transient, harness-created).
-/.claude/worktrees/
@@ -1,127 +0,0 @@
-# Changelog
-
-All notable changes to this project are documented here.
-The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
-and the project follows [Semantic Versioning](https://semver.org/).
-
-## [Unreleased]
-
-## [1.1.0] - 2026-06-15
-
-### Added
-
- **Always-On vs intermittent host mode.** A host can now be marked as
-  not always-on — for laptops/workstations that legitimately sleep,
-  travel, or shut down outside hours. An intermittent host no longer
-  raises "agent offline" alerts when it disappears; instead it shows a
-  calm "asleep" state in the UI ("asleep · last seen … · will catch up
-  on return") and is covered by a longer-horizon staleness alert (raised
-  only when it has an enabled schedule and no successful backup in 7
-  days). When such a host reconnects, the server waits a short settle
-  window and then automatically dispatches any scheduled backup whose
-  window elapsed while it was asleep. Toggle per host from the host
-  detail page (operator-band, audited as `host.mode_updated`). New and
-  existing hosts default to always-on, so current fleets are unaffected.
-
-### Changed
-
- Host-detail header redesign: tags and presence are grouped into
-  labelled, boxed pills with click-to-edit; presence shows a `24x7` /
-  `Free` chip; the agent "out of date" indicator is simplified (the full
-  version detail remains in the Agent-update panel and on hover).
- Relative timestamps ("2h ago") now tick client-side, so a tab left
-  open no longer shows a stale value as wall-clock time moves on.
- Release and CI container images are now published to and pulled from
-  the zot OCI registry (`docker.dcglab.co.uk`).
-
-## [1.0.1] - 2026-05-09
-
-### Fixed
-
- Build version is now single-sourced from `internal/version`, and the
-  server Dockerfile's ldflags were corrected so docker-built binaries
-  report their real version. Previously `internal/version.Version` stayed
-  at its "dev" default in docker images, which made every host look
-  permanently out-of-date to the update logic.
-
-## [1.0.0] - 2026-05-09
-
-First tagged release. Six development phases brought the project from
-empty repo to a self-hostable, multi-tenant restic backup orchestrator
-with a web UI, JSON API, and self-updating agent fleet.
-
-### Phase 1 — MVP: enrolment, visibility, on-demand backup
-
- HTTP server, SQLite store with migrations, AEAD-encrypted
-  credentials at rest, Argon2id password hashing, session cookies.
- WebSocket transport between server and agents (heartbeat, hello,
-  schedule fan-out, job log streaming).
- Agent install path for Linux (systemd unit + `install.sh`); one-time
-  enrolment tokens with embedded repo credentials.
- Run-now backup execution end-to-end, snapshot listing.
- Server-side encrypted repo creds pushed to the agent on hello.
-
-### Phase 2 — Scheduling, retention, repo operations
-
- Source groups (paths + excludes + pre/post hooks + bandwidth caps)
-  decoupled from schedules; a schedule fires a source group.
- Cron-style schedules with retention policies, server-driven
-  reconciliation push and ack.
- `restic forget`, `prune`, `check`, `unlock` automation; periodic
-  maintenance ticker with per-host stagger.
- Pending-runs queue with backpressure (`max_concurrent_jobs` per
-  host).
- Repo stats panel on the host detail page (size, last-check, last-
-  prune, stale-lock banner).
- Auto-init of repos on first onboard with credential-failure surface
-  on the host detail page.
- Announce-and-approve enrolment path for hosts that don't have a
-  pre-minted token (Ed25519 fingerprint, operator approves).
- Windows agent: SCM service integration + `install.ps1` installer.
- Cross-platform alt-enrolment (announce flow on Windows).
-
-### Phase 3 — Restore, alerts, audit
-
- Restore wizard: pick a snapshot, pick paths, pick a target
-  (in-place / new directory), live progress.
- Snapshot diff against parent.
- Alert engine: per-source-group dedup, severity tiers, ack / resolve.
- Live-refresh alerts table with severity cues.
- Audit log UI with filters, sort, CSV export, payload-detail modal.
-
-### Phase 4 — RBAC, OIDC, host tags
-
- Role-based access control: viewer / operator / admin.
- User management UI (invite, role change, disable, password reset).
- Generic OIDC SSO with JIT user provisioning + role mapping.
- Per-host tags with chip-row filter on the dashboard.
-
-### Phase 5 — OSS readiness
-
- mdBook-rendered docs site at `docs/book/`.
- Contributor onboarding (CONTRIBUTING.md, security policy, license).
- Docker-only release pipeline + reference deployment compose file.
- Playwright e2e harness covering the smoke runbook.
-
-### Phase 6 — Update delivery + observability
-
- Agent self-update: server-side channel pin per host, signed binary
-  fetch via the WS transport, atomic swap with rollback on failure.
- Fleet-wide update orchestration with per-host stagger and an admin
-  pause switch.
- Prometheus `/metrics` endpoint + Grafana dashboard JSON.
- Repo size trend per host (90-day rolling) on the host detail page.
-
-### Cross-cutting
-
- Live dashboard with column sort, filters, free-text host search,
-  background-tab-aware live refresh (5s cadence).
- Pure-Go binary with embedded UI, no Node/CGO at runtime.
- Reproducible `-trimpath -ldflags="-s -w"` builds for
-  linux/amd64, linux/arm64, windows/amd64.
- Sharded CI (server-http / store / rest), pre-commit hooks (gofumpt,
-  go vet, golangci-lint).
- Threat model published (`docs/threat-model.md`).
-
-[Unreleased]: https://gitea.dcglab.co.uk/steve/restic-manager/compare/v1.0.0...HEAD
-[1.0.0]: https://gitea.dcglab.co.uk/steve/restic-manager/releases/tag/v1.0.0
@@ -2,19 +2,10 @@

 Project-specific rules for Claude when working in this repo.

-## Commands
-
-Is the user types in any of the following, follow the instructions in the table
-
-| Command | Action |
-| --- | --- |
-| :release | trigger subagent to commit (if needed), push (if needed), raise PR, wait for PR to pass or fail. If fail, report back. If pass, merge in to main |
-
 ## Repo

 The repo lives inside a Gitea instance; `tea` CLI is available for use by agents

-
 ## Run `go vet` before every commit

 CI runs `go vet ./...` and will fail the build on any vet error.
@@ -38,7 +29,7 @@ but the **agent** is fetched by the install script from the server's
 **install script** are fetched from `<DataDir>/install/`. Plain
 `make build` doesn't touch any of those — the source-of-truth files
 in the working tree (`deploy/install/*`, `bin/restic-manager-agent`)
-must be copied into `$HOME/smoke/data/...` *and* the running agent
+must be copied into `/tmp/rm-smoke/data/...` *and* the running agent
 on this dev host needs replacing if the change touches agent code or
 the unit file.

@@ -53,13 +44,13 @@ asking the operator to test.**
 ```sh
 # 1. Restage what the install script serves (binary + unit + script).
 cp bin/restic-manager-agent \
-   $HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
+   /tmp/rm-smoke/data/agent-binaries/restic-manager-agent-linux-amd64
 cp deploy/install/install.sh \
-   $HOME/smoke/data/install/install.sh
+   /tmp/rm-smoke/data/install/install.sh
 cp deploy/install/install.ps1 \
-   $HOME/smoke/data/install/install.ps1
+   /tmp/rm-smoke/data/install/install.ps1
 cp deploy/install/restic-manager-agent.service \
-   $HOME/smoke/data/install/restic-manager-agent.service
+   /tmp/rm-smoke/data/install/restic-manager-agent.service

 # 2. Replace the running agent on this dev box and restart the
 #    service. Skip only when the change is server-side only AND
@@ -74,36 +65,15 @@ sudo -n systemctl restart restic-manager-agent
 # 3. The server runs from the working tree; restart it manually
 #    after a build that touches server code:
 pkill -f restic-manager-server
-RM_LISTEN=:8080 RM_DATA_DIR=$HOME/smoke/data \
+RM_LISTEN=:8080 RM_DATA_DIR=/tmp/rm-smoke/data \
 RM_BASE_URL=http://127.0.0.1:8080 \
-RM_SECRET_KEY_FILE=$HOME/smoke/data/secret.key \
+RM_SECRET_KEY_FILE=/tmp/rm-smoke/data/secret.key \
 RM_COOKIE_SECURE=false \
-./bin/restic-manager-server >> $HOME/smoke/server.log 2>&1 &
+./bin/restic-manager-server >> /tmp/rm-smoke/server.log 2>&1 &
 ```

-## Smoke server: use the Make targets, not raw `nohup`
-
-The smoke server runs as a transient `systemd --user` unit named
-`restic-manager-smoke.service` so it survives any sandbox or
-process-group boundary that would otherwise SIGTERM a backgrounded
-process. Use the Make targets:
-
-```
-make smoke-restart   # rebuild server + (re)launch as systemd --user unit
-make smoke-status    # systemctl --user status
-make smoke-logs      # tail $HOME/smoke/server.log
-make smoke-stop      # stop the unit
-make smoke-deploy    # full rebuild + restage agent assets + restart
-```
-
-`./bin/restic-manager-server &` from inside a Bash tool call gets
-reaped when the tool exits — don't do that. If the unit fails to
-start: `systemctl --user status restic-manager-smoke` and
-`$HOME/smoke/server.log` have the diagnosis.
-
-`smoke-deploy` does NOT touch `/usr/local/bin/restic-manager-agent`
-on this dev box; if your change requires the live agent here to
-update, run the agent restage block above by hand.
+A `make smoke-deploy` target that bundles all of this would be a
+good follow-up.

 ## Migrations: prefer column-level ALTERs over table rebuilds

@@ -1,69 +0,0 @@
-# Code of Conduct
-
-restic-manager is a small project run by one person. This Code of
-Conduct sets out the basic expectations for participating in the
-project's issue tracker, pull requests, and any other community
-spaces (chat, mailing lists) we may run in future.
-
-## Expected behaviour
-
- **Be civil.** Disagreement is fine; rudeness is not. The same
-  comment can usually be made without making it personal.
- **Assume good faith.** People asking what feels like a basic
-  question may be new to the project. People proposing what feels
-  like a duplicate idea may not have seen the prior discussion.
-  Point them to the right place politely.
- **Stay on topic.** Issue threads are for the issue. Tangential
-  conversations belong in their own thread.
- **Acknowledge the project's scope.** restic-manager is
-  intentionally small in scope (see `spec.md` §2). Reasonable
-  feature suggestions may still be declined for fit reasons.
-
-## Unacceptable behaviour
-
- Harassment, threats, or insults — public or private.
- Discriminatory comments based on age, body size, disability,
-  ethnicity, gender identity or expression, level of experience,
-  nationality, personal appearance, race, religion, sexual identity
-  or orientation.
- Sustained disruption — derailing threads, ignoring repeated
-  requests to take a discussion elsewhere, brigading.
- Publishing other people's private information without permission.
-
-## Reporting
-
-If someone in the project's spaces is behaving in a way that
-breaches this Code of Conduct, contact the maintainer directly
-through the contact details on their Gitea profile, or via the
-private security disclosure path documented in
-[SECURITY.md](./SECURITY.md). Reports stay confidential.
-
-The maintainer will review the report, gather context if needed,
-and respond. Possible outcomes include a private warning, a public
-clarification of expectations, a temporary or permanent ban from
-project spaces, or no action if the report doesn't hold up.
-
-There is no formal appeals process — this is a one-person project,
-not a foundation. If you think a decision was wrong you can say
-so, in writing, to the maintainer; that's it.
-
-## Scope
-
-This Code of Conduct applies to interactions in any space the
-project owns or operates: the Gitea repository (issues, pull
-requests, discussions, wiki), any chat channels we publish, and
-any conferences or events the project is officially represented at.
-
-It does not apply to:
-
- Forks of the project that aren't being submitted back upstream.
- Conversations between contributors that don't reference the
-  project.
- Public criticism of the project itself.
-
-## Acknowledgement
-
-This document borrows shape and language from the
-[Contributor Covenant](https://www.contributor-covenant.org/) v2.1
-but is intentionally shorter and adapted to the project's
-single-maintainer reality.
@@ -1,168 +1,30 @@
-# Contributing to restic-manager
+# Contributing

-Thanks for your interest in restic-manager. This document covers how
-to set up a development environment, the conventions the project
-follows, and how patches make it from your machine into `main`.
+Thanks for your interest in contributing to restic-manager.

-## Project status and scope
+> This is a placeholder. The project is in pre-alpha (Phase 1 / MVP). A
+> full contributor guide will land alongside the Phase 5 OSS-readiness
+> work — see [`tasks.md`](./tasks.md) P5-02. Until then the notes below
+> apply.

-restic-manager is in pre-1.0. Core functionality (Phases 0–4) is
-landed; OSS-readiness polish is in progress. The top of
-[`tasks.md`](./tasks.md) tracks what's next; [`spec.md`](./spec.md)
-is the canonical design doc and the source of truth for any
-"why is it built this way" question.
+## Before opening a PR

-The project is **single-maintainer, hobbyist-scale, and licensed
-under [PolyForm Noncommercial 1.0.0](./LICENSE)**. That has two
-practical implications:
+1. Open an issue first for non-trivial changes — the design is still
+   moving (see [`spec.md`](./spec.md)) and unsolicited large PRs may
+   conflict with in-flight work.
+2. `make lint test` should pass.
+3. Match the existing code style — `gofumpt`, `goimports`, no comments
+   that just restate what the code does.
+4. Keep commits focused; one logical change per commit.

-1. Big PRs without prior discussion may be declined for fit
-   reasons even when they're correct — opening an issue first lets
-   us check alignment cheaply.
-2. Commercial use is not permitted by the license. Bug reports and
-   patches from operators of personal/community deployments are
-   very welcome.
+## Reporting security issues

-## Getting started
-
-### Prerequisites
-
- Go 1.25 or newer (`go.mod` is the source of truth)
- `make`
- For the front-end CSS bundle: nothing extra — `make build`
-  downloads a pinned `tailwindcss` standalone binary into `bin/`.
- For the docs site: nothing extra — `make docs` does the same trick
-  with `mdbook`.
- For end-to-end tests: Docker + Docker Compose, plus `npx` for
-  Playwright.
-
-### One-time setup
-
-```sh
-git clone https://gitea.dcglab.co.uk/steve/restic-manager.git
-cd restic-manager
-make build          # compiles bin/restic-manager-{server,agent}
-make test           # full unit + integration test sweep
-make lint           # gofumpt + goimports + golangci-lint
-```
-
-### Running locally
-
-For most development, the [smoke environment](./docs/e2e-smoke.md)
-is the path of least resistance:
-
-```sh
-make smoke-restart  # rebuilds, launches as a systemd --user unit
-make smoke-logs     # tail of the server log
-```
-
-Then point a browser at `http://127.0.0.1:8080`. The first run
-prints a one-time bootstrap token to the log; use it to create the
-admin user.
-
-## Code conventions
-
-### Style
-
- `gofumpt` for formatting; `goimports` for import grouping.
-  Both run via the pre-commit hook in this repo.
- `golangci-lint` with `.golangci.yml` defaults; CI rejects on lint
-  errors.
- UK English in identifiers, comments, log messages, and UI strings
-  (the misspell linter is configured for the UK locale — see
-  P3-X5 for the original sweep).
- Comments explain **why**, not what; avoid restating the code.
-  A surprising invariant or an external constraint is worth
-  writing down. "Adds 1 to x" is not.
- `slog` for structured logs. Never log secrets — and especially
-  never the merged-creds rest-server URL (see [`CLAUDE.md`](./CLAUDE.md)).
-
-### File and package layout
-
- `cmd/server` and `cmd/agent` are the two binary entry points.
- `internal/` holds everything that's not part of the public Go
-  API (which is none of it — restic-manager isn't a library).
- Per-feature packages live under `internal/server/...` for the
-  control plane and `internal/agent/...` for the agent.
- `web/templates/` are HTML templates rendered with the standard
-  library; embedded via `web.FS`.
-
-### Tests
-
- Unit tests live alongside the code as `*_test.go`. Use the
-  in-process sqlite store (`store.Open(":memory:")`) when you need
-  state — there is no test mock layer to maintain.
- HTTP handlers test through `httptest.NewServer` against the real
-  router; see `internal/server/http/auth_test.go` for the canonical
-  fixture pattern.
- End-to-end tests live in `e2e/` and run against a Docker Compose
-  stack. See [`docs/e2e.md`](./docs/e2e.md).
-
-### Database migrations
-
- Migrations are hand-rolled SQL in `internal/store/migrations/`
-  and embedded via `embed.FS`.
- Prefer column-level `ALTER TABLE` over rebuilds — see
-  [`CLAUDE.md`](./CLAUDE.md) "Migrations" section for the FK-cascade
-  trap that bit migration 0007's first draft.
-
-## Workflow
-
-### Before opening a PR
-
-1. **Open an issue first** for non-trivial changes. The design is
-   still moving; an issue lets us agree on direction cheaply.
-2. Run `make lint test` locally — both must pass.
-3. Match existing code style (see above).
-4. Keep commits focused: one logical change per commit. Imperative
-   subject lines, body explaining why if it isn't obvious.
-5. Don't add `Co-Authored-By` trailers — repo policy. If you used
-   AI assistance in writing the patch, that's fine; we just don't
-   pollute every commit message with attribution boilerplate.
-
-### Pull requests
-
-PRs target `main`. CI runs lint + tests on Linux amd64/arm64 and
-Windows amd64; all three must be green to merge. Squash-merge is
-the default; the PR title becomes the merge-commit subject, so
-keep it short and informative.
-
-The PR template asks for:
-
- A short description of what changed and why.
- A test plan (commands run, scenarios verified).
- Anything reviewers need to know to assess the change (related
-  issue, follow-up work, deferred concerns).
-
-### Reporting bugs
-
-Open an issue with:
-
- restic-manager version (`server --version`) and agent version.
- restic version on the affected host.
- Steps to reproduce.
- Server and agent logs (sanitise any tokens before pasting).
-
-Security-sensitive bugs go through the [SECURITY.md](./SECURITY.md)
-disclosure path instead — please don't open a public issue for
-them.
-
-### Suggesting features
-
-Open an issue describing the use case (not just the proposed
-solution). The roadmap in `tasks.md` shows where the project is
-heading; if the suggestion fits a future phase we'll wire it in
-there. If it falls outside the project's scope (multi-tenancy, SaaS,
-non-restic backends — see `spec.md` §2 non-goals) we'll say so
-early to save your time.
-
-## Code of conduct
-
-Project participation is governed by [CODE_OF_CONDUCT.md](./CODE_OF_CONDUCT.md).
-The short version: be civil; assume good faith; harassment is not
-tolerated.
+Please do **not** open a public issue for security problems. A
+`SECURITY.md` with a private disclosure path will be added in Phase 5
+(P5-05). Until then, contact the repository owner directly via the
+contact details on their gitea profile.

 ## License

-By contributing you agree that your contributions are licensed
-under the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
+By contributing you agree that your contributions are licensed under
+the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
@@ -5,15 +5,9 @@ BIN_DIR        := bin
 SERVER_BIN     := $(BIN_DIR)/restic-manager-server
 AGENT_BIN      := $(BIN_DIR)/restic-manager-agent
 VERSION        ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
-COMMIT         ?= $(shell git rev-parse HEAD 2>/dev/null || echo none)
-DATE           ?= $(shell date -u +%Y-%m-%dT%H:%M:%SZ)
-VERSION_PKG    := gitea.dcglab.co.uk/steve/restic-manager/internal/version
-LDFLAGS        := -s -w \
-                  -X $(VERSION_PKG).Version=$(VERSION) \
-                  -X $(VERSION_PKG).Commit=$(COMMIT) \
-                  -X $(VERSION_PKG).Date=$(DATE)
+LDFLAGS        := -s -w -X main.version=$(VERSION)
 GOFLAGS        := -trimpath
-DOCKER_IMAGE   ?= gitea.dcglab.co.uk/steve/restic-manager
+DOCKER_IMAGE   ?= ghcr.io/dcglab/restic-manager
 DOCKER_TAG     ?= dev

 # Tailwind standalone CLI — single binary, no Node toolchain.
@@ -26,29 +20,7 @@ TAILWIND_URL      := https://github.com/tailwindlabs/tailwindcss/releases/downlo
 TAILWIND_INPUT    := web/styles/input.css
 TAILWIND_OUTPUT   := web/static/css/styles.css

-# mdBook for the docs site (P5-01). Single static binary, no
-# Rust toolchain — same pattern as Tailwind.
-MDBOOK_VERSION    ?= v0.4.51
-MDBOOK_OS         := $(shell uname -s | tr A-Z a-z)
-MDBOOK_TRIPLE     := $(shell uname -m)-unknown-$(if $(filter darwin,$(MDBOOK_OS)),apple-darwin,linux-gnu)
-MDBOOK_BIN        := $(BIN_DIR)/mdbook
-MDBOOK_TARBALL    := mdbook-$(MDBOOK_VERSION)-$(MDBOOK_TRIPLE).tar.gz
-MDBOOK_URL        := https://github.com/rust-lang/mdBook/releases/download/$(MDBOOK_VERSION)/$(MDBOOK_TARBALL)
-DOCS_BOOK_DIR     := docs/book
-DOCS_BOOK_OUT     := $(DOCS_BOOK_DIR)/book
-
-.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch docs docs-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
-
-# ---- smoke-env tooling -------------------------------------------------
-# The smoke server runs as a transient user-systemd unit so it survives
-# bash-tool boundaries and reboots-of-the-shell. Use `make smoke-restart`
-# any time you've rebuilt the server. `make smoke-deploy` is the full
-# rebuild + restage + restart workflow described in CLAUDE.md.
-SMOKE_UNIT       := restic-manager-smoke
-SMOKE_DATA_DIR   := $(HOME)/smoke/data
-SMOKE_LOG_FILE   := $(HOME)/smoke/server.log
-SMOKE_BASE_URL   := http://127.0.0.1:8080
-SMOKE_LISTEN     := :8080
+.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch setup hooks

 help:
 	@grep -E '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN{FS=":.*?## "};{printf "  \033[36m%-14s\033[0m %s\n",$$1,$$2}'
@@ -73,18 +45,6 @@ tailwind-watch: $(TAILWIND_BIN) ## Watch and rebuild on every save
 	@mkdir -p $$(dirname $(TAILWIND_OUTPUT))
 	$(TAILWIND_BIN) -c tailwind.config.js -i $(TAILWIND_INPUT) -o $(TAILWIND_OUTPUT) --watch

-$(MDBOOK_BIN):
-	@mkdir -p $(BIN_DIR)
-	@echo "==> downloading mdbook $(MDBOOK_VERSION) ($(MDBOOK_TRIPLE))"
-	curl -fsSL "$(MDBOOK_URL)" | tar -xz -C $(BIN_DIR) mdbook
-	@chmod +x $@
-
-docs: $(MDBOOK_BIN) ## Build the docs/book/ mdBook site into docs/book/book/
-	$(MDBOOK_BIN) build $(DOCS_BOOK_DIR)
-
-docs-watch: $(MDBOOK_BIN) ## Serve the docs site at http://127.0.0.1:3000 with live reload
-	$(MDBOOK_BIN) serve $(DOCS_BOOK_DIR) -n 127.0.0.1 -p 3000
-
 agent: ## Build the agent binary
 	@mkdir -p $(BIN_DIR)
 	CGO_ENABLED=0 go build $(GOFLAGS) -ldflags "$(LDFLAGS)" -o $(AGENT_BIN) ./cmd/agent
@@ -115,7 +75,7 @@ tidy: ## go mod tidy
 	go mod tidy

 clean: ## Remove build artifacts
-	rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT) $(DOCS_BOOK_OUT)
+	rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT)

 run-server: server ## Build and run the server
 	$(SERVER_BIN)
@@ -124,53 +84,7 @@ run-agent: agent ## Build and run the agent
 	$(AGENT_BIN)

 docker: ## Build the server Docker image
-	docker build -f deploy/Dockerfile.server \
-	  --build-arg VERSION=$(VERSION) \
-	  --build-arg COMMIT=$(COMMIT) \
-	  --build-arg DATE=$(DATE) \
-	  -t $(DOCKER_IMAGE):$(DOCKER_TAG) .
-
-smoke-restart: server ## (Re)start the smoke server as a transient user-systemd unit
-	@systemctl --user reset-failed $(SMOKE_UNIT) >/dev/null 2>&1 || true
-	@systemctl --user stop $(SMOKE_UNIT) >/dev/null 2>&1 || true
-	@echo "==> launching $(SMOKE_UNIT)"
-	systemd-run --user --unit=$(SMOKE_UNIT) \
-	  --setenv=RM_LISTEN=$(SMOKE_LISTEN) \
-	  --setenv=RM_DATA_DIR=$(SMOKE_DATA_DIR) \
-	  --setenv=RM_BASE_URL=$(SMOKE_BASE_URL) \
-	  --setenv=RM_SECRET_KEY_FILE=$(SMOKE_DATA_DIR)/secret.key \
-	  --setenv=RM_COOKIE_SECURE=false \
-	  --property=StandardOutput=append:$(SMOKE_LOG_FILE) \
-	  --property=StandardError=append:$(SMOKE_LOG_FILE) \
-	  --property=Restart=on-failure \
-	  $(PWD)/$(SERVER_BIN)
-	@for i in 1 2 3 4 5; do \
-	  curl -fsS -o /dev/null $(SMOKE_BASE_URL)/api/version 2>/dev/null && \
-	    { echo "==> smoke server up: $$(curl -s $(SMOKE_BASE_URL)/api/version)"; exit 0; }; \
-	  sleep 1; \
-	done; \
-	echo "!! smoke server did not respond on $(SMOKE_BASE_URL) — check $(SMOKE_LOG_FILE)" >&2; \
-	systemctl --user status --no-pager $(SMOKE_UNIT) || true; \
-	exit 1
-
-smoke-stop: ## Stop the smoke server
-	systemctl --user stop $(SMOKE_UNIT) || true
-	@systemctl --user reset-failed $(SMOKE_UNIT) >/dev/null 2>&1 || true
-
-smoke-status: ## Show status of the smoke server
-	@systemctl --user status --no-pager $(SMOKE_UNIT) 2>&1 | head -20 || true
-
-smoke-logs: ## Tail the smoke server log
-	tail -50 $(SMOKE_LOG_FILE)
-
-smoke-deploy: build smoke-restart ## Rebuild + restage agent into smoke + restart server (full per-CLAUDE.md cycle)
-	@echo "==> restaging agent + install assets into $(SMOKE_DATA_DIR)"
-	cp $(AGENT_BIN) $(SMOKE_DATA_DIR)/agent-binaries/restic-manager-agent-linux-amd64
-	cp deploy/install/install.sh $(SMOKE_DATA_DIR)/install/install.sh
-	cp deploy/install/install.ps1 $(SMOKE_DATA_DIR)/install/install.ps1
-	cp deploy/install/restic-manager-agent.service $(SMOKE_DATA_DIR)/install/restic-manager-agent.service
-	@echo "==> NOTE: this dev box's installed agent at /usr/local/bin/restic-manager-agent is NOT updated by this target."
-	@echo "    Run the agent restage block in CLAUDE.md if your change touches agent code or the unit file."
+	docker build -f deploy/Dockerfile.server --build-arg VERSION=$(VERSION) -t $(DOCKER_IMAGE):$(DOCKER_TAG) .

 release: ## Cross-compile for all supported platforms
 	@mkdir -p $(BIN_DIR)
@@ -1,62 +1,36 @@
 # restic-manager

 Self-hosted, browser-based, single-pane-of-glass for managing
-[restic](https://restic.net) backups across a fleet of Linux and
-Windows endpoints.
+[restic](https://restic.net) backups across a fleet of Linux and Windows
+endpoints.

-> **Status:** pre-1.0, feature-complete for the original use
-> case. Phases 0–4 + 6 are landed (MVP, scheduling, restore,
-> RBAC + OIDC, observability); Phase 5 (OSS readiness — docs site,
-> contributor onboarding, end-to-end CI) is in flight. See
-> [`spec.md`](./spec.md) for the design and [`tasks.md`](./tasks.md)
-> for the live roadmap.
+> Status: pre-alpha. Phase 0 (project bootstrap) complete; Phase 1 (MVP) in
+> progress. See [`spec.md`](./spec.md) for the design and
+> [`tasks.md`](./tasks.md) for the roadmap.

-## What it does
+## What it does (target)

- Central visibility into backup state for every endpoint.
- Trigger any restic operation remotely (`backup`, `forget`,
-  `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`,
-  `restore`).
- Per-host schedules with named source groups + retention.
- Live job log streamed to the browser; downloadable as
-  text/NDJSON afterwards.
- Restore wizard: browse a snapshot's tree, pick paths, restore
-  in-place or to a new directory.
- Repo health surfacing (size, raw size, last check, lock state),
-  plus a 30/90-day repo-size trend.
- Alerting over webhook, ntfy, or SMTP.
- Cross-platform agent (Linux systemd + Windows SCM).
- Append-only-friendly: separate admin credential for prune.
- Optional Prometheus `/metrics` endpoint + sample Grafana
-  dashboard.
- Optional OIDC SSO (Authelia, Authentik, etc.).
+- Central visibility into backup state for every endpoint
+- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
+  `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
+- Manage per-host backup schedules from the UI
+- Live job progress streamed back to the UI
+- Restore wizard (browse snapshots, pick paths, restore to original or
+  alternate host)
+- Repo health surfacing (size, dedup ratio, last check, lock state)
+- Alerting on failure or staleness
+- Cross-platform agent (Linux + Windows)
+- Ransomware-resistant repo access via append-only credentials

-## Screenshots
+## Architecture (one-line summary)

-| Sign in | Empty dashboard | Add host |
-|:-------:|:---------------:|:--------:|
-| ![Sign in](docs/screenshots/01-login.png) | ![Dashboard, fresh](docs/screenshots/02-dashboard-empty.png) | ![Add host](docs/screenshots/03-add-host.png) |
-
-| Alerts | Settings | Audit log |
-|:------:|:--------:|:---------:|
-| ![Alerts](docs/screenshots/04-alerts.png) | ![Settings](docs/screenshots/05-settings.png) | ![Audit log](docs/screenshots/06-audit.png) |
-
-(Screenshots from a fresh smoke install with no hosts. A populated
-fleet view and the live-log + restore wizard surfaces are part of
-the docs site under [`docs/book/`](./docs/book) — `make docs` to
-render locally.)
-
-## Architecture (one-line)
-
-A small Go control-plane in Docker, lightweight Go agents on each
-endpoint holding an outbound WebSocket to the control-plane, and
-a restic repository (rest-server, S3, B2, SFTP — anything restic
-speaks) that holds the actual backup data. **The control-plane
-never touches backup bytes.**
+A small Go control-plane on the Proxmox host, lightweight Go agents on each
+endpoint that hold an outbound WebSocket to the control-plane, and a
+`restic/rest-server` on Unraid that holds the actual backup data. The
+control-plane never touches backup bytes.

 Full architecture diagram and component breakdown:
-[`spec.md` §3](./spec.md), or the rendered version in the
-[docs site](./docs/book/src/concepts/architecture.md).
+[`spec.md` §3](./spec.md).

 ## Repository layout

@@ -64,63 +38,31 @@ Full architecture diagram and component breakdown:
 cmd/server/        control-plane binary
 cmd/agent/         endpoint agent binary
 internal/api       shared API types (REST + WS envelopes)
-internal/server/   HTTP, WS, UI handlers, alert engine
+internal/server/   HTTP, WS, UI handlers
 internal/agent/    service integration, restic runner, local scheduler
 internal/restic    restic CLI wrapper
 internal/store     SQLite persistence
-internal/crypto    secret encryption (AEAD)
+internal/crypto    secret encryption
 internal/auth      passwords, sessions, agent tokens
 web/               server-rendered templates + static assets
-deploy/            Dockerfile, docker-compose.yml, install scripts, Grafana dashboard
-docs/              prose docs + the mdBook site under docs/book
-e2e/               compose stack + Playwright tests for end-to-end CI
+deploy/            Dockerfile, docker-compose.yml, install scripts
+design/            UI wireframes (Phase 0 design pass)
 ```

-## Quickstart
-
-The reference deployment is a single Docker container fronted by
-your existing reverse proxy. See the [installation guide](docs/book/src/getting-started/install.md)
-for the full path; the very short version:
-
-```sh
-export RM_VERSION=v0.9.0    # pin a real tag
-export RM_BASE_URL=https://restic.example.com
-export RM_TRUSTED_PROXY=10.0.0.0/8
-docker compose -f deploy/docker-compose.yml up -d
-```
-
-The server prints a one-time bootstrap token to the log on first
-start. POST it to `/api/bootstrap` (or open `/bootstrap` in a
-browser) to create the admin user.
-
 ## Local development

-Requires Go 1.25+. The floor is set by `modernc.org/sqlite` v1.50.
+Requires Go 1.25+ (built and tested on 1.26). The floor is set by
+`modernc.org/sqlite` v1.50.

 ```sh
 make build           # builds cmd/server and cmd/agent into ./bin
 make test            # runs go test ./...
 make lint            # runs golangci-lint
-make smoke-restart   # systemd --user smoke server (see CLAUDE.md)
-make docs            # renders the mdBook site to docs/book/book/
+make run-server      # runs the server (dev defaults)
 ```

-End-to-end test harness against a Docker Compose stack with a
-sibling Linux agent: see [`docs/e2e.md`](docs/e2e.md). Runs in CI
-on every PR.
-
-## Documentation
-
- **Concepts and operator guides**: [docs site](docs/book/src/intro.md),
-  rendered with `make docs`.
- **Reverse-proxy setup**: [docs/reverse-proxy.md](docs/reverse-proxy.md).
- **Prometheus + Grafana**: [docs/prometheus.md](docs/prometheus.md).
- **End-to-end test harness**: [docs/e2e.md](docs/e2e.md).
- **Security policy**: [SECURITY.md](SECURITY.md).
- **Contributing**: [CONTRIBUTING.md](CONTRIBUTING.md).
-
 ## License

-[PolyForm Noncommercial 1.0.0](./LICENSE). Free for personal,
-hobby, research, educational, governmental, and other noncommercial
-use. Commercial use requires a separate license.
+PolyForm Noncommercial 1.0.0 — see [`LICENSE`](./LICENSE). Free for personal,
+hobby, research, educational, governmental, and other noncommercial use.
+Commercial use requires a separate license.
@@ -1,137 +0,0 @@
-# Security policy
-
-restic-manager handles credentials that grant access to backup
-repositories — losing them means an attacker can read or destroy a
-fleet's backups. We take security reports seriously even at this
-project's small scale.
-
-## Supported versions
-
-Pre-1.0, only the latest tagged release on `main` is supported.
-Backporting fixes to older tags is not currently offered.
-
-| Version            | Supported      |
-|--------------------|----------------|
-| `main` HEAD        | Yes            |
-| Latest released tag| Yes            |
-| Anything older     | No             |
-
-## Reporting a vulnerability
-
-**Please don't open a public issue for security problems.**
-
-Instead, use one of these private channels:
-
-1. **Gitea private message** to the repository owner. The
-   instance is at <https://gitea.dcglab.co.uk> and the owner's
-   profile (`steve`) has direct-message contact set up.
-2. **Email** to the address on the maintainer's Gitea profile.
-   Use a subject like `[SECURITY] restic-manager: <one-line summary>`
-   so it doesn't get lost. PGP optional — if you want to encrypt,
-   ask for a key first.
-
-If you don't get an acknowledgement within **3 working days**,
-please escalate through the other channel — solo maintainers do
-miss things, and the goal here is to fix the problem, not to
-preserve protocol.
-
-### What to include
-
- A description of the issue and the impact (what does an attacker
-  gain? confidentiality, integrity, availability?).
- Affected component (server, agent, install script, docs).
- Affected version (`restic-manager-server --version`).
- Reproduction steps if you have them. A working PoC is welcome
-  but not required — a credible threat model is enough.
- Whether you intend to publish a writeup, and any timing
-  preferences.
-
-### What we'll do
-
-1. Acknowledge receipt within 3 working days.
-2. Confirm or refute the issue, and agree a rough severity (CVSS
-   or just "this is bad / this isn't"). Asking clarifying
-   questions is normal at this stage — please don't read it as
-   foot-dragging.
-3. Develop a fix on a private branch, test it, and prepare a
-   release.
-4. Coordinate disclosure timing with you. The default is **30
-   days from confirmed report to public disclosure**, with a
-   patched release published before the disclosure date. Faster
-   if a workable PoC is already circulating; slower only by
-   mutual agreement.
-5. Credit the reporter in the release notes (or omit the credit
-   if you'd rather stay anonymous — your choice).
-
-## Scope
-
-In scope:
-
- The server binary (`cmd/server`) and any HTTP, WebSocket, or CLI
-  surface it exposes.
- The agent binary (`cmd/agent`) and the way it consumes commands
-  from the server.
- The install scripts (`deploy/install/install.sh`, `install.ps1`)
-  and the systemd unit shipped with them.
- The docker-compose reference deployment and the docker image we
-  publish.
- Any cryptographic primitive choice or implementation detail
-  (AEAD, token hashing, session handling, OIDC handshake).
- Documentation that, if followed, leads operators into an
-  insecure configuration.
-
-Out of scope (not because they aren't real problems, just not ones
-this report channel can act on):
-
- Vulnerabilities in restic itself — report those upstream at
-  <https://github.com/restic/restic>.
- Vulnerabilities in third-party dependencies that haven't yet been
-  patched upstream — report upstream first.
- Issues that require pre-authenticated admin access on the control
-  plane (admins can already do everything; that's not a privilege
-  escalation, that's the design).
- DoS via resource exhaustion on a deployment without the
-  recommended reverse proxy / rate limiting in front (see
-  `docs/reverse-proxy.md`).
- Social-engineering scenarios that don't have a technical hook
-  into the project's own surfaces.
-
-## Threat model summary
-
-For context (longer version in [`spec.md`](./spec.md) §11):
-
- The server is **HTTP-only**; TLS termination, ACME, HSTS, and
-  edge rate-limiting are the reverse proxy's job.
- Credentials are encrypted at rest with an AEAD key loaded from
-  `RM_SECRET_KEY_FILE`. The same key encrypts agent secrets that
-  travel to the agent over the WS channel.
- Agents authenticate with bearer tokens issued at enrolment and
-  hashed at rest. Compromise of the server DB does **not** leak
-  bearer tokens in plaintext, but does leak the hashes (which is
-  enough to log in *as* the agent until the operator revokes —
-  see [NS-01 / NS-02](./tasks.md) for the revoke + regenerate
-  flows).
- The control plane intentionally **never touches backup bytes** —
-  the agent runs `restic` directly against the repo. A
-  compromised control plane can dispatch new jobs but cannot
-  exfiltrate snapshot contents in-band.
- Append-only credentials are first-class. Forget/prune jobs use a
-  separate, admin-marked credential that the server only pushes
-  for the duration of a maintenance dispatch.
-
-## Hardening checklist for operators
-
- Run behind a TLS-terminating reverse proxy (Caddy/nginx/Traefik).
- Set `RM_TRUSTED_PROXY` to the proxy's CIDR so request IPs aren't
-  spoofable.
- Back up `RM_SECRET_KEY_FILE` separately from the database.
-  Without it the encrypted creds are unrecoverable.
- Use append-only credentials for the everyday backup path; only
-  the optional admin credential should have write/forget/prune
-  power.
- Disable users (don't delete) when staff change roles — bearer
-  tokens stay valid until rotated.
- Watch the alert and audit-log views during enrolment of new
-  hosts.
-
-Thanks for helping keep restic-manager users safe.
@@ -0,0 +1,8 @@
+# The ask!
+
+I have numerous servers deployed out in a lab, mainly Linux but some Windows
+All have restic installed on them
+I need to build a browser based management service that allows me to have a central single-plane-of-glass to monitor and manage all teh endpoints
+All endpoints will be enabled for SSH (unless other methods are better?)
+
+Plan out how we would go about this please?
@@ -22,9 +22,10 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/restic"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
 )

+var version = "dev"
+
 func main() {
 	if err := run(); err != nil {
 		slog.Error("agent fatal", "err", err)
@@ -61,7 +62,7 @@ func run() error {
 	flag.Parse()

 	if *showVersion {
-		fmt.Printf("restic-manager-agent %s (commit %s, built %s)\n", version.Version, version.Commit, version.Date)
+		fmt.Println("restic-manager-agent", version)
 		return nil
 	}

@@ -77,14 +78,14 @@ func run() error {
 		if *enrollServer == "" {
 			return errors.New("enrollment: -enroll-server is required with -enroll-token")
 		}
-		return doEnroll(*enrollServer, *enrollToken, cfg, version.Version)
+		return doEnroll(*enrollServer, *enrollToken, cfg, version)
 	}

 	// Announce-and-approve: -enroll-server set, no token, agent not
 	// yet enrolled. Run the announce flow inline; on success the cfg
 	// has the bearer + host_id and we drop into the normal run loop.
 	if !cfg.Enrolled() && *enrollServer != "" {
-		if err := doAnnounce(*enrollServer, cfg, version.Version); err != nil {
+		if err := doAnnounce(*enrollServer, cfg, version); err != nil {
 			return fmt.Errorf("announce: %w", err)
 		}
 	}
@@ -101,7 +102,7 @@ func run() error {
 		return fmt.Errorf("sysinfo: %w", err)
 	}
 	slog.Info("agent starting",
-		"version", version.Version,
+		"version", version,
 		"host_id", cfg.HostID,
 		"server", cfg.ServerURL,
 		"restic_version", snap.ResticVersion,
@@ -110,12 +111,6 @@ func run() error {

 	resticBin, _ := restic.Locate(cfg.ResticPath) // empty is fine; commands fail with a clear error later

-	// Probe the actual restic binary for restore-flag support. We used
-	// to gate --no-ownership on a SemVer comparison (added in 0.17),
-	// but a restic 0.18.1 build was observed in the wild that still
-	// rejects the flag. The help text is the only reliable signal.
-	resticSupportsNoOwnership := restic.SupportsRestoreNoOwnership(ctx, resticBin)
-
 	// Open the secrets store. If the agent is enrolled but has no
 	// secrets key yet (legacy YAML), mint one and migrate any
 	// plaintext repo fields into the encrypted blob.
@@ -131,7 +126,7 @@ func run() error {
 		CertPinSHA256: cfg.CertPinSHA256,
 		HelloPayload: api.HelloPayload{
 			ProtocolVersion: snap.ProtocolVersion,
-			AgentVersion:    version.Version,
+			AgentVersion:    version,
 			ResticVersion:   snap.ResticVersion,
 			Hostname:        snap.Hostname,
 			OS:              snap.OS,
@@ -140,12 +135,10 @@ func run() error {
 	}

 	d := &dispatcher{
-		resticBin:                 resticBin,
-		resticVer:                 snap.ResticVersion,
-		resticSupportsNoOwnership: resticSupportsNoOwnership,
-		serverURL:                 cfg.ServerURL,
-		secrets:                   sec,
-		scheduler:                 scheduler.New(),
+		resticBin: resticBin,
+		resticVer: snap.ResticVersion,
+		secrets:   sec,
+		scheduler: scheduler.New(),
 	}
 	if err := wsclient.Run(ctx, wsCfg, d.handle); err != nil {
 		return fmt.Errorf("ws run: %w", err)
@@ -207,12 +200,10 @@ func openSecretsStore(cfg *config.Config) (*secrets.Store, error) {
 // secrets store on each job — config.update writes through to disk,
 // so a job dispatched in the same session sees the latest values.
 type dispatcher struct {
-	resticBin                 string
-	resticVer                 string // e.g. "0.17.1"; empty if restic isn't installed yet
-	resticSupportsNoOwnership bool   // captured at startup from `restic restore --help`
-	serverURL                 string // base URL of the server (used by the self-update fetch)
-	secrets                   *secrets.Store
-	scheduler                 *scheduler.Scheduler
+	resticBin string
+	resticVer string // e.g. "0.17.1"; empty if restic isn't installed yet
+	secrets   *secrets.Store
+	scheduler *scheduler.Scheduler

 	// Bandwidth caps in KB/s pushed via config.update. Mutated under
 	// bwMu by the config.update handler; read by runJob when building
@@ -392,12 +383,10 @@ func (d *dispatcher) handle(ctx context.Context, env api.Envelope, tx wsclient.S
 				"up_kbps", up, "down_kbps", down)
 		}

-	case api.MsgCommandUpdate:
-		var p api.CommandUpdatePayload
-		if err := env.UnmarshalPayload(&p); err != nil {
-			return fmt.Errorf("command.update: %w", err)
-		}
-		go d.runUpdate(ctx, p, tx)
+	case api.MsgAgentUpdateAvail:
+		var p api.AgentUpdateAvailablePayload
+		_ = env.UnmarshalPayload(&p)
+		slog.Info("ws agent: update available", "version", p.LatestVersion, "url", p.PackageURL)

 	default:
 		slog.Debug("ws agent: ignored message", "type", env.Type)
@@ -471,47 +460,17 @@ func (d *dispatcher) handleTreeList(ctx context.Context, reqID string, p api.Tre
 	reply(api.TreeListResultPayload{Entries: apiEntries})
 }

-// failJob ships a synthetic job.started + job.finished(failed) pair
-// for a command.run we couldn't even spawn locally — missing restic
-// binary, missing credentials, or a malformed payload. Without these
-// envelopes the server has no way to know the job will never produce
-// output: the row sits in "running", the live stream stays stuck on
-// "awaiting agent output," and a subsequent command.cancel arrives
-// for a job_id the agent never registered (we log "unknown job"
-// because trackJob was never called). Sending a terminal envelope
-// here closes the loop on both fronts.
-func failJob(p api.CommandRunPayload, tx wsclient.Sender, errMsg string) {
-	now := time.Now().UTC()
-	if startedEnv, err := api.Marshal(api.MsgJobStarted, p.JobID, api.JobStartedPayload{
-		JobID: p.JobID, Kind: p.Kind, StartedAt: now,
-	}); err == nil {
-		_ = tx.Send(startedEnv)
-	}
-	if finEnv, err := api.Marshal(api.MsgJobFinished, p.JobID, api.JobFinishedPayload{
-		JobID:      p.JobID,
-		Status:     api.JobFailed,
-		ExitCode:   -1,
-		FinishedAt: now,
-		Error:      errMsg,
-	}); err == nil {
-		_ = tx.Send(finEnv)
-	}
-}
-
 // runJob spawns a runner for one job. We launch a goroutine so the
 // WS read loop keeps draining messages while restic chugs along.
 func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsclient.Sender) error {
 	if d.resticBin == "" {
-		failJob(p, tx, "restic binary not located on this agent")
 		return fmt.Errorf("restic binary not located on this agent")
 	}
 	creds, err := d.secrets.Load()
 	if err != nil {
-		failJob(p, tx, "load repo credentials: "+err.Error())
 		return fmt.Errorf("load repo credentials: %w", err)
 	}
 	if creds.Empty() {
-		failJob(p, tx, "repo credentials not configured (waiting for server config.update push)")
 		return fmt.Errorf("repo credentials not configured (waiting for server config.update push)")
 	}
 	// r is the everyday runner — bound to the host's repo
@@ -535,14 +494,13 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
 	}

 	r := runner.New(runner.Config{
-		ResticBin:                  d.resticBin,
-		ResticVersion:              d.resticVer,
-		RepoURL:                    creds.URL,
-		RepoUsername:               creds.Username,
-		RepoPassword:               creds.Password,
-		SupportsRestoreNoOwnership: d.resticSupportsNoOwnership,
-		LimitUploadKBps:            upKBps,
-		LimitDownloadKBps:          downKBps,
+		ResticBin:         d.resticBin,
+		ResticVersion:     d.resticVer,
+		RepoURL:           creds.URL,
+		RepoUsername:      creds.Username,
+		RepoPassword:      creds.Password,
+		LimitUploadKBps:   upKBps,
+		LimitDownloadKBps: downKBps,
 	}, tx, time.Second)

 	// spawn wraps the kind-specific goroutine: derives a per-job
@@ -598,7 +556,6 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
 			// policy fallback was specced but skipped — see the
 			// Phase 5 plan rationale and version.go's lockstep-deploy
 			// note for why.
-			failJob(p, tx, "forget: command.run carried no forget_groups (server didn't populate them)")
 			return fmt.Errorf("forget: command.run carried no forget_groups (server didn't populate them)")
 		}
 		groups := make([]restic.ForgetGroup, 0, len(p.ForgetGroups))
@@ -633,14 +590,13 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
 			runCreds = ac
 		}
 		prr := runner.New(runner.Config{
-			ResticBin:                  d.resticBin,
-			ResticVersion:              d.resticVer,
-			RepoURL:                    runCreds.URL,
-			RepoUsername:               runCreds.Username,
-			RepoPassword:               runCreds.Password,
-			SupportsRestoreNoOwnership: d.resticSupportsNoOwnership,
-			LimitUploadKBps:            upKBps,
-			LimitDownloadKBps:          downKBps,
+			ResticBin:         d.resticBin,
+			ResticVersion:     d.resticVer,
+			RepoURL:           runCreds.URL,
+			RepoUsername:      runCreds.Username,
+			RepoPassword:      runCreds.Password,
+			LimitUploadKBps:   upKBps,
+			LimitDownloadKBps: downKBps,
 		}, tx, time.Second)
 		slog.Info("agent: accepting prune job", "job_id", p.JobID, "admin_creds", p.RequiresAdminCreds)
 		spawn("prune", func(jobCtx context.Context) error {
@@ -662,16 +618,13 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
 		})
 	case api.JobRestore:
 		if p.Restore == nil {
-			failJob(p, tx, "restore: command.run carried no restore payload")
 			return fmt.Errorf("restore: command.run carried no restore payload")
 		}
 		rp := *p.Restore
 		if rp.SnapshotID == "" {
-			failJob(p, tx, "restore: snapshot_id is required")
 			return fmt.Errorf("restore: snapshot_id is required")
 		}
 		if !rp.InPlace && rp.TargetDir == "" {
-			failJob(p, tx, "restore: target_dir required for non-in-place restore")
 			return fmt.Errorf("restore: target_dir required for non-in-place restore")
 		}
 		slog.Info("agent: accepting restore job",
@@ -682,7 +635,6 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
 		})
 	case api.JobDiff:
 		if p.Diff == nil || p.Diff.SnapshotA == "" || p.Diff.SnapshotB == "" {
-			failJob(p, tx, "diff: command.run carried incomplete diff payload")
 			return fmt.Errorf("diff: command.run carried incomplete diff payload")
 		}
 		dp := *p.Diff
@@ -692,7 +644,6 @@ func (d *dispatcher) runJob(ctx context.Context, p api.CommandRunPayload, tx wsc
 			return r.RunDiff(jobCtx, p.JobID, dp.SnapshotA, dp.SnapshotB)
 		})
 	default:
-		failJob(p, tx, fmt.Sprintf("kind %q not implemented on this agent", p.Kind))
 		return fmt.Errorf("kind %q not implemented yet (Phase 2 lands the rest)", p.Kind)
 	}
 	return nil
@@ -1,65 +0,0 @@
-package main
-
-import (
-	"context"
-	"fmt"
-	"log/slog"
-	"time"
-
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
-)
-
-// runUpdate handles a server-dispatched command.update. It logs progress
-// via log.stream so the live job page captures pre-restart state, then
-// calls the platform updater. On Linux the updater calls os.Exit; on
-// Windows it spawns a detached helper and returns, with the agent then
-// exiting.
-//
-// The terminal job state is set by the server, not the agent: success
-// is "agent re-hellos with matching version" rather than anything the
-// agent itself can assert. The only `job.finished` we send from here is
-// on the failure path, before any restart attempt.
-func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) {
-	logf := func(format string, args ...any) {
-		line := fmt.Sprintf(format, args...)
-		slog.Info("ws agent: update: " + line)
-		env, err := api.Marshal(api.MsgLogStream, "", api.LogStreamLine{
-			JobID:   p.JobID,
-			TS:      time.Now().UTC(),
-			Stream:  api.LogStdout,
-			Payload: line,
-		})
-		if err == nil {
-			_ = tx.Send(env)
-		}
-	}
-
-	startedEnv, err := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{
-		JobID:     p.JobID,
-		Kind:      api.JobUpdate,
-		StartedAt: time.Now().UTC(),
-	})
-	if err == nil {
-		_ = tx.Send(startedEnv)
-	}
-
-	logf("fetching new binary from %s", d.serverURL)
-	if err := updater.Update(ctx, d.serverURL); err != nil {
-		logf("update failed: %v", err)
-		finishedEnv, mErr := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{
-			JobID:      p.JobID,
-			Status:     api.JobFailed,
-			FinishedAt: time.Now().UTC(),
-			Error:      err.Error(),
-		})
-		if mErr == nil {
-			_ = tx.Send(finishedEnv)
-		}
-		return
-	}
-	// Unreachable on Linux (Update calls os.Exit). On Windows control
-	// returns here while the detached helper does the swap-and-restart;
-	// the agent then exits cleanly so SCM hands off.
-}
@@ -9,7 +9,6 @@ import (
 	"os"
 	"os/signal"
 	"path/filepath"
-	"strings"
 	"syscall"
 	"time"

@@ -18,17 +17,15 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
 	rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
 )

+var version = "dev"
+
 func main() {
 	if err := run(); err != nil {
 		slog.Error("server fatal", "err", err)
@@ -42,7 +39,7 @@ func run() error {
 	flag.Parse()

 	if *showVersion {
-		fmt.Printf("restic-manager-server %s (commit %s, built %s)\n", version.Version, version.Commit, version.Date)
+		fmt.Println("restic-manager-server", version)
 		return nil
 	}

@@ -86,28 +83,15 @@ func run() error {

 	hub := ws.NewHub()
 	jobHub := ws.NewJobHub()
-	metricsRegistry := metrics.NewRegistry()

 	notifHub := notification.NewHub(st, aead, cfg.BaseURL)
 	alertEngine := alert.NewEngine(st, notifHub)
-	updateWatcher := ws.NewUpdateWatcher(st, alertEngine, jobHub)

 	renderer, err := ui.New()
 	if err != nil {
 		return fmt.Errorf("ui: %w", err)
 	}

-	var oidcClient *oidc.Client
-	if cfg.OIDC != nil {
-		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
-		defer cancel()
-		oidcClient, err = oidc.New(ctx, cfg.OIDC, cfg.BaseURL)
-		if err != nil {
-			return fmt.Errorf("oidc: %w", err)
-		}
-		slog.Info("oidc enabled", "issuer", cfg.OIDC.Issuer, "display", cfg.OIDC.DisplayName)
-	}
-
 	deps := rmhttp.Deps{
 		Cfg:             cfg,
 		Store:           st,
@@ -116,11 +100,8 @@ func run() error {
 		JobHub:          jobHub,
 		AlertEngine:     alertEngine,
 		NotificationHub: notifHub,
-		UpdateWatcher:   updateWatcher,
 		UI:              renderer,
-		Version:         version.Version,
-		OIDC:            oidcClient,
-		Metrics:         metricsRegistry,
+		Version:         version,
 	}

 	// First-run bootstrap: if the users table is empty, mint a one-time
@@ -141,38 +122,22 @@ func run() error {
 		// text exactly once; we hash it into BootstrapToken on the
 		// server-side handler.
 		fmt.Fprintln(os.Stderr, "================================================================")
-		fmt.Fprintln(os.Stderr, "  FIRST RUN — no admin user exists yet.")
-		if cfg.BaseURL != "" {
-			fmt.Fprintln(os.Stderr, "  Open this URL in a browser to create the first administrator:")
-			fmt.Fprintln(os.Stderr, "    "+strings.TrimRight(cfg.BaseURL, "/")+"/bootstrap")
-		} else {
-			fmt.Fprintln(os.Stderr, "  Open the server URL in a browser; you'll be sent to /bootstrap.")
-			fmt.Fprintln(os.Stderr, "  (Set RM_BASE_URL to have a clickable link printed here.)")
-		}
-		fmt.Fprintln(os.Stderr, "")
-		fmt.Fprintln(os.Stderr, "  Headless? POST {token, username, password} to /api/bootstrap")
-		fmt.Fprintln(os.Stderr, "  with this one-shot bootstrap token (valid until first user exists):")
+		fmt.Fprintln(os.Stderr, "  FIRST RUN — bootstrap token (use within 1 hour, then it's gone):")
 		fmt.Fprintln(os.Stderr, "    "+token)
+		fmt.Fprintln(os.Stderr, "  POST it to /api/bootstrap with {token, username, password}.")
 		fmt.Fprintln(os.Stderr, "================================================================")
 	}

 	srv := rmhttp.New(deps)

-	// Fleet-update worker — built after the HTTP server because the
-	// dispatcher delegates back into srv.DispatchHostUpdate.
-	fleetWorker := fleetupdate.NewWorker(st, hub,
-		&serverDispatcher{srv: srv}, alertEngine)
-	srv.SetFleetWorker(fleetWorker)
-
 	ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
 	defer stop()

 	go alertEngine.Run(ctx)
-	go updateWatcher.Run(ctx)

 	errCh := make(chan error, 1)
 	go func() {
-		slog.Info("server listening", "addr", cfg.Listen, "version", version.Version)
+		slog.Info("server listening", "addr", cfg.Listen, "version", version)
 		errCh <- srv.Start()
 	}()

@@ -227,7 +192,6 @@ func run() error {
 				}
 			case <-pendingDrainTick.C:
 				srv.DrainAllDue(ctx)
-				srv.RunCatchupsDue(ctx)
 			case <-pendingExpiryTick.C:
 				if n, err := st.DeleteExpiredPendingHosts(ctx, time.Now().UTC()); err == nil && n > 0 {
 					slog.Info("expired pending hosts swept", "n", n)
@@ -262,12 +226,3 @@ func run() error {
 	}
 	return nil
 }
-
-// serverDispatcher adapts the http.Server's DispatchHostUpdate method
-// to the fleetupdate.Dispatcher interface. Lives in main so the
-// http and fleetupdate packages don't need to know about each other.
-type serverDispatcher struct{ srv *rmhttp.Server }
-
-func (d *serverDispatcher) DispatchUpdate(ctx context.Context, hostID, actorUserID string) (string, string, error) {
-	return d.srv.DispatchHostUpdate(ctx, hostID, actorUserID)
-}
@@ -1,17 +1,14 @@
 # syntax=docker/dockerfile:1.7

 # ---- Build stage --------------------------------------------------------
-# Cross-compiles:
-#   * the server binary for the image's TARGETARCH (linux/amd64 or arm64),
-#   * three agent binaries (linux/amd64, linux/arm64, windows/amd64) that
-#     the running server hands out via /agent/binary.
-# Pure-Go SQLite (modernc.org/sqlite) means CGO stays off; static binaries
-# run on distroless/static.
-FROM --platform=$BUILDPLATFORM golang:1.25-alpine AS build
+FROM golang:1.25-alpine AS build

 WORKDIR /src

+# Pure-Go SQLite (modernc.org/sqlite) means we can keep CGO off and build a
+# fully static binary that runs on distroless/static.
 ENV CGO_ENABLED=0 \
+    GOOS=linux \
    GOFLAGS="-trimpath"

 # Cache module downloads in a separate layer.
@@ -21,45 +18,9 @@ RUN go mod download
 COPY . .

 ARG VERSION=dev
-ARG COMMIT=none
-ARG DATE=unknown
-ARG TARGETOS
-ARG TARGETARCH
-
-ENV VERSION_PKG="gitea.dcglab.co.uk/steve/restic-manager/internal/version"
-ENV LDFLAGS="-s -w \
-    -X ${VERSION_PKG}.Version=${VERSION} \
-    -X ${VERSION_PKG}.Commit=${COMMIT} \
-    -X ${VERSION_PKG}.Date=${DATE}"
-
-# Server: built for the image's runtime arch.
-RUN GOOS=${TARGETOS} GOARCH=${TARGETARCH} \
-    go build -ldflags="${LDFLAGS}" \
-        -o /out/restic-manager-server \
-        ./cmd/server
-
-# Empty /data skeleton so the runtime image carries an existing,
-# nonroot-owned mount point. Docker copies that ownership onto a
-# named volume the first time it's created, which avoids the
-# "permission denied" trap on /data/secret.key when the operator
-# uses a default `volumes: { rm-data: {} }` declaration.
-RUN mkdir -p /out/data
-
-# Agents: identical across image arches — an arm64 server image still
-# ships an amd64 agent binary for amd64 endpoints to download.
-RUN mkdir -p /out/agent-binaries && \
-    GOOS=linux GOARCH=amd64 \
-        go build -ldflags="${LDFLAGS}" \
-            -o /out/agent-binaries/restic-manager-agent-linux-amd64 \
-            ./cmd/agent && \
-    GOOS=linux GOARCH=arm64 \
-        go build -ldflags="${LDFLAGS}" \
-            -o /out/agent-binaries/restic-manager-agent-linux-arm64 \
-            ./cmd/agent && \
-    GOOS=windows GOARCH=amd64 \
-        go build -ldflags="${LDFLAGS}" \
-            -o /out/agent-binaries/restic-manager-agent-windows-amd64.exe \
-            ./cmd/agent
+RUN go build -ldflags="-s -w -X main.version=${VERSION}" \
+    -o /out/restic-manager-server \
+    ./cmd/server

 # ---- Runtime stage ------------------------------------------------------
 FROM gcr.io/distroless/static-debian12:nonroot
@@ -70,22 +31,7 @@ LABEL org.opencontainers.image.licenses="PolyForm-Noncommercial-1.0.0"
 USER nonroot:nonroot
 WORKDIR /

-# Server binary on PATH.
 COPY --from=build /out/restic-manager-server /usr/local/bin/restic-manager-server

-# Image-baked bundled assets (P5-03). Read-only; the /agent/binary and
-# /install/* handlers fall back here when <DataDir>/... is empty, so a
-# fresh container Just Works without first-run staging. Operators can
-# still drop a custom build under <DataDir>/agent-binaries/<name> to
-# override per-host.
-COPY --from=build --chmod=0755 /out/agent-binaries/ /opt/restic-manager/dist/agent-binaries/
-COPY --chmod=0755 deploy/install/install.sh /opt/restic-manager/dist/install/install.sh
-COPY --chmod=0644 deploy/install/install.ps1 /opt/restic-manager/dist/install/install.ps1
-COPY --chmod=0644 deploy/install/restic-manager-agent.service /opt/restic-manager/dist/install/restic-manager-agent.service
-
-# Pre-created data dir owned by nonroot so a fresh named volume
-# inherits the right ownership.
-COPY --from=build --chown=nonroot:nonroot /out/data /data
-
 EXPOSE 8443
 ENTRYPOINT ["/usr/local/bin/restic-manager-server"]
@@ -1,52 +1,21 @@
 # Reference deployment for the restic-manager control plane.
-# Mirrors spec.md §10.1 and the P5-07 reference deployment.
+# Mirrors spec.md §10.1. Adjust image tag and RM_BASE_URL for your env.
 #
-# Scope: this compose stands up the server only. TLS termination and
-# the public hostname belong to a reverse proxy that lives outside
-# this stack (Caddy, Traefik, nginx, HAProxy, your existing edge —
-# whatever you already operate). See `docs/reverse-proxy.md` for the
-# headers + CIDRs that proxy needs to forward.
-#
-# Architecture:
-#   * The server speaks plain HTTP on :8080.
-#   * The agent binaries + install scripts ship inside the image under
-#     /opt/restic-manager/dist/, so /agent/binary and /install/*
-#     serve out of the box without first-run staging.
-#   * The named volume holds *only* operator state (sqlite,
-#     secrets.enc, audit log, the AEAD key). Image upgrades replace
-#     the agents/scripts; the volume is untouched.
-#   * Pre-1.0 releases never publish :latest — pin to an exact
-#     vX.Y.Z tag and bump deliberately.
-#
-# Before first start:
-#   1. Pick a version: export RM_VERSION=vX.Y.Z (or substitute below).
-#   2. Set RM_BASE_URL to the public HTTPS URL the external proxy
-#      serves on.
-#   3. Set RM_TRUSTED_PROXY to the IP/CIDR the proxy connects from
-#      (the X-Forwarded-* headers are honoured only when the immediate
-#      peer matches one of these).
-
+# The server speaks plain HTTP. Front it with a TLS-terminating
+# reverse proxy (Caddy/Traefik/nginx). RM_TRUSTED_PROXY must contain
+# the proxy's IP/CIDR so X-Forwarded-* headers are honoured.
 services:
  restic-manager:
-    image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:?set RM_VERSION to a vX.Y.Z tag}
+    image: ghcr.io/dcglab/restic-manager:latest
    restart: unless-stopped
-    # Bind to localhost only — your reverse proxy reaches the server
-    # over loopback (or, if it runs in a separate compose / on
-    # another host, swap this for an internal docker network or a
-    # private LAN bind).
+    # Bind to localhost only — the proxy is what the public reaches.
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
-      - rm-data:/data
+      - ./data:/data
    environment:
      - RM_DATA_DIR=/data
      - RM_LISTEN=:8080
-      - RM_BASE_URL=${RM_BASE_URL:?set RM_BASE_URL to the public https URL}
+      - RM_BASE_URL=https://restic.lab.example
      - RM_SECRET_KEY_FILE=/data/secret.key
-      - RM_TRUSTED_PROXY=${RM_TRUSTED_PROXY:?set RM_TRUSTED_PROXY to the proxy CIDR}
-      # Cookies are Secure by default; keep that. Override only for
-      # local-HTTP smoke tests.
-      # - RM_COOKIE_SECURE=true
-
-volumes:
-  rm-data:
+      - RM_TRUSTED_PROXY=172.16.0.0/12
@@ -1,325 +0,0 @@
-{
-  "annotations": {
-    "list": [
-      {
-        "builtIn": 1,
-        "datasource": { "type": "grafana", "uid": "-- Grafana --" },
-        "enable": true,
-        "hide": true,
-        "iconColor": "rgba(0, 211, 255, 1)",
-        "name": "Annotations & Alerts",
-        "type": "dashboard"
-      }
-    ]
-  },
-  "description": "restic-manager fleet overview. Imports against any Prometheus data source.",
-  "editable": true,
-  "fiscalYearStartMonth": 0,
-  "graphTooltip": 0,
-  "id": null,
-  "links": [],
-  "liveNow": false,
-  "panels": [
-    {
-      "id": 1,
-      "title": "Fleet status",
-      "type": "stat",
-      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-      "gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 },
-      "fieldConfig": {
-        "defaults": {
-          "color": { "mode": "thresholds" },
-          "thresholds": {
-            "mode": "absolute",
-            "steps": [
-              { "color": "red", "value": null },
-              { "color": "green", "value": 1 }
-            ]
-          },
-          "unit": "short"
-        },
-        "overrides": []
-      },
-      "options": {
-        "colorMode": "value",
-        "graphMode": "area",
-        "justifyMode": "auto",
-        "orientation": "auto",
-        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
-        "textMode": "auto"
-      },
-      "targets": [
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "rm_hosts_online",
-          "legendFormat": "online",
-          "refId": "A"
-        },
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "rm_hosts_total",
-          "legendFormat": "total",
-          "refId": "B"
-        }
-      ]
-    },
-    {
-      "id": 2,
-      "title": "Open alerts",
-      "type": "stat",
-      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-      "gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 },
-      "fieldConfig": {
-        "defaults": {
-          "color": { "mode": "thresholds" },
-          "thresholds": {
-            "mode": "absolute",
-            "steps": [
-              { "color": "green", "value": null },
-              { "color": "yellow", "value": 1 },
-              { "color": "red", "value": 5 }
-            ]
-          },
-          "unit": "short"
-        },
-        "overrides": []
-      },
-      "options": {
-        "colorMode": "value",
-        "graphMode": "none",
-        "orientation": "horizontal",
-        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
-        "textMode": "auto"
-      },
-      "targets": [
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "sum by (severity) (rm_active_alerts)",
-          "legendFormat": "{{severity}}",
-          "refId": "A"
-        }
-      ]
-    },
-    {
-      "id": 3,
-      "title": "Backups failing (last reported run)",
-      "type": "stat",
-      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-      "gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 },
-      "fieldConfig": {
-        "defaults": {
-          "color": { "mode": "thresholds" },
-          "thresholds": {
-            "mode": "absolute",
-            "steps": [
-              { "color": "green", "value": null },
-              { "color": "red", "value": 1 }
-            ]
-          },
-          "unit": "short"
-        },
-        "overrides": []
-      },
-      "options": {
-        "colorMode": "value",
-        "graphMode": "area",
-        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
-        "textMode": "auto"
-      },
-      "targets": [
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "count(rm_host_last_backup_success == 0)",
-          "legendFormat": "failing",
-          "refId": "A"
-        }
-      ]
-    },
-    {
-      "id": 4,
-      "title": "Hosts",
-      "type": "table",
-      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-      "gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 },
-      "fieldConfig": {
-        "defaults": {
-          "custom": { "align": "auto", "displayMode": "auto" }
-        },
-        "overrides": [
-          {
-            "matcher": { "id": "byName", "options": "Value #B" },
-            "properties": [
-              { "id": "displayName", "value": "Last backup (s ago)" },
-              { "id": "unit", "value": "s" }
-            ]
-          },
-          {
-            "matcher": { "id": "byName", "options": "Value #C" },
-            "properties": [
-              { "id": "displayName", "value": "Repo size" },
-              { "id": "unit", "value": "bytes" }
-            ]
-          },
-          {
-            "matcher": { "id": "byName", "options": "Value #D" },
-            "properties": [
-              { "id": "displayName", "value": "Snapshots" }
-            ]
-          },
-          {
-            "matcher": { "id": "byName", "options": "Value #A" },
-            "properties": [
-              { "id": "displayName", "value": "Online" }
-            ]
-          },
-          {
-            "matcher": { "id": "byName", "options": "Value #E" },
-            "properties": [
-              { "id": "displayName", "value": "Open alerts" }
-            ]
-          }
-        ]
-      },
-      "options": { "showHeader": true },
-      "transformations": [
-        {
-          "id": "merge",
-          "options": {}
-        }
-      ],
-      "targets": [
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "rm_host_agent_online",
-          "format": "table",
-          "instant": true,
-          "refId": "A"
-        },
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "time() - rm_host_last_backup_timestamp_seconds",
-          "format": "table",
-          "instant": true,
-          "refId": "B"
-        },
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "rm_host_repo_size_bytes",
-          "format": "table",
-          "instant": true,
-          "refId": "C"
-        },
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "rm_host_snapshot_count",
-          "format": "table",
-          "instant": true,
-          "refId": "D"
-        },
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "rm_host_open_alerts",
-          "format": "table",
-          "instant": true,
-          "refId": "E"
-        }
-      ]
-    },
-    {
-      "id": 5,
-      "title": "Repo size over time",
-      "type": "timeseries",
-      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
-      "fieldConfig": {
-        "defaults": {
-          "color": { "mode": "palette-classic" },
-          "custom": {
-            "axisLabel": "",
-            "drawStyle": "line",
-            "fillOpacity": 10,
-            "lineWidth": 1,
-            "pointSize": 5,
-            "showPoints": "never"
-          },
-          "unit": "bytes"
-        },
-        "overrides": []
-      },
-      "options": {
-        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
-        "tooltip": { "mode": "multi", "sort": "desc" }
-      },
-      "targets": [
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "rm_host_repo_size_bytes",
-          "legendFormat": "{{host}}",
-          "refId": "A"
-        }
-      ]
-    },
-    {
-      "id": 6,
-      "title": "Job duration p95 (last 1h, by kind)",
-      "type": "timeseries",
-      "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
-      "fieldConfig": {
-        "defaults": {
-          "color": { "mode": "palette-classic" },
-          "custom": {
-            "drawStyle": "line",
-            "fillOpacity": 5,
-            "lineWidth": 1,
-            "pointSize": 4,
-            "showPoints": "never"
-          },
-          "unit": "s"
-        },
-        "overrides": []
-      },
-      "options": {
-        "legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
-        "tooltip": { "mode": "multi", "sort": "desc" }
-      },
-      "targets": [
-        {
-          "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
-          "expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))",
-          "legendFormat": "{{kind}}",
-          "refId": "A"
-        }
-      ]
-    }
-  ],
-  "refresh": "30s",
-  "schemaVersion": 39,
-  "style": "dark",
-  "tags": ["restic-manager", "backups"],
-  "templating": {
-    "list": [
-      {
-        "current": {},
-        "hide": 0,
-        "includeAll": false,
-        "label": "Prometheus",
-        "multi": false,
-        "name": "DS_PROMETHEUS",
-        "options": [],
-        "query": "prometheus",
-        "refresh": 1,
-        "regex": "",
-        "skipUrlSync": false,
-        "type": "datasource"
-      }
-    ]
-  },
-  "time": { "from": "now-6h", "to": "now" },
-  "timepicker": {},
-  "timezone": "",
-  "title": "restic-manager — fleet",
-  "uid": "rm-fleet-overview",
-  "version": 1,
-  "weekStart": ""
-}
@@ -49,10 +49,12 @@ detect_arch() {
 ensure_dirs() {
  install -d -m 0700 -o root -g root "$RM_CONFIG_DIR"
  install -d -m 0700 -o root -g root "$RM_STATE_DIR"
-  # Default new-directory restore target: $HOME/rm-restore. With the
-  # current unit (ProtectSystem=full, no ReadWritePaths pin) the agent
-  # can mkdir anywhere on real filesystems, so this is just a courtesy
-  # pre-create so the wizard's default lands in a tidy spot.
+  # Default new-directory restore target: $HOME/rm-restore. Pre-create
+  # so the systemd unit's ReadWritePaths bind-mount applies cleanly
+  # (paths that don't exist when systemd starts get a soft-fail
+  # because of the '-' prefix, but the agent then can't mkdir into
+  # the read-only /root). Mode 0700 + root-owned matches the threat
+  # model — files restored here are operator-readable as root.
  install -d -m 0700 -o root -g root /root/rm-restore
 }

@@ -33,31 +33,17 @@ CapabilityBoundingSet=CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_FOWNER CAP_CHOWN
 AmbientCapabilities=CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_FOWNER CAP_CHOWN

 # Hardening — blocks privilege escalation even from root, and
-# confines kernel / namespace / privilege surface. Filesystem reads
-# stay open (that's the whole job) and restore writes are
-# unrestricted: a backup tool whose entire purpose is "put files
-# back where they belong" can't have ProtectHome=read-only or
-# ProtectSystem=strict without breaking on the first cross-user
-# restore. ProtectSystem=full keeps /usr, /boot, /efi read-only so a
-# compromised agent can't swap out /usr/bin/restic or drop a kernel
-# module, while leaving /home, /root, /var, /opt, /srv, /tmp etc.
-# writable for arbitrary restore targets. The agent is treated as a
-# high-trust component (it runs operator hooks as root and holds
-# repo credentials); the residual hardening is about kernel + privesc
-# protection, not write confinement.
+# confines writes / network / kernel access to what restic actually
+# needs. Filesystem reads stay open: that's the whole job.
 NoNewPrivileges=true
-ProtectSystem=full
-# ProtectSystem=full mounts /usr, /boot, /efi *and* /etc read-only.
-# The agent rewrites /etc/restic-manager/agent.yaml on enrolment and
-# whenever a new SecretsKey is minted, so we need a targeted
-# write-exemption for that dir. No exemption for the rest of /etc:
-# the agent has no business editing /etc/passwd, /etc/sudoers, etc.
-#
-# /usr/local/bin is writable so the self-update flow (P6-01) can
-# atomic-rename a fresh binary over the running one. Permitting the
-# whole directory (rather than just the binary path) is required
-# because os.Rename takes a write lock on the parent dir.
-ReadWritePaths=/etc/restic-manager /usr/local/bin
+ProtectSystem=strict
+# /etc/restic-manager: agent.yaml + secrets.enc.
+# /var/lib/restic-manager: agent state (currently unused but reserved).
+# /root/rm-restore: default target for new-directory restores
+#   ($HOME/rm-restore/<job-id>/ resolves here for User=root).
+#   ReadWritePaths overrides ProtectHome=read-only on this subdir only.
+ReadWritePaths=/etc/restic-manager /var/lib/restic-manager -/root/rm-restore
+ProtectHome=read-only
 ProtectHostname=true
 ProtectKernelTunables=true
 ProtectKernelModules=true
@@ -1,249 +0,0 @@
-# Onboarding a new host — agent instructions
-
-How an automation agent (with a username + password for the
-restic-manager server) brings a new host fully online.
-
-The flow is two roles:
-
- **Controller side**: the agent calls JSON APIs on the
-  restic-manager server. Needs network reach to the server, plus
-  username/password.
- **Target side**: the host being onboarded runs the install
-  script, which calls back to the server with the one-time token.
-
-If the agent is *both* sides (e.g. it can SSH into the target),
-it does steps 1–2 against the server and steps 3–4 against the
-target. If the agent only controls the server, it stops at
-step 2 and hands the install snippet to whoever owns the target.
-
---
-
-## Conventions
-
- Base URL: `$RM_SERVER` (e.g. `https://restic.lab.example`).
- Session cookie jar: persist `rm_session` between calls.
- All request/response bodies are JSON unless noted.
- On any non-2xx, response body is
-  `{"code": "...", "message": "..."}`.
-
---
-
-## 1. Login
-
-```
-POST $RM_SERVER/api/auth/login
-Content-Type: application/json
-
-{"username": "...", "password": "..."}
-```
-
-→ 200 with `{"user_id": "...", "role": "..."}` and a `Set-Cookie:
-rm_session=...` (HttpOnly, 24h TTL). Persist the cookie; reuse
-it on every subsequent call.
-
-Required role for the next step: **operator** or **admin**.
-A viewer-only login can read but cannot mint tokens.
-
-Session expires at 24h. On 401 from a later call, re-login.
-
---
-
-## 2. Mint an enrolment token
-
-```
-POST $RM_SERVER/api/enrollment-tokens
-Cookie: rm_session=...
-Content-Type: application/json
-
-{
-  "hostname":      "newhost.example",
-  "tags":          ["prod", "london"],          // optional
-  "repo_url":      "rest:https://rest.example/newhost",
-  "repo_username": "...",                        // optional, for rest-server / S3
-  "repo_password": "...",                        // optional
-  "initial_paths": ["/etc", "/home", "/var/lib"] // optional; default source group
-}
-```
-
-→ 200 with:
-
-```json
-{ "token": "<RAW_ONE_TIME_TOKEN>", "expires_at": "2026-05-09T..." }
-```
-
-**Capture `token` immediately — the server only stores its hash
-and will never return the raw value again.** TTL is 1 hour.
-
-The repo creds you provided are encrypted under the token hash
-and pre-attached to the host. The agent will fetch and store
-them at enrol-time; you will not need to push them again.
-
-If you lose the token before the install runs, mint a new one
-(the existing one becomes irrelevant; you can leave it to expire
-or revoke it via the UI).
-
---
-
-## 3. Install on the target host
-
-The install script is hosted by the server itself. Running on the
-target:
-
-### Linux
-
-```
-curl -fsSL $RM_SERVER/install/install.sh | \
-  sudo RM_SERVER=$RM_SERVER RM_TOKEN=<RAW_ONE_TIME_TOKEN> bash
-```
-
-What it does, end-to-end:
-
-1. detects arch (amd64 / arm64)
-2. downloads `$RM_SERVER/agent/binary?os=linux&arch=<arch>` to
-   `/usr/local/bin/restic-manager-agent`
-3. creates `/etc/restic-manager/` and `/var/lib/restic-manager/`
-   (root:root, 0700)
-4. calls `POST /api/agents/enroll` with the token; server returns
-   the persistent agent bearer + `host_id`, written to
-   `/etc/restic-manager/agent.env`
-5. installs the systemd unit, `daemon-reload`, `enable --now`
-6. surfaces any pre-existing restic cron/timer entries so the
-   operator can decide whether to disable them (script does
-   *not* touch them automatically)
-
-The script is idempotent. Re-running on an already-enrolled host
-is a no-op unless `RM_FORCE_REENROLL=1`.
-
-The agent runs as **root** by design — fleet backup needs to
-read every file on the system. See
-`deploy/install/restic-manager-agent.service` for rationale.
-
-### Windows
-
-```
-iwr $RM_SERVER/install/install.ps1 -UseBasicParsing | iex
-# (or download + run; needs an elevated PowerShell)
-# Required env: $env:RM_SERVER, $env:RM_TOKEN
-```
-
-Same flow, lays down a Windows service instead of a systemd unit.
-
-### Manual / non-script enrolment
-
-If the install script can't be used, the wire-level enrol call is:
-
-```
-POST $RM_SERVER/api/agents/enroll
-Content-Type: application/json
-
-{
-  "token":          "<RAW_ONE_TIME_TOKEN>",
-  "hostname":       "newhost.example",
-  "os":             "linux",                  // linux | windows
-  "arch":           "amd64",                  // amd64 | arm64
-  "agent_version":  "...",
-  "restic_version": "..."
-}
-```
-
-→ 200 with
-`{"host_id": "...", "agent_token": "...", "cert_pin_sha256": "..."}`.
-
-The agent_token goes into `/etc/restic-manager/agent.env` as
-`RM_AGENT_TOKEN=...`; subsequent agent → server traffic uses
-`Authorization: Bearer $RM_AGENT_TOKEN`.
-
---
-
-## 4. Verify the host is healthy
-
-Poll until both conditions are true. Cap at ~5 minutes.
-
-```
-GET $RM_SERVER/api/hosts
-Cookie: rm_session=...
-```
-
-→ array of host objects. Find the one with the matching hostname
-and check:
-
- `"status": "online"` — agent connected to the WS heartbeat
- `"repo_status": "ready"` — `restic init` (or existing-config
-  detection) completed successfully
-
-If `repo_status` settles on `"init_failed"`, the repo creds are
-wrong or the repo URL is unreachable from the target. Inspect
-the matching job log:
-
-```
-GET $RM_SERVER/api/hosts/<host_id>/jobs   (most recent init job)
-GET $RM_SERVER/api/jobs/<job_id>          (full output)
-```
-
-Fix the creds with a creds-update call (see Settings → Repo on
-the UI for the exact route — currently form-only) or revoke the
-host and start over.
-
---
-
-## 5. (Optional) configure schedules
-
-A new host gets one default source group covering `initial_paths`
-(or `/etc`,`/home` if you didn't pass any) and **no schedule**.
-Backups won't run until either:
-
- a schedule is attached (cron expression, retention, etc.), or
- you trigger an on-demand run via the source-group "Run now"
-  endpoint.
-
-These are not yet exposed cleanly as JSON-only routes; if the
-agent needs them, look at `internal/server/http/schedules*.go`
-and `internal/server/http/source_groups*.go` — most are JSON-
-capable, some are form-only with HTML 303 responses.
-
---
-
-## Failure modes — quick reference
-
-| Symptom | Likely cause | Fix |
-|---|---|---|
-| `401` on `/api/enrollment-tokens` | session expired or viewer role | re-login as operator+ |
-| install.sh fails at "enrol": HTTP 410 | token expired (>1h) or already used | mint a fresh token |
-| Host shows `status=offline` after install | systemd unit didn't start; firewall blocks WS | `systemctl status restic-manager-agent`, check `$RM_SERVER` reachability |
-| `repo_status=init_failed` | bad repo creds or URL | inspect init job log; fix creds; retry probe via `/hosts/{id}/repo/probe` |
-| Token list grows with stale rows | normal — they expire at 1h | optional cleanup via `/hosts/enrollment-tokens/{hash}/revoke` |
-
---
-
-## Minimum reproducible script
-
-```bash
-#!/usr/bin/env bash
-set -euo pipefail
-: "${RM_SERVER:?}" "${RM_USER:?}" "${RM_PASS:?}" "${RM_HOSTNAME:?}" \
-  "${RM_REPO_URL:?}" "${RM_REPO_USER:?}" "${RM_REPO_PASS:?}"
-
-JAR=$(mktemp)
-trap 'rm -f "$JAR"' EXIT
-
-# 1. login
-curl -fsS -c "$JAR" -H 'Content-Type: application/json' \
-  -d "{\"username\":\"$RM_USER\",\"password\":\"$RM_PASS\"}" \
-  "$RM_SERVER/api/auth/login" >/dev/null
-
-# 2. mint token
-TOKEN=$(curl -fsS -b "$JAR" -H 'Content-Type: application/json' \
-  -d "$(jq -nc \
-        --arg h "$RM_HOSTNAME" --arg u "$RM_REPO_USER" \
-        --arg p "$RM_REPO_PASS" --arg r "$RM_REPO_URL" \
-        '{hostname:$h, repo_url:$r, repo_username:$u, repo_password:$p}')" \
-  "$RM_SERVER/api/enrollment-tokens" | jq -r .token)
-
-# 3. emit the install snippet for the target machine
-cat <<EOF
-Run on $RM_HOSTNAME (as root):
-
-  curl -fsSL $RM_SERVER/install/install.sh | \\
-    sudo RM_SERVER=$RM_SERVER RM_TOKEN=$TOKEN bash
-EOF
-```
@@ -1,19 +0,0 @@
-[book]
-title = "restic-manager"
-description = "Self-hosted control plane for restic backups across a fleet of Linux and Windows endpoints."
-authors = ["Steve Cliff"]
-language = "en-GB"
-multilingual = false
-src = "src"
-
-[output.html]
-default-theme = "ayu"
-preferred-dark-theme = "ayu"
-git-repository-url = "https://gitea.dcglab.co.uk/steve/restic-manager"
-git-repository-icon = "fa-code-fork"
-edit-url-template = "https://gitea.dcglab.co.uk/steve/restic-manager/_edit/main/docs/book/{path}"
-no-section-label = false
-
-[output.html.fold]
-enable = true
-level = 2
@@ -1,40 +0,0 @@
-# Summary
-
-[Introduction](./intro.md)
-
-# Getting started
-
- [Installing the server](./getting-started/install.md)
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
- [Running behind a reverse proxy](./getting-started/reverse-proxy.md)
-
-# Concepts
-
- [Architecture](./concepts/architecture.md)
- [Credentials and how they flow](./concepts/credentials.md)
- [Schedules and source groups](./concepts/schedules-and-source-groups.md)
- [Repo maintenance](./concepts/repo-maintenance.md)
-
-# Operations
-
- [Backups and restores](./operations/backups-and-restores.md)
- [Alerts and notifications](./operations/alerts.md)
- [Observability with Prometheus](./operations/observability.md)
- [Updating agents](./operations/updates.md)
-
-# Security
-
- [Threat model](./security/threat-model.md)
- [Hardening checklist](./security/hardening.md)
- [Reporting vulnerabilities](./security/disclosure.md)
-
-# Reference
-
- [Environment variables](./reference/env-vars.md)
- [HTTP endpoints](./reference/http-endpoints.md)
-
---
-
-[Contributing](./contributing.md)
-[Roadmap](./roadmap.md)
-[License](./license.md)
@@ -1,121 +0,0 @@
-# Architecture
-
-## Components
-
-```
-┌────────────────────────────────────────────────────────────┐
-│  Server (control plane, single process)                    │
-│   * chi-based HTTP API + HTMX server-rendered UI           │
-│   * WebSocket hub for agent fan-out + browser fan-out      │
-│   * SQLite store (modernc.org/sqlite, pure Go)             │
-│   * AEAD encryption helpers                                │
-│   * Alert engine + notification hub                        │
-└────────────┬───────────────────────────────────┬───────────┘
-             │ outbound WS only                   │ HTTP(S)
-             │                                    │
-┌────────────▼─────────────┐         ┌────────────▼─────────────┐
-│  Agent (per host)        │         │  Browser (operator)      │
-│   * coder/websocket      │         │   * htmx + a tiny bit    │
-│   * cron for schedules   │         │     of vanilla JS for    │
-│   * restic wrapper       │         │     live job updates     │
-│   * sysinfo collector    │         └──────────────────────────┘
-└────────────┬─────────────┘
-             │ subprocess: restic ...
-             │
-┌────────────▼─────────────────────────────────────────────────┐
-│  restic repository (rest-server, S3, B2, SFTP, local …)      │
-│  Backup data flows directly here. Server never touches it.   │
-└──────────────────────────────────────────────────────────────┘
-```
-
-## Why outbound-only WebSockets?
-
-The agent dials the server on `/ws/agent` with a bearer token. The
-server doesn't initiate connections to the agent. Three reasons:
-
-1. **Firewall friendliness.** Nothing on the endpoint needs an
-   inbound port; this works behind the typical "branch office NAT"
-   without router config.
-2. **Single auth point.** The bearer token is the only credential
-   that crosses the boundary; the agent never accepts an
-   incoming socket.
-3. **Reconnect semantics are simpler.** When the connection drops
-   (NAT timeout, server restart, transient network glitch) the
-   agent backs off and re-dials; the server marks the host
-   offline after 90s and lets the alert engine raise a stale-host
-   alert.
-
-## Why SQLite?
-
-SQLite covers the project's HA non-goal: there isn't one. A small
-control plane managing twelve endpoints does not need replication
-or a separate database tier. SQLite gives us:
-
- A single file to back up (plus the secret key).
- Hand-rolled migrations under `internal/store/migrations/` —
-  no migration framework lock-in.
- `WAL` mode plus per-connection foreign-key enforcement.
-
-The migrations file the entire schema; there's no ORM or
-query-builder layer between Go code and SQL.
-
-## Why the agent runs `restic` itself, not via the server
-
-The control plane never holds backup bytes in flight. That's
-deliberate:
-
- A compromised control plane cannot exfiltrate snapshot
-  contents in-band — at worst it can dispatch new backup or
-  forget jobs (audit-logged) but the data path is between the
-  agent and the repository.
- The same agent process can target whichever transport restic
-  natively supports (rest-server, S3, B2, SFTP, local), no
-  separate mux on the server side.
-
-## Job lifecycle
-
-```
-            ┌──────────────────────┐
-operator →  │ POST /hosts/{id}/    │
-            │       run-backup     │
-            └──────────┬───────────┘
-                       │   1. INSERT INTO jobs (status='queued')
-                       │   2. dispatch command.run over WS
-                       ▼
-            ┌──────────────────────┐
-            │ Agent dispatches     │
-            │ restic subprocess    │
-            └──────────┬───────────┘
-                       │
-                       │   3. job.started   ───▶ store.MarkJobStarted
-                       │   4. job.progress  ───▶ JobHub broadcast (live UI)
-                       │   5. log.stream    ───▶ append to job_logs
-                       │   6. job.finished  ───▶ store.MarkJobFinished
-                       │                          + alert engine eval
-                       │                          + (P6) metrics histogram
-                       ▼
-                  terminal: succeeded | failed | cancelled
-```
-
-Operators see live updates because the browser subscribes to
-`/api/jobs/{id}/stream`, and the WS handler broadcasts each
-agent-emitted envelope to all live subscribers in addition to
-persisting it.
-
-## What scheduling looks like
-
- The agent runs a local `robfig/cron/v3` instance.
- The server pushes the desired schedule set to the agent on
-  hello + after every CRUD change.
- When the agent's cron fires, it sends `schedule.fire` to the
-  server. The server creates a job row, sends `command.run` back,
-  and the agent dispatches a normal backup.
- If the WS drops between fire and run, the server queues the
-  schedule firing into `pending_runs` and drains on agent
-  reconnect — no missed scheduled backups due to network blips.
-
-For everything that isn't a backup (forget, prune, check), the
-server runs a 60-second maintenance ticker against
-`host_repo_maintenance` rows and dispatches the relevant command
-when a cadence is due. The agent's local cron only handles
-backups.
@@ -1,98 +0,0 @@
-# Credentials and how they flow
-
-restic-manager handles three credential surfaces:
-
-1. **Operator credentials** — the username + password (or OIDC
-   identity) that logs into the UI.
-2. **Agent bearer tokens** — issued at enrolment, used by the
-   agent to authenticate its WebSocket to the server.
-3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
-   credentials the agent passes to `restic` itself.
-
-Each has a different threat model and storage strategy.
-
-## Operator credentials
-
- Local users are stored in `users` with a bcrypt password hash.
- Sessions are random tokens minted at login, stored hashed in
-  the `sessions` table, expired after 24h. Cookie is HttpOnly,
-  SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
-  default).
- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
-  pinning their IdP identity. Local password login is rejected
-  for OIDC users.
- Disabling a user soft-deletes them via `disabled_at` —
-  pre-existing sessions are invalidated on the next request.
-
-## Agent bearer tokens
-
- Minted at enrolment, hashed at rest with `auth.HashToken`.
- The plaintext token only exists in memory at enrolment time
-  and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
-  mode `0600`, owned by the service user).
- Compromise of the server DB leaks the hashes, which is enough
-  to *log in as that agent* until you revoke. Compromise of the
-  agent host leaks the plaintext (via the config file) — same
-  end result.
- Rotation: re-enrol the host. Today there's no in-place rotate;
-  the operator deletes the host (which cascades, including
-  revoking the bearer hash) and re-runs the install command.
-
-## Repo credentials
-
-This is the credential that ultimately matters for backup
-integrity. restic-manager keeps two slots per host:
-
- **The everyday credential** (`host_credentials.kind = ''`).
-  Append-only-friendly: this is the one your backup schedule
-  uses. It can write but not delete or forget.
- **The admin credential** (`host_credentials.kind = 'admin'`).
-  Has full delete rights. Only pushed to the agent transiently
-  while a `prune` or `forget` job is dispatching, and discarded
-  by the agent after the job ends.
-
-### Encryption flow
-
-1. Operator types the credential into the UI or the install form.
-2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
-   key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
-   memory.
-3. Encrypted blob is stored in `host_credentials.cred_blob`.
-4. When the agent connects, the server decrypts the blob and
-   sends the **plaintext** down the WebSocket inside a
-   `config.update` envelope.
-5. The agent stores the plaintext in its in-memory secrets store
-   for the lifetime of the process; it's reloaded fresh on every
-   server-side push.
-6. When a job runs, the agent merges the credential into the
-   restic environment (`restic.Env.RepoURL` stays bare; the
-   `user:pass@…` form is built only inside `envSlice()` at the
-   moment of `exec.Command`).
-
-The merged form is **never logged**. The slog package's structured
-output gets `restic.RedactURL()` for any URL it has cause to
-mention.
-
-### Why push plaintext over the wire?
-
-The transport itself is the trust boundary: the WebSocket runs
-inside the same TLS-terminated reverse-proxy connection your
-browser uses, and the agent has already authenticated with its
-bearer token. Re-encrypting the payload on top of that would just
-move the key-management problem somewhere else.
-
-If your reverse proxy isn't TLS-terminated, the deployment is
-already broken — see [Hardening](../security/hardening.md).
-
-## Setup tokens (admin-driven)
-
-When an admin creates a new user, the server mints a one-time
-setup link valid for 1 hour. The hash is stored; the raw token
-is shown to the admin once. The user opens the link, sets a
-password, and is dropped into a session. Expired tokens are
-swept on the alert engine's 60s tick.
-
-Same pattern for enrolment tokens: the raw token only exists in
-memory at mint time, and the install snippet is the operator's
-only chance to capture it. If you lose it, regenerate via the
-**Add host** page (NS-02).
@@ -1,85 +0,0 @@
-# Repo maintenance
-
-Backups go in; without maintenance, repos grow forever and
-eventually fall over. restic-manager runs three maintenance
-operations on a per-host cadence:
-
-| Command  | What it does                                                | Default cadence |
-|----------|-------------------------------------------------------------|-----------------|
-| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
-| `prune`  | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
-| `check`  | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
-
-A new field on each host row, `host_repo_maintenance`, holds the
-cron expressions and last-fire anchors. The maintenance ticker on
-the server runs every 60s, finds hosts whose next-fire is due,
-and dispatches the right command. The agent's local cron is
-**only** for backups.
-
-## Why server-side and not agent-side?
-
-The agent's cron knows about backups because backups are
-per-source-group. Maintenance is per-repo, not per-source-group,
-so doing it server-side keeps the per-host wiring simple:
-
- One ticker, not N agent crons to keep in sync.
- Cancelling a maintenance dispatch is just "don't dispatch the
-  next one" — no agent-side state to clean up.
- Skipping offline hosts is trivial (no queue; only scheduled
-  *backups* queue into `pending_runs`).
-
-## Forget and the multi-group payload
-
-A single `forget` job can target several source groups at once.
-The wire envelope (`ForgetGroups`) carries one entry per group,
-each with its retention policy. The agent runs N
-`restic forget --tag <name> --keep-...` invocations in sequence,
-streams their output, and reports a single terminal status.
-
-## Prune and the admin credential
-
-Prune mutates the repo. The everyday append-only credential
-**cannot** prune — that's the whole point of append-only.
-restic-manager keeps a second slot per host (`kind = 'admin'`)
-for the credential that can.
-
-When a prune is dispatched (cadence-driven or operator-driven):
-
-1. Server pushes the admin credential to the agent in a fresh
-   `config.update`.
-2. Agent runs `restic prune` with the merged credential.
-3. Job finishes; agent discards the admin credential from its
-   in-memory secrets store.
-
-The server never logs the merged URL (see
-[Credentials](./credentials.md)).
-
-## Check and lock state
-
-`restic check` warns about stale locks when it finds them. The
-agent ships every check's output back as a `repo.stats` envelope
-and a stream of log lines; if a stale lock is detected, the
-**Repo** page surfaces a banner with an **Unlock** button. The
-operator-only `unlock` command runs `restic unlock` and clears
-the banner.
-
-`unlock` has no cadence — it's a manual action, never automatic.
-Auto-unlocking would mask the cause (probably a previously
-crashed long-running operation) and risk corrupting an
-operation the operator has merely lost track of.
-
-## Repo stats
-
-After every backup, check, prune, and unlock, the agent runs
-`restic stats --json --mode raw-data` and ships the result as a
-`repo.stats` envelope. The server stores this in
-`host_repo_stats` (latest only) and `host_repo_stats_history`
-(one row per host per day, last-write-wins per column — a
-prune-only patch never nulls a backup-time size).
-
-The host detail page surfaces:
-
- Total size + raw size in the vitals strip.
- Last-check timestamp + colour-coded status.
- Last-prune timestamp.
- 30/90-day repo size trend chart.
@@ -1,105 +0,0 @@
-# Schedules and source groups
-
-Two related but separable ideas:
-
- A **source group** is a named bundle of "what to back up":
-  include paths, exclude patterns, retention policy, retry
-  configuration, optional pre/post hooks. The group's name is
-  used as the restic snapshot tag, so retention can target it
-  with `restic forget --tag <name>`.
- A **schedule** is a cron expression that, when it fires,
-  triggers a backup of one or more source groups on a host.
-
-Decoupling them means you can have one schedule covering several
-groups (e.g. `0 1 * * *` running both `system` and `data`), and
-each group has its own retention without duplicating policy
-across schedules.
-
-## Source group anatomy
-
-```yaml
-name: data
-includes:
-  - /var/lib/postgresql
-  - /home
-excludes:
-  - /home/*/.cache
-  - /home/*/Downloads
-retention:
-  keep_last: 7
-  keep_daily: 14
-  keep_weekly: 4
-  keep_monthly: 6
-retry_max: 3
-retry_backoff_seconds: 600
-pre_hook: |
-  pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
-post_hook: |
-  rm -f /var/lib/postgresql/dumps/all.dump
-```
-
-### Conflict detection
-
-If your retention policy says `keep_hourly: 24` but no schedule
-points at this group sub-daily, the UI surfaces a
-**conflict-dimension banner** ("`hourly` won't be honoured —
-no schedule fires more often than once a day"). The flag is
-stored on the source group (`conflict_dimension`) and refreshed
-whenever a schedule or group changes.
-
-### Hooks
-
-`pre_hook` and `post_hook` run on the agent host inside
-`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
-to the live job log as `hook(<phase>): …` lines.
-
- A non-zero `pre_hook` exit aborts the backup.
- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
-  in the environment. Use this for cleanup that must happen
-  whether the backup worked or not.
- Hooks only run for `kind=backup` jobs. They do not run for
-  `forget`, `prune`, `check`, etc.
- AEAD-encrypted at rest at the HTTP layer; the agent receives
-  plaintext over the WS channel.
-
-A "host default" pair of hooks lives on the host itself; a
-source group's own hooks override them when set.
-
-## Schedule anatomy
-
-```yaml
-cron: "0 2 * * *"
-enabled: true
-source_group_ids:
-  - <gid for "data">
-  - <gid for "system">
-```
-
-Slim by design: a schedule says **when** and **which groups**.
-Everything else (paths, retention, hooks) lives on the groups.
-
-The agent's local cron fires the schedule. If the WebSocket is
-down at fire time, the server queues the firing into
-`pending_runs` and drains it on the next agent reconnect — a
-short network blip won't lose the backup.
-
-### Last / next run
-
-The schedules tab shows "next" (computed by parsing the cron
-expression with `robfig/cron/v3`) and "last" (the latest
-`actor_kind=schedule` job in the `jobs` table) for every
-schedule. The dashboard host row also surfaces `next 12h ago/from
-now` when a single covering schedule is the run-now candidate.
-
-## Bandwidth limits
-
-Two places set restic's `--limit-upload` / `--limit-download`:
-
-1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
-   `bandwidth_down_kbps`). Pushed to the agent on hello and
-   after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
-   invocation on the host.
-2. **Per-job overrides** on the per-source-group Run-now form.
-   Win over host caps for the lifetime of that one job.
-
-If neither is set, restic runs unthrottled.
@@ -1,17 +0,0 @@
-# Contributing
-
-Full contributor guide:
-[`CONTRIBUTING.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CONTRIBUTING.md)
-in the repository root.
-
-The short version:
-
- Open an issue first for non-trivial changes; the design is
-  still moving and unsolicited large PRs may conflict with
-  in-flight work.
- `make lint test` must pass.
- One logical change per commit, no `Co-Authored-By` trailers.
- UK English in identifiers and comments; comments explain the
-  **why** not the **what**.
-
-Code of conduct: [`CODE_OF_CONDUCT.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CODE_OF_CONDUCT.md).
@@ -1,113 +0,0 @@
-# Enrolling your first host
-
-The control plane only knows about hosts you've explicitly
-enrolled. Two paths exist:
-
-1. **Token-based enrolment** — admin generates a token, pastes it
-   into an install command on the host. The host appears immediately,
-   already mapped to the desired repo.
-2. **Announce-and-approve** — the agent runs without a token,
-   "announces" itself to the server, and a human in the UI accepts
-   the announcement.
-
-Token-based is the default and what most operators want; the
-announce flow exists for the case where you can't easily paste a
-secret onto the host (auto-imaged endpoints, scripted bring-ups
-from a config repo).
-
-## Token-based enrolment
-
-### From the UI
-
-1. Click **+ Add host** on the dashboard.
-2. Fill in the hostname, the restic repo URL, and the repo
-   credentials. The credentials are AEAD-encrypted at the server
-   immediately; what you paste is what the agent receives.
-3. Optionally pick the initial source paths — these become the
-   first source group on the host.
-4. Submit. The server mints a one-time token and shows you a copy-
-   pasteable install snippet.
-
-### On the host (Linux)
-
-```sh
-curl -fsSL https://restic.example.com/install/install.sh | \
-    sudo RM_SERVER=https://restic.example.com \
-         RM_ENROL_TOKEN=<token> \
-         bash
-```
-
-The script:
-
-1. Detects architecture (`amd64` or `arm64`).
-2. Downloads the agent binary from `/agent/binary?os=…&arch=…`.
-3. Drops the systemd unit at
-   `/etc/systemd/system/restic-manager-agent.service`.
-4. Runs the agent in `-enrol` mode, which posts the token and
-   stores the persistent bearer it gets back.
-5. Enables and starts the unit.
-
-Within seconds the host should appear on the dashboard as
-**online**.
-
-### On the host (Windows)
-
-```pwsh
-$env:RM_SERVER  = "https://restic.example.com"
-$env:RM_ENROL_TOKEN = "<token>"
-iwr -useb $env:RM_SERVER/install/install.ps1 | iex
-```
-
-Equivalent shape: registers a Windows service via the SCM
-(see P2-16 for details), runs `-enrol`, starts the service.
-
-## Recovering a lost token
-
-Tokens are single-use and short-lived (1h). If you closed the tab
-before pasting the install command, head to the **Add host** page —
-outstanding tokens are listed there with a **Regenerate** button.
-Regenerating revokes the old token's hash and mints a fresh raw
-token while preserving the original repo credentials and initial
-paths. (NS-02 in `tasks.md` if you want the design rationale.)
-
-## Announce-and-approve
-
-If the host can reach the server but you don't want to paste a
-secret on it, run the agent in `-announce` mode:
-
-```sh
-restic-manager-agent -announce \
-                     -server https://restic.example.com \
-                     -hostname myhost
-```
-
-The host appears in the **Pending hosts** panel on the dashboard
-with its hostname, OS, arch, and the source IP that announced it.
-Click **Accept**, fill in the repo URL + credentials, and the
-server pushes the bearer over the still-open WebSocket. No
-back-and-forth round trip.
-
-If you don't accept within an hour the announcement is swept.
-
-## What happens on the agent
-
-After enrolment, the agent:
-
-1. Connects via WebSocket to `/ws/agent` with its bearer token.
-2. Sends a `hello` envelope with its OS, arch, agent version,
-   restic version, and protocol version.
-3. Receives a `config.update` carrying its encrypted repo
-   credentials and any source-group paths.
-4. Sits idle, sending a heartbeat every 30s. Operator-driven
-   "Run now" actions arrive as `command.run` envelopes; scheduled
-   jobs are driven by the agent's local cron.
-
-## Auto-init of the repository
-
-The first time a backup runs, the agent invokes `restic init`
-against the repo you configured at enrolment. If the repo already
-exists (`config file already exists`) the agent treats it as a
-success and proceeds. The host's repo status (`unknown` →
-`ready` / `init_failed`) is surfaced under the vitals strip on
-the host detail page; if init fails, save fresh credentials in
-the **Repo** tab to retry.
@@ -1,92 +0,0 @@
-# Installing the server
-
-The reference deployment is a single Docker container fronted by
-your existing reverse proxy. The image bundles the server binary,
-the cross-compiled agent binaries, and the install scripts.
-
-## Prerequisites
-
- A Linux host with Docker and Docker Compose.
- A reverse proxy in front (Caddy, nginx, Traefik) terminating
-  TLS on a public hostname. The server itself is HTTP-only by
-  design — see [Reverse proxy](./reverse-proxy.md) for why.
- A persistent volume for the server's data directory.
-
-## Quick start
-
-The reference compose file lives at
-[`deploy/docker-compose.yml`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/docker-compose.yml):
-
-```yaml
-services:
-  restic-manager:
-    image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:-latest}
-    restart: unless-stopped
-    environment:
-      RM_LISTEN: ":8080"
-      RM_DATA_DIR: "/data"
-      RM_BASE_URL: "https://restic.example.com"
-      # Trust your reverse proxy's CIDR so X-Forwarded-* are honoured.
-      RM_TRUSTED_PROXY: "10.0.0.0/8"
-    volumes:
-      - rm-data:/data
-    ports:
-      # Bind localhost only — your reverse proxy is the public face.
-      - "127.0.0.1:8080:8080"
-
-volumes:
-  rm-data:
-```
-
-Bring it up:
-
-```sh
-docker compose up -d
-docker compose logs -f restic-manager
-```
-
-The first run prints a one-time **bootstrap token** to the log. Use
-it within an hour or it expires; if you miss the window the
-container print it again on next start as long as no admin user
-exists.
-
-## First-run admin setup
-
-Open `https://restic.example.com/bootstrap` (or whatever your
-public URL is). Paste the bootstrap token, pick a username and a
-password (≥ 12 characters), and submit. You'll land in the
-dashboard logged in as the new admin.
-
-If you'd rather curl it, the equivalent is:
-
-```sh
-curl -X POST https://restic.example.com/api/bootstrap \
-     -H 'Content-Type: application/json' \
-     -d '{"token":"<token-from-log>","username":"admin","password":"<≥12 chars>"}'
-```
-
-## Backing up the secret key
-
-Inside the data volume, `secret.key` holds the AEAD key used to
-encrypt every credential at rest. **Back it up separately from
-the database.** Without it, encrypted credentials in the database
-are unrecoverable; you'd have to re-enrol every host.
-
-A simple working approach: copy `secret.key` to your password
-manager or to a separately-backed-up secrets vault the day you
-install. It doesn't change.
-
-## Updating the server
-
-```sh
-# Pin a new version in your compose file (.env or docker-compose.yml),
-# then:
-docker compose pull
-docker compose up -d
-```
-
-Migrations run automatically on startup; the server will refuse to
-start if a migration fails (better to bail than to half-migrate).
-
-For the agent self-update story, see
-[Updating agents](../operations/updates.md).
@@ -1,95 +0,0 @@
-# Running behind a reverse proxy
-
-The restic-manager server is HTTP-only by design. TLS termination,
-public hostname, ACME, HSTS, and edge-level rate limiting all
-belong to a reverse proxy you already operate outside this project.
-
-## What the proxy must forward
-
-The server reads four headers when (and only when) the immediate
-peer matches `RM_TRUSTED_PROXY`:
-
-| Header                 | Value                                              | Why |
-|------------------------|----------------------------------------------------|-----|
-| `X-Forwarded-For`      | The original client IP                             | Rate-limit keys, audit log entries, OIDC redirect-URI checks. |
-| `X-Forwarded-Proto`    | `https`                                            | Used for absolute URLs (e.g. OIDC redirect URIs). |
-| `Host`                 | The public hostname clients use                    | Cookies are scoped to this; `RM_BASE_URL` must match. |
-| `Connection` / `Upgrade` | Pass through unchanged                           | `/ws/agent` and `/api/jobs/{id}/stream` are WebSockets; without `Upgrade: websocket` they fail. |
-
-Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of
-CIDRs) the proxy connects from. Anything outside that range has
-its `X-Forwarded-*` headers ignored, so a stray request that
-bypasses the proxy can't spoof the client IP.
-
-## Caddy
-
-```caddyfile
-restic.example.com {
-    encode zstd gzip
-    reverse_proxy 127.0.0.1:8080 {
-        header_up X-Real-IP {remote_host}
-    }
-}
-```
-
-Caddy adds `X-Forwarded-For` / `X-Forwarded-Proto` automatically
-and passes WebSocket headers through by default, so this is the
-whole config.
-
-## nginx
-
-```nginx
-server {
-    listen 443 ssl http2;
-    server_name restic.example.com;
-
-    ssl_certificate     /etc/letsencrypt/live/restic.example.com/fullchain.pem;
-    ssl_certificate_key /etc/letsencrypt/live/restic.example.com/privkey.pem;
-
-    location / {
-        proxy_pass         http://127.0.0.1:8080;
-        proxy_http_version 1.1;
-        proxy_set_header   Host              $host;
-        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
-        proxy_set_header   X-Forwarded-Proto https;
-
-        # WebSocket upgrade
-        proxy_set_header   Upgrade           $http_upgrade;
-        proxy_set_header   Connection        "upgrade";
-
-        # Long-lived agent WS — disable read timeout for this surface.
-        proxy_read_timeout 86400s;
-    }
-}
-```
-
-## Traefik
-
-```yaml
-http:
-  routers:
-    restic-manager:
-      rule: "Host(`restic.example.com`)"
-      entryPoints: [websecure]
-      tls:
-        certResolver: letsencrypt
-      service: restic-manager
-
-  services:
-    restic-manager:
-      loadBalancer:
-        servers:
-          - url: "http://restic-manager:8080"
-        passHostHeader: true
-```
-
-Traefik forwards WebSocket upgrades and the standard
-`X-Forwarded-*` set out of the box.
-
-## Verification
-
-After bringing the proxy up, the audit log should show your real
-client IP for an interactive login (not the proxy's local
-address). If you see `127.0.0.1` or the proxy's container IP, your
-`RM_TRUSTED_PROXY` is wrong or `X-Forwarded-For` isn't being
-forwarded.
@@ -1,86 +0,0 @@
-# restic-manager
-
-restic-manager is a self-hosted, browser-based, single-pane-of-glass
-for managing [restic](https://restic.net) backups across a fleet of
-Linux and Windows endpoints. It's designed for **small fleets** —
-the original target was twelve endpoints — and **one operator**.
-
-## What it does
-
- Centralised view of every endpoint's last backup, repo size,
-  snapshot count, and recent jobs.
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
-  `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`).
- Per-host backup schedules with source groups (named bundles of
-  paths + retention policy).
- Live job log streamed to the browser; downloadable as text or NDJSON.
- Restore wizard with snapshot tree browse + path selection.
- Repo-level health surfacing (size, raw size, last-check, lock
-  state) plus a 30/90-day size trend.
- Alerting over webhook, ntfy, or SMTP.
- Cross-platform agent (Linux + Windows).
- Append-only-credential-friendly with a separate admin credential
-  for forget/prune.
-
-## What it isn't
-
- **Not a SaaS.** Single-instance, single-tenant, by design.
- **Not a replacement for restic** — it's a control plane. The agent
-  shells out to a real `restic` binary.
- **Not highly available.** SQLite, single process; if you need
-  HA backups, you're shopping in the wrong aisle.
- **Not a multi-protocol backup tool.** restic only.
-
-## How it fits together
-
-```
-┌──────────────────────────────────────────────┐
-│  Server (control plane, Docker)              │
-│   - REST + WebSocket API                     │
-│   - SQLite store                             │
-│   - Embedded HTMX UI                         │
-└──────────┬─────────────────────────┬─────────┘
-           │ outbound WS              │ HTTP(S)
-           │                          │
-┌──────────▼──────────┐    ┌──────────▼─────────┐
-│  Agent (per host)   │    │  Browser (operator) │
-│   - restic wrapper  │    └─────────────────────┘
-│   - cron for sched. │
-└──────────┬──────────┘
-           │ restic
-┌──────────▼──────────────────────────────────┐
-│  rest-server / S3 / SFTP / local repo       │
-│  (the actual backup data — server never     │
-│   touches it)                               │
-└─────────────────────────────────────────────┘
-```
-
-The control plane is a Go binary that runs in Docker. Each endpoint
-runs a small Go agent that holds an outbound WebSocket to the
-control plane. Backup data flows directly between the agent and the
-restic repository — the control plane never sees a snapshot byte.
-
-## Where to start
-
- [Installing the server](./getting-started/install.md) walks
-  through the Docker-based reference deployment.
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
-  covers the install scripts and the announce-and-approve flow.
- [Architecture](./concepts/architecture.md) is the right read if
-  you want to know why something is the way it is before running
-  the install.
-
-## Project status
-
-Pre-1.0 but feature-complete for the original use case. Phases
-0–4 are landed (MVP, scheduling, restore, RBAC + OIDC); Phase 5
-(this docs site, contributor onboarding, end-to-end CI) is in
-flight. See [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md)
-for the live roadmap and [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
-for the canonical design doc.
-
-## License
-
-[PolyForm Noncommercial 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0/).
-Personal and community deployments welcome; commercial use
-requires a separate license.
@@ -1,39 +0,0 @@
-# License
-
-restic-manager is licensed under
-[**PolyForm Noncommercial 1.0.0**](https://polyformproject.org/licenses/noncommercial/1.0.0/).
-The full text lives at
-[`LICENSE`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/LICENSE)
-in the repository root.
-
-## What this means
-
- **Personal, hobbyist, educational, charitable, and similar
-  noncommercial use** is fully permitted, including modification
-  and redistribution.
- **Commercial use is not permitted** without a separate
-  license. The maintainer is not currently offering one — if
-  you need commercial rights, open an issue to start the
-  conversation.
- The license is permissive about everything except commercial
-  use: you can fork, modify, deploy in your home/lab, and
-  contribute back.
-
-## Why this license
-
-The PolyForm Noncommercial license was chosen because:
-
- It's a real, legal, plainly-worded license (not a custom
-  half-written variant).
- It permits the realistic uses for a hobby project (the
-  maintainer's homelab, a friend's fleet, a charity's IT
-  closet) without inviting commercial vendors to repackage
-  the work.
- It's compatible with the project staying small and
-  maintainable — the maintainer doesn't want to be on the hook
-  for SLA-grade commercial support.
-
-## Contributions
-
-By contributing, you agree your contributions are licensed
-under the same PolyForm Noncommercial 1.0.0 license.
@@ -1,73 +0,0 @@
-# Alerts and notifications
-
-restic-manager raises alerts on conditions that need human
-attention. The alert engine evaluates rules on a 60s tick and
-on every job-finished / host-online event.
-
-## Built-in alert kinds
-
-| Kind                | Trigger | Severity |
-|---------------------|---------|----------|
-| `backup_failed`     | A backup job ends in `failed` or `cancelled` | warning |
-| `forget_failed`     | A forget job ends in `failed` | warning |
-| `prune_failed`      | A prune job ends in `failed` | critical |
-| `check_failed`      | A check job ends in `failed` | critical |
-| `agent_offline`     | A host has been offline more than 90s past its heartbeat cadence | warning |
-| `stale_schedule`    | A schedule's "last run" is more than 1.5 × its interval ago | warning |
-| `update_failed`     | An agent self-update returned a fail or didn't reconnect within 90s | warning |
-| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
-
-Each alert has a `dedup_key` so re-firing the same condition
-just bumps `last_seen_at` — the operator gets one row per
-condition, not a thousand.
-
-## Lifecycle
-
-```
-raised  ──acknowledge──▶  acknowledged  ──resolve──▶  resolved
-   │                          │
-   └────────auto-resolve──────┘
-   (e.g. agent_offline auto-resolves on agent_online)
-```
-
- **Acknowledge** says "I've seen this, stop notifying about it".
- **Resolve** says "the underlying condition is gone".
- Some alerts auto-resolve when the condition clears
-  (`agent_offline` is the canonical example).
-
-## Notification channels
-
-Configure under **Settings → Notifications**. Each channel can
-subscribe to all alerts or filter by severity.
-
-### Webhook
-
-Posts a JSON envelope to a URL of your choice. Useful for
-piping into Slack via an Incoming Webhook URL or into your own
-alerting tooling.
-
-### ntfy
-
-Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
-topic. Configure the topic URL; optional bearer token if you
-self-host with auth.
-
-### SMTP
-
-Plain SMTP (with optional TLS). Configure host, port,
-username, password, and the recipient list.
-
-## Test fire
-
-Each channel exposes a **Test fire** button that dispatches a
-single synthetic alert through the channel without touching the
-alert engine. Use this when you've added a channel and want to
-verify connectivity before the next real failure happens.
-
-## What gets logged
-
-Every alert raise / acknowledge / resolve writes an audit log
-entry. The audit log UI at **Settings → Audit log** filters by
-user, action, target, and time range — useful for the
-post-incident "who clicked acknowledge on the prune-failure
-alert" question.
@@ -1,73 +0,0 @@
-# Backups and restores
-
-## Running a backup
-
-Three ways to trigger one:
-
-1. **Scheduled** — the agent's local cron fires at the time set
-   on the schedule.
-2. **Run-now** — operator clicks **Run now** on the host detail
-   right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
-   source groups) or to a per-group form for finer control.
-3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
-   payload. Same audit + dispatch path.
-
-In every case the server creates a `jobs` row, broadcasts a
-`command.run` to the host, and lands the operator on the live
-job log page (HTMX `HX-Redirect`).
-
-## Cancelling a job
-
-Any running job — backup, forget, prune, restore, anything —
-exposes a **Cancel** button on its detail page. The server
-broadcasts `command.cancel`, and the agent kills the running
-restic subprocess via context cancel: SIGTERM first, SIGKILL
-after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
-SIGTERM step is replaced with `os.Kill` because Windows can't
-deliver SIGTERM. Result: a cancelled job lands as `cancelled`
-within a couple of hundred milliseconds.
-
-## Restore wizard
-
-Restoring a file or path goes through a four-step wizard at
-`/hosts/{id}/restore`:
-
-1. **Pick a snapshot.** Search by id or by date; the page is
-   pre-populated when you launched the wizard from a snapshot row.
-2. **Browse the snapshot tree.** Lazy-loaded children via the
-   `MsgTreeList` synchronous WS RPC; results are cached
-   per-wizard-session for 30 minutes. Pick the absolute paths
-   you want.
-3. **Choose a target.** Either **In place** (overwrites the
-   live filesystem; requires you to type the hostname to
-   confirm) or **New directory** (default
-   `$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
-   `${HOME}` / `~/` and creates the directory chain).
-4. **Review and submit.** Server mints a job, dispatches
-   `command.run` with a `RestorePayload`, and `HX-Redirect`s to
-   the live job log.
-
-`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
-in that release). Hosts running 0.16 don't get the flag and
-restore as the running user instead.
-
-## Snapshot diff
-
-Two snapshot ids in the **Diff** form on the host detail page →
-a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
-to the standard live job log. Useful when investigating a
-suspiciously-sized backup.
-
-## Job log artefacts
-
-Every job's log is persisted in `job_logs` (one row per line),
-not just streamed in-memory. That gives you:
-
- A live view at `/jobs/{id}` while the job runs.
- Two download formats from the same page header dropdown:
-  - **txt** — one line per row, `HH:MM:SS.mmm  TAG  payload`.
-  - **ndjson** — one self-contained JSON object per line
-    (`{seq, ts, stream, payload}`), perfect for `jq`.
-
-Downloads work whether the job is running or finished —
-the source is the DB, not the live socket.
@@ -1,61 +0,0 @@
-# Observability with Prometheus
-
-restic-manager can expose a Prometheus scrape endpoint at
-`GET /metrics`. The endpoint is **opt-in** — without an explicit
-auth gate it isn't even mounted, so a forgotten config can't
-accidentally publish fleet state.
-
-The full reference lives at
-[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
-the short version follows.
-
-## Enable the endpoint
-
-Set at least one of:
-
- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
-
-Both ANDed when both set. Constant-time token compare; CIDR
-honours `X-Forwarded-For` only when the immediate hop matches
-`RM_TRUSTED_PROXY`.
-
-## Metrics emitted
-
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
-  `rm_active_alerts{severity}`, `rm_build_info{...}`.
- **Per-host gauges**: `rm_host_agent_online`,
-  `rm_host_last_backup_timestamp_seconds`,
-  `rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
-  `rm_host_snapshot_count`, `rm_host_open_alerts`,
-  `rm_host_repo_status`.
- **Histogram**:
-  `rm_job_duration_seconds{kind,status,le=…}` (buckets
-  `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
-
-In-memory histogram only. Prometheus persists the scrapes; if
-you need durable history at hourly resolution that's
-Prometheus's job.
-
-## Sample Grafana dashboard
-
-[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
-imports through Grafana's **+ → Import → Upload JSON file**.
-Six panels:
-
-1. Fleet status (online / total).
-2. Open alerts by severity.
-3. Backups failing on most-recent run.
-4. Hosts table — last backup, repo size, snapshots, open alerts.
-5. Repo size over time, one line per host.
-6. Job-duration p95 over a 1h window per kind.
-
-## Alerting
-
-restic-manager already has a built-in alert engine
-([Alerts](./alerts.md)). The dashboard intentionally doesn't
-duplicate it as Prometheus alert rules. If you want
-Prometheus-side alerts on top, write your own based on the
-metrics above — `rm_host_last_backup_success == 0`,
-`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
-or whatever suits your environment.
@@ -1,50 +0,0 @@
-# Updating agents
-
-Server updates are a `docker compose pull && up -d` away.
-Agents update via the control plane.
-
-## Single-host update
-
-Each host's detail page shows an **Update agent** button when
-the agent's reported version is older than the server's. The
-button:
-
-1. Dispatches a `command.update` to that host.
-2. The agent fetches the appropriate binary from
-   `$RM_SERVER/agent/binary?os=…&arch=…` to
-   `<binary-path>.new`.
-3. Copies the running binary to `<binary-path>.old` (one
-   revision back, in case rollback is needed).
-4. Atomic-renames `.new` over the running binary.
-5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
-   brings the process back on the new binary.
-
-A 90-second timer on the server side waits for a hello at the
-target version and marks the update succeeded — or, if the
-agent doesn't reconnect at the expected version in time, marks
-the update **failed** and raises an `update_failed` alert.
-
-## Fleet update
-
-The admin-only **Settings → Fleet update** page drives a rolling
-update across every host in the fleet:
-
- One host at a time.
- Wait for hello-with-target-version (max 95s).
- On any host failing, **halt** the rollout, raise a
-  `fleet_update_halted` alert, leave the rest of the fleet on
-  the old version. No surprise mass-failures.
-
-You can cancel an in-progress fleet update; the worker stops
-after the current host finishes.
-
-## TLS and corruption
-
-Updates rely on the reverse proxy's TLS to detect corruption in
-transit. There's no separate sha256 verification step — we
-chose the simpler model on the basis that the same TLS already
-gates every other byte the server hands to the agent.
-
-If you'd like a separate signature step before applying updates,
-that's a future-phase enhancement (see `tasks.md` Phase 6
-candidates).
@@ -1,58 +0,0 @@
-# Environment variables
-
-The server reads its configuration from environment variables
-(canonical) with an optional YAML overlay. Env wins over YAML so
-operators can tweak a single setting without rewriting the file.
-
-## Server
-
-| Variable                  | Default                          | Meaning |
-|---------------------------|----------------------------------|---------|
-| `RM_LISTEN`               | `:8080`                          | TCP listener for the HTTP server. |
-| `RM_DATA_DIR`             | `/data`                          | Persistent state directory (SQLite, secret key, agent assets). |
-| `RM_BASE_URL`             | (none)                           | Public URL clients use; required for OIDC redirects + cookie scope. |
-| `RM_SECRET_KEY_FILE`      | `${RM_DATA_DIR}/secret.key`      | Path to the AEAD key file. Auto-generated on first run. |
-| `RM_COOKIE_SECURE`        | `true`                           | Set `false` only for local HTTP testing. Controls `Secure` on session cookies. |
-| `RM_TRUSTED_PROXY`        | (none)                           | Comma-separated CIDRs trusted for `X-Forwarded-*`. |
-| `RM_BUNDLED_ASSETS_DIR`   | `/opt/restic-manager/dist`       | Read-only path with bundled agent binaries + install scripts (the docker image bakes them here). |
-| `RM_METRICS_TOKEN`        | (off)                            | When set, `GET /metrics` requires `Authorization: Bearer <token>`. |
-| `RM_METRICS_TRUSTED_CIDR` | (off)                            | When set, `GET /metrics` restricts source IPs (comma-CIDR). |
-
-OIDC variables (all optional; empty issuer disables OIDC):
-
-| Variable                       | Meaning |
-|--------------------------------|---------|
-| `RM_OIDC_ISSUER`               | OIDC discovery URL (e.g. `https://auth.example.com`). |
-| `RM_OIDC_CLIENT_ID`            | Client ID registered with the IdP. |
-| `RM_OIDC_CLIENT_SECRET`        | Client secret (or use `RM_OIDC_CLIENT_SECRET_FILE`). |
-| `RM_OIDC_CLIENT_SECRET_FILE`   | Path to a file holding the client secret. |
-| `RM_OIDC_DISPLAY_NAME`         | Button label on the login page (e.g. "Authelia"). |
-| `RM_OIDC_ROLE_CLAIM`           | Token claim that carries roles (default `groups`). |
-| `RM_OIDC_ROLE_MAPPING`         | `idp-group=role` entries, comma-separated (e.g. `rm-admin=admin,rm-ops=operator`). |
-| `RM_OIDC_REDIRECT_URL`         | Override for the redirect URL; defaults to `${RM_BASE_URL}/auth/oidc/callback`. |
-
-## Agent
-
-| Variable             | Default | Meaning |
-|----------------------|---------|---------|
-| `RM_AGENT_CONFIG`    | `/etc/restic-manager/agent.yaml` (Linux) | Config file path. |
-
-The agent's other settings live in the YAML file (server URL,
-bearer token, optional cert pin). The install script writes that
-file for you at enrolment.
-
-## Build-time
-
-The Makefile threads `-ldflags` from `git describe` into the
-`internal/version` package so `--version` and the dashboard
-footer show the right values:
-
-```
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION)
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
-```
-
-If you build with `go build` directly (no Makefile), `Version`
-falls back to `dev` and the agent-update comparison falls back
-to "always equal". Source-build deployments can still run; they
-just don't participate in the self-update flow.
@@ -1,82 +0,0 @@
-# HTTP endpoints
-
-A non-exhaustive map of the surfaces the control plane exposes.
-All `/api/*` routes return JSON; all other paths render HTML
-(server-rendered with HTMX in the loop).
-
-The canonical wiring lives at
-[`internal/server/http/server.go`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/internal/server/http/server.go);
-when in doubt, read the routes block there.
-
-## Public (no auth)
-
-| Method | Path                       | Purpose |
-|--------|----------------------------|---------|
-| GET    | `/healthz`                 | Liveness probe. Returns 204. |
-| POST   | `/api/auth/login`          | Local-user login. JSON body: `{username, password}`. |
-| POST   | `/api/auth/logout`         | Invalidate the session cookie. |
-| POST   | `/api/bootstrap`           | First-run admin creation. Accepts the token printed at first start. |
-| POST   | `/api/agents/enroll`       | Token-based agent enrolment. |
-| POST   | `/api/agents/announce`     | Announce-and-approve agent enrolment. |
-| GET    | `/agent/binary?os=&arch=`  | Serves the agent binary for the install scripts. |
-| GET    | `/install/*`               | Serves the Linux + Windows install scripts and the systemd unit. |
-| GET    | `/api/version`             | Build version + commit JSON. |
-| GET    | `/metrics`                 | Prometheus exposition (only when opted-in via `RM_METRICS_TOKEN` / `RM_METRICS_TRUSTED_CIDR`). |
-| GET    | `/login`, `/setup`, `/bootstrap` | UI pages. |
-
-## Authenticated (any role)
-
-| Method | Path                                     | Purpose |
-|--------|------------------------------------------|---------|
-| GET    | `/`                                      | Dashboard. |
-| GET    | `/hosts/{id}`                            | Host detail. |
-| GET    | `/hosts/{id}/repo`                       | Repo tab. |
-| GET    | `/hosts/{id}/jobs`                       | Jobs tab. |
-| GET    | `/hosts/{id}/sources`                    | Source groups list. |
-| GET    | `/hosts/{id}/schedules`                  | Schedules list. |
-| GET    | `/jobs/{id}`                             | Live job log. |
-| GET    | `/api/hosts`, `/api/fleet/summary`       | JSON list + summary. |
-| GET    | `/api/jobs/{id}/stream`                  | WebSocket subscription to a job's live log. |
-| GET    | `/api/jobs/{id}/log.{txt,ndjson}`        | Persisted log download. |
-
-## Operator role and above
-
-| Method | Path                                  | Purpose |
-|--------|---------------------------------------|---------|
-| POST   | `/hosts/{id}/run-backup`              | Run-now (HTMX form-post). |
-| POST   | `/hosts/{id}/sources/{gid}/run-now`   | Per-source-group run-now. |
-| POST   | `/hosts/{id}/repo/{prune,check,unlock,reinit,probe}` | Maintenance actions. |
-| POST   | `/api/hosts/{id}/snapshots/diff`      | Snapshot-diff job. |
-| POST   | `/hosts/{id}/restore`                 | Restore wizard submit. |
-| POST   | `/api/jobs/{id}/cancel`               | Cancel a running job. |
-| POST   | `/hosts/{id}/tags`                    | Update host tags. |
-| POST   | `/hosts/{id}/sources` and friends     | Source-group CRUD. |
-| POST   | `/hosts/{id}/schedules` and friends   | Schedule CRUD. |
-| POST   | `/hosts/{id}/repo/credentials`, `/admin-credentials` | Credential update. |
-
-## Admin role only
-
-| Method | Path                                  | Purpose |
-|--------|---------------------------------------|---------|
-| POST   | `/hosts/new`                          | Mint enrolment token (Add host). |
-| POST   | `/hosts/{id}/delete`                  | Delete + cascade. |
-| POST   | `/hosts/{id}/update`                  | Dispatch a single agent update. |
-| GET/POST | `/settings/users/...`                | User management. |
-| POST   | `/settings/notifications/...`         | Notification channel CRUD + test fire. |
-| POST   | `/settings/fleet-update/...`          | Fleet-update worker. |
-
-## WebSocket
-
-| Path                           | Who connects | Auth |
-|--------------------------------|--------------|------|
-| `/ws/agent`                    | Agent        | Bearer token issued at enrolment. |
-| `/ws/agent/pending`            | Agent (announce flow) | Pending-id query param. |
-| `/api/jobs/{id}/stream`        | Browser      | Session cookie. |
-
-## RBAC enforcement
-
-Routes are grouped into chi route-groups by required role
-(`viewer < operator < admin`); the `requireRole` middleware in
-`internal/server/http/middleware.go` is the bouncer. Sessions
-re-validate `disabled_at` on every request, so a disabled user's
-cookie stops working immediately.
@@ -1,32 +0,0 @@
-# Roadmap
-
-The live roadmap is in
-[`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md).
-Phases ship in order; items inside a phase ship as the
-opportunity arises.
-
-## Status snapshot
-
-| Phase | Theme                                            | Status |
-|-------|--------------------------------------------------|--------|
-| 0     | Project bootstrap                                | ✅ done |
-| 1     | MVP: enrolment, visibility, on-demand backup     | ✅ done |
-| 2     | Scheduling, retention, repo operations           | ✅ done |
-| 3     | Restore, alerts, audit                           | ✅ done |
-| 4     | RBAC, OIDC, host tags                            | ✅ done |
-| 5     | OSS readiness                                    | 🚧 in flight (this docs site is part of it) |
-| 6     | Update delivery + observability polish           | ✅ done |
-
-## What's not on the roadmap
-
-The non-goals list in [`spec.md` §2](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md):
-
- Replacing restic itself or providing custom repo formats
- Managing non-restic backup tools
- Multi-tenancy / SaaS deployment
- High availability of the control plane (SQLite, single-instance)
- Mobile-native apps (responsive web only)
-
-If something there is critical to your use case, restic-manager
-isn't the right tool. That's not a closed door — it's a
-deliberate scope decision so the project stays maintainable.
@@ -1,35 +0,0 @@
-# Reporting vulnerabilities
-
-The full disclosure policy lives in
-[`SECURITY.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/SECURITY.md)
-at the repo root. The short version:
-
- **Don't open a public issue.**
- Send a Gitea private message to `steve` on
-  <https://gitea.dcglab.co.uk>, or email the address on the
-  maintainer's profile, with a subject like
-  `[SECURITY] restic-manager: <one-line summary>`.
- Expect an acknowledgement within 3 working days; escalate
-  through the other channel if you don't get one.
- Default disclosure window is **30 days from confirmed report
-  to public disclosure**, faster if a PoC is already
-  circulating, slower only by mutual agreement.
-
-## What to include
-
-A description of the issue and the impact, the affected
-component (server / agent / install script / docs), the version,
-and reproduction steps. A working PoC is welcome but not
-required — a credible threat model is enough.
-
-## In scope vs. out of scope
-
-See the full policy. Quick highlights:
-
- **In scope:** server, agent, install scripts, docker image,
-  docker-compose reference, crypto choices, docs that lead to
-  insecure configs.
- **Out of scope:** restic itself (report upstream), unpatched
-  third-party deps (report upstream first), pre-authenticated
-  admin abuse (admins are designed to have full power), DoS on
-  deployments without the recommended reverse proxy.
@@ -1,72 +0,0 @@
-# Hardening checklist
-
-A baseline for new deployments. Most of these are defaults; the
-list is here to make audit easy.
-
-## Server
-
- [ ] Reverse proxy in front, TLS terminating at the proxy
-      (Caddy/nginx/Traefik).
- [ ] `RM_TRUSTED_PROXY` set to the proxy's CIDR.
- [ ] `RM_BASE_URL` matches the public hostname and the cookie
-      scope you want.
- [ ] `RM_COOKIE_SECURE=true` (the default; only set `false`
-      for local HTTP testing).
- [ ] HTTP listener bound to **localhost** in the compose file,
-      not `0.0.0.0`. The reverse proxy is the only thing that
-      should reach it.
- [ ] `secret.key` backed up separately from the database.
- [ ] Bootstrap token consumed and the printed log line scrubbed
-      from any log archive.
-
-## Authentication
-
- [ ] Admin user has a password ≥ 12 characters (the floor).
- [ ] OIDC enabled if you have an IdP — local password auth
-      stays as a break-glass.
- [ ] Disabled (not deleted) any users who change roles or leave
-      so their session is invalidated immediately.
- [ ] The last-admin guard isn't tripped — there's always at
-      least one enabled admin user.
-
-## Repo credentials
-
- [ ] Append-only credential set as the everyday cred for every
-      host.
- [ ] Admin credential set only where prune cadence is enabled.
- [ ] No credentials reused across hosts. Each host should have
-      its own credential pair so a single host compromise has a
-      single blast radius.
- [ ] If using rest-server, `--append-only` flag is on for the
-      everyday user; the prune user is a separate identity.
-
-## Agent
-
- [ ] Agent runs as `root` (Linux) or `LocalSystem` (Windows)
-      **only when** the source paths require it. Otherwise pin
-      a service user that has read access to what's backed up
-      and nothing else.
- [ ] systemd unit's sandboxing flags are intact
-      (`NoNewPrivileges`, `Protect*`, `MemoryDenyWriteExecute`).
- [ ] Agent's config file `/etc/restic-manager/agent.yaml` is
-      mode `0600` and owned by the service user. The bearer
-      token lives in there.
-
-## Operations
-
- [ ] Alerts wired to a real channel (webhook into Slack,
-      ntfy topic, SMTP) — not just sitting in the UI.
- [ ] Test-fire each notification channel after configuring.
- [ ] Audit-log retention is long enough to cover the operator's
-      incident-response window.
- [ ] Prometheus endpoint, if enabled, gated by token AND CIDR
-      where practical (default is opt-in / off).
-
-## Recovery
-
- [ ] A documented procedure for rotating a leaked agent bearer
-      (delete + re-enrol the host).
- [ ] A test-restore done at least once, end-to-end, before
-      relying on the system in anger.
- [ ] `secret.key` and the SQLite database covered by separate
-      backup paths so neither alone reconstitutes the other.
@@ -1,110 +0,0 @@
-# Threat model
-
-This page documents what restic-manager defends against, what it
-doesn't, and the trust assumptions a deployment is making. The
-canonical version lives in [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
-§11; the summary here is shaped for operators rather than
-implementers.
-
-## Trust boundaries
-
-```
-┌──────────────────────────────────────────┐
-│  TRUSTED zone                            │
-│  ┌─────────────┐    ┌──────────────┐     │
-│  │  Operator's │    │   Reverse    │     │
-│  │   browser   │◄──►│    proxy     │     │  TLS terminates here
-│  └─────────────┘    └──────┬───────┘     │
-└────────────────────────────┼─────────────┘
-                             │ HTTP, plaintext
-                             │ (loopback or trusted LAN)
-┌────────────────────────────▼─────────────┐
-│  Server (control plane)                  │
-└────────────┬─────────────────────────────┘
-             │ outbound WebSocket (TLS to clients via proxy)
-             │ — bearer-authenticated
-┌────────────▼──────────────┐
-│  Agent (per host)         │  ◄── attacker model: assume one
-└────────────┬──────────────┘       endpoint can be compromised
-             │ subprocess
-             ▼
-   restic ──▶ repository (rest-server / S3 / SFTP / …)
-```
-
-## What we defend against
-
-### Network attacker between operator and server
-
- HTTPS via the reverse proxy is the only operator-facing surface
-  on a sane deployment.
- `RM_COOKIE_SECURE=true` (default) means the session cookie
-  refuses to ride a non-HTTPS connection.
- `RM_TRUSTED_PROXY` gates whether `X-Forwarded-*` is honoured;
-  a bypassing request can't spoof the client IP.
-
-### Compromised agent host
-
- The agent's bearer token can dispatch commands **only on its
-  own host**. It can't read other hosts' state, dispatch jobs
-  on other hosts, or escalate within the control plane.
- If you suspect a host compromise:
-  1. Disable the agent's host row from **Hosts → Delete**
-     (cascades the bearer hash).
-  2. Rotate the repo credential at the rest-server / object
-     store side.
-  3. Audit-log lists every action that bearer ever drove.
-
-### DB compromise without the secret key
-
- Repo credentials are AEAD-encrypted at rest. A DB dump alone
-  doesn't expose them.
- Agent bearer **hashes** are leaked; that's enough to
-  authenticate as any agent until you revoke. A rotation
-  procedure is just "delete + re-enrol" today.
- Operator passwords are bcrypt-hashed; OIDC users have no
-  password to leak.
- Session tokens are hashed; an attacker can't replay a
-  session from a DB dump.
-
-### DB compromise WITH the secret key
-
-The attacker can decrypt every credential. Treat
-`secret.key` with the same care as a password manager database.
-Back it up to a separate vault, not to the same Docker volume
-as the database.
-
-### Forget/prune as a DoS vector
-
- The everyday backup credential cannot prune (append-only).
- The admin credential is only pushed to the agent at the
-  moment of dispatch and discarded after the job ends.
- Compromise of a single agent host does **not** grant prune
-  rights — at worst the attacker gets fresh write access until
-  the credential is rotated.
-
-### Operator-side typo or bad copy-paste
-
- Repo credentials are stored encrypted; mis-typed creds fail
-  fast on the next `restic` invocation rather than silently
-  corrupting state.
- NS-03 added auto-init: the first dispatched job after creds
-  change runs `restic init`, surfaces the error eagerly under
-  the host's vitals strip if the creds are bad, and resets the
-  host's `repo_status` so the operator can retry without
-  hunting through job logs.
-
-## What we don't defend against
-
- **Insider threat at the maintainer level.** A malicious
-  maintainer can publish a backdoored container; SBOM /
-  signing infrastructure (Phase 6 candidate) would help here
-  but isn't shipped today.
- **Supply chain.** We pin module versions (`go.sum`) and
-  pin the Tailwind binary's release tag, but a compromise in
-  one of those upstreams would land here.
- **Side-channel via restic itself.** A bug in restic that
-  enables snapshot-content disclosure is restic's problem; the
-  control plane doesn't see snapshot bytes either way.
- **DoS via resource exhaustion** without the recommended
-  reverse-proxy / rate-limit in front. Don't expose the
-  server's HTTP port to the public internet directly.
@@ -1,120 +0,0 @@
-# End-to-end test harness
-
-The e2e harness stands up the full production-shaped stack
-(server + agent + rest-server) in Docker Compose and drives it
-through Playwright. CI runs it on every PR; operators can run it
-locally too.
-
-## Files
-
-```
-e2e/
-├── compose.e2e.yml         compose stack: server + rest-server + agent
-├── Dockerfile.agent        Linux container for the agent (alpine + restic)
-├── agent-entrypoint.sh     decides between announce / token-enrol / run
-└── playwright/
-    ├── package.json
-    ├── playwright.config.ts
-    └── tests/
-        ├── lib/server.ts   bootstrap, login, accept, poll helpers
-        └── smoke.spec.ts   happy-path: enrol → backup → succeeded
-```
-
-## Local run
-
-Prerequisites: Docker + Docker Compose, and `npx` for Playwright.
-
-```sh
-# 1. Build + bring up the stack (server, rest-server, source data).
-docker compose -f e2e/compose.e2e.yml up --build -d server rest-server source-fixture
-
-# 2. Wait for the server, then scrape the bootstrap token from the log.
-until curl -fsS http://127.0.0.1:8080/api/version >/dev/null; do sleep 1; done
-RM_BOOTSTRAP_TOKEN=$(docker compose -f e2e/compose.e2e.yml logs server \
-    | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1)
-export RM_BOOTSTRAP_TOKEN
-
-# 3. Start the agent (it announces against the running server).
-docker compose -f e2e/compose.e2e.yml up -d agent
-
-# 4. Install + run Playwright.
-cd e2e/playwright
-npm install
-npx playwright install --with-deps chromium
-npx playwright test
-```
-
-When the test passes you'll see:
-
-```
-Running 2 tests using 1 worker
-  ✓  smoke: enrol-via-announce → backup › happy path completes in under a minute (47s)
-  ✓  smoke: scrape /metrics › metrics endpoint exposes the host gauge (180ms)
-
-  2 passed (47.5s)
-```
-
-Tear-down:
-
-```sh
-docker compose -f e2e/compose.e2e.yml down -v
-```
-
-`-v` removes the named volumes too — important between runs because
-the rest-server volume holds an initialised repo and the
-agent-config volume holds a stale bearer.
-
-## What the test exercises
-
-1. **Bootstrap.** Posts the admin-creation request to
-   `/api/bootstrap` with the token scraped from the server log.
-2. **Login (UI).** Drives the login form via Playwright; verifies
-   the dashboard loads with a session cookie set.
-3. **Pending host appears.** Polls the dashboard for the inline
-   accept form generated by the announcing agent; reads the
-   pending-id out of its action URL.
-4. **Accept.** POSTs `/api/pending-hosts/{id}/accept` with the
-   rest-server URL + repo password. The server mints a Host row
-   + bearer + AEAD-encrypted creds and pushes the bearer down
-   the still-open pending WebSocket.
-5. **Online + auto-init.** Polls `/api/hosts` until the new host
-   is `status=online`. Auto-init runs as part of this — the
-   first dispatched job after creds save is `restic init`.
-6. **Run backup.** Submits the host detail page's `Run now`
-   form; expects `HX-Redirect` to the live job page.
-7. **Verify.** Polls `/api/hosts` until the host's
-   `last_backup_status` flips to `succeeded`.
-8. **Metrics.** Scrapes `/metrics` and asserts the
-   server-gauge + build-info lines are present (the compose
-   stack opens the endpoint via `RM_METRICS_TRUSTED_CIDR=0.0.0.0/0`).
-
-## CI workflow
-
-[`.gitea/workflows/e2e.yml`](../.gitea/workflows/e2e.yml) runs the
-suite on every PR into `main`. On failure it dumps the last 200
-lines of each container log as a workflow annotation and uploads
-the Playwright HTML report as an artefact.
-
-## When tests fail
-
- **Pending host never appears.** Agent container probably
-  couldn't reach the server. Check `docker compose logs agent`
-  for connection errors and `docker compose logs server` for
-  any 4xx on `/api/agents/announce`.
- **Backup hangs in `running`.** The agent shells out to
-  `restic`; check the live job log at
-  `http://127.0.0.1:8080/jobs/<id>` (still up after a
-  failed test as long as you didn't `down -v`).
- **`RM_BOOTSTRAP_TOKEN not set`.** The server log scrape
-  matched the wrong line or the token regex is too tight. The
-  server prints the token on a line starting with `    ` (four
-  spaces) inside a banner; widen the regex if your server log
-  format changes.
-
-## Adding new tests
-
-The harness is intentionally flat — one `*.spec.ts` per
-scenario. Reuse the helpers in `lib/server.ts` and avoid
-duplicating bootstrap / login boilerplate. Heavy fixtures
-(custom users, OIDC IdP) belong in their own compose override
-file rather than complicating `compose.e2e.yml`.
@@ -1,139 +0,0 @@
-# Prometheus + Grafana
-
-restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
-The endpoint is **opt-in** — it is not mounted at all unless you set
-at least one of the auth gates below. Once enabled, it serves the
-standard `text/plain` exposition format that every Prometheus
-release since 2.x parses without configuration.
-
-A sample Grafana dashboard lives at
-`deploy/grafana/restic-manager-dashboard.json`.
-
-## Enable the endpoint
-
-Two switches, both off by default. If both are set, both must pass
-(token AND source-IP); if only one is set, that gate alone
-authorises a scrape.
-
-| Env var                    | YAML key               | Effect |
-|----------------------------|------------------------|--------|
-| `RM_METRICS_TOKEN`         | `metrics_token`        | Requires `Authorization: Bearer <token>`. Compared in constant time. |
-| `RM_METRICS_TRUSTED_CIDR`  | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
-
-When neither is set, `GET /metrics` returns 404 — the route is not
-registered with the chi router so a forgotten config can't
-accidentally publish fleet state.
-
-### Example: Docker
-
-```yaml
-services:
-  restic-manager:
-    image: gitea.dcglab.co.uk/steve/restic-manager:latest
-    environment:
-      RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
-      RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
-    secrets:
-      - rm_metrics_token
-```
-
-(`RM_METRICS_TOKEN_FILE` is not currently supported — set
-`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
-roadmap.)
-
-## Prometheus scrape config
-
-Drop into your `prometheus.yml`:
-
-```yaml
-scrape_configs:
-  - job_name: restic-manager
-    metrics_path: /metrics
-    scheme: https            # via your reverse proxy
-    static_configs:
-      - targets: ['restic.example.com']
-    authorization:
-      type: Bearer
-      credentials_file: /etc/prometheus/secrets/rm_metrics_token
-```
-
-If you don't run a TLS-terminating proxy in front, drop `scheme:
-https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
-
-## Metric reference
-
-All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
-label (the stable ULID, immune to renames) and a `host` label
-(the human-readable name).
-
-### Server gauges
-
-| Name                  | Labels                             | Description |
-|-----------------------|------------------------------------|-------------|
-| `rm_hosts_total`      | —                                  | Total number of enrolled hosts (excludes pending announces). |
-| `rm_hosts_online`     | —                                  | Number of hosts with `status='online'`. |
-| `rm_active_alerts`    | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
-| `rm_build_info`       | `version, commit, go_version`      | Always 1; pure label-bag for joining. |
-
-### Per-host gauges
-
-| Name                                       | Description |
-|--------------------------------------------|-------------|
-| `rm_host_agent_online`                     | 1 if the agent is currently online, 0 otherwise. |
-| `rm_host_last_backup_timestamp_seconds`    | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
-| `rm_host_last_backup_success`              | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
-| `rm_host_repo_size_bytes`                  | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
-| `rm_host_snapshot_count`                   | Number of restic snapshots known on the host's repo. |
-| `rm_host_open_alerts`                      | Number of currently open alerts attached to this host. |
-| `rm_host_repo_status`                      | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
-
-### Job duration histogram
-
-```
-rm_job_duration_seconds_bucket{kind, status, le}
-rm_job_duration_seconds_sum{kind, status}
-rm_job_duration_seconds_count{kind, status}
-```
-
-`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
-`status` ∈ {succeeded, failed, cancelled}.
-
-Buckets (seconds):
-
-```
-1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
-1s   5s  30s  1m  5m   30m   1h    6h    24h
-```
-
-The histogram is in-memory only — values reset on process restart.
-Operators who want durable history should let Prometheus persist
-the scrapes; restic-manager itself is a control plane, not a
-metrics database.
-
-## Grafana dashboard
-
-Import `deploy/grafana/restic-manager-dashboard.json`:
-
-1. In Grafana, **+ → Import → Upload JSON file**.
-2. Pick the Prometheus data source you scrape with.
-3. The dashboard's six panels populate from the metrics above:
-   * **Fleet status** — online/total stat panel.
-   * **Open alerts** — by severity.
-   * **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
-   * **Repo size over time** — one line per host.
-   * **Backups failing** — count of hosts whose last backup didn't succeed.
-   * **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
-
-Alerting is intentionally not configured in the dashboard — the
-control plane already has alerts (P3-05) with native channels for
-webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
-just duplicate state. If you do want Prom-side alerts, copy the
-recording rules into your usual location.
-
-## Cardinality
-
-Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
-histogram rows. A 100-host fleet emits roughly 700 host rows + 270
-histogram rows — well below any practical limit. There are no
-`job_id` labels (cardinality bomb avoidance) and no per-source-group
-labels.
@@ -1,113 +0,0 @@
-# Running behind a reverse proxy
-
-The restic-manager server is HTTP-only by design (see `spec.md` §11):
-TLS termination, public hostname, ACME, HSTS, and edge-level rate
-limiting all belong to a reverse proxy that you already operate
-outside this project. The reference compose in `deploy/docker-compose.yml`
-stands up *only* the server; this page covers what your proxy needs
-to do to make the rest of it work.
-
-## What the proxy must forward
-
-The server reads four headers when (and only when) the immediate peer
-matches `RM_TRUSTED_PROXY`:
-
-| Header              | Value                                                    | Why |
-|---------------------|----------------------------------------------------------|-----|
-| `X-Forwarded-For`   | The original client IP (single value, or comma chain)    | Rate-limit keys, audit log entries, and OIDC redirect-URI checks all use the real client IP. |
-| `X-Forwarded-Proto` | `https`                                                  | The server emits absolute URLs (e.g. OIDC redirect URIs) using this. |
-| `Host`              | The public hostname clients use                          | Cookies are scoped to this; `RM_BASE_URL` must match. |
-| `Connection`/`Upgrade` | Pass through unchanged                                | The agent connects on `/ws/agent` and the live-log viewer connects on `/api/jobs/{id}/stream` — both are WebSockets and need `Upgrade: websocket` to survive the hop. |
-
-Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of CIDRs)
-the proxy connects from. Anything outside that range has its
-`X-Forwarded-*` headers ignored, so a stray request that bypasses the
-proxy can't spoof the client IP.
-
-## Example: Caddy
-
-```caddyfile
-restic.example.com {
-    # Caddy's default reverse_proxy preserves Host, sets
-    # X-Forwarded-For/Proto, and passes Connection: upgrade through,
-    # so a single directive covers HTTP + WebSocket.
-    reverse_proxy 127.0.0.1:8080
-
-    encode zstd gzip
-}
-```
-
-`RM_TRUSTED_PROXY=127.0.0.1/32` if Caddy and the server share the
-host; the docker-bridge CIDR (commonly `172.16.0.0/12`) if Caddy
-runs in another container on the default bridge network.
-
-## Example: nginx
-
-```nginx
-server {
-    listen 443 ssl http2;
-    server_name restic.example.com;
-
-    ssl_certificate     /etc/ssl/restic.example.com.fullchain.pem;
-    ssl_certificate_key /etc/ssl/restic.example.com.key.pem;
-
-    location / {
-        proxy_pass         http://127.0.0.1:8080;
-        proxy_http_version 1.1;
-
-        # WebSocket support — agent + live-log endpoints need this.
-        proxy_set_header   Upgrade           $http_upgrade;
-        proxy_set_header   Connection        $connection_upgrade;
-
-        # Trusted-proxy headers.
-        proxy_set_header   Host              $host;
-        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
-        proxy_set_header   X-Forwarded-Proto https;
-
-        # Live job logs are long-running streams. Bump read timeouts
-        # so nginx doesn't drop them mid-backup.
-        proxy_read_timeout 1h;
-        proxy_send_timeout 1h;
-    }
-}
-
-# Standard websocket upgrade map (define once at the http {} level).
-map $http_upgrade $connection_upgrade {
-    default upgrade;
-    ''      close;
-}
-```
-
-`RM_TRUSTED_PROXY` for the same-host case: `127.0.0.1/32`.
-
-## Example: Traefik (label-based)
-
-```yaml
-labels:
-  - "traefik.enable=true"
-  - "traefik.http.routers.restic-manager.rule=Host(`restic.example.com`)"
-  - "traefik.http.routers.restic-manager.entrypoints=websecure"
-  - "traefik.http.routers.restic-manager.tls.certresolver=letsencrypt"
-  - "traefik.http.services.restic-manager.loadbalancer.server.port=8080"
-```
-
-Traefik handles `X-Forwarded-*` and `Connection: upgrade` by default.
-`RM_TRUSTED_PROXY` should be the docker network the Traefik container
-shares with the server (commonly `172.16.0.0/12` for the default
-bridge, or whatever your overlay network's CIDR is).
-
-## Sanity-checking the wiring
-
-After bringing the stack up:
-
-1. `curl -fsS https://restic.example.com/healthz` — should return 200.
-2. The login page should report HTTPS in the address bar; cookies
-   set after login should carry the `Secure` flag.
-3. Check the server log for the `config resolved` line:
-   `trusted_proxies` must include the IP/CIDR your proxy actually
-   connects from.
-4. Enrol a test agent — the WebSocket handshake hitting `/ws/agent`
-   confirms `Upgrade` is being forwarded correctly.
-
-If any of those fail, the proxy is the first place to look — the
-server itself is intentionally minimal.
@@ -1,223 +0,0 @@
-# Always-On vs Intermittent host mode
-
-**Date:** 2026-06-15
-**Branch:** `feat-laptop-host-mode`
-**Status:** Design — awaiting review
-
-## Problem
-
-The server currently assumes every host should be present 24×7. When an
-agent stops heartbeating for 90s it is flipped to `offline`, and after 15
-minutes that raises a `warning` alert. This is correct for a server, but
-wrong for a host that legitimately comes and goes — a workstation or
-laptop that sleeps overnight, travels, or is shut down on weekends. Such
-a host generates noise alerts every time it is closed, and — more
-importantly — there is **no mechanism to catch up a backup it missed
-while it was away.**
-
-Two distinct facts make the catch-up gap real:
-
- **Backup cron runs on the agent, locally.** The agent fires
-  `MsgScheduleFire`; the server only dispatches in response. If the host
-  is asleep, the agent process is suspended, so the cron tick never
-  fires and no `MsgScheduleFire` is ever sent.
- Therefore the existing `pending_runs` retry queue **does not** cover
-  this case. `pending_runs` only gets a row when a schedule *fired* but
-  the agent was momentarily disconnected at dispatch time. A window
-  missed entirely during sleep never enqueues anything.
-
-## Goal
-
-Let an operator mark a host as **not** always-on. Such a host:
-
-1. Does **not** raise offline/agent-down alerts when it is not visible.
-2. Renders a distinct, calm "asleep" state in the UI instead of the
-   alarming red "offline".
-3. When it reconnects, after a short settle delay, the server checks
-   whether it missed a scheduled backup and — if so — triggers a
-   catch-up backup automatically.
-4. Still raises a *staleness* alert if it has genuinely gone too long
-   without any backup (a host left in a drawer). This is the only
-   alert covering an asleep host: while the agent is offline no job
-   runs, so there is no failure to detect — staleness is the safety
-   net for "no backups are happening at all."
-5. Leaves normal job-failure alerting untouched: a backup that
-   actually runs (scheduled or catch-up) and fails alerts as it does
-   today. Failures can only occur while the agent is online and
-   executing restic.
-
-Default behaviour is unchanged for the entire existing fleet.
-
-## Decisions (from brainstorming)
-
- **Setting shape:** a single boolean `Always On` checkbox per host,
-  **default ON**. Checked = today's 24×7 server semantics. Unchecked =
-  intermittent host. Opt-in only; zero behaviour change for current and
-  future hosts unless explicitly toggled.
- **Overdue trigger:** evaluated on **reconnect + behind schedule**
-  (not a continuous always-evaluating sweep).
- **Alert policy for intermittent hosts:** suppress offline alerts;
-  keep a long-threshold **staleness** alert; keep job-failure alerts.
- **Staleness threshold:** **7 days**, a global constant for v1. May
-  become per-host configurable later — out of scope now.
- **Catch-up granularity:** **per enabled schedule.** A host with a
-  daily and a weekly schedule catches up only whichever is actually
-  behind.
- **UI vocabulary:** not-visible intermittent host shows a grey
-  `asleep` state; detail line reads
-  `asleep · last seen <relTime> · will catch up on return`.
- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show
-  a chip for **Always-On** hosts; **no** chip for intermittent.
-
-## Architecture
-
-The change is deliberately a thin policy + presentation layer over the
-existing online/offline state machine. We do **not** add a new `status`
-enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a
-reinterpretation of `status='offline' AND NOT always_on`.
-
-### 1. Data model
-
- **Migration `0024_hosts_always_on.sql`:**
-  ```sql
-  ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
-  ```
-  Column-level ALTER per the repo's migration rules. Default `1` means
-  every existing row is Always-On — no behaviour change on upgrade.
- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it
-  through every host SELECT scan and the host insert/update paths.
- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`.
-
-### 2. Online/offline mechanics — UNCHANGED
-
-The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen
-host to `status='offline'` and still calls
-`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello`
-behaviour is untouched. The intermittent distinction is applied
-*downstream* of this state, in the alert engine and the templates.
-
-### 3. Alert behaviour
-
-All changes key off `host.AlwaysOn`, which the engine already has access
-to via the host row it loads.
-
- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()`
-  and the 60s `tick()`): when `!host.AlwaysOn`, do not raise
-  `agent_offline`.
- **Resolve-on-toggle:** when a host is switched server→intermittent and
-  has an open `agent_offline` alert, auto-resolve it. (Handled in the
-  mode-change handler, fanning through the normal resolve path so
-  channels/audit fire as usual.)
- **Staleness alert** — wire up the currently-dead `KindStaleSchedule`
-  constant, **for intermittent hosts only.** On the 60s tick, for each
-  host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND
-  `LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a
-  `warning` `stale_schedule` alert (dedup key `""`, one per host).
-  Auto-resolves when `LastBackupAt` advances past the threshold (i.e.
-  any successful backup, including the catch-up). Always-On hosts'
-  `stale_schedule` remains a no-op (unchanged, out of scope).
-  - If `LastBackupAt == nil` (intermittent host enrolled but never
-    backed up): no staleness alert in v1 — there is no baseline to
-    measure against, and onboarding probe state (`repo_status`) already
-    covers "never successfully set up."
- **Job-failure alerts:** untouched. A catch-up backup that runs and
-  fails alerts exactly like any other backup.
-
-### 4. Catch-up on reconnect
-
-A new small component — the **catch-up scheduler** — lives server-side
-alongside the existing ticks.
-
- **Arm:** on agent hello (`server/ws/handler.go` hello path /
-  `onAgentHello`), if the host is `!AlwaysOn`, record
-  `catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a
-  subsequent hello just overwrites the timestamp (debounce — rapid
-  flapping does not stack catch-ups). In-memory is acceptable: catch-up
-  is best-effort and a server restart simply re-arms on the next hello.
- **Fire:** reuse the existing 30s server tick. For each due entry
-  (`catchupDueAt <= now`):
-  1. Re-verify the agent is still connected (`Hub.Connected(hostID)`).
-     If it bounced back offline within the settle window, drop the entry
-     (it will re-arm on the next hello).
-  2. Skip if a backup is already running or queued for the host
-     (`current_job_id` set, or a relevant `pending_runs` row exists) —
-     avoid double-firing alongside a normal dispatch or pending drain.
-  3. For each **enabled** schedule on the host, compute overdue:
-     ```
-     overdue := sched.Next(host.LastBackupAt) <= now
-     ```
-     using `robfig/cron/v3` (already a dependency) to parse
-     `Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly
-     after the last successful backup; if that moment has already
-     passed, the window was missed → overdue. (If `LastBackupAt` is nil,
-     treat as overdue so a never-backed-up intermittent host with a
-     schedule gets its first run on connect.)
-  4. For each overdue schedule, dispatch its source-groups via the
-     existing `dispatchBackupForGroupCore()`.
-  5. Clear the entry.
-
-Net latency is ~60–90s after wake (60s settle + up to one 30s tick).
-This path is independent of and complementary to the `pending_runs`
-drain, which continues to handle the fired-but-not-sent case.
-
-### 5. UI
-
- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`,
-  visually distinct from red `dot-offline`.
- **`partials/host_row.html` and `partials/host_chrome.html`:** when
-  `!AlwaysOn && status=='offline'`, render the grey dot + label
-  `asleep`; the detail/last-seen line reads
-  `asleep · last seen <relTime> · will catch up on return`. All other
-  states unchanged.
- **24×7 chip:** on the host detail header, render a small
-  `Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip
-  for intermittent hosts. (Chip and checkbox highlight the same fact.)
- **Toggle:** an `Always On` checkbox (default checked) on the host edit
-  surface. Operator-band `POST` (mirrors existing host-edit handlers),
-  audited as `host.mode_updated`. On save, if switching to intermittent,
-  trigger the resolve-on-toggle path for any open `agent_offline` alert.
-
-## Error handling & edge cases
-
- **Toggle server→intermittent while offline+alerting:** open
-  `agent_offline` alert auto-resolved on save.
- **Toggle intermittent→server while asleep:** host resumes normal
-  offline/alert semantics; it will alert per the 15-minute floor once
-  the sweeper/tick next evaluates it.
- **No enabled schedules:** no catch-up and no staleness alert — there
-  is no backup expectation to measure against.
- **Catch-up vs in-flight work:** guarded by the running/queued check in
-  step 4.2 so catch-up never races a normal dispatch or pending drain.
- **Agent flaps during settle window:** entry dropped if not connected
-  at fire time; re-armed on the next hello.
-
-## Testing
-
- **Alert engine (unit):**
-  - offline alert suppressed when `!AlwaysOn`.
-  - staleness alert raised when intermittent + schedule + last backup >
-    7d; not raised for Always-On hosts; not raised when last backup is
-    recent; not raised when no enabled schedule.
-  - staleness alert auto-resolves after a backup advances `LastBackupAt`.
-  - server→intermittent toggle resolves an open `agent_offline` alert.
- **Overdue computation (unit, table-driven):** `(cronExpr,
-  lastBackupAt, now) → overdue?` including nil-last-backup and
-  daily/weekly cases.
- **Catch-up scheduler (unit):** fires only when still connected; skips
-  when a backup is running/queued; dispatches only overdue schedules.
- **UI (render test):** asleep state + 24×7 chip render under the right
-  conditions; offline state for Always-On hosts unchanged.
- `go vet ./...` and full `go test ./...` green before merge.
-
-## Out of scope
-
- Per-host staleness thresholds (global 7d constant for v1).
- Continuous (non-reconnect) overdue evaluation.
- Agent-side catch-up cron — the server is the reliable arbiter.
- Wiring `stale_schedule` for Always-On hosts (separate concern).
-
-## Task tracking
-
-Add an entry to `tasks.md` under "Next steps from testing" (or a new
-small section) once the plan is approved, per the repo's tasks.md
-source-of-truth rule.
@@ -0,0 +1,259 @@
+# P2 Completion Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** Close every remaining P2 task in `tasks.md`: P2R-09 (auto-init UX), P2R-10/11/12 (hooks), P2R-13 (bandwidth wiring + per-job override), P2R-14 (schedule next/last run), P2-16 (Windows svc), P2-17 (`install.ps1`), P2-18 (announce-and-approve).
+
+**Architecture:** Server stays HTTP+WS; agent stays a single binary that auto-restages via `make build`. Hooks live on `source_groups` (and host-level defaults). Announce-and-approve adds a separate WS path (`/ws/agent/pending`) and a Pending hosts panel; token-flow stays default. Windows service support uses `golang.org/x/sys/windows/svc` behind a `//go:build windows` tag — Linux builds untouched. **Operator is away — make best guesses on small UX choices, but commit each item separately so the choices are reviewable.**
+
+**Tech Stack:** Go 1.23+, chi router, modernc/sqlite, `coder/websocket`, `robfig/cron/v3`, HTMX + Tailwind, `golang.org/x/sys/windows/svc`, Ed25519 (stdlib).
+
+---
+
+## Pre-flight
+
+- [ ] **Run baseline:** `go vet ./... && go build ./... && go test ./...` — must be green before starting. Restage agent + restart server (per CLAUDE.md restage block) so smoke env is warm.
+
+## Order of execution
+
+Smallest blast-radius first. UI polish → bandwidth → next/last → hooks → announce → Windows. Commit and restage at each task boundary. Run `go vet ./... && go test ./...` before every commit.
+
+---
+
+## Task 1 — P2R-13a: Wire bandwidth caps into restic invocations
+
+**Files:**
+- Modify: `internal/restic/runner.go` (add `LimitUploadKBps`, `LimitDownloadKBps` to `Env` or to a per-call options struct already present; emit `--limit-upload N`/`--limit-download N` on `restic backup|forget|prune|check|restore`)
+- Modify: `internal/agent/runner/*.go` — pass host-wide caps into the runner. Caps come from `agent.config.Config` or are pushed via `config.update`. Decision: ship caps in the existing `config.update` envelope as new fields `bandwidth_up_kbps`, `bandwidth_down_kbps`. Server pushes on hello + on `PUT /api/hosts/{id}/bandwidth`.
+- Modify: `internal/api/messages.go` — extend `ConfigUpdatePayload` with the two int pointers.
+- Modify: `internal/server/ws/handler.go` (or wherever hello/config push lives) — include caps in the pushed config.
+- Modify: `internal/server/http/host_bandwidth.go` — after `SetHostBandwidth`, fan out a `config.update` to the connected agent (mirror the credentials-edit path).
+- Test: `internal/restic/runner_test.go` — assert flag injection.
+- Test: `internal/server/ws/*_test.go` — assert config.update carries caps on hello and on edit.
+
+- [ ] **Step 1.1** Add `LimitUploadKBps *int`, `LimitDownloadKBps *int` to whatever per-host config the runner already consults. Existing pattern is `restic.Env{}`; extend it.
+- [ ] **Step 1.2** Failing test in `internal/restic/runner_test.go`: build a backup command with `LimitUploadKBps=1024`, assert the resulting argv contains `--limit-upload 1024`.
+- [ ] **Step 1.3** Implement: prepend the flags in argv builders for `backup`, `forget`, `prune`, `check`, `restore`. Skip when nil/<=0.
+- [ ] **Step 1.4** Wire `config.update` payload — server reads `Host.BandwidthUpKBps`/`DownKBps`, includes them in the existing `ConfigUpdatePayload` push on hello and on bandwidth edit (mirror cred-edit fan-out in `internal/server/http/host_credentials.go`).
+- [ ] **Step 1.5** Agent applies caps: store in the in-memory dispatcher state on `config.update`, attach to every restic call.
+- [ ] **Step 1.6** `go vet ./... && go test ./... && make build && <restage block>`. Commit:
+```
+agent+server: apply host bandwidth caps to restic invocations
+```
+
+## Task 2 — P2R-13b: Per-job override on Run-now confirm dialog
+
+**Decision:** A small numeric input on the per-source-group Run-now button (and dashboard Run-all). Operator is away — keep it minimal: two optional inputs (up/down KB/s) on the dispatch endpoint; UI shows a `<details>` "Limit bandwidth for this run" disclosure with two number inputs.
+
+**Files:**
+- Modify: `internal/server/http/sources.go` (or wherever the per-group Run-now POST lives) — accept optional `bandwidth_up_kbps`/`bandwidth_down_kbps` form fields, pass through.
+- Modify: dispatch path (`internal/server/dispatch_*.go` or `ws/handler.go` job-dispatch core) — accept overrides, include in the `command.run` payload.
+- Modify: `internal/api/messages.go` — `CommandRunPayload` gains optional caps that take precedence over host-wide caps when present.
+- Modify: agent dispatcher — use payload override if present else falls back to config caps.
+- Modify: `web/templates/pages/host_sources.html` (and the schedules Run-now form) — `<details>` block.
+- Test: HTTP test for the new form fields; agent runner test for override precedence.
+
+- [ ] **Step 2.1** Failing test: POST to per-group Run-now with `bandwidth_up_kbps=512` → assert dispatched payload carries 512.
+- [ ] **Step 2.2** Implement endpoint changes + payload extension.
+- [ ] **Step 2.3** Agent override precedence test (payload wins over config).
+- [ ] **Step 2.4** UI `<details>` blocks (one per Run-now form).
+- [ ] **Step 2.5** Playwright spot-check via `:8080` smoke env: open Sources tab, expand the Run-now disclosure, fire with limit=128, then open the live job log and confirm the agent's restic argv (read `/tmp/rm-smoke/server.log` for the dispatched command — it logs argv) shows `--limit-upload 128`.
+- [ ] **Step 2.6** Commit.
+
+## Task 3 — P2R-14: Schedule "next run" / "last run"
+
+**Files:**
+- Modify: `internal/store/schedules.go` — add `NextRunAt(time.Time)` derivation helper and `LatestScheduledJobAt(host_id, schedule_id) (time.Time, error)` (or a single batched fetch for all schedules of a host).
+- Modify: dashboard host row (`web/templates/partials/host_row.html`) — show "Next: …" and "Last: …" when there's a single covering schedule (already detected in slice 5).
+- Modify: `web/templates/pages/host_schedules.html` — add Next/Last columns to the schedules table.
+- Modify: relevant page handlers (`internal/server/http/ui_schedules.go`, dashboard handler) — populate the data.
+- Test: `schedules_test.go` for next-run derivation (parse cron, compute next from a fixed `now`).
+
+- [ ] **Step 3.1** Add `NextRun(cronExpr string, from time.Time) (time.Time, error)` helper using `robfig/cron/v3`'s `Parse(...).Next(from)`. Test with three crons.
+- [ ] **Step 3.2** Add `LatestJobByActorKindForSchedule(host_id, schedule_id) (time.Time, status, error)` query against `jobs` (filter `actor_kind='schedule'` AND `schedule_id=?`, ORDER BY `started_at` DESC LIMIT 1).
+- [ ] **Step 3.3** Wire schedules-page handler to populate Next/Last per row; render relative time + ISO tooltip (mirror existing `formatRelTime` template helper if it exists; otherwise use a simple "5m ago" helper).
+- [ ] **Step 3.4** Wire dashboard row: when single covering schedule, surface "Next: 03:00" / "Last: 8h ago — succeeded".
+- [ ] **Step 3.5** Playwright spot-check: a host with a schedule shows Next/Last; pause it → Next becomes "—" / "(paused)".
+- [ ] **Step 3.6** Commit.
+
+## Task 4 — P2R-09: Auto-init UX polish
+
+**Files:**
+- Modify: `web/templates/pages/host_repo.html` — danger-zone re-init button + two-step confirm (type the host name).
+- Modify: `internal/server/http/ui_repo.go` (or new `repo_reinit.go`) — `POST /hosts/{id}/repo/reinit` admin-only, audit-logged. Server runs `restic init --force` (or wipes-then-inits — pick the safer of the two; restic doesn't truly wipe a repo, the operator must clear the bucket. **Best guess:** dispatch a normal `init` job with a flag that re-runs even if the repo claims to exist; if restic refuses, surface "the repo on the remote already has data — clear it manually before re-init" via the job log).
+- Modify: host detail page header / vitals strip — surface init result line. Use the existing latest-`init`-job query to render "repo ready · initialised <relative time> ago" or "init failed · job N · retry".
+- Test: HTTP test for re-init endpoint (auth, audit, host-name confirm); template test that the result line renders for both states.
+
+- [ ] **Step 4.1** Add helper: `LatestJobByKind(host_id, "init")` — already exists from P2R-06 (`store.LatestJobByKind`). Reuse.
+- [ ] **Step 4.2** Render init line into vitals strip; show "init failed" amber when latest init failed.
+- [ ] **Step 4.3** Implement `POST /hosts/{id}/repo/reinit` handler — admin role check, requires a `confirm_hostname` form field that must equal `host.Name`, returns 400 otherwise. Dispatches a fresh `init` job.
+- [ ] **Step 4.4** Add danger-zone re-init form to `host_repo.html` (currently disabled per slice 4). Two-step confirm with the typed hostname.
+- [ ] **Step 4.5** Playwright: visit `/hosts/{id}/repo`, click re-init, type wrong hostname → blocked; type right hostname → dispatches init job → returns to live log.
+- [ ] **Step 4.6** Commit.
+
+## Task 5 — P2R-10: Hook schema (migration 0010)
+
+**Files:**
+- Create: `internal/store/migrations/0010_hooks.sql`
+  - `ALTER TABLE source_groups ADD COLUMN pre_hook BLOB;`  (AEAD ciphertext, NULLable)
+  - `ALTER TABLE source_groups ADD COLUMN post_hook BLOB;`
+  - `ALTER TABLE hosts ADD COLUMN pre_hook_default BLOB;`
+  - `ALTER TABLE hosts ADD COLUMN post_hook_default BLOB;`
+  - All four are AEAD ciphertext (existing `crypto.AEAD`); BLOB column type.
+- Modify: `internal/store/types.go` — add `PreHook *string` (decrypted), `PostHook *string` to `SourceGroup`; same to `Host`.
+- Modify: `internal/store/sources.go` + `internal/store/hosts.go` — getters/setters encrypt on write, decrypt on read. Pass `crypto.AEAD` through (pattern mirrors `host_credentials.go`).
+- Test: encrypt/decrypt round-trip; setting `nil` clears the column.
+
+- [ ] **Step 5.1** Write migration SQL. Column-level ALTERs only (per CLAUDE.md).
+- [ ] **Step 5.2** Update store types + getters/setters with AEAD encrypt/decrypt. Mirror `internal/store/host_credentials.go` patterns exactly.
+- [ ] **Step 5.3** Round-trip test: set hook on a source group; reload; assert plaintext returned. Set nil; assert nil after reload.
+- [ ] **Step 5.4** `go vet && go test`. Commit.
+
+## Task 6 — P2R-11: Agent execution of hooks
+
+**Files:**
+- Modify: `internal/api/messages.go` — `ConfigUpdatePayload` (or the per-source-group bundle inside `ScheduleSetPayload`) carries `PreHook`, `PostHook` plaintext (server has decrypted by then; wire is authenticated WS, same trust boundary as repo creds).
+- Modify: agent dispatcher — for `kind=backup` only:
+  - Run `pre_hook` (if present) via `os/exec` with the host shell (`/bin/sh -c` on Linux, `cmd.exe /C` on Windows). Capture stdout+stderr → JobLog with `hook:` prefix. Non-zero exit aborts the backup, marks the job failed with `pre_hook` error.
+  - Run `post_hook` (if present) **always** after the backup, with `RM_JOB_STATUS=succeeded|failed` env var. Capture into JobLog, prefix `hook:`. Non-zero exit on post_hook does NOT change job status (warning logged).
+- Skip both for `kind` ∈ {forget, prune, check, unlock, init} per spec.md §14.3.
+- Test: dispatcher test with a `pre_hook` that exits 1 → backup not started; `post_hook` always runs and sees `RM_JOB_STATUS`.
+
+- [ ] **Step 6.1** Plumb hooks through `ScheduleSetPayload` source-group bundle + per-group Run-now `command.run` payload (override host-default with group hook if both present). Server-side resolution: host default if group hook is empty.
+- [ ] **Step 6.2** Agent dispatcher: factor hook execution into `internal/agent/runner/hooks.go`. Use `exec.CommandContext`, set env, plumb output to existing JobLog stream with `Source: "hook"` (or prefix the log lines `hook: …`).
+- [ ] **Step 6.3** Failing test in `internal/agent/runner/runner_test.go` (create file if absent): `pre_hook=/bin/false` → job fails with `pre_hook failed (exit 1)` and the actual restic backup never runs (assert via mock-restic shim).
+- [ ] **Step 6.4** Test: `post_hook` runs even when backup fails; receives `RM_JOB_STATUS=failed`.
+- [ ] **Step 6.5** Test: hooks skipped on `forget`/`prune`/`check`/`unlock` jobs.
+- [ ] **Step 6.6** `go vet && go test && make build && <restage block>`. Commit.
+
+## Task 7 — P2R-12: Hook editor UI
+
+**Files:**
+- Modify: `web/templates/pages/source_group_edit.html` (new or extend existing source-group form) — `<textarea>` for pre_hook, `<textarea>` for post_hook, with the warning banner: "this hook runs as the agent service user (root on Linux; LocalSystem on Windows)".
+- Modify: source-group HTTP handler (`internal/server/http/sources.go`) — accept hook fields on POST/PUT, encrypt-and-persist via store.
+- Create: a new "Settings" tab section on host detail (currently inert per P1-25) — wait, just add a new sub-tab or extend Repo page. **Decision:** add `pre_hook_default` / `post_hook_default` to the Repo page under a new "Hooks" section since Settings is still inert.
+- Modify: source-group form admin-only check; post-only edit allowed by operators? **Decision:** admin-only edit per spec; render but disable for operators.
+- Modify: audit-log writer — emit `source_group.hook_updated` and `host.default_hook_updated` events (without the hook body).
+- Test: HTTP test for create + update; admin-only enforcement; audit row written without secret.
+
+- [ ] **Step 7.1** Source-group form extension + handler wiring.
+- [ ] **Step 7.2** Repo page Hooks section (host defaults).
+- [ ] **Step 7.3** Audit entries.
+- [ ] **Step 7.4** Playwright: as admin, set a `pre_hook` of `echo hello`, fire Run-now, open live log, confirm `hook: hello` line appears.
+- [ ] **Step 7.5** Commit.
+
+## Task 8 — P2-18a: Announce schema + endpoint
+
+**Files:**
+- Create: `internal/store/migrations/0011_pending_hosts.sql`
+  ```sql
+  CREATE TABLE pending_hosts (
+    id                 TEXT PRIMARY KEY,
+    hostname           TEXT NOT NULL,
+    os                 TEXT NOT NULL,
+    arch               TEXT NOT NULL,
+    agent_version      TEXT NOT NULL,
+    restic_version     TEXT NOT NULL,
+    public_key         BLOB NOT NULL,             -- 32-byte Ed25519
+    fingerprint        TEXT NOT NULL,             -- "SHA256:hex"
+    announced_from_ip  TEXT NOT NULL,
+    first_seen_at      TEXT NOT NULL,
+    last_seen_at       TEXT NOT NULL,
+    expires_at         TEXT NOT NULL
+  );
+  CREATE INDEX pending_hosts_expires ON pending_hosts(expires_at);
+  CREATE INDEX pending_hosts_fingerprint ON pending_hosts(fingerprint);
+  ```
+- Create: `internal/store/pending_hosts.go` — `CreatePendingHost`, `GetPendingHostByFingerprint`, `ListPendingHosts`, `DeletePendingHost`, `TouchPendingHost`, `DeleteExpiredPendingHosts`.
+- Create: `internal/server/http/announce.go` — `POST /api/agents/announce` accepts `{hostname, os, arch, agent_version, restic_version, public_key (base64)}`. Validates protocol_version implicitly via `agent_version` check. Token-bucket rate limit per source IP (10/min). Global cap 100 pending rows. Returns `{fingerprint, pending_id, hostname_collision: bool}`.
+- Test: `announce_test.go` — happy path; rate limit; cap; collision flag.
+
+- [ ] **Step 8.1** Migration + store layer + tests.
+- [ ] **Step 8.2** Endpoint + tests (use a fake clock + in-process token bucket).
+- [ ] **Step 8.3** Commit.
+
+## Task 9 — P2-18b: Pending WS + accept/reject
+
+**Files:**
+- Create: `internal/server/ws/pending.go` — `GET /ws/agent/pending` upgrade. Server issues a 32-byte nonce; agent signs it with its Ed25519 private key; server verifies against the `public_key` stored on the pending row keyed by the supplied `pending_id`. If valid, hold the connection open; on accept, push a single `enrolled` message containing `{bearer_token, repo_credentials_aead_blob}` and close cleanly. On reject, close with code 4001 + reason "rejected".
+- Create: `internal/server/http/pending.go` — admin-only `POST /api/pending-hosts/{id}/accept` (atomically: mint bearer, decrypt admin-supplied repo creds (passed in form), promote pending row → real `hosts` row, push `enrolled` to the open WS, audit-log) and `POST /api/pending-hosts/{id}/reject` (delete row + close socket).
+- Modify: server `main.go` route registration.
+- Test: integration test — fake agent opens pending WS, admin POST /accept, agent receives bearer.
+
+- [ ] **Step 9.1** Pending WS handler with nonce-sign verify.
+- [ ] **Step 9.2** Accept/reject endpoints. Accept reuses the existing token-consume path internally (mints persistent bearer from `crypto.RandomToken`-style helper, inserts host row + `host_credentials`).
+- [ ] **Step 9.3** Tests.
+- [ ] **Step 9.4** Commit.
+
+## Task 10 — P2-18c: Agent announce path
+
+**Files:**
+- Modify: `cmd/agent/main.go` — when `RM_TOKEN` is unset, switch to announce mode instead of erroring out. `RM_SERVER` still required.
+- Create: `internal/agent/announce/announce.go` — generate-or-load Ed25519 keypair (persisted as a file alongside `secrets.enc`, mode 0600). POST `/api/agents/announce`. Open `/ws/agent/pending`. Wait. On `enrolled` message, persist bearer to `agent.yaml`, persist repo creds via existing secrets store, exit announce mode and reconnect via the normal WS path.
+- Modify: `deploy/install/install.sh` — when `RM_TOKEN` is missing, run agent in announce mode and `journalctl --follow` until the agent prints the fingerprint, print it to the operator's terminal in big copy-friendly format, then keep following until enrolled.
+- Test: end-to-end test in `internal/server/...` using a fake agent.
+
+- [ ] **Step 10.1** Keypair generation + persistence.
+- [ ] **Step 10.2** Announce client + pending WS client; print `SHA256:…` fingerprint to stdout in a banner.
+- [ ] **Step 10.3** Install script branch.
+- [ ] **Step 10.4** Playwright: register a host via announce mode (run agent locally with no RM_TOKEN), log into UI, see Pending hosts panel with the fingerprint, click Accept, confirm host appears.
+- [ ] **Step 10.5** Commit.
+
+## Task 11 — P2-18d: Pending hosts UI panel
+
+**Files:**
+- Modify: `web/templates/pages/dashboard.html` — add Pending hosts panel above the host list when any pending rows exist.
+- Modify: dashboard handler — `Store.ListPendingHosts(now)` (auto-skips expired).
+- Add buttons → POST `/api/pending-hosts/{id}/accept` and `/reject` via HTMX.
+- Background sweeper for `DeleteExpiredPendingHosts` every 60s (mirror the existing offline-sweeper goroutine pattern).
+
+- [ ] **Step 11.1** Sweeper goroutine.
+- [ ] **Step 11.2** Dashboard handler + template.
+- [ ] **Step 11.3** Accept form must include the same repo URL/user/pw fields as the token-mint form (admin still supplies repo creds at accept time).
+- [ ] **Step 11.4** Playwright sweep.
+- [ ] **Step 11.5** Commit.
+
+## Task 12 — P2-16: Windows service integration
+
+**Decision:** Cannot test on Windows from WSL. Goal is a clean compile under `GOOS=windows GOARCH=amd64` and code that follows the canonical `golang.org/x/sys/windows/svc/example` pattern. Untestable beyond compile + manual review; mark in commit message.
+
+**Files:**
+- Create: `internal/agent/service/service_windows.go` (build tag `//go:build windows`) — implements `svc.Handler`. `Execute` starts the agent's main loop in a goroutine, listens for `svc.Stop`/`svc.Shutdown`, cancels ctx, waits.
+- Create: `internal/agent/service/service_other.go` (build tag `//go:build !windows`) — stub `RunService` that just runs the agent loop in the foreground.
+- Create: `internal/agent/service/install_windows.go` — `Install`, `Uninstall`, `Start`, `Stop` thin wrappers around `mgr` package.
+- Modify: `cmd/agent/main.go` — sub-commands: `install`, `uninstall`, `start`, `stop`, `run` (default). `run` delegates to `service.Run()` which on Windows checks `svc.IsWindowsService()` and dispatches accordingly.
+- Test: `internal/agent/service/service_windows_test.go` (build-tagged) for argv parsing only — actual SCM interaction can't be tested in CI.
+
+- [ ] **Step 12.1** Implement the svc.Handler shell.
+- [ ] **Step 12.2** Install/uninstall wrappers (use `mgr.ConnectLocal()`, `m.CreateService(name, exepath, mgr.Config{...}, "run")`).
+- [ ] **Step 12.3** Cross-compile check: `GOOS=windows GOARCH=amd64 go build ./cmd/agent` must succeed.
+- [ ] **Step 12.4** Commit with note "untested on Windows; compile-verified only".
+
+## Task 13 — P2-17: install.ps1
+
+**Files:**
+- Create: `deploy/install/install.ps1` — PowerShell 5.1+ compatible. Checks admin elevation. Downloads agent binary from `$RM_SERVER/agent/binary?os=windows&arch=amd64`. Drops it at `C:\Program Files\restic-manager\restic-manager-agent.exe`. Runs `restic-manager-agent.exe install` (registers service). Starts it. Detects existing tasks named `*restic*` via `Get-ScheduledTask` and prints them — does not auto-disable. Writes `C:\ProgramData\restic-manager\agent.yaml` with `RM_SERVER` + `RM_TOKEN` (or no token if announce-mode).
+- Modify: `internal/server/http/install.go` (or wherever install scripts are served) to also serve `/install/install.ps1`.
+- Modify: CLAUDE.md restage block to also stage `install.ps1`.
+
+- [ ] **Step 13.1** Write the script.
+- [ ] **Step 13.2** Wire serving + restage.
+- [ ] **Step 13.3** Smoke parse: `pwsh -NoProfile -Command "Get-Command -Syntax (Get-ChildItem deploy/install/install.ps1)"` if pwsh is on PATH, else `Set-StrictMode` parse via `pwsh -c "$null = [scriptblock]::Create((Get-Content deploy/install/install.ps1 -Raw))"`. Skip if no pwsh available — note in commit.
+- [ ] **Step 13.4** Commit.
+
+## Task 14 — Final integration sweep
+
+- [ ] **Step 14.1** `go vet ./... && go test ./... -race`. Full build. Restage. Restart server.
+- [ ] **Step 14.2** Playwright walkthrough on `:8080`: login → dashboard shows pending-hosts empty state → create source group → set a `pre_hook` → Run-now with bandwidth override → confirm hook fires + bandwidth applied → schedules tab shows next/last → repo page shows init-OK line → re-init flow gated by typed hostname.
+- [ ] **Step 14.3** Update `tasks.md`: tick P2R-09, P2R-10, P2R-11, P2R-12, P2R-13, P2R-14, P2-16, P2-17, P2-18 done. Update Phase 2 acceptance line items as satisfied.
+- [ ] **Step 14.4** Open PR `p2-completion → main` with a summary of every item closed.
+
+---
+
+## Decisions made on the operator's behalf (away)
+
+1. **Bandwidth UI for per-job override:** small `<details>` disclosure under each Run-now button. Simpler than a modal; matches the rest of the app's progressive-disclosure style.
+2. **Re-init UX:** server dispatches a fresh `init` job; if restic refuses because the repo already exists, surfaces the error in the job log and instructs the operator to clear the remote bucket. We don't try to forcibly wipe — too dangerous, and the agent doesn't have credentials to wipe S3/B2/etc generically.
+3. **Hooks editor lives on the Repo page (host defaults) + on the source-group edit form (per-group override).** Skips inventing a new "Settings" tab since that surface is still inert.
+4. **Announce flow:** admin still supplies repo creds at accept time (same form as the token-mint flow). The pending row only carries identity-of-the-endpoint material, never repo creds.
+5. **Windows service:** compile-verified only; untested. Commit message will say so.
@@ -0,0 +1,473 @@
+# P3 — Alerts (design)
+
+> Phase 3 sub-spec covering the alerts engine, notification channels, and UI
+> (P3-05 / P3-06 / P3-07).
+>
+> Wireframe: `_diag/p3-alerts-wireframe/wireframe.html`. Screenshots in the
+> same directory. Spec brainstorm ran 2026-05-04; user approved all ten
+> design decisions before this spec was written.
+
+## Scope locked
+
+Brainstorm decisions (in order asked):
+
+1. **Rule model.** Hardcoded rule set, no operator-tunable thresholds in v1.
+   The engine knows about each rule type internally; per-rule config can land
+   later if/when an operator asks.
+2. **Rule set.** Six rules: `backup_failed`, `forget_failed`, `prune_failed`,
+   `check_failed`, `stale_schedule`, `agent_offline`.
+3. **Engine cadence.** Hybrid. Event hooks at the existing
+   `MarkJobFinished` and offline-sweeper sites for the immediate triggers;
+   one 60-second ticker handles stale-schedule detection and auto-resolution.
+4. **Resolution.** Auto-resolve when the underlying condition clears + manual
+   Resolve at any time. Acknowledge is a separate "I've seen it" intermediate
+   state that does NOT close the alert.
+5. **v1 channels.** Webhook + native ntfy + SMTP. Apprise deferred (the
+   channel plumbing accepts new kinds without reshaping). SMTP added as
+   a first-class channel post-brainstorm because the use case — overnight
+   alerts the operator wants to read in the morning rather than be pinged
+   on at 03:00 — is poorly served by ntfy's push model and clumsy via
+   webhook → email-gateway.
+6. **Channel scope.** Global only. No per-host or per-severity routing in v1.
+7. **Notification body.** Structured JSON for webhooks, formatted
+   title+body+click-URL for ntfy, plus a per-channel "Send test notification"
+   button with inline result feedback.
+8. **Deduplication.** Open-alert uniqueness on `(host_id, kind)` with a
+   `last_seen_at` bump on every confirming tick. One notification per
+   occurrence; the UI shows "still happening · Ns ago" while a rule keeps
+   matching.
+9. **Alert UI.** Top-level `/alerts` page (the existing nav stub becomes
+   real). Per-host vitals "Open alerts" cell links to `/alerts?host_id=...`.
+   Channel CRUD lives at `/settings/notifications`.
+10. **Delivery semantics.** Best-effort fire-and-forget with a 5s timeout
+    per notification. Failures are logged but not retried. The alert row in
+    the DB is the source of truth.
+
+## Architecture
+
+The subsystem is three loosely-coupled units behind one `AlertEngine`
+goroutine:
+
+```
+                                 ┌───────────────────────────┐
+   event hooks ─────────────────►│                           │
+                                 │   AlertEngine             │ ──► raise/resolve
+   60s ticker ──────────────────►│   (rule evaluation)       │     alert row
+                                 │                           │
+                                 └────────────┬──────────────┘
+                                              │
+                                              ▼
+                                  ┌──────────────────────┐
+                                  │   notification.Hub   │
+                                  │   (fire-and-forget)  │
+                                  └──┬────────┬──────────┘
+                                     │        │
+                              ┌──────▼──┐  ┌──▼──────┐
+                              │ Webhook │  │  Ntfy   │  …future channels
+                              └─────────┘  └─────────┘
+```
+
+### Component boundaries
+
+| Component                                | Purpose                                                                                  | Depends on                             |
+| ---------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------- |
+| `internal/alert.Engine`                  | Owns the rule evaluation. Exposes `OnJobFinished`, `OnHostOffline`, `OnHostOnline` event hooks; runs a 60s ticker for stale-schedule + auto-resolution sweeps. Persists raises/resolves through the store. | store, notification.Hub, slog          |
+| `internal/alert.Rule` + per-rule files   | Each of the six rules is a small struct with `Kind() string`, `Severity() string`, `MessageFor(ctx) string`. The engine iterates over a registered slice. | store models                           |
+| `internal/notification.Hub`              | Receives "alert raised/resolved/test" events; fans out to enabled channels in parallel; logs results to a new `notification_log` table.        | store, channel adapters                |
+| `internal/notification.Channel` (iface)  | Single method `Send(ctx, payload) error` with a 5s context for HTTP channels, 10s for SMTP. Three impls in v1: `webhookChannel`, `ntfyChannel`, `smtpChannel`. | http.Client; net/smtp + crypto/tls for SMTP |
+| `internal/store/alerts.go`               | CRUD on `alerts` table: `RaiseOrTouch(host_id, kind, severity, message)`, `Acknowledge(id, user)`, `Resolve(id, by user)`, `AutoResolve(host_id, kind)`, `ListAlerts(filter)`, plus the `last_seen_at` bump. | sqlite                                 |
+| `internal/store/notification_channels.go` | CRUD on `notification_channels` (new table) + `notification_log` (new table).            | sqlite, crypto.AEAD (for secrets)      |
+| `internal/server/http/ui_alerts.go`      | `/alerts` page handler + filter parsing + ack/resolve form actions.                      | store                                  |
+| `internal/server/http/ui_notifications.go` | `/settings/notifications` page + channel CRUD + "Send test" handler.                   | store, notification.Hub                |
+
+### Engine event shape
+
+The engine runs as one goroutine per server process started in
+`cmd/server/main.go`. It exposes a small set of channels other code writes to:
+
+```go
+type Engine struct {
+    store *store.Store
+    hub   *notification.Hub
+
+    // Event channels (buffered, drop-on-full with a slog warning to keep
+    // hot paths non-blocking). The engine drains them on its own
+    // goroutine, evaluates the rule, and acts.
+    jobFinished chan jobFinishedEvent  // from store.MarkJobFinished hook
+    hostOffline chan string            // host_id; from offline sweeper
+    hostOnline  chan string            // host_id; from ws handler hello
+
+    // 60s ticker drives stale-schedule + auto-resolution sweeps.
+    tick *time.Ticker
+}
+```
+
+The hot-path call sites (`store.MarkJobFinished`, `ws.handler` offline
+sweep, `ws.handler` hello) push to these channels via a tiny
+`Engine.Notify*` method that does a non-blocking send. The engine's own
+goroutine handles every match — keeps mutation off the hot path.
+
+### Rule catalogue
+
+| Kind                | Severity | Trigger                                                                 | Auto-resolve when                                  |
+| ------------------- | -------- | ----------------------------------------------------------------------- | -------------------------------------------------- |
+| `backup_failed`     | warning  | `MarkJobFinished` with kind=backup, status=failed                       | next backup for the same host succeeds             |
+| `forget_failed`     | warning  | `MarkJobFinished` with kind=forget, status=failed                       | next forget for the same host succeeds             |
+| `prune_failed`      | warning  | `MarkJobFinished` with kind=prune, status=failed                        | next prune for the same host succeeds              |
+| `check_failed`      | critical | `MarkJobFinished` with kind=check, status=failed OR errors_found        | next check for the same host succeeds without errors |
+| `stale_schedule`    | warning  | 60s ticker: a schedule's next-fire time is more than 5 minutes in the past with no matching job since | next job for that schedule succeeds OR schedule deleted |
+| `agent_offline`     | warning  | offline-sweeper marks the host offline AND the host has been offline > 15 min (engine checks `last_seen_at`) | hostOnline event for that host                     |
+
+The 15-minute floor on `agent_offline` exists so a 30-second blip during
+agent restart doesn't generate a notification storm. The store's existing
+offline sweeper (`hosts.last_seen_at` with 90s threshold) already marks the
+host offline; the engine sees the event but waits for the threshold before
+raising.
+
+### Dedup + last_seen_at
+
+`store.RaiseOrTouch(host_id, kind, severity, message)`:
+
+```sql
+SELECT id, last_seen_at FROM alerts
+ WHERE host_id = ? AND kind = ? AND resolved_at IS NULL
+ LIMIT 1;
+```
+
+- Found: `UPDATE alerts SET last_seen_at = ?, message = ? WHERE id = ?`,
+  return `(id, didRaise=false)`.
+- Not found: `INSERT INTO alerts (id, host_id, kind, severity, message,
+  created_at, last_seen_at) VALUES (?, ?, ?, ?, ?, ?, ?)`, return
+  `(id, didRaise=true)`.
+
+The engine fires a notification through the Hub only when `didRaise=true`.
+Touch-only events keep the row's `last_seen_at` fresh so the UI can render
+"still happening · Ns ago" without spamming the operator's phone.
+
+### Notification payload shapes
+
+**Webhook** — a single JSON envelope per event:
+
+```json
+{
+  "event":     "alert.raised",
+  "alert_id":  "01KQT...",
+  "severity":  "warning",
+  "kind":      "backup_failed",
+  "host_id":   "01KQ...",
+  "host_name": "alfa-01",
+  "message":   "Backup 'system-config' failed: rest-server returned 401",
+  "raised_at": "2026-05-04T15:42:01Z",
+  "link":      "https://restic-manager.example/alerts/01KQT..."
+}
+```
+
+`event` is one of `alert.raised | alert.acknowledged | alert.resolved |
+alert.test`. The same envelope shape is reused across events — operators
+build one bridge, switch on `event` and `severity`.
+
+**SMTP** — single-recipient plain-text email per channel. The channel
+config carries the SMTP server credentials and a `to` address; one
+channel = one recipient (or one distribution-list address). Operators
+who want multiple recipients add multiple channels — keeps the config
+flat and the failure modes per-recipient.
+
+Subject pattern is hardcoded (no per-channel template in v1):
+
+```
+Subject: [restic-manager] [<severity>] <host_name>: <kind>
+From: <configured-from-address>
+To: <configured-to-address>
+Date: <RFC 5322>
+Message-ID: <alert_id@<server-host>>
+
+<message line — same string the webhook/ntfy gets>
+
+—
+Raised at: 2026-05-04T15:42:01Z
+Severity:  warning
+Host:      alfa-01
+Kind:      backup_failed
+
+Open in restic-manager:
+https://restic-manager.example/alerts/01KQT...
+
+(This message was sent by restic-manager. Acknowledge or resolve in the UI.)
+```
+
+The body is plain text only in v1 — no HTML alternative — both because
+the data is already structured well enough as text and because HTML
+email opens a long tail of rendering / sanitisation concerns. The
+`Message-ID` includes the alert id so a thread-aware client can group
+related events (raised → acknowledged → resolved) together.
+
+Encryption:
+- **STARTTLS** (default, port 587). Opportunistic upgrade. Most
+  operator-facing relays.
+- **Implicit TLS** (port 465). Connect-then-TLS-handshake.
+- **None** (port 25). Plain. Hidden behind a "Yes I understand" warning
+  on the form because the password goes over the wire.
+
+Auth:
+- **PLAIN** (RFC 4616) over TLS. Default and almost always what's wanted.
+- **CRAM-MD5** (RFC 2195). Offered if the server advertises it, no UI
+  toggle — automatic.
+- No OAuth2 / XOAUTH2 in v1; that's a real next step if Gmail-without-
+  app-passwords becomes a recurring ask.
+
+Per-message timeout is 10s (vs 5s for HTTP channels) — STARTTLS
+handshake + DATA over a slow link can legitimately take that long.
+
+**Ntfy** — uses the standard publish format:
+
+```
+POST /<topic> HTTP/1.1
+Host: <server>
+Authorization: Bearer <access-token>   (if configured)
+Title: [warning] alfa-01 backup failed
+Priority: 4
+Tags: warning,backup_failed
+Click: https://restic-manager.example/alerts/01KQT...
+
+Backup 'system-config' failed: rest-server returned 401
+```
+
+Severity → priority mapping:
+
+| Severity  | Priority |
+| --------- | -------- |
+| info      | 3 (default) |
+| warning   | 4 (high)    |
+| critical  | 5 (urgent)  |
+
+Per-channel `default_priority` setting overrides for non-critical alerts;
+critical always goes urgent regardless.
+
+### Test notification
+
+`POST /api/notifications/{channel_id}/test` builds a synthetic event
+(severity=info, kind=test_notification, message="Test from
+restic-manager", link to the channel's edit page) and runs it through the
+real send path. Returns `{ok: bool, latency_ms: int, status_code?: int,
+error?: string}`. UI renders the green ✓ / red ✗ feedback inline.
+
+## Routes added
+
+| Method  | Path                                                  | Purpose                                                       |
+| ------- | ----------------------------------------------------- | ------------------------------------------------------------- |
+| GET     | `/alerts`                                             | Fleet alerts list with filters (`?status=open&severity=warning&host_id=...&q=...`) |
+| POST    | `/alerts/{id}/acknowledge`                            | Mark alert acknowledged (HTMX form)                           |
+| POST    | `/alerts/{id}/resolve`                                | Manual resolve (HTMX form)                                    |
+| GET     | `/settings/notifications`                             | Channel list page                                             |
+| GET     | `/settings/notifications/new`                         | Channel kind picker + empty form                              |
+| POST    | `/settings/notifications/new`                         | Validate + create + redirect                                  |
+| GET     | `/settings/notifications/{id}/edit`                   | Channel edit form                                             |
+| POST    | `/settings/notifications/{id}/edit`                   | Validate + update                                             |
+| POST    | `/settings/notifications/{id}/delete`                 | Delete channel (typed-confirm name in the form)               |
+| POST    | `/api/notifications/{id}/test`                        | Fire test notification, return JSON result                    |
+| GET     | `/api/alerts`                                         | JSON list (mirrors the UI filters) for future REST callers    |
+
+## Data model
+
+### Migration 0013 — alerts.last_seen_at
+
+```sql
+ALTER TABLE alerts ADD COLUMN last_seen_at TEXT;
+UPDATE alerts SET last_seen_at = created_at WHERE last_seen_at IS NULL;
+```
+
+Existing alerts (currently zero in production — nothing writes them yet)
+get `last_seen_at = created_at`. Column is nullable for forwards-compat
+with rows from the alert-engine-pre-bump period.
+
+### Migration 0014 — notification_channels + notification_log
+
+```sql
+CREATE TABLE notification_channels (
+  id              TEXT PRIMARY KEY,
+  kind            TEXT NOT NULL CHECK (kind IN ('webhook', 'ntfy', 'smtp')),
+  name            TEXT NOT NULL,
+  enabled         INTEGER NOT NULL DEFAULT 1 CHECK (enabled IN (0, 1)),
+  config          BLOB NOT NULL,        -- AEAD-encrypted JSON; per-kind shape
+  default_priority TEXT,                -- ntfy only; null for webhook + smtp
+  created_at      TEXT NOT NULL,
+  updated_at      TEXT NOT NULL,
+  last_fired_at   TEXT
+);
+
+CREATE INDEX notification_channels_enabled ON notification_channels(enabled) WHERE enabled = 1;
+
+CREATE TABLE notification_log (
+  id           TEXT PRIMARY KEY,
+  channel_id   TEXT NOT NULL REFERENCES notification_channels(id) ON DELETE CASCADE,
+  alert_id     TEXT REFERENCES alerts(id) ON DELETE SET NULL,
+  event        TEXT NOT NULL,           -- alert.raised | alert.acknowledged | alert.resolved | alert.test
+  ok           INTEGER NOT NULL CHECK (ok IN (0, 1)),
+  status_code  INTEGER,
+  latency_ms   INTEGER,
+  error        TEXT,
+  fired_at     TEXT NOT NULL
+);
+
+CREATE INDEX notification_log_channel ON notification_log(channel_id, fired_at DESC);
+CREATE INDEX notification_log_alert ON notification_log(alert_id);
+```
+
+`config` is an AEAD-encrypted JSON blob — bearer tokens for webhooks and
+access tokens for ntfy live there. Per-kind config shapes:
+
+```go
+type webhookConfig struct {
+    URL          string `json:"url"`
+    BearerToken  string `json:"bearer_token,omitempty"`
+    HeaderName   string `json:"header_name,omitempty"`
+    HeaderValue  string `json:"header_value,omitempty"`
+}
+
+type ntfyConfig struct {
+    ServerURL    string `json:"server_url"`     // default https://ntfy.sh
+    Topic        string `json:"topic"`
+    AccessToken  string `json:"access_token,omitempty"`
+}
+
+type smtpConfig struct {
+    Host       string `json:"host"`         // e.g. smtp.example.com
+    Port       int    `json:"port"`         // default 587 (STARTTLS), 465 (TLS), 25 (none)
+    Encryption string `json:"encryption"`   // "starttls" | "tls" | "none"
+    Username   string `json:"username"`
+    Password   string `json:"password"`     // sensitive — AEAD-encrypted with the rest of config
+    From       string `json:"from"`         // RFC 5322 address; "alerts@example.com" or "Restic-Manager <alerts@…>"
+    To         string `json:"to"`           // single recipient or distribution-list address; v1 = one channel = one to-line
+}
+```
+
+### Engine state
+
+The engine itself is stateless beyond the channels it owns; all
+persisted state is in the existing `alerts` table + the new
+`notification_log` table. A process restart re-evaluates from scratch:
+on next tick the stale-schedule + auto-resolution sweeps catch up with
+whatever happened during the downtime. No outbox to drain.
+
+## UI templates
+
+| Template                                  | Purpose                                                |
+| ----------------------------------------- | ------------------------------------------------------ |
+| `web/templates/pages/alerts.html`         | Fleet alerts page                                      |
+| `web/templates/partials/alert_row.html`   | One alert row (used by both list and detail-fragment swap) |
+| `web/templates/pages/settings.html`       | Settings shell with Notifications / Users / Auth sub-tabs |
+| `web/templates/pages/notifications.html`  | Channel list (Notifications sub-tab body)              |
+| `web/templates/pages/notification_edit.html` | Channel kind picker + per-kind form + test button + payload preview |
+| `web/templates/partials/crit_banner.html` | Dashboard top-of-page banner                           |
+| `web/templates/partials/nav.html`         | Existing — gain a `data-alerts-count` attribute on the Alerts tab so the badge auto-updates |
+
+The Settings shell + Notifications sub-tab is the new chrome the wireframe
+introduced; Users + Authentication tabs are placeholder links that 404 in
+v1 (or render an "Lands later" notice). Same pattern P2R-02 used for
+inert sub-tabs.
+
+## Tests (target coverage)
+
+- `internal/alert/engine_test.go` — rule firing per kind: backup_failed
+  raises on `MarkJobFinished(kind=backup, status=failed)`; touch-only on
+  the second failure for the same host (no second notification);
+  auto-resolve on next success.
+- `internal/alert/agent_offline_test.go` — `OnHostOffline` emits without
+  raising until the 15-min floor; `OnHostOnline` clears the alert.
+- `internal/alert/stale_schedule_test.go` — synthetic schedule whose next
+  fire is in the past triggers; resets when a job lands.
+- `internal/notification/webhook_test.go` — payload shape pinned;
+  authorisation header sent when bearer set; custom header echoed; 5s
+  timeout enforced; error in `notification_log`.
+- `internal/notification/ntfy_test.go` — title/priority/tags/click headers
+  match the severity mapping; access token sent as `Authorization: Bearer
+  <token>`; default priority overridden by severity for critical.
+- `internal/notification/smtp_test.go` — round-trip against a local
+  `net/smtp.NewServer`-style fake (or `mhog`/MailHog if convenient):
+  STARTTLS handshake completes against a self-signed cert; PLAIN auth
+  uses configured creds; subject + from + to + body bytes match the
+  spec'd format; Message-ID contains the alert id; 10s timeout enforced;
+  failure path (auth refused) lands in `notification_log` with the
+  server's error string.
+- `internal/server/http/ui_alerts_test.go` — page renders with filters
+  applied; ack/resolve POSTs flip the row + write audit; HX-Redirect
+  bounces back to the filtered list.
+- `internal/server/http/ui_notifications_test.go` — CRUD happy paths,
+  validation re-render, secrets-encrypted-at-rest assertion (load row,
+  decrypt, compare), test-button hits the real send path against a
+  test http.Server.
+- Migration 0013 + 0014 round-trip tested via `store.Open` on a fresh
+  db.
+
+## Playwright sweep
+
+End-of-phase sweep mirrors the P2R-02 / P3-restore pattern:
+
+1. Login → `/alerts` (initially empty) → see "All clear · last alert
+   never" empty state.
+2. Trigger a fake-failed-backup via `POST /api/hosts/{id}/jobs` against a
+   host with a deliberately-wrong rest-server URL. Wait for the
+   `backup_failed` alert to appear in the list within ~2s of the job
+   finishing.
+3. Acknowledge → row tints + ack actor visible.
+4. Take the agent offline (`systemctl stop`); wait 15 min OR mock
+   `last_seen_at` to 16 min ago via the test harness; confirm
+   `agent_offline` alert raises once.
+5. Restart the agent → `agent_offline` auto-resolves; `backup_failed` is
+   still open.
+6. Configure a webhook channel pointing at a local test sink; click "Send
+   test" → green ✓.
+7. Configure a ntfy channel pointing at a local sink → click "Send test"
+   → green ✓.
+8. Configure an SMTP channel pointing at a local MailHog (Docker, port
+   1025, no TLS for the local-only sweep) → click "Send test" → green ✓
+   → MailHog UI at :8025 shows the test email with the right subject
+   and Message-ID.
+9. Trigger a fresh failed backup → all three channels receive the
+   notification (verified from sink logs + MailHog inbox);
+   `notification_log` has three rows `event=alert.raised, ok=true`.
+10. Manually Resolve the open `backup_failed`; confirm all three channels
+    receive `event=alert.resolved`.
+11. Critical-severity test: trigger `check_failed` (mocked) → dashboard
+    banner appears; clicking it lands on `/alerts?severity=critical&status=open`.
+12. Empty the alerts again → banner disappears.
+
+Screenshots into `_diag/p3-alerts-sweep/`. End-to-end clean, zero console
+errors, before handing back.
+
+## What does NOT change
+
+- Existing chrome/templates beyond the small additions noted above.
+- Existing `alerts.severity` CHECK (`info`/`warning`/`critical`) — already
+  the right shape; no migration needed for that.
+- Audit log writer pattern — engine writes audit rows for ack/resolve
+  the same way every other state-changing handler does.
+- The agent. Alerts are entirely a server concern; the agent doesn't
+  know they exist.
+
+## Open questions / explicit non-goals
+
+- **Per-rule cooldowns / re-raise on long-running issues.** Out of scope
+  (brainstorm question 8 ruled this out). Operators see "still happening"
+  in the UI; they don't get a reminder ping.
+- **SMTP HTML emails.** v1 is plain text only — operators wanting rich
+  rendering can deploy a webhook → mail-merge bridge, or wait for a v2
+  template engine. The Message-ID threading + plain text body should be
+  enough for almost every overnight-digest workflow.
+- **SMTP OAuth2 / XOAUTH2.** Out of scope. Gmail / Microsoft 365 with
+  modern OAuth requires an `app password` workaround in v1. Native
+  XOAUTH2 lands when an operator asks (or when Google starts refusing
+  app passwords for non-business accounts in earnest).
+- **Multi-recipient SMTP channels.** A channel = one `To`. Operators
+  wanting multiple recipients add multiple channels. Keeps failure
+  attribution per-recipient.
+- **Apprise sidecar integration.** Deferred per brainstorm. The
+  `Channel` interface accepts a third impl without reshaping when we get
+  there.
+- **Per-host or per-severity channel routing.** Out of scope. Likely
+  next step if operators ask: a `min_severity` field on the channel row.
+- **Snooze / mute.** Out of scope. Acknowledge is the closest analogue;
+  full silence-windows would need a new table and is YAGNI for v1.
+- **PagerDuty / OpsGenie.** Both have webhook receivers; operators wire
+  them via the webhook channel today.
+- **Alert "rules" UI.** No CRUD; the rule set is hardcoded.
@@ -0,0 +1,342 @@
+# P3 — Restore (design)
+
+> Phase 3 sub-spec covering single-host restore (P3-01, P3-02, P3-03, P3-09).
+> P3-04 (cross-host restore) is deferred to a new "Future / unscheduled"
+> section in `tasks.md` — disaster recovery is already covered by re-enrolling
+> a replacement host with the same repo credentials.
+>
+> Wireframe: `_diag/p3-restore-wizard/wireframe.html`. Screenshot:
+> `_diag/p3-restore-wizard/01-full-wizard.png`.
+
+## Scope locked
+
+Brainstorm decisions (in order asked):
+
+1. **In-place vs new-directory.** Default is a new directory under
+   `/var/restic-restore/<job-id>/`. An "Restore in place (overwrite original
+   paths)" toggle is gated by typed-confirmation of the host name, mirroring
+   the repo re-init pattern.
+2. **Path-selection granularity.** Tree browser as the path selector, lazy-
+   loaded via `restic ls --json <snapshot> <path>` per directory expansion.
+3. **Cross-host restore (P3-04).** Out of scope this phase. Move to
+   "Future / unscheduled" in `tasks.md`. The disaster-recovery case is covered
+   by the standard enrolment flow: stand up a replacement host, paste the
+   original repo creds at enrolment, snapshots reappear, restore is
+   same-host.
+4. **Snapshot diff (P3-09).** Diff-as-a-job. New `JobDiff` JobKind dispatched
+   like every other agent operation. Output streams as `log.stream` and
+   renders on the live job log page.
+5. **Wizard entry points.** Top-level "Restore" button on host detail
+   (`/hosts/{id}/restore`, opens wizard at step 1) plus a per-snapshot
+   Restore action on snapshot rows (`/hosts/{id}/snapshots/{sid}/restore`,
+   skips step 1).
+6. **Wizard interaction model.** Single-page, sections progressively enable;
+   tree-browser nodes lazy-load via HTMX partials. No `restore_drafts` table.
+7. **Tree-browser data path.** Synchronous WS RPC (`tree.list` ↔
+   `tree.list.result`, correlation-ID) plus a per-wizard-session in-memory
+   cache keyed by `{snapshot_id, path}` with ~30-min TTL.
+8. **Restore progress UI.** Restore-specific job-page variant: files-restored
+   / bytes-restored / throughput / ETA / current-file display, driven by
+   restic restore's JSON status events surfaced through `job.progress`.
+9. **Permissions/ownership.** Policy, not toggle. In-place restore preserves
+   original ownership; new-directory restore drops ownership
+   (`--no-ownership`).
+10. **Concurrency.** Single-flight per host (one job at a time across all
+    kinds). Plus a real cancel-job feature: `command.cancel` envelope, agent
+    kills the `restic` subprocess via context cancel (SIGTERM, SIGKILL after
+    grace), server transitions the job to `cancelled`. The "Cancel" button
+    already in the `job_detail` template becomes real for any running job
+    kind.
+11. **Audit + safety.** Audit row on every restore dispatch (`host.restore`
+    with snapshot ID, paths, target, in-place flag). Recent-restores panel
+    on the host page surfacing the latest restore job alongside last-backup
+    and last-init signals. Role gate deferred to P4-03.
+
+## Architecture
+
+Restore composes from existing primitives plus three new pieces:
+
+- **New JobKind values**: `JobRestore`, `JobDiff`. Dispatcher cases mirror
+  the prune/check pattern. Agent-side handlers wrap `restic.RunRestore` and
+  `restic.RunDiff` (new methods on the `restic` package).
+- **New WS RPC**: `tree.list` request (`{snapshot_id, path}`) ↔
+  `tree.list.result` reply (`{entries: [{name, type, size}], ...}` or
+  `{error}`). Reuses existing correlation-ID infrastructure from P1-09. No
+  `jobs` row.
+- **New cancel surface**: `command.cancel` request (`{job_id}`), agent
+  cancels the running subprocess context, returns `command.ack` + `job.finished`
+  with status `cancelled`. Server endpoint `POST /api/jobs/{id}/cancel`
+  bridges UI button → WS envelope.
+
+Everything else (job lifecycle, log streaming, progress envelope, snapshot
+listing, audit log writer, host_chrome partial, danger-zone typed-confirmation)
+already exists and is reused verbatim.
+
+### Component boundaries
+
+| Component                          | Purpose                                              | Depends on                                |
+| ---------------------------------- | ---------------------------------------------------- | ----------------------------------------- |
+| `internal/restic.RunRestore`        | Run `restic restore` with paths + target + ownership | `restic.Env`                              |
+| `internal/restic.RunDiff`           | Run `restic diff --json a b`                         | `restic.Env`                              |
+| `internal/agent/runner` cases       | Dispatch `JobRestore` / `JobDiff` jobs               | `restic.Run*`, hooks (skipped: backup-only) |
+| `internal/agent/runner` cancel hook | Wire WS `command.cancel` → ctx.CancelFunc per job   | runner job map                            |
+| `internal/agent/runner` tree-list   | Sync RPC handler: `restic ls --json` for one path   | `restic.Env`                              |
+| `internal/server/ws/cancel.go`      | Validate + send `command.cancel` envelope            | hub.Send, store.UpdateJobStatus           |
+| `internal/server/ws/tree.go`        | RPC mediator: `tree.list` request → reply, with cache | hub.SendRPC, in-memory cache              |
+| `internal/server/http/restore.go`   | Wizard routes + dispatch endpoint                    | store, ws, audit                          |
+| `internal/server/http/diff.go`      | Snapshot-diff dispatch endpoint                      | store, ws                                 |
+| `internal/server/http/cancel.go`    | `POST /api/jobs/{id}/cancel`                         | ws                                        |
+| `web/templates/pages/host_restore.html` | Wizard page                                      | host_chrome partial                       |
+| `web/templates/partials/tree_node.html` | Lazy-loaded tree node fragment for HTMX swap     | —                                         |
+| `web/templates/pages/job_detail.html` | Restore-kind progress widget (variant)             | existing job_detail                       |
+
+### Data flow — wizard happy path
+
+```
+operator
+  ├─ GET /hosts/{id}/restore
+  │     server renders wizard shell, snapshot table from store.ListSnapshotsByHost
+  │
+  ├─ click snapshot row (or arrives via /hosts/{id}/snapshots/{sid}/restore)
+  │     wizard advances to step 2, snapshot summary card rendered
+  │
+  ├─ expand a tree node (chevron click)
+  │     HTMX GET /hosts/{id}/restore/tree?snapshot={sid}&path=/etc
+  │       server checks per-session cache (keyed by sid+path)
+  │         hit  → render tree_node fragment from cache
+  │         miss → hub.SendRPC(host_id, "tree.list", {sid, path}) → wait reply
+  │                cache result, render tree_node fragment
+  │
+  ├─ tick file/dir checkboxes (form state, no round-trip)
+  │
+  ├─ pick target radio (and optionally type host name to unlock in-place)
+  │
+  └─ POST /hosts/{id}/restore  (form submit)
+        server validates: ≥1 path, target mode, in-place ⇒ host name match
+        write audit row host.restore
+        store.CreateJob{kind=restore, payload={snapshot_id, paths, target, in_place}}
+        hub.Send(host_id, "command.run", {job_id, kind=restore, payload})
+        HX-Redirect: /jobs/{job_id}
+```
+
+### Data flow — agent restore execution
+
+```
+agent.runner receives command.run kind=restore
+  ├─ check single-flight: if r.activeJobID != "" → reply busy
+  │   (server queues to pending_runs only for kind=backup; restore returns busy)
+  ├─ allocate ctx, ctxCancel — store cancelFunc against job_id in r.cancels
+  ├─ sendStarted(job_id, JobRestore, now)
+  ├─ build target path: if in_place → "/" else "/var/restic-restore/<job_id>/"
+  ├─ build flags: paths from payload, --no-ownership when !in_place
+  ├─ restic.RunRestore(ctx, env, snapshot_id, paths, target, in_place):
+  │   restic restore <sid> --target <path> [--no-ownership] -- <p1> <p2> ...
+  │   parse stdout JSON: forward "status" → job.progress (1Hz throttle), "summary" → final
+  ├─ on success: sendFinished(job_id, succeeded, exit=0)
+  ├─ on ctx.Err() == context.Canceled: sendFinished(job_id, cancelled, exit=130)
+  └─ delete cancel func from r.cancels
+```
+
+### Data flow — cancel
+
+```
+operator clicks Cancel on /jobs/{id} (running)
+  POST /api/jobs/{id}/cancel
+    server: lookup job, ensure status=running, find host
+    hub.Send(host_id, "command.cancel", {job_id})
+  → agent.runner receives command.cancel
+       cancelFunc, ok := r.cancels[job_id]
+       ok && cancelFunc()
+       → restic subprocess context done → exec.Cmd kills via SIGTERM
+       → if still alive after 5s grace → SIGKILL
+       → runner sendFinished(job_id, cancelled, exit=130)
+  → server receives job.finished status=cancelled, persists, broadcasts
+  → browser refresh shows cancelled state
+```
+
+The cancel surface is independently useful for any kind (prune/check/backup) —
+not gated to restore. The button already in `job_detail.html` becomes real.
+
+### Tree-list RPC details
+
+New WS message types (added to `internal/api/messages.go`):
+
+```
+type TreeListRequestPayload struct {
+    SnapshotID string `json:"snapshot_id"`
+    Path       string `json:"path"`
+}
+
+type TreeListEntry struct {
+    Name string `json:"name"`
+    Type string `json:"type"`        // "dir" | "file" | "symlink"
+    Size int64  `json:"size,omitempty"`
+}
+
+type TreeListResultPayload struct {
+    SnapshotID string          `json:"snapshot_id"`
+    Path       string          `json:"path"`
+    Entries    []TreeListEntry `json:"entries,omitempty"`
+    Error      string          `json:"error,omitempty"`
+}
+```
+
+Server-side mediator (`ws.SendRPC`) takes a request envelope, registers the
+correlation ID in a pending map, sends, blocks on a per-call channel until
+the matching reply arrives (or 30s timeout). The pattern is small enough
+to inline in `internal/server/ws/rpc.go` as a generic helper — future
+synchronous RPCs reuse it.
+
+In-memory cache: `map[sessionID]map[cacheKey]TreeListResultPayload` with
+`cacheKey = snapshot_id + "\x00" + path`. Session ID minted per wizard
+load (HTTP-only cookie scoped to `/hosts/{id}/restore/tree`, lifetime 30
+min). On wizard close (browser navigation away) the entry expires
+naturally. No persistence, no migration.
+
+Agent handler runs `restic ls --json <sid> <path>` (non-recursive — restic
+defaults to recursive but `restic ls` accepts `--long` and a path filter;
+parse output line-by-line and emit only direct children of `path`). 60s
+context timeout, mirroring existing `restic snapshots` invocation.
+
+### Restore payload
+
+`api.CommandRunPayload` gains a nested optional `restore` field:
+
+```
+type RestorePayload struct {
+    SnapshotID    string   `json:"snapshot_id"`
+    Paths         []string `json:"paths"`           // absolute paths inside the snapshot
+    InPlace       bool     `json:"in_place"`
+    TargetDir     string   `json:"target_dir"`      // empty when in_place=true
+    PreserveOwner bool     `json:"preserve_owner"`  // mirrors policy: in_place=>true, else=>false
+}
+```
+
+The payload is set by the server when dispatching `JobRestore` and ignored
+on every other kind. Wire-shape test pinned in `wire_test.go`.
+
+### Diff payload
+
+`api.CommandRunPayload` gains:
+
+```
+type DiffPayload struct {
+    SnapshotA string `json:"snapshot_a"`
+    SnapshotB string `json:"snapshot_b"`
+}
+```
+
+Set on `JobDiff`. Output is plain `restic diff --json <a> <b>` forwarded as
+`log.stream` lines. Job page renders unchanged — operator reads the diff
+output directly.
+
+### Recent-restores panel
+
+A small panel rendered on the host detail page below the existing init-status
+line:
+
+```
+last restore: succeeded 2h ago · job f73ab4c1… · 3 files to /var/restic-restore/...
+```
+
+Backed by a new `store.LatestJobByKind(host_id, JobRestore)` query (mirroring
+the existing `store.LatestJobByKind` already used for init/forget/prune/check
+in P2R-06). One template addition in `host_chrome.html` next to the
+`InitStatus` block.
+
+## Routes added
+
+| Method  | Path                                                      | Purpose                                                     |
+| ------- | --------------------------------------------------------- | ----------------------------------------------------------- |
+| GET     | `/hosts/{id}/restore`                                     | Wizard shell (step 1 = snapshot picker)                     |
+| GET     | `/hosts/{id}/snapshots/{sid}/restore`                     | Wizard shell with snapshot pre-selected (skips step 1)      |
+| GET     | `/hosts/{id}/restore/tree`                                | HTMX partial: tree node listing for `?snapshot=&path=`      |
+| POST    | `/hosts/{id}/restore`                                     | Validate + dispatch restore job, redirect to live job page  |
+| POST    | `/api/hosts/{id}/snapshots/diff`                          | Dispatch a diff job for `{snapshot_a, snapshot_b}`          |
+| POST    | `/api/jobs/{id}/cancel`                                   | Send `command.cancel` to host, transition job → cancelled   |
+
+## Migrations
+
+None. Restore + diff piggyback on the existing `jobs` table (their `kind` is
+new but the schema already accepts arbitrary kind strings — there's no
+CHECK constraint on `kind`). The cancel feature uses the existing
+`JobCancelled` terminal status. The tree-list cache lives in process memory.
+
+## Tests (target coverage)
+
+- `internal/restic/restore_test.go` — `RunRestore` invocation builds the
+  expected argv (paths, --target, --no-ownership flag presence, in-place
+  variant); JSON status parsing → `BackupStatus`-shaped progress envelopes.
+- `internal/restic/diff_test.go` — `RunDiff` argv shape and JSON forwarding.
+- `internal/agent/runner/restore_test.go` — happy path, cancel mid-run
+  produces `cancelled` finished, in-place vs new-directory dispatch,
+  single-flight rejects when another job is running.
+- `internal/agent/runner/tree_test.go` — `tree.list` handler returns
+  direct children for a synthetic restic ls output, surfaces error on
+  missing snapshot.
+- `internal/server/ws/rpc_test.go` — `SendRPC` correlation matching,
+  timeout, concurrent calls.
+- `internal/server/http/restore_test.go` — wizard renders with snapshots,
+  POST validates ≥1 path + in-place host-name match, audit row written,
+  job dispatched with correct payload, in-place without typed-confirm
+  re-renders form with input intact and an error.
+- `internal/server/http/diff_test.go` — POST dispatches `JobDiff`,
+  snapshot IDs validated against the host's snapshot list.
+- `internal/server/http/cancel_test.go` — POST cancel happy path
+  (running → cancelled), 4xx for non-running jobs, 4xx when host offline.
+- `internal/server/http/restore_e2e_test.go` — happy path: GET wizard,
+  expand `/etc` (HTMX call returns expected fragment), submit, follow
+  HX-Redirect to job page, see status.
+- `web/templates/pages/host_restore_test.go` (template-render test) —
+  wizard renders all four sections; in-place card disabled until typed
+  confirm.
+
+## Playwright iteration / sweep
+
+A Playwright sweep at the end (mirroring P2R-02 Slice 6) runs against the
+local smoke server with a real agent enrolled. Steps:
+
+1. Login → navigate to alfa-01 host → click Restore.
+2. Wizard step 1: pick the most recent snapshot.
+3. Wizard step 2: expand a directory two levels, tick three files,
+   verify tally updates.
+4. Wizard step 3: leave default new-directory.
+5. Wizard step 4: dispatch.
+6. Land on live job page, see progress widget animating, see log lines.
+7. Click Cancel mid-flight, verify status transitions to cancelled and
+   the agent's subprocess actually died (log line `signal: killed` or exit
+   130).
+8. Repeat with in-place mode: type host name, dispatch, verify red
+   primary button, verify files actually overwritten on host.
+9. Snapshot diff: navigate to snapshots, pick two, dispatch diff, see
+   diff output streamed.
+10. Screenshots into `_diag/p3-restore-sweep/`.
+
+End-to-end clean, zero console errors, before handing back.
+
+## What does NOT change
+
+- `host_chrome.html` only grows the recent-restores line; sub-tab list
+  unchanged (Restore is a top-level button on the host page, not a sub-tab).
+- `enrollment.go`, schedule reconciliation, source-group CRUD, repo
+  maintenance ticker, hook execution — none of these are touched.
+- The CLAUDE.md restage block applies as-is when the agent binary changes
+  (it does — runner gains restore/diff/cancel/tree handlers). The unit
+  file does not change.
+
+## Open questions / explicit non-goals
+
+- **Restore preview / dry-run.** Restic doesn't have a dry-run for restore.
+  Out of scope.
+- **Resumable restore.** Restic restore is idempotent per-file but not
+  resumable mid-stream from where it left off. If a restore is cancelled,
+  the operator re-runs (files already written are overwritten). No state
+  to track.
+- **Restore to a glob/pattern (e.g. `*.conf`).** Out of scope; the tree
+  picker requires explicit ticks. Power users can edit the URL or use the
+  CLI.
+- **Bandwidth caps for restore.** Honoured automatically — restic's
+  `--limit-download` is part of `restic.Env` already (P2R-13) and applies
+  to restore unchanged.
+- **Pre/post hooks for restore.** Hooks today gate only `kind=backup`
+  (P2R-11). Out of scope.
@@ -1,126 +0,0 @@
-# Threat model
-
-A short, structured walkthrough of the assets restic-manager
-protects, the actors that interact with it, the attack surfaces
-exposed, and the mitigations in place. This document is written for
-operators considering a deployment and for contributors evaluating
-security-sensitive changes. It is **not** a formal certification —
-restic-manager has not been third-party audited.
-
-Last reviewed: **2026-05-09** (against v1.0.0).
-
---
-
-## 1. Assets
-
-In rough order of sensitivity:
-
-| Asset | Why it matters |
-|---|---|
-| **Restic repository passwords** | Decrypt every backup in the repo. Server holds them encrypted at rest; agents need plaintext at backup-time. |
-| **Repository URLs with embedded credentials** (e.g. `rest:https://user:pass@host/repo`) | Same as above — read access to the repo is leak-equivalent to the password. |
-| **Agent bearer tokens** | Long-lived credentials authenticating each agent → server WS. Compromise lets an attacker impersonate that host (push fake snapshots, ack fake schedule versions, exfiltrate repo creds the server pushes back). |
-| **Server session cookies** | Browser-side session for human operators. Compromise = full UI access at the user's role for the cookie's TTL (24h). |
-| **Database secret key** | Wraps every encrypted-at-rest field (repo creds, agent enrolment payloads). Loss of the file means decryptable backups; rotation requires re-pushing creds to every agent. |
-| **Bootstrap / setup tokens** | One-shot, time-limited; mint admin or invited-user accounts. |
-| **Audit log** | Tamper-evident record of admin actions; read-only via UI. |
-| **Backup data on the wire** | Restic itself encrypts on the agent before sending — see "out of scope". |
-
---
-
-## 2. Actors
-
-| Actor | Trust |
-|---|---|
-| **Anonymous internet** | Untrusted. Should not reach the server unless proxied behind auth (see deployment guide). |
-| **Authenticated viewer** | Read-only on hosts/jobs/alerts/audit. |
-| **Authenticated operator** | Add/remove hosts, edit schedules, run backups/restores, mint enrolment tokens, ack alerts. |
-| **Authenticated admin** | All of the above plus user management, role changes, fleet update controls, secret-key visibility (no — see below). |
-| **Agent** | Trusted to backup-and-report on its own host only. Cannot read other hosts' creds. Bearer-authenticated. |
-| **Restic backend (rest-server / S3 / B2 / etc.)** | Out of scope for this document — assumed to authenticate the credentials presented and not collude. |
-
---
-
-## 3. Attack surfaces and mitigations
-
-### 3.1 First-run bootstrap
-
- **Surface**: `/bootstrap` UI + `/api/bootstrap` JSON endpoint.
- **Risk**: race between server start and admin creation — an attacker who reaches the server first can claim admin.
- **Mitigations**:
-  - Bootstrap token printed to stderr exactly once; held in memory, not persisted.
-  - The UI form on `/bootstrap` uses the in-memory token automatically (no token field for the operator to type or expose).
-  - Both surfaces self-disable the moment any user row exists (`CountUsers > 0`).
-  - Token is also blanked from process memory after success (defence in depth).
- **Residual risk**: if an operator brings up the server on the public internet before reaching the bootstrap page, an attacker reaching `/bootstrap` first wins. **Recommendation**: bring the server up behind an existing trusted network or with the listener bound to `127.0.0.1` until first-run is complete.
-
-### 3.2 Local user accounts
-
- **Surface**: `/login`, `/api/auth/login`.
- **Mitigations**: Argon2id password hashing with per-deployment params; constant-time password compare; session-cookie minting via `crypto/rand`; session rows hash-only (raw token only in cookie).
- **Rate limiting**: Currently not in place at the application layer — the project assumes a reverse proxy enforces login throttling. **Recommendation**: front the server with `caddy`/`nginx` rate-limit rules in production.
- **Password policy**: 12-character minimum on bootstrap and user-setup paths; no maximum, no rotation, no history. Sufficient for self-hosted ops; tighten in policy if a deployment requires it.
-
-### 3.3 OIDC SSO
-
- **Surface**: `/auth/oidc/*` — generic OIDC client, JIT user provisioning.
- **Mitigations**: state + nonce per flow; role mapping is server-configured (claims trusted only to identify the user, not pick role); user-disabled gate runs after IdP success.
- **Residual risk**: misconfigured role-mapping rules can promote any IdP user to admin. **Recommendation**: review `cfg.OIDC.RoleMappings` carefully.
-
-### 3.4 Agent enrolment
-
- **Surface**: `/api/agents/enroll` (token-authenticated), `/api/agents/announce` (anonymous, then operator-approves).
- **Mitigations**:
-  - Token path: one-shot, hashed at rest, 1h TTL; agent receives a fresh long-lived bearer in the response.
-  - Announce path: agent supplies an Ed25519 public key; operator sees a fingerprint to confirm out-of-band before accepting.
-  - Bearer tokens are SHA-256 hashed in the DB.
- **Residual risk**: an attacker on the network between operator and target host who intercepts the install snippet can enrol *as* the target. The install script must be served over TLS in production (the docker-only deployment defaults to TLS-by-default; bare-metal deployers must configure their own).
-
-### 3.5 Agent → server WebSocket
-
- **Surface**: persistent WS authenticated by agent bearer.
- **Mitigations**: bearer is presented per-connection; server pins the agent fingerprint for the announce flow; messages are envelope-typed and rejected if shape-invalid.
- **No payload-level signing** today — TLS is the integrity boundary. A man-in-the-middle with a valid cert chain could swap messages. **Recommendation**: pin the server cert via `RM_SERVER_CERT_PIN_SHA256` if running over a network you don't fully control.
-
-### 3.6 Repo credential lifecycle
-
- Stored encrypted at rest under the AEAD secret key.
- Pushed to the agent over the WS on hello, on creds change, and on demand.
- Agent persists them encrypted (per-host secret key derived from a value known only to the agent).
- Logged surfaces use `restic.RedactURL()` to strip `user:pass@` from URLs before they reach `slog`.
- Plaintext form is constructed only at `exec.Command` time inside the agent, never stored on a struct field that could be slogged.
-
-### 3.7 Restore
-
- Operators can restore to any path the agent (running as root) can write.
- Cross-host restore (host A's snapshot → host C) is **deferred** — see F-01. The current single-host restore does not require granting any cross-host privileges.
-
-### 3.8 Audit log
-
- Append-only writes from the application; SQLite enforces no schema-level immutability.
- A compromise of the SQLite file (via OS-level access) can edit the audit log. **Recommendation**: ship audit entries to an append-only sink (syslog / Loki / Splunk) if tamper-evidence beyond the OS boundary is required.
-
-### 3.9 Self-update channel (P6)
-
- Agents fetch new binaries via the WS transport from the server.
- Binaries are signature-checked by the agent against a key embedded in the existing agent (see `internal/fleetupdate/`).
- **Residual risk**: a server compromise lets the attacker push code to every agent (running as root). The signing-key compromise window is the same as the server compromise window because both live on the server. Splitting the signing key onto a separate signer is future work (not v1).
-
---
-
-## 4. Out of scope
-
- **Restic itself** — its repository format, encryption, and backend protocol are upstream-trusted.
- **The host OS** — root compromise of a host obviously compromises that host's backups.
- **The backup destination** — restic-manager assumes the rest-server / object-store / SFTP target enforces its own auth.
- **Side-channel attacks** on the server process (RAM dump, process tracing).
- **Physical access** to the server's disk.
-
---
-
-## 5. Reporting
-
-Found something we missed? See `SECURITY.md` for the disclosure
-process. Coordinated disclosure preferred; the project is
-maintained by a small team and we'll respond as quickly as we
-reasonably can.
@@ -1,42 +0,0 @@
-# Build a Linux container that runs the restic-manager agent against a
-# sibling rest-server in the e2e compose stack. Used only by tests
-# (e2e/compose.e2e.yml + .gitea/workflows/e2e.yml).
-#
-# Two stages:
-#   1. golang:alpine to build the agent binary.
-#   2. alpine:3.20 with the `restic` package + the built binary.
-#
-# Pinning by digest is intentional for CI reproducibility.
-
-FROM golang:1.25-alpine AS build
-WORKDIR /src
-
-ENV CGO_ENABLED=0 \
-    GOFLAGS="-trimpath"
-
-COPY go.mod go.sum* ./
-RUN go mod download
-
-COPY . .
-ARG VERSION=e2e
-RUN go build -ldflags="-s -w -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=${VERSION}" \
-        -o /out/restic-manager-agent ./cmd/agent
-
-FROM alpine:3.20
-RUN apk add --no-cache restic ca-certificates curl
-COPY --from=build /out/restic-manager-agent /usr/local/bin/restic-manager-agent
-
-# Agents normally run as root because backup paths often need it. The
-# e2e fixture only backs up paths under /data which we own, so this
-# container would tolerate a non-root user — but staying root keeps
-# parity with the production install.
-USER root
-
-# The agent needs a writable directory for its config + secrets store.
-RUN mkdir -p /etc/restic-manager /var/lib/restic-manager
-ENV RM_AGENT_CONFIG=/etc/restic-manager/agent.yaml
-
-# The compose entrypoint sets the announce URL via env.
-COPY e2e/agent-entrypoint.sh /usr/local/bin/entrypoint.sh
-RUN chmod +x /usr/local/bin/entrypoint.sh
-ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
@@ -1,21 +0,0 @@
-# Playwright runner for the e2e suite. Built and run by
-# e2e/compose.e2e.yml so the test process sits on the same docker
-# network as the server, agent, and rest-server. The previous setup
-# ran Playwright on the workflow runner host and reached the server
-# via 127.0.0.1:8080; that fails on Gitea's act-style runners
-# because the workflow steps execute inside a runner container,
-# not on the host where compose publishes its ports.
-
-FROM mcr.microsoft.com/playwright:v1.59.1-jammy
-
-WORKDIR /work
-
-# Install npm deps in a separate layer keyed off package.json so
-# changes to specs don't bust the dep cache.
-COPY e2e/playwright/package.json /work/package.json
-RUN npm install --no-audit --no-fund
-
-COPY e2e/playwright/ /work/
-
-ENV CI=1
-ENTRYPOINT ["npx", "playwright", "test"]
@@ -1,27 +0,0 @@
-#!/bin/sh
-# Entrypoint for the e2e agent container.
-#
-# Three states:
-#   1. Already enrolled (agent.yaml has a bearer): run the agent.
-#   2. Token supplied via $RM_ENROL_TOKEN: enrol then run.
-#   3. Otherwise: announce against $RM_SERVER and wait for an admin to
-#      accept us. The announce flow blocks until accepted, then drops
-#      straight into the normal run loop, so this is the test-friendly
-#      path.
-set -eu
-
-CFG="${RM_AGENT_CONFIG:-/etc/restic-manager/agent.yaml}"
-SERVER="${RM_SERVER:?set RM_SERVER}"
-
-if [ -f "$CFG" ] && grep -q '^agent_token:' "$CFG"; then
-    exec restic-manager-agent -config "$CFG"
-fi
-
-if [ -n "${RM_ENROL_TOKEN:-}" ]; then
-    exec restic-manager-agent -config "$CFG" \
-        -enroll-server "$SERVER" \
-        -enroll-token "$RM_ENROL_TOKEN"
-fi
-
-# Announce-and-approve: blocks until an admin accepts, then runs.
-exec restic-manager-agent -config "$CFG" -enroll-server "$SERVER"
@@ -1,113 +0,0 @@
-# End-to-end test stack — used by .gitea/workflows/e2e.yml and by
-# operators who want to run the Playwright suite locally.
-#
-# Three services:
-#   * server      — restic-manager built from the working tree
-#   * agent       — restic-manager agent built from the working tree
-#                   (announces; Playwright accepts it during the test)
-#   * rest-server — the actual restic backend, sibling of the agent
-#
-# Run from the repo root:
-#   docker compose -f e2e/compose.e2e.yml up --build --abort-on-container-exit
-
-services:
-  rest-server:
-    image: restic/rest-server:0.13.0
-    environment:
-      DATA_DIR: /data
-      OPTIONS: "--no-auth"
-    volumes:
-      - rest-data:/data
-    networks: [rmnet]
-
-  server:
-    build:
-      context: ..
-      dockerfile: deploy/Dockerfile.server
-      args:
-        VERSION: e2e
-    environment:
-      RM_LISTEN: ":8080"
-      RM_DATA_DIR: "/data"
-      RM_BASE_URL: "http://server:8080"
-      RM_COOKIE_SECURE: "false"
-      # Bind the metrics endpoint loose for the test, so one of the
-      # Playwright assertions can exercise it.
-      RM_METRICS_TRUSTED_CIDR: "0.0.0.0/0"
-    volumes:
-      - server-data:/data
-    ports:
-      - "127.0.0.1:8080:8080"
-    healthcheck:
-      test: ["CMD", "/usr/local/bin/restic-manager-server", "--version"]
-      interval: 2s
-      timeout: 2s
-      retries: 30
-    networks: [rmnet]
-
-  agent:
-    build:
-      context: ..
-      dockerfile: e2e/Dockerfile.agent
-      args:
-        VERSION: e2e
-    environment:
-      RM_SERVER: "http://server:8080"
-    depends_on:
-      - server
-    volumes:
-      # Source paths the agent backs up. Compose pre-populates this
-      # with a few files so the snapshot list isn't empty.
-      - source-data:/source
-      - agent-config:/etc/restic-manager
-      - agent-state:/var/lib/restic-manager
-    networks: [rmnet]
-
-  # Playwright test runner. Profile-gated so `compose up` doesn't
-  # start it; CI invokes it via `compose run` and `docker cp`s the
-  # report+traces out (see .gitea/workflows/e2e.yml). Lives on
-  # rmnet so it can reach the server via its compose-network DNS
-  # name rather than depending on host port-publish (which doesn't
-  # work on Gitea's container-based runners).
-  #
-  # Reports are NOT bind-mounted: when the runner job itself runs
-  # inside a container, `./playwright/...` resolves to a path that
-  # only exists inside the runner container, so the host docker
-  # daemon would silently mount an empty dir. Instead the report
-  # stays inside the playwright container and the workflow extracts
-  # it via `docker cp` before tearing down.
-  playwright:
-    profiles: [test]
-    build:
-      context: ..
-      dockerfile: e2e/Dockerfile.playwright
-    environment:
-      RM_BASE_URL: "http://server:8080"
-      RM_BOOTSTRAP_TOKEN: "${RM_BOOTSTRAP_TOKEN:-}"
-    depends_on:
-      - server
-      - agent
-    networks: [rmnet]
-
-  # One-shot init container that drops a couple of files into the
-  # source volume so backups have something to snapshot.
-  source-fixture:
-    image: alpine:3.20
-    command: >
-      sh -c 'mkdir -p /source && echo "hello world" > /source/hello.txt &&
-             echo "another file" > /source/two.txt && sleep 0.2'
-    volumes:
-      - source-data:/source
-    networks: [rmnet]
-    restart: "no"
-
-volumes:
-  server-data:
-  rest-data:
-  source-data:
-  agent-config:
-  agent-state:
-
-networks:
-  rmnet:
-    driver: bridge
@@ -1,14 +0,0 @@
-{
-  "name": "restic-manager-e2e",
-  "version": "0.0.0",
-  "private": true,
-  "type": "module",
-  "scripts": {
-    "test": "playwright test",
-    "test:headed": "playwright test --headed",
-    "test:debug": "PWDEBUG=1 playwright test"
-  },
-  "devDependencies": {
-    "@playwright/test": "1.59.1"
-  }
-}
@@ -1,35 +0,0 @@
-import { defineConfig, devices } from '@playwright/test';
-
-// Single-target Chromium config: the e2e suite is narrow (smoke
-// the production-shaped flow against the docker-compose stack).
-// Cross-browser matrix doesn't add signal — what we're verifying is
-// the server's HTML and the agent's WebSocket handshake, neither of
-// which depends on browser engine.
-
-const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
-
-export default defineConfig({
-    testDir: './tests',
-    // 4 minutes — the smoke test waits for: enrolment + bootstrap
-    // (~5s), auto-init landing (~10s), backup completion (~120s
-    // budget). 60s is far too tight in CI; 4m gives headroom even
-    // on a contended runner without masking real regressions.
-    timeout: 240_000,
-    expect: { timeout: 10_000 },
-    fullyParallel: false,
-    retries: process.env.CI ? 1 : 0,
-    workers: 1,
-    reporter: [['list'], ['html', { open: 'never' }]],
-    use: {
-        baseURL,
-        trace: 'retain-on-failure',
-        screenshot: 'only-on-failure',
-        video: 'retain-on-failure',
-    },
-    projects: [
-        {
-            name: 'chromium',
-            use: { ...devices['Desktop Chrome'] },
-        },
-    ],
-});
@@ -1,152 +0,0 @@
-// Helpers used by every test. The shape favours the JSON API for
-// reads + accept/dispatch (deterministic, easy to assert) and the
-// browser for human-facing surfaces (login form, dashboard render).
-
-import { APIRequestContext, expect, Page } from '@playwright/test';
-
-export const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
-
-export interface HostJSON {
-    id: string;
-    name: string;
-    status: string;
-    repo_status?: string;
-    last_backup_status?: string;
-}
-
-export async function readBootstrapToken(): Promise<string> {
-    const tok = process.env.RM_BOOTSTRAP_TOKEN;
-    if (!tok) {
-        throw new Error('RM_BOOTSTRAP_TOKEN not set — the harness scrapes it from server logs');
-    }
-    return tok;
-}
-
-export async function bootstrapAdmin(
-    request: APIRequestContext,
-    {
-        username = 'admin',
-        password = 'e2e-test-password-1234',
-    }: { username?: string; password?: string } = {},
-): Promise<{ username: string; password: string }> {
-    const token = await readBootstrapToken();
-    const res = await request.post(`${baseURL}/api/bootstrap`, {
-        data: { token, username, password },
-    });
-    if (!res.ok() && res.status() !== 409 /* already bootstrapped */) {
-        throw new Error(`bootstrap: ${res.status()} ${await res.text()}`);
-    }
-    return { username, password };
-}
-
-export async function loginViaUI(page: Page, username: string, password: string): Promise<void> {
-    await page.goto(`${baseURL}/login`);
-    await page.locator('#login-username').fill(username);
-    await page.locator('#login-password').fill(password);
-    await Promise.all([
-        page.waitForURL(new RegExp(`^${baseURL}/?$`)),
-        page.locator('form[action="/login"] button[type="submit"]').click(),
-    ]);
-}
-
-/**
- * Polls the dashboard until a pending host card is visible, then
- * extracts its pending-id from the inline accept form's action URL.
- */
-export async function waitForPendingHostID(page: Page): Promise<string> {
-    const formLocator = page.locator('form[action^="/api/pending-hosts/"][action$="/accept"]').first();
-    await expect(formLocator).toBeVisible({ timeout: 60_000 });
-    const action = await formLocator.getAttribute('action');
-    if (!action) throw new Error('pending host form has no action attribute');
-    const m = action.match(/\/api\/pending-hosts\/([^/]+)\/accept/);
-    if (!m) throw new Error(`unexpected action URL: ${action}`);
-    return m[1];
-}
-
-export async function acceptPending(
-    request: APIRequestContext,
-    cookie: string,
-    pendingID: string,
-    repo: { url: string; username?: string; password: string },
-): Promise<void> {
-    const res = await request.post(`${baseURL}/api/pending-hosts/${pendingID}/accept`, {
-        headers: { cookie, 'content-type': 'application/json' },
-        data: {
-            repo_url: repo.url,
-            repo_username: repo.username ?? '',
-            repo_password: repo.password,
-        },
-    });
-    if (!res.ok()) {
-        throw new Error(`accept: ${res.status()} ${await res.text()}`);
-    }
-}
-
-export async function listHosts(request: APIRequestContext, cookie: string): Promise<HostJSON[]> {
-    const res = await request.get(`${baseURL}/api/hosts`, { headers: { cookie } });
-    if (!res.ok()) throw new Error(`list hosts: ${res.status()} ${await res.text()}`);
-    const body = (await res.json()) as { items?: HostJSON[]; hosts?: HostJSON[] };
-    return body.items ?? body.hosts ?? [];
-}
-
-export async function waitForHostStatus(
-    request: APIRequestContext,
-    cookie: string,
-    matcher: (h: HostJSON) => boolean,
-    timeoutMs = 60_000,
-): Promise<HostJSON> {
-    const deadline = Date.now() + timeoutMs;
-    let last: HostJSON | undefined;
-    while (Date.now() < deadline) {
-        const hosts = await listHosts(request, cookie);
-        const hit = hosts.find(matcher);
-        if (hit) return hit;
-        last = hosts[0];
-        await new Promise((r) => setTimeout(r, 1_000));
-    }
-    throw new Error(`waitForHostStatus: timeout. Last seen: ${JSON.stringify(last)}`);
-}
-
-export async function createSourceGroup(
-    request: APIRequestContext,
-    cookie: string,
-    hostID: string,
-    body: { name: string; includes: string[]; excludes?: string[] },
-): Promise<string> {
-    const res = await request.post(`${baseURL}/api/hosts/${hostID}/source-groups`, {
-        headers: { cookie, 'content-type': 'application/json' },
-        data: {
-            name: body.name,
-            includes: body.includes,
-            excludes: body.excludes ?? [],
-            retention_policy: {},
-            retry_max: 0,
-            retry_backoff_seconds: 0,
-        },
-    });
-    if (!res.ok()) throw new Error(`createSourceGroup: ${res.status()} ${await res.text()}`);
-    const created = (await res.json()) as { id?: string; group?: { id?: string } };
-    const id = created.id ?? created.group?.id;
-    if (!id) throw new Error(`createSourceGroup: no id in response: ${JSON.stringify(created)}`);
-    return id;
-}
-
-export async function runSourceGroup(
-    request: APIRequestContext,
-    cookie: string,
-    hostID: string,
-    groupID: string,
-): Promise<void> {
-    const res = await request.post(
-        `${baseURL}/api/hosts/${hostID}/source-groups/${groupID}/run`,
-        { headers: { cookie } },
-    );
-    if (!res.ok()) throw new Error(`runSourceGroup: ${res.status()} ${await res.text()}`);
-}
-
-export async function getSessionCookie(page: Page): Promise<string> {
-    const cookies = await page.context().cookies();
-    const c = cookies.find((c) => c.name === 'rm_session');
-    if (!c) throw new Error('rm_session cookie not set after login');
-    return `${c.name}=${c.value}`;
-}
@@ -1,90 +0,0 @@
-// End-to-end smoke: bootstrap → accept pending host → run backup → see succeeded.
-//
-// The compose stack stands up a server, a sibling rest-server, and an
-// agent in announce-and-approve mode. This test drives the operator
-// path through the UI (login + dashboard) and the API
-// (accept + run-now + poll for terminal) — UI for the human surfaces,
-// API for the deterministic ones.
-
-import { test, expect } from '@playwright/test';
-import {
-    baseURL,
-    bootstrapAdmin,
-    loginViaUI,
-    waitForPendingHostID,
-    acceptPending,
-    waitForHostStatus,
-    createSourceGroup,
-    runSourceGroup,
-    getSessionCookie,
-} from './lib/server';
-
-test.describe('smoke: enrol-via-announce → backup', () => {
-    test('happy path: enrol → accept → backup → succeeded', async ({ page, request }) => {
-        const { username, password } = await bootstrapAdmin(request);
-        await loginViaUI(page, username, password);
-
-        // Dashboard renders.
-        await expect(page.locator('main')).toContainText(/host|fleet|pending/i, { timeout: 10_000 });
-
-        // Pending host appears (the agent container has been
-        // announcing since startup).
-        const pendingID = await waitForPendingHostID(page);
-        const cookie = await getSessionCookie(page);
-
-        // Accept with the rest-server creds. compose's rest-server runs
-        // --no-auth, so any credentials work; restic still demands a
-        // password to encrypt the repo.
-        await acceptPending(request, cookie, pendingID, {
-            url: 'rest:http://rest-server:8000/',
-            password: 'e2e-repo-password',
-        });
-
-        // Wait for the host to come online AND for auto-init to
-        // finish. Coming online happens as soon as the agent's
-        // bearer-authed WS attaches (~1s after accept); repo_status
-        // flips to 'ready' once the auto-init job completes (a
-        // couple of seconds later). Loading the host page before
-        // that leaves the Run-backup button disabled because the
-        // server-rendered HTML reflects the still-in-progress init,
-        // and the page has no live-refresh on that field.
-        const readyHost = await waitForHostStatus(
-            request, cookie,
-            (h) => h.status === 'online' && h.repo_status === 'ready',
-            90_000,
-        );
-        expect(readyHost.id).toBeTruthy();
-
-        // Per-host Run-now is gone; backups are dispatched per
-        // source-group now. Create one that maps to the agent's
-        // /source mount, then kick it via the JSON API.
-        const groupID = await createSourceGroup(request, cookie, readyHost.id, {
-            name: 'default',
-            includes: ['/source'],
-        });
-        await runSourceGroup(request, cookie, readyHost.id, groupID);
-
-        // Wait for the host's last_backup_status to flip to 'succeeded'.
-        // The host record is the source of truth: it's what the
-        // dashboard projects from job-completion events on the WS
-        // channel.
-        const finishedHost = await waitForHostStatus(
-            request, cookie,
-            (h) => h.id === readyHost.id && h.last_backup_status === 'succeeded',
-            120_000,
-        );
-        expect(finishedHost.last_backup_status).toBe('succeeded');
-    });
-});
-
-test.describe('smoke: scrape /metrics', () => {
-    test('metrics endpoint exposes the host gauge', async ({ request }) => {
-        // Compose sets RM_METRICS_TRUSTED_CIDR=0.0.0.0/0 so the
-        // endpoint is open to the test runner.
-        const res = await request.get(`${baseURL}/metrics`);
-        expect(res.status()).toBe(200);
-        const body = await res.text();
-        expect(body).toContain('rm_hosts_total');
-        expect(body).toContain('rm_build_info{');
-    });
-});
@@ -3,26 +3,22 @@ module gitea.dcglab.co.uk/steve/restic-manager
 go 1.25.0

 require (
-	github.com/coder/websocket v1.8.14
-	github.com/coreos/go-oidc/v3 v3.18.0
 	github.com/go-chi/chi/v5 v5.2.5
-	github.com/golang-jwt/jwt/v5 v5.3.1
 	github.com/oklog/ulid/v2 v2.1.1
-	github.com/robfig/cron/v3 v3.0.1
 	golang.org/x/crypto v0.50.0
-	golang.org/x/oauth2 v0.36.0
-	golang.org/x/sys v0.43.0
 	gopkg.in/yaml.v3 v3.0.1
 	modernc.org/sqlite v1.50.0
 )

 require (
+	github.com/coder/websocket v1.8.14 // indirect
 	github.com/dustin/go-humanize v1.0.1 // indirect
-	github.com/go-jose/go-jose/v4 v4.1.4 // indirect
 	github.com/google/uuid v1.6.0 // indirect
 	github.com/mattn/go-isatty v0.0.20 // indirect
 	github.com/ncruces/go-strftime v1.0.0 // indirect
 	github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect
+	github.com/robfig/cron/v3 v3.0.1 // indirect
+	golang.org/x/sys v0.43.0 // indirect
 	modernc.org/libc v1.72.0 // indirect
 	modernc.org/mathutil v1.7.1 // indirect
 	modernc.org/memory v1.11.0 // indirect
@@ -1,15 +1,9 @@
 github.com/coder/websocket v1.8.14 h1:9L0p0iKiNOibykf283eHkKUHHrpG7f65OE3BhhO7v9g=
 github.com/coder/websocket v1.8.14/go.mod h1:NX3SzP+inril6yawo5CQXx8+fk145lPDC6pumgx0mVg=
-github.com/coreos/go-oidc/v3 v3.18.0 h1:V9orjXynvu5wiC9SemFTWnG4F45v403aIcjWo0d41+A=
-github.com/coreos/go-oidc/v3 v3.18.0/go.mod h1:DYCf24+ncYi+XkIH97GY1+dqoRlbaSI26KVTCI9SrY4=
 github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
 github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
 github.com/go-chi/chi/v5 v5.2.5 h1:Eg4myHZBjyvJmAFjFvWgrqDTXFyOzjj7YIm3L3mu6Ug=
 github.com/go-chi/chi/v5 v5.2.5/go.mod h1:X7Gx4mteadT3eDOMTsXzmI4/rwUpOwBHLpAfupzFJP0=
-github.com/go-jose/go-jose/v4 v4.1.4 h1:moDMcTHmvE6Groj34emNPLs/qtYXRVcd6S7NHbHz3kA=
-github.com/go-jose/go-jose/v4 v4.1.4/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
-github.com/golang-jwt/jwt/v5 v5.3.1 h1:kYf81DTWFe7t+1VvL7eS+jKFVWaUnK9cB1qbwn63YCY=
-github.com/golang-jwt/jwt/v5 v5.3.1/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE=
 github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e h1:ijClszYn+mADRFY17kjQEVQ1XRhq2/JR1M3sGqeJoxs=
 github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e/go.mod h1:boTsfXsheKC2y+lKOCMpSfarhxDeIzfZG1jqGcPl3cA=
 github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
@@ -31,8 +25,6 @@ golang.org/x/crypto v0.50.0 h1:zO47/JPrL6vsNkINmLoo/PH1gcxpls50DNogFvB5ZGI=
 golang.org/x/crypto v0.50.0/go.mod h1:3muZ7vA7PBCE6xgPX7nkzzjiUq87kRItoJQM1Yo8S+Q=
 golang.org/x/mod v0.33.0 h1:tHFzIWbBifEmbwtGz65eaWyGiGZatSrT9prnU8DbVL8=
 golang.org/x/mod v0.33.0/go.mod h1:swjeQEj+6r7fODbD2cqrnje9PnziFuw4bmLbBZFrQ5w=
-golang.org/x/oauth2 v0.36.0 h1:peZ/1z27fi9hUOFCAZaHyrpWG5lwe0RJEEEeH0ThlIs=
-golang.org/x/oauth2 v0.36.0/go.mod h1:YDBUJMTkDnJS+A4BP4eZBjCqtokkg1hODuPjwiGPO7Q=
 golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4=
 golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0=
 golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
@@ -32,11 +32,6 @@ type Config struct {
 	RepoUsername  string
 	RepoPassword  string

-	// SupportsRestoreNoOwnership comes from a startup probe of
-	// `restic restore --help`; gates the new-dir-restore flag without
-	// relying on version sniffing.
-	SupportsRestoreNoOwnership bool
-
 	// Bandwidth caps in KB/s applied to every restic invocation.
 	// <=0 means "no cap". Per-job override: callers that build a
 	// runner per-dispatch can pass the override value here directly.
@@ -66,14 +61,13 @@ func New(cfg Config, tx Sender, progressMinPeriod time.Duration) *Runner {
 // resticEnv builds the shared restic.Env from r.cfg.
 func (r *Runner) resticEnv() restic.Env {
 	return restic.Env{
-		Bin:                        r.cfg.ResticBin,
-		Version:                    r.cfg.ResticVersion,
-		RepoURL:                    r.cfg.RepoURL,
-		RepoUsername:               r.cfg.RepoUsername,
-		RepoPassword:               r.cfg.RepoPassword,
-		SupportsRestoreNoOwnership: r.cfg.SupportsRestoreNoOwnership,
-		LimitUploadKBps:            r.cfg.LimitUploadKBps,
-		LimitDownloadKBps:          r.cfg.LimitDownloadKBps,
+		Bin:               r.cfg.ResticBin,
+		Version:           r.cfg.ResticVersion,
+		RepoURL:           r.cfg.RepoURL,
+		RepoUsername:      r.cfg.RepoUsername,
+		RepoPassword:      r.cfg.RepoPassword,
+		LimitUploadKBps:   r.cfg.LimitUploadKBps,
+		LimitDownloadKBps: r.cfg.LimitDownloadKBps,
 	}
 }

@@ -2,14 +2,10 @@ package runner

 import (
 	"context"
-	"errors"
 	"os"
-	"os/exec"
 	"path/filepath"
 	"sync"
-	"syscall"
 	"testing"
-	"time"

 	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/restic"
@@ -47,22 +43,13 @@ func (s *fakeSender) snapshot() []api.Envelope {
 // setupScript writes a shell script (without shebang) to a temp dir,
 // names it "restic", makes it executable, and returns the path.
 //
-// Writes to "<path>.tmp" then renames into place. The rename is the
-// usual guard against ETXTBSY: under -race + many t.Parallel tests,
-// a fork-from-another-goroutine can inherit the writable fd from
+// Writes to "<path>.tmp" then renames into place. The rename is what
+// makes this race-free: under -race + many t.Parallel tests, a
+// fork-from-another-goroutine can inherit the writable fd from
 // os.WriteFile before close completes, and exec'ing the file then
-// returns ETXTBSY ("text file busy"). The renamed dirent points at
-// an inode that has no writable fd open anywhere — exec is safe on
-// a vanilla filesystem.
-//
-// On overlayfs (every job that runs inside a `container:` block on
-// our Gitea runner), the rename can briefly leak ETXTBSY anyway —
-// the upper layer's "writable inode" bookkeeping lags the userspace
-// close. To make the helper deterministic across environments, we
-// probe-exec the file with a benign argument until exec succeeds,
-// then return. Each script body has a `case "$1" in ... esac` shape
-// where unknown args fall through to a clean exit, so the probe is
-// a no-op from the test's point of view.
+// returns ETXTBSY ("text file busy"). Once the rename lands, the
+// final path is a fresh dirent pointing at an inode that has no
+// writable fd open anywhere — exec is safe.
 func setupScript(t *testing.T, body string) string {
 	t.Helper()
 	dir := t.TempDir()
@@ -74,21 +61,7 @@ func setupScript(t *testing.T, body string) string {
 	if err := os.Rename(tmp, final); err != nil {
 		t.Fatalf("setupScript: rename: %v", err)
 	}
-
-	deadline := time.Now().Add(3 * time.Second)
-	for {
-		err := exec.Command(final, "__rm_probe__").Run()
-		if err == nil {
-			return final
-		}
-		if !errors.Is(err, syscall.ETXTBSY) {
-			t.Fatalf("setupScript: probe exec: %v", err)
-		}
-		if time.Now().After(deadline) {
-			t.Fatalf("setupScript: %s still ETXTBSY after 3s", final)
-		}
-		time.Sleep(10 * time.Millisecond)
-	}
+	return final
 }

 // firstEnvOfType returns the first envelope with the given type, or
@@ -1,100 +0,0 @@
-// Package updater carries the agent's self-update logic.
-//
-// The flow is operator-driven: the server dispatches a command.update
-// WS envelope, the agent fetches a fresh binary from the server's
-// /agent/binary endpoint, atomic-renames it over the running binary
-// (Linux) or hands off to a detached helper script (Windows), and
-// exits cleanly so the service manager restarts under the new
-// binary. See docs/superpowers/specs/2026-05-06-p6-01-02-...
-//
-// Platform-specific code is build-tagged into updater_unix.go /
-// updater_windows.go. This file holds the shared HTTP fetch + path
-// helpers + the test seam.
-package updater
-
-import (
-	"context"
-	"fmt"
-	"io"
-	"net/http"
-	"os"
-	"path/filepath"
-	"runtime"
-	"time"
-)
-
-// fetch downloads the new binary into <binaryPath>.new, fsyncs, chmods.
-// Returns the path of the staged file (always binaryPath + ".new").
-func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) {
-	url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH)
-	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
-	if err != nil {
-		return "", err
-	}
-	c := &http.Client{Timeout: 5 * time.Minute}
-	res, err := c.Do(req)
-	if err != nil {
-		return "", err
-	}
-	defer func() { _ = res.Body.Close() }()
-	if res.StatusCode != http.StatusOK {
-		return "", fmt.Errorf("agent binary fetch: %s", res.Status)
-	}
-
-	stagePath := binaryPath + ".new"
-	f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
-	if err != nil {
-		return "", err
-	}
-	if _, copyErr := io.Copy(f, res.Body); copyErr != nil {
-		_ = f.Close()
-		_ = os.Remove(stagePath)
-		return "", copyErr
-	}
-	if syncErr := f.Sync(); syncErr != nil {
-		_ = f.Close()
-		_ = os.Remove(stagePath)
-		return "", syncErr
-	}
-	if closeErr := f.Close(); closeErr != nil {
-		_ = os.Remove(stagePath)
-		return "", closeErr
-	}
-	if err := os.Chmod(stagePath, 0o755); err != nil {
-		_ = os.Remove(stagePath)
-		return "", err
-	}
-	return stagePath, nil
-}
-
-// resolveOwnBinary returns the absolute path of the running binary.
-// Refuses /proc/self/exe — that's what os.Executable returns on some
-// systems but the path can't be renamed across.
-func resolveOwnBinary() (string, error) {
-	p, err := os.Executable()
-	if err != nil {
-		return "", err
-	}
-	abs, err := filepath.Abs(p)
-	if err != nil {
-		return "", err
-	}
-	if abs == "/proc/self/exe" {
-		return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe)")
-	}
-	return abs, nil
-}
-
-// UpdateForTest is the platform-neutral test seam. In production the
-// platform-specific Update fetches, swaps, then exits the process.
-// UpdateForTest stops short of the exit so unit tests can assert on
-// file state.
-func UpdateForTest(serverURL, binaryPath string) error {
-	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
-	defer cancel()
-	stage, err := fetch(ctx, serverURL, binaryPath)
-	if err != nil {
-		return err
-	}
-	return swap(stage, binaryPath)
-}
@@ -1,87 +0,0 @@
-//go:build !windows
-
-package updater
-
-import (
-	"bytes"
-	"io"
-	"net/http"
-	"net/http/httptest"
-	"os"
-	"path/filepath"
-	"runtime"
-	"testing"
-)
-
-// TestUpdate_LinuxAtomicSwap stages a fake "running binary" file, runs
-// UpdateForTest against a fake /agent/binary server, and asserts that
-// the binary was swapped, .old preserves the previous bytes, and .new
-// was renamed away.
-func TestUpdate_LinuxAtomicSwap(t *testing.T) {
-	tmp := t.TempDir()
-	binPath := filepath.Join(tmp, "agent")
-	if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
-		t.Fatal(err)
-	}
-	newBytes := []byte("NEW BINARY CONTENTS")
-
-	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		if r.URL.Path != "/agent/binary" {
-			http.NotFound(w, r)
-			return
-		}
-		gotOS, gotArch := r.URL.Query().Get("os"), r.URL.Query().Get("arch")
-		if gotOS != runtime.GOOS || gotArch != runtime.GOARCH {
-			t.Errorf("query mismatch: got os=%s arch=%s want %s/%s",
-				gotOS, gotArch, runtime.GOOS, runtime.GOARCH)
-		}
-		_, _ = io.Copy(w, bytes.NewReader(newBytes))
-	}))
-	defer srv.Close()
-
-	if err := UpdateForTest(srv.URL, binPath); err != nil {
-		t.Fatalf("update: %v", err)
-	}
-
-	got, err := os.ReadFile(binPath)
-	if err != nil {
-		t.Fatal(err)
-	}
-	if string(got) != string(newBytes) {
-		t.Fatalf("binary contents: got %q want %q", got, newBytes)
-	}
-	old, err := os.ReadFile(binPath + ".old")
-	if err != nil {
-		t.Fatalf("agent.old missing: %v", err)
-	}
-	if string(old) != "OLD" {
-		t.Fatalf("agent.old contents: got %q want %q", old, "OLD")
-	}
-	if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) {
-		t.Fatalf("agent.new should be absent after swap, got err=%v", err)
-	}
-}
-
-// TestUpdate_FetchHTTPError surfaces the server's status when the
-// binary is not published for this os/arch.
-func TestUpdate_FetchHTTPError(t *testing.T) {
-	tmp := t.TempDir()
-	binPath := filepath.Join(tmp, "agent")
-	if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
-		t.Fatal(err)
-	}
-
-	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		http.Error(w, `{"error":"binary_not_published"}`, http.StatusNotFound)
-	}))
-	defer srv.Close()
-
-	err := UpdateForTest(srv.URL, binPath)
-	if err == nil {
-		t.Fatal("expected error, got nil")
-	}
-	got, _ := os.ReadFile(binPath)
-	if string(got) != "OLD" {
-		t.Fatalf("binary should not have changed, got %q", got)
-	}
-}
@@ -1,73 +0,0 @@
-//go:build !windows
-
-package updater
-
-import (
-	"context"
-	"fmt"
-	"io"
-	"log/slog"
-	"os"
-	"time"
-)
-
-// Update fetches the new binary, swaps it in, then exits so systemd
-// restarts the process under the new binary. The caller should close
-// the WS connection cleanly (so the server transitions the host to
-// disconnected immediately rather than waiting for the heartbeat
-// sweep) before invoking.
-//
-// Service-user assumption: the agent runs as root under the
-// systemd-shipped unit, which can write the binary path directly.
-// If the agent ever moves to a non-root service user, this breaks —
-// would need a setuid helper or an out-of-process update service.
-func Update(ctx context.Context, serverURL string) error {
-	binPath, err := resolveOwnBinary()
-	if err != nil {
-		return err
-	}
-	stage, err := fetch(ctx, serverURL, binPath)
-	if err != nil {
-		return err
-	}
-	if err := swap(stage, binPath); err != nil {
-		return err
-	}
-	slog.Info("agent self-update: binary swapped, exiting for systemd restart",
-		"binary", binPath)
-	// Give logger / WS close-frame a moment to flush, then exit.
-	time.Sleep(200 * time.Millisecond)
-	os.Exit(0)
-	return nil // unreachable
-}
-
-// swap copies the running binary to <bin>.old (M1 — keep one revision
-// back for hand-rolled rollback), then atomic-renames the staged
-// binary into place. Linux supports rename-while-open so this works
-// even though the running process holds the source open.
-func swap(stagePath, binPath string) error {
-	src, err := os.Open(binPath)
-	if err != nil {
-		return fmt.Errorf("open running binary: %w", err)
-	}
-	defer func() { _ = src.Close() }()
-	dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
-	if err != nil {
-		return fmt.Errorf("open .old: %w", err)
-	}
-	if _, err := io.Copy(dst, src); err != nil {
-		_ = dst.Close()
-		return fmt.Errorf("copy to .old: %w", err)
-	}
-	if err := dst.Sync(); err != nil {
-		_ = dst.Close()
-		return err
-	}
-	if err := dst.Close(); err != nil {
-		return err
-	}
-	if err := os.Rename(stagePath, binPath); err != nil {
-		return fmt.Errorf("rename .new over running binary: %w", err)
-	}
-	return nil
-}
@@ -1,73 +0,0 @@
-//go:build windows
-
-package updater
-
-import (
-	"context"
-	"fmt"
-	"log/slog"
-	"os"
-	"os/exec"
-	"path/filepath"
-	"syscall"
-	"time"
-)
-
-// helperScript is rendered with fmt.Sprintf, args order:
-//
-//	%[1]s — running binary path (source for the .old copy)
-//	%[2]s — .old path
-//	%[3]s — staged .new path
-//	%[4]s — running binary path (rename target)
-const helperScript = `@echo off
-timeout /t 3 /nobreak >nul
-copy /Y "%[1]s" "%[2]s"
-sc stop restic-manager-agent
-:wait
-sc query restic-manager-agent | find "STOPPED" >nul
-if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
-move /Y "%[3]s" "%[4]s"
-sc start restic-manager-agent
-del "%%~f0"
-`
-
-// Update on Windows can't overwrite the running .exe in-process
-// (exclusive file lock), so we stage the new binary, write a small
-// detached helper script that waits, stops the service, swaps the
-// binary, and starts the service, then exit cleanly. SCM treats
-// clean exits after sc stop as intentional and does not auto-restart;
-// the helper's final sc start handles that.
-func Update(ctx context.Context, serverURL string) error {
-	binPath, err := resolveOwnBinary()
-	if err != nil {
-		return err
-	}
-	stage, err := fetch(ctx, serverURL, binPath)
-	if err != nil {
-		return err
-	}
-	helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd")
-	body := fmt.Sprintf(helperScript, binPath, binPath+".old", stage, binPath)
-	if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil {
-		return err
-	}
-	cmd := exec.Command("cmd.exe", "/c", helperPath)
-	cmd.SysProcAttr = &syscall.SysProcAttr{
-		HideWindow:    true,
-		CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW
-	}
-	if err := cmd.Start(); err != nil {
-		return err
-	}
-	slog.Info("agent self-update: helper spawned, exiting cleanly",
-		"binary", binPath, "helper", helperPath)
-	time.Sleep(200 * time.Millisecond)
-	os.Exit(0)
-	return nil // unreachable
-}
-
-// swap is unused on Windows — the helper script does the swap.
-// Defined to satisfy the build (UpdateForTest references it).
-func swap(_, _ string) error {
-	return fmt.Errorf("updater.swap not implemented on Windows; use the helper script via Update")
-}
@@ -22,12 +22,6 @@ import (
 	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
 )

-// staleBackupThreshold is how long an intermittent host may go without
-// a successful backup before we raise a stale_schedule alert. Global
-// constant for v1 (may become per-host later). Only intermittent hosts
-// are evaluated — always-on hosts' stale_schedule stays a no-op.
-const staleBackupThreshold = 7 * 24 * time.Hour
-
 // JobFinishedEvent carries everything the engine needs to evaluate
 // the failed-X rules. Pushed via Engine.NotifyJobFinished from the
 // MarkJobFinished site.
@@ -155,10 +149,6 @@ func (e *Engine) handleJobFinished(ctx context.Context, ev JobFinishedEvent) {
 			fmt.Sprintf("%s job %s failed", ev.Kind, ev.JobID), ev.When)
 	case "succeeded":
 		e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When)
-		if ev.Kind == "backup" {
-			// A fresh backup clears staleness for intermittent hosts.
-			e.resolveAndNotify(ctx, ev.HostID, KindStaleSchedule, "", ev.When)
-		}
 	}
 }

@@ -167,12 +157,6 @@ func (e *Engine) handleHostOffline(ctx context.Context, hostID string) {
 	if err != nil {
 		return
 	}
-	// Intermittent hosts (laptops) legitimately disappear — never raise
-	// agent_offline for them. The stale_schedule sweep in tick() is the
-	// only staleness signal for these hosts.
-	if !host.AlwaysOn {
-		return
-	}
 	// Apply the 15-min floor — raise only when last_seen_at is older
 	// than agentOfflineFloor. A nil last_seen_at (host enrolled but
 	// never connected) is treated as "now" so we don't raise
@@ -196,56 +180,18 @@ func (e *Engine) handleHostOnline(ctx context.Context, hostID string) {
 // tick is the 60-second sweep. Responsibilities:
 //  1. Re-evaluate agent_offline for every offline host that may have
 //     crossed the floor between explicit events.
-//  2. Stale-schedule detection for intermittent hosts — raises
-//     stale_schedule when LastBackupAt is older than 7 days and the
-//     host has an enabled schedule. Always-on hosts are excluded.
+//  2. Stale-schedule detection — declared in the spec but intentionally
+//     left as a no-op in v1. The precise "expected to have fired but
+//     didn't" trigger requires a store helper that lands in a later
+//     task. The KindStaleSchedule constant is exported so UI code can
+//     reference the tag string today.
 func (e *Engine) tick(ctx context.Context, now time.Time) {
-	// User-management cleanup piggy-backed here for now. Setup tokens
-	// have a 1h expiry; the alert engine tick is the cheapest existing
-	// 60s loop. If more housekeeping queries appear, extract a
-	// dedicated maintenance loop.
-	if _, err := e.store.CleanupExpiredSetupTokens(ctx, now); err != nil {
-		slog.Warn("alert: cleanup expired setup tokens", "err", err)
-	}
-	if _, err := e.store.CleanupExpiredOIDCState(ctx, now.Add(-5*time.Minute)); err != nil {
-		slog.Warn("alert: cleanup expired oidc state", "err", err)
-	}
-
 	hosts, err := e.store.ListHosts(ctx)
 	if err != nil {
 		slog.Warn("alert: tick list hosts", "err", err)
 		return
 	}
 	for _, h := range hosts {
-		// Intermittent hosts: suppress agent_offline entirely; instead
-		// raise stale_schedule when they have gone too long with no
-		// successful backup AND they have at least one enabled schedule
-		// to be measured against. A nil LastBackupAt (never backed up)
-		// has no baseline — onboarding/repo_status covers that case.
-		if !h.AlwaysOn {
-			if h.LastBackupAt == nil {
-				continue
-			}
-			if now.Sub(*h.LastBackupAt) < staleBackupThreshold {
-				continue
-			}
-			hasEnabled, err := e.hostHasEnabledSchedule(ctx, h.ID)
-			if err != nil {
-				slog.Warn("alert: tick list schedules", "host_id", h.ID, "err", err)
-				continue
-			}
-			if !hasEnabled {
-				continue
-			}
-			e.raiseAndNotify(ctx, h.ID, KindStaleSchedule, "", "warning",
-				fmt.Sprintf("No backup in %s (threshold %s)",
-					roundDur(now.Sub(*h.LastBackupAt)), staleBackupThreshold), now)
-			// Resolution is handled in handleJobFinished on a successful
-			// backup (and ResolveOnModeChange on toggle) — the tick only
-			// raises, it does not auto-resolve.
-			continue
-		}
-		// Always-on hosts: existing agent_offline re-evaluation.
 		if h.Status != "offline" || h.LastSeenAt == nil {
 			continue
 		}
@@ -255,6 +201,7 @@ func (e *Engine) tick(ctx context.Context, now time.Time) {
 					roundDur(now.Sub(*h.LastSeenAt)), e.agentOfflineFloor), now)
 		}
 	}
+	// Stale-schedule sweep — no-op in v1. See KindStaleSchedule doc comment.
 }

 // roundDur returns a human-readable duration string, rounding to the
@@ -266,19 +213,3 @@ func roundDur(d time.Duration) string {
 	}
 	return d.Round(time.Minute).String()
 }
-
-// hostHasEnabledSchedule reports whether the host has at least one
-// enabled backup schedule — the precondition for a stale_schedule
-// alert (no schedule = no backup expectation to measure against).
-func (e *Engine) hostHasEnabledSchedule(ctx context.Context, hostID string) (bool, error) {
-	schedules, err := e.store.ListSchedulesByHost(ctx, hostID)
-	if err != nil {
-		return false, err
-	}
-	for _, sc := range schedules {
-		if sc.Enabled {
-			return true, nil
-		}
-	}
-	return false, nil
-}
@@ -1,255 +0,0 @@
-package alert
-
-import (
-	"context"
-	"testing"
-	"time"
-
-	"github.com/oklog/ulid/v2"
-
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
-)
-
-// TestIntermittentHostSuppressesOfflineAlert checks that handleHostOffline
-// does NOT raise agent_offline for a host with AlwaysOn=false.
-func TestIntermittentHostSuppressesOfflineAlert(t *testing.T) {
-	t.Parallel()
-	eng, st, hostID := setupEngine(t)
-	ctx := context.Background()
-
-	// Make the host intermittent.
-	if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
-		t.Fatalf("SetHostAlwaysOn: %v", err)
-	}
-
-	// Give it a stale last_seen_at well past the floor.
-	if _, err := st.DB().Exec(
-		`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
-		time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
-		"offline",
-		hostID,
-	); err != nil {
-		t.Fatalf("update last_seen_at: %v", err)
-	}
-
-	eng.handleHostOffline(ctx, hostID)
-
-	open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	if len(open) != 0 {
-		t.Fatalf("expected 0 open alerts for intermittent host; got %d: %+v", len(open), open)
-	}
-}
-
-// TestAlwaysOnHostStillRaisesOfflineAlert checks that always-on hosts still
-// get an agent_offline alert when offline past the floor.
-func TestAlwaysOnHostStillRaisesOfflineAlert(t *testing.T) {
-	t.Parallel()
-	eng, st, hostID := setupEngine(t)
-	ctx := context.Background()
-
-	// always_on=true is the default, but be explicit.
-	if err := st.SetHostAlwaysOn(ctx, hostID, true); err != nil {
-		t.Fatalf("SetHostAlwaysOn: %v", err)
-	}
-
-	// Give it a stale last_seen_at well past the 15m floor.
-	if _, err := st.DB().Exec(
-		`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
-		time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
-		"offline",
-		hostID,
-	); err != nil {
-		t.Fatalf("update last_seen_at: %v", err)
-	}
-
-	eng.handleHostOffline(ctx, hostID)
-
-	open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	if len(open) != 1 || open[0].Kind != KindAgentOffline {
-		t.Fatalf("expected 1 agent_offline alert; got %d: %+v", len(open), open)
-	}
-}
-
-// TestStalenessAlertForIntermittentHost checks that tick raises stale_schedule
-// for an intermittent host whose last backup is older than 7 days AND has an
-// enabled schedule. Also verifies that a succeeded backup clears the alert.
-func TestStalenessAlertForIntermittentHost(t *testing.T) {
-	t.Parallel()
-	eng, st, hostID := setupEngine(t)
-	ctx := context.Background()
-
-	// Make intermittent.
-	if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
-		t.Fatalf("SetHostAlwaysOn: %v", err)
-	}
-
-	// Create a source group to attach the schedule to.
-	sgID := ulid.Make().String()
-	if err := st.CreateSourceGroup(ctx, &store.SourceGroup{
-		ID:       sgID,
-		HostID:   hostID,
-		Name:     "default",
-		Includes: []string{"/home"},
-	}); err != nil {
-		t.Fatalf("CreateSourceGroup: %v", err)
-	}
-
-	// Create an enabled schedule pointing at the source group.
-	schedID := ulid.Make().String()
-	if err := st.CreateSchedule(ctx, &store.Schedule{
-		ID:             schedID,
-		HostID:         hostID,
-		CronExpr:       "0 2 * * *",
-		Enabled:        true,
-		SourceGroupIDs: []string{sgID},
-	}); err != nil {
-		t.Fatalf("CreateSchedule: %v", err)
-	}
-
-	// Set last_backup_at to 8 days ago.
-	eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
-	if err := st.SetHostLastBackup(ctx, hostID, "succeeded", eightDaysAgo); err != nil {
-		t.Fatalf("SetHostLastBackup: %v", err)
-	}
-
-	eng.tick(ctx, time.Now().UTC())
-
-	open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	var staleCount int
-	for _, a := range open {
-		if a.Kind == KindStaleSchedule {
-			staleCount++
-		}
-	}
-	if staleCount != 1 {
-		t.Fatalf("expected 1 stale_schedule alert after tick; got %d (all open: %+v)", staleCount, open)
-	}
-
-	// A succeeded backup should clear the stale_schedule alert.
-	eng.handleJobFinished(ctx, JobFinishedEvent{
-		HostID:        hostID,
-		JobID:         ulid.Make().String(),
-		Kind:          "backup",
-		Status:        "succeeded",
-		SourceGroupID: sgID,
-		When:          time.Now().UTC(),
-	})
-
-	open, _ = st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	for _, a := range open {
-		if a.Kind == KindStaleSchedule {
-			t.Fatalf("expected stale_schedule to be resolved after backup succeeded; still open: %+v", a)
-		}
-	}
-}
-
-// TestNoStalenessWithoutEnabledSchedule checks that no stale_schedule is
-// raised for an intermittent host with a stale backup but no enabled schedule.
-func TestNoStalenessWithoutEnabledSchedule(t *testing.T) {
-	t.Parallel()
-	eng, st, hostID := setupEngine(t)
-	ctx := context.Background()
-
-	// Make intermittent.
-	if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
-		t.Fatalf("SetHostAlwaysOn: %v", err)
-	}
-
-	// Set last_backup_at to 8 days ago — stale — but no schedule.
-	eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
-	if err := st.SetHostLastBackup(ctx, hostID, "succeeded", eightDaysAgo); err != nil {
-		t.Fatalf("SetHostLastBackup: %v", err)
-	}
-
-	eng.tick(ctx, time.Now().UTC())
-
-	open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	for _, a := range open {
-		if a.Kind == KindStaleSchedule {
-			t.Fatalf("expected no stale_schedule without an enabled schedule; got: %+v", a)
-		}
-	}
-}
-
-// TestResolveOnModeChangeClearsOfflineAlert checks that ResolveOnModeChange
-// clears an open agent_offline alert when a host's mode is toggled.
-func TestResolveOnModeChangeClearsOfflineAlert(t *testing.T) {
-	t.Parallel()
-	eng, st, hostID := setupEngine(t)
-	ctx := context.Background()
-
-	// Make always-on and set it offline with a stale last_seen_at.
-	if err := st.SetHostAlwaysOn(ctx, hostID, true); err != nil {
-		t.Fatalf("SetHostAlwaysOn: %v", err)
-	}
-	if _, err := st.DB().Exec(
-		`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
-		time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
-		"offline",
-		hostID,
-	); err != nil {
-		t.Fatalf("update last_seen_at: %v", err)
-	}
-
-	// Raise the offline alert.
-	eng.handleHostOffline(ctx, hostID)
-
-	open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	if len(open) != 1 || open[0].Kind != KindAgentOffline {
-		t.Fatalf("expected 1 agent_offline alert before mode change; got %d: %+v", len(open), open)
-	}
-
-	// Toggle mode — should clear the alert.
-	eng.ResolveOnModeChange(ctx, hostID, time.Now().UTC())
-
-	open, _ = st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	for _, a := range open {
-		if a.Kind == KindAgentOffline {
-			t.Fatalf("expected agent_offline to be resolved after mode change; still open: %+v", a)
-		}
-	}
-}
-
-// TestNoStalenessWhenNeverBackedUp checks that no stale_schedule alert is
-// raised for an intermittent host that has never backed up (nil LastBackupAt).
-func TestNoStalenessWhenNeverBackedUp(t *testing.T) {
-	t.Parallel()
-	eng, st, hostID := setupEngine(t)
-	ctx := context.Background()
-
-	// Make intermittent.
-	if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
-		t.Fatalf("SetHostAlwaysOn: %v", err)
-	}
-
-	// Create a source group and an enabled schedule — but do NOT set LastBackupAt.
-	sgID := ulid.Make().String()
-	if err := st.CreateSourceGroup(ctx, &store.SourceGroup{
-		ID:       sgID,
-		HostID:   hostID,
-		Name:     "default",
-		Includes: []string{"/home"},
-	}); err != nil {
-		t.Fatalf("CreateSourceGroup: %v", err)
-	}
-
-	schedID := ulid.Make().String()
-	if err := st.CreateSchedule(ctx, &store.Schedule{
-		ID:             schedID,
-		HostID:         hostID,
-		CronExpr:       "0 2 * * *",
-		Enabled:        true,
-		SourceGroupIDs: []string{sgID},
-	}); err != nil {
-		t.Fatalf("CreateSchedule: %v", err)
-	}
-
-	eng.tick(ctx, time.Now().UTC())
-
-	open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
-	for _, a := range open {
-		if a.Kind == KindStaleSchedule {
-			t.Fatalf("expected no stale_schedule when never backed up; got: %+v", a)
-		}
-	}
-}
@@ -27,10 +27,10 @@ const (
 	// integrity is at risk) when a check job fails.
 	KindCheckFailed = "check_failed"

-	// KindStaleSchedule is raised for intermittent (non-always-on) hosts
-	// when their last successful backup is older than staleBackupThreshold
-	// (7 days) and they have at least one enabled schedule. Resolved on
-	// backup success or when the host is switched to always-on mode.
+	// KindStaleSchedule is declared for completeness but intentionally
+	// left as a no-op in v1. The precise "expected to have fired but
+	// didn't" logic requires a store helper that lands in a follow-up
+	// task. Ask the team before implementing.
 	KindStaleSchedule = "stale_schedule"

 	// KindAgentOffline is raised when a host's last_seen_at is older
@@ -122,16 +122,6 @@ func alertPayload(ctx context.Context, st *store.Store, ev notification.Event, a
 	}
 }

-// ResolveOnModeChange clears any open agent_offline and stale_schedule
-// alerts for a host whose always-on flag was just toggled. The next
-// 60s tick re-raises whichever still applies under the new mode, so
-// this is a self-correcting "wipe and let the sweep settle" call.
-// Safe to invoke from the HTTP layer (it only touches the store + hub).
-func (e *Engine) ResolveOnModeChange(ctx context.Context, hostID string, when time.Time) {
-	e.resolveAndNotify(ctx, hostID, KindAgentOffline, "", when)
-	e.resolveAndNotify(ctx, hostID, KindStaleSchedule, "", when)
-}
-
 // resolveAndNotify clears the open (or acknowledged) alert matching
 // (host_id, kind, dedup_key) via store.AutoResolve, then fires
 // alert.resolved for the row(s) actually closed. Best-effort —
@@ -1,63 +0,0 @@
-package alert
-
-import (
-	"context"
-	"fmt"
-	"log/slog"
-	"time"
-
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
-)
-
-// Alert-kind constants for P6 self-update flows.
-const (
-	// KindUpdateFailed is raised when an agent fails to come back with
-	// the expected version after a command.update dispatch (timeout or
-	// version-mismatch). Resolved by a subsequent matching hello.
-	KindUpdateFailed = "update_failed"
-
-	// KindFleetUpdateHalted is raised when the fleet-update worker
-	// stops mid-run because a host failed to update or went offline.
-	// Host-less alert (system-scoped). Manually resolved by an admin.
-	KindFleetUpdateHalted = "fleet_update_halted"
-)
-
-// RaiseUpdateFailed records a per-host update failure. dedupKey is the
-// hostID so a re-dispatch on the same host touches the existing alert
-// rather than spawning a duplicate.
-func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) {
-	msg := fmt.Sprintf("Agent update failed (job %s): %s", jobID, reason)
-	e.raiseAndNotify(ctx, hostID, KindUpdateFailed, hostID, "warning", msg, when)
-}
-
-// ResolveUpdateFailed clears any open update_failed alert for hostID.
-// Called from the WS hello path when the agent reconnects with the
-// target version.
-func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) {
-	e.resolveAndNotify(ctx, hostID, KindUpdateFailed, hostID, when)
-}
-
-// RaiseFleetUpdateHalted is host-less — the fleet update is a
-// system-level concept. We persist it via the dedicated host-less
-// alert path so the alerts table's host_id column carries NULL.
-func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) {
-	msg := fmt.Sprintf("Fleet update %s halted: %s", fleetUpdateID, reason)
-	id, didRaise, err := e.store.RaiseOrTouchSystem(ctx, KindFleetUpdateHalted, fleetUpdateID, "warning", msg, when)
-	if err != nil {
-		slog.Warn("alert: raise fleet_update_halted", "fu_id", fleetUpdateID, "err", err)
-		return
-	}
-	if !didRaise {
-		return
-	}
-	go e.hub.Dispatch(ctx, notification.Payload{
-		Event:    notification.EventRaised,
-		AlertID:  id,
-		Severity: "warning",
-		Kind:     KindFleetUpdateHalted,
-		HostID:   "",
-		HostName: "",
-		Message:  msg,
-		RaisedAt: when,
-	})
-}
@@ -63,7 +63,6 @@ const (
 	JobUnlock  JobKind = "unlock"
 	JobRestore JobKind = "restore"
 	JobDiff    JobKind = "diff"
-	JobUpdate  JobKind = "update"
 )

 // JobStatus is the lifecycle state of a job.
@@ -362,14 +361,13 @@ type ConfigUpdatePayload struct {
 	BandwidthDownKBps *int `json:"bandwidth_down_kbps,omitempty"`
 }

-// CommandUpdatePayload carries no operational data — the agent
-// already knows its own os/arch and fetches from its configured
-// server URL via /agent/binary. JobID is the server-issued id of
-// the update job; the agent echoes it on log.stream lines so the
-// live job log captures pre-restart progress, then either exits
-// (Linux) or hands off to a detached helper script (Windows).
-type CommandUpdatePayload struct {
-	JobID string `json:"job_id"`
+// AgentUpdateAvailablePayload — informational only; the agent does
+// NOT self-update. See spec.md §4.2 for the package-manager-based
+// update model.
+type AgentUpdateAvailablePayload struct {
+	LatestVersion string `json:"latest_version"`
+	PackageURL    string `json:"package_url"` // apt repo / choco source
+	Changelog     string `json:"changelog,omitempty"`
 }

 // TreeListRequestPayload is the body of a tree.list RPC. Used by the
@@ -29,12 +29,12 @@ const (

 // Server → agent message types.
 const (
-	MsgCommandRun    MessageType = "command.run"
-	MsgCommandCancel MessageType = "command.cancel"
-	MsgScheduleSet   MessageType = "schedule.set"
-	MsgConfigUpdate  MessageType = "config.update"
-	MsgCommandUpdate MessageType = "command.update"
-	MsgTreeList      MessageType = "tree.list" // sync RPC: list a snapshot's children
+	MsgCommandRun       MessageType = "command.run"
+	MsgCommandCancel    MessageType = "command.cancel"
+	MsgScheduleSet      MessageType = "schedule.set"
+	MsgConfigUpdate     MessageType = "config.update"
+	MsgAgentUpdateAvail MessageType = "agent.update.available"
+	MsgTreeList         MessageType = "tree.list" // sync RPC: list a snapshot's children
 )

 // Envelope is the framing for every WS message in either direction.
@@ -9,7 +9,6 @@ import (
 	"errors"
 	"fmt"
 	"strings"
-	"testing"

 	"golang.org/x/crypto/argon2"
 )
@@ -28,38 +27,22 @@ const (
 	defaultKeyLen     = 32
 )

-// Cheap params used only when the binary is a `go test` binary
-// (testing.Testing() == true). Argon2id at production params costs
-// 300–500 ms per hash and dominates wall time on CI runners under
-// `-race`. Tests don't need real KDF strength — VerifyPassword reads
-// params from the encoded hash, so verifying a cheap-params hash
-// works the same way.
-const (
-	testMemoryKiB  = 8
-	testIterations = 1
-	testParallel   = 1
-)
-
 // HashPassword returns an argon2id-encoded string of the form
 //
 //	$argon2id$v=19$m=...,t=...,p=...$<salt>$<hash>
 //
 // safe to store in a TEXT column. The salt is freshly random per call.
 func HashPassword(password string) (string, error) {
-	mem, iter, par := uint32(defaultMemoryKiB), uint32(defaultIterations), uint8(defaultParallel)
-	if testing.Testing() {
-		mem, iter, par = testMemoryKiB, testIterations, testParallel
-	}
 	salt := make([]byte, defaultSaltLen)
 	if _, err := rand.Read(salt); err != nil {
 		return "", fmt.Errorf("auth: read salt: %w", err)
 	}
 	hash := argon2.IDKey([]byte(password), salt,
-		iter, mem, par, defaultKeyLen)
+		defaultIterations, defaultMemoryKiB, defaultParallel, defaultKeyLen)

 	return fmt.Sprintf("$argon2id$v=%d$m=%d,t=%d,p=%d$%s$%s",
 		argon2.Version,
-		mem, iter, par,
+		defaultMemoryKiB, defaultIterations, defaultParallel,
 		base64.RawStdEncoding.EncodeToString(salt),
 		base64.RawStdEncoding.EncodeToString(hash),
 	), nil
@@ -58,34 +58,14 @@ func (c *NtfyChannel) Send(ctx context.Context, p Payload) (int, time.Duration,
 	server := strings.TrimRight(c.cfg.ServerURL, "/")
 	url := server + "/" + c.cfg.Topic

-	// Body carries the event verb so the body alone is unambiguous when
-	// it shows up on a phone lockscreen without the title.
-	body := p.Message
-	switch p.Event {
-	case EventResolved:
-		body = "Resolved · " + p.Message
-	case EventAcknowledged:
-		body = "Acknowledged · " + p.Message
-	}
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewBufferString(body))
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewBufferString(p.Message))
 	if err != nil {
 		return 0, 0, fmt.Errorf("ntfy: build request: %w", err)
 	}

-	// Title prefix tracks the event so raise vs ack vs resolve are
-	// visually distinct in the ntfy notification list.
-	verb := "raised"
-	switch p.Event {
-	case EventAcknowledged:
-		verb = "ack"
-	case EventResolved:
-		verb = "resolved"
-	case EventTest:
-		verb = "test"
-	}
 	req.Header.Set("Content-Type", "text/plain")
-	req.Header.Set("Title", fmt.Sprintf("[%s · %s] %s %s", verb, p.Severity, p.HostName, p.Kind))
-	req.Header.Set("Tags", verb+","+p.Severity+","+p.Kind)
+	req.Header.Set("Title", fmt.Sprintf("[%s] %s %s", p.Severity, p.HostName, p.Kind))
+	req.Header.Set("Tags", p.Severity+","+p.Kind)
 	req.Header.Set("Priority", priorityForSeverity(p.Severity, c.defaultPriority))
 	if p.Link != "" {
 		req.Header.Set("Click", p.Link)
@@ -60,13 +60,13 @@ func TestNtfySendsHeadersAndBody(t *testing.T) {
 		t.Fatalf("want 200, got %d", code)
 	}

-	if want := "[raised · critical] alfa-01 check_failed"; gotTitle != want {
+	if want := "[critical] alfa-01 check_failed"; gotTitle != want {
 		t.Errorf("Title: got %q want %q", gotTitle, want)
 	}
 	if gotPri != "5" {
 		t.Errorf("Priority: got %q want \"5\"", gotPri)
 	}
-	if want := "raised,critical,check_failed"; gotTags != want {
+	if want := "critical,check_failed"; gotTags != want {
 		t.Errorf("Tags: got %q want %q", gotTags, want)
 	}
 	if gotClick != "https://rm.example/a" {
@@ -117,20 +117,9 @@ func extractAddr(s string) string {
 // Plain text only; subject hardcoded.
 func buildEmailBody(cfg SMTPConfig, msgIDDomain string, p Payload) []byte {
 	var b strings.Builder
-	// Subject prefix tracks the event verb so raise vs ack vs resolve
-	// are visually distinct in the inbox (and threaded by Message-ID).
-	verb := "raised"
-	switch p.Event {
-	case EventAcknowledged:
-		verb = "ack"
-	case EventResolved:
-		verb = "resolved"
-	case EventTest:
-		verb = "test"
-	}
 	b.WriteString("From: " + cfg.From + "\r\n")
 	b.WriteString("To: " + cfg.To + "\r\n")
-	b.WriteString(fmt.Sprintf("Subject: [restic-manager] [%s · %s] %s: %s\r\n", verb, p.Severity, p.HostName, p.Kind))
+	b.WriteString(fmt.Sprintf("Subject: [restic-manager] [%s] %s: %s\r\n", p.Severity, p.HostName, p.Kind))
 	b.WriteString("Date: " + p.RaisedAt.UTC().Format(time.RFC1123Z) + "\r\n")
 	b.WriteString("Message-ID: <" + p.AlertID + "@" + msgIDDomain + ">\r\n")
 	b.WriteString("MIME-Version: 1.0\r\n")
@@ -133,7 +133,7 @@ func TestSMTPSendsExpectedHeaders(t *testing.T) {
 	if !strings.Contains(srv.rcptTo, "ops@example.com") {
 		t.Errorf("RCPT TO: %q", srv.rcptTo)
 	}
-	if !strings.Contains(srv.data, "Subject: [restic-manager] [raised · warning] alfa-01: backup_failed") {
+	if !strings.Contains(srv.data, "Subject: [restic-manager] [warning] alfa-01: backup_failed") {
 		t.Errorf("subject missing or wrong: %q", srv.data)
 	}
 	if !strings.Contains(srv.data, "Message-ID: <01ABC@rm.example>") {
@@ -87,13 +87,13 @@ func (e Env) RunRestore(ctx context.Context, snapshotID string, paths []string,
 		}
 	}
 	args = append(args, "--target", target)
-	// --no-ownership is nominally a restic 0.17+ flag, but at least
-	// one downstream 0.18.1 build still rejects it. We rely on a
-	// runtime probe captured at agent startup (see
-	// SupportsRestoreNoOwnership) rather than version sniffing.
-	// In-place restores always preserve ownership — that's the whole
-	// point of in-place — so we only add the flag for new-dir mode.
-	if !inPlace && e.SupportsRestoreNoOwnership {
+	// --no-ownership was added in restic 0.17. Older versions reject
+	// the flag with "unknown flag: --no-ownership". For new-dir
+	// restores we want the files owned by the agent user (operator
+	// can cp them without juggling chown), so pass the flag iff the
+	// running restic supports it. In-place restores always preserve
+	// ownership — that's the whole point of in-place.
+	if !inPlace && e.AtLeastVersion(0, 17) {
 		args = append(args, "--no-ownership")
 	}
 	for _, p := range paths {
@@ -15,26 +15,6 @@ import (
 	"time"
 )

-// SupportsRestoreNoOwnership probes the running restic for the
-// `--no-ownership` flag on the `restore` subcommand. Some restic
-// builds (≥ 0.17 in theory; observed missing on a downstream 0.18.1)
-// do not expose it, so we ask the binary directly rather than
-// inferring from the version string. Empty `bin` or any failure to
-// run the help command returns false — the caller stays on the
-// conservative path of not adding the flag.
-func SupportsRestoreNoOwnership(ctx context.Context, bin string) bool {
-	if bin == "" {
-		return false
-	}
-	probeCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
-	defer cancel()
-	out, err := exec.CommandContext(probeCtx, bin, "restore", "--help").CombinedOutput()
-	if err != nil {
-		return false
-	}
-	return strings.Contains(string(out), "--no-ownership")
-}
-
 // Locate resolves the path to the restic binary. Honour an explicit
 // override if provided, else fall back to PATH.
 func Locate(override string) (string, error) {
@@ -69,15 +49,6 @@ type Env struct {
 	ExtraEnv     map[string]string // any other RESTIC_* / passthrough
 	WorkDir      string            // CWD; default = current

-	// SupportsRestoreNoOwnership records whether the running restic's
-	// `restore --help` advertises the --no-ownership flag. The flag was
-	// added in 0.17, but at least one downstream build of 0.18.1 still
-	// rejects it ("unknown flag: --no-ownership") — version sniffing
-	// proved unreliable, so the agent now probes for the actual flag at
-	// startup (see internal/restic.SupportsRestoreNoOwnership) and
-	// passes the resulting boolean down here.
-	SupportsRestoreNoOwnership bool
-
 	// Bandwidth caps in KB/s. <=0 means "no cap" (omit the flag).
 	// Emitted as restic global flags --limit-upload / --limit-download
 	// before the subcommand on every invocation.
@@ -536,14 +507,12 @@ func pumpPlain(r io.Reader, stream string, handle LineHandler) error {
 // on one or the other for its cache dir; without it the command
 // fails before ever talking to the repo.
 //
-// Default to /var/lib/restic-manager. The unit no longer pins
-// ProtectHome=read-only (a backup tool needs to restore anywhere),
-// but the explicit HOME stays for two reasons: the parent's HOME
-// can be unset under unusual init shapes, and pinning the cache
-// under a known agent-owned dir keeps restic's metadata isolated
-// from the actual operator home dirs that the agent can now write
-// to. ExtraEnv overrides win for callers that want a different
-// cache location.
+// Default to /var/lib/restic-manager — that's in the systemd unit's
+// ReadWritePaths and survives ProtectHome=read-only. We do NOT fall
+// back to the parent's HOME env var: the agent runs as root with
+// HOME=/root, but ProtectHome makes /root read-only, so restic's
+// `mkdir /root/.cache/restic` fails. ExtraEnv overrides win for
+// callers that explicitly want a different cache location.
 func (e Env) envSlice() []string {
 	home := "/var/lib/restic-manager"
 	if h, ok := e.ExtraEnv["HOME"]; ok && h != "" {
@@ -30,35 +30,7 @@ type Config struct {
 	// Defaults to true. Set RM_COOKIE_SECURE=false only for local HTTP
 	// testing — production deployments are always behind a TLS proxy
 	// and the cookie must be Secure.
-	CookieSecure bool        `yaml:"cookie_secure"`
-	OIDCRaw      *OIDCConfig `yaml:"oidc"`
-	OIDC         *OIDCConfig `yaml:"-"`
-
-	// BundledAssetsDir is the read-only path inside the image that
-	// holds agent binaries (under agent-binaries/) and install
-	// scripts (under install/). The /agent/binary and /install/*
-	// handlers fall back here when the file is not present in
-	// DataDir. Source-build deployments can override via
-	// RM_BUNDLED_ASSETS_DIR.
-	BundledAssetsDir string `yaml:"bundled_assets_dir"`
-
-	// MetricsToken, if set, gates the /metrics scrape endpoint
-	// behind a `Authorization: Bearer <token>` check (constant-time
-	// compare). When neither this nor MetricsTrustedCIDRs is set,
-	// the route is not mounted at all (the endpoint is opt-in).
-	MetricsToken string `yaml:"metrics_token"`
-
-	// MetricsTrustedCIDRs, if non-empty, gates /metrics so only
-	// callers from these networks may scrape. ANDed with
-	// MetricsToken when both are set.
-	MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"`
-}
-
-// MetricsAuthEnabled reports whether the operator has opted into
-// exposing the Prometheus scrape endpoint by configuring at least
-// one auth gate.
-func (c Config) MetricsAuthEnabled() bool {
-	return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0
+	CookieSecure bool `yaml:"cookie_secure"`
 }

 // Load resolves config in this order:
@@ -70,10 +42,9 @@ func (c Config) MetricsAuthEnabled() bool {
 // safe to start.
 func Load(yamlPath string) (Config, error) {
 	c := Config{
-		Listen:           ":8080",
-		DataDir:          "/data",
-		CookieSecure:     true,
-		BundledAssetsDir: "/opt/restic-manager/dist",
+		Listen:       ":8080",
+		DataDir:      "/data",
+		CookieSecure: true,
 	}

 	if yamlPath != "" {
@@ -108,22 +79,6 @@ func Load(yamlPath string) (Config, error) {
 			c.CookieSecure = true
 		}
 	}
-	if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
-		c.BundledAssetsDir = v
-	}
-	if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok {
-		c.MetricsToken = v
-	}
-	if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok {
-		parts := strings.Split(v, ",")
-		c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0]
-		for _, p := range parts {
-			p = strings.TrimSpace(p)
-			if p != "" {
-				c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p)
-			}
-		}
-	}
 	if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
 		// Comma-separated CIDRs; allow whitespace for readability.
 		parts := strings.Split(v, ",")
@@ -136,16 +91,6 @@ func Load(yamlPath string) (Config, error) {
 		}
 	}

-	var rawOIDC OIDCConfig
-	if c.OIDCRaw != nil {
-		rawOIDC = *c.OIDCRaw
-	}
-	oidc, err := loadOIDC(envSnapshot(), rawOIDC)
-	if err != nil {
-		return c, err
-	}
-	c.OIDC = oidc
-
 	return c, c.validate()
 }

@@ -168,10 +113,5 @@ func (c *Config) validate() error {
 			return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
 		}
 	}
-	for _, cidr := range c.MetricsTrustedCIDRs {
-		if _, err := netip.ParsePrefix(cidr); err != nil {
-			return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err)
-		}
-	}
 	return nil
 }
@@ -98,45 +98,6 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) {
 	}
 }

-func TestMetricsAuthGates(t *testing.T) {
-	t.Setenv("RM_LISTEN", ":8080")
-	t.Setenv("RM_DATA_DIR", "/tmp/x")
-
-	c, err := Load("")
-	if err != nil {
-		t.Fatalf("load: %v", err)
-	}
-	if c.MetricsAuthEnabled() {
-		t.Errorf("metrics endpoint should be off by default")
-	}
-
-	t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes")
-	t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24")
-	c, err = Load("")
-	if err != nil {
-		t.Fatalf("load: %v", err)
-	}
-	if c.MetricsToken != "s3cr3t-token-with-enough-bytes" {
-		t.Errorf("token: %q", c.MetricsToken)
-	}
-	if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" {
-		t.Errorf("cidrs: %v", got)
-	}
-	if !c.MetricsAuthEnabled() {
-		t.Errorf("MetricsAuthEnabled should be true")
-	}
-}
-
-func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) {
-	t.Setenv("RM_LISTEN", ":8080")
-	t.Setenv("RM_DATA_DIR", "/tmp/x")
-	t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage")
-
-	if _, err := Load(""); err == nil {
-		t.Fatal("expected validation error, got nil")
-	}
-}
-
 func writeFile(path string, body []byte) error {
 	return writeFileImpl(path, body)
 }
@@ -1,103 +0,0 @@
-// internal/server/config/oidc.go — OIDC subsection of the server
-// config. Disabled when oidc.issuer is empty or absent.
-package config
-
-import (
-	"errors"
-	"fmt"
-	"os"
-)
-
-// OIDCConfig is the OIDC sub-block. The struct doubles as YAML schema;
-// loadOIDC applies env overlays on top and fills defaults.
-type OIDCConfig struct {
-	Issuer       string            `yaml:"issuer"`
-	ClientID     string            `yaml:"client_id"`
-	ClientSecret string            `yaml:"client_secret"`
-	DisplayName  string            `yaml:"display_name"`
-	Scopes       []string          `yaml:"scopes"`
-	RoleClaim    string            `yaml:"role_claim"`
-	RoleMapping  map[string]string `yaml:"role_mapping"`
-	RedirectURL  string            `yaml:"redirect_url"`
-}
-
-// loadOIDC merges YAML + env, applies defaults, validates. Returns
-// nil + nil when OIDC is disabled (issuer empty after merge); a
-// non-nil OIDCConfig means the caller should wire OIDC.
-//
-// Env vars (override YAML when set):
-//
-//	RM_OIDC_ISSUER, RM_OIDC_CLIENT_ID, RM_OIDC_CLIENT_SECRET,
-//	RM_OIDC_CLIENT_SECRET_FILE, RM_OIDC_DISPLAY_NAME,
-//	RM_OIDC_REDIRECT_URL.
-//
-// envs is passed in (rather than read with os.LookupEnv) so unit
-// tests can supply a fake env map.
-func loadOIDC(envs map[string]string, yaml OIDCConfig) (*OIDCConfig, error) {
-	c := yaml
-	if v, ok := envs["RM_OIDC_ISSUER"]; ok {
-		c.Issuer = v
-	}
-	if v, ok := envs["RM_OIDC_CLIENT_ID"]; ok {
-		c.ClientID = v
-	}
-	if v, ok := envs["RM_OIDC_CLIENT_SECRET"]; ok {
-		c.ClientSecret = v
-	}
-	if v, ok := envs["RM_OIDC_CLIENT_SECRET_FILE"]; ok && v != "" {
-		body, err := os.ReadFile(v)
-		if err != nil {
-			return nil, fmt.Errorf("config: oidc client_secret_file: %w", err)
-		}
-		c.ClientSecret = string(body)
-	}
-	if v, ok := envs["RM_OIDC_DISPLAY_NAME"]; ok {
-		c.DisplayName = v
-	}
-	if v, ok := envs["RM_OIDC_REDIRECT_URL"]; ok {
-		c.RedirectURL = v
-	}
-
-	if c.Issuer == "" {
-		return nil, nil
-	}
-
-	if c.ClientID == "" {
-		return nil, errors.New("config: oidc.client_id required when issuer is set")
-	}
-	if c.ClientSecret == "" {
-		return nil, errors.New("config: oidc.client_secret required when issuer is set")
-	}
-	if len(c.RoleMapping) == 0 {
-		return nil, errors.New("config: oidc.role_mapping must have at least one entry")
-	}
-
-	if c.DisplayName == "" {
-		c.DisplayName = "SSO"
-	}
-	if c.RoleClaim == "" {
-		c.RoleClaim = "groups"
-	}
-	if len(c.Scopes) == 0 {
-		c.Scopes = []string{"openid", "profile", "email", "groups"}
-	}
-	return &c, nil
-}
-
-// envSnapshot reads the OIDC env vars into a map. Lets the production
-// loadOIDC call site stay env-driven while tests pass an explicit
-// map.
-func envSnapshot() map[string]string {
-	keys := []string{
-		"RM_OIDC_ISSUER", "RM_OIDC_CLIENT_ID", "RM_OIDC_CLIENT_SECRET",
-		"RM_OIDC_CLIENT_SECRET_FILE", "RM_OIDC_DISPLAY_NAME",
-		"RM_OIDC_REDIRECT_URL",
-	}
-	out := make(map[string]string, len(keys))
-	for _, k := range keys {
-		if v, ok := os.LookupEnv(k); ok {
-			out[k] = v
-		}
-	}
-	return out
-}
@@ -1,72 +0,0 @@
-package config
-
-import "testing"
-
-func TestOIDCParseDisabledWhenIssuerEmpty(t *testing.T) {
-	t.Parallel()
-	c, err := loadOIDC(map[string]string{}, OIDCConfig{})
-	if err != nil {
-		t.Fatalf("load: %v", err)
-	}
-	if c != nil {
-		t.Errorf("expected nil OIDC config when issuer empty; got %+v", c)
-	}
-}
-
-func TestOIDCRejectMissingClientID(t *testing.T) {
-	t.Parallel()
-	yaml := OIDCConfig{Issuer: "https://x", ClientSecret: "s"}
-	if _, err := loadOIDC(map[string]string{}, yaml); err == nil {
-		t.Error("expected error for missing client_id")
-	}
-}
-
-func TestOIDCRejectMissingClientSecret(t *testing.T) {
-	t.Parallel()
-	yaml := OIDCConfig{Issuer: "https://x", ClientID: "rm"}
-	if _, err := loadOIDC(map[string]string{}, yaml); err == nil {
-		t.Error("expected error for missing client_secret")
-	}
-}
-
-func TestOIDCDefaultsApplied(t *testing.T) {
-	t.Parallel()
-	yaml := OIDCConfig{
-		Issuer: "https://x", ClientID: "rm", ClientSecret: "s",
-		RoleMapping: map[string]string{"a": "admin"},
-	}
-	c, err := loadOIDC(map[string]string{}, yaml)
-	if err != nil {
-		t.Fatalf("load: %v", err)
-	}
-	if c.RoleClaim != "groups" {
-		t.Errorf("role_claim default: got %q want groups", c.RoleClaim)
-	}
-	if c.DisplayName != "SSO" {
-		t.Errorf("display_name default: got %q want SSO", c.DisplayName)
-	}
-	wantScopes := []string{"openid", "profile", "email", "groups"}
-	if len(c.Scopes) != len(wantScopes) {
-		t.Errorf("scopes default: got %v want %v", c.Scopes, wantScopes)
-	}
-}
-
-func TestOIDCEnvOverrides(t *testing.T) {
-	t.Parallel()
-	yaml := OIDCConfig{
-		Issuer: "https://from-yaml", ClientID: "yaml-id", ClientSecret: "yaml-secret",
-		RoleMapping: map[string]string{"x": "admin"},
-	}
-	envs := map[string]string{
-		"RM_OIDC_ISSUER":        "https://from-env",
-		"RM_OIDC_CLIENT_ID":     "env-id",
-		"RM_OIDC_CLIENT_SECRET": "env-secret",
-	}
-	c, err := loadOIDC(envs, yaml)
-	if err != nil {
-		t.Fatalf("load: %v", err)
-	}
-	if c.Issuer != "https://from-env" || c.ClientID != "env-id" || c.ClientSecret != "env-secret" {
-		t.Errorf("env override: got %+v", c)
-	}
-}
@@ -1,221 +0,0 @@
-// Package fleetupdate drives a rolling, sequential agent self-update
-// over a list of hosts. One worker goroutine per Start() call (gated
-// at the store layer to at-most-one-running-fleet-update).
-package fleetupdate
-
-import (
-	"context"
-	"errors"
-	"fmt"
-	"log/slog"
-	"time"
-
-	"github.com/oklog/ulid/v2"
-
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
-)
-
-// Hub is the slim "is this host connected?" surface.
-type Hub interface {
-	Connected(hostID string) bool
-}
-
-// Dispatcher sends one command.update envelope. The implementer also
-// creates the jobs row, writes audit, and registers with the update
-// watcher. Pre-checks are the dispatcher's responsibility — the worker
-// passes through whatever error it returns.
-type Dispatcher interface {
-	DispatchUpdate(ctx context.Context, hostID string, actorUserID string) (jobID string, code string, err error)
-}
-
-// AlertRaiser is the slim view of the alert engine's host-less raise
-// path. Used to emit fleet_update_halted on first failure.
-type AlertRaiser interface {
-	RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time)
-}
-
-// Worker is the long-lived fleet-update orchestrator. There is at most
-// one *running* fleet update at a time (enforced by the store).
-type Worker struct {
-	store  *store.Store
-	hub    Hub
-	disp   Dispatcher
-	alerts AlertRaiser
-
-	// targetVersion is the version every dispatched agent is expected
-	// to come back with. Captured at Start time to avoid drift.
-	targetVersion string
-
-	// pollPeriod controls the cadence at which the worker re-reads the
-	// host row to check for the version transition. Exposed for tests.
-	pollPeriod time.Duration
-	// hostTimeout bounds how long the worker waits for one host to
-	// reach the target version before halting.
-	hostTimeout time.Duration
-}
-
-// NewWorker builds an unstarted worker. targetVersion is set on each
-// Start call; the values here are defaults.
-func NewWorker(st *store.Store, hub Hub, disp Dispatcher, alerts AlertRaiser) *Worker {
-	return &Worker{
-		store:       st,
-		hub:         hub,
-		disp:        disp,
-		alerts:      alerts,
-		pollPeriod:  1 * time.Second,
-		hostTimeout: 95 * time.Second,
-	}
-}
-
-// Start creates the parent + child rows, then spawns the per-host
-// worker goroutine. Returns the new fleet_update_id on success.
-// store.ErrFleetUpdateRunning bubbles up unchanged.
-func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) {
-	if userID == "" || targetVersion == "" {
-		return "", errors.New("fleetupdate: userID and targetVersion required")
-	}
-	if len(hostIDs) == 0 {
-		return "", errors.New("fleetupdate: at least one host required")
-	}
-	fuID := ulid.Make().String()
-	now := time.Now().UTC()
-	if err := w.store.CreateFleetUpdate(ctx, store.FleetUpdate{
-		ID:              fuID,
-		StartedAt:       now,
-		StartedByUserID: userID,
-		TargetVersion:   targetVersion,
-		Status:          "running",
-	}, hostIDs); err != nil {
-		return "", err
-	}
-
-	// The goroutine outlives the request that started it; carry a
-	// detached context so an HTTP-handler ctx cancel doesn't abort
-	// the long roll.
-	bg := context.WithoutCancel(ctx)
-	go w.run(bg, fuID, userID, targetVersion)
-	return fuID, nil
-}
-
-// Cancel marks the fleet update cancelled. The running goroutine
-// observes the new status on its next pre-check and exits without
-// dispatching further hosts. The currently-dispatched job is left to
-// finish on its own — cancelling agent-side is out of scope for v1.
-func (w *Worker) Cancel(ctx context.Context, fuID string) error {
-	return w.store.CancelFleetUpdate(ctx, fuID, time.Now().UTC())
-}
-
-// run is the per-host loop. Halts on first failure; emits one alert
-// on transition.
-func (w *Worker) run(ctx context.Context, fuID, userID, targetVersion string) {
-	w.targetVersion = targetVersion
-
-	for {
-		// Check the parent row's status — picks up Cancel.
-		fu, err := w.store.ActiveFleetUpdate(ctx)
-		if err != nil {
-			slog.Warn("fleetupdate: read active", "fu_id", fuID, "err", err)
-			return
-		}
-		if fu == nil || fu.ID != fuID {
-			// Cancelled, halted, or completed externally. Done.
-			return
-		}
-
-		pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID)
-		if err != nil {
-			slog.Warn("fleetupdate: list pending", "fu_id", fuID, "err", err)
-			return
-		}
-		if len(pending) == 0 {
-			now := time.Now().UTC()
-			if err := w.store.CompleteFleetUpdate(ctx, fuID, now); err != nil {
-				slog.Warn("fleetupdate: complete", "fu_id", fuID, "err", err)
-			}
-			return
-		}
-
-		next := pending[0]
-		w.processHost(ctx, fuID, userID, next)
-	}
-}
-
-// processHost handles one host slot. Marks it skipped, succeeded, or
-// failed (and halts the fleet on failure).
-func (w *Worker) processHost(ctx context.Context, fuID, userID string, slot store.FleetUpdateHost) {
-	hostID := slot.HostID
-	_ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, hostID)
-
-	// Pre-flight: re-read the host. The dispatch path repeats most of
-	// these checks but doing them up-front lets us emit the right
-	// per-host status (skipped vs failed) without consuming a job row.
-	host, err := w.store.GetHost(ctx, hostID)
-	if err != nil || host == nil {
-		_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "skipped", "host not found", "")
-		return
-	}
-	if host.AgentVersion != "" && host.AgentVersion == w.targetVersion {
-		_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "skipped", "already at target version", "")
-		return
-	}
-	if !w.hub.Connected(hostID) {
-		reason := fmt.Sprintf("host went offline: %s", hostID)
-		_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, "")
-		w.halt(ctx, fuID, reason)
-		return
-	}
-
-	// Dispatch.
-	_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "running", "", "")
-	jobID, code, err := w.disp.DispatchUpdate(ctx, hostID, userID)
-	if err != nil || code != "" {
-		reason := dispatchErrorReason(code, err)
-		_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, jobID)
-		w.halt(ctx, fuID, reason)
-		return
-	}
-
-	// Poll until the host's recorded agent_version matches target, or
-	// timeout.
-	deadline := time.Now().Add(w.hostTimeout)
-	for time.Now().Before(deadline) {
-		// Honour cancellation between polls.
-		fu, err := w.store.ActiveFleetUpdate(ctx)
-		if err == nil && (fu == nil || fu.ID != fuID) {
-			// Cancelled mid-host; leave the slot in 'running' for the
-			// admin to inspect. No further dispatches.
-			return
-		}
-		time.Sleep(w.pollPeriod)
-		h, err := w.store.GetHost(ctx, hostID)
-		if err == nil && h != nil && h.AgentVersion == w.targetVersion {
-			if err := w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "succeeded", "", jobID); err != nil {
-				slog.Warn("fleetupdate: set succeeded", "fu_id", fuID, "host_id", hostID, "err", err)
-			}
-			return
-		}
-	}
-	reason := fmt.Sprintf("timeout waiting for %s to reach %s", hostID, w.targetVersion)
-	_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, jobID)
-	w.halt(ctx, fuID, reason)
-}
-
-func (w *Worker) halt(ctx context.Context, fuID, reason string) {
-	now := time.Now().UTC()
-	if err := w.store.HaltFleetUpdate(ctx, fuID, reason, now); err != nil {
-		slog.Warn("fleetupdate: halt", "fu_id", fuID, "err", err)
-	}
-	if w.alerts != nil {
-		w.alerts.RaiseFleetUpdateHalted(ctx, fuID, reason, now)
-	}
-}
-
-func dispatchErrorReason(code string, err error) string {
-	if code != "" {
-		return "dispatch failed: " + code
-	}
-	if err != nil {
-		return err.Error()
-	}
-	return "dispatch failed"
-}
@@ -1,344 +0,0 @@
-package fleetupdate
-
-import (
-	"context"
-	"errors"
-	"path/filepath"
-	"sync"
-	"testing"
-	"time"
-
-	"github.com/oklog/ulid/v2"
-
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
-)
-
-type fakeHub struct {
-	mu     sync.Mutex
-	online map[string]bool
-}
-
-func (f *fakeHub) Connected(hostID string) bool {
-	f.mu.Lock()
-	defer f.mu.Unlock()
-	return f.online[hostID]
-}
-
-type fakeDispatcher struct {
-	mu    sync.Mutex
-	calls []string // host IDs
-	// after dispatch, set the host's agent_version to this on the
-	// store so the worker observes the version transition.
-	st         *store.Store
-	target     string
-	delayMS    int
-	failOnHost map[string]string // host → error code
-}
-
-func (f *fakeDispatcher) DispatchUpdate(ctx context.Context, hostID, _ string) (string, string, error) {
-	f.mu.Lock()
-	f.calls = append(f.calls, hostID)
-	if code, ok := f.failOnHost[hostID]; ok {
-		f.mu.Unlock()
-		return "", code, nil
-	}
-	st := f.st
-	target := f.target
-	delay := f.delayMS
-	f.mu.Unlock()
-
-	jobID := ulid.Make().String()
-	if st != nil {
-		_ = st.CreateJob(context.Background(), store.Job{
-			ID: jobID, HostID: hostID, Kind: "update",
-			ActorKind: "user", CreatedAt: time.Now().UTC(),
-		})
-	}
-	if st != nil && target != "" {
-		go func() {
-			if delay > 0 {
-				time.Sleep(time.Duration(delay) * time.Millisecond)
-			}
-			_ = st.MarkHostHello(context.Background(), hostID, target, "0.17", api.CurrentProtocolVersion, time.Now().UTC())
-		}()
-	}
-	return jobID, "", nil
-}
-
-type recAlert struct {
-	mu      sync.Mutex
-	reasons []string
-}
-
-func (r *recAlert) RaiseFleetUpdateHalted(_ context.Context, _ string, reason string, _ time.Time) {
-	r.mu.Lock()
-	r.reasons = append(r.reasons, reason)
-	r.mu.Unlock()
-}
-
-func openStore(t *testing.T) *store.Store {
-	t.Helper()
-	dir := t.TempDir()
-	st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
-	if err != nil {
-		t.Fatalf("open: %v", err)
-	}
-	t.Cleanup(func() { _ = st.Close() })
-	return st
-}
-
-func mustCreateAdmin(t *testing.T, st *store.Store) string {
-	t.Helper()
-	uid := ulid.Make().String()
-	if err := st.CreateUser(context.Background(), store.User{
-		ID: uid, Username: "u-" + uid[:6],
-		PasswordHash: "x", Role: store.RoleAdmin, CreatedAt: time.Now().UTC(),
-	}); err != nil {
-		t.Fatalf("user: %v", err)
-	}
-	return uid
-}
-
-func mustCreateHost(t *testing.T, st *store.Store, name, version string) string {
-	t.Helper()
-	hostID := ulid.Make().String()
-	if err := st.CreateHost(context.Background(), store.Host{
-		ID: hostID, Name: name, OS: "linux", Arch: "amd64",
-		EnrolledAt: time.Now().UTC(),
-	}, "deadbeef-"+hostID, ""); err != nil {
-		t.Fatalf("host: %v", err)
-	}
-	if version != "" {
-		if err := st.MarkHostHello(context.Background(), hostID, version, "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
-			t.Fatalf("hello: %v", err)
-		}
-	}
-	return hostID
-}
-
-func waitForStatus(t *testing.T, st *store.Store, fuID, want string, timeout time.Duration) *store.FleetUpdate {
-	t.Helper()
-	deadline := time.Now().Add(timeout)
-	for time.Now().Before(deadline) {
-		fu, _, err := st.GetFleetUpdate(context.Background(), fuID)
-		if err == nil && fu != nil && fu.Status == want {
-			return fu
-		}
-		time.Sleep(20 * time.Millisecond)
-	}
-	t.Fatalf("status never reached %q", want)
-	return nil
-}
-
-func TestWorkerTwoHostsBothSucceed(t *testing.T) {
-	st := openStore(t)
-	uid := mustCreateAdmin(t, st)
-	h1 := mustCreateHost(t, st, "h1", "v0")
-	h2 := mustCreateHost(t, st, "h2", "v0")
-
-	hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
-	disp := &fakeDispatcher{st: st, target: "v2", delayMS: 30}
-	alerts := &recAlert{}
-	w := NewWorker(st, hub, disp, alerts)
-	w.pollPeriod = 20 * time.Millisecond
-	w.hostTimeout = 2 * time.Second
-
-	fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
-	if err != nil {
-		t.Fatalf("start: %v", err)
-	}
-	waitForStatus(t, st, fuID, "completed", 5*time.Second)
-	_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
-	for _, h := range hosts {
-		if h.Status != "succeeded" {
-			t.Errorf("host %s status %q want succeeded", h.HostID, h.Status)
-		}
-	}
-	if n := len(alerts.reasons); n != 0 {
-		t.Errorf("unexpected halt alert: %v", alerts.reasons)
-	}
-}
-
-func TestWorkerSecondHostTimesOutHalts(t *testing.T) {
-	st := openStore(t)
-	uid := mustCreateAdmin(t, st)
-	h1 := mustCreateHost(t, st, "h1", "v0")
-	h2 := mustCreateHost(t, st, "h2", "v0")
-	h3 := mustCreateHost(t, st, "h3", "v0")
-
-	hub := &fakeHub{online: map[string]bool{h1: true, h2: true, h3: true}}
-	// h1 dispatches normally (transitions to v2). h2 dispatch returns
-	// success but never transitions.
-	disp := &fakeDispatcher{st: st, target: "v2", delayMS: 20, failOnHost: map[string]string{
-		h2: "", // not a code-failure; simulate by clearing target on this disp run
-	}}
-	// Actually: drop h2 from the auto-transition by faking with a
-	// per-host store setter. Easiest: subclass via a wrapper.
-	_ = disp
-	customDisp := &perHostDispatcher{base: disp, st: st, target: "v2", noTransition: map[string]bool{h2: true}}
-
-	alerts := &recAlert{}
-	w := NewWorker(st, hub, customDisp, alerts)
-	w.pollPeriod = 20 * time.Millisecond
-	w.hostTimeout = 200 * time.Millisecond
-
-	fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2, h3})
-	if err != nil {
-		t.Fatalf("start: %v", err)
-	}
-	waitForStatus(t, st, fuID, "halted", 3*time.Second)
-	_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
-	gotStatus := map[string]string{}
-	for _, h := range hosts {
-		gotStatus[h.HostID] = h.Status
-	}
-	if gotStatus[h1] != "succeeded" {
-		t.Errorf("h1: %q", gotStatus[h1])
-	}
-	if gotStatus[h2] != "failed" {
-		t.Errorf("h2: %q", gotStatus[h2])
-	}
-	if gotStatus[h3] != "pending" {
-		t.Errorf("h3: %q", gotStatus[h3])
-	}
-	alerts.mu.Lock()
-	defer alerts.mu.Unlock()
-	if len(alerts.reasons) != 1 {
-		t.Errorf("alert reasons: %v", alerts.reasons)
-	}
-}
-
-// perHostDispatcher lets a test omit the auto-transition for selected
-// hosts so we can simulate timeout.
-type perHostDispatcher struct {
-	mu           sync.Mutex
-	base         *fakeDispatcher
-	st           *store.Store
-	target       string
-	noTransition map[string]bool
-}
-
-func (p *perHostDispatcher) DispatchUpdate(_ context.Context, hostID, _ string) (string, string, error) {
-	p.mu.Lock()
-	skip := p.noTransition[hostID]
-	p.mu.Unlock()
-	jobID := ulid.Make().String()
-	_ = p.st.CreateJob(context.Background(), store.Job{
-		ID: jobID, HostID: hostID, Kind: "update",
-		ActorKind: "user", CreatedAt: time.Now().UTC(),
-	})
-	if !skip {
-		go func() {
-			time.Sleep(20 * time.Millisecond)
-			_ = p.st.MarkHostHello(context.Background(), hostID, p.target, "0.17", api.CurrentProtocolVersion, time.Now().UTC())
-		}()
-	}
-	return jobID, "", nil
-}
-
-func TestWorkerHostOfflineHalts(t *testing.T) {
-	st := openStore(t)
-	uid := mustCreateAdmin(t, st)
-	h1 := mustCreateHost(t, st, "h1", "v0")
-	h2 := mustCreateHost(t, st, "h2", "v0")
-	hub := &fakeHub{online: map[string]bool{h1: false, h2: true}}
-	disp := &fakeDispatcher{st: st, target: "v2"}
-	alerts := &recAlert{}
-	w := NewWorker(st, hub, disp, alerts)
-	w.pollPeriod = 20 * time.Millisecond
-	w.hostTimeout = 500 * time.Millisecond
-
-	fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
-	if err != nil {
-		t.Fatalf("start: %v", err)
-	}
-	waitForStatus(t, st, fuID, "halted", 2*time.Second)
-	_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
-	if hosts[0].Status != "failed" {
-		t.Errorf("h1 status: %q", hosts[0].Status)
-	}
-	if hosts[1].Status != "pending" {
-		t.Errorf("h2 status: %q", hosts[1].Status)
-	}
-}
-
-func TestWorkerAlreadyAtTargetSkipped(t *testing.T) {
-	st := openStore(t)
-	uid := mustCreateAdmin(t, st)
-	h1 := mustCreateHost(t, st, "h1", "v2")
-	h2 := mustCreateHost(t, st, "h2", "v0")
-	hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
-	disp := &fakeDispatcher{st: st, target: "v2", delayMS: 20}
-	alerts := &recAlert{}
-	w := NewWorker(st, hub, disp, alerts)
-	w.pollPeriod = 20 * time.Millisecond
-	w.hostTimeout = 2 * time.Second
-
-	fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
-	if err != nil {
-		t.Fatalf("start: %v", err)
-	}
-	waitForStatus(t, st, fuID, "completed", 4*time.Second)
-	_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
-	want := map[string]string{h1: "skipped", h2: "succeeded"}
-	for _, h := range hosts {
-		if h.Status != want[h.HostID] {
-			t.Errorf("host %s: got %q want %q", h.HostID, h.Status, want[h.HostID])
-		}
-	}
-}
-
-func TestWorkerCancelMidRun(t *testing.T) {
-	st := openStore(t)
-	uid := mustCreateAdmin(t, st)
-	h1 := mustCreateHost(t, st, "h1", "v0")
-	h2 := mustCreateHost(t, st, "h2", "v0")
-	hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
-	// h1's transition is delayed long enough that we can cancel
-	// before it lands; h2 should never be touched.
-	disp := &fakeDispatcher{st: st, target: "v2", delayMS: 500}
-	alerts := &recAlert{}
-	w := NewWorker(st, hub, disp, alerts)
-	w.pollPeriod = 50 * time.Millisecond
-	w.hostTimeout = 5 * time.Second
-
-	fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
-	if err != nil {
-		t.Fatalf("start: %v", err)
-	}
-	// Give the worker a moment to dispatch h1.
-	time.Sleep(100 * time.Millisecond)
-	if err := w.Cancel(context.Background(), fuID); err != nil {
-		t.Fatalf("cancel: %v", err)
-	}
-	waitForStatus(t, st, fuID, "cancelled", 2*time.Second)
-
-	// h2 should never be dispatched.
-	disp.mu.Lock()
-	defer disp.mu.Unlock()
-	for _, c := range disp.calls {
-		if c == h2 {
-			t.Errorf("h2 dispatched after cancel")
-		}
-	}
-}
-
-func TestWorkerStartWhileActiveErrors(t *testing.T) {
-	st := openStore(t)
-	uid := mustCreateAdmin(t, st)
-	h1 := mustCreateHost(t, st, "h1", "v0")
-	h2 := mustCreateHost(t, st, "h2", "v0")
-	hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
-	disp := &fakeDispatcher{st: st, target: "v2", delayMS: 5_000}
-	w := NewWorker(st, hub, disp, &recAlert{})
-	w.pollPeriod = 50 * time.Millisecond
-	w.hostTimeout = 2 * time.Second
-	if _, err := w.Start(context.Background(), uid, "v2", []string{h1}); err != nil {
-		t.Fatalf("first start: %v", err)
-	}
-	_, err := w.Start(context.Background(), uid, "v2", []string{h2})
-	if !errors.Is(err, store.ErrFleetUpdateRunning) {
-		t.Fatalf("err: %v want ErrFleetUpdateRunning", err)
-	}
-}
@@ -11,23 +11,19 @@ import (
 )

 // agent_assets.go serves the agent binary (one per OS/arch) and the
-// install scripts. Lookup is dual-path:
-//
-//  1. <DataDir>/agent-binaries/<name>  (or <DataDir>/install/<name>) —
-//     operator-managed override; lets the operator hot-patch a
-//     pre-release agent without rebuilding the server image.
-//  2. <BundledAssetsDir>/agent-binaries/<name> — read-only, baked
-//     into the server image at build time (P5-03). This is what
-//     makes a fresh container Just Work without first-run staging.
+// install scripts. The binaries live under <DataDir>/agent-binaries/,
+// laid down by the release pipeline (or copied by hand for now).
+// The install scripts live in <DataDir>/install/ alongside the
+// systemd unit.
 //
 // Both endpoints are intentionally unauthenticated: the install
 // payload is unprivileged on its own — it's the one-time enrollment
 // token that grants access. Anyone can pull the binary; only
 // someone with a valid token can use it productively.
 //
-// P1-31: signed-binary verification is deferred. The image is the
-// unit of trust; pull-by-digest is the verification primitive.
-// Future work bumps standalone-binary delivery to minisign/cosign.
+// P1-31: signed-binary verification is deferred. Today we serve
+// whatever the operator dropped on disk. Future work bumps this to
+// minisign/cosign signed bundles.

 // installAssetsRoutes adds /agent/binary and /install/* to r.
 func (s *Server) handleAgentBinary(w stdhttp.ResponseWriter, r *stdhttp.Request) {
@@ -49,8 +45,8 @@ func (s *Server) handleAgentBinary(w stdhttp.ResponseWriter, r *stdhttp.Request)
 		ext = ".exe"
 	}
 	name := fmt.Sprintf("restic-manager-agent-%s-%s%s", osTag, archTag, ext)
-	path, ok := s.resolveBundledAsset("agent-binaries", name)
-	if !ok {
+	path := filepath.Join(s.deps.Cfg.DataDir, "agent-binaries", name)
+	if _, err := os.Stat(path); err != nil {
 		writeJSONError(w, stdhttp.StatusNotFound, "binary_not_published",
 			fmt.Sprintf("agent binary for %s/%s not published on this server", osTag, archTag))
 		return
@@ -68,34 +64,14 @@ func (s *Server) handleInstallAsset(w stdhttp.ResponseWriter, r *stdhttp.Request
 		writeJSONError(w, stdhttp.StatusBadRequest, "bad_path", "")
 		return
 	}
-	path, ok := s.resolveBundledAsset("install", rel)
-	if !ok {
+	path := filepath.Join(s.deps.Cfg.DataDir, "install", rel)
+	if _, err := os.Stat(path); err != nil {
 		writeJSONError(w, stdhttp.StatusNotFound, "not_found", "")
 		return
 	}
 	stdhttp.ServeFile(w, r, path)
 }

-// resolveBundledAsset looks up an asset by (subdir, name). DataDir
-// wins so an operator can override the image-baked copy by dropping
-// a file into <DataDir>/<subdir>/<name>. If neither path resolves,
-// returns ("", false).
-func (s *Server) resolveBundledAsset(subdir, name string) (string, bool) {
-	candidates := []string{
-		filepath.Join(s.deps.Cfg.DataDir, subdir, name),
-	}
-	if s.deps.Cfg.BundledAssetsDir != "" {
-		candidates = append(candidates,
-			filepath.Join(s.deps.Cfg.BundledAssetsDir, subdir, name))
-	}
-	for _, p := range candidates {
-		if _, err := os.Stat(p); err == nil {
-			return p, true
-		}
-	}
-	return "", false
-}
-
 func validOS(s string) bool {
 	switch api.HostOS(s) {
 	case api.OSLinux, api.OSWindows:
@@ -1,167 +0,0 @@
-package http
-
-import (
-	"context"
-	"io"
-	stdhttp "net/http"
-	"net/http/httptest"
-	"os"
-	"path/filepath"
-	"testing"
-
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
-	"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
-)
-
-// newAssetsTestServer is a minimal scaffold for the /agent/binary and
-// /install/* handlers. Two roots: one acts as DataDir, the other as
-// the image-baked BundledAssetsDir. Either or both may be empty.
-func newAssetsTestServer(t *testing.T, populate func(dataDir, bundleDir string)) string {
-	t.Helper()
-	root := t.TempDir()
-	dataDir := filepath.Join(root, "data")
-	bundleDir := filepath.Join(root, "dist")
-	for _, d := range []string{
-		filepath.Join(dataDir, "agent-binaries"),
-		filepath.Join(dataDir, "install"),
-		filepath.Join(bundleDir, "agent-binaries"),
-		filepath.Join(bundleDir, "install"),
-	} {
-		if err := os.MkdirAll(d, 0o755); err != nil {
-			t.Fatalf("mkdir: %v", err)
-		}
-	}
-	if populate != nil {
-		populate(dataDir, bundleDir)
-	}
-
-	st, err := store.Open(context.Background(), filepath.Join(root, "rm.db"))
-	if err != nil {
-		t.Fatalf("store: %v", err)
-	}
-	t.Cleanup(func() { _ = st.Close() })
-
-	keyPath := filepath.Join(root, "secret.key")
-	_ = crypto.GenerateKeyFile(keyPath)
-	key, _ := crypto.LoadKeyFromFile(keyPath)
-	aead, _ := crypto.NewAEAD(key)
-
-	deps := Deps{
-		Cfg: config.Config{
-			Listen:           ":0",
-			DataDir:          dataDir,
-			SecretKeyFile:    keyPath,
-			BundledAssetsDir: bundleDir,
-		},
-		Store:          st,
-		AEAD:           aead,
-		Hub:            ws.NewHub(),
-		BootstrapToken: "test-token",
-	}
-	s := New(deps)
-	ts := httptest.NewServer(s.srv.Handler)
-	t.Cleanup(ts.Close)
-	return ts.URL
-}
-
-func writeFile(t *testing.T, path string, body []byte) {
-	t.Helper()
-	if err := os.WriteFile(path, body, 0o644); err != nil {
-		t.Fatalf("write %s: %v", path, err)
-	}
-}
-
-func get(t *testing.T, url string) (int, []byte) {
-	t.Helper()
-	res, err := stdhttp.Get(url)
-	if err != nil {
-		t.Fatalf("GET %s: %v", url, err)
-	}
-	defer res.Body.Close()
-	body, _ := io.ReadAll(res.Body)
-	return res.StatusCode, body
-}
-
-func TestAgentBinary_DataDirHit(t *testing.T) {
-	t.Parallel()
-	url := newAssetsTestServer(t, func(dataDir, _ string) {
-		writeFile(t, filepath.Join(dataDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
-			[]byte("from-datadir"))
-	})
-	code, body := get(t, url+"/agent/binary?os=linux&arch=amd64")
-	if code != 200 || string(body) != "from-datadir" {
-		t.Fatalf("got %d %q", code, string(body))
-	}
-}
-
-func TestAgentBinary_BundleFallback(t *testing.T) {
-	t.Parallel()
-	url := newAssetsTestServer(t, func(_, bundleDir string) {
-		writeFile(t, filepath.Join(bundleDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
-			[]byte("from-bundle"))
-	})
-	code, body := get(t, url+"/agent/binary?os=linux&arch=amd64")
-	if code != 200 || string(body) != "from-bundle" {
-		t.Fatalf("got %d %q", code, string(body))
-	}
-}
-
-func TestAgentBinary_DataDirShadowsBundle(t *testing.T) {
-	t.Parallel()
-	url := newAssetsTestServer(t, func(dataDir, bundleDir string) {
-		writeFile(t, filepath.Join(dataDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
-			[]byte("from-datadir"))
-		writeFile(t, filepath.Join(bundleDir, "agent-binaries", "restic-manager-agent-linux-amd64"),
-			[]byte("from-bundle"))
-	})
-	code, body := get(t, url+"/agent/binary?os=linux&arch=amd64")
-	if code != 200 || string(body) != "from-datadir" {
-		t.Fatalf("operator override should win: got %d %q", code, string(body))
-	}
-}
-
-func TestAgentBinary_BothMiss(t *testing.T) {
-	t.Parallel()
-	url := newAssetsTestServer(t, nil)
-	code, _ := get(t, url+"/agent/binary?os=linux&arch=amd64")
-	if code != 404 {
-		t.Fatalf("expected 404, got %d", code)
-	}
-}
-
-func TestAgentBinary_WindowsNameHasExe(t *testing.T) {
-	t.Parallel()
-	url := newAssetsTestServer(t, func(_, bundleDir string) {
-		writeFile(t, filepath.Join(bundleDir, "agent-binaries", "restic-manager-agent-windows-amd64.exe"),
-			[]byte("win"))
-	})
-	code, body := get(t, url+"/agent/binary?os=windows&arch=amd64")
-	if code != 200 || string(body) != "win" {
-		t.Fatalf("got %d %q", code, string(body))
-	}
-}
-
-func TestInstallAsset_BundleFallback(t *testing.T) {
-	t.Parallel()
-	url := newAssetsTestServer(t, func(_, bundleDir string) {
-		writeFile(t, filepath.Join(bundleDir, "install", "install.sh"), []byte("#!/bin/sh\n"))
-	})
-	code, body := get(t, url+"/install/install.sh")
-	if code != 200 || string(body) != "#!/bin/sh\n" {
-		t.Fatalf("got %d %q", code, string(body))
-	}
-}
-
-func TestInstallAsset_PathTraversalRejected(t *testing.T) {
-	t.Parallel()
-	url := newAssetsTestServer(t, nil)
-	// chi will normalise some traversal attempts, but the handler
-	// also rejects any rel containing a slash or backslash. The
-	// path component of the URL after /install/ is the rel.
-	code, _ := get(t, url+"/install/..%2fpasswd")
-	if code == 200 {
-		t.Fatalf("traversal should not return 200")
-	}
-}
--- a/Show More
+++ b/Show More