Compare commits
91 Commits
v0.9.0
..
6c6b962e24
| Author | SHA1 | Date | |
|---|---|---|---|
| 6c6b962e24 | |||
| e64075d5d7 | |||
| 0f5110f3d9 | |||
| 0fbacf9f98 | |||
| d8fd4110b0 | |||
| e17932d797 | |||
| 39030a3bbe | |||
| a30f824a3c | |||
| 239d55b65b | |||
| 74e5b75380 | |||
| 9371b7b777 | |||
| 10b2518323 | |||
| 6694dfdc3a | |||
| f88f2cc1f2 | |||
| 1a07fbb217 | |||
| 9e6524788f | |||
| 25c55e5e4d | |||
| e408de9610 | |||
| 5c4e0275d9 | |||
| 7aaafceab5 | |||
| 4c9641b6ed | |||
| ff65d39f25 | |||
| 9d16e3f7e3 | |||
| 261b83ec26 | |||
| 0c3a0844e4 | |||
| 2dae61f678 | |||
| 55cb8909c7 | |||
| 06748f5582 | |||
| a4d705db6b | |||
| c6f73f790d | |||
| 068f08d96d | |||
| 28ef9750d3 | |||
| f4db0b17e8 | |||
| 8afda7cd8c | |||
| 123e4f4915 | |||
| 7b035a8f09 | |||
| 7a813cacd3 | |||
| 1d36dcd668 | |||
| 755840d9ff | |||
| cc638f6456 | |||
| e046be98b2 | |||
| a9c47deb26 | |||
| 8a7706407d | |||
| 3101024d1a | |||
| 7f98524cfa | |||
| 41def51977 | |||
| b9439da467 | |||
| 5925d09e8b | |||
| cc6844605f | |||
| 4cd36d83e3 | |||
| 68276810ec | |||
| e8804922b5 | |||
| a9c6a060d4 | |||
| a8026608ae | |||
| 6c23bdbe63 | |||
| a087321570 | |||
| e8f7502a7f | |||
| af2cb292b8 | |||
| bb4ed3502d | |||
| ff8a5dbead | |||
| ccd14f7cee | |||
| 07bce16c84 | |||
| a28bda2031 | |||
| 51192c3603 | |||
| 06fd440dd4 | |||
| 28c8b58f93 | |||
| 6ef58a707e | |||
| 001575ae9c | |||
| 28cc55711d | |||
| 98cc490ea8 | |||
| be4ac02ddd | |||
| 6e8a1c5b45 | |||
| e7d25cd704 | |||
| db88c5a7d1 | |||
| bb2a88be24 | |||
| b9c7ec6ebf | |||
| da518de3e6 | |||
| 55453300b0 | |||
| 0a75b82c17 | |||
| b60c2c6f6b | |||
| 1909f71f90 | |||
| dddff10b99 | |||
| 39304b08d0 | |||
| 9bcd8bc5fe | |||
| e6cfb1cd9f | |||
| 9d5775fb47 | |||
| c37954aa3f | |||
| efed96f67a | |||
| f31f6edde7 | |||
| 516c50fa16 | |||
| a8256f5aff |
@@ -0,0 +1,32 @@
|
||||
<!--
|
||||
Thanks for the PR! A few quick checks before submitting:
|
||||
|
||||
* Did you open an issue first for non-trivial changes?
|
||||
* `make lint test` is green locally?
|
||||
* Commits are focused (one logical change per commit)?
|
||||
* No `Co-Authored-By` trailers (repo policy)?
|
||||
* No new dependencies without a one-line justification below?
|
||||
-->
|
||||
|
||||
## Summary
|
||||
|
||||
<!-- One paragraph: what changed and why. -->
|
||||
|
||||
## Test plan
|
||||
|
||||
<!-- Bullet list of what you actually ran. Be specific.
|
||||
- `make test` → green
|
||||
- Manually exercised the new flow at /hosts/{id}/foo
|
||||
- Smoke env: enrolled a fresh host, ran a backup end-to-end
|
||||
-->
|
||||
|
||||
## Notes for the reviewer
|
||||
|
||||
<!-- Anything the reviewer needs to know that isn't obvious from the
|
||||
diff: related issue, follow-up work that's intentionally not
|
||||
in this PR, deferred concerns, design alternatives considered
|
||||
and rejected. -->
|
||||
|
||||
## Linked issues
|
||||
|
||||
<!-- "Closes #123" / "Refs #456" / "Part of P5-06" -->
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
name: Bug report
|
||||
about: Something isn't behaving the way the docs / code suggest it should
|
||||
title: "[bug] "
|
||||
labels: bug
|
||||
---
|
||||
|
||||
## What happened
|
||||
|
||||
<!-- A clear description of the actual behaviour. Include the exact
|
||||
UI surface, API endpoint, or CLI invocation involved. -->
|
||||
|
||||
## What you expected
|
||||
|
||||
<!-- What you thought would happen, and where that expectation came from
|
||||
(docs page, command output, prior behaviour). -->
|
||||
|
||||
## Steps to reproduce
|
||||
|
||||
1.
|
||||
2.
|
||||
3.
|
||||
|
||||
## Environment
|
||||
|
||||
- restic-manager server version: <!-- `restic-manager-server --version` or footer of the UI -->
|
||||
- Agent version (if relevant): <!-- `restic-manager-agent --version` -->
|
||||
- restic version on affected host: <!-- `restic version` -->
|
||||
- Host OS: <!-- e.g. "Ubuntu 22.04 amd64" or "Windows Server 2022" -->
|
||||
- How was the server installed: <!-- docker compose / source build / other -->
|
||||
|
||||
## Logs / output
|
||||
|
||||
<details><summary>Server log (sanitised)</summary>
|
||||
|
||||
```
|
||||
<!-- paste relevant lines; redact tokens, passwords, repo URLs -->
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details><summary>Agent log (sanitised)</summary>
|
||||
|
||||
```
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Anything else
|
||||
|
||||
<!-- Screenshots, related issues, recent changes you made before the
|
||||
bug appeared, anything that might help. -->
|
||||
@@ -0,0 +1,34 @@
|
||||
---
|
||||
name: Feature request
|
||||
about: Suggest a new capability or change to existing behaviour
|
||||
title: "[feature] "
|
||||
labels: enhancement
|
||||
---
|
||||
|
||||
## What you're trying to do
|
||||
|
||||
<!-- Describe the use case, not the proposed solution. Who is the
|
||||
operator, what are they trying to accomplish, and what's
|
||||
blocking them today? -->
|
||||
|
||||
## Why the current behaviour falls short
|
||||
|
||||
<!-- What does the system do today, and where does it stop short of
|
||||
the use case above? -->
|
||||
|
||||
## Proposed direction (optional)
|
||||
|
||||
<!-- If you have a specific design in mind, describe it. Skip this
|
||||
section if you'd rather leave it to the maintainer. -->
|
||||
|
||||
## Scope check
|
||||
|
||||
- [ ] I've read [`spec.md`](../spec.md) §2 (Goals & Non-Goals).
|
||||
- [ ] This isn't already on the roadmap in [`tasks.md`](../tasks.md).
|
||||
- [ ] This fits the project's "small fleet, one person operating"
|
||||
target rather than enterprise / multi-tenant / SaaS use cases.
|
||||
|
||||
## Anything else
|
||||
|
||||
<!-- Related restic features, prior art in similar tools, links to
|
||||
discussions you've had elsewhere. -->
|
||||
+50
-37
@@ -2,28 +2,34 @@
|
||||
#
|
||||
# Notes for anyone editing this file:
|
||||
#
|
||||
# Custom runner image
|
||||
# Every job runs inside `gitea.dcglab.co.uk/steve/ci-runner-go`
|
||||
# (recipe: https://gitea.dcglab.co.uk/steve/ci/src/branch/main/images/ci-runner-go).
|
||||
# That image already ships:
|
||||
# * Go on PATH at /usr/local/go/bin (so `actions/setup-go` is
|
||||
# redundant and intentionally NOT used here — the action would
|
||||
# otherwise re-download Go on every job)
|
||||
# * Node.js + npm (used by docs / e2e workflows)
|
||||
# * Docker CLI, Buildx, Compose v2 (used by docker-build steps)
|
||||
# When bumping the Go floor, push a new ci-runner-go image with
|
||||
# the matching Go version and bump the date pin in IMAGE below.
|
||||
#
|
||||
# Self-hosted runner expectations
|
||||
# The Gitea runners are provisioned out-of-band (the infra team owns
|
||||
# the script). Each runner host bind-mounts persistent volumes for
|
||||
# /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE), and
|
||||
# /root/.cache/act (action clones) into every job container. As a
|
||||
# Each runner host bind-mounts persistent volumes for
|
||||
# /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE),
|
||||
# and /root/.cache/act (action clones) into every job container —
|
||||
# regardless of which image the container is built from. As a
|
||||
# result:
|
||||
# * `cache: true` on actions/setup-go is intentionally OMITTED — the
|
||||
# action would otherwise tar/untar GOMODCACHE+GOCACHE through the
|
||||
# Gitea cache backend on every job, undoing the host-volume cache
|
||||
# and adding ~10s of redundant zstd round-trip per job.
|
||||
# * Common GitHub actions (actions/checkout, actions/setup-go,
|
||||
# actions/upload-artifact, golangci/golangci-lint-action) are
|
||||
# pre-cloned into /root/.cache/act on the runner, so the per-job
|
||||
# "git clone https://github.com/actions/..." step is a fetch, not
|
||||
# a full clone.
|
||||
# * Common GitHub actions (actions/checkout, actions/upload-artifact,
|
||||
# golangci/golangci-lint-action) are pre-cloned into
|
||||
# /root/.cache/act on the runner, so the per-job
|
||||
# "git clone https://github.com/actions/..." step is a fetch,
|
||||
# not a full clone.
|
||||
# * golangci-lint is pre-installed at /usr/local/bin/golangci-lint
|
||||
# on the runner (latest v2.x). The golangci-lint-action below
|
||||
# still pins a specific version and re-downloads — that's fine
|
||||
# (deterministic CI > marginal speed) but means the host-installed
|
||||
# binary is currently unused. Drop the `version:` arg below to
|
||||
# use the host-installed one if you want to trade determinism
|
||||
# for speed.
|
||||
# on the runner host BUT that's outside the job's filesystem
|
||||
# view; the golangci-lint-action below pins a specific version
|
||||
# and re-downloads — that's fine (deterministic CI > marginal
|
||||
# speed).
|
||||
#
|
||||
# Build matrix
|
||||
# Linux amd64 + arm64 + Windows amd64. CGO_ENABLED=0 throughout —
|
||||
@@ -32,10 +38,10 @@
|
||||
# binaries.
|
||||
#
|
||||
# Go version
|
||||
# The GO_VERSION env var anchors all three jobs. Floor is set by the
|
||||
# heaviest dep (modernc.org/sqlite v1.50+ requires Go 1.23+ today;
|
||||
# we run 1.25 so golangci-lint's Go-version compatibility check is
|
||||
# happy — see the version pin in the lint job).
|
||||
# Anchored by the ci-runner-go image (currently Go 1.25.7). Floor
|
||||
# is set by the heaviest dep (modernc.org/sqlite v1.50+ requires
|
||||
# Go 1.23+; we run 1.25 so golangci-lint's Go-version compatibility
|
||||
# check is happy — see the version pin in the lint job).
|
||||
#
|
||||
# upload-artifact
|
||||
# Pinned at v3 historically; v3 was deprecated upstream. v4 should
|
||||
@@ -48,8 +54,12 @@ on:
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
env:
|
||||
GO_VERSION: "1.25"
|
||||
# Force bash as the default shell. With `container:` set on every
|
||||
# job, Gitea Actions otherwise picks `sh -e` and our `set -euo
|
||||
# pipefail` fails on dash with "Illegal option -o pipefail".
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
|
||||
jobs:
|
||||
test:
|
||||
@@ -60,6 +70,11 @@ jobs:
|
||||
# one runner. The third shard ("rest") covers everything else.
|
||||
name: Test (${{ matrix.name }})
|
||||
runs-on: ubuntu-latest
|
||||
container:
|
||||
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
|
||||
credentials:
|
||||
username: ${{ secrets.ZOT_USERNAME }}
|
||||
password: ${{ secrets.ZOT_PASSWORD }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
@@ -73,10 +88,6 @@ jobs:
|
||||
packages: ""
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version: ${{ env.GO_VERSION }}
|
||||
# cache: true intentionally omitted — see header notes.
|
||||
- name: go vet
|
||||
run: go vet ./...
|
||||
- name: go test
|
||||
@@ -98,12 +109,13 @@ jobs:
|
||||
lint:
|
||||
name: Lint
|
||||
runs-on: ubuntu-latest
|
||||
container:
|
||||
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
|
||||
credentials:
|
||||
username: ${{ secrets.ZOT_USERNAME }}
|
||||
password: ${{ secrets.ZOT_PASSWORD }}
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version: ${{ env.GO_VERSION }}
|
||||
# cache: true intentionally omitted — see header notes.
|
||||
- uses: golangci/golangci-lint-action@v7
|
||||
with:
|
||||
# Must be built against the same Go release as go.mod targets,
|
||||
@@ -117,6 +129,11 @@ jobs:
|
||||
build:
|
||||
name: Build (${{ matrix.goos }}/${{ matrix.goarch }})
|
||||
runs-on: ubuntu-latest
|
||||
container:
|
||||
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
|
||||
credentials:
|
||||
username: ${{ secrets.ZOT_USERNAME }}
|
||||
password: ${{ secrets.ZOT_PASSWORD }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
@@ -130,10 +147,6 @@ jobs:
|
||||
ext: ".exe"
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version: ${{ env.GO_VERSION }}
|
||||
# cache: true intentionally omitted — see header notes.
|
||||
- name: build server + agent
|
||||
env:
|
||||
GOOS: ${{ matrix.goos }}
|
||||
|
||||
@@ -0,0 +1,133 @@
|
||||
# P5-06 — End-to-end test suite.
|
||||
#
|
||||
# Spec : docs/superpowers/specs/2026-05-07-p5-oss-readiness-design.md
|
||||
# Stack: e2e/compose.e2e.yml (server + agent + rest-server + playwright)
|
||||
# Tests: e2e/playwright/tests/*.spec.ts
|
||||
#
|
||||
# Triggered on every PR into main and on workflow_dispatch. Runs
|
||||
# longer than the unit-test workflow (~3-4 minutes for a clean run);
|
||||
# kept separate so a slow e2e doesn't block the fast lint/test loop.
|
||||
#
|
||||
# Networking note: every interaction with the server (health probe,
|
||||
# Playwright) happens from a container on the compose `rmnet`
|
||||
# network, addressing the server as `http://server:8080`. We can't
|
||||
# rely on `127.0.0.1:8080` because Gitea's runner executes steps
|
||||
# inside its own container, where compose's host port-publish is
|
||||
# not visible.
|
||||
|
||||
name: e2e
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [main]
|
||||
workflow_dispatch:
|
||||
|
||||
# Force bash as the default shell — see ci.yml header.
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
name: Playwright vs docker-compose
|
||||
runs-on: ubuntu-latest
|
||||
container: gitea.dcglab.co.uk/steve/ci-runner-go:2026-05-08
|
||||
timeout-minutes: 15
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Build the e2e stack
|
||||
# --profile test pulls in the playwright service which is
|
||||
# otherwise gated. --pull refreshes base images so a bump
|
||||
# to the Dockerfile's FROM tag (e.g. mcr.microsoft.com/
|
||||
# playwright:vX.Y.Z-jammy) isn't masked by a stale runner
|
||||
# cache that still has the old tag's layers.
|
||||
run: docker compose --profile test -f e2e/compose.e2e.yml build --pull
|
||||
|
||||
- name: Bring up the stack
|
||||
run: docker compose -f e2e/compose.e2e.yml up -d server rest-server source-fixture
|
||||
|
||||
- name: Wait for server health
|
||||
run: |
|
||||
set -eu
|
||||
for i in $(seq 1 30); do
|
||||
if docker run --rm --network e2e_rmnet curlimages/curl:8.10.1 \
|
||||
-fsS http://server:8080/api/version >/dev/null 2>&1; then
|
||||
echo "server up"; exit 0
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
echo "server didn't come up"; docker compose -f e2e/compose.e2e.yml logs server; exit 1
|
||||
|
||||
- name: Capture bootstrap token from server logs
|
||||
id: bootstrap
|
||||
run: |
|
||||
set -eu
|
||||
for i in $(seq 1 15); do
|
||||
line=$(docker compose -f e2e/compose.e2e.yml logs server 2>&1 | grep -E 'bootstrap token' -A2 | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1 || true)
|
||||
if [ -n "$line" ]; then
|
||||
echo "RM_BOOTSTRAP_TOKEN=$line" >> "$GITHUB_ENV"
|
||||
echo "got bootstrap token (${#line} chars)"
|
||||
exit 0
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
echo "bootstrap token not found in logs"
|
||||
docker compose -f e2e/compose.e2e.yml logs server
|
||||
exit 1
|
||||
|
||||
- name: Start the agent
|
||||
run: docker compose -f e2e/compose.e2e.yml up -d agent
|
||||
|
||||
- name: Run Playwright tests
|
||||
id: playwright
|
||||
env:
|
||||
RM_BOOTSTRAP_TOKEN: ${{ env.RM_BOOTSTRAP_TOKEN }}
|
||||
# --name pins a stable container ID so the next step can
|
||||
# docker cp out of it before tear-down. We deliberately
|
||||
# drop --rm so the container survives the test exit; the
|
||||
# tear-down step removes it.
|
||||
run: docker compose -f e2e/compose.e2e.yml run --name e2e-pw playwright
|
||||
|
||||
- name: Extract Playwright report
|
||||
if: always() && steps.playwright.outcome != 'skipped'
|
||||
run: |
|
||||
mkdir -p e2e/playwright/playwright-report e2e/playwright/test-results
|
||||
docker cp e2e-pw:/work/playwright-report/. e2e/playwright/playwright-report/ || true
|
||||
docker cp e2e-pw:/work/test-results/. e2e/playwright/test-results/ || true
|
||||
|
||||
- name: Show Playwright failure context (on failure)
|
||||
if: failure()
|
||||
run: |
|
||||
set +e
|
||||
shopt -s nullglob globstar
|
||||
for f in e2e/playwright/test-results/**/error-context.md; do
|
||||
echo "::group::$f"
|
||||
cat "$f"
|
||||
echo "::endgroup::"
|
||||
done
|
||||
echo "Failure attachments (download via the playwright-report artifact):"
|
||||
find e2e/playwright/test-results \( -name '*.png' -o -name '*.webm' -o -name 'trace.zip' \) -printf ' %p\n' | sort
|
||||
|
||||
- name: Compose logs (on failure)
|
||||
if: failure()
|
||||
run: |
|
||||
docker compose -f e2e/compose.e2e.yml logs --tail=200 server
|
||||
docker compose -f e2e/compose.e2e.yml logs --tail=200 agent
|
||||
docker compose -f e2e/compose.e2e.yml logs --tail=200 rest-server
|
||||
|
||||
- name: Upload Playwright report (on failure)
|
||||
if: failure()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: playwright-report
|
||||
path: |
|
||||
e2e/playwright/playwright-report
|
||||
e2e/playwright/test-results
|
||||
retention-days: 7
|
||||
|
||||
- name: Tear down
|
||||
if: always()
|
||||
run: |
|
||||
docker rm -f e2e-pw 2>/dev/null || true
|
||||
docker compose -f e2e/compose.e2e.yml down -v
|
||||
@@ -12,18 +12,12 @@
|
||||
# plus install.sh / install.ps1 / the systemd unit baked in under
|
||||
# /opt/restic-manager/dist (the read-only fallback path the server
|
||||
# handlers use when <DataDir>/... is empty).
|
||||
# * Pushes to this Gitea instance's container registry under
|
||||
# <gitea-host>/<owner>/restic-manager.
|
||||
# * Pushes to zot OCI registry (docker.dcglab.co.uk).
|
||||
#
|
||||
# Tag fan-out
|
||||
# * tag push: :vX.Y.Z, :X.Y, :X
|
||||
# * tag push and X >= 1: also :latest
|
||||
# * workflow_dispatch: only :snapshot-<shortsha>; nothing else moves.
|
||||
#
|
||||
# Why no goreleaser
|
||||
# The architecture already routes agent distribution through the
|
||||
# server's /agent/binary endpoint. The image is the only deliverable;
|
||||
# binary archives would just be a second source of truth.
|
||||
|
||||
name: Release
|
||||
|
||||
@@ -34,25 +28,35 @@ on:
|
||||
workflow_dispatch:
|
||||
|
||||
env:
|
||||
REGISTRY: gitea.dcglab.co.uk
|
||||
IMAGE_NAME: ${{ gitea.repository }}
|
||||
REGISTRY: docker.dcglab.co.uk
|
||||
IMAGE_NAME: restic-manager
|
||||
|
||||
# Force bash as the default shell — see ci.yml header.
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
|
||||
jobs:
|
||||
image:
|
||||
name: Build + push image
|
||||
runs-on: ubuntu-latest
|
||||
container:
|
||||
image: docker.dcglab.co.uk/ci-runner-go:2026-05-15
|
||||
credentials:
|
||||
username: ${{ secrets.ZOT_USERNAME }}
|
||||
password: ${{ secrets.ZOT_PASSWORD }}
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- uses: docker/setup-qemu-action@v3
|
||||
- uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Log in to Gitea registry
|
||||
- name: Log in to zot registry
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: ${{ env.REGISTRY }}
|
||||
username: ${{ gitea.actor }}
|
||||
password: ${{ secrets.DEV_TOKEN }}
|
||||
username: ${{ secrets.ZOT_USERNAME }}
|
||||
password: ${{ secrets.ZOT_PASSWORD }}
|
||||
|
||||
- name: Compute tags + version
|
||||
id: meta
|
||||
|
||||
+11
@@ -2,6 +2,10 @@
|
||||
/bin/
|
||||
/dist/
|
||||
|
||||
# Generated mdBook output (source under docs/book/src is committed,
|
||||
# the rendered book/ directory is not).
|
||||
/docs/book/book/
|
||||
|
||||
# Local data / runtime state
|
||||
/data/
|
||||
/certs/
|
||||
@@ -41,3 +45,10 @@ coverage.html
|
||||
# tooling already skips paths starting with _, but ignore explicitly
|
||||
# so an accidental `git add cmd/.` can't sneak them into a release.
|
||||
/cmd/_*/
|
||||
|
||||
# Local-only planning / scratch — never committed.
|
||||
/ask.md
|
||||
/docs/superpowers/
|
||||
|
||||
# Claude Code agent worktrees (transient, harness-created).
|
||||
/.claude/worktrees/
|
||||
|
||||
+127
@@ -0,0 +1,127 @@
|
||||
# Changelog
|
||||
|
||||
All notable changes to this project are documented here.
|
||||
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
||||
and the project follows [Semantic Versioning](https://semver.org/).
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
## [1.1.0] - 2026-06-15
|
||||
|
||||
### Added
|
||||
|
||||
- **Always-On vs intermittent host mode.** A host can now be marked as
|
||||
not always-on — for laptops/workstations that legitimately sleep,
|
||||
travel, or shut down outside hours. An intermittent host no longer
|
||||
raises "agent offline" alerts when it disappears; instead it shows a
|
||||
calm "asleep" state in the UI ("asleep · last seen … · will catch up
|
||||
on return") and is covered by a longer-horizon staleness alert (raised
|
||||
only when it has an enabled schedule and no successful backup in 7
|
||||
days). When such a host reconnects, the server waits a short settle
|
||||
window and then automatically dispatches any scheduled backup whose
|
||||
window elapsed while it was asleep. Toggle per host from the host
|
||||
detail page (operator-band, audited as `host.mode_updated`). New and
|
||||
existing hosts default to always-on, so current fleets are unaffected.
|
||||
|
||||
### Changed
|
||||
|
||||
- Host-detail header redesign: tags and presence are grouped into
|
||||
labelled, boxed pills with click-to-edit; presence shows a `24x7` /
|
||||
`Free` chip; the agent "out of date" indicator is simplified (the full
|
||||
version detail remains in the Agent-update panel and on hover).
|
||||
- Relative timestamps ("2h ago") now tick client-side, so a tab left
|
||||
open no longer shows a stale value as wall-clock time moves on.
|
||||
- Release and CI container images are now published to and pulled from
|
||||
the zot OCI registry (`docker.dcglab.co.uk`).
|
||||
|
||||
## [1.0.1] - 2026-05-09
|
||||
|
||||
### Fixed
|
||||
|
||||
- Build version is now single-sourced from `internal/version`, and the
|
||||
server Dockerfile's ldflags were corrected so docker-built binaries
|
||||
report their real version. Previously `internal/version.Version` stayed
|
||||
at its "dev" default in docker images, which made every host look
|
||||
permanently out-of-date to the update logic.
|
||||
|
||||
## [1.0.0] - 2026-05-09
|
||||
|
||||
First tagged release. Six development phases brought the project from
|
||||
empty repo to a self-hostable, multi-tenant restic backup orchestrator
|
||||
with a web UI, JSON API, and self-updating agent fleet.
|
||||
|
||||
### Phase 1 — MVP: enrolment, visibility, on-demand backup
|
||||
|
||||
- HTTP server, SQLite store with migrations, AEAD-encrypted
|
||||
credentials at rest, Argon2id password hashing, session cookies.
|
||||
- WebSocket transport between server and agents (heartbeat, hello,
|
||||
schedule fan-out, job log streaming).
|
||||
- Agent install path for Linux (systemd unit + `install.sh`); one-time
|
||||
enrolment tokens with embedded repo credentials.
|
||||
- Run-now backup execution end-to-end, snapshot listing.
|
||||
- Server-side encrypted repo creds pushed to the agent on hello.
|
||||
|
||||
### Phase 2 — Scheduling, retention, repo operations
|
||||
|
||||
- Source groups (paths + excludes + pre/post hooks + bandwidth caps)
|
||||
decoupled from schedules; a schedule fires a source group.
|
||||
- Cron-style schedules with retention policies, server-driven
|
||||
reconciliation push and ack.
|
||||
- `restic forget`, `prune`, `check`, `unlock` automation; periodic
|
||||
maintenance ticker with per-host stagger.
|
||||
- Pending-runs queue with backpressure (`max_concurrent_jobs` per
|
||||
host).
|
||||
- Repo stats panel on the host detail page (size, last-check, last-
|
||||
prune, stale-lock banner).
|
||||
- Auto-init of repos on first onboard with credential-failure surface
|
||||
on the host detail page.
|
||||
- Announce-and-approve enrolment path for hosts that don't have a
|
||||
pre-minted token (Ed25519 fingerprint, operator approves).
|
||||
- Windows agent: SCM service integration + `install.ps1` installer.
|
||||
- Cross-platform alt-enrolment (announce flow on Windows).
|
||||
|
||||
### Phase 3 — Restore, alerts, audit
|
||||
|
||||
- Restore wizard: pick a snapshot, pick paths, pick a target
|
||||
(in-place / new directory), live progress.
|
||||
- Snapshot diff against parent.
|
||||
- Alert engine: per-source-group dedup, severity tiers, ack / resolve.
|
||||
- Live-refresh alerts table with severity cues.
|
||||
- Audit log UI with filters, sort, CSV export, payload-detail modal.
|
||||
|
||||
### Phase 4 — RBAC, OIDC, host tags
|
||||
|
||||
- Role-based access control: viewer / operator / admin.
|
||||
- User management UI (invite, role change, disable, password reset).
|
||||
- Generic OIDC SSO with JIT user provisioning + role mapping.
|
||||
- Per-host tags with chip-row filter on the dashboard.
|
||||
|
||||
### Phase 5 — OSS readiness
|
||||
|
||||
- mdBook-rendered docs site at `docs/book/`.
|
||||
- Contributor onboarding (CONTRIBUTING.md, security policy, license).
|
||||
- Docker-only release pipeline + reference deployment compose file.
|
||||
- Playwright e2e harness covering the smoke runbook.
|
||||
|
||||
### Phase 6 — Update delivery + observability
|
||||
|
||||
- Agent self-update: server-side channel pin per host, signed binary
|
||||
fetch via the WS transport, atomic swap with rollback on failure.
|
||||
- Fleet-wide update orchestration with per-host stagger and an admin
|
||||
pause switch.
|
||||
- Prometheus `/metrics` endpoint + Grafana dashboard JSON.
|
||||
- Repo size trend per host (90-day rolling) on the host detail page.
|
||||
|
||||
### Cross-cutting
|
||||
|
||||
- Live dashboard with column sort, filters, free-text host search,
|
||||
background-tab-aware live refresh (5s cadence).
|
||||
- Pure-Go binary with embedded UI, no Node/CGO at runtime.
|
||||
- Reproducible `-trimpath -ldflags="-s -w"` builds for
|
||||
linux/amd64, linux/arm64, windows/amd64.
|
||||
- Sharded CI (server-http / store / rest), pre-commit hooks (gofumpt,
|
||||
go vet, golangci-lint).
|
||||
- Threat model published (`docs/threat-model.md`).
|
||||
|
||||
[Unreleased]: https://gitea.dcglab.co.uk/steve/restic-manager/compare/v1.0.0...HEAD
|
||||
[1.0.0]: https://gitea.dcglab.co.uk/steve/restic-manager/releases/tag/v1.0.0
|
||||
@@ -38,7 +38,7 @@ but the **agent** is fetched by the install script from the server's
|
||||
**install script** are fetched from `<DataDir>/install/`. Plain
|
||||
`make build` doesn't touch any of those — the source-of-truth files
|
||||
in the working tree (`deploy/install/*`, `bin/restic-manager-agent`)
|
||||
must be copied into `/tmp/rm-smoke/data/...` *and* the running agent
|
||||
must be copied into `$HOME/smoke/data/...` *and* the running agent
|
||||
on this dev host needs replacing if the change touches agent code or
|
||||
the unit file.
|
||||
|
||||
@@ -53,13 +53,13 @@ asking the operator to test.**
|
||||
```sh
|
||||
# 1. Restage what the install script serves (binary + unit + script).
|
||||
cp bin/restic-manager-agent \
|
||||
/tmp/rm-smoke/data/agent-binaries/restic-manager-agent-linux-amd64
|
||||
$HOME/smoke/data/agent-binaries/restic-manager-agent-linux-amd64
|
||||
cp deploy/install/install.sh \
|
||||
/tmp/rm-smoke/data/install/install.sh
|
||||
$HOME/smoke/data/install/install.sh
|
||||
cp deploy/install/install.ps1 \
|
||||
/tmp/rm-smoke/data/install/install.ps1
|
||||
$HOME/smoke/data/install/install.ps1
|
||||
cp deploy/install/restic-manager-agent.service \
|
||||
/tmp/rm-smoke/data/install/restic-manager-agent.service
|
||||
$HOME/smoke/data/install/restic-manager-agent.service
|
||||
|
||||
# 2. Replace the running agent on this dev box and restart the
|
||||
# service. Skip only when the change is server-side only AND
|
||||
@@ -74,15 +74,36 @@ sudo -n systemctl restart restic-manager-agent
|
||||
# 3. The server runs from the working tree; restart it manually
|
||||
# after a build that touches server code:
|
||||
pkill -f restic-manager-server
|
||||
RM_LISTEN=:8080 RM_DATA_DIR=/tmp/rm-smoke/data \
|
||||
RM_LISTEN=:8080 RM_DATA_DIR=$HOME/smoke/data \
|
||||
RM_BASE_URL=http://127.0.0.1:8080 \
|
||||
RM_SECRET_KEY_FILE=/tmp/rm-smoke/data/secret.key \
|
||||
RM_SECRET_KEY_FILE=$HOME/smoke/data/secret.key \
|
||||
RM_COOKIE_SECURE=false \
|
||||
./bin/restic-manager-server >> /tmp/rm-smoke/server.log 2>&1 &
|
||||
./bin/restic-manager-server >> $HOME/smoke/server.log 2>&1 &
|
||||
```
|
||||
|
||||
A `make smoke-deploy` target that bundles all of this would be a
|
||||
good follow-up.
|
||||
## Smoke server: use the Make targets, not raw `nohup`
|
||||
|
||||
The smoke server runs as a transient `systemd --user` unit named
|
||||
`restic-manager-smoke.service` so it survives any sandbox or
|
||||
process-group boundary that would otherwise SIGTERM a backgrounded
|
||||
process. Use the Make targets:
|
||||
|
||||
```
|
||||
make smoke-restart # rebuild server + (re)launch as systemd --user unit
|
||||
make smoke-status # systemctl --user status
|
||||
make smoke-logs # tail $HOME/smoke/server.log
|
||||
make smoke-stop # stop the unit
|
||||
make smoke-deploy # full rebuild + restage agent assets + restart
|
||||
```
|
||||
|
||||
`./bin/restic-manager-server &` from inside a Bash tool call gets
|
||||
reaped when the tool exits — don't do that. If the unit fails to
|
||||
start: `systemctl --user status restic-manager-smoke` and
|
||||
`$HOME/smoke/server.log` have the diagnosis.
|
||||
|
||||
`smoke-deploy` does NOT touch `/usr/local/bin/restic-manager-agent`
|
||||
on this dev box; if your change requires the live agent here to
|
||||
update, run the agent restage block above by hand.
|
||||
|
||||
## Migrations: prefer column-level ALTERs over table rebuilds
|
||||
|
||||
|
||||
@@ -0,0 +1,69 @@
|
||||
# Code of Conduct
|
||||
|
||||
restic-manager is a small project run by one person. This Code of
|
||||
Conduct sets out the basic expectations for participating in the
|
||||
project's issue tracker, pull requests, and any other community
|
||||
spaces (chat, mailing lists) we may run in future.
|
||||
|
||||
## Expected behaviour
|
||||
|
||||
- **Be civil.** Disagreement is fine; rudeness is not. The same
|
||||
comment can usually be made without making it personal.
|
||||
- **Assume good faith.** People asking what feels like a basic
|
||||
question may be new to the project. People proposing what feels
|
||||
like a duplicate idea may not have seen the prior discussion.
|
||||
Point them to the right place politely.
|
||||
- **Stay on topic.** Issue threads are for the issue. Tangential
|
||||
conversations belong in their own thread.
|
||||
- **Acknowledge the project's scope.** restic-manager is
|
||||
intentionally small in scope (see `spec.md` §2). Reasonable
|
||||
feature suggestions may still be declined for fit reasons.
|
||||
|
||||
## Unacceptable behaviour
|
||||
|
||||
- Harassment, threats, or insults — public or private.
|
||||
- Discriminatory comments based on age, body size, disability,
|
||||
ethnicity, gender identity or expression, level of experience,
|
||||
nationality, personal appearance, race, religion, sexual identity
|
||||
or orientation.
|
||||
- Sustained disruption — derailing threads, ignoring repeated
|
||||
requests to take a discussion elsewhere, brigading.
|
||||
- Publishing other people's private information without permission.
|
||||
|
||||
## Reporting
|
||||
|
||||
If someone in the project's spaces is behaving in a way that
|
||||
breaches this Code of Conduct, contact the maintainer directly
|
||||
through the contact details on their Gitea profile, or via the
|
||||
private security disclosure path documented in
|
||||
[SECURITY.md](./SECURITY.md). Reports stay confidential.
|
||||
|
||||
The maintainer will review the report, gather context if needed,
|
||||
and respond. Possible outcomes include a private warning, a public
|
||||
clarification of expectations, a temporary or permanent ban from
|
||||
project spaces, or no action if the report doesn't hold up.
|
||||
|
||||
There is no formal appeals process — this is a one-person project,
|
||||
not a foundation. If you think a decision was wrong you can say
|
||||
so, in writing, to the maintainer; that's it.
|
||||
|
||||
## Scope
|
||||
|
||||
This Code of Conduct applies to interactions in any space the
|
||||
project owns or operates: the Gitea repository (issues, pull
|
||||
requests, discussions, wiki), any chat channels we publish, and
|
||||
any conferences or events the project is officially represented at.
|
||||
|
||||
It does not apply to:
|
||||
|
||||
- Forks of the project that aren't being submitted back upstream.
|
||||
- Conversations between contributors that don't reference the
|
||||
project.
|
||||
- Public criticism of the project itself.
|
||||
|
||||
## Acknowledgement
|
||||
|
||||
This document borrows shape and language from the
|
||||
[Contributor Covenant](https://www.contributor-covenant.org/) v2.1
|
||||
but is intentionally shorter and adapted to the project's
|
||||
single-maintainer reality.
|
||||
+159
-21
@@ -1,30 +1,168 @@
|
||||
# Contributing
|
||||
# Contributing to restic-manager
|
||||
|
||||
Thanks for your interest in contributing to restic-manager.
|
||||
Thanks for your interest in restic-manager. This document covers how
|
||||
to set up a development environment, the conventions the project
|
||||
follows, and how patches make it from your machine into `main`.
|
||||
|
||||
> This is a placeholder. The project is in pre-alpha (Phase 1 / MVP). A
|
||||
> full contributor guide will land alongside the Phase 5 OSS-readiness
|
||||
> work — see [`tasks.md`](./tasks.md) P5-02. Until then the notes below
|
||||
> apply.
|
||||
## Project status and scope
|
||||
|
||||
## Before opening a PR
|
||||
restic-manager is in pre-1.0. Core functionality (Phases 0–4) is
|
||||
landed; OSS-readiness polish is in progress. The top of
|
||||
[`tasks.md`](./tasks.md) tracks what's next; [`spec.md`](./spec.md)
|
||||
is the canonical design doc and the source of truth for any
|
||||
"why is it built this way" question.
|
||||
|
||||
1. Open an issue first for non-trivial changes — the design is still
|
||||
moving (see [`spec.md`](./spec.md)) and unsolicited large PRs may
|
||||
conflict with in-flight work.
|
||||
2. `make lint test` should pass.
|
||||
3. Match the existing code style — `gofumpt`, `goimports`, no comments
|
||||
that just restate what the code does.
|
||||
4. Keep commits focused; one logical change per commit.
|
||||
The project is **single-maintainer, hobbyist-scale, and licensed
|
||||
under [PolyForm Noncommercial 1.0.0](./LICENSE)**. That has two
|
||||
practical implications:
|
||||
|
||||
## Reporting security issues
|
||||
1. Big PRs without prior discussion may be declined for fit
|
||||
reasons even when they're correct — opening an issue first lets
|
||||
us check alignment cheaply.
|
||||
2. Commercial use is not permitted by the license. Bug reports and
|
||||
patches from operators of personal/community deployments are
|
||||
very welcome.
|
||||
|
||||
Please do **not** open a public issue for security problems. A
|
||||
`SECURITY.md` with a private disclosure path will be added in Phase 5
|
||||
(P5-05). Until then, contact the repository owner directly via the
|
||||
contact details on their gitea profile.
|
||||
## Getting started
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Go 1.25 or newer (`go.mod` is the source of truth)
|
||||
- `make`
|
||||
- For the front-end CSS bundle: nothing extra — `make build`
|
||||
downloads a pinned `tailwindcss` standalone binary into `bin/`.
|
||||
- For the docs site: nothing extra — `make docs` does the same trick
|
||||
with `mdbook`.
|
||||
- For end-to-end tests: Docker + Docker Compose, plus `npx` for
|
||||
Playwright.
|
||||
|
||||
### One-time setup
|
||||
|
||||
```sh
|
||||
git clone https://gitea.dcglab.co.uk/steve/restic-manager.git
|
||||
cd restic-manager
|
||||
make build # compiles bin/restic-manager-{server,agent}
|
||||
make test # full unit + integration test sweep
|
||||
make lint # gofumpt + goimports + golangci-lint
|
||||
```
|
||||
|
||||
### Running locally
|
||||
|
||||
For most development, the [smoke environment](./docs/e2e-smoke.md)
|
||||
is the path of least resistance:
|
||||
|
||||
```sh
|
||||
make smoke-restart # rebuilds, launches as a systemd --user unit
|
||||
make smoke-logs # tail of the server log
|
||||
```
|
||||
|
||||
Then point a browser at `http://127.0.0.1:8080`. The first run
|
||||
prints a one-time bootstrap token to the log; use it to create the
|
||||
admin user.
|
||||
|
||||
## Code conventions
|
||||
|
||||
### Style
|
||||
|
||||
- `gofumpt` for formatting; `goimports` for import grouping.
|
||||
Both run via the pre-commit hook in this repo.
|
||||
- `golangci-lint` with `.golangci.yml` defaults; CI rejects on lint
|
||||
errors.
|
||||
- UK English in identifiers, comments, log messages, and UI strings
|
||||
(the misspell linter is configured for the UK locale — see
|
||||
P3-X5 for the original sweep).
|
||||
- Comments explain **why**, not what; avoid restating the code.
|
||||
A surprising invariant or an external constraint is worth
|
||||
writing down. "Adds 1 to x" is not.
|
||||
- `slog` for structured logs. Never log secrets — and especially
|
||||
never the merged-creds rest-server URL (see [`CLAUDE.md`](./CLAUDE.md)).
|
||||
|
||||
### File and package layout
|
||||
|
||||
- `cmd/server` and `cmd/agent` are the two binary entry points.
|
||||
- `internal/` holds everything that's not part of the public Go
|
||||
API (which is none of it — restic-manager isn't a library).
|
||||
- Per-feature packages live under `internal/server/...` for the
|
||||
control plane and `internal/agent/...` for the agent.
|
||||
- `web/templates/` are HTML templates rendered with the standard
|
||||
library; embedded via `web.FS`.
|
||||
|
||||
### Tests
|
||||
|
||||
- Unit tests live alongside the code as `*_test.go`. Use the
|
||||
in-process sqlite store (`store.Open(":memory:")`) when you need
|
||||
state — there is no test mock layer to maintain.
|
||||
- HTTP handlers test through `httptest.NewServer` against the real
|
||||
router; see `internal/server/http/auth_test.go` for the canonical
|
||||
fixture pattern.
|
||||
- End-to-end tests live in `e2e/` and run against a Docker Compose
|
||||
stack. See [`docs/e2e.md`](./docs/e2e.md).
|
||||
|
||||
### Database migrations
|
||||
|
||||
- Migrations are hand-rolled SQL in `internal/store/migrations/`
|
||||
and embedded via `embed.FS`.
|
||||
- Prefer column-level `ALTER TABLE` over rebuilds — see
|
||||
[`CLAUDE.md`](./CLAUDE.md) "Migrations" section for the FK-cascade
|
||||
trap that bit migration 0007's first draft.
|
||||
|
||||
## Workflow
|
||||
|
||||
### Before opening a PR
|
||||
|
||||
1. **Open an issue first** for non-trivial changes. The design is
|
||||
still moving; an issue lets us agree on direction cheaply.
|
||||
2. Run `make lint test` locally — both must pass.
|
||||
3. Match existing code style (see above).
|
||||
4. Keep commits focused: one logical change per commit. Imperative
|
||||
subject lines, body explaining why if it isn't obvious.
|
||||
5. Don't add `Co-Authored-By` trailers — repo policy. If you used
|
||||
AI assistance in writing the patch, that's fine; we just don't
|
||||
pollute every commit message with attribution boilerplate.
|
||||
|
||||
### Pull requests
|
||||
|
||||
PRs target `main`. CI runs lint + tests on Linux amd64/arm64 and
|
||||
Windows amd64; all three must be green to merge. Squash-merge is
|
||||
the default; the PR title becomes the merge-commit subject, so
|
||||
keep it short and informative.
|
||||
|
||||
The PR template asks for:
|
||||
|
||||
- A short description of what changed and why.
|
||||
- A test plan (commands run, scenarios verified).
|
||||
- Anything reviewers need to know to assess the change (related
|
||||
issue, follow-up work, deferred concerns).
|
||||
|
||||
### Reporting bugs
|
||||
|
||||
Open an issue with:
|
||||
|
||||
- restic-manager version (`server --version`) and agent version.
|
||||
- restic version on the affected host.
|
||||
- Steps to reproduce.
|
||||
- Server and agent logs (sanitise any tokens before pasting).
|
||||
|
||||
Security-sensitive bugs go through the [SECURITY.md](./SECURITY.md)
|
||||
disclosure path instead — please don't open a public issue for
|
||||
them.
|
||||
|
||||
### Suggesting features
|
||||
|
||||
Open an issue describing the use case (not just the proposed
|
||||
solution). The roadmap in `tasks.md` shows where the project is
|
||||
heading; if the suggestion fits a future phase we'll wire it in
|
||||
there. If it falls outside the project's scope (multi-tenancy, SaaS,
|
||||
non-restic backends — see `spec.md` §2 non-goals) we'll say so
|
||||
early to save your time.
|
||||
|
||||
## Code of conduct
|
||||
|
||||
Project participation is governed by [CODE_OF_CONDUCT.md](./CODE_OF_CONDUCT.md).
|
||||
The short version: be civil; assume good faith; harassment is not
|
||||
tolerated.
|
||||
|
||||
## License
|
||||
|
||||
By contributing you agree that your contributions are licensed under
|
||||
the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
|
||||
By contributing you agree that your contributions are licensed
|
||||
under the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
|
||||
|
||||
@@ -7,7 +7,11 @@ AGENT_BIN := $(BIN_DIR)/restic-manager-agent
|
||||
VERSION ?= $(shell git describe --tags --always --dirty 2>/dev/null || echo dev)
|
||||
COMMIT ?= $(shell git rev-parse HEAD 2>/dev/null || echo none)
|
||||
DATE ?= $(shell date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||
LDFLAGS := -s -w -X main.version=$(VERSION) -X main.commit=$(COMMIT) -X main.date=$(DATE)
|
||||
VERSION_PKG := gitea.dcglab.co.uk/steve/restic-manager/internal/version
|
||||
LDFLAGS := -s -w \
|
||||
-X $(VERSION_PKG).Version=$(VERSION) \
|
||||
-X $(VERSION_PKG).Commit=$(COMMIT) \
|
||||
-X $(VERSION_PKG).Date=$(DATE)
|
||||
GOFLAGS := -trimpath
|
||||
DOCKER_IMAGE ?= gitea.dcglab.co.uk/steve/restic-manager
|
||||
DOCKER_TAG ?= dev
|
||||
@@ -22,7 +26,29 @@ TAILWIND_URL := https://github.com/tailwindlabs/tailwindcss/releases/downlo
|
||||
TAILWIND_INPUT := web/styles/input.css
|
||||
TAILWIND_OUTPUT := web/static/css/styles.css
|
||||
|
||||
.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch setup hooks
|
||||
# mdBook for the docs site (P5-01). Single static binary, no
|
||||
# Rust toolchain — same pattern as Tailwind.
|
||||
MDBOOK_VERSION ?= v0.4.51
|
||||
MDBOOK_OS := $(shell uname -s | tr A-Z a-z)
|
||||
MDBOOK_TRIPLE := $(shell uname -m)-unknown-$(if $(filter darwin,$(MDBOOK_OS)),apple-darwin,linux-gnu)
|
||||
MDBOOK_BIN := $(BIN_DIR)/mdbook
|
||||
MDBOOK_TARBALL := mdbook-$(MDBOOK_VERSION)-$(MDBOOK_TRIPLE).tar.gz
|
||||
MDBOOK_URL := https://github.com/rust-lang/mdBook/releases/download/$(MDBOOK_VERSION)/$(MDBOOK_TARBALL)
|
||||
DOCS_BOOK_DIR := docs/book
|
||||
DOCS_BOOK_OUT := $(DOCS_BOOK_DIR)/book
|
||||
|
||||
.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch docs docs-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
|
||||
|
||||
# ---- smoke-env tooling -------------------------------------------------
|
||||
# The smoke server runs as a transient user-systemd unit so it survives
|
||||
# bash-tool boundaries and reboots-of-the-shell. Use `make smoke-restart`
|
||||
# any time you've rebuilt the server. `make smoke-deploy` is the full
|
||||
# rebuild + restage + restart workflow described in CLAUDE.md.
|
||||
SMOKE_UNIT := restic-manager-smoke
|
||||
SMOKE_DATA_DIR := $(HOME)/smoke/data
|
||||
SMOKE_LOG_FILE := $(HOME)/smoke/server.log
|
||||
SMOKE_BASE_URL := http://127.0.0.1:8080
|
||||
SMOKE_LISTEN := :8080
|
||||
|
||||
help:
|
||||
@grep -E '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN{FS=":.*?## "};{printf " \033[36m%-14s\033[0m %s\n",$$1,$$2}'
|
||||
@@ -47,6 +73,18 @@ tailwind-watch: $(TAILWIND_BIN) ## Watch and rebuild on every save
|
||||
@mkdir -p $$(dirname $(TAILWIND_OUTPUT))
|
||||
$(TAILWIND_BIN) -c tailwind.config.js -i $(TAILWIND_INPUT) -o $(TAILWIND_OUTPUT) --watch
|
||||
|
||||
$(MDBOOK_BIN):
|
||||
@mkdir -p $(BIN_DIR)
|
||||
@echo "==> downloading mdbook $(MDBOOK_VERSION) ($(MDBOOK_TRIPLE))"
|
||||
curl -fsSL "$(MDBOOK_URL)" | tar -xz -C $(BIN_DIR) mdbook
|
||||
@chmod +x $@
|
||||
|
||||
docs: $(MDBOOK_BIN) ## Build the docs/book/ mdBook site into docs/book/book/
|
||||
$(MDBOOK_BIN) build $(DOCS_BOOK_DIR)
|
||||
|
||||
docs-watch: $(MDBOOK_BIN) ## Serve the docs site at http://127.0.0.1:3000 with live reload
|
||||
$(MDBOOK_BIN) serve $(DOCS_BOOK_DIR) -n 127.0.0.1 -p 3000
|
||||
|
||||
agent: ## Build the agent binary
|
||||
@mkdir -p $(BIN_DIR)
|
||||
CGO_ENABLED=0 go build $(GOFLAGS) -ldflags "$(LDFLAGS)" -o $(AGENT_BIN) ./cmd/agent
|
||||
@@ -77,7 +115,7 @@ tidy: ## go mod tidy
|
||||
go mod tidy
|
||||
|
||||
clean: ## Remove build artifacts
|
||||
rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT)
|
||||
rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT) $(DOCS_BOOK_OUT)
|
||||
|
||||
run-server: server ## Build and run the server
|
||||
$(SERVER_BIN)
|
||||
@@ -92,6 +130,48 @@ docker: ## Build the server Docker image
|
||||
--build-arg DATE=$(DATE) \
|
||||
-t $(DOCKER_IMAGE):$(DOCKER_TAG) .
|
||||
|
||||
smoke-restart: server ## (Re)start the smoke server as a transient user-systemd unit
|
||||
@systemctl --user reset-failed $(SMOKE_UNIT) >/dev/null 2>&1 || true
|
||||
@systemctl --user stop $(SMOKE_UNIT) >/dev/null 2>&1 || true
|
||||
@echo "==> launching $(SMOKE_UNIT)"
|
||||
systemd-run --user --unit=$(SMOKE_UNIT) \
|
||||
--setenv=RM_LISTEN=$(SMOKE_LISTEN) \
|
||||
--setenv=RM_DATA_DIR=$(SMOKE_DATA_DIR) \
|
||||
--setenv=RM_BASE_URL=$(SMOKE_BASE_URL) \
|
||||
--setenv=RM_SECRET_KEY_FILE=$(SMOKE_DATA_DIR)/secret.key \
|
||||
--setenv=RM_COOKIE_SECURE=false \
|
||||
--property=StandardOutput=append:$(SMOKE_LOG_FILE) \
|
||||
--property=StandardError=append:$(SMOKE_LOG_FILE) \
|
||||
--property=Restart=on-failure \
|
||||
$(PWD)/$(SERVER_BIN)
|
||||
@for i in 1 2 3 4 5; do \
|
||||
curl -fsS -o /dev/null $(SMOKE_BASE_URL)/api/version 2>/dev/null && \
|
||||
{ echo "==> smoke server up: $$(curl -s $(SMOKE_BASE_URL)/api/version)"; exit 0; }; \
|
||||
sleep 1; \
|
||||
done; \
|
||||
echo "!! smoke server did not respond on $(SMOKE_BASE_URL) — check $(SMOKE_LOG_FILE)" >&2; \
|
||||
systemctl --user status --no-pager $(SMOKE_UNIT) || true; \
|
||||
exit 1
|
||||
|
||||
smoke-stop: ## Stop the smoke server
|
||||
systemctl --user stop $(SMOKE_UNIT) || true
|
||||
@systemctl --user reset-failed $(SMOKE_UNIT) >/dev/null 2>&1 || true
|
||||
|
||||
smoke-status: ## Show status of the smoke server
|
||||
@systemctl --user status --no-pager $(SMOKE_UNIT) 2>&1 | head -20 || true
|
||||
|
||||
smoke-logs: ## Tail the smoke server log
|
||||
tail -50 $(SMOKE_LOG_FILE)
|
||||
|
||||
smoke-deploy: build smoke-restart ## Rebuild + restage agent into smoke + restart server (full per-CLAUDE.md cycle)
|
||||
@echo "==> restaging agent + install assets into $(SMOKE_DATA_DIR)"
|
||||
cp $(AGENT_BIN) $(SMOKE_DATA_DIR)/agent-binaries/restic-manager-agent-linux-amd64
|
||||
cp deploy/install/install.sh $(SMOKE_DATA_DIR)/install/install.sh
|
||||
cp deploy/install/install.ps1 $(SMOKE_DATA_DIR)/install/install.ps1
|
||||
cp deploy/install/restic-manager-agent.service $(SMOKE_DATA_DIR)/install/restic-manager-agent.service
|
||||
@echo "==> NOTE: this dev box's installed agent at /usr/local/bin/restic-manager-agent is NOT updated by this target."
|
||||
@echo " Run the agent restage block in CLAUDE.md if your change touches agent code or the unit file."
|
||||
|
||||
release: ## Cross-compile for all supported platforms
|
||||
@mkdir -p $(BIN_DIR)
|
||||
@for target in linux/amd64 linux/arm64 windows/amd64; do \
|
||||
|
||||
@@ -1,36 +1,62 @@
|
||||
# restic-manager
|
||||
|
||||
Self-hosted, browser-based, single-pane-of-glass for managing
|
||||
[restic](https://restic.net) backups across a fleet of Linux and Windows
|
||||
endpoints.
|
||||
[restic](https://restic.net) backups across a fleet of Linux and
|
||||
Windows endpoints.
|
||||
|
||||
> Status: pre-alpha. Phase 0 (project bootstrap) complete; Phase 1 (MVP) in
|
||||
> progress. See [`spec.md`](./spec.md) for the design and
|
||||
> [`tasks.md`](./tasks.md) for the roadmap.
|
||||
> **Status:** pre-1.0, feature-complete for the original use
|
||||
> case. Phases 0–4 + 6 are landed (MVP, scheduling, restore,
|
||||
> RBAC + OIDC, observability); Phase 5 (OSS readiness — docs site,
|
||||
> contributor onboarding, end-to-end CI) is in flight. See
|
||||
> [`spec.md`](./spec.md) for the design and [`tasks.md`](./tasks.md)
|
||||
> for the live roadmap.
|
||||
|
||||
## What it does (target)
|
||||
## What it does
|
||||
|
||||
- Central visibility into backup state for every endpoint
|
||||
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
|
||||
`check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
|
||||
- Manage per-host backup schedules from the UI
|
||||
- Live job progress streamed back to the UI
|
||||
- Restore wizard (browse snapshots, pick paths, restore to original or
|
||||
alternate host)
|
||||
- Repo health surfacing (size, dedup ratio, last check, lock state)
|
||||
- Alerting on failure or staleness
|
||||
- Cross-platform agent (Linux + Windows)
|
||||
- Ransomware-resistant repo access via append-only credentials
|
||||
- Central visibility into backup state for every endpoint.
|
||||
- Trigger any restic operation remotely (`backup`, `forget`,
|
||||
`prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`,
|
||||
`restore`).
|
||||
- Per-host schedules with named source groups + retention.
|
||||
- Live job log streamed to the browser; downloadable as
|
||||
text/NDJSON afterwards.
|
||||
- Restore wizard: browse a snapshot's tree, pick paths, restore
|
||||
in-place or to a new directory.
|
||||
- Repo health surfacing (size, raw size, last check, lock state),
|
||||
plus a 30/90-day repo-size trend.
|
||||
- Alerting over webhook, ntfy, or SMTP.
|
||||
- Cross-platform agent (Linux systemd + Windows SCM).
|
||||
- Append-only-friendly: separate admin credential for prune.
|
||||
- Optional Prometheus `/metrics` endpoint + sample Grafana
|
||||
dashboard.
|
||||
- Optional OIDC SSO (Authelia, Authentik, etc.).
|
||||
|
||||
## Architecture (one-line summary)
|
||||
## Screenshots
|
||||
|
||||
A small Go control-plane on the Proxmox host, lightweight Go agents on each
|
||||
endpoint that hold an outbound WebSocket to the control-plane, and a
|
||||
`restic/rest-server` on Unraid that holds the actual backup data. The
|
||||
control-plane never touches backup bytes.
|
||||
| Sign in | Empty dashboard | Add host |
|
||||
|:-------:|:---------------:|:--------:|
|
||||
|  |  |  |
|
||||
|
||||
| Alerts | Settings | Audit log |
|
||||
|:------:|:--------:|:---------:|
|
||||
|  |  |  |
|
||||
|
||||
(Screenshots from a fresh smoke install with no hosts. A populated
|
||||
fleet view and the live-log + restore wizard surfaces are part of
|
||||
the docs site under [`docs/book/`](./docs/book) — `make docs` to
|
||||
render locally.)
|
||||
|
||||
## Architecture (one-line)
|
||||
|
||||
A small Go control-plane in Docker, lightweight Go agents on each
|
||||
endpoint holding an outbound WebSocket to the control-plane, and
|
||||
a restic repository (rest-server, S3, B2, SFTP — anything restic
|
||||
speaks) that holds the actual backup data. **The control-plane
|
||||
never touches backup bytes.**
|
||||
|
||||
Full architecture diagram and component breakdown:
|
||||
[`spec.md` §3](./spec.md).
|
||||
[`spec.md` §3](./spec.md), or the rendered version in the
|
||||
[docs site](./docs/book/src/concepts/architecture.md).
|
||||
|
||||
## Repository layout
|
||||
|
||||
@@ -38,31 +64,63 @@ Full architecture diagram and component breakdown:
|
||||
cmd/server/ control-plane binary
|
||||
cmd/agent/ endpoint agent binary
|
||||
internal/api shared API types (REST + WS envelopes)
|
||||
internal/server/ HTTP, WS, UI handlers
|
||||
internal/server/ HTTP, WS, UI handlers, alert engine
|
||||
internal/agent/ service integration, restic runner, local scheduler
|
||||
internal/restic restic CLI wrapper
|
||||
internal/store SQLite persistence
|
||||
internal/crypto secret encryption
|
||||
internal/crypto secret encryption (AEAD)
|
||||
internal/auth passwords, sessions, agent tokens
|
||||
web/ server-rendered templates + static assets
|
||||
deploy/ Dockerfile, docker-compose.yml, install scripts
|
||||
design/ UI wireframes (Phase 0 design pass)
|
||||
deploy/ Dockerfile, docker-compose.yml, install scripts, Grafana dashboard
|
||||
docs/ prose docs + the mdBook site under docs/book
|
||||
e2e/ compose stack + Playwright tests for end-to-end CI
|
||||
```
|
||||
|
||||
## Quickstart
|
||||
|
||||
The reference deployment is a single Docker container fronted by
|
||||
your existing reverse proxy. See the [installation guide](docs/book/src/getting-started/install.md)
|
||||
for the full path; the very short version:
|
||||
|
||||
```sh
|
||||
export RM_VERSION=v0.9.0 # pin a real tag
|
||||
export RM_BASE_URL=https://restic.example.com
|
||||
export RM_TRUSTED_PROXY=10.0.0.0/8
|
||||
docker compose -f deploy/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
The server prints a one-time bootstrap token to the log on first
|
||||
start. POST it to `/api/bootstrap` (or open `/bootstrap` in a
|
||||
browser) to create the admin user.
|
||||
|
||||
## Local development
|
||||
|
||||
Requires Go 1.25+ (built and tested on 1.26). The floor is set by
|
||||
`modernc.org/sqlite` v1.50.
|
||||
Requires Go 1.25+. The floor is set by `modernc.org/sqlite` v1.50.
|
||||
|
||||
```sh
|
||||
make build # builds cmd/server and cmd/agent into ./bin
|
||||
make test # runs go test ./...
|
||||
make lint # runs golangci-lint
|
||||
make run-server # runs the server (dev defaults)
|
||||
make smoke-restart # systemd --user smoke server (see CLAUDE.md)
|
||||
make docs # renders the mdBook site to docs/book/book/
|
||||
```
|
||||
|
||||
End-to-end test harness against a Docker Compose stack with a
|
||||
sibling Linux agent: see [`docs/e2e.md`](docs/e2e.md). Runs in CI
|
||||
on every PR.
|
||||
|
||||
## Documentation
|
||||
|
||||
- **Concepts and operator guides**: [docs site](docs/book/src/intro.md),
|
||||
rendered with `make docs`.
|
||||
- **Reverse-proxy setup**: [docs/reverse-proxy.md](docs/reverse-proxy.md).
|
||||
- **Prometheus + Grafana**: [docs/prometheus.md](docs/prometheus.md).
|
||||
- **End-to-end test harness**: [docs/e2e.md](docs/e2e.md).
|
||||
- **Security policy**: [SECURITY.md](SECURITY.md).
|
||||
- **Contributing**: [CONTRIBUTING.md](CONTRIBUTING.md).
|
||||
|
||||
## License
|
||||
|
||||
PolyForm Noncommercial 1.0.0 — see [`LICENSE`](./LICENSE). Free for personal,
|
||||
hobby, research, educational, governmental, and other noncommercial use.
|
||||
Commercial use requires a separate license.
|
||||
[PolyForm Noncommercial 1.0.0](./LICENSE). Free for personal,
|
||||
hobby, research, educational, governmental, and other noncommercial
|
||||
use. Commercial use requires a separate license.
|
||||
|
||||
+137
@@ -0,0 +1,137 @@
|
||||
# Security policy
|
||||
|
||||
restic-manager handles credentials that grant access to backup
|
||||
repositories — losing them means an attacker can read or destroy a
|
||||
fleet's backups. We take security reports seriously even at this
|
||||
project's small scale.
|
||||
|
||||
## Supported versions
|
||||
|
||||
Pre-1.0, only the latest tagged release on `main` is supported.
|
||||
Backporting fixes to older tags is not currently offered.
|
||||
|
||||
| Version | Supported |
|
||||
|--------------------|----------------|
|
||||
| `main` HEAD | Yes |
|
||||
| Latest released tag| Yes |
|
||||
| Anything older | No |
|
||||
|
||||
## Reporting a vulnerability
|
||||
|
||||
**Please don't open a public issue for security problems.**
|
||||
|
||||
Instead, use one of these private channels:
|
||||
|
||||
1. **Gitea private message** to the repository owner. The
|
||||
instance is at <https://gitea.dcglab.co.uk> and the owner's
|
||||
profile (`steve`) has direct-message contact set up.
|
||||
2. **Email** to the address on the maintainer's Gitea profile.
|
||||
Use a subject like `[SECURITY] restic-manager: <one-line summary>`
|
||||
so it doesn't get lost. PGP optional — if you want to encrypt,
|
||||
ask for a key first.
|
||||
|
||||
If you don't get an acknowledgement within **3 working days**,
|
||||
please escalate through the other channel — solo maintainers do
|
||||
miss things, and the goal here is to fix the problem, not to
|
||||
preserve protocol.
|
||||
|
||||
### What to include
|
||||
|
||||
- A description of the issue and the impact (what does an attacker
|
||||
gain? confidentiality, integrity, availability?).
|
||||
- Affected component (server, agent, install script, docs).
|
||||
- Affected version (`restic-manager-server --version`).
|
||||
- Reproduction steps if you have them. A working PoC is welcome
|
||||
but not required — a credible threat model is enough.
|
||||
- Whether you intend to publish a writeup, and any timing
|
||||
preferences.
|
||||
|
||||
### What we'll do
|
||||
|
||||
1. Acknowledge receipt within 3 working days.
|
||||
2. Confirm or refute the issue, and agree a rough severity (CVSS
|
||||
or just "this is bad / this isn't"). Asking clarifying
|
||||
questions is normal at this stage — please don't read it as
|
||||
foot-dragging.
|
||||
3. Develop a fix on a private branch, test it, and prepare a
|
||||
release.
|
||||
4. Coordinate disclosure timing with you. The default is **30
|
||||
days from confirmed report to public disclosure**, with a
|
||||
patched release published before the disclosure date. Faster
|
||||
if a workable PoC is already circulating; slower only by
|
||||
mutual agreement.
|
||||
5. Credit the reporter in the release notes (or omit the credit
|
||||
if you'd rather stay anonymous — your choice).
|
||||
|
||||
## Scope
|
||||
|
||||
In scope:
|
||||
|
||||
- The server binary (`cmd/server`) and any HTTP, WebSocket, or CLI
|
||||
surface it exposes.
|
||||
- The agent binary (`cmd/agent`) and the way it consumes commands
|
||||
from the server.
|
||||
- The install scripts (`deploy/install/install.sh`, `install.ps1`)
|
||||
and the systemd unit shipped with them.
|
||||
- The docker-compose reference deployment and the docker image we
|
||||
publish.
|
||||
- Any cryptographic primitive choice or implementation detail
|
||||
(AEAD, token hashing, session handling, OIDC handshake).
|
||||
- Documentation that, if followed, leads operators into an
|
||||
insecure configuration.
|
||||
|
||||
Out of scope (not because they aren't real problems, just not ones
|
||||
this report channel can act on):
|
||||
|
||||
- Vulnerabilities in restic itself — report those upstream at
|
||||
<https://github.com/restic/restic>.
|
||||
- Vulnerabilities in third-party dependencies that haven't yet been
|
||||
patched upstream — report upstream first.
|
||||
- Issues that require pre-authenticated admin access on the control
|
||||
plane (admins can already do everything; that's not a privilege
|
||||
escalation, that's the design).
|
||||
- DoS via resource exhaustion on a deployment without the
|
||||
recommended reverse proxy / rate limiting in front (see
|
||||
`docs/reverse-proxy.md`).
|
||||
- Social-engineering scenarios that don't have a technical hook
|
||||
into the project's own surfaces.
|
||||
|
||||
## Threat model summary
|
||||
|
||||
For context (longer version in [`spec.md`](./spec.md) §11):
|
||||
|
||||
- The server is **HTTP-only**; TLS termination, ACME, HSTS, and
|
||||
edge rate-limiting are the reverse proxy's job.
|
||||
- Credentials are encrypted at rest with an AEAD key loaded from
|
||||
`RM_SECRET_KEY_FILE`. The same key encrypts agent secrets that
|
||||
travel to the agent over the WS channel.
|
||||
- Agents authenticate with bearer tokens issued at enrolment and
|
||||
hashed at rest. Compromise of the server DB does **not** leak
|
||||
bearer tokens in plaintext, but does leak the hashes (which is
|
||||
enough to log in *as* the agent until the operator revokes —
|
||||
see [NS-01 / NS-02](./tasks.md) for the revoke + regenerate
|
||||
flows).
|
||||
- The control plane intentionally **never touches backup bytes** —
|
||||
the agent runs `restic` directly against the repo. A
|
||||
compromised control plane can dispatch new jobs but cannot
|
||||
exfiltrate snapshot contents in-band.
|
||||
- Append-only credentials are first-class. Forget/prune jobs use a
|
||||
separate, admin-marked credential that the server only pushes
|
||||
for the duration of a maintenance dispatch.
|
||||
|
||||
## Hardening checklist for operators
|
||||
|
||||
- Run behind a TLS-terminating reverse proxy (Caddy/nginx/Traefik).
|
||||
- Set `RM_TRUSTED_PROXY` to the proxy's CIDR so request IPs aren't
|
||||
spoofable.
|
||||
- Back up `RM_SECRET_KEY_FILE` separately from the database.
|
||||
Without it the encrypted creds are unrecoverable.
|
||||
- Use append-only credentials for the everyday backup path; only
|
||||
the optional admin credential should have write/forget/prune
|
||||
power.
|
||||
- Disable users (don't delete) when staff change roles — bearer
|
||||
tokens stay valid until rotated.
|
||||
- Watch the alert and audit-log views during enrolment of new
|
||||
hosts.
|
||||
|
||||
Thanks for helping keep restic-manager users safe.
|
||||
+14
-15
@@ -22,12 +22,7 @@ import (
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/restic"
|
||||
)
|
||||
|
||||
var (
|
||||
version = "dev"
|
||||
commit = "none"
|
||||
date = "unknown"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
func main() {
|
||||
@@ -66,7 +61,7 @@ func run() error {
|
||||
flag.Parse()
|
||||
|
||||
if *showVersion {
|
||||
fmt.Printf("restic-manager-agent %s (commit %s, built %s)\n", version, commit, date)
|
||||
fmt.Printf("restic-manager-agent %s (commit %s, built %s)\n", version.Version, version.Commit, version.Date)
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -82,14 +77,14 @@ func run() error {
|
||||
if *enrollServer == "" {
|
||||
return errors.New("enrollment: -enroll-server is required with -enroll-token")
|
||||
}
|
||||
return doEnroll(*enrollServer, *enrollToken, cfg, version)
|
||||
return doEnroll(*enrollServer, *enrollToken, cfg, version.Version)
|
||||
}
|
||||
|
||||
// Announce-and-approve: -enroll-server set, no token, agent not
|
||||
// yet enrolled. Run the announce flow inline; on success the cfg
|
||||
// has the bearer + host_id and we drop into the normal run loop.
|
||||
if !cfg.Enrolled() && *enrollServer != "" {
|
||||
if err := doAnnounce(*enrollServer, cfg, version); err != nil {
|
||||
if err := doAnnounce(*enrollServer, cfg, version.Version); err != nil {
|
||||
return fmt.Errorf("announce: %w", err)
|
||||
}
|
||||
}
|
||||
@@ -106,7 +101,7 @@ func run() error {
|
||||
return fmt.Errorf("sysinfo: %w", err)
|
||||
}
|
||||
slog.Info("agent starting",
|
||||
"version", version,
|
||||
"version", version.Version,
|
||||
"host_id", cfg.HostID,
|
||||
"server", cfg.ServerURL,
|
||||
"restic_version", snap.ResticVersion,
|
||||
@@ -136,7 +131,7 @@ func run() error {
|
||||
CertPinSHA256: cfg.CertPinSHA256,
|
||||
HelloPayload: api.HelloPayload{
|
||||
ProtocolVersion: snap.ProtocolVersion,
|
||||
AgentVersion: version,
|
||||
AgentVersion: version.Version,
|
||||
ResticVersion: snap.ResticVersion,
|
||||
Hostname: snap.Hostname,
|
||||
OS: snap.OS,
|
||||
@@ -148,6 +143,7 @@ func run() error {
|
||||
resticBin: resticBin,
|
||||
resticVer: snap.ResticVersion,
|
||||
resticSupportsNoOwnership: resticSupportsNoOwnership,
|
||||
serverURL: cfg.ServerURL,
|
||||
secrets: sec,
|
||||
scheduler: scheduler.New(),
|
||||
}
|
||||
@@ -214,6 +210,7 @@ type dispatcher struct {
|
||||
resticBin string
|
||||
resticVer string // e.g. "0.17.1"; empty if restic isn't installed yet
|
||||
resticSupportsNoOwnership bool // captured at startup from `restic restore --help`
|
||||
serverURL string // base URL of the server (used by the self-update fetch)
|
||||
secrets *secrets.Store
|
||||
scheduler *scheduler.Scheduler
|
||||
|
||||
@@ -395,10 +392,12 @@ func (d *dispatcher) handle(ctx context.Context, env api.Envelope, tx wsclient.S
|
||||
"up_kbps", up, "down_kbps", down)
|
||||
}
|
||||
|
||||
case api.MsgAgentUpdateAvail:
|
||||
var p api.AgentUpdateAvailablePayload
|
||||
_ = env.UnmarshalPayload(&p)
|
||||
slog.Info("ws agent: update available", "version", p.LatestVersion, "url", p.PackageURL)
|
||||
case api.MsgCommandUpdate:
|
||||
var p api.CommandUpdatePayload
|
||||
if err := env.UnmarshalPayload(&p); err != nil {
|
||||
return fmt.Errorf("command.update: %w", err)
|
||||
}
|
||||
go d.runUpdate(ctx, p, tx)
|
||||
|
||||
default:
|
||||
slog.Debug("ws agent: ignored message", "type", env.Type)
|
||||
|
||||
@@ -0,0 +1,65 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/updater"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/agent/wsclient"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
)
|
||||
|
||||
// runUpdate handles a server-dispatched command.update. It logs progress
|
||||
// via log.stream so the live job page captures pre-restart state, then
|
||||
// calls the platform updater. On Linux the updater calls os.Exit; on
|
||||
// Windows it spawns a detached helper and returns, with the agent then
|
||||
// exiting.
|
||||
//
|
||||
// The terminal job state is set by the server, not the agent: success
|
||||
// is "agent re-hellos with matching version" rather than anything the
|
||||
// agent itself can assert. The only `job.finished` we send from here is
|
||||
// on the failure path, before any restart attempt.
|
||||
func (d *dispatcher) runUpdate(ctx context.Context, p api.CommandUpdatePayload, tx wsclient.Sender) {
|
||||
logf := func(format string, args ...any) {
|
||||
line := fmt.Sprintf(format, args...)
|
||||
slog.Info("ws agent: update: " + line)
|
||||
env, err := api.Marshal(api.MsgLogStream, "", api.LogStreamLine{
|
||||
JobID: p.JobID,
|
||||
TS: time.Now().UTC(),
|
||||
Stream: api.LogStdout,
|
||||
Payload: line,
|
||||
})
|
||||
if err == nil {
|
||||
_ = tx.Send(env)
|
||||
}
|
||||
}
|
||||
|
||||
startedEnv, err := api.Marshal(api.MsgJobStarted, "", api.JobStartedPayload{
|
||||
JobID: p.JobID,
|
||||
Kind: api.JobUpdate,
|
||||
StartedAt: time.Now().UTC(),
|
||||
})
|
||||
if err == nil {
|
||||
_ = tx.Send(startedEnv)
|
||||
}
|
||||
|
||||
logf("fetching new binary from %s", d.serverURL)
|
||||
if err := updater.Update(ctx, d.serverURL); err != nil {
|
||||
logf("update failed: %v", err)
|
||||
finishedEnv, mErr := api.Marshal(api.MsgJobFinished, "", api.JobFinishedPayload{
|
||||
JobID: p.JobID,
|
||||
Status: api.JobFailed,
|
||||
FinishedAt: time.Now().UTC(),
|
||||
Error: err.Error(),
|
||||
})
|
||||
if mErr == nil {
|
||||
_ = tx.Send(finishedEnv)
|
||||
}
|
||||
return
|
||||
}
|
||||
// Unreachable on Linux (Update calls os.Exit). On Windows control
|
||||
// returns here while the detached helper does the swap-and-restart;
|
||||
// the agent then exits cleanly so SCM hands off.
|
||||
}
|
||||
+39
-11
@@ -9,6 +9,7 @@ import (
|
||||
"os"
|
||||
"os/signal"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"syscall"
|
||||
"time"
|
||||
|
||||
@@ -17,18 +18,15 @@ import (
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
|
||||
rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
var (
|
||||
version = "dev"
|
||||
commit = "none"
|
||||
date = "unknown"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
func main() {
|
||||
@@ -44,7 +42,7 @@ func run() error {
|
||||
flag.Parse()
|
||||
|
||||
if *showVersion {
|
||||
fmt.Printf("restic-manager-server %s (commit %s, built %s)\n", version, commit, date)
|
||||
fmt.Printf("restic-manager-server %s (commit %s, built %s)\n", version.Version, version.Commit, version.Date)
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -88,9 +86,11 @@ func run() error {
|
||||
|
||||
hub := ws.NewHub()
|
||||
jobHub := ws.NewJobHub()
|
||||
metricsRegistry := metrics.NewRegistry()
|
||||
|
||||
notifHub := notification.NewHub(st, aead, cfg.BaseURL)
|
||||
alertEngine := alert.NewEngine(st, notifHub)
|
||||
updateWatcher := ws.NewUpdateWatcher(st, alertEngine, jobHub)
|
||||
|
||||
renderer, err := ui.New()
|
||||
if err != nil {
|
||||
@@ -116,9 +116,11 @@ func run() error {
|
||||
JobHub: jobHub,
|
||||
AlertEngine: alertEngine,
|
||||
NotificationHub: notifHub,
|
||||
UpdateWatcher: updateWatcher,
|
||||
UI: renderer,
|
||||
Version: version,
|
||||
Version: version.Version,
|
||||
OIDC: oidcClient,
|
||||
Metrics: metricsRegistry,
|
||||
}
|
||||
|
||||
// First-run bootstrap: if the users table is empty, mint a one-time
|
||||
@@ -139,22 +141,38 @@ func run() error {
|
||||
// text exactly once; we hash it into BootstrapToken on the
|
||||
// server-side handler.
|
||||
fmt.Fprintln(os.Stderr, "================================================================")
|
||||
fmt.Fprintln(os.Stderr, " FIRST RUN — bootstrap token (use within 1 hour, then it's gone):")
|
||||
fmt.Fprintln(os.Stderr, " FIRST RUN — no admin user exists yet.")
|
||||
if cfg.BaseURL != "" {
|
||||
fmt.Fprintln(os.Stderr, " Open this URL in a browser to create the first administrator:")
|
||||
fmt.Fprintln(os.Stderr, " "+strings.TrimRight(cfg.BaseURL, "/")+"/bootstrap")
|
||||
} else {
|
||||
fmt.Fprintln(os.Stderr, " Open the server URL in a browser; you'll be sent to /bootstrap.")
|
||||
fmt.Fprintln(os.Stderr, " (Set RM_BASE_URL to have a clickable link printed here.)")
|
||||
}
|
||||
fmt.Fprintln(os.Stderr, "")
|
||||
fmt.Fprintln(os.Stderr, " Headless? POST {token, username, password} to /api/bootstrap")
|
||||
fmt.Fprintln(os.Stderr, " with this one-shot bootstrap token (valid until first user exists):")
|
||||
fmt.Fprintln(os.Stderr, " "+token)
|
||||
fmt.Fprintln(os.Stderr, " POST it to /api/bootstrap with {token, username, password}.")
|
||||
fmt.Fprintln(os.Stderr, "================================================================")
|
||||
}
|
||||
|
||||
srv := rmhttp.New(deps)
|
||||
|
||||
// Fleet-update worker — built after the HTTP server because the
|
||||
// dispatcher delegates back into srv.DispatchHostUpdate.
|
||||
fleetWorker := fleetupdate.NewWorker(st, hub,
|
||||
&serverDispatcher{srv: srv}, alertEngine)
|
||||
srv.SetFleetWorker(fleetWorker)
|
||||
|
||||
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
|
||||
defer stop()
|
||||
|
||||
go alertEngine.Run(ctx)
|
||||
go updateWatcher.Run(ctx)
|
||||
|
||||
errCh := make(chan error, 1)
|
||||
go func() {
|
||||
slog.Info("server listening", "addr", cfg.Listen, "version", version)
|
||||
slog.Info("server listening", "addr", cfg.Listen, "version", version.Version)
|
||||
errCh <- srv.Start()
|
||||
}()
|
||||
|
||||
@@ -209,6 +227,7 @@ func run() error {
|
||||
}
|
||||
case <-pendingDrainTick.C:
|
||||
srv.DrainAllDue(ctx)
|
||||
srv.RunCatchupsDue(ctx)
|
||||
case <-pendingExpiryTick.C:
|
||||
if n, err := st.DeleteExpiredPendingHosts(ctx, time.Now().UTC()); err == nil && n > 0 {
|
||||
slog.Info("expired pending hosts swept", "n", n)
|
||||
@@ -243,3 +262,12 @@ func run() error {
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// serverDispatcher adapts the http.Server's DispatchHostUpdate method
|
||||
// to the fleetupdate.Dispatcher interface. Lives in main so the
|
||||
// http and fleetupdate packages don't need to know about each other.
|
||||
type serverDispatcher struct{ srv *rmhttp.Server }
|
||||
|
||||
func (d *serverDispatcher) DispatchUpdate(ctx context.Context, hostID, actorUserID string) (string, string, error) {
|
||||
return d.srv.DispatchHostUpdate(ctx, hostID, actorUserID)
|
||||
}
|
||||
|
||||
@@ -26,7 +26,11 @@ ARG DATE=unknown
|
||||
ARG TARGETOS
|
||||
ARG TARGETARCH
|
||||
|
||||
ENV LDFLAGS="-s -w -X main.version=${VERSION} -X main.commit=${COMMIT} -X main.date=${DATE}"
|
||||
ENV VERSION_PKG="gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
ENV LDFLAGS="-s -w \
|
||||
-X ${VERSION_PKG}.Version=${VERSION} \
|
||||
-X ${VERSION_PKG}.Commit=${COMMIT} \
|
||||
-X ${VERSION_PKG}.Date=${DATE}"
|
||||
|
||||
# Server: built for the image's runtime arch.
|
||||
RUN GOOS=${TARGETOS} GOARCH=${TARGETARCH} \
|
||||
|
||||
@@ -0,0 +1,325 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"builtIn": 1,
|
||||
"datasource": { "type": "grafana", "uid": "-- Grafana --" },
|
||||
"enable": true,
|
||||
"hide": true,
|
||||
"iconColor": "rgba(0, 211, 255, 1)",
|
||||
"name": "Annotations & Alerts",
|
||||
"type": "dashboard"
|
||||
}
|
||||
]
|
||||
},
|
||||
"description": "restic-manager fleet overview. Imports against any Prometheus data source.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"liveNow": false,
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Fleet status",
|
||||
"type": "stat",
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "red", "value": null },
|
||||
{ "color": "green", "value": 1 }
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "rm_hosts_online",
|
||||
"legendFormat": "online",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "rm_hosts_total",
|
||||
"legendFormat": "total",
|
||||
"refId": "B"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Open alerts",
|
||||
"type": "stat",
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 1 },
|
||||
{ "color": "red", "value": 5 }
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"orientation": "horizontal",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "sum by (severity) (rm_active_alerts)",
|
||||
"legendFormat": "{{severity}}",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Backups failing (last reported run)",
|
||||
"type": "stat",
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "red", "value": 1 }
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "count(rm_host_last_backup_success == 0)",
|
||||
"legendFormat": "failing",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Hosts",
|
||||
"type": "table",
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"custom": { "align": "auto", "displayMode": "auto" }
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "Value #B" },
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Last backup (s ago)" },
|
||||
{ "id": "unit", "value": "s" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "Value #C" },
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Repo size" },
|
||||
{ "id": "unit", "value": "bytes" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "Value #D" },
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Snapshots" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "Value #A" },
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Online" }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "Value #E" },
|
||||
"properties": [
|
||||
{ "id": "displayName", "value": "Open alerts" }
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
"options": { "showHeader": true },
|
||||
"transformations": [
|
||||
{
|
||||
"id": "merge",
|
||||
"options": {}
|
||||
}
|
||||
],
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "rm_host_agent_online",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "time() - rm_host_last_backup_timestamp_seconds",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "rm_host_repo_size_bytes",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "C"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "rm_host_snapshot_count",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "D"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "rm_host_open_alerts",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "E"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Repo size over time",
|
||||
"type": "timeseries",
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"showPoints": "never"
|
||||
},
|
||||
"unit": "bytes"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "rm_host_repo_size_bytes",
|
||||
"legendFormat": "{{host}}",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Job duration p95 (last 1h, by kind)",
|
||||
"type": "timeseries",
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 5,
|
||||
"lineWidth": 1,
|
||||
"pointSize": 4,
|
||||
"showPoints": "never"
|
||||
},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||
"expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))",
|
||||
"legendFormat": "{{kind}}",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"refresh": "30s",
|
||||
"schemaVersion": 39,
|
||||
"style": "dark",
|
||||
"tags": ["restic-manager", "backups"],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"current": {},
|
||||
"hide": 0,
|
||||
"includeAll": false,
|
||||
"label": "Prometheus",
|
||||
"multi": false,
|
||||
"name": "DS_PROMETHEUS",
|
||||
"options": [],
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"skipUrlSync": false,
|
||||
"type": "datasource"
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": { "from": "now-6h", "to": "now" },
|
||||
"timepicker": {},
|
||||
"timezone": "",
|
||||
"title": "restic-manager — fleet",
|
||||
"uid": "rm-fleet-overview",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
@@ -52,7 +52,12 @@ ProtectSystem=full
|
||||
# whenever a new SecretsKey is minted, so we need a targeted
|
||||
# write-exemption for that dir. No exemption for the rest of /etc:
|
||||
# the agent has no business editing /etc/passwd, /etc/sudoers, etc.
|
||||
ReadWritePaths=/etc/restic-manager
|
||||
#
|
||||
# /usr/local/bin is writable so the self-update flow (P6-01) can
|
||||
# atomic-rename a fresh binary over the running one. Permitting the
|
||||
# whole directory (rather than just the binary path) is required
|
||||
# because os.Rename takes a write lock on the parent dir.
|
||||
ReadWritePaths=/etc/restic-manager /usr/local/bin
|
||||
ProtectHostname=true
|
||||
ProtectKernelTunables=true
|
||||
ProtectKernelModules=true
|
||||
|
||||
@@ -0,0 +1,249 @@
|
||||
# Onboarding a new host — agent instructions
|
||||
|
||||
How an automation agent (with a username + password for the
|
||||
restic-manager server) brings a new host fully online.
|
||||
|
||||
The flow is two roles:
|
||||
|
||||
- **Controller side**: the agent calls JSON APIs on the
|
||||
restic-manager server. Needs network reach to the server, plus
|
||||
username/password.
|
||||
- **Target side**: the host being onboarded runs the install
|
||||
script, which calls back to the server with the one-time token.
|
||||
|
||||
If the agent is *both* sides (e.g. it can SSH into the target),
|
||||
it does steps 1–2 against the server and steps 3–4 against the
|
||||
target. If the agent only controls the server, it stops at
|
||||
step 2 and hands the install snippet to whoever owns the target.
|
||||
|
||||
---
|
||||
|
||||
## Conventions
|
||||
|
||||
- Base URL: `$RM_SERVER` (e.g. `https://restic.lab.example`).
|
||||
- Session cookie jar: persist `rm_session` between calls.
|
||||
- All request/response bodies are JSON unless noted.
|
||||
- On any non-2xx, response body is
|
||||
`{"code": "...", "message": "..."}`.
|
||||
|
||||
---
|
||||
|
||||
## 1. Login
|
||||
|
||||
```
|
||||
POST $RM_SERVER/api/auth/login
|
||||
Content-Type: application/json
|
||||
|
||||
{"username": "...", "password": "..."}
|
||||
```
|
||||
|
||||
→ 200 with `{"user_id": "...", "role": "..."}` and a `Set-Cookie:
|
||||
rm_session=...` (HttpOnly, 24h TTL). Persist the cookie; reuse
|
||||
it on every subsequent call.
|
||||
|
||||
Required role for the next step: **operator** or **admin**.
|
||||
A viewer-only login can read but cannot mint tokens.
|
||||
|
||||
Session expires at 24h. On 401 from a later call, re-login.
|
||||
|
||||
---
|
||||
|
||||
## 2. Mint an enrolment token
|
||||
|
||||
```
|
||||
POST $RM_SERVER/api/enrollment-tokens
|
||||
Cookie: rm_session=...
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"hostname": "newhost.example",
|
||||
"tags": ["prod", "london"], // optional
|
||||
"repo_url": "rest:https://rest.example/newhost",
|
||||
"repo_username": "...", // optional, for rest-server / S3
|
||||
"repo_password": "...", // optional
|
||||
"initial_paths": ["/etc", "/home", "/var/lib"] // optional; default source group
|
||||
}
|
||||
```
|
||||
|
||||
→ 200 with:
|
||||
|
||||
```json
|
||||
{ "token": "<RAW_ONE_TIME_TOKEN>", "expires_at": "2026-05-09T..." }
|
||||
```
|
||||
|
||||
**Capture `token` immediately — the server only stores its hash
|
||||
and will never return the raw value again.** TTL is 1 hour.
|
||||
|
||||
The repo creds you provided are encrypted under the token hash
|
||||
and pre-attached to the host. The agent will fetch and store
|
||||
them at enrol-time; you will not need to push them again.
|
||||
|
||||
If you lose the token before the install runs, mint a new one
|
||||
(the existing one becomes irrelevant; you can leave it to expire
|
||||
or revoke it via the UI).
|
||||
|
||||
---
|
||||
|
||||
## 3. Install on the target host
|
||||
|
||||
The install script is hosted by the server itself. Running on the
|
||||
target:
|
||||
|
||||
### Linux
|
||||
|
||||
```
|
||||
curl -fsSL $RM_SERVER/install/install.sh | \
|
||||
sudo RM_SERVER=$RM_SERVER RM_TOKEN=<RAW_ONE_TIME_TOKEN> bash
|
||||
```
|
||||
|
||||
What it does, end-to-end:
|
||||
|
||||
1. detects arch (amd64 / arm64)
|
||||
2. downloads `$RM_SERVER/agent/binary?os=linux&arch=<arch>` to
|
||||
`/usr/local/bin/restic-manager-agent`
|
||||
3. creates `/etc/restic-manager/` and `/var/lib/restic-manager/`
|
||||
(root:root, 0700)
|
||||
4. calls `POST /api/agents/enroll` with the token; server returns
|
||||
the persistent agent bearer + `host_id`, written to
|
||||
`/etc/restic-manager/agent.env`
|
||||
5. installs the systemd unit, `daemon-reload`, `enable --now`
|
||||
6. surfaces any pre-existing restic cron/timer entries so the
|
||||
operator can decide whether to disable them (script does
|
||||
*not* touch them automatically)
|
||||
|
||||
The script is idempotent. Re-running on an already-enrolled host
|
||||
is a no-op unless `RM_FORCE_REENROLL=1`.
|
||||
|
||||
The agent runs as **root** by design — fleet backup needs to
|
||||
read every file on the system. See
|
||||
`deploy/install/restic-manager-agent.service` for rationale.
|
||||
|
||||
### Windows
|
||||
|
||||
```
|
||||
iwr $RM_SERVER/install/install.ps1 -UseBasicParsing | iex
|
||||
# (or download + run; needs an elevated PowerShell)
|
||||
# Required env: $env:RM_SERVER, $env:RM_TOKEN
|
||||
```
|
||||
|
||||
Same flow, lays down a Windows service instead of a systemd unit.
|
||||
|
||||
### Manual / non-script enrolment
|
||||
|
||||
If the install script can't be used, the wire-level enrol call is:
|
||||
|
||||
```
|
||||
POST $RM_SERVER/api/agents/enroll
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"token": "<RAW_ONE_TIME_TOKEN>",
|
||||
"hostname": "newhost.example",
|
||||
"os": "linux", // linux | windows
|
||||
"arch": "amd64", // amd64 | arm64
|
||||
"agent_version": "...",
|
||||
"restic_version": "..."
|
||||
}
|
||||
```
|
||||
|
||||
→ 200 with
|
||||
`{"host_id": "...", "agent_token": "...", "cert_pin_sha256": "..."}`.
|
||||
|
||||
The agent_token goes into `/etc/restic-manager/agent.env` as
|
||||
`RM_AGENT_TOKEN=...`; subsequent agent → server traffic uses
|
||||
`Authorization: Bearer $RM_AGENT_TOKEN`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Verify the host is healthy
|
||||
|
||||
Poll until both conditions are true. Cap at ~5 minutes.
|
||||
|
||||
```
|
||||
GET $RM_SERVER/api/hosts
|
||||
Cookie: rm_session=...
|
||||
```
|
||||
|
||||
→ array of host objects. Find the one with the matching hostname
|
||||
and check:
|
||||
|
||||
- `"status": "online"` — agent connected to the WS heartbeat
|
||||
- `"repo_status": "ready"` — `restic init` (or existing-config
|
||||
detection) completed successfully
|
||||
|
||||
If `repo_status` settles on `"init_failed"`, the repo creds are
|
||||
wrong or the repo URL is unreachable from the target. Inspect
|
||||
the matching job log:
|
||||
|
||||
```
|
||||
GET $RM_SERVER/api/hosts/<host_id>/jobs (most recent init job)
|
||||
GET $RM_SERVER/api/jobs/<job_id> (full output)
|
||||
```
|
||||
|
||||
Fix the creds with a creds-update call (see Settings → Repo on
|
||||
the UI for the exact route — currently form-only) or revoke the
|
||||
host and start over.
|
||||
|
||||
---
|
||||
|
||||
## 5. (Optional) configure schedules
|
||||
|
||||
A new host gets one default source group covering `initial_paths`
|
||||
(or `/etc`,`/home` if you didn't pass any) and **no schedule**.
|
||||
Backups won't run until either:
|
||||
|
||||
- a schedule is attached (cron expression, retention, etc.), or
|
||||
- you trigger an on-demand run via the source-group "Run now"
|
||||
endpoint.
|
||||
|
||||
These are not yet exposed cleanly as JSON-only routes; if the
|
||||
agent needs them, look at `internal/server/http/schedules*.go`
|
||||
and `internal/server/http/source_groups*.go` — most are JSON-
|
||||
capable, some are form-only with HTML 303 responses.
|
||||
|
||||
---
|
||||
|
||||
## Failure modes — quick reference
|
||||
|
||||
| Symptom | Likely cause | Fix |
|
||||
|---|---|---|
|
||||
| `401` on `/api/enrollment-tokens` | session expired or viewer role | re-login as operator+ |
|
||||
| install.sh fails at "enrol": HTTP 410 | token expired (>1h) or already used | mint a fresh token |
|
||||
| Host shows `status=offline` after install | systemd unit didn't start; firewall blocks WS | `systemctl status restic-manager-agent`, check `$RM_SERVER` reachability |
|
||||
| `repo_status=init_failed` | bad repo creds or URL | inspect init job log; fix creds; retry probe via `/hosts/{id}/repo/probe` |
|
||||
| Token list grows with stale rows | normal — they expire at 1h | optional cleanup via `/hosts/enrollment-tokens/{hash}/revoke` |
|
||||
|
||||
---
|
||||
|
||||
## Minimum reproducible script
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
: "${RM_SERVER:?}" "${RM_USER:?}" "${RM_PASS:?}" "${RM_HOSTNAME:?}" \
|
||||
"${RM_REPO_URL:?}" "${RM_REPO_USER:?}" "${RM_REPO_PASS:?}"
|
||||
|
||||
JAR=$(mktemp)
|
||||
trap 'rm -f "$JAR"' EXIT
|
||||
|
||||
# 1. login
|
||||
curl -fsS -c "$JAR" -H 'Content-Type: application/json' \
|
||||
-d "{\"username\":\"$RM_USER\",\"password\":\"$RM_PASS\"}" \
|
||||
"$RM_SERVER/api/auth/login" >/dev/null
|
||||
|
||||
# 2. mint token
|
||||
TOKEN=$(curl -fsS -b "$JAR" -H 'Content-Type: application/json' \
|
||||
-d "$(jq -nc \
|
||||
--arg h "$RM_HOSTNAME" --arg u "$RM_REPO_USER" \
|
||||
--arg p "$RM_REPO_PASS" --arg r "$RM_REPO_URL" \
|
||||
'{hostname:$h, repo_url:$r, repo_username:$u, repo_password:$p}')" \
|
||||
"$RM_SERVER/api/enrollment-tokens" | jq -r .token)
|
||||
|
||||
# 3. emit the install snippet for the target machine
|
||||
cat <<EOF
|
||||
Run on $RM_HOSTNAME (as root):
|
||||
|
||||
curl -fsSL $RM_SERVER/install/install.sh | \\
|
||||
sudo RM_SERVER=$RM_SERVER RM_TOKEN=$TOKEN bash
|
||||
EOF
|
||||
```
|
||||
@@ -0,0 +1,19 @@
|
||||
[book]
|
||||
title = "restic-manager"
|
||||
description = "Self-hosted control plane for restic backups across a fleet of Linux and Windows endpoints."
|
||||
authors = ["Steve Cliff"]
|
||||
language = "en-GB"
|
||||
multilingual = false
|
||||
src = "src"
|
||||
|
||||
[output.html]
|
||||
default-theme = "ayu"
|
||||
preferred-dark-theme = "ayu"
|
||||
git-repository-url = "https://gitea.dcglab.co.uk/steve/restic-manager"
|
||||
git-repository-icon = "fa-code-fork"
|
||||
edit-url-template = "https://gitea.dcglab.co.uk/steve/restic-manager/_edit/main/docs/book/{path}"
|
||||
no-section-label = false
|
||||
|
||||
[output.html.fold]
|
||||
enable = true
|
||||
level = 2
|
||||
@@ -0,0 +1,40 @@
|
||||
# Summary
|
||||
|
||||
[Introduction](./intro.md)
|
||||
|
||||
# Getting started
|
||||
|
||||
- [Installing the server](./getting-started/install.md)
|
||||
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
|
||||
- [Running behind a reverse proxy](./getting-started/reverse-proxy.md)
|
||||
|
||||
# Concepts
|
||||
|
||||
- [Architecture](./concepts/architecture.md)
|
||||
- [Credentials and how they flow](./concepts/credentials.md)
|
||||
- [Schedules and source groups](./concepts/schedules-and-source-groups.md)
|
||||
- [Repo maintenance](./concepts/repo-maintenance.md)
|
||||
|
||||
# Operations
|
||||
|
||||
- [Backups and restores](./operations/backups-and-restores.md)
|
||||
- [Alerts and notifications](./operations/alerts.md)
|
||||
- [Observability with Prometheus](./operations/observability.md)
|
||||
- [Updating agents](./operations/updates.md)
|
||||
|
||||
# Security
|
||||
|
||||
- [Threat model](./security/threat-model.md)
|
||||
- [Hardening checklist](./security/hardening.md)
|
||||
- [Reporting vulnerabilities](./security/disclosure.md)
|
||||
|
||||
# Reference
|
||||
|
||||
- [Environment variables](./reference/env-vars.md)
|
||||
- [HTTP endpoints](./reference/http-endpoints.md)
|
||||
|
||||
---
|
||||
|
||||
[Contributing](./contributing.md)
|
||||
[Roadmap](./roadmap.md)
|
||||
[License](./license.md)
|
||||
@@ -0,0 +1,121 @@
|
||||
# Architecture
|
||||
|
||||
## Components
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────┐
|
||||
│ Server (control plane, single process) │
|
||||
│ * chi-based HTTP API + HTMX server-rendered UI │
|
||||
│ * WebSocket hub for agent fan-out + browser fan-out │
|
||||
│ * SQLite store (modernc.org/sqlite, pure Go) │
|
||||
│ * AEAD encryption helpers │
|
||||
│ * Alert engine + notification hub │
|
||||
└────────────┬───────────────────────────────────┬───────────┘
|
||||
│ outbound WS only │ HTTP(S)
|
||||
│ │
|
||||
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
|
||||
│ Agent (per host) │ │ Browser (operator) │
|
||||
│ * coder/websocket │ │ * htmx + a tiny bit │
|
||||
│ * cron for schedules │ │ of vanilla JS for │
|
||||
│ * restic wrapper │ │ live job updates │
|
||||
│ * sysinfo collector │ └──────────────────────────┘
|
||||
└────────────┬─────────────┘
|
||||
│ subprocess: restic ...
|
||||
│
|
||||
┌────────────▼─────────────────────────────────────────────────┐
|
||||
│ restic repository (rest-server, S3, B2, SFTP, local …) │
|
||||
│ Backup data flows directly here. Server never touches it. │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Why outbound-only WebSockets?
|
||||
|
||||
The agent dials the server on `/ws/agent` with a bearer token. The
|
||||
server doesn't initiate connections to the agent. Three reasons:
|
||||
|
||||
1. **Firewall friendliness.** Nothing on the endpoint needs an
|
||||
inbound port; this works behind the typical "branch office NAT"
|
||||
without router config.
|
||||
2. **Single auth point.** The bearer token is the only credential
|
||||
that crosses the boundary; the agent never accepts an
|
||||
incoming socket.
|
||||
3. **Reconnect semantics are simpler.** When the connection drops
|
||||
(NAT timeout, server restart, transient network glitch) the
|
||||
agent backs off and re-dials; the server marks the host
|
||||
offline after 90s and lets the alert engine raise a stale-host
|
||||
alert.
|
||||
|
||||
## Why SQLite?
|
||||
|
||||
SQLite covers the project's HA non-goal: there isn't one. A small
|
||||
control plane managing twelve endpoints does not need replication
|
||||
or a separate database tier. SQLite gives us:
|
||||
|
||||
- A single file to back up (plus the secret key).
|
||||
- Hand-rolled migrations under `internal/store/migrations/` —
|
||||
no migration framework lock-in.
|
||||
- `WAL` mode plus per-connection foreign-key enforcement.
|
||||
|
||||
The migrations file the entire schema; there's no ORM or
|
||||
query-builder layer between Go code and SQL.
|
||||
|
||||
## Why the agent runs `restic` itself, not via the server
|
||||
|
||||
The control plane never holds backup bytes in flight. That's
|
||||
deliberate:
|
||||
|
||||
- A compromised control plane cannot exfiltrate snapshot
|
||||
contents in-band — at worst it can dispatch new backup or
|
||||
forget jobs (audit-logged) but the data path is between the
|
||||
agent and the repository.
|
||||
- The same agent process can target whichever transport restic
|
||||
natively supports (rest-server, S3, B2, SFTP, local), no
|
||||
separate mux on the server side.
|
||||
|
||||
## Job lifecycle
|
||||
|
||||
```
|
||||
┌──────────────────────┐
|
||||
operator → │ POST /hosts/{id}/ │
|
||||
│ run-backup │
|
||||
└──────────┬───────────┘
|
||||
│ 1. INSERT INTO jobs (status='queued')
|
||||
│ 2. dispatch command.run over WS
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Agent dispatches │
|
||||
│ restic subprocess │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
│ 3. job.started ───▶ store.MarkJobStarted
|
||||
│ 4. job.progress ───▶ JobHub broadcast (live UI)
|
||||
│ 5. log.stream ───▶ append to job_logs
|
||||
│ 6. job.finished ───▶ store.MarkJobFinished
|
||||
│ + alert engine eval
|
||||
│ + (P6) metrics histogram
|
||||
▼
|
||||
terminal: succeeded | failed | cancelled
|
||||
```
|
||||
|
||||
Operators see live updates because the browser subscribes to
|
||||
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
|
||||
agent-emitted envelope to all live subscribers in addition to
|
||||
persisting it.
|
||||
|
||||
## What scheduling looks like
|
||||
|
||||
- The agent runs a local `robfig/cron/v3` instance.
|
||||
- The server pushes the desired schedule set to the agent on
|
||||
hello + after every CRUD change.
|
||||
- When the agent's cron fires, it sends `schedule.fire` to the
|
||||
server. The server creates a job row, sends `command.run` back,
|
||||
and the agent dispatches a normal backup.
|
||||
- If the WS drops between fire and run, the server queues the
|
||||
schedule firing into `pending_runs` and drains on agent
|
||||
reconnect — no missed scheduled backups due to network blips.
|
||||
|
||||
For everything that isn't a backup (forget, prune, check), the
|
||||
server runs a 60-second maintenance ticker against
|
||||
`host_repo_maintenance` rows and dispatches the relevant command
|
||||
when a cadence is due. The agent's local cron only handles
|
||||
backups.
|
||||
@@ -0,0 +1,98 @@
|
||||
# Credentials and how they flow
|
||||
|
||||
restic-manager handles three credential surfaces:
|
||||
|
||||
1. **Operator credentials** — the username + password (or OIDC
|
||||
identity) that logs into the UI.
|
||||
2. **Agent bearer tokens** — issued at enrolment, used by the
|
||||
agent to authenticate its WebSocket to the server.
|
||||
3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
|
||||
credentials the agent passes to `restic` itself.
|
||||
|
||||
Each has a different threat model and storage strategy.
|
||||
|
||||
## Operator credentials
|
||||
|
||||
- Local users are stored in `users` with a bcrypt password hash.
|
||||
- Sessions are random tokens minted at login, stored hashed in
|
||||
the `sessions` table, expired after 24h. Cookie is HttpOnly,
|
||||
SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
|
||||
default).
|
||||
- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
|
||||
pinning their IdP identity. Local password login is rejected
|
||||
for OIDC users.
|
||||
- Disabling a user soft-deletes them via `disabled_at` —
|
||||
pre-existing sessions are invalidated on the next request.
|
||||
|
||||
## Agent bearer tokens
|
||||
|
||||
- Minted at enrolment, hashed at rest with `auth.HashToken`.
|
||||
- The plaintext token only exists in memory at enrolment time
|
||||
and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
|
||||
mode `0600`, owned by the service user).
|
||||
- Compromise of the server DB leaks the hashes, which is enough
|
||||
to *log in as that agent* until you revoke. Compromise of the
|
||||
agent host leaks the plaintext (via the config file) — same
|
||||
end result.
|
||||
- Rotation: re-enrol the host. Today there's no in-place rotate;
|
||||
the operator deletes the host (which cascades, including
|
||||
revoking the bearer hash) and re-runs the install command.
|
||||
|
||||
## Repo credentials
|
||||
|
||||
This is the credential that ultimately matters for backup
|
||||
integrity. restic-manager keeps two slots per host:
|
||||
|
||||
- **The everyday credential** (`host_credentials.kind = ''`).
|
||||
Append-only-friendly: this is the one your backup schedule
|
||||
uses. It can write but not delete or forget.
|
||||
- **The admin credential** (`host_credentials.kind = 'admin'`).
|
||||
Has full delete rights. Only pushed to the agent transiently
|
||||
while a `prune` or `forget` job is dispatching, and discarded
|
||||
by the agent after the job ends.
|
||||
|
||||
### Encryption flow
|
||||
|
||||
1. Operator types the credential into the UI or the install form.
|
||||
2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
|
||||
key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
|
||||
memory.
|
||||
3. Encrypted blob is stored in `host_credentials.cred_blob`.
|
||||
4. When the agent connects, the server decrypts the blob and
|
||||
sends the **plaintext** down the WebSocket inside a
|
||||
`config.update` envelope.
|
||||
5. The agent stores the plaintext in its in-memory secrets store
|
||||
for the lifetime of the process; it's reloaded fresh on every
|
||||
server-side push.
|
||||
6. When a job runs, the agent merges the credential into the
|
||||
restic environment (`restic.Env.RepoURL` stays bare; the
|
||||
`user:pass@…` form is built only inside `envSlice()` at the
|
||||
moment of `exec.Command`).
|
||||
|
||||
The merged form is **never logged**. The slog package's structured
|
||||
output gets `restic.RedactURL()` for any URL it has cause to
|
||||
mention.
|
||||
|
||||
### Why push plaintext over the wire?
|
||||
|
||||
The transport itself is the trust boundary: the WebSocket runs
|
||||
inside the same TLS-terminated reverse-proxy connection your
|
||||
browser uses, and the agent has already authenticated with its
|
||||
bearer token. Re-encrypting the payload on top of that would just
|
||||
move the key-management problem somewhere else.
|
||||
|
||||
If your reverse proxy isn't TLS-terminated, the deployment is
|
||||
already broken — see [Hardening](../security/hardening.md).
|
||||
|
||||
## Setup tokens (admin-driven)
|
||||
|
||||
When an admin creates a new user, the server mints a one-time
|
||||
setup link valid for 1 hour. The hash is stored; the raw token
|
||||
is shown to the admin once. The user opens the link, sets a
|
||||
password, and is dropped into a session. Expired tokens are
|
||||
swept on the alert engine's 60s tick.
|
||||
|
||||
Same pattern for enrolment tokens: the raw token only exists in
|
||||
memory at mint time, and the install snippet is the operator's
|
||||
only chance to capture it. If you lose it, regenerate via the
|
||||
**Add host** page (NS-02).
|
||||
@@ -0,0 +1,85 @@
|
||||
# Repo maintenance
|
||||
|
||||
Backups go in; without maintenance, repos grow forever and
|
||||
eventually fall over. restic-manager runs three maintenance
|
||||
operations on a per-host cadence:
|
||||
|
||||
| Command | What it does | Default cadence |
|
||||
|----------|-------------------------------------------------------------|-----------------|
|
||||
| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
|
||||
| `prune` | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
|
||||
| `check` | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
|
||||
|
||||
A new field on each host row, `host_repo_maintenance`, holds the
|
||||
cron expressions and last-fire anchors. The maintenance ticker on
|
||||
the server runs every 60s, finds hosts whose next-fire is due,
|
||||
and dispatches the right command. The agent's local cron is
|
||||
**only** for backups.
|
||||
|
||||
## Why server-side and not agent-side?
|
||||
|
||||
The agent's cron knows about backups because backups are
|
||||
per-source-group. Maintenance is per-repo, not per-source-group,
|
||||
so doing it server-side keeps the per-host wiring simple:
|
||||
|
||||
- One ticker, not N agent crons to keep in sync.
|
||||
- Cancelling a maintenance dispatch is just "don't dispatch the
|
||||
next one" — no agent-side state to clean up.
|
||||
- Skipping offline hosts is trivial (no queue; only scheduled
|
||||
*backups* queue into `pending_runs`).
|
||||
|
||||
## Forget and the multi-group payload
|
||||
|
||||
A single `forget` job can target several source groups at once.
|
||||
The wire envelope (`ForgetGroups`) carries one entry per group,
|
||||
each with its retention policy. The agent runs N
|
||||
`restic forget --tag <name> --keep-...` invocations in sequence,
|
||||
streams their output, and reports a single terminal status.
|
||||
|
||||
## Prune and the admin credential
|
||||
|
||||
Prune mutates the repo. The everyday append-only credential
|
||||
**cannot** prune — that's the whole point of append-only.
|
||||
restic-manager keeps a second slot per host (`kind = 'admin'`)
|
||||
for the credential that can.
|
||||
|
||||
When a prune is dispatched (cadence-driven or operator-driven):
|
||||
|
||||
1. Server pushes the admin credential to the agent in a fresh
|
||||
`config.update`.
|
||||
2. Agent runs `restic prune` with the merged credential.
|
||||
3. Job finishes; agent discards the admin credential from its
|
||||
in-memory secrets store.
|
||||
|
||||
The server never logs the merged URL (see
|
||||
[Credentials](./credentials.md)).
|
||||
|
||||
## Check and lock state
|
||||
|
||||
`restic check` warns about stale locks when it finds them. The
|
||||
agent ships every check's output back as a `repo.stats` envelope
|
||||
and a stream of log lines; if a stale lock is detected, the
|
||||
**Repo** page surfaces a banner with an **Unlock** button. The
|
||||
operator-only `unlock` command runs `restic unlock` and clears
|
||||
the banner.
|
||||
|
||||
`unlock` has no cadence — it's a manual action, never automatic.
|
||||
Auto-unlocking would mask the cause (probably a previously
|
||||
crashed long-running operation) and risk corrupting an
|
||||
operation the operator has merely lost track of.
|
||||
|
||||
## Repo stats
|
||||
|
||||
After every backup, check, prune, and unlock, the agent runs
|
||||
`restic stats --json --mode raw-data` and ships the result as a
|
||||
`repo.stats` envelope. The server stores this in
|
||||
`host_repo_stats` (latest only) and `host_repo_stats_history`
|
||||
(one row per host per day, last-write-wins per column — a
|
||||
prune-only patch never nulls a backup-time size).
|
||||
|
||||
The host detail page surfaces:
|
||||
|
||||
- Total size + raw size in the vitals strip.
|
||||
- Last-check timestamp + colour-coded status.
|
||||
- Last-prune timestamp.
|
||||
- 30/90-day repo size trend chart.
|
||||
@@ -0,0 +1,105 @@
|
||||
# Schedules and source groups
|
||||
|
||||
Two related but separable ideas:
|
||||
|
||||
- A **source group** is a named bundle of "what to back up":
|
||||
include paths, exclude patterns, retention policy, retry
|
||||
configuration, optional pre/post hooks. The group's name is
|
||||
used as the restic snapshot tag, so retention can target it
|
||||
with `restic forget --tag <name>`.
|
||||
- A **schedule** is a cron expression that, when it fires,
|
||||
triggers a backup of one or more source groups on a host.
|
||||
|
||||
Decoupling them means you can have one schedule covering several
|
||||
groups (e.g. `0 1 * * *` running both `system` and `data`), and
|
||||
each group has its own retention without duplicating policy
|
||||
across schedules.
|
||||
|
||||
## Source group anatomy
|
||||
|
||||
```yaml
|
||||
name: data
|
||||
includes:
|
||||
- /var/lib/postgresql
|
||||
- /home
|
||||
excludes:
|
||||
- /home/*/.cache
|
||||
- /home/*/Downloads
|
||||
retention:
|
||||
keep_last: 7
|
||||
keep_daily: 14
|
||||
keep_weekly: 4
|
||||
keep_monthly: 6
|
||||
retry_max: 3
|
||||
retry_backoff_seconds: 600
|
||||
pre_hook: |
|
||||
pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
|
||||
post_hook: |
|
||||
rm -f /var/lib/postgresql/dumps/all.dump
|
||||
```
|
||||
|
||||
### Conflict detection
|
||||
|
||||
If your retention policy says `keep_hourly: 24` but no schedule
|
||||
points at this group sub-daily, the UI surfaces a
|
||||
**conflict-dimension banner** ("`hourly` won't be honoured —
|
||||
no schedule fires more often than once a day"). The flag is
|
||||
stored on the source group (`conflict_dimension`) and refreshed
|
||||
whenever a schedule or group changes.
|
||||
|
||||
### Hooks
|
||||
|
||||
`pre_hook` and `post_hook` run on the agent host inside
|
||||
`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
|
||||
to the live job log as `hook(<phase>): …` lines.
|
||||
|
||||
- A non-zero `pre_hook` exit aborts the backup.
|
||||
- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
|
||||
in the environment. Use this for cleanup that must happen
|
||||
whether the backup worked or not.
|
||||
- Hooks only run for `kind=backup` jobs. They do not run for
|
||||
`forget`, `prune`, `check`, etc.
|
||||
- AEAD-encrypted at rest at the HTTP layer; the agent receives
|
||||
plaintext over the WS channel.
|
||||
|
||||
A "host default" pair of hooks lives on the host itself; a
|
||||
source group's own hooks override them when set.
|
||||
|
||||
## Schedule anatomy
|
||||
|
||||
```yaml
|
||||
cron: "0 2 * * *"
|
||||
enabled: true
|
||||
source_group_ids:
|
||||
- <gid for "data">
|
||||
- <gid for "system">
|
||||
```
|
||||
|
||||
Slim by design: a schedule says **when** and **which groups**.
|
||||
Everything else (paths, retention, hooks) lives on the groups.
|
||||
|
||||
The agent's local cron fires the schedule. If the WebSocket is
|
||||
down at fire time, the server queues the firing into
|
||||
`pending_runs` and drains it on the next agent reconnect — a
|
||||
short network blip won't lose the backup.
|
||||
|
||||
### Last / next run
|
||||
|
||||
The schedules tab shows "next" (computed by parsing the cron
|
||||
expression with `robfig/cron/v3`) and "last" (the latest
|
||||
`actor_kind=schedule` job in the `jobs` table) for every
|
||||
schedule. The dashboard host row also surfaces `next 12h ago/from
|
||||
now` when a single covering schedule is the run-now candidate.
|
||||
|
||||
## Bandwidth limits
|
||||
|
||||
Two places set restic's `--limit-upload` / `--limit-download`:
|
||||
|
||||
1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
|
||||
`bandwidth_down_kbps`). Pushed to the agent on hello and
|
||||
after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
|
||||
invocation on the host.
|
||||
2. **Per-job overrides** on the per-source-group Run-now form.
|
||||
Win over host caps for the lifetime of that one job.
|
||||
|
||||
If neither is set, restic runs unthrottled.
|
||||
@@ -0,0 +1,17 @@
|
||||
# Contributing
|
||||
|
||||
Full contributor guide:
|
||||
[`CONTRIBUTING.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CONTRIBUTING.md)
|
||||
in the repository root.
|
||||
|
||||
The short version:
|
||||
|
||||
- Open an issue first for non-trivial changes; the design is
|
||||
still moving and unsolicited large PRs may conflict with
|
||||
in-flight work.
|
||||
- `make lint test` must pass.
|
||||
- One logical change per commit, no `Co-Authored-By` trailers.
|
||||
- UK English in identifiers and comments; comments explain the
|
||||
**why** not the **what**.
|
||||
|
||||
Code of conduct: [`CODE_OF_CONDUCT.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CODE_OF_CONDUCT.md).
|
||||
@@ -0,0 +1,113 @@
|
||||
# Enrolling your first host
|
||||
|
||||
The control plane only knows about hosts you've explicitly
|
||||
enrolled. Two paths exist:
|
||||
|
||||
1. **Token-based enrolment** — admin generates a token, pastes it
|
||||
into an install command on the host. The host appears immediately,
|
||||
already mapped to the desired repo.
|
||||
2. **Announce-and-approve** — the agent runs without a token,
|
||||
"announces" itself to the server, and a human in the UI accepts
|
||||
the announcement.
|
||||
|
||||
Token-based is the default and what most operators want; the
|
||||
announce flow exists for the case where you can't easily paste a
|
||||
secret onto the host (auto-imaged endpoints, scripted bring-ups
|
||||
from a config repo).
|
||||
|
||||
## Token-based enrolment
|
||||
|
||||
### From the UI
|
||||
|
||||
1. Click **+ Add host** on the dashboard.
|
||||
2. Fill in the hostname, the restic repo URL, and the repo
|
||||
credentials. The credentials are AEAD-encrypted at the server
|
||||
immediately; what you paste is what the agent receives.
|
||||
3. Optionally pick the initial source paths — these become the
|
||||
first source group on the host.
|
||||
4. Submit. The server mints a one-time token and shows you a copy-
|
||||
pasteable install snippet.
|
||||
|
||||
### On the host (Linux)
|
||||
|
||||
```sh
|
||||
curl -fsSL https://restic.example.com/install/install.sh | \
|
||||
sudo RM_SERVER=https://restic.example.com \
|
||||
RM_ENROL_TOKEN=<token> \
|
||||
bash
|
||||
```
|
||||
|
||||
The script:
|
||||
|
||||
1. Detects architecture (`amd64` or `arm64`).
|
||||
2. Downloads the agent binary from `/agent/binary?os=…&arch=…`.
|
||||
3. Drops the systemd unit at
|
||||
`/etc/systemd/system/restic-manager-agent.service`.
|
||||
4. Runs the agent in `-enrol` mode, which posts the token and
|
||||
stores the persistent bearer it gets back.
|
||||
5. Enables and starts the unit.
|
||||
|
||||
Within seconds the host should appear on the dashboard as
|
||||
**online**.
|
||||
|
||||
### On the host (Windows)
|
||||
|
||||
```pwsh
|
||||
$env:RM_SERVER = "https://restic.example.com"
|
||||
$env:RM_ENROL_TOKEN = "<token>"
|
||||
iwr -useb $env:RM_SERVER/install/install.ps1 | iex
|
||||
```
|
||||
|
||||
Equivalent shape: registers a Windows service via the SCM
|
||||
(see P2-16 for details), runs `-enrol`, starts the service.
|
||||
|
||||
## Recovering a lost token
|
||||
|
||||
Tokens are single-use and short-lived (1h). If you closed the tab
|
||||
before pasting the install command, head to the **Add host** page —
|
||||
outstanding tokens are listed there with a **Regenerate** button.
|
||||
Regenerating revokes the old token's hash and mints a fresh raw
|
||||
token while preserving the original repo credentials and initial
|
||||
paths. (NS-02 in `tasks.md` if you want the design rationale.)
|
||||
|
||||
## Announce-and-approve
|
||||
|
||||
If the host can reach the server but you don't want to paste a
|
||||
secret on it, run the agent in `-announce` mode:
|
||||
|
||||
```sh
|
||||
restic-manager-agent -announce \
|
||||
-server https://restic.example.com \
|
||||
-hostname myhost
|
||||
```
|
||||
|
||||
The host appears in the **Pending hosts** panel on the dashboard
|
||||
with its hostname, OS, arch, and the source IP that announced it.
|
||||
Click **Accept**, fill in the repo URL + credentials, and the
|
||||
server pushes the bearer over the still-open WebSocket. No
|
||||
back-and-forth round trip.
|
||||
|
||||
If you don't accept within an hour the announcement is swept.
|
||||
|
||||
## What happens on the agent
|
||||
|
||||
After enrolment, the agent:
|
||||
|
||||
1. Connects via WebSocket to `/ws/agent` with its bearer token.
|
||||
2. Sends a `hello` envelope with its OS, arch, agent version,
|
||||
restic version, and protocol version.
|
||||
3. Receives a `config.update` carrying its encrypted repo
|
||||
credentials and any source-group paths.
|
||||
4. Sits idle, sending a heartbeat every 30s. Operator-driven
|
||||
"Run now" actions arrive as `command.run` envelopes; scheduled
|
||||
jobs are driven by the agent's local cron.
|
||||
|
||||
## Auto-init of the repository
|
||||
|
||||
The first time a backup runs, the agent invokes `restic init`
|
||||
against the repo you configured at enrolment. If the repo already
|
||||
exists (`config file already exists`) the agent treats it as a
|
||||
success and proceeds. The host's repo status (`unknown` →
|
||||
`ready` / `init_failed`) is surfaced under the vitals strip on
|
||||
the host detail page; if init fails, save fresh credentials in
|
||||
the **Repo** tab to retry.
|
||||
@@ -0,0 +1,92 @@
|
||||
# Installing the server
|
||||
|
||||
The reference deployment is a single Docker container fronted by
|
||||
your existing reverse proxy. The image bundles the server binary,
|
||||
the cross-compiled agent binaries, and the install scripts.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A Linux host with Docker and Docker Compose.
|
||||
- A reverse proxy in front (Caddy, nginx, Traefik) terminating
|
||||
TLS on a public hostname. The server itself is HTTP-only by
|
||||
design — see [Reverse proxy](./reverse-proxy.md) for why.
|
||||
- A persistent volume for the server's data directory.
|
||||
|
||||
## Quick start
|
||||
|
||||
The reference compose file lives at
|
||||
[`deploy/docker-compose.yml`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/docker-compose.yml):
|
||||
|
||||
```yaml
|
||||
services:
|
||||
restic-manager:
|
||||
image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:-latest}
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
RM_LISTEN: ":8080"
|
||||
RM_DATA_DIR: "/data"
|
||||
RM_BASE_URL: "https://restic.example.com"
|
||||
# Trust your reverse proxy's CIDR so X-Forwarded-* are honoured.
|
||||
RM_TRUSTED_PROXY: "10.0.0.0/8"
|
||||
volumes:
|
||||
- rm-data:/data
|
||||
ports:
|
||||
# Bind localhost only — your reverse proxy is the public face.
|
||||
- "127.0.0.1:8080:8080"
|
||||
|
||||
volumes:
|
||||
rm-data:
|
||||
```
|
||||
|
||||
Bring it up:
|
||||
|
||||
```sh
|
||||
docker compose up -d
|
||||
docker compose logs -f restic-manager
|
||||
```
|
||||
|
||||
The first run prints a one-time **bootstrap token** to the log. Use
|
||||
it within an hour or it expires; if you miss the window the
|
||||
container print it again on next start as long as no admin user
|
||||
exists.
|
||||
|
||||
## First-run admin setup
|
||||
|
||||
Open `https://restic.example.com/bootstrap` (or whatever your
|
||||
public URL is). Paste the bootstrap token, pick a username and a
|
||||
password (≥ 12 characters), and submit. You'll land in the
|
||||
dashboard logged in as the new admin.
|
||||
|
||||
If you'd rather curl it, the equivalent is:
|
||||
|
||||
```sh
|
||||
curl -X POST https://restic.example.com/api/bootstrap \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"token":"<token-from-log>","username":"admin","password":"<≥12 chars>"}'
|
||||
```
|
||||
|
||||
## Backing up the secret key
|
||||
|
||||
Inside the data volume, `secret.key` holds the AEAD key used to
|
||||
encrypt every credential at rest. **Back it up separately from
|
||||
the database.** Without it, encrypted credentials in the database
|
||||
are unrecoverable; you'd have to re-enrol every host.
|
||||
|
||||
A simple working approach: copy `secret.key` to your password
|
||||
manager or to a separately-backed-up secrets vault the day you
|
||||
install. It doesn't change.
|
||||
|
||||
## Updating the server
|
||||
|
||||
```sh
|
||||
# Pin a new version in your compose file (.env or docker-compose.yml),
|
||||
# then:
|
||||
docker compose pull
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
Migrations run automatically on startup; the server will refuse to
|
||||
start if a migration fails (better to bail than to half-migrate).
|
||||
|
||||
For the agent self-update story, see
|
||||
[Updating agents](../operations/updates.md).
|
||||
@@ -0,0 +1,95 @@
|
||||
# Running behind a reverse proxy
|
||||
|
||||
The restic-manager server is HTTP-only by design. TLS termination,
|
||||
public hostname, ACME, HSTS, and edge-level rate limiting all
|
||||
belong to a reverse proxy you already operate outside this project.
|
||||
|
||||
## What the proxy must forward
|
||||
|
||||
The server reads four headers when (and only when) the immediate
|
||||
peer matches `RM_TRUSTED_PROXY`:
|
||||
|
||||
| Header | Value | Why |
|
||||
|------------------------|----------------------------------------------------|-----|
|
||||
| `X-Forwarded-For` | The original client IP | Rate-limit keys, audit log entries, OIDC redirect-URI checks. |
|
||||
| `X-Forwarded-Proto` | `https` | Used for absolute URLs (e.g. OIDC redirect URIs). |
|
||||
| `Host` | The public hostname clients use | Cookies are scoped to this; `RM_BASE_URL` must match. |
|
||||
| `Connection` / `Upgrade` | Pass through unchanged | `/ws/agent` and `/api/jobs/{id}/stream` are WebSockets; without `Upgrade: websocket` they fail. |
|
||||
|
||||
Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of
|
||||
CIDRs) the proxy connects from. Anything outside that range has
|
||||
its `X-Forwarded-*` headers ignored, so a stray request that
|
||||
bypasses the proxy can't spoof the client IP.
|
||||
|
||||
## Caddy
|
||||
|
||||
```caddyfile
|
||||
restic.example.com {
|
||||
encode zstd gzip
|
||||
reverse_proxy 127.0.0.1:8080 {
|
||||
header_up X-Real-IP {remote_host}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Caddy adds `X-Forwarded-For` / `X-Forwarded-Proto` automatically
|
||||
and passes WebSocket headers through by default, so this is the
|
||||
whole config.
|
||||
|
||||
## nginx
|
||||
|
||||
```nginx
|
||||
server {
|
||||
listen 443 ssl http2;
|
||||
server_name restic.example.com;
|
||||
|
||||
ssl_certificate /etc/letsencrypt/live/restic.example.com/fullchain.pem;
|
||||
ssl_certificate_key /etc/letsencrypt/live/restic.example.com/privkey.pem;
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:8080;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto https;
|
||||
|
||||
# WebSocket upgrade
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
|
||||
# Long-lived agent WS — disable read timeout for this surface.
|
||||
proxy_read_timeout 86400s;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Traefik
|
||||
|
||||
```yaml
|
||||
http:
|
||||
routers:
|
||||
restic-manager:
|
||||
rule: "Host(`restic.example.com`)"
|
||||
entryPoints: [websecure]
|
||||
tls:
|
||||
certResolver: letsencrypt
|
||||
service: restic-manager
|
||||
|
||||
services:
|
||||
restic-manager:
|
||||
loadBalancer:
|
||||
servers:
|
||||
- url: "http://restic-manager:8080"
|
||||
passHostHeader: true
|
||||
```
|
||||
|
||||
Traefik forwards WebSocket upgrades and the standard
|
||||
`X-Forwarded-*` set out of the box.
|
||||
|
||||
## Verification
|
||||
|
||||
After bringing the proxy up, the audit log should show your real
|
||||
client IP for an interactive login (not the proxy's local
|
||||
address). If you see `127.0.0.1` or the proxy's container IP, your
|
||||
`RM_TRUSTED_PROXY` is wrong or `X-Forwarded-For` isn't being
|
||||
forwarded.
|
||||
@@ -0,0 +1,86 @@
|
||||
# restic-manager
|
||||
|
||||
restic-manager is a self-hosted, browser-based, single-pane-of-glass
|
||||
for managing [restic](https://restic.net) backups across a fleet of
|
||||
Linux and Windows endpoints. It's designed for **small fleets** —
|
||||
the original target was twelve endpoints — and **one operator**.
|
||||
|
||||
## What it does
|
||||
|
||||
- Centralised view of every endpoint's last backup, repo size,
|
||||
snapshot count, and recent jobs.
|
||||
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
|
||||
`check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`).
|
||||
- Per-host backup schedules with source groups (named bundles of
|
||||
paths + retention policy).
|
||||
- Live job log streamed to the browser; downloadable as text or NDJSON.
|
||||
- Restore wizard with snapshot tree browse + path selection.
|
||||
- Repo-level health surfacing (size, raw size, last-check, lock
|
||||
state) plus a 30/90-day size trend.
|
||||
- Alerting over webhook, ntfy, or SMTP.
|
||||
- Cross-platform agent (Linux + Windows).
|
||||
- Append-only-credential-friendly with a separate admin credential
|
||||
for forget/prune.
|
||||
|
||||
## What it isn't
|
||||
|
||||
- **Not a SaaS.** Single-instance, single-tenant, by design.
|
||||
- **Not a replacement for restic** — it's a control plane. The agent
|
||||
shells out to a real `restic` binary.
|
||||
- **Not highly available.** SQLite, single process; if you need
|
||||
HA backups, you're shopping in the wrong aisle.
|
||||
- **Not a multi-protocol backup tool.** restic only.
|
||||
|
||||
## How it fits together
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ Server (control plane, Docker) │
|
||||
│ - REST + WebSocket API │
|
||||
│ - SQLite store │
|
||||
│ - Embedded HTMX UI │
|
||||
└──────────┬─────────────────────────┬─────────┘
|
||||
│ outbound WS │ HTTP(S)
|
||||
│ │
|
||||
┌──────────▼──────────┐ ┌──────────▼─────────┐
|
||||
│ Agent (per host) │ │ Browser (operator) │
|
||||
│ - restic wrapper │ └─────────────────────┘
|
||||
│ - cron for sched. │
|
||||
└──────────┬──────────┘
|
||||
│ restic
|
||||
┌──────────▼──────────────────────────────────┐
|
||||
│ rest-server / S3 / SFTP / local repo │
|
||||
│ (the actual backup data — server never │
|
||||
│ touches it) │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The control plane is a Go binary that runs in Docker. Each endpoint
|
||||
runs a small Go agent that holds an outbound WebSocket to the
|
||||
control plane. Backup data flows directly between the agent and the
|
||||
restic repository — the control plane never sees a snapshot byte.
|
||||
|
||||
## Where to start
|
||||
|
||||
- [Installing the server](./getting-started/install.md) walks
|
||||
through the Docker-based reference deployment.
|
||||
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
|
||||
covers the install scripts and the announce-and-approve flow.
|
||||
- [Architecture](./concepts/architecture.md) is the right read if
|
||||
you want to know why something is the way it is before running
|
||||
the install.
|
||||
|
||||
## Project status
|
||||
|
||||
Pre-1.0 but feature-complete for the original use case. Phases
|
||||
0–4 are landed (MVP, scheduling, restore, RBAC + OIDC); Phase 5
|
||||
(this docs site, contributor onboarding, end-to-end CI) is in
|
||||
flight. See [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md)
|
||||
for the live roadmap and [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
|
||||
for the canonical design doc.
|
||||
|
||||
## License
|
||||
|
||||
[PolyForm Noncommercial 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0/).
|
||||
Personal and community deployments welcome; commercial use
|
||||
requires a separate license.
|
||||
@@ -0,0 +1,39 @@
|
||||
# License
|
||||
|
||||
restic-manager is licensed under
|
||||
[**PolyForm Noncommercial 1.0.0**](https://polyformproject.org/licenses/noncommercial/1.0.0/).
|
||||
The full text lives at
|
||||
[`LICENSE`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/LICENSE)
|
||||
in the repository root.
|
||||
|
||||
## What this means
|
||||
|
||||
- **Personal, hobbyist, educational, charitable, and similar
|
||||
noncommercial use** is fully permitted, including modification
|
||||
and redistribution.
|
||||
- **Commercial use is not permitted** without a separate
|
||||
license. The maintainer is not currently offering one — if
|
||||
you need commercial rights, open an issue to start the
|
||||
conversation.
|
||||
- The license is permissive about everything except commercial
|
||||
use: you can fork, modify, deploy in your home/lab, and
|
||||
contribute back.
|
||||
|
||||
## Why this license
|
||||
|
||||
The PolyForm Noncommercial license was chosen because:
|
||||
|
||||
- It's a real, legal, plainly-worded license (not a custom
|
||||
half-written variant).
|
||||
- It permits the realistic uses for a hobby project (the
|
||||
maintainer's homelab, a friend's fleet, a charity's IT
|
||||
closet) without inviting commercial vendors to repackage
|
||||
the work.
|
||||
- It's compatible with the project staying small and
|
||||
maintainable — the maintainer doesn't want to be on the hook
|
||||
for SLA-grade commercial support.
|
||||
|
||||
## Contributions
|
||||
|
||||
By contributing, you agree your contributions are licensed
|
||||
under the same PolyForm Noncommercial 1.0.0 license.
|
||||
@@ -0,0 +1,73 @@
|
||||
# Alerts and notifications
|
||||
|
||||
restic-manager raises alerts on conditions that need human
|
||||
attention. The alert engine evaluates rules on a 60s tick and
|
||||
on every job-finished / host-online event.
|
||||
|
||||
## Built-in alert kinds
|
||||
|
||||
| Kind | Trigger | Severity |
|
||||
|---------------------|---------|----------|
|
||||
| `backup_failed` | A backup job ends in `failed` or `cancelled` | warning |
|
||||
| `forget_failed` | A forget job ends in `failed` | warning |
|
||||
| `prune_failed` | A prune job ends in `failed` | critical |
|
||||
| `check_failed` | A check job ends in `failed` | critical |
|
||||
| `agent_offline` | A host has been offline more than 90s past its heartbeat cadence | warning |
|
||||
| `stale_schedule` | A schedule's "last run" is more than 1.5 × its interval ago | warning |
|
||||
| `update_failed` | An agent self-update returned a fail or didn't reconnect within 90s | warning |
|
||||
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
|
||||
|
||||
Each alert has a `dedup_key` so re-firing the same condition
|
||||
just bumps `last_seen_at` — the operator gets one row per
|
||||
condition, not a thousand.
|
||||
|
||||
## Lifecycle
|
||||
|
||||
```
|
||||
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
|
||||
│ │
|
||||
└────────auto-resolve──────┘
|
||||
(e.g. agent_offline auto-resolves on agent_online)
|
||||
```
|
||||
|
||||
- **Acknowledge** says "I've seen this, stop notifying about it".
|
||||
- **Resolve** says "the underlying condition is gone".
|
||||
- Some alerts auto-resolve when the condition clears
|
||||
(`agent_offline` is the canonical example).
|
||||
|
||||
## Notification channels
|
||||
|
||||
Configure under **Settings → Notifications**. Each channel can
|
||||
subscribe to all alerts or filter by severity.
|
||||
|
||||
### Webhook
|
||||
|
||||
Posts a JSON envelope to a URL of your choice. Useful for
|
||||
piping into Slack via an Incoming Webhook URL or into your own
|
||||
alerting tooling.
|
||||
|
||||
### ntfy
|
||||
|
||||
Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
|
||||
topic. Configure the topic URL; optional bearer token if you
|
||||
self-host with auth.
|
||||
|
||||
### SMTP
|
||||
|
||||
Plain SMTP (with optional TLS). Configure host, port,
|
||||
username, password, and the recipient list.
|
||||
|
||||
## Test fire
|
||||
|
||||
Each channel exposes a **Test fire** button that dispatches a
|
||||
single synthetic alert through the channel without touching the
|
||||
alert engine. Use this when you've added a channel and want to
|
||||
verify connectivity before the next real failure happens.
|
||||
|
||||
## What gets logged
|
||||
|
||||
Every alert raise / acknowledge / resolve writes an audit log
|
||||
entry. The audit log UI at **Settings → Audit log** filters by
|
||||
user, action, target, and time range — useful for the
|
||||
post-incident "who clicked acknowledge on the prune-failure
|
||||
alert" question.
|
||||
@@ -0,0 +1,73 @@
|
||||
# Backups and restores
|
||||
|
||||
## Running a backup
|
||||
|
||||
Three ways to trigger one:
|
||||
|
||||
1. **Scheduled** — the agent's local cron fires at the time set
|
||||
on the schedule.
|
||||
2. **Run-now** — operator clicks **Run now** on the host detail
|
||||
right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
|
||||
source groups) or to a per-group form for finer control.
|
||||
3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
|
||||
payload. Same audit + dispatch path.
|
||||
|
||||
In every case the server creates a `jobs` row, broadcasts a
|
||||
`command.run` to the host, and lands the operator on the live
|
||||
job log page (HTMX `HX-Redirect`).
|
||||
|
||||
## Cancelling a job
|
||||
|
||||
Any running job — backup, forget, prune, restore, anything —
|
||||
exposes a **Cancel** button on its detail page. The server
|
||||
broadcasts `command.cancel`, and the agent kills the running
|
||||
restic subprocess via context cancel: SIGTERM first, SIGKILL
|
||||
after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
|
||||
SIGTERM step is replaced with `os.Kill` because Windows can't
|
||||
deliver SIGTERM. Result: a cancelled job lands as `cancelled`
|
||||
within a couple of hundred milliseconds.
|
||||
|
||||
## Restore wizard
|
||||
|
||||
Restoring a file or path goes through a four-step wizard at
|
||||
`/hosts/{id}/restore`:
|
||||
|
||||
1. **Pick a snapshot.** Search by id or by date; the page is
|
||||
pre-populated when you launched the wizard from a snapshot row.
|
||||
2. **Browse the snapshot tree.** Lazy-loaded children via the
|
||||
`MsgTreeList` synchronous WS RPC; results are cached
|
||||
per-wizard-session for 30 minutes. Pick the absolute paths
|
||||
you want.
|
||||
3. **Choose a target.** Either **In place** (overwrites the
|
||||
live filesystem; requires you to type the hostname to
|
||||
confirm) or **New directory** (default
|
||||
`$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
|
||||
`${HOME}` / `~/` and creates the directory chain).
|
||||
4. **Review and submit.** Server mints a job, dispatches
|
||||
`command.run` with a `RestorePayload`, and `HX-Redirect`s to
|
||||
the live job log.
|
||||
|
||||
`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
|
||||
in that release). Hosts running 0.16 don't get the flag and
|
||||
restore as the running user instead.
|
||||
|
||||
## Snapshot diff
|
||||
|
||||
Two snapshot ids in the **Diff** form on the host detail page →
|
||||
a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
|
||||
to the standard live job log. Useful when investigating a
|
||||
suspiciously-sized backup.
|
||||
|
||||
## Job log artefacts
|
||||
|
||||
Every job's log is persisted in `job_logs` (one row per line),
|
||||
not just streamed in-memory. That gives you:
|
||||
|
||||
- A live view at `/jobs/{id}` while the job runs.
|
||||
- Two download formats from the same page header dropdown:
|
||||
- **txt** — one line per row, `HH:MM:SS.mmm TAG payload`.
|
||||
- **ndjson** — one self-contained JSON object per line
|
||||
(`{seq, ts, stream, payload}`), perfect for `jq`.
|
||||
|
||||
Downloads work whether the job is running or finished —
|
||||
the source is the DB, not the live socket.
|
||||
@@ -0,0 +1,61 @@
|
||||
# Observability with Prometheus
|
||||
|
||||
restic-manager can expose a Prometheus scrape endpoint at
|
||||
`GET /metrics`. The endpoint is **opt-in** — without an explicit
|
||||
auth gate it isn't even mounted, so a forgotten config can't
|
||||
accidentally publish fleet state.
|
||||
|
||||
The full reference lives at
|
||||
[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
|
||||
the short version follows.
|
||||
|
||||
## Enable the endpoint
|
||||
|
||||
Set at least one of:
|
||||
|
||||
- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
|
||||
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
|
||||
|
||||
Both ANDed when both set. Constant-time token compare; CIDR
|
||||
honours `X-Forwarded-For` only when the immediate hop matches
|
||||
`RM_TRUSTED_PROXY`.
|
||||
|
||||
## Metrics emitted
|
||||
|
||||
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
|
||||
`rm_active_alerts{severity}`, `rm_build_info{...}`.
|
||||
- **Per-host gauges**: `rm_host_agent_online`,
|
||||
`rm_host_last_backup_timestamp_seconds`,
|
||||
`rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
|
||||
`rm_host_snapshot_count`, `rm_host_open_alerts`,
|
||||
`rm_host_repo_status`.
|
||||
- **Histogram**:
|
||||
`rm_job_duration_seconds{kind,status,le=…}` (buckets
|
||||
`1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
|
||||
|
||||
In-memory histogram only. Prometheus persists the scrapes; if
|
||||
you need durable history at hourly resolution that's
|
||||
Prometheus's job.
|
||||
|
||||
## Sample Grafana dashboard
|
||||
|
||||
[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
|
||||
imports through Grafana's **+ → Import → Upload JSON file**.
|
||||
Six panels:
|
||||
|
||||
1. Fleet status (online / total).
|
||||
2. Open alerts by severity.
|
||||
3. Backups failing on most-recent run.
|
||||
4. Hosts table — last backup, repo size, snapshots, open alerts.
|
||||
5. Repo size over time, one line per host.
|
||||
6. Job-duration p95 over a 1h window per kind.
|
||||
|
||||
## Alerting
|
||||
|
||||
restic-manager already has a built-in alert engine
|
||||
([Alerts](./alerts.md)). The dashboard intentionally doesn't
|
||||
duplicate it as Prometheus alert rules. If you want
|
||||
Prometheus-side alerts on top, write your own based on the
|
||||
metrics above — `rm_host_last_backup_success == 0`,
|
||||
`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
|
||||
or whatever suits your environment.
|
||||
@@ -0,0 +1,50 @@
|
||||
# Updating agents
|
||||
|
||||
Server updates are a `docker compose pull && up -d` away.
|
||||
Agents update via the control plane.
|
||||
|
||||
## Single-host update
|
||||
|
||||
Each host's detail page shows an **Update agent** button when
|
||||
the agent's reported version is older than the server's. The
|
||||
button:
|
||||
|
||||
1. Dispatches a `command.update` to that host.
|
||||
2. The agent fetches the appropriate binary from
|
||||
`$RM_SERVER/agent/binary?os=…&arch=…` to
|
||||
`<binary-path>.new`.
|
||||
3. Copies the running binary to `<binary-path>.old` (one
|
||||
revision back, in case rollback is needed).
|
||||
4. Atomic-renames `.new` over the running binary.
|
||||
5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
|
||||
brings the process back on the new binary.
|
||||
|
||||
A 90-second timer on the server side waits for a hello at the
|
||||
target version and marks the update succeeded — or, if the
|
||||
agent doesn't reconnect at the expected version in time, marks
|
||||
the update **failed** and raises an `update_failed` alert.
|
||||
|
||||
## Fleet update
|
||||
|
||||
The admin-only **Settings → Fleet update** page drives a rolling
|
||||
update across every host in the fleet:
|
||||
|
||||
- One host at a time.
|
||||
- Wait for hello-with-target-version (max 95s).
|
||||
- On any host failing, **halt** the rollout, raise a
|
||||
`fleet_update_halted` alert, leave the rest of the fleet on
|
||||
the old version. No surprise mass-failures.
|
||||
|
||||
You can cancel an in-progress fleet update; the worker stops
|
||||
after the current host finishes.
|
||||
|
||||
## TLS and corruption
|
||||
|
||||
Updates rely on the reverse proxy's TLS to detect corruption in
|
||||
transit. There's no separate sha256 verification step — we
|
||||
chose the simpler model on the basis that the same TLS already
|
||||
gates every other byte the server hands to the agent.
|
||||
|
||||
If you'd like a separate signature step before applying updates,
|
||||
that's a future-phase enhancement (see `tasks.md` Phase 6
|
||||
candidates).
|
||||
@@ -0,0 +1,58 @@
|
||||
# Environment variables
|
||||
|
||||
The server reads its configuration from environment variables
|
||||
(canonical) with an optional YAML overlay. Env wins over YAML so
|
||||
operators can tweak a single setting without rewriting the file.
|
||||
|
||||
## Server
|
||||
|
||||
| Variable | Default | Meaning |
|
||||
|---------------------------|----------------------------------|---------|
|
||||
| `RM_LISTEN` | `:8080` | TCP listener for the HTTP server. |
|
||||
| `RM_DATA_DIR` | `/data` | Persistent state directory (SQLite, secret key, agent assets). |
|
||||
| `RM_BASE_URL` | (none) | Public URL clients use; required for OIDC redirects + cookie scope. |
|
||||
| `RM_SECRET_KEY_FILE` | `${RM_DATA_DIR}/secret.key` | Path to the AEAD key file. Auto-generated on first run. |
|
||||
| `RM_COOKIE_SECURE` | `true` | Set `false` only for local HTTP testing. Controls `Secure` on session cookies. |
|
||||
| `RM_TRUSTED_PROXY` | (none) | Comma-separated CIDRs trusted for `X-Forwarded-*`. |
|
||||
| `RM_BUNDLED_ASSETS_DIR` | `/opt/restic-manager/dist` | Read-only path with bundled agent binaries + install scripts (the docker image bakes them here). |
|
||||
| `RM_METRICS_TOKEN` | (off) | When set, `GET /metrics` requires `Authorization: Bearer <token>`. |
|
||||
| `RM_METRICS_TRUSTED_CIDR` | (off) | When set, `GET /metrics` restricts source IPs (comma-CIDR). |
|
||||
|
||||
OIDC variables (all optional; empty issuer disables OIDC):
|
||||
|
||||
| Variable | Meaning |
|
||||
|--------------------------------|---------|
|
||||
| `RM_OIDC_ISSUER` | OIDC discovery URL (e.g. `https://auth.example.com`). |
|
||||
| `RM_OIDC_CLIENT_ID` | Client ID registered with the IdP. |
|
||||
| `RM_OIDC_CLIENT_SECRET` | Client secret (or use `RM_OIDC_CLIENT_SECRET_FILE`). |
|
||||
| `RM_OIDC_CLIENT_SECRET_FILE` | Path to a file holding the client secret. |
|
||||
| `RM_OIDC_DISPLAY_NAME` | Button label on the login page (e.g. "Authelia"). |
|
||||
| `RM_OIDC_ROLE_CLAIM` | Token claim that carries roles (default `groups`). |
|
||||
| `RM_OIDC_ROLE_MAPPING` | `idp-group=role` entries, comma-separated (e.g. `rm-admin=admin,rm-ops=operator`). |
|
||||
| `RM_OIDC_REDIRECT_URL` | Override for the redirect URL; defaults to `${RM_BASE_URL}/auth/oidc/callback`. |
|
||||
|
||||
## Agent
|
||||
|
||||
| Variable | Default | Meaning |
|
||||
|----------------------|---------|---------|
|
||||
| `RM_AGENT_CONFIG` | `/etc/restic-manager/agent.yaml` (Linux) | Config file path. |
|
||||
|
||||
The agent's other settings live in the YAML file (server URL,
|
||||
bearer token, optional cert pin). The install script writes that
|
||||
file for you at enrolment.
|
||||
|
||||
## Build-time
|
||||
|
||||
The Makefile threads `-ldflags` from `git describe` into the
|
||||
`internal/version` package so `--version` and the dashboard
|
||||
footer show the right values:
|
||||
|
||||
```
|
||||
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION)
|
||||
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
|
||||
```
|
||||
|
||||
If you build with `go build` directly (no Makefile), `Version`
|
||||
falls back to `dev` and the agent-update comparison falls back
|
||||
to "always equal". Source-build deployments can still run; they
|
||||
just don't participate in the self-update flow.
|
||||
@@ -0,0 +1,82 @@
|
||||
# HTTP endpoints
|
||||
|
||||
A non-exhaustive map of the surfaces the control plane exposes.
|
||||
All `/api/*` routes return JSON; all other paths render HTML
|
||||
(server-rendered with HTMX in the loop).
|
||||
|
||||
The canonical wiring lives at
|
||||
[`internal/server/http/server.go`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/internal/server/http/server.go);
|
||||
when in doubt, read the routes block there.
|
||||
|
||||
## Public (no auth)
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|----------------------------|---------|
|
||||
| GET | `/healthz` | Liveness probe. Returns 204. |
|
||||
| POST | `/api/auth/login` | Local-user login. JSON body: `{username, password}`. |
|
||||
| POST | `/api/auth/logout` | Invalidate the session cookie. |
|
||||
| POST | `/api/bootstrap` | First-run admin creation. Accepts the token printed at first start. |
|
||||
| POST | `/api/agents/enroll` | Token-based agent enrolment. |
|
||||
| POST | `/api/agents/announce` | Announce-and-approve agent enrolment. |
|
||||
| GET | `/agent/binary?os=&arch=` | Serves the agent binary for the install scripts. |
|
||||
| GET | `/install/*` | Serves the Linux + Windows install scripts and the systemd unit. |
|
||||
| GET | `/api/version` | Build version + commit JSON. |
|
||||
| GET | `/metrics` | Prometheus exposition (only when opted-in via `RM_METRICS_TOKEN` / `RM_METRICS_TRUSTED_CIDR`). |
|
||||
| GET | `/login`, `/setup`, `/bootstrap` | UI pages. |
|
||||
|
||||
## Authenticated (any role)
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|------------------------------------------|---------|
|
||||
| GET | `/` | Dashboard. |
|
||||
| GET | `/hosts/{id}` | Host detail. |
|
||||
| GET | `/hosts/{id}/repo` | Repo tab. |
|
||||
| GET | `/hosts/{id}/jobs` | Jobs tab. |
|
||||
| GET | `/hosts/{id}/sources` | Source groups list. |
|
||||
| GET | `/hosts/{id}/schedules` | Schedules list. |
|
||||
| GET | `/jobs/{id}` | Live job log. |
|
||||
| GET | `/api/hosts`, `/api/fleet/summary` | JSON list + summary. |
|
||||
| GET | `/api/jobs/{id}/stream` | WebSocket subscription to a job's live log. |
|
||||
| GET | `/api/jobs/{id}/log.{txt,ndjson}` | Persisted log download. |
|
||||
|
||||
## Operator role and above
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|---------------------------------------|---------|
|
||||
| POST | `/hosts/{id}/run-backup` | Run-now (HTMX form-post). |
|
||||
| POST | `/hosts/{id}/sources/{gid}/run-now` | Per-source-group run-now. |
|
||||
| POST | `/hosts/{id}/repo/{prune,check,unlock,reinit,probe}` | Maintenance actions. |
|
||||
| POST | `/api/hosts/{id}/snapshots/diff` | Snapshot-diff job. |
|
||||
| POST | `/hosts/{id}/restore` | Restore wizard submit. |
|
||||
| POST | `/api/jobs/{id}/cancel` | Cancel a running job. |
|
||||
| POST | `/hosts/{id}/tags` | Update host tags. |
|
||||
| POST | `/hosts/{id}/sources` and friends | Source-group CRUD. |
|
||||
| POST | `/hosts/{id}/schedules` and friends | Schedule CRUD. |
|
||||
| POST | `/hosts/{id}/repo/credentials`, `/admin-credentials` | Credential update. |
|
||||
|
||||
## Admin role only
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|---------------------------------------|---------|
|
||||
| POST | `/hosts/new` | Mint enrolment token (Add host). |
|
||||
| POST | `/hosts/{id}/delete` | Delete + cascade. |
|
||||
| POST | `/hosts/{id}/update` | Dispatch a single agent update. |
|
||||
| GET/POST | `/settings/users/...` | User management. |
|
||||
| POST | `/settings/notifications/...` | Notification channel CRUD + test fire. |
|
||||
| POST | `/settings/fleet-update/...` | Fleet-update worker. |
|
||||
|
||||
## WebSocket
|
||||
|
||||
| Path | Who connects | Auth |
|
||||
|--------------------------------|--------------|------|
|
||||
| `/ws/agent` | Agent | Bearer token issued at enrolment. |
|
||||
| `/ws/agent/pending` | Agent (announce flow) | Pending-id query param. |
|
||||
| `/api/jobs/{id}/stream` | Browser | Session cookie. |
|
||||
|
||||
## RBAC enforcement
|
||||
|
||||
Routes are grouped into chi route-groups by required role
|
||||
(`viewer < operator < admin`); the `requireRole` middleware in
|
||||
`internal/server/http/middleware.go` is the bouncer. Sessions
|
||||
re-validate `disabled_at` on every request, so a disabled user's
|
||||
cookie stops working immediately.
|
||||
@@ -0,0 +1,32 @@
|
||||
# Roadmap
|
||||
|
||||
The live roadmap is in
|
||||
[`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md).
|
||||
Phases ship in order; items inside a phase ship as the
|
||||
opportunity arises.
|
||||
|
||||
## Status snapshot
|
||||
|
||||
| Phase | Theme | Status |
|
||||
|-------|--------------------------------------------------|--------|
|
||||
| 0 | Project bootstrap | ✅ done |
|
||||
| 1 | MVP: enrolment, visibility, on-demand backup | ✅ done |
|
||||
| 2 | Scheduling, retention, repo operations | ✅ done |
|
||||
| 3 | Restore, alerts, audit | ✅ done |
|
||||
| 4 | RBAC, OIDC, host tags | ✅ done |
|
||||
| 5 | OSS readiness | 🚧 in flight (this docs site is part of it) |
|
||||
| 6 | Update delivery + observability polish | ✅ done |
|
||||
|
||||
## What's not on the roadmap
|
||||
|
||||
The non-goals list in [`spec.md` §2](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md):
|
||||
|
||||
- Replacing restic itself or providing custom repo formats
|
||||
- Managing non-restic backup tools
|
||||
- Multi-tenancy / SaaS deployment
|
||||
- High availability of the control plane (SQLite, single-instance)
|
||||
- Mobile-native apps (responsive web only)
|
||||
|
||||
If something there is critical to your use case, restic-manager
|
||||
isn't the right tool. That's not a closed door — it's a
|
||||
deliberate scope decision so the project stays maintainable.
|
||||
@@ -0,0 +1,35 @@
|
||||
# Reporting vulnerabilities
|
||||
|
||||
The full disclosure policy lives in
|
||||
[`SECURITY.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/SECURITY.md)
|
||||
at the repo root. The short version:
|
||||
|
||||
- **Don't open a public issue.**
|
||||
- Send a Gitea private message to `steve` on
|
||||
<https://gitea.dcglab.co.uk>, or email the address on the
|
||||
maintainer's profile, with a subject like
|
||||
`[SECURITY] restic-manager: <one-line summary>`.
|
||||
- Expect an acknowledgement within 3 working days; escalate
|
||||
through the other channel if you don't get one.
|
||||
- Default disclosure window is **30 days from confirmed report
|
||||
to public disclosure**, faster if a PoC is already
|
||||
circulating, slower only by mutual agreement.
|
||||
|
||||
## What to include
|
||||
|
||||
A description of the issue and the impact, the affected
|
||||
component (server / agent / install script / docs), the version,
|
||||
and reproduction steps. A working PoC is welcome but not
|
||||
required — a credible threat model is enough.
|
||||
|
||||
## In scope vs. out of scope
|
||||
|
||||
See the full policy. Quick highlights:
|
||||
|
||||
- **In scope:** server, agent, install scripts, docker image,
|
||||
docker-compose reference, crypto choices, docs that lead to
|
||||
insecure configs.
|
||||
- **Out of scope:** restic itself (report upstream), unpatched
|
||||
third-party deps (report upstream first), pre-authenticated
|
||||
admin abuse (admins are designed to have full power), DoS on
|
||||
deployments without the recommended reverse proxy.
|
||||
@@ -0,0 +1,72 @@
|
||||
# Hardening checklist
|
||||
|
||||
A baseline for new deployments. Most of these are defaults; the
|
||||
list is here to make audit easy.
|
||||
|
||||
## Server
|
||||
|
||||
- [ ] Reverse proxy in front, TLS terminating at the proxy
|
||||
(Caddy/nginx/Traefik).
|
||||
- [ ] `RM_TRUSTED_PROXY` set to the proxy's CIDR.
|
||||
- [ ] `RM_BASE_URL` matches the public hostname and the cookie
|
||||
scope you want.
|
||||
- [ ] `RM_COOKIE_SECURE=true` (the default; only set `false`
|
||||
for local HTTP testing).
|
||||
- [ ] HTTP listener bound to **localhost** in the compose file,
|
||||
not `0.0.0.0`. The reverse proxy is the only thing that
|
||||
should reach it.
|
||||
- [ ] `secret.key` backed up separately from the database.
|
||||
- [ ] Bootstrap token consumed and the printed log line scrubbed
|
||||
from any log archive.
|
||||
|
||||
## Authentication
|
||||
|
||||
- [ ] Admin user has a password ≥ 12 characters (the floor).
|
||||
- [ ] OIDC enabled if you have an IdP — local password auth
|
||||
stays as a break-glass.
|
||||
- [ ] Disabled (not deleted) any users who change roles or leave
|
||||
so their session is invalidated immediately.
|
||||
- [ ] The last-admin guard isn't tripped — there's always at
|
||||
least one enabled admin user.
|
||||
|
||||
## Repo credentials
|
||||
|
||||
- [ ] Append-only credential set as the everyday cred for every
|
||||
host.
|
||||
- [ ] Admin credential set only where prune cadence is enabled.
|
||||
- [ ] No credentials reused across hosts. Each host should have
|
||||
its own credential pair so a single host compromise has a
|
||||
single blast radius.
|
||||
- [ ] If using rest-server, `--append-only` flag is on for the
|
||||
everyday user; the prune user is a separate identity.
|
||||
|
||||
## Agent
|
||||
|
||||
- [ ] Agent runs as `root` (Linux) or `LocalSystem` (Windows)
|
||||
**only when** the source paths require it. Otherwise pin
|
||||
a service user that has read access to what's backed up
|
||||
and nothing else.
|
||||
- [ ] systemd unit's sandboxing flags are intact
|
||||
(`NoNewPrivileges`, `Protect*`, `MemoryDenyWriteExecute`).
|
||||
- [ ] Agent's config file `/etc/restic-manager/agent.yaml` is
|
||||
mode `0600` and owned by the service user. The bearer
|
||||
token lives in there.
|
||||
|
||||
## Operations
|
||||
|
||||
- [ ] Alerts wired to a real channel (webhook into Slack,
|
||||
ntfy topic, SMTP) — not just sitting in the UI.
|
||||
- [ ] Test-fire each notification channel after configuring.
|
||||
- [ ] Audit-log retention is long enough to cover the operator's
|
||||
incident-response window.
|
||||
- [ ] Prometheus endpoint, if enabled, gated by token AND CIDR
|
||||
where practical (default is opt-in / off).
|
||||
|
||||
## Recovery
|
||||
|
||||
- [ ] A documented procedure for rotating a leaked agent bearer
|
||||
(delete + re-enrol the host).
|
||||
- [ ] A test-restore done at least once, end-to-end, before
|
||||
relying on the system in anger.
|
||||
- [ ] `secret.key` and the SQLite database covered by separate
|
||||
backup paths so neither alone reconstitutes the other.
|
||||
@@ -0,0 +1,110 @@
|
||||
# Threat model
|
||||
|
||||
This page documents what restic-manager defends against, what it
|
||||
doesn't, and the trust assumptions a deployment is making. The
|
||||
canonical version lives in [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
|
||||
§11; the summary here is shaped for operators rather than
|
||||
implementers.
|
||||
|
||||
## Trust boundaries
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ TRUSTED zone │
|
||||
│ ┌─────────────┐ ┌──────────────┐ │
|
||||
│ │ Operator's │ │ Reverse │ │
|
||||
│ │ browser │◄──►│ proxy │ │ TLS terminates here
|
||||
│ └─────────────┘ └──────┬───────┘ │
|
||||
└────────────────────────────┼─────────────┘
|
||||
│ HTTP, plaintext
|
||||
│ (loopback or trusted LAN)
|
||||
┌────────────────────────────▼─────────────┐
|
||||
│ Server (control plane) │
|
||||
└────────────┬─────────────────────────────┘
|
||||
│ outbound WebSocket (TLS to clients via proxy)
|
||||
│ — bearer-authenticated
|
||||
┌────────────▼──────────────┐
|
||||
│ Agent (per host) │ ◄── attacker model: assume one
|
||||
└────────────┬──────────────┘ endpoint can be compromised
|
||||
│ subprocess
|
||||
▼
|
||||
restic ──▶ repository (rest-server / S3 / SFTP / …)
|
||||
```
|
||||
|
||||
## What we defend against
|
||||
|
||||
### Network attacker between operator and server
|
||||
|
||||
- HTTPS via the reverse proxy is the only operator-facing surface
|
||||
on a sane deployment.
|
||||
- `RM_COOKIE_SECURE=true` (default) means the session cookie
|
||||
refuses to ride a non-HTTPS connection.
|
||||
- `RM_TRUSTED_PROXY` gates whether `X-Forwarded-*` is honoured;
|
||||
a bypassing request can't spoof the client IP.
|
||||
|
||||
### Compromised agent host
|
||||
|
||||
- The agent's bearer token can dispatch commands **only on its
|
||||
own host**. It can't read other hosts' state, dispatch jobs
|
||||
on other hosts, or escalate within the control plane.
|
||||
- If you suspect a host compromise:
|
||||
1. Disable the agent's host row from **Hosts → Delete**
|
||||
(cascades the bearer hash).
|
||||
2. Rotate the repo credential at the rest-server / object
|
||||
store side.
|
||||
3. Audit-log lists every action that bearer ever drove.
|
||||
|
||||
### DB compromise without the secret key
|
||||
|
||||
- Repo credentials are AEAD-encrypted at rest. A DB dump alone
|
||||
doesn't expose them.
|
||||
- Agent bearer **hashes** are leaked; that's enough to
|
||||
authenticate as any agent until you revoke. A rotation
|
||||
procedure is just "delete + re-enrol" today.
|
||||
- Operator passwords are bcrypt-hashed; OIDC users have no
|
||||
password to leak.
|
||||
- Session tokens are hashed; an attacker can't replay a
|
||||
session from a DB dump.
|
||||
|
||||
### DB compromise WITH the secret key
|
||||
|
||||
The attacker can decrypt every credential. Treat
|
||||
`secret.key` with the same care as a password manager database.
|
||||
Back it up to a separate vault, not to the same Docker volume
|
||||
as the database.
|
||||
|
||||
### Forget/prune as a DoS vector
|
||||
|
||||
- The everyday backup credential cannot prune (append-only).
|
||||
- The admin credential is only pushed to the agent at the
|
||||
moment of dispatch and discarded after the job ends.
|
||||
- Compromise of a single agent host does **not** grant prune
|
||||
rights — at worst the attacker gets fresh write access until
|
||||
the credential is rotated.
|
||||
|
||||
### Operator-side typo or bad copy-paste
|
||||
|
||||
- Repo credentials are stored encrypted; mis-typed creds fail
|
||||
fast on the next `restic` invocation rather than silently
|
||||
corrupting state.
|
||||
- NS-03 added auto-init: the first dispatched job after creds
|
||||
change runs `restic init`, surfaces the error eagerly under
|
||||
the host's vitals strip if the creds are bad, and resets the
|
||||
host's `repo_status` so the operator can retry without
|
||||
hunting through job logs.
|
||||
|
||||
## What we don't defend against
|
||||
|
||||
- **Insider threat at the maintainer level.** A malicious
|
||||
maintainer can publish a backdoored container; SBOM /
|
||||
signing infrastructure (Phase 6 candidate) would help here
|
||||
but isn't shipped today.
|
||||
- **Supply chain.** We pin module versions (`go.sum`) and
|
||||
pin the Tailwind binary's release tag, but a compromise in
|
||||
one of those upstreams would land here.
|
||||
- **Side-channel via restic itself.** A bug in restic that
|
||||
enables snapshot-content disclosure is restic's problem; the
|
||||
control plane doesn't see snapshot bytes either way.
|
||||
- **DoS via resource exhaustion** without the recommended
|
||||
reverse-proxy / rate-limit in front. Don't expose the
|
||||
server's HTTP port to the public internet directly.
|
||||
+120
@@ -0,0 +1,120 @@
|
||||
# End-to-end test harness
|
||||
|
||||
The e2e harness stands up the full production-shaped stack
|
||||
(server + agent + rest-server) in Docker Compose and drives it
|
||||
through Playwright. CI runs it on every PR; operators can run it
|
||||
locally too.
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
e2e/
|
||||
├── compose.e2e.yml compose stack: server + rest-server + agent
|
||||
├── Dockerfile.agent Linux container for the agent (alpine + restic)
|
||||
├── agent-entrypoint.sh decides between announce / token-enrol / run
|
||||
└── playwright/
|
||||
├── package.json
|
||||
├── playwright.config.ts
|
||||
└── tests/
|
||||
├── lib/server.ts bootstrap, login, accept, poll helpers
|
||||
└── smoke.spec.ts happy-path: enrol → backup → succeeded
|
||||
```
|
||||
|
||||
## Local run
|
||||
|
||||
Prerequisites: Docker + Docker Compose, and `npx` for Playwright.
|
||||
|
||||
```sh
|
||||
# 1. Build + bring up the stack (server, rest-server, source data).
|
||||
docker compose -f e2e/compose.e2e.yml up --build -d server rest-server source-fixture
|
||||
|
||||
# 2. Wait for the server, then scrape the bootstrap token from the log.
|
||||
until curl -fsS http://127.0.0.1:8080/api/version >/dev/null; do sleep 1; done
|
||||
RM_BOOTSTRAP_TOKEN=$(docker compose -f e2e/compose.e2e.yml logs server \
|
||||
| grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1)
|
||||
export RM_BOOTSTRAP_TOKEN
|
||||
|
||||
# 3. Start the agent (it announces against the running server).
|
||||
docker compose -f e2e/compose.e2e.yml up -d agent
|
||||
|
||||
# 4. Install + run Playwright.
|
||||
cd e2e/playwright
|
||||
npm install
|
||||
npx playwright install --with-deps chromium
|
||||
npx playwright test
|
||||
```
|
||||
|
||||
When the test passes you'll see:
|
||||
|
||||
```
|
||||
Running 2 tests using 1 worker
|
||||
✓ smoke: enrol-via-announce → backup › happy path completes in under a minute (47s)
|
||||
✓ smoke: scrape /metrics › metrics endpoint exposes the host gauge (180ms)
|
||||
|
||||
2 passed (47.5s)
|
||||
```
|
||||
|
||||
Tear-down:
|
||||
|
||||
```sh
|
||||
docker compose -f e2e/compose.e2e.yml down -v
|
||||
```
|
||||
|
||||
`-v` removes the named volumes too — important between runs because
|
||||
the rest-server volume holds an initialised repo and the
|
||||
agent-config volume holds a stale bearer.
|
||||
|
||||
## What the test exercises
|
||||
|
||||
1. **Bootstrap.** Posts the admin-creation request to
|
||||
`/api/bootstrap` with the token scraped from the server log.
|
||||
2. **Login (UI).** Drives the login form via Playwright; verifies
|
||||
the dashboard loads with a session cookie set.
|
||||
3. **Pending host appears.** Polls the dashboard for the inline
|
||||
accept form generated by the announcing agent; reads the
|
||||
pending-id out of its action URL.
|
||||
4. **Accept.** POSTs `/api/pending-hosts/{id}/accept` with the
|
||||
rest-server URL + repo password. The server mints a Host row
|
||||
+ bearer + AEAD-encrypted creds and pushes the bearer down
|
||||
the still-open pending WebSocket.
|
||||
5. **Online + auto-init.** Polls `/api/hosts` until the new host
|
||||
is `status=online`. Auto-init runs as part of this — the
|
||||
first dispatched job after creds save is `restic init`.
|
||||
6. **Run backup.** Submits the host detail page's `Run now`
|
||||
form; expects `HX-Redirect` to the live job page.
|
||||
7. **Verify.** Polls `/api/hosts` until the host's
|
||||
`last_backup_status` flips to `succeeded`.
|
||||
8. **Metrics.** Scrapes `/metrics` and asserts the
|
||||
server-gauge + build-info lines are present (the compose
|
||||
stack opens the endpoint via `RM_METRICS_TRUSTED_CIDR=0.0.0.0/0`).
|
||||
|
||||
## CI workflow
|
||||
|
||||
[`.gitea/workflows/e2e.yml`](../.gitea/workflows/e2e.yml) runs the
|
||||
suite on every PR into `main`. On failure it dumps the last 200
|
||||
lines of each container log as a workflow annotation and uploads
|
||||
the Playwright HTML report as an artefact.
|
||||
|
||||
## When tests fail
|
||||
|
||||
- **Pending host never appears.** Agent container probably
|
||||
couldn't reach the server. Check `docker compose logs agent`
|
||||
for connection errors and `docker compose logs server` for
|
||||
any 4xx on `/api/agents/announce`.
|
||||
- **Backup hangs in `running`.** The agent shells out to
|
||||
`restic`; check the live job log at
|
||||
`http://127.0.0.1:8080/jobs/<id>` (still up after a
|
||||
failed test as long as you didn't `down -v`).
|
||||
- **`RM_BOOTSTRAP_TOKEN not set`.** The server log scrape
|
||||
matched the wrong line or the token regex is too tight. The
|
||||
server prints the token on a line starting with ` ` (four
|
||||
spaces) inside a banner; widen the regex if your server log
|
||||
format changes.
|
||||
|
||||
## Adding new tests
|
||||
|
||||
The harness is intentionally flat — one `*.spec.ts` per
|
||||
scenario. Reuse the helpers in `lib/server.ts` and avoid
|
||||
duplicating bootstrap / login boilerplate. Heavy fixtures
|
||||
(custom users, OIDC IdP) belong in their own compose override
|
||||
file rather than complicating `compose.e2e.yml`.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,139 @@
|
||||
# Prometheus + Grafana
|
||||
|
||||
restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
|
||||
The endpoint is **opt-in** — it is not mounted at all unless you set
|
||||
at least one of the auth gates below. Once enabled, it serves the
|
||||
standard `text/plain` exposition format that every Prometheus
|
||||
release since 2.x parses without configuration.
|
||||
|
||||
A sample Grafana dashboard lives at
|
||||
`deploy/grafana/restic-manager-dashboard.json`.
|
||||
|
||||
## Enable the endpoint
|
||||
|
||||
Two switches, both off by default. If both are set, both must pass
|
||||
(token AND source-IP); if only one is set, that gate alone
|
||||
authorises a scrape.
|
||||
|
||||
| Env var | YAML key | Effect |
|
||||
|----------------------------|------------------------|--------|
|
||||
| `RM_METRICS_TOKEN` | `metrics_token` | Requires `Authorization: Bearer <token>`. Compared in constant time. |
|
||||
| `RM_METRICS_TRUSTED_CIDR` | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
|
||||
|
||||
When neither is set, `GET /metrics` returns 404 — the route is not
|
||||
registered with the chi router so a forgotten config can't
|
||||
accidentally publish fleet state.
|
||||
|
||||
### Example: Docker
|
||||
|
||||
```yaml
|
||||
services:
|
||||
restic-manager:
|
||||
image: gitea.dcglab.co.uk/steve/restic-manager:latest
|
||||
environment:
|
||||
RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
|
||||
RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
|
||||
secrets:
|
||||
- rm_metrics_token
|
||||
```
|
||||
|
||||
(`RM_METRICS_TOKEN_FILE` is not currently supported — set
|
||||
`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
|
||||
roadmap.)
|
||||
|
||||
## Prometheus scrape config
|
||||
|
||||
Drop into your `prometheus.yml`:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: restic-manager
|
||||
metrics_path: /metrics
|
||||
scheme: https # via your reverse proxy
|
||||
static_configs:
|
||||
- targets: ['restic.example.com']
|
||||
authorization:
|
||||
type: Bearer
|
||||
credentials_file: /etc/prometheus/secrets/rm_metrics_token
|
||||
```
|
||||
|
||||
If you don't run a TLS-terminating proxy in front, drop `scheme:
|
||||
https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
|
||||
|
||||
## Metric reference
|
||||
|
||||
All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
|
||||
label (the stable ULID, immune to renames) and a `host` label
|
||||
(the human-readable name).
|
||||
|
||||
### Server gauges
|
||||
|
||||
| Name | Labels | Description |
|
||||
|-----------------------|------------------------------------|-------------|
|
||||
| `rm_hosts_total` | — | Total number of enrolled hosts (excludes pending announces). |
|
||||
| `rm_hosts_online` | — | Number of hosts with `status='online'`. |
|
||||
| `rm_active_alerts` | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
|
||||
| `rm_build_info` | `version, commit, go_version` | Always 1; pure label-bag for joining. |
|
||||
|
||||
### Per-host gauges
|
||||
|
||||
| Name | Description |
|
||||
|--------------------------------------------|-------------|
|
||||
| `rm_host_agent_online` | 1 if the agent is currently online, 0 otherwise. |
|
||||
| `rm_host_last_backup_timestamp_seconds` | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
|
||||
| `rm_host_last_backup_success` | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
|
||||
| `rm_host_repo_size_bytes` | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
|
||||
| `rm_host_snapshot_count` | Number of restic snapshots known on the host's repo. |
|
||||
| `rm_host_open_alerts` | Number of currently open alerts attached to this host. |
|
||||
| `rm_host_repo_status` | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
|
||||
|
||||
### Job duration histogram
|
||||
|
||||
```
|
||||
rm_job_duration_seconds_bucket{kind, status, le}
|
||||
rm_job_duration_seconds_sum{kind, status}
|
||||
rm_job_duration_seconds_count{kind, status}
|
||||
```
|
||||
|
||||
`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
|
||||
`status` ∈ {succeeded, failed, cancelled}.
|
||||
|
||||
Buckets (seconds):
|
||||
|
||||
```
|
||||
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
|
||||
1s 5s 30s 1m 5m 30m 1h 6h 24h
|
||||
```
|
||||
|
||||
The histogram is in-memory only — values reset on process restart.
|
||||
Operators who want durable history should let Prometheus persist
|
||||
the scrapes; restic-manager itself is a control plane, not a
|
||||
metrics database.
|
||||
|
||||
## Grafana dashboard
|
||||
|
||||
Import `deploy/grafana/restic-manager-dashboard.json`:
|
||||
|
||||
1. In Grafana, **+ → Import → Upload JSON file**.
|
||||
2. Pick the Prometheus data source you scrape with.
|
||||
3. The dashboard's six panels populate from the metrics above:
|
||||
* **Fleet status** — online/total stat panel.
|
||||
* **Open alerts** — by severity.
|
||||
* **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
|
||||
* **Repo size over time** — one line per host.
|
||||
* **Backups failing** — count of hosts whose last backup didn't succeed.
|
||||
* **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
|
||||
|
||||
Alerting is intentionally not configured in the dashboard — the
|
||||
control plane already has alerts (P3-05) with native channels for
|
||||
webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
|
||||
just duplicate state. If you do want Prom-side alerts, copy the
|
||||
recording rules into your usual location.
|
||||
|
||||
## Cardinality
|
||||
|
||||
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
|
||||
histogram rows. A 100-host fleet emits roughly 700 host rows + 270
|
||||
histogram rows — well below any practical limit. There are no
|
||||
`job_id` labels (cardinality bomb avoidance) and no per-source-group
|
||||
labels.
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 27 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 98 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 178 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 48 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 92 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 47 KiB |
@@ -0,0 +1,223 @@
|
||||
# Always-On vs Intermittent host mode
|
||||
|
||||
**Date:** 2026-06-15
|
||||
**Branch:** `feat-laptop-host-mode`
|
||||
**Status:** Design — awaiting review
|
||||
|
||||
## Problem
|
||||
|
||||
The server currently assumes every host should be present 24×7. When an
|
||||
agent stops heartbeating for 90s it is flipped to `offline`, and after 15
|
||||
minutes that raises a `warning` alert. This is correct for a server, but
|
||||
wrong for a host that legitimately comes and goes — a workstation or
|
||||
laptop that sleeps overnight, travels, or is shut down on weekends. Such
|
||||
a host generates noise alerts every time it is closed, and — more
|
||||
importantly — there is **no mechanism to catch up a backup it missed
|
||||
while it was away.**
|
||||
|
||||
Two distinct facts make the catch-up gap real:
|
||||
|
||||
- **Backup cron runs on the agent, locally.** The agent fires
|
||||
`MsgScheduleFire`; the server only dispatches in response. If the host
|
||||
is asleep, the agent process is suspended, so the cron tick never
|
||||
fires and no `MsgScheduleFire` is ever sent.
|
||||
- Therefore the existing `pending_runs` retry queue **does not** cover
|
||||
this case. `pending_runs` only gets a row when a schedule *fired* but
|
||||
the agent was momentarily disconnected at dispatch time. A window
|
||||
missed entirely during sleep never enqueues anything.
|
||||
|
||||
## Goal
|
||||
|
||||
Let an operator mark a host as **not** always-on. Such a host:
|
||||
|
||||
1. Does **not** raise offline/agent-down alerts when it is not visible.
|
||||
2. Renders a distinct, calm "asleep" state in the UI instead of the
|
||||
alarming red "offline".
|
||||
3. When it reconnects, after a short settle delay, the server checks
|
||||
whether it missed a scheduled backup and — if so — triggers a
|
||||
catch-up backup automatically.
|
||||
4. Still raises a *staleness* alert if it has genuinely gone too long
|
||||
without any backup (a host left in a drawer). This is the only
|
||||
alert covering an asleep host: while the agent is offline no job
|
||||
runs, so there is no failure to detect — staleness is the safety
|
||||
net for "no backups are happening at all."
|
||||
5. Leaves normal job-failure alerting untouched: a backup that
|
||||
actually runs (scheduled or catch-up) and fails alerts as it does
|
||||
today. Failures can only occur while the agent is online and
|
||||
executing restic.
|
||||
|
||||
Default behaviour is unchanged for the entire existing fleet.
|
||||
|
||||
## Decisions (from brainstorming)
|
||||
|
||||
- **Setting shape:** a single boolean `Always On` checkbox per host,
|
||||
**default ON**. Checked = today's 24×7 server semantics. Unchecked =
|
||||
intermittent host. Opt-in only; zero behaviour change for current and
|
||||
future hosts unless explicitly toggled.
|
||||
- **Overdue trigger:** evaluated on **reconnect + behind schedule**
|
||||
(not a continuous always-evaluating sweep).
|
||||
- **Alert policy for intermittent hosts:** suppress offline alerts;
|
||||
keep a long-threshold **staleness** alert; keep job-failure alerts.
|
||||
- **Staleness threshold:** **7 days**, a global constant for v1. May
|
||||
become per-host configurable later — out of scope now.
|
||||
- **Catch-up granularity:** **per enabled schedule.** A host with a
|
||||
daily and a weekly schedule catches up only whichever is actually
|
||||
behind.
|
||||
- **UI vocabulary:** not-visible intermittent host shows a grey
|
||||
`asleep` state; detail line reads
|
||||
`asleep · last seen <relTime> · will catch up on return`.
|
||||
- **Chip:** chip and checkbox highlight the **same** truth (24×7). Show
|
||||
a chip for **Always-On** hosts; **no** chip for intermittent.
|
||||
|
||||
## Architecture
|
||||
|
||||
The change is deliberately a thin policy + presentation layer over the
|
||||
existing online/offline state machine. We do **not** add a new `status`
|
||||
enum value or alter heartbeat / `last_seen_at` tracking. "Asleep" is a
|
||||
reinterpretation of `status='offline' AND NOT always_on`.
|
||||
|
||||
### 1. Data model
|
||||
|
||||
- **Migration `0024_hosts_always_on.sql`:**
|
||||
```sql
|
||||
ALTER TABLE hosts ADD COLUMN always_on INTEGER NOT NULL DEFAULT 1;
|
||||
```
|
||||
Column-level ALTER per the repo's migration rules. Default `1` means
|
||||
every existing row is Always-On — no behaviour change on upgrade.
|
||||
- `store/types.go`: add `AlwaysOn bool` to the `Host` struct; thread it
|
||||
through every host SELECT scan and the host insert/update paths.
|
||||
- New store helper `SetHostAlwaysOn(ctx, hostID, bool) error`.
|
||||
|
||||
### 2. Online/offline mechanics — UNCHANGED
|
||||
|
||||
The 30s offline sweeper (`cmd/server/main.go:220`) still flips an unseen
|
||||
host to `status='offline'` and still calls
|
||||
`alertEngine.NotifyHostOffline(id)`. `TouchHost` / `MarkHostHello`
|
||||
behaviour is untouched. The intermittent distinction is applied
|
||||
*downstream* of this state, in the alert engine and the templates.
|
||||
|
||||
### 3. Alert behaviour
|
||||
|
||||
All changes key off `host.AlwaysOn`, which the engine already has access
|
||||
to via the host row it loads.
|
||||
|
||||
- **Suppress offline alert** (`alert/engine.go` `handleHostOffline()`
|
||||
and the 60s `tick()`): when `!host.AlwaysOn`, do not raise
|
||||
`agent_offline`.
|
||||
- **Resolve-on-toggle:** when a host is switched server→intermittent and
|
||||
has an open `agent_offline` alert, auto-resolve it. (Handled in the
|
||||
mode-change handler, fanning through the normal resolve path so
|
||||
channels/audit fire as usual.)
|
||||
- **Staleness alert** — wire up the currently-dead `KindStaleSchedule`
|
||||
constant, **for intermittent hosts only.** On the 60s tick, for each
|
||||
host where `!AlwaysOn` AND the host has ≥1 enabled schedule AND
|
||||
`LastBackupAt != nil` AND `now - LastBackupAt > 7*24h`: raise a
|
||||
`warning` `stale_schedule` alert (dedup key `""`, one per host).
|
||||
Auto-resolves when `LastBackupAt` advances past the threshold (i.e.
|
||||
any successful backup, including the catch-up). Always-On hosts'
|
||||
`stale_schedule` remains a no-op (unchanged, out of scope).
|
||||
- If `LastBackupAt == nil` (intermittent host enrolled but never
|
||||
backed up): no staleness alert in v1 — there is no baseline to
|
||||
measure against, and onboarding probe state (`repo_status`) already
|
||||
covers "never successfully set up."
|
||||
- **Job-failure alerts:** untouched. A catch-up backup that runs and
|
||||
fails alerts exactly like any other backup.
|
||||
|
||||
### 4. Catch-up on reconnect
|
||||
|
||||
A new small component — the **catch-up scheduler** — lives server-side
|
||||
alongside the existing ticks.
|
||||
|
||||
- **Arm:** on agent hello (`server/ws/handler.go` hello path /
|
||||
`onAgentHello`), if the host is `!AlwaysOn`, record
|
||||
`catchupDueAt[hostID] = now + 60s` in an in-memory map. Re-arming on a
|
||||
subsequent hello just overwrites the timestamp (debounce — rapid
|
||||
flapping does not stack catch-ups). In-memory is acceptable: catch-up
|
||||
is best-effort and a server restart simply re-arms on the next hello.
|
||||
- **Fire:** reuse the existing 30s server tick. For each due entry
|
||||
(`catchupDueAt <= now`):
|
||||
1. Re-verify the agent is still connected (`Hub.Connected(hostID)`).
|
||||
If it bounced back offline within the settle window, drop the entry
|
||||
(it will re-arm on the next hello).
|
||||
2. Skip if a backup is already running or queued for the host
|
||||
(`current_job_id` set, or a relevant `pending_runs` row exists) —
|
||||
avoid double-firing alongside a normal dispatch or pending drain.
|
||||
3. For each **enabled** schedule on the host, compute overdue:
|
||||
```
|
||||
overdue := sched.Next(host.LastBackupAt) <= now
|
||||
```
|
||||
using `robfig/cron/v3` (already a dependency) to parse
|
||||
`Schedule.CronExpr`. `Next(lastBackup)` is the first fire strictly
|
||||
after the last successful backup; if that moment has already
|
||||
passed, the window was missed → overdue. (If `LastBackupAt` is nil,
|
||||
treat as overdue so a never-backed-up intermittent host with a
|
||||
schedule gets its first run on connect.)
|
||||
4. For each overdue schedule, dispatch its source-groups via the
|
||||
existing `dispatchBackupForGroupCore()`.
|
||||
5. Clear the entry.
|
||||
|
||||
Net latency is ~60–90s after wake (60s settle + up to one 30s tick).
|
||||
This path is independent of and complementary to the `pending_runs`
|
||||
drain, which continues to handle the fired-but-not-sent case.
|
||||
|
||||
### 5. UI
|
||||
|
||||
- **CSS:** new grey `dot-asleep` token in `web/styles/input.css`,
|
||||
visually distinct from red `dot-offline`.
|
||||
- **`partials/host_row.html` and `partials/host_chrome.html`:** when
|
||||
`!AlwaysOn && status=='offline'`, render the grey dot + label
|
||||
`asleep`; the detail/last-seen line reads
|
||||
`asleep · last seen <relTime> · will catch up on return`. All other
|
||||
states unchanged.
|
||||
- **24×7 chip:** on the host detail header, render a small
|
||||
`Always On` / `24×7` chip **only when `AlwaysOn` is true**. No chip
|
||||
for intermittent hosts. (Chip and checkbox highlight the same fact.)
|
||||
- **Toggle:** an `Always On` checkbox (default checked) on the host edit
|
||||
surface. Operator-band `POST` (mirrors existing host-edit handlers),
|
||||
audited as `host.mode_updated`. On save, if switching to intermittent,
|
||||
trigger the resolve-on-toggle path for any open `agent_offline` alert.
|
||||
|
||||
## Error handling & edge cases
|
||||
|
||||
- **Toggle server→intermittent while offline+alerting:** open
|
||||
`agent_offline` alert auto-resolved on save.
|
||||
- **Toggle intermittent→server while asleep:** host resumes normal
|
||||
offline/alert semantics; it will alert per the 15-minute floor once
|
||||
the sweeper/tick next evaluates it.
|
||||
- **No enabled schedules:** no catch-up and no staleness alert — there
|
||||
is no backup expectation to measure against.
|
||||
- **Catch-up vs in-flight work:** guarded by the running/queued check in
|
||||
step 4.2 so catch-up never races a normal dispatch or pending drain.
|
||||
- **Agent flaps during settle window:** entry dropped if not connected
|
||||
at fire time; re-armed on the next hello.
|
||||
|
||||
## Testing
|
||||
|
||||
- **Alert engine (unit):**
|
||||
- offline alert suppressed when `!AlwaysOn`.
|
||||
- staleness alert raised when intermittent + schedule + last backup >
|
||||
7d; not raised for Always-On hosts; not raised when last backup is
|
||||
recent; not raised when no enabled schedule.
|
||||
- staleness alert auto-resolves after a backup advances `LastBackupAt`.
|
||||
- server→intermittent toggle resolves an open `agent_offline` alert.
|
||||
- **Overdue computation (unit, table-driven):** `(cronExpr,
|
||||
lastBackupAt, now) → overdue?` including nil-last-backup and
|
||||
daily/weekly cases.
|
||||
- **Catch-up scheduler (unit):** fires only when still connected; skips
|
||||
when a backup is running/queued; dispatches only overdue schedules.
|
||||
- **UI (render test):** asleep state + 24×7 chip render under the right
|
||||
conditions; offline state for Always-On hosts unchanged.
|
||||
- `go vet ./...` and full `go test ./...` green before merge.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Per-host staleness thresholds (global 7d constant for v1).
|
||||
- Continuous (non-reconnect) overdue evaluation.
|
||||
- Agent-side catch-up cron — the server is the reliable arbiter.
|
||||
- Wiring `stale_schedule` for Always-On hosts (separate concern).
|
||||
|
||||
## Task tracking
|
||||
|
||||
Add an entry to `tasks.md` under "Next steps from testing" (or a new
|
||||
small section) once the plan is approved, per the repo's tasks.md
|
||||
source-of-truth rule.
|
||||
@@ -0,0 +1,126 @@
|
||||
# Threat model
|
||||
|
||||
A short, structured walkthrough of the assets restic-manager
|
||||
protects, the actors that interact with it, the attack surfaces
|
||||
exposed, and the mitigations in place. This document is written for
|
||||
operators considering a deployment and for contributors evaluating
|
||||
security-sensitive changes. It is **not** a formal certification —
|
||||
restic-manager has not been third-party audited.
|
||||
|
||||
Last reviewed: **2026-05-09** (against v1.0.0).
|
||||
|
||||
---
|
||||
|
||||
## 1. Assets
|
||||
|
||||
In rough order of sensitivity:
|
||||
|
||||
| Asset | Why it matters |
|
||||
|---|---|
|
||||
| **Restic repository passwords** | Decrypt every backup in the repo. Server holds them encrypted at rest; agents need plaintext at backup-time. |
|
||||
| **Repository URLs with embedded credentials** (e.g. `rest:https://user:pass@host/repo`) | Same as above — read access to the repo is leak-equivalent to the password. |
|
||||
| **Agent bearer tokens** | Long-lived credentials authenticating each agent → server WS. Compromise lets an attacker impersonate that host (push fake snapshots, ack fake schedule versions, exfiltrate repo creds the server pushes back). |
|
||||
| **Server session cookies** | Browser-side session for human operators. Compromise = full UI access at the user's role for the cookie's TTL (24h). |
|
||||
| **Database secret key** | Wraps every encrypted-at-rest field (repo creds, agent enrolment payloads). Loss of the file means decryptable backups; rotation requires re-pushing creds to every agent. |
|
||||
| **Bootstrap / setup tokens** | One-shot, time-limited; mint admin or invited-user accounts. |
|
||||
| **Audit log** | Tamper-evident record of admin actions; read-only via UI. |
|
||||
| **Backup data on the wire** | Restic itself encrypts on the agent before sending — see "out of scope". |
|
||||
|
||||
---
|
||||
|
||||
## 2. Actors
|
||||
|
||||
| Actor | Trust |
|
||||
|---|---|
|
||||
| **Anonymous internet** | Untrusted. Should not reach the server unless proxied behind auth (see deployment guide). |
|
||||
| **Authenticated viewer** | Read-only on hosts/jobs/alerts/audit. |
|
||||
| **Authenticated operator** | Add/remove hosts, edit schedules, run backups/restores, mint enrolment tokens, ack alerts. |
|
||||
| **Authenticated admin** | All of the above plus user management, role changes, fleet update controls, secret-key visibility (no — see below). |
|
||||
| **Agent** | Trusted to backup-and-report on its own host only. Cannot read other hosts' creds. Bearer-authenticated. |
|
||||
| **Restic backend (rest-server / S3 / B2 / etc.)** | Out of scope for this document — assumed to authenticate the credentials presented and not collude. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Attack surfaces and mitigations
|
||||
|
||||
### 3.1 First-run bootstrap
|
||||
|
||||
- **Surface**: `/bootstrap` UI + `/api/bootstrap` JSON endpoint.
|
||||
- **Risk**: race between server start and admin creation — an attacker who reaches the server first can claim admin.
|
||||
- **Mitigations**:
|
||||
- Bootstrap token printed to stderr exactly once; held in memory, not persisted.
|
||||
- The UI form on `/bootstrap` uses the in-memory token automatically (no token field for the operator to type or expose).
|
||||
- Both surfaces self-disable the moment any user row exists (`CountUsers > 0`).
|
||||
- Token is also blanked from process memory after success (defence in depth).
|
||||
- **Residual risk**: if an operator brings up the server on the public internet before reaching the bootstrap page, an attacker reaching `/bootstrap` first wins. **Recommendation**: bring the server up behind an existing trusted network or with the listener bound to `127.0.0.1` until first-run is complete.
|
||||
|
||||
### 3.2 Local user accounts
|
||||
|
||||
- **Surface**: `/login`, `/api/auth/login`.
|
||||
- **Mitigations**: Argon2id password hashing with per-deployment params; constant-time password compare; session-cookie minting via `crypto/rand`; session rows hash-only (raw token only in cookie).
|
||||
- **Rate limiting**: Currently not in place at the application layer — the project assumes a reverse proxy enforces login throttling. **Recommendation**: front the server with `caddy`/`nginx` rate-limit rules in production.
|
||||
- **Password policy**: 12-character minimum on bootstrap and user-setup paths; no maximum, no rotation, no history. Sufficient for self-hosted ops; tighten in policy if a deployment requires it.
|
||||
|
||||
### 3.3 OIDC SSO
|
||||
|
||||
- **Surface**: `/auth/oidc/*` — generic OIDC client, JIT user provisioning.
|
||||
- **Mitigations**: state + nonce per flow; role mapping is server-configured (claims trusted only to identify the user, not pick role); user-disabled gate runs after IdP success.
|
||||
- **Residual risk**: misconfigured role-mapping rules can promote any IdP user to admin. **Recommendation**: review `cfg.OIDC.RoleMappings` carefully.
|
||||
|
||||
### 3.4 Agent enrolment
|
||||
|
||||
- **Surface**: `/api/agents/enroll` (token-authenticated), `/api/agents/announce` (anonymous, then operator-approves).
|
||||
- **Mitigations**:
|
||||
- Token path: one-shot, hashed at rest, 1h TTL; agent receives a fresh long-lived bearer in the response.
|
||||
- Announce path: agent supplies an Ed25519 public key; operator sees a fingerprint to confirm out-of-band before accepting.
|
||||
- Bearer tokens are SHA-256 hashed in the DB.
|
||||
- **Residual risk**: an attacker on the network between operator and target host who intercepts the install snippet can enrol *as* the target. The install script must be served over TLS in production (the docker-only deployment defaults to TLS-by-default; bare-metal deployers must configure their own).
|
||||
|
||||
### 3.5 Agent → server WebSocket
|
||||
|
||||
- **Surface**: persistent WS authenticated by agent bearer.
|
||||
- **Mitigations**: bearer is presented per-connection; server pins the agent fingerprint for the announce flow; messages are envelope-typed and rejected if shape-invalid.
|
||||
- **No payload-level signing** today — TLS is the integrity boundary. A man-in-the-middle with a valid cert chain could swap messages. **Recommendation**: pin the server cert via `RM_SERVER_CERT_PIN_SHA256` if running over a network you don't fully control.
|
||||
|
||||
### 3.6 Repo credential lifecycle
|
||||
|
||||
- Stored encrypted at rest under the AEAD secret key.
|
||||
- Pushed to the agent over the WS on hello, on creds change, and on demand.
|
||||
- Agent persists them encrypted (per-host secret key derived from a value known only to the agent).
|
||||
- Logged surfaces use `restic.RedactURL()` to strip `user:pass@` from URLs before they reach `slog`.
|
||||
- Plaintext form is constructed only at `exec.Command` time inside the agent, never stored on a struct field that could be slogged.
|
||||
|
||||
### 3.7 Restore
|
||||
|
||||
- Operators can restore to any path the agent (running as root) can write.
|
||||
- Cross-host restore (host A's snapshot → host C) is **deferred** — see F-01. The current single-host restore does not require granting any cross-host privileges.
|
||||
|
||||
### 3.8 Audit log
|
||||
|
||||
- Append-only writes from the application; SQLite enforces no schema-level immutability.
|
||||
- A compromise of the SQLite file (via OS-level access) can edit the audit log. **Recommendation**: ship audit entries to an append-only sink (syslog / Loki / Splunk) if tamper-evidence beyond the OS boundary is required.
|
||||
|
||||
### 3.9 Self-update channel (P6)
|
||||
|
||||
- Agents fetch new binaries via the WS transport from the server.
|
||||
- Binaries are signature-checked by the agent against a key embedded in the existing agent (see `internal/fleetupdate/`).
|
||||
- **Residual risk**: a server compromise lets the attacker push code to every agent (running as root). The signing-key compromise window is the same as the server compromise window because both live on the server. Splitting the signing key onto a separate signer is future work (not v1).
|
||||
|
||||
---
|
||||
|
||||
## 4. Out of scope
|
||||
|
||||
- **Restic itself** — its repository format, encryption, and backend protocol are upstream-trusted.
|
||||
- **The host OS** — root compromise of a host obviously compromises that host's backups.
|
||||
- **The backup destination** — restic-manager assumes the rest-server / object-store / SFTP target enforces its own auth.
|
||||
- **Side-channel attacks** on the server process (RAM dump, process tracing).
|
||||
- **Physical access** to the server's disk.
|
||||
|
||||
---
|
||||
|
||||
## 5. Reporting
|
||||
|
||||
Found something we missed? See `SECURITY.md` for the disclosure
|
||||
process. Coordinated disclosure preferred; the project is
|
||||
maintained by a small team and we'll respond as quickly as we
|
||||
reasonably can.
|
||||
@@ -0,0 +1,42 @@
|
||||
# Build a Linux container that runs the restic-manager agent against a
|
||||
# sibling rest-server in the e2e compose stack. Used only by tests
|
||||
# (e2e/compose.e2e.yml + .gitea/workflows/e2e.yml).
|
||||
#
|
||||
# Two stages:
|
||||
# 1. golang:alpine to build the agent binary.
|
||||
# 2. alpine:3.20 with the `restic` package + the built binary.
|
||||
#
|
||||
# Pinning by digest is intentional for CI reproducibility.
|
||||
|
||||
FROM golang:1.25-alpine AS build
|
||||
WORKDIR /src
|
||||
|
||||
ENV CGO_ENABLED=0 \
|
||||
GOFLAGS="-trimpath"
|
||||
|
||||
COPY go.mod go.sum* ./
|
||||
RUN go mod download
|
||||
|
||||
COPY . .
|
||||
ARG VERSION=e2e
|
||||
RUN go build -ldflags="-s -w -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=${VERSION}" \
|
||||
-o /out/restic-manager-agent ./cmd/agent
|
||||
|
||||
FROM alpine:3.20
|
||||
RUN apk add --no-cache restic ca-certificates curl
|
||||
COPY --from=build /out/restic-manager-agent /usr/local/bin/restic-manager-agent
|
||||
|
||||
# Agents normally run as root because backup paths often need it. The
|
||||
# e2e fixture only backs up paths under /data which we own, so this
|
||||
# container would tolerate a non-root user — but staying root keeps
|
||||
# parity with the production install.
|
||||
USER root
|
||||
|
||||
# The agent needs a writable directory for its config + secrets store.
|
||||
RUN mkdir -p /etc/restic-manager /var/lib/restic-manager
|
||||
ENV RM_AGENT_CONFIG=/etc/restic-manager/agent.yaml
|
||||
|
||||
# The compose entrypoint sets the announce URL via env.
|
||||
COPY e2e/agent-entrypoint.sh /usr/local/bin/entrypoint.sh
|
||||
RUN chmod +x /usr/local/bin/entrypoint.sh
|
||||
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
|
||||
@@ -0,0 +1,21 @@
|
||||
# Playwright runner for the e2e suite. Built and run by
|
||||
# e2e/compose.e2e.yml so the test process sits on the same docker
|
||||
# network as the server, agent, and rest-server. The previous setup
|
||||
# ran Playwright on the workflow runner host and reached the server
|
||||
# via 127.0.0.1:8080; that fails on Gitea's act-style runners
|
||||
# because the workflow steps execute inside a runner container,
|
||||
# not on the host where compose publishes its ports.
|
||||
|
||||
FROM mcr.microsoft.com/playwright:v1.59.1-jammy
|
||||
|
||||
WORKDIR /work
|
||||
|
||||
# Install npm deps in a separate layer keyed off package.json so
|
||||
# changes to specs don't bust the dep cache.
|
||||
COPY e2e/playwright/package.json /work/package.json
|
||||
RUN npm install --no-audit --no-fund
|
||||
|
||||
COPY e2e/playwright/ /work/
|
||||
|
||||
ENV CI=1
|
||||
ENTRYPOINT ["npx", "playwright", "test"]
|
||||
Executable
+27
@@ -0,0 +1,27 @@
|
||||
#!/bin/sh
|
||||
# Entrypoint for the e2e agent container.
|
||||
#
|
||||
# Three states:
|
||||
# 1. Already enrolled (agent.yaml has a bearer): run the agent.
|
||||
# 2. Token supplied via $RM_ENROL_TOKEN: enrol then run.
|
||||
# 3. Otherwise: announce against $RM_SERVER and wait for an admin to
|
||||
# accept us. The announce flow blocks until accepted, then drops
|
||||
# straight into the normal run loop, so this is the test-friendly
|
||||
# path.
|
||||
set -eu
|
||||
|
||||
CFG="${RM_AGENT_CONFIG:-/etc/restic-manager/agent.yaml}"
|
||||
SERVER="${RM_SERVER:?set RM_SERVER}"
|
||||
|
||||
if [ -f "$CFG" ] && grep -q '^agent_token:' "$CFG"; then
|
||||
exec restic-manager-agent -config "$CFG"
|
||||
fi
|
||||
|
||||
if [ -n "${RM_ENROL_TOKEN:-}" ]; then
|
||||
exec restic-manager-agent -config "$CFG" \
|
||||
-enroll-server "$SERVER" \
|
||||
-enroll-token "$RM_ENROL_TOKEN"
|
||||
fi
|
||||
|
||||
# Announce-and-approve: blocks until an admin accepts, then runs.
|
||||
exec restic-manager-agent -config "$CFG" -enroll-server "$SERVER"
|
||||
@@ -0,0 +1,113 @@
|
||||
# End-to-end test stack — used by .gitea/workflows/e2e.yml and by
|
||||
# operators who want to run the Playwright suite locally.
|
||||
#
|
||||
# Three services:
|
||||
# * server — restic-manager built from the working tree
|
||||
# * agent — restic-manager agent built from the working tree
|
||||
# (announces; Playwright accepts it during the test)
|
||||
# * rest-server — the actual restic backend, sibling of the agent
|
||||
#
|
||||
# Run from the repo root:
|
||||
# docker compose -f e2e/compose.e2e.yml up --build --abort-on-container-exit
|
||||
|
||||
services:
|
||||
rest-server:
|
||||
image: restic/rest-server:0.13.0
|
||||
environment:
|
||||
DATA_DIR: /data
|
||||
OPTIONS: "--no-auth"
|
||||
volumes:
|
||||
- rest-data:/data
|
||||
networks: [rmnet]
|
||||
|
||||
server:
|
||||
build:
|
||||
context: ..
|
||||
dockerfile: deploy/Dockerfile.server
|
||||
args:
|
||||
VERSION: e2e
|
||||
environment:
|
||||
RM_LISTEN: ":8080"
|
||||
RM_DATA_DIR: "/data"
|
||||
RM_BASE_URL: "http://server:8080"
|
||||
RM_COOKIE_SECURE: "false"
|
||||
# Bind the metrics endpoint loose for the test, so one of the
|
||||
# Playwright assertions can exercise it.
|
||||
RM_METRICS_TRUSTED_CIDR: "0.0.0.0/0"
|
||||
volumes:
|
||||
- server-data:/data
|
||||
ports:
|
||||
- "127.0.0.1:8080:8080"
|
||||
healthcheck:
|
||||
test: ["CMD", "/usr/local/bin/restic-manager-server", "--version"]
|
||||
interval: 2s
|
||||
timeout: 2s
|
||||
retries: 30
|
||||
networks: [rmnet]
|
||||
|
||||
agent:
|
||||
build:
|
||||
context: ..
|
||||
dockerfile: e2e/Dockerfile.agent
|
||||
args:
|
||||
VERSION: e2e
|
||||
environment:
|
||||
RM_SERVER: "http://server:8080"
|
||||
depends_on:
|
||||
- server
|
||||
volumes:
|
||||
# Source paths the agent backs up. Compose pre-populates this
|
||||
# with a few files so the snapshot list isn't empty.
|
||||
- source-data:/source
|
||||
- agent-config:/etc/restic-manager
|
||||
- agent-state:/var/lib/restic-manager
|
||||
networks: [rmnet]
|
||||
|
||||
# Playwright test runner. Profile-gated so `compose up` doesn't
|
||||
# start it; CI invokes it via `compose run` and `docker cp`s the
|
||||
# report+traces out (see .gitea/workflows/e2e.yml). Lives on
|
||||
# rmnet so it can reach the server via its compose-network DNS
|
||||
# name rather than depending on host port-publish (which doesn't
|
||||
# work on Gitea's container-based runners).
|
||||
#
|
||||
# Reports are NOT bind-mounted: when the runner job itself runs
|
||||
# inside a container, `./playwright/...` resolves to a path that
|
||||
# only exists inside the runner container, so the host docker
|
||||
# daemon would silently mount an empty dir. Instead the report
|
||||
# stays inside the playwright container and the workflow extracts
|
||||
# it via `docker cp` before tearing down.
|
||||
playwright:
|
||||
profiles: [test]
|
||||
build:
|
||||
context: ..
|
||||
dockerfile: e2e/Dockerfile.playwright
|
||||
environment:
|
||||
RM_BASE_URL: "http://server:8080"
|
||||
RM_BOOTSTRAP_TOKEN: "${RM_BOOTSTRAP_TOKEN:-}"
|
||||
depends_on:
|
||||
- server
|
||||
- agent
|
||||
networks: [rmnet]
|
||||
|
||||
# One-shot init container that drops a couple of files into the
|
||||
# source volume so backups have something to snapshot.
|
||||
source-fixture:
|
||||
image: alpine:3.20
|
||||
command: >
|
||||
sh -c 'mkdir -p /source && echo "hello world" > /source/hello.txt &&
|
||||
echo "another file" > /source/two.txt && sleep 0.2'
|
||||
volumes:
|
||||
- source-data:/source
|
||||
networks: [rmnet]
|
||||
restart: "no"
|
||||
|
||||
volumes:
|
||||
server-data:
|
||||
rest-data:
|
||||
source-data:
|
||||
agent-config:
|
||||
agent-state:
|
||||
|
||||
networks:
|
||||
rmnet:
|
||||
driver: bridge
|
||||
@@ -0,0 +1,14 @@
|
||||
{
|
||||
"name": "restic-manager-e2e",
|
||||
"version": "0.0.0",
|
||||
"private": true,
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"test": "playwright test",
|
||||
"test:headed": "playwright test --headed",
|
||||
"test:debug": "PWDEBUG=1 playwright test"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@playwright/test": "1.59.1"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,35 @@
|
||||
import { defineConfig, devices } from '@playwright/test';
|
||||
|
||||
// Single-target Chromium config: the e2e suite is narrow (smoke
|
||||
// the production-shaped flow against the docker-compose stack).
|
||||
// Cross-browser matrix doesn't add signal — what we're verifying is
|
||||
// the server's HTML and the agent's WebSocket handshake, neither of
|
||||
// which depends on browser engine.
|
||||
|
||||
const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
|
||||
|
||||
export default defineConfig({
|
||||
testDir: './tests',
|
||||
// 4 minutes — the smoke test waits for: enrolment + bootstrap
|
||||
// (~5s), auto-init landing (~10s), backup completion (~120s
|
||||
// budget). 60s is far too tight in CI; 4m gives headroom even
|
||||
// on a contended runner without masking real regressions.
|
||||
timeout: 240_000,
|
||||
expect: { timeout: 10_000 },
|
||||
fullyParallel: false,
|
||||
retries: process.env.CI ? 1 : 0,
|
||||
workers: 1,
|
||||
reporter: [['list'], ['html', { open: 'never' }]],
|
||||
use: {
|
||||
baseURL,
|
||||
trace: 'retain-on-failure',
|
||||
screenshot: 'only-on-failure',
|
||||
video: 'retain-on-failure',
|
||||
},
|
||||
projects: [
|
||||
{
|
||||
name: 'chromium',
|
||||
use: { ...devices['Desktop Chrome'] },
|
||||
},
|
||||
],
|
||||
});
|
||||
@@ -0,0 +1,152 @@
|
||||
// Helpers used by every test. The shape favours the JSON API for
|
||||
// reads + accept/dispatch (deterministic, easy to assert) and the
|
||||
// browser for human-facing surfaces (login form, dashboard render).
|
||||
|
||||
import { APIRequestContext, expect, Page } from '@playwright/test';
|
||||
|
||||
export const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
|
||||
|
||||
export interface HostJSON {
|
||||
id: string;
|
||||
name: string;
|
||||
status: string;
|
||||
repo_status?: string;
|
||||
last_backup_status?: string;
|
||||
}
|
||||
|
||||
export async function readBootstrapToken(): Promise<string> {
|
||||
const tok = process.env.RM_BOOTSTRAP_TOKEN;
|
||||
if (!tok) {
|
||||
throw new Error('RM_BOOTSTRAP_TOKEN not set — the harness scrapes it from server logs');
|
||||
}
|
||||
return tok;
|
||||
}
|
||||
|
||||
export async function bootstrapAdmin(
|
||||
request: APIRequestContext,
|
||||
{
|
||||
username = 'admin',
|
||||
password = 'e2e-test-password-1234',
|
||||
}: { username?: string; password?: string } = {},
|
||||
): Promise<{ username: string; password: string }> {
|
||||
const token = await readBootstrapToken();
|
||||
const res = await request.post(`${baseURL}/api/bootstrap`, {
|
||||
data: { token, username, password },
|
||||
});
|
||||
if (!res.ok() && res.status() !== 409 /* already bootstrapped */) {
|
||||
throw new Error(`bootstrap: ${res.status()} ${await res.text()}`);
|
||||
}
|
||||
return { username, password };
|
||||
}
|
||||
|
||||
export async function loginViaUI(page: Page, username: string, password: string): Promise<void> {
|
||||
await page.goto(`${baseURL}/login`);
|
||||
await page.locator('#login-username').fill(username);
|
||||
await page.locator('#login-password').fill(password);
|
||||
await Promise.all([
|
||||
page.waitForURL(new RegExp(`^${baseURL}/?$`)),
|
||||
page.locator('form[action="/login"] button[type="submit"]').click(),
|
||||
]);
|
||||
}
|
||||
|
||||
/**
|
||||
* Polls the dashboard until a pending host card is visible, then
|
||||
* extracts its pending-id from the inline accept form's action URL.
|
||||
*/
|
||||
export async function waitForPendingHostID(page: Page): Promise<string> {
|
||||
const formLocator = page.locator('form[action^="/api/pending-hosts/"][action$="/accept"]').first();
|
||||
await expect(formLocator).toBeVisible({ timeout: 60_000 });
|
||||
const action = await formLocator.getAttribute('action');
|
||||
if (!action) throw new Error('pending host form has no action attribute');
|
||||
const m = action.match(/\/api\/pending-hosts\/([^/]+)\/accept/);
|
||||
if (!m) throw new Error(`unexpected action URL: ${action}`);
|
||||
return m[1];
|
||||
}
|
||||
|
||||
export async function acceptPending(
|
||||
request: APIRequestContext,
|
||||
cookie: string,
|
||||
pendingID: string,
|
||||
repo: { url: string; username?: string; password: string },
|
||||
): Promise<void> {
|
||||
const res = await request.post(`${baseURL}/api/pending-hosts/${pendingID}/accept`, {
|
||||
headers: { cookie, 'content-type': 'application/json' },
|
||||
data: {
|
||||
repo_url: repo.url,
|
||||
repo_username: repo.username ?? '',
|
||||
repo_password: repo.password,
|
||||
},
|
||||
});
|
||||
if (!res.ok()) {
|
||||
throw new Error(`accept: ${res.status()} ${await res.text()}`);
|
||||
}
|
||||
}
|
||||
|
||||
export async function listHosts(request: APIRequestContext, cookie: string): Promise<HostJSON[]> {
|
||||
const res = await request.get(`${baseURL}/api/hosts`, { headers: { cookie } });
|
||||
if (!res.ok()) throw new Error(`list hosts: ${res.status()} ${await res.text()}`);
|
||||
const body = (await res.json()) as { items?: HostJSON[]; hosts?: HostJSON[] };
|
||||
return body.items ?? body.hosts ?? [];
|
||||
}
|
||||
|
||||
export async function waitForHostStatus(
|
||||
request: APIRequestContext,
|
||||
cookie: string,
|
||||
matcher: (h: HostJSON) => boolean,
|
||||
timeoutMs = 60_000,
|
||||
): Promise<HostJSON> {
|
||||
const deadline = Date.now() + timeoutMs;
|
||||
let last: HostJSON | undefined;
|
||||
while (Date.now() < deadline) {
|
||||
const hosts = await listHosts(request, cookie);
|
||||
const hit = hosts.find(matcher);
|
||||
if (hit) return hit;
|
||||
last = hosts[0];
|
||||
await new Promise((r) => setTimeout(r, 1_000));
|
||||
}
|
||||
throw new Error(`waitForHostStatus: timeout. Last seen: ${JSON.stringify(last)}`);
|
||||
}
|
||||
|
||||
export async function createSourceGroup(
|
||||
request: APIRequestContext,
|
||||
cookie: string,
|
||||
hostID: string,
|
||||
body: { name: string; includes: string[]; excludes?: string[] },
|
||||
): Promise<string> {
|
||||
const res = await request.post(`${baseURL}/api/hosts/${hostID}/source-groups`, {
|
||||
headers: { cookie, 'content-type': 'application/json' },
|
||||
data: {
|
||||
name: body.name,
|
||||
includes: body.includes,
|
||||
excludes: body.excludes ?? [],
|
||||
retention_policy: {},
|
||||
retry_max: 0,
|
||||
retry_backoff_seconds: 0,
|
||||
},
|
||||
});
|
||||
if (!res.ok()) throw new Error(`createSourceGroup: ${res.status()} ${await res.text()}`);
|
||||
const created = (await res.json()) as { id?: string; group?: { id?: string } };
|
||||
const id = created.id ?? created.group?.id;
|
||||
if (!id) throw new Error(`createSourceGroup: no id in response: ${JSON.stringify(created)}`);
|
||||
return id;
|
||||
}
|
||||
|
||||
export async function runSourceGroup(
|
||||
request: APIRequestContext,
|
||||
cookie: string,
|
||||
hostID: string,
|
||||
groupID: string,
|
||||
): Promise<void> {
|
||||
const res = await request.post(
|
||||
`${baseURL}/api/hosts/${hostID}/source-groups/${groupID}/run`,
|
||||
{ headers: { cookie } },
|
||||
);
|
||||
if (!res.ok()) throw new Error(`runSourceGroup: ${res.status()} ${await res.text()}`);
|
||||
}
|
||||
|
||||
export async function getSessionCookie(page: Page): Promise<string> {
|
||||
const cookies = await page.context().cookies();
|
||||
const c = cookies.find((c) => c.name === 'rm_session');
|
||||
if (!c) throw new Error('rm_session cookie not set after login');
|
||||
return `${c.name}=${c.value}`;
|
||||
}
|
||||
@@ -0,0 +1,90 @@
|
||||
// End-to-end smoke: bootstrap → accept pending host → run backup → see succeeded.
|
||||
//
|
||||
// The compose stack stands up a server, a sibling rest-server, and an
|
||||
// agent in announce-and-approve mode. This test drives the operator
|
||||
// path through the UI (login + dashboard) and the API
|
||||
// (accept + run-now + poll for terminal) — UI for the human surfaces,
|
||||
// API for the deterministic ones.
|
||||
|
||||
import { test, expect } from '@playwright/test';
|
||||
import {
|
||||
baseURL,
|
||||
bootstrapAdmin,
|
||||
loginViaUI,
|
||||
waitForPendingHostID,
|
||||
acceptPending,
|
||||
waitForHostStatus,
|
||||
createSourceGroup,
|
||||
runSourceGroup,
|
||||
getSessionCookie,
|
||||
} from './lib/server';
|
||||
|
||||
test.describe('smoke: enrol-via-announce → backup', () => {
|
||||
test('happy path: enrol → accept → backup → succeeded', async ({ page, request }) => {
|
||||
const { username, password } = await bootstrapAdmin(request);
|
||||
await loginViaUI(page, username, password);
|
||||
|
||||
// Dashboard renders.
|
||||
await expect(page.locator('main')).toContainText(/host|fleet|pending/i, { timeout: 10_000 });
|
||||
|
||||
// Pending host appears (the agent container has been
|
||||
// announcing since startup).
|
||||
const pendingID = await waitForPendingHostID(page);
|
||||
const cookie = await getSessionCookie(page);
|
||||
|
||||
// Accept with the rest-server creds. compose's rest-server runs
|
||||
// --no-auth, so any credentials work; restic still demands a
|
||||
// password to encrypt the repo.
|
||||
await acceptPending(request, cookie, pendingID, {
|
||||
url: 'rest:http://rest-server:8000/',
|
||||
password: 'e2e-repo-password',
|
||||
});
|
||||
|
||||
// Wait for the host to come online AND for auto-init to
|
||||
// finish. Coming online happens as soon as the agent's
|
||||
// bearer-authed WS attaches (~1s after accept); repo_status
|
||||
// flips to 'ready' once the auto-init job completes (a
|
||||
// couple of seconds later). Loading the host page before
|
||||
// that leaves the Run-backup button disabled because the
|
||||
// server-rendered HTML reflects the still-in-progress init,
|
||||
// and the page has no live-refresh on that field.
|
||||
const readyHost = await waitForHostStatus(
|
||||
request, cookie,
|
||||
(h) => h.status === 'online' && h.repo_status === 'ready',
|
||||
90_000,
|
||||
);
|
||||
expect(readyHost.id).toBeTruthy();
|
||||
|
||||
// Per-host Run-now is gone; backups are dispatched per
|
||||
// source-group now. Create one that maps to the agent's
|
||||
// /source mount, then kick it via the JSON API.
|
||||
const groupID = await createSourceGroup(request, cookie, readyHost.id, {
|
||||
name: 'default',
|
||||
includes: ['/source'],
|
||||
});
|
||||
await runSourceGroup(request, cookie, readyHost.id, groupID);
|
||||
|
||||
// Wait for the host's last_backup_status to flip to 'succeeded'.
|
||||
// The host record is the source of truth: it's what the
|
||||
// dashboard projects from job-completion events on the WS
|
||||
// channel.
|
||||
const finishedHost = await waitForHostStatus(
|
||||
request, cookie,
|
||||
(h) => h.id === readyHost.id && h.last_backup_status === 'succeeded',
|
||||
120_000,
|
||||
);
|
||||
expect(finishedHost.last_backup_status).toBe('succeeded');
|
||||
});
|
||||
});
|
||||
|
||||
test.describe('smoke: scrape /metrics', () => {
|
||||
test('metrics endpoint exposes the host gauge', async ({ request }) => {
|
||||
// Compose sets RM_METRICS_TRUSTED_CIDR=0.0.0.0/0 so the
|
||||
// endpoint is open to the test runner.
|
||||
const res = await request.get(`${baseURL}/metrics`);
|
||||
expect(res.status()).toBe(200);
|
||||
const body = await res.text();
|
||||
expect(body).toContain('rm_hosts_total');
|
||||
expect(body).toContain('rm_build_info{');
|
||||
});
|
||||
});
|
||||
@@ -2,10 +2,14 @@ package runner
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sync"
|
||||
"syscall"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/restic"
|
||||
@@ -43,13 +47,22 @@ func (s *fakeSender) snapshot() []api.Envelope {
|
||||
// setupScript writes a shell script (without shebang) to a temp dir,
|
||||
// names it "restic", makes it executable, and returns the path.
|
||||
//
|
||||
// Writes to "<path>.tmp" then renames into place. The rename is what
|
||||
// makes this race-free: under -race + many t.Parallel tests, a
|
||||
// fork-from-another-goroutine can inherit the writable fd from
|
||||
// Writes to "<path>.tmp" then renames into place. The rename is the
|
||||
// usual guard against ETXTBSY: under -race + many t.Parallel tests,
|
||||
// a fork-from-another-goroutine can inherit the writable fd from
|
||||
// os.WriteFile before close completes, and exec'ing the file then
|
||||
// returns ETXTBSY ("text file busy"). Once the rename lands, the
|
||||
// final path is a fresh dirent pointing at an inode that has no
|
||||
// writable fd open anywhere — exec is safe.
|
||||
// returns ETXTBSY ("text file busy"). The renamed dirent points at
|
||||
// an inode that has no writable fd open anywhere — exec is safe on
|
||||
// a vanilla filesystem.
|
||||
//
|
||||
// On overlayfs (every job that runs inside a `container:` block on
|
||||
// our Gitea runner), the rename can briefly leak ETXTBSY anyway —
|
||||
// the upper layer's "writable inode" bookkeeping lags the userspace
|
||||
// close. To make the helper deterministic across environments, we
|
||||
// probe-exec the file with a benign argument until exec succeeds,
|
||||
// then return. Each script body has a `case "$1" in ... esac` shape
|
||||
// where unknown args fall through to a clean exit, so the probe is
|
||||
// a no-op from the test's point of view.
|
||||
func setupScript(t *testing.T, body string) string {
|
||||
t.Helper()
|
||||
dir := t.TempDir()
|
||||
@@ -61,7 +74,21 @@ func setupScript(t *testing.T, body string) string {
|
||||
if err := os.Rename(tmp, final); err != nil {
|
||||
t.Fatalf("setupScript: rename: %v", err)
|
||||
}
|
||||
return final
|
||||
|
||||
deadline := time.Now().Add(3 * time.Second)
|
||||
for {
|
||||
err := exec.Command(final, "__rm_probe__").Run()
|
||||
if err == nil {
|
||||
return final
|
||||
}
|
||||
if !errors.Is(err, syscall.ETXTBSY) {
|
||||
t.Fatalf("setupScript: probe exec: %v", err)
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("setupScript: %s still ETXTBSY after 3s", final)
|
||||
}
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
}
|
||||
}
|
||||
|
||||
// firstEnvOfType returns the first envelope with the given type, or
|
||||
|
||||
@@ -0,0 +1,100 @@
|
||||
// Package updater carries the agent's self-update logic.
|
||||
//
|
||||
// The flow is operator-driven: the server dispatches a command.update
|
||||
// WS envelope, the agent fetches a fresh binary from the server's
|
||||
// /agent/binary endpoint, atomic-renames it over the running binary
|
||||
// (Linux) or hands off to a detached helper script (Windows), and
|
||||
// exits cleanly so the service manager restarts under the new
|
||||
// binary. See docs/superpowers/specs/2026-05-06-p6-01-02-...
|
||||
//
|
||||
// Platform-specific code is build-tagged into updater_unix.go /
|
||||
// updater_windows.go. This file holds the shared HTTP fetch + path
|
||||
// helpers + the test seam.
|
||||
package updater
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"time"
|
||||
)
|
||||
|
||||
// fetch downloads the new binary into <binaryPath>.new, fsyncs, chmods.
|
||||
// Returns the path of the staged file (always binaryPath + ".new").
|
||||
func fetch(ctx context.Context, serverURL, binaryPath string) (string, error) {
|
||||
url := fmt.Sprintf("%s/agent/binary?os=%s&arch=%s", serverURL, runtime.GOOS, runtime.GOARCH)
|
||||
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
c := &http.Client{Timeout: 5 * time.Minute}
|
||||
res, err := c.Do(req)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
defer func() { _ = res.Body.Close() }()
|
||||
if res.StatusCode != http.StatusOK {
|
||||
return "", fmt.Errorf("agent binary fetch: %s", res.Status)
|
||||
}
|
||||
|
||||
stagePath := binaryPath + ".new"
|
||||
f, err := os.OpenFile(stagePath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
if _, copyErr := io.Copy(f, res.Body); copyErr != nil {
|
||||
_ = f.Close()
|
||||
_ = os.Remove(stagePath)
|
||||
return "", copyErr
|
||||
}
|
||||
if syncErr := f.Sync(); syncErr != nil {
|
||||
_ = f.Close()
|
||||
_ = os.Remove(stagePath)
|
||||
return "", syncErr
|
||||
}
|
||||
if closeErr := f.Close(); closeErr != nil {
|
||||
_ = os.Remove(stagePath)
|
||||
return "", closeErr
|
||||
}
|
||||
if err := os.Chmod(stagePath, 0o755); err != nil {
|
||||
_ = os.Remove(stagePath)
|
||||
return "", err
|
||||
}
|
||||
return stagePath, nil
|
||||
}
|
||||
|
||||
// resolveOwnBinary returns the absolute path of the running binary.
|
||||
// Refuses /proc/self/exe — that's what os.Executable returns on some
|
||||
// systems but the path can't be renamed across.
|
||||
func resolveOwnBinary() (string, error) {
|
||||
p, err := os.Executable()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
abs, err := filepath.Abs(p)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
if abs == "/proc/self/exe" {
|
||||
return "", fmt.Errorf("cannot resolve own binary path (/proc/self/exe)")
|
||||
}
|
||||
return abs, nil
|
||||
}
|
||||
|
||||
// UpdateForTest is the platform-neutral test seam. In production the
|
||||
// platform-specific Update fetches, swaps, then exits the process.
|
||||
// UpdateForTest stops short of the exit so unit tests can assert on
|
||||
// file state.
|
||||
func UpdateForTest(serverURL, binaryPath string) error {
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
|
||||
defer cancel()
|
||||
stage, err := fetch(ctx, serverURL, binaryPath)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
return swap(stage, binaryPath)
|
||||
}
|
||||
@@ -0,0 +1,87 @@
|
||||
//go:build !windows
|
||||
|
||||
package updater
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// TestUpdate_LinuxAtomicSwap stages a fake "running binary" file, runs
|
||||
// UpdateForTest against a fake /agent/binary server, and asserts that
|
||||
// the binary was swapped, .old preserves the previous bytes, and .new
|
||||
// was renamed away.
|
||||
func TestUpdate_LinuxAtomicSwap(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
binPath := filepath.Join(tmp, "agent")
|
||||
if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
newBytes := []byte("NEW BINARY CONTENTS")
|
||||
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
if r.URL.Path != "/agent/binary" {
|
||||
http.NotFound(w, r)
|
||||
return
|
||||
}
|
||||
gotOS, gotArch := r.URL.Query().Get("os"), r.URL.Query().Get("arch")
|
||||
if gotOS != runtime.GOOS || gotArch != runtime.GOARCH {
|
||||
t.Errorf("query mismatch: got os=%s arch=%s want %s/%s",
|
||||
gotOS, gotArch, runtime.GOOS, runtime.GOARCH)
|
||||
}
|
||||
_, _ = io.Copy(w, bytes.NewReader(newBytes))
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
if err := UpdateForTest(srv.URL, binPath); err != nil {
|
||||
t.Fatalf("update: %v", err)
|
||||
}
|
||||
|
||||
got, err := os.ReadFile(binPath)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if string(got) != string(newBytes) {
|
||||
t.Fatalf("binary contents: got %q want %q", got, newBytes)
|
||||
}
|
||||
old, err := os.ReadFile(binPath + ".old")
|
||||
if err != nil {
|
||||
t.Fatalf("agent.old missing: %v", err)
|
||||
}
|
||||
if string(old) != "OLD" {
|
||||
t.Fatalf("agent.old contents: got %q want %q", old, "OLD")
|
||||
}
|
||||
if _, err := os.Stat(binPath + ".new"); !os.IsNotExist(err) {
|
||||
t.Fatalf("agent.new should be absent after swap, got err=%v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// TestUpdate_FetchHTTPError surfaces the server's status when the
|
||||
// binary is not published for this os/arch.
|
||||
func TestUpdate_FetchHTTPError(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
binPath := filepath.Join(tmp, "agent")
|
||||
if err := os.WriteFile(binPath, []byte("OLD"), 0o755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
http.Error(w, `{"error":"binary_not_published"}`, http.StatusNotFound)
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
err := UpdateForTest(srv.URL, binPath)
|
||||
if err == nil {
|
||||
t.Fatal("expected error, got nil")
|
||||
}
|
||||
got, _ := os.ReadFile(binPath)
|
||||
if string(got) != "OLD" {
|
||||
t.Fatalf("binary should not have changed, got %q", got)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,73 @@
|
||||
//go:build !windows
|
||||
|
||||
package updater
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"io"
|
||||
"log/slog"
|
||||
"os"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Update fetches the new binary, swaps it in, then exits so systemd
|
||||
// restarts the process under the new binary. The caller should close
|
||||
// the WS connection cleanly (so the server transitions the host to
|
||||
// disconnected immediately rather than waiting for the heartbeat
|
||||
// sweep) before invoking.
|
||||
//
|
||||
// Service-user assumption: the agent runs as root under the
|
||||
// systemd-shipped unit, which can write the binary path directly.
|
||||
// If the agent ever moves to a non-root service user, this breaks —
|
||||
// would need a setuid helper or an out-of-process update service.
|
||||
func Update(ctx context.Context, serverURL string) error {
|
||||
binPath, err := resolveOwnBinary()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
stage, err := fetch(ctx, serverURL, binPath)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := swap(stage, binPath); err != nil {
|
||||
return err
|
||||
}
|
||||
slog.Info("agent self-update: binary swapped, exiting for systemd restart",
|
||||
"binary", binPath)
|
||||
// Give logger / WS close-frame a moment to flush, then exit.
|
||||
time.Sleep(200 * time.Millisecond)
|
||||
os.Exit(0)
|
||||
return nil // unreachable
|
||||
}
|
||||
|
||||
// swap copies the running binary to <bin>.old (M1 — keep one revision
|
||||
// back for hand-rolled rollback), then atomic-renames the staged
|
||||
// binary into place. Linux supports rename-while-open so this works
|
||||
// even though the running process holds the source open.
|
||||
func swap(stagePath, binPath string) error {
|
||||
src, err := os.Open(binPath)
|
||||
if err != nil {
|
||||
return fmt.Errorf("open running binary: %w", err)
|
||||
}
|
||||
defer func() { _ = src.Close() }()
|
||||
dst, err := os.OpenFile(binPath+".old", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755)
|
||||
if err != nil {
|
||||
return fmt.Errorf("open .old: %w", err)
|
||||
}
|
||||
if _, err := io.Copy(dst, src); err != nil {
|
||||
_ = dst.Close()
|
||||
return fmt.Errorf("copy to .old: %w", err)
|
||||
}
|
||||
if err := dst.Sync(); err != nil {
|
||||
_ = dst.Close()
|
||||
return err
|
||||
}
|
||||
if err := dst.Close(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := os.Rename(stagePath, binPath); err != nil {
|
||||
return fmt.Errorf("rename .new over running binary: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
@@ -0,0 +1,73 @@
|
||||
//go:build windows
|
||||
|
||||
package updater
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"syscall"
|
||||
"time"
|
||||
)
|
||||
|
||||
// helperScript is rendered with fmt.Sprintf, args order:
|
||||
//
|
||||
// %[1]s — running binary path (source for the .old copy)
|
||||
// %[2]s — .old path
|
||||
// %[3]s — staged .new path
|
||||
// %[4]s — running binary path (rename target)
|
||||
const helperScript = `@echo off
|
||||
timeout /t 3 /nobreak >nul
|
||||
copy /Y "%[1]s" "%[2]s"
|
||||
sc stop restic-manager-agent
|
||||
:wait
|
||||
sc query restic-manager-agent | find "STOPPED" >nul
|
||||
if errorlevel 1 (timeout /t 1 /nobreak >nul & goto wait)
|
||||
move /Y "%[3]s" "%[4]s"
|
||||
sc start restic-manager-agent
|
||||
del "%%~f0"
|
||||
`
|
||||
|
||||
// Update on Windows can't overwrite the running .exe in-process
|
||||
// (exclusive file lock), so we stage the new binary, write a small
|
||||
// detached helper script that waits, stops the service, swaps the
|
||||
// binary, and starts the service, then exit cleanly. SCM treats
|
||||
// clean exits after sc stop as intentional and does not auto-restart;
|
||||
// the helper's final sc start handles that.
|
||||
func Update(ctx context.Context, serverURL string) error {
|
||||
binPath, err := resolveOwnBinary()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
stage, err := fetch(ctx, serverURL, binPath)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
helperPath := filepath.Join(filepath.Dir(binPath), "agent-update.cmd")
|
||||
body := fmt.Sprintf(helperScript, binPath, binPath+".old", stage, binPath)
|
||||
if err := os.WriteFile(helperPath, []byte(body), 0o755); err != nil {
|
||||
return err
|
||||
}
|
||||
cmd := exec.Command("cmd.exe", "/c", helperPath)
|
||||
cmd.SysProcAttr = &syscall.SysProcAttr{
|
||||
HideWindow: true,
|
||||
CreationFlags: 0x00000008 | 0x08000000, // DETACHED_PROCESS | CREATE_NO_WINDOW
|
||||
}
|
||||
if err := cmd.Start(); err != nil {
|
||||
return err
|
||||
}
|
||||
slog.Info("agent self-update: helper spawned, exiting cleanly",
|
||||
"binary", binPath, "helper", helperPath)
|
||||
time.Sleep(200 * time.Millisecond)
|
||||
os.Exit(0)
|
||||
return nil // unreachable
|
||||
}
|
||||
|
||||
// swap is unused on Windows — the helper script does the swap.
|
||||
// Defined to satisfy the build (UpdateForTest references it).
|
||||
func swap(_, _ string) error {
|
||||
return fmt.Errorf("updater.swap not implemented on Windows; use the helper script via Update")
|
||||
}
|
||||
@@ -22,6 +22,12 @@ import (
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
// staleBackupThreshold is how long an intermittent host may go without
|
||||
// a successful backup before we raise a stale_schedule alert. Global
|
||||
// constant for v1 (may become per-host later). Only intermittent hosts
|
||||
// are evaluated — always-on hosts' stale_schedule stays a no-op.
|
||||
const staleBackupThreshold = 7 * 24 * time.Hour
|
||||
|
||||
// JobFinishedEvent carries everything the engine needs to evaluate
|
||||
// the failed-X rules. Pushed via Engine.NotifyJobFinished from the
|
||||
// MarkJobFinished site.
|
||||
@@ -149,6 +155,10 @@ func (e *Engine) handleJobFinished(ctx context.Context, ev JobFinishedEvent) {
|
||||
fmt.Sprintf("%s job %s failed", ev.Kind, ev.JobID), ev.When)
|
||||
case "succeeded":
|
||||
e.resolveAndNotify(ctx, ev.HostID, kind, dedupKey, ev.When)
|
||||
if ev.Kind == "backup" {
|
||||
// A fresh backup clears staleness for intermittent hosts.
|
||||
e.resolveAndNotify(ctx, ev.HostID, KindStaleSchedule, "", ev.When)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -157,6 +167,12 @@ func (e *Engine) handleHostOffline(ctx context.Context, hostID string) {
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
// Intermittent hosts (laptops) legitimately disappear — never raise
|
||||
// agent_offline for them. The stale_schedule sweep in tick() is the
|
||||
// only staleness signal for these hosts.
|
||||
if !host.AlwaysOn {
|
||||
return
|
||||
}
|
||||
// Apply the 15-min floor — raise only when last_seen_at is older
|
||||
// than agentOfflineFloor. A nil last_seen_at (host enrolled but
|
||||
// never connected) is treated as "now" so we don't raise
|
||||
@@ -180,11 +196,9 @@ func (e *Engine) handleHostOnline(ctx context.Context, hostID string) {
|
||||
// tick is the 60-second sweep. Responsibilities:
|
||||
// 1. Re-evaluate agent_offline for every offline host that may have
|
||||
// crossed the floor between explicit events.
|
||||
// 2. Stale-schedule detection — declared in the spec but intentionally
|
||||
// left as a no-op in v1. The precise "expected to have fired but
|
||||
// didn't" trigger requires a store helper that lands in a later
|
||||
// task. The KindStaleSchedule constant is exported so UI code can
|
||||
// reference the tag string today.
|
||||
// 2. Stale-schedule detection for intermittent hosts — raises
|
||||
// stale_schedule when LastBackupAt is older than 7 days and the
|
||||
// host has an enabled schedule. Always-on hosts are excluded.
|
||||
func (e *Engine) tick(ctx context.Context, now time.Time) {
|
||||
// User-management cleanup piggy-backed here for now. Setup tokens
|
||||
// have a 1h expiry; the alert engine tick is the cheapest existing
|
||||
@@ -203,6 +217,35 @@ func (e *Engine) tick(ctx context.Context, now time.Time) {
|
||||
return
|
||||
}
|
||||
for _, h := range hosts {
|
||||
// Intermittent hosts: suppress agent_offline entirely; instead
|
||||
// raise stale_schedule when they have gone too long with no
|
||||
// successful backup AND they have at least one enabled schedule
|
||||
// to be measured against. A nil LastBackupAt (never backed up)
|
||||
// has no baseline — onboarding/repo_status covers that case.
|
||||
if !h.AlwaysOn {
|
||||
if h.LastBackupAt == nil {
|
||||
continue
|
||||
}
|
||||
if now.Sub(*h.LastBackupAt) < staleBackupThreshold {
|
||||
continue
|
||||
}
|
||||
hasEnabled, err := e.hostHasEnabledSchedule(ctx, h.ID)
|
||||
if err != nil {
|
||||
slog.Warn("alert: tick list schedules", "host_id", h.ID, "err", err)
|
||||
continue
|
||||
}
|
||||
if !hasEnabled {
|
||||
continue
|
||||
}
|
||||
e.raiseAndNotify(ctx, h.ID, KindStaleSchedule, "", "warning",
|
||||
fmt.Sprintf("No backup in %s (threshold %s)",
|
||||
roundDur(now.Sub(*h.LastBackupAt)), staleBackupThreshold), now)
|
||||
// Resolution is handled in handleJobFinished on a successful
|
||||
// backup (and ResolveOnModeChange on toggle) — the tick only
|
||||
// raises, it does not auto-resolve.
|
||||
continue
|
||||
}
|
||||
// Always-on hosts: existing agent_offline re-evaluation.
|
||||
if h.Status != "offline" || h.LastSeenAt == nil {
|
||||
continue
|
||||
}
|
||||
@@ -212,7 +255,6 @@ func (e *Engine) tick(ctx context.Context, now time.Time) {
|
||||
roundDur(now.Sub(*h.LastSeenAt)), e.agentOfflineFloor), now)
|
||||
}
|
||||
}
|
||||
// Stale-schedule sweep — no-op in v1. See KindStaleSchedule doc comment.
|
||||
}
|
||||
|
||||
// roundDur returns a human-readable duration string, rounding to the
|
||||
@@ -224,3 +266,19 @@ func roundDur(d time.Duration) string {
|
||||
}
|
||||
return d.Round(time.Minute).String()
|
||||
}
|
||||
|
||||
// hostHasEnabledSchedule reports whether the host has at least one
|
||||
// enabled backup schedule — the precondition for a stale_schedule
|
||||
// alert (no schedule = no backup expectation to measure against).
|
||||
func (e *Engine) hostHasEnabledSchedule(ctx context.Context, hostID string) (bool, error) {
|
||||
schedules, err := e.store.ListSchedulesByHost(ctx, hostID)
|
||||
if err != nil {
|
||||
return false, err
|
||||
}
|
||||
for _, sc := range schedules {
|
||||
if sc.Enabled {
|
||||
return true, nil
|
||||
}
|
||||
}
|
||||
return false, nil
|
||||
}
|
||||
|
||||
@@ -0,0 +1,255 @@
|
||||
package alert
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
// TestIntermittentHostSuppressesOfflineAlert checks that handleHostOffline
|
||||
// does NOT raise agent_offline for a host with AlwaysOn=false.
|
||||
func TestIntermittentHostSuppressesOfflineAlert(t *testing.T) {
|
||||
t.Parallel()
|
||||
eng, st, hostID := setupEngine(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// Make the host intermittent.
|
||||
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
|
||||
t.Fatalf("SetHostAlwaysOn: %v", err)
|
||||
}
|
||||
|
||||
// Give it a stale last_seen_at well past the floor.
|
||||
if _, err := st.DB().Exec(
|
||||
`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
|
||||
time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
|
||||
"offline",
|
||||
hostID,
|
||||
); err != nil {
|
||||
t.Fatalf("update last_seen_at: %v", err)
|
||||
}
|
||||
|
||||
eng.handleHostOffline(ctx, hostID)
|
||||
|
||||
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
if len(open) != 0 {
|
||||
t.Fatalf("expected 0 open alerts for intermittent host; got %d: %+v", len(open), open)
|
||||
}
|
||||
}
|
||||
|
||||
// TestAlwaysOnHostStillRaisesOfflineAlert checks that always-on hosts still
|
||||
// get an agent_offline alert when offline past the floor.
|
||||
func TestAlwaysOnHostStillRaisesOfflineAlert(t *testing.T) {
|
||||
t.Parallel()
|
||||
eng, st, hostID := setupEngine(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// always_on=true is the default, but be explicit.
|
||||
if err := st.SetHostAlwaysOn(ctx, hostID, true); err != nil {
|
||||
t.Fatalf("SetHostAlwaysOn: %v", err)
|
||||
}
|
||||
|
||||
// Give it a stale last_seen_at well past the 15m floor.
|
||||
if _, err := st.DB().Exec(
|
||||
`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
|
||||
time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
|
||||
"offline",
|
||||
hostID,
|
||||
); err != nil {
|
||||
t.Fatalf("update last_seen_at: %v", err)
|
||||
}
|
||||
|
||||
eng.handleHostOffline(ctx, hostID)
|
||||
|
||||
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
if len(open) != 1 || open[0].Kind != KindAgentOffline {
|
||||
t.Fatalf("expected 1 agent_offline alert; got %d: %+v", len(open), open)
|
||||
}
|
||||
}
|
||||
|
||||
// TestStalenessAlertForIntermittentHost checks that tick raises stale_schedule
|
||||
// for an intermittent host whose last backup is older than 7 days AND has an
|
||||
// enabled schedule. Also verifies that a succeeded backup clears the alert.
|
||||
func TestStalenessAlertForIntermittentHost(t *testing.T) {
|
||||
t.Parallel()
|
||||
eng, st, hostID := setupEngine(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// Make intermittent.
|
||||
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
|
||||
t.Fatalf("SetHostAlwaysOn: %v", err)
|
||||
}
|
||||
|
||||
// Create a source group to attach the schedule to.
|
||||
sgID := ulid.Make().String()
|
||||
if err := st.CreateSourceGroup(ctx, &store.SourceGroup{
|
||||
ID: sgID,
|
||||
HostID: hostID,
|
||||
Name: "default",
|
||||
Includes: []string{"/home"},
|
||||
}); err != nil {
|
||||
t.Fatalf("CreateSourceGroup: %v", err)
|
||||
}
|
||||
|
||||
// Create an enabled schedule pointing at the source group.
|
||||
schedID := ulid.Make().String()
|
||||
if err := st.CreateSchedule(ctx, &store.Schedule{
|
||||
ID: schedID,
|
||||
HostID: hostID,
|
||||
CronExpr: "0 2 * * *",
|
||||
Enabled: true,
|
||||
SourceGroupIDs: []string{sgID},
|
||||
}); err != nil {
|
||||
t.Fatalf("CreateSchedule: %v", err)
|
||||
}
|
||||
|
||||
// Set last_backup_at to 8 days ago.
|
||||
eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
|
||||
if err := st.SetHostLastBackup(ctx, hostID, "succeeded", eightDaysAgo); err != nil {
|
||||
t.Fatalf("SetHostLastBackup: %v", err)
|
||||
}
|
||||
|
||||
eng.tick(ctx, time.Now().UTC())
|
||||
|
||||
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
var staleCount int
|
||||
for _, a := range open {
|
||||
if a.Kind == KindStaleSchedule {
|
||||
staleCount++
|
||||
}
|
||||
}
|
||||
if staleCount != 1 {
|
||||
t.Fatalf("expected 1 stale_schedule alert after tick; got %d (all open: %+v)", staleCount, open)
|
||||
}
|
||||
|
||||
// A succeeded backup should clear the stale_schedule alert.
|
||||
eng.handleJobFinished(ctx, JobFinishedEvent{
|
||||
HostID: hostID,
|
||||
JobID: ulid.Make().String(),
|
||||
Kind: "backup",
|
||||
Status: "succeeded",
|
||||
SourceGroupID: sgID,
|
||||
When: time.Now().UTC(),
|
||||
})
|
||||
|
||||
open, _ = st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
for _, a := range open {
|
||||
if a.Kind == KindStaleSchedule {
|
||||
t.Fatalf("expected stale_schedule to be resolved after backup succeeded; still open: %+v", a)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestNoStalenessWithoutEnabledSchedule checks that no stale_schedule is
|
||||
// raised for an intermittent host with a stale backup but no enabled schedule.
|
||||
func TestNoStalenessWithoutEnabledSchedule(t *testing.T) {
|
||||
t.Parallel()
|
||||
eng, st, hostID := setupEngine(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// Make intermittent.
|
||||
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
|
||||
t.Fatalf("SetHostAlwaysOn: %v", err)
|
||||
}
|
||||
|
||||
// Set last_backup_at to 8 days ago — stale — but no schedule.
|
||||
eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
|
||||
if err := st.SetHostLastBackup(ctx, hostID, "succeeded", eightDaysAgo); err != nil {
|
||||
t.Fatalf("SetHostLastBackup: %v", err)
|
||||
}
|
||||
|
||||
eng.tick(ctx, time.Now().UTC())
|
||||
|
||||
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
for _, a := range open {
|
||||
if a.Kind == KindStaleSchedule {
|
||||
t.Fatalf("expected no stale_schedule without an enabled schedule; got: %+v", a)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestResolveOnModeChangeClearsOfflineAlert checks that ResolveOnModeChange
|
||||
// clears an open agent_offline alert when a host's mode is toggled.
|
||||
func TestResolveOnModeChangeClearsOfflineAlert(t *testing.T) {
|
||||
t.Parallel()
|
||||
eng, st, hostID := setupEngine(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// Make always-on and set it offline with a stale last_seen_at.
|
||||
if err := st.SetHostAlwaysOn(ctx, hostID, true); err != nil {
|
||||
t.Fatalf("SetHostAlwaysOn: %v", err)
|
||||
}
|
||||
if _, err := st.DB().Exec(
|
||||
`UPDATE hosts SET last_seen_at = ?, status = ? WHERE id = ?`,
|
||||
time.Now().UTC().Add(-2*time.Hour).Format(time.RFC3339Nano),
|
||||
"offline",
|
||||
hostID,
|
||||
); err != nil {
|
||||
t.Fatalf("update last_seen_at: %v", err)
|
||||
}
|
||||
|
||||
// Raise the offline alert.
|
||||
eng.handleHostOffline(ctx, hostID)
|
||||
|
||||
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
if len(open) != 1 || open[0].Kind != KindAgentOffline {
|
||||
t.Fatalf("expected 1 agent_offline alert before mode change; got %d: %+v", len(open), open)
|
||||
}
|
||||
|
||||
// Toggle mode — should clear the alert.
|
||||
eng.ResolveOnModeChange(ctx, hostID, time.Now().UTC())
|
||||
|
||||
open, _ = st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
for _, a := range open {
|
||||
if a.Kind == KindAgentOffline {
|
||||
t.Fatalf("expected agent_offline to be resolved after mode change; still open: %+v", a)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestNoStalenessWhenNeverBackedUp checks that no stale_schedule alert is
|
||||
// raised for an intermittent host that has never backed up (nil LastBackupAt).
|
||||
func TestNoStalenessWhenNeverBackedUp(t *testing.T) {
|
||||
t.Parallel()
|
||||
eng, st, hostID := setupEngine(t)
|
||||
ctx := context.Background()
|
||||
|
||||
// Make intermittent.
|
||||
if err := st.SetHostAlwaysOn(ctx, hostID, false); err != nil {
|
||||
t.Fatalf("SetHostAlwaysOn: %v", err)
|
||||
}
|
||||
|
||||
// Create a source group and an enabled schedule — but do NOT set LastBackupAt.
|
||||
sgID := ulid.Make().String()
|
||||
if err := st.CreateSourceGroup(ctx, &store.SourceGroup{
|
||||
ID: sgID,
|
||||
HostID: hostID,
|
||||
Name: "default",
|
||||
Includes: []string{"/home"},
|
||||
}); err != nil {
|
||||
t.Fatalf("CreateSourceGroup: %v", err)
|
||||
}
|
||||
|
||||
schedID := ulid.Make().String()
|
||||
if err := st.CreateSchedule(ctx, &store.Schedule{
|
||||
ID: schedID,
|
||||
HostID: hostID,
|
||||
CronExpr: "0 2 * * *",
|
||||
Enabled: true,
|
||||
SourceGroupIDs: []string{sgID},
|
||||
}); err != nil {
|
||||
t.Fatalf("CreateSchedule: %v", err)
|
||||
}
|
||||
|
||||
eng.tick(ctx, time.Now().UTC())
|
||||
|
||||
open, _ := st.ListAlerts(ctx, store.AlertFilter{Status: "open", HostID: hostID})
|
||||
for _, a := range open {
|
||||
if a.Kind == KindStaleSchedule {
|
||||
t.Fatalf("expected no stale_schedule when never backed up; got: %+v", a)
|
||||
}
|
||||
}
|
||||
}
|
||||
+14
-4
@@ -27,10 +27,10 @@ const (
|
||||
// integrity is at risk) when a check job fails.
|
||||
KindCheckFailed = "check_failed"
|
||||
|
||||
// KindStaleSchedule is declared for completeness but intentionally
|
||||
// left as a no-op in v1. The precise "expected to have fired but
|
||||
// didn't" logic requires a store helper that lands in a follow-up
|
||||
// task. Ask the team before implementing.
|
||||
// KindStaleSchedule is raised for intermittent (non-always-on) hosts
|
||||
// when their last successful backup is older than staleBackupThreshold
|
||||
// (7 days) and they have at least one enabled schedule. Resolved on
|
||||
// backup success or when the host is switched to always-on mode.
|
||||
KindStaleSchedule = "stale_schedule"
|
||||
|
||||
// KindAgentOffline is raised when a host's last_seen_at is older
|
||||
@@ -122,6 +122,16 @@ func alertPayload(ctx context.Context, st *store.Store, ev notification.Event, a
|
||||
}
|
||||
}
|
||||
|
||||
// ResolveOnModeChange clears any open agent_offline and stale_schedule
|
||||
// alerts for a host whose always-on flag was just toggled. The next
|
||||
// 60s tick re-raises whichever still applies under the new mode, so
|
||||
// this is a self-correcting "wipe and let the sweep settle" call.
|
||||
// Safe to invoke from the HTTP layer (it only touches the store + hub).
|
||||
func (e *Engine) ResolveOnModeChange(ctx context.Context, hostID string, when time.Time) {
|
||||
e.resolveAndNotify(ctx, hostID, KindAgentOffline, "", when)
|
||||
e.resolveAndNotify(ctx, hostID, KindStaleSchedule, "", when)
|
||||
}
|
||||
|
||||
// resolveAndNotify clears the open (or acknowledged) alert matching
|
||||
// (host_id, kind, dedup_key) via store.AutoResolve, then fires
|
||||
// alert.resolved for the row(s) actually closed. Best-effort —
|
||||
|
||||
@@ -0,0 +1,63 @@
|
||||
package alert
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
|
||||
)
|
||||
|
||||
// Alert-kind constants for P6 self-update flows.
|
||||
const (
|
||||
// KindUpdateFailed is raised when an agent fails to come back with
|
||||
// the expected version after a command.update dispatch (timeout or
|
||||
// version-mismatch). Resolved by a subsequent matching hello.
|
||||
KindUpdateFailed = "update_failed"
|
||||
|
||||
// KindFleetUpdateHalted is raised when the fleet-update worker
|
||||
// stops mid-run because a host failed to update or went offline.
|
||||
// Host-less alert (system-scoped). Manually resolved by an admin.
|
||||
KindFleetUpdateHalted = "fleet_update_halted"
|
||||
)
|
||||
|
||||
// RaiseUpdateFailed records a per-host update failure. dedupKey is the
|
||||
// hostID so a re-dispatch on the same host touches the existing alert
|
||||
// rather than spawning a duplicate.
|
||||
func (e *Engine) RaiseUpdateFailed(ctx context.Context, hostID, jobID, reason string, when time.Time) {
|
||||
msg := fmt.Sprintf("Agent update failed (job %s): %s", jobID, reason)
|
||||
e.raiseAndNotify(ctx, hostID, KindUpdateFailed, hostID, "warning", msg, when)
|
||||
}
|
||||
|
||||
// ResolveUpdateFailed clears any open update_failed alert for hostID.
|
||||
// Called from the WS hello path when the agent reconnects with the
|
||||
// target version.
|
||||
func (e *Engine) ResolveUpdateFailed(ctx context.Context, hostID string, when time.Time) {
|
||||
e.resolveAndNotify(ctx, hostID, KindUpdateFailed, hostID, when)
|
||||
}
|
||||
|
||||
// RaiseFleetUpdateHalted is host-less — the fleet update is a
|
||||
// system-level concept. We persist it via the dedicated host-less
|
||||
// alert path so the alerts table's host_id column carries NULL.
|
||||
func (e *Engine) RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time) {
|
||||
msg := fmt.Sprintf("Fleet update %s halted: %s", fleetUpdateID, reason)
|
||||
id, didRaise, err := e.store.RaiseOrTouchSystem(ctx, KindFleetUpdateHalted, fleetUpdateID, "warning", msg, when)
|
||||
if err != nil {
|
||||
slog.Warn("alert: raise fleet_update_halted", "fu_id", fleetUpdateID, "err", err)
|
||||
return
|
||||
}
|
||||
if !didRaise {
|
||||
return
|
||||
}
|
||||
go e.hub.Dispatch(ctx, notification.Payload{
|
||||
Event: notification.EventRaised,
|
||||
AlertID: id,
|
||||
Severity: "warning",
|
||||
Kind: KindFleetUpdateHalted,
|
||||
HostID: "",
|
||||
HostName: "",
|
||||
Message: msg,
|
||||
RaisedAt: when,
|
||||
})
|
||||
}
|
||||
@@ -63,6 +63,7 @@ const (
|
||||
JobUnlock JobKind = "unlock"
|
||||
JobRestore JobKind = "restore"
|
||||
JobDiff JobKind = "diff"
|
||||
JobUpdate JobKind = "update"
|
||||
)
|
||||
|
||||
// JobStatus is the lifecycle state of a job.
|
||||
@@ -361,13 +362,14 @@ type ConfigUpdatePayload struct {
|
||||
BandwidthDownKBps *int `json:"bandwidth_down_kbps,omitempty"`
|
||||
}
|
||||
|
||||
// AgentUpdateAvailablePayload — informational only; the agent does
|
||||
// NOT self-update. See spec.md §4.2 for the package-manager-based
|
||||
// update model.
|
||||
type AgentUpdateAvailablePayload struct {
|
||||
LatestVersion string `json:"latest_version"`
|
||||
PackageURL string `json:"package_url"` // apt repo / choco source
|
||||
Changelog string `json:"changelog,omitempty"`
|
||||
// CommandUpdatePayload carries no operational data — the agent
|
||||
// already knows its own os/arch and fetches from its configured
|
||||
// server URL via /agent/binary. JobID is the server-issued id of
|
||||
// the update job; the agent echoes it on log.stream lines so the
|
||||
// live job log captures pre-restart progress, then either exits
|
||||
// (Linux) or hands off to a detached helper script (Windows).
|
||||
type CommandUpdatePayload struct {
|
||||
JobID string `json:"job_id"`
|
||||
}
|
||||
|
||||
// TreeListRequestPayload is the body of a tree.list RPC. Used by the
|
||||
|
||||
@@ -29,12 +29,12 @@ const (
|
||||
|
||||
// Server → agent message types.
|
||||
const (
|
||||
MsgCommandRun MessageType = "command.run"
|
||||
MsgCommandCancel MessageType = "command.cancel"
|
||||
MsgScheduleSet MessageType = "schedule.set"
|
||||
MsgConfigUpdate MessageType = "config.update"
|
||||
MsgAgentUpdateAvail MessageType = "agent.update.available"
|
||||
MsgTreeList MessageType = "tree.list" // sync RPC: list a snapshot's children
|
||||
MsgCommandRun MessageType = "command.run"
|
||||
MsgCommandCancel MessageType = "command.cancel"
|
||||
MsgScheduleSet MessageType = "schedule.set"
|
||||
MsgConfigUpdate MessageType = "config.update"
|
||||
MsgCommandUpdate MessageType = "command.update"
|
||||
MsgTreeList MessageType = "tree.list" // sync RPC: list a snapshot's children
|
||||
)
|
||||
|
||||
// Envelope is the framing for every WS message in either direction.
|
||||
|
||||
@@ -41,6 +41,24 @@ type Config struct {
|
||||
// DataDir. Source-build deployments can override via
|
||||
// RM_BUNDLED_ASSETS_DIR.
|
||||
BundledAssetsDir string `yaml:"bundled_assets_dir"`
|
||||
|
||||
// MetricsToken, if set, gates the /metrics scrape endpoint
|
||||
// behind a `Authorization: Bearer <token>` check (constant-time
|
||||
// compare). When neither this nor MetricsTrustedCIDRs is set,
|
||||
// the route is not mounted at all (the endpoint is opt-in).
|
||||
MetricsToken string `yaml:"metrics_token"`
|
||||
|
||||
// MetricsTrustedCIDRs, if non-empty, gates /metrics so only
|
||||
// callers from these networks may scrape. ANDed with
|
||||
// MetricsToken when both are set.
|
||||
MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"`
|
||||
}
|
||||
|
||||
// MetricsAuthEnabled reports whether the operator has opted into
|
||||
// exposing the Prometheus scrape endpoint by configuring at least
|
||||
// one auth gate.
|
||||
func (c Config) MetricsAuthEnabled() bool {
|
||||
return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0
|
||||
}
|
||||
|
||||
// Load resolves config in this order:
|
||||
@@ -93,6 +111,19 @@ func Load(yamlPath string) (Config, error) {
|
||||
if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
|
||||
c.BundledAssetsDir = v
|
||||
}
|
||||
if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok {
|
||||
c.MetricsToken = v
|
||||
}
|
||||
if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok {
|
||||
parts := strings.Split(v, ",")
|
||||
c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0]
|
||||
for _, p := range parts {
|
||||
p = strings.TrimSpace(p)
|
||||
if p != "" {
|
||||
c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p)
|
||||
}
|
||||
}
|
||||
}
|
||||
if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
|
||||
// Comma-separated CIDRs; allow whitespace for readability.
|
||||
parts := strings.Split(v, ",")
|
||||
@@ -137,5 +168,10 @@ func (c *Config) validate() error {
|
||||
return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
|
||||
}
|
||||
}
|
||||
for _, cidr := range c.MetricsTrustedCIDRs {
|
||||
if _, err := netip.ParsePrefix(cidr); err != nil {
|
||||
return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -98,6 +98,45 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestMetricsAuthGates(t *testing.T) {
|
||||
t.Setenv("RM_LISTEN", ":8080")
|
||||
t.Setenv("RM_DATA_DIR", "/tmp/x")
|
||||
|
||||
c, err := Load("")
|
||||
if err != nil {
|
||||
t.Fatalf("load: %v", err)
|
||||
}
|
||||
if c.MetricsAuthEnabled() {
|
||||
t.Errorf("metrics endpoint should be off by default")
|
||||
}
|
||||
|
||||
t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes")
|
||||
t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24")
|
||||
c, err = Load("")
|
||||
if err != nil {
|
||||
t.Fatalf("load: %v", err)
|
||||
}
|
||||
if c.MetricsToken != "s3cr3t-token-with-enough-bytes" {
|
||||
t.Errorf("token: %q", c.MetricsToken)
|
||||
}
|
||||
if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" {
|
||||
t.Errorf("cidrs: %v", got)
|
||||
}
|
||||
if !c.MetricsAuthEnabled() {
|
||||
t.Errorf("MetricsAuthEnabled should be true")
|
||||
}
|
||||
}
|
||||
|
||||
func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) {
|
||||
t.Setenv("RM_LISTEN", ":8080")
|
||||
t.Setenv("RM_DATA_DIR", "/tmp/x")
|
||||
t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage")
|
||||
|
||||
if _, err := Load(""); err == nil {
|
||||
t.Fatal("expected validation error, got nil")
|
||||
}
|
||||
}
|
||||
|
||||
func writeFile(path string, body []byte) error {
|
||||
return writeFileImpl(path, body)
|
||||
}
|
||||
|
||||
@@ -0,0 +1,221 @@
|
||||
// Package fleetupdate drives a rolling, sequential agent self-update
|
||||
// over a list of hosts. One worker goroutine per Start() call (gated
|
||||
// at the store layer to at-most-one-running-fleet-update).
|
||||
package fleetupdate
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"time"
|
||||
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
// Hub is the slim "is this host connected?" surface.
|
||||
type Hub interface {
|
||||
Connected(hostID string) bool
|
||||
}
|
||||
|
||||
// Dispatcher sends one command.update envelope. The implementer also
|
||||
// creates the jobs row, writes audit, and registers with the update
|
||||
// watcher. Pre-checks are the dispatcher's responsibility — the worker
|
||||
// passes through whatever error it returns.
|
||||
type Dispatcher interface {
|
||||
DispatchUpdate(ctx context.Context, hostID string, actorUserID string) (jobID string, code string, err error)
|
||||
}
|
||||
|
||||
// AlertRaiser is the slim view of the alert engine's host-less raise
|
||||
// path. Used to emit fleet_update_halted on first failure.
|
||||
type AlertRaiser interface {
|
||||
RaiseFleetUpdateHalted(ctx context.Context, fleetUpdateID, reason string, when time.Time)
|
||||
}
|
||||
|
||||
// Worker is the long-lived fleet-update orchestrator. There is at most
|
||||
// one *running* fleet update at a time (enforced by the store).
|
||||
type Worker struct {
|
||||
store *store.Store
|
||||
hub Hub
|
||||
disp Dispatcher
|
||||
alerts AlertRaiser
|
||||
|
||||
// targetVersion is the version every dispatched agent is expected
|
||||
// to come back with. Captured at Start time to avoid drift.
|
||||
targetVersion string
|
||||
|
||||
// pollPeriod controls the cadence at which the worker re-reads the
|
||||
// host row to check for the version transition. Exposed for tests.
|
||||
pollPeriod time.Duration
|
||||
// hostTimeout bounds how long the worker waits for one host to
|
||||
// reach the target version before halting.
|
||||
hostTimeout time.Duration
|
||||
}
|
||||
|
||||
// NewWorker builds an unstarted worker. targetVersion is set on each
|
||||
// Start call; the values here are defaults.
|
||||
func NewWorker(st *store.Store, hub Hub, disp Dispatcher, alerts AlertRaiser) *Worker {
|
||||
return &Worker{
|
||||
store: st,
|
||||
hub: hub,
|
||||
disp: disp,
|
||||
alerts: alerts,
|
||||
pollPeriod: 1 * time.Second,
|
||||
hostTimeout: 95 * time.Second,
|
||||
}
|
||||
}
|
||||
|
||||
// Start creates the parent + child rows, then spawns the per-host
|
||||
// worker goroutine. Returns the new fleet_update_id on success.
|
||||
// store.ErrFleetUpdateRunning bubbles up unchanged.
|
||||
func (w *Worker) Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error) {
|
||||
if userID == "" || targetVersion == "" {
|
||||
return "", errors.New("fleetupdate: userID and targetVersion required")
|
||||
}
|
||||
if len(hostIDs) == 0 {
|
||||
return "", errors.New("fleetupdate: at least one host required")
|
||||
}
|
||||
fuID := ulid.Make().String()
|
||||
now := time.Now().UTC()
|
||||
if err := w.store.CreateFleetUpdate(ctx, store.FleetUpdate{
|
||||
ID: fuID,
|
||||
StartedAt: now,
|
||||
StartedByUserID: userID,
|
||||
TargetVersion: targetVersion,
|
||||
Status: "running",
|
||||
}, hostIDs); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
// The goroutine outlives the request that started it; carry a
|
||||
// detached context so an HTTP-handler ctx cancel doesn't abort
|
||||
// the long roll.
|
||||
bg := context.WithoutCancel(ctx)
|
||||
go w.run(bg, fuID, userID, targetVersion)
|
||||
return fuID, nil
|
||||
}
|
||||
|
||||
// Cancel marks the fleet update cancelled. The running goroutine
|
||||
// observes the new status on its next pre-check and exits without
|
||||
// dispatching further hosts. The currently-dispatched job is left to
|
||||
// finish on its own — cancelling agent-side is out of scope for v1.
|
||||
func (w *Worker) Cancel(ctx context.Context, fuID string) error {
|
||||
return w.store.CancelFleetUpdate(ctx, fuID, time.Now().UTC())
|
||||
}
|
||||
|
||||
// run is the per-host loop. Halts on first failure; emits one alert
|
||||
// on transition.
|
||||
func (w *Worker) run(ctx context.Context, fuID, userID, targetVersion string) {
|
||||
w.targetVersion = targetVersion
|
||||
|
||||
for {
|
||||
// Check the parent row's status — picks up Cancel.
|
||||
fu, err := w.store.ActiveFleetUpdate(ctx)
|
||||
if err != nil {
|
||||
slog.Warn("fleetupdate: read active", "fu_id", fuID, "err", err)
|
||||
return
|
||||
}
|
||||
if fu == nil || fu.ID != fuID {
|
||||
// Cancelled, halted, or completed externally. Done.
|
||||
return
|
||||
}
|
||||
|
||||
pending, err := w.store.ListPendingFleetUpdateHosts(ctx, fuID)
|
||||
if err != nil {
|
||||
slog.Warn("fleetupdate: list pending", "fu_id", fuID, "err", err)
|
||||
return
|
||||
}
|
||||
if len(pending) == 0 {
|
||||
now := time.Now().UTC()
|
||||
if err := w.store.CompleteFleetUpdate(ctx, fuID, now); err != nil {
|
||||
slog.Warn("fleetupdate: complete", "fu_id", fuID, "err", err)
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
next := pending[0]
|
||||
w.processHost(ctx, fuID, userID, next)
|
||||
}
|
||||
}
|
||||
|
||||
// processHost handles one host slot. Marks it skipped, succeeded, or
|
||||
// failed (and halts the fleet on failure).
|
||||
func (w *Worker) processHost(ctx context.Context, fuID, userID string, slot store.FleetUpdateHost) {
|
||||
hostID := slot.HostID
|
||||
_ = w.store.SetFleetUpdateCurrentHost(ctx, fuID, hostID)
|
||||
|
||||
// Pre-flight: re-read the host. The dispatch path repeats most of
|
||||
// these checks but doing them up-front lets us emit the right
|
||||
// per-host status (skipped vs failed) without consuming a job row.
|
||||
host, err := w.store.GetHost(ctx, hostID)
|
||||
if err != nil || host == nil {
|
||||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "skipped", "host not found", "")
|
||||
return
|
||||
}
|
||||
if host.AgentVersion != "" && host.AgentVersion == w.targetVersion {
|
||||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "skipped", "already at target version", "")
|
||||
return
|
||||
}
|
||||
if !w.hub.Connected(hostID) {
|
||||
reason := fmt.Sprintf("host went offline: %s", hostID)
|
||||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, "")
|
||||
w.halt(ctx, fuID, reason)
|
||||
return
|
||||
}
|
||||
|
||||
// Dispatch.
|
||||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "running", "", "")
|
||||
jobID, code, err := w.disp.DispatchUpdate(ctx, hostID, userID)
|
||||
if err != nil || code != "" {
|
||||
reason := dispatchErrorReason(code, err)
|
||||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, jobID)
|
||||
w.halt(ctx, fuID, reason)
|
||||
return
|
||||
}
|
||||
|
||||
// Poll until the host's recorded agent_version matches target, or
|
||||
// timeout.
|
||||
deadline := time.Now().Add(w.hostTimeout)
|
||||
for time.Now().Before(deadline) {
|
||||
// Honour cancellation between polls.
|
||||
fu, err := w.store.ActiveFleetUpdate(ctx)
|
||||
if err == nil && (fu == nil || fu.ID != fuID) {
|
||||
// Cancelled mid-host; leave the slot in 'running' for the
|
||||
// admin to inspect. No further dispatches.
|
||||
return
|
||||
}
|
||||
time.Sleep(w.pollPeriod)
|
||||
h, err := w.store.GetHost(ctx, hostID)
|
||||
if err == nil && h != nil && h.AgentVersion == w.targetVersion {
|
||||
if err := w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "succeeded", "", jobID); err != nil {
|
||||
slog.Warn("fleetupdate: set succeeded", "fu_id", fuID, "host_id", hostID, "err", err)
|
||||
}
|
||||
return
|
||||
}
|
||||
}
|
||||
reason := fmt.Sprintf("timeout waiting for %s to reach %s", hostID, w.targetVersion)
|
||||
_ = w.store.SetFleetUpdateHostStatus(ctx, fuID, hostID, "failed", reason, jobID)
|
||||
w.halt(ctx, fuID, reason)
|
||||
}
|
||||
|
||||
func (w *Worker) halt(ctx context.Context, fuID, reason string) {
|
||||
now := time.Now().UTC()
|
||||
if err := w.store.HaltFleetUpdate(ctx, fuID, reason, now); err != nil {
|
||||
slog.Warn("fleetupdate: halt", "fu_id", fuID, "err", err)
|
||||
}
|
||||
if w.alerts != nil {
|
||||
w.alerts.RaiseFleetUpdateHalted(ctx, fuID, reason, now)
|
||||
}
|
||||
}
|
||||
|
||||
func dispatchErrorReason(code string, err error) string {
|
||||
if code != "" {
|
||||
return "dispatch failed: " + code
|
||||
}
|
||||
if err != nil {
|
||||
return err.Error()
|
||||
}
|
||||
return "dispatch failed"
|
||||
}
|
||||
@@ -0,0 +1,344 @@
|
||||
package fleetupdate
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"path/filepath"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
type fakeHub struct {
|
||||
mu sync.Mutex
|
||||
online map[string]bool
|
||||
}
|
||||
|
||||
func (f *fakeHub) Connected(hostID string) bool {
|
||||
f.mu.Lock()
|
||||
defer f.mu.Unlock()
|
||||
return f.online[hostID]
|
||||
}
|
||||
|
||||
type fakeDispatcher struct {
|
||||
mu sync.Mutex
|
||||
calls []string // host IDs
|
||||
// after dispatch, set the host's agent_version to this on the
|
||||
// store so the worker observes the version transition.
|
||||
st *store.Store
|
||||
target string
|
||||
delayMS int
|
||||
failOnHost map[string]string // host → error code
|
||||
}
|
||||
|
||||
func (f *fakeDispatcher) DispatchUpdate(ctx context.Context, hostID, _ string) (string, string, error) {
|
||||
f.mu.Lock()
|
||||
f.calls = append(f.calls, hostID)
|
||||
if code, ok := f.failOnHost[hostID]; ok {
|
||||
f.mu.Unlock()
|
||||
return "", code, nil
|
||||
}
|
||||
st := f.st
|
||||
target := f.target
|
||||
delay := f.delayMS
|
||||
f.mu.Unlock()
|
||||
|
||||
jobID := ulid.Make().String()
|
||||
if st != nil {
|
||||
_ = st.CreateJob(context.Background(), store.Job{
|
||||
ID: jobID, HostID: hostID, Kind: "update",
|
||||
ActorKind: "user", CreatedAt: time.Now().UTC(),
|
||||
})
|
||||
}
|
||||
if st != nil && target != "" {
|
||||
go func() {
|
||||
if delay > 0 {
|
||||
time.Sleep(time.Duration(delay) * time.Millisecond)
|
||||
}
|
||||
_ = st.MarkHostHello(context.Background(), hostID, target, "0.17", api.CurrentProtocolVersion, time.Now().UTC())
|
||||
}()
|
||||
}
|
||||
return jobID, "", nil
|
||||
}
|
||||
|
||||
type recAlert struct {
|
||||
mu sync.Mutex
|
||||
reasons []string
|
||||
}
|
||||
|
||||
func (r *recAlert) RaiseFleetUpdateHalted(_ context.Context, _ string, reason string, _ time.Time) {
|
||||
r.mu.Lock()
|
||||
r.reasons = append(r.reasons, reason)
|
||||
r.mu.Unlock()
|
||||
}
|
||||
|
||||
func openStore(t *testing.T) *store.Store {
|
||||
t.Helper()
|
||||
dir := t.TempDir()
|
||||
st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
|
||||
if err != nil {
|
||||
t.Fatalf("open: %v", err)
|
||||
}
|
||||
t.Cleanup(func() { _ = st.Close() })
|
||||
return st
|
||||
}
|
||||
|
||||
func mustCreateAdmin(t *testing.T, st *store.Store) string {
|
||||
t.Helper()
|
||||
uid := ulid.Make().String()
|
||||
if err := st.CreateUser(context.Background(), store.User{
|
||||
ID: uid, Username: "u-" + uid[:6],
|
||||
PasswordHash: "x", Role: store.RoleAdmin, CreatedAt: time.Now().UTC(),
|
||||
}); err != nil {
|
||||
t.Fatalf("user: %v", err)
|
||||
}
|
||||
return uid
|
||||
}
|
||||
|
||||
func mustCreateHost(t *testing.T, st *store.Store, name, version string) string {
|
||||
t.Helper()
|
||||
hostID := ulid.Make().String()
|
||||
if err := st.CreateHost(context.Background(), store.Host{
|
||||
ID: hostID, Name: name, OS: "linux", Arch: "amd64",
|
||||
EnrolledAt: time.Now().UTC(),
|
||||
}, "deadbeef-"+hostID, ""); err != nil {
|
||||
t.Fatalf("host: %v", err)
|
||||
}
|
||||
if version != "" {
|
||||
if err := st.MarkHostHello(context.Background(), hostID, version, "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("hello: %v", err)
|
||||
}
|
||||
}
|
||||
return hostID
|
||||
}
|
||||
|
||||
func waitForStatus(t *testing.T, st *store.Store, fuID, want string, timeout time.Duration) *store.FleetUpdate {
|
||||
t.Helper()
|
||||
deadline := time.Now().Add(timeout)
|
||||
for time.Now().Before(deadline) {
|
||||
fu, _, err := st.GetFleetUpdate(context.Background(), fuID)
|
||||
if err == nil && fu != nil && fu.Status == want {
|
||||
return fu
|
||||
}
|
||||
time.Sleep(20 * time.Millisecond)
|
||||
}
|
||||
t.Fatalf("status never reached %q", want)
|
||||
return nil
|
||||
}
|
||||
|
||||
func TestWorkerTwoHostsBothSucceed(t *testing.T) {
|
||||
st := openStore(t)
|
||||
uid := mustCreateAdmin(t, st)
|
||||
h1 := mustCreateHost(t, st, "h1", "v0")
|
||||
h2 := mustCreateHost(t, st, "h2", "v0")
|
||||
|
||||
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
|
||||
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 30}
|
||||
alerts := &recAlert{}
|
||||
w := NewWorker(st, hub, disp, alerts)
|
||||
w.pollPeriod = 20 * time.Millisecond
|
||||
w.hostTimeout = 2 * time.Second
|
||||
|
||||
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
|
||||
if err != nil {
|
||||
t.Fatalf("start: %v", err)
|
||||
}
|
||||
waitForStatus(t, st, fuID, "completed", 5*time.Second)
|
||||
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
|
||||
for _, h := range hosts {
|
||||
if h.Status != "succeeded" {
|
||||
t.Errorf("host %s status %q want succeeded", h.HostID, h.Status)
|
||||
}
|
||||
}
|
||||
if n := len(alerts.reasons); n != 0 {
|
||||
t.Errorf("unexpected halt alert: %v", alerts.reasons)
|
||||
}
|
||||
}
|
||||
|
||||
func TestWorkerSecondHostTimesOutHalts(t *testing.T) {
|
||||
st := openStore(t)
|
||||
uid := mustCreateAdmin(t, st)
|
||||
h1 := mustCreateHost(t, st, "h1", "v0")
|
||||
h2 := mustCreateHost(t, st, "h2", "v0")
|
||||
h3 := mustCreateHost(t, st, "h3", "v0")
|
||||
|
||||
hub := &fakeHub{online: map[string]bool{h1: true, h2: true, h3: true}}
|
||||
// h1 dispatches normally (transitions to v2). h2 dispatch returns
|
||||
// success but never transitions.
|
||||
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 20, failOnHost: map[string]string{
|
||||
h2: "", // not a code-failure; simulate by clearing target on this disp run
|
||||
}}
|
||||
// Actually: drop h2 from the auto-transition by faking with a
|
||||
// per-host store setter. Easiest: subclass via a wrapper.
|
||||
_ = disp
|
||||
customDisp := &perHostDispatcher{base: disp, st: st, target: "v2", noTransition: map[string]bool{h2: true}}
|
||||
|
||||
alerts := &recAlert{}
|
||||
w := NewWorker(st, hub, customDisp, alerts)
|
||||
w.pollPeriod = 20 * time.Millisecond
|
||||
w.hostTimeout = 200 * time.Millisecond
|
||||
|
||||
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2, h3})
|
||||
if err != nil {
|
||||
t.Fatalf("start: %v", err)
|
||||
}
|
||||
waitForStatus(t, st, fuID, "halted", 3*time.Second)
|
||||
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
|
||||
gotStatus := map[string]string{}
|
||||
for _, h := range hosts {
|
||||
gotStatus[h.HostID] = h.Status
|
||||
}
|
||||
if gotStatus[h1] != "succeeded" {
|
||||
t.Errorf("h1: %q", gotStatus[h1])
|
||||
}
|
||||
if gotStatus[h2] != "failed" {
|
||||
t.Errorf("h2: %q", gotStatus[h2])
|
||||
}
|
||||
if gotStatus[h3] != "pending" {
|
||||
t.Errorf("h3: %q", gotStatus[h3])
|
||||
}
|
||||
alerts.mu.Lock()
|
||||
defer alerts.mu.Unlock()
|
||||
if len(alerts.reasons) != 1 {
|
||||
t.Errorf("alert reasons: %v", alerts.reasons)
|
||||
}
|
||||
}
|
||||
|
||||
// perHostDispatcher lets a test omit the auto-transition for selected
|
||||
// hosts so we can simulate timeout.
|
||||
type perHostDispatcher struct {
|
||||
mu sync.Mutex
|
||||
base *fakeDispatcher
|
||||
st *store.Store
|
||||
target string
|
||||
noTransition map[string]bool
|
||||
}
|
||||
|
||||
func (p *perHostDispatcher) DispatchUpdate(_ context.Context, hostID, _ string) (string, string, error) {
|
||||
p.mu.Lock()
|
||||
skip := p.noTransition[hostID]
|
||||
p.mu.Unlock()
|
||||
jobID := ulid.Make().String()
|
||||
_ = p.st.CreateJob(context.Background(), store.Job{
|
||||
ID: jobID, HostID: hostID, Kind: "update",
|
||||
ActorKind: "user", CreatedAt: time.Now().UTC(),
|
||||
})
|
||||
if !skip {
|
||||
go func() {
|
||||
time.Sleep(20 * time.Millisecond)
|
||||
_ = p.st.MarkHostHello(context.Background(), hostID, p.target, "0.17", api.CurrentProtocolVersion, time.Now().UTC())
|
||||
}()
|
||||
}
|
||||
return jobID, "", nil
|
||||
}
|
||||
|
||||
func TestWorkerHostOfflineHalts(t *testing.T) {
|
||||
st := openStore(t)
|
||||
uid := mustCreateAdmin(t, st)
|
||||
h1 := mustCreateHost(t, st, "h1", "v0")
|
||||
h2 := mustCreateHost(t, st, "h2", "v0")
|
||||
hub := &fakeHub{online: map[string]bool{h1: false, h2: true}}
|
||||
disp := &fakeDispatcher{st: st, target: "v2"}
|
||||
alerts := &recAlert{}
|
||||
w := NewWorker(st, hub, disp, alerts)
|
||||
w.pollPeriod = 20 * time.Millisecond
|
||||
w.hostTimeout = 500 * time.Millisecond
|
||||
|
||||
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
|
||||
if err != nil {
|
||||
t.Fatalf("start: %v", err)
|
||||
}
|
||||
waitForStatus(t, st, fuID, "halted", 2*time.Second)
|
||||
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
|
||||
if hosts[0].Status != "failed" {
|
||||
t.Errorf("h1 status: %q", hosts[0].Status)
|
||||
}
|
||||
if hosts[1].Status != "pending" {
|
||||
t.Errorf("h2 status: %q", hosts[1].Status)
|
||||
}
|
||||
}
|
||||
|
||||
func TestWorkerAlreadyAtTargetSkipped(t *testing.T) {
|
||||
st := openStore(t)
|
||||
uid := mustCreateAdmin(t, st)
|
||||
h1 := mustCreateHost(t, st, "h1", "v2")
|
||||
h2 := mustCreateHost(t, st, "h2", "v0")
|
||||
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
|
||||
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 20}
|
||||
alerts := &recAlert{}
|
||||
w := NewWorker(st, hub, disp, alerts)
|
||||
w.pollPeriod = 20 * time.Millisecond
|
||||
w.hostTimeout = 2 * time.Second
|
||||
|
||||
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
|
||||
if err != nil {
|
||||
t.Fatalf("start: %v", err)
|
||||
}
|
||||
waitForStatus(t, st, fuID, "completed", 4*time.Second)
|
||||
_, hosts, _ := st.GetFleetUpdate(context.Background(), fuID)
|
||||
want := map[string]string{h1: "skipped", h2: "succeeded"}
|
||||
for _, h := range hosts {
|
||||
if h.Status != want[h.HostID] {
|
||||
t.Errorf("host %s: got %q want %q", h.HostID, h.Status, want[h.HostID])
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestWorkerCancelMidRun(t *testing.T) {
|
||||
st := openStore(t)
|
||||
uid := mustCreateAdmin(t, st)
|
||||
h1 := mustCreateHost(t, st, "h1", "v0")
|
||||
h2 := mustCreateHost(t, st, "h2", "v0")
|
||||
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
|
||||
// h1's transition is delayed long enough that we can cancel
|
||||
// before it lands; h2 should never be touched.
|
||||
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 500}
|
||||
alerts := &recAlert{}
|
||||
w := NewWorker(st, hub, disp, alerts)
|
||||
w.pollPeriod = 50 * time.Millisecond
|
||||
w.hostTimeout = 5 * time.Second
|
||||
|
||||
fuID, err := w.Start(context.Background(), uid, "v2", []string{h1, h2})
|
||||
if err != nil {
|
||||
t.Fatalf("start: %v", err)
|
||||
}
|
||||
// Give the worker a moment to dispatch h1.
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
if err := w.Cancel(context.Background(), fuID); err != nil {
|
||||
t.Fatalf("cancel: %v", err)
|
||||
}
|
||||
waitForStatus(t, st, fuID, "cancelled", 2*time.Second)
|
||||
|
||||
// h2 should never be dispatched.
|
||||
disp.mu.Lock()
|
||||
defer disp.mu.Unlock()
|
||||
for _, c := range disp.calls {
|
||||
if c == h2 {
|
||||
t.Errorf("h2 dispatched after cancel")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestWorkerStartWhileActiveErrors(t *testing.T) {
|
||||
st := openStore(t)
|
||||
uid := mustCreateAdmin(t, st)
|
||||
h1 := mustCreateHost(t, st, "h1", "v0")
|
||||
h2 := mustCreateHost(t, st, "h2", "v0")
|
||||
hub := &fakeHub{online: map[string]bool{h1: true, h2: true}}
|
||||
disp := &fakeDispatcher{st: st, target: "v2", delayMS: 5_000}
|
||||
w := NewWorker(st, hub, disp, &recAlert{})
|
||||
w.pollPeriod = 50 * time.Millisecond
|
||||
w.hostTimeout = 2 * time.Second
|
||||
if _, err := w.Start(context.Background(), uid, "v2", []string{h1}); err != nil {
|
||||
t.Fatalf("first start: %v", err)
|
||||
}
|
||||
_, err := w.Start(context.Background(), uid, "v2", []string{h2})
|
||||
if !errors.Is(err, store.ErrFleetUpdateRunning) {
|
||||
t.Fatalf("err: %v want ErrFleetUpdateRunning", err)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,141 @@
|
||||
// catchup.go — server-side catch-up for intermittent (non-always-on)
|
||||
// hosts. When such a host reconnects we wait a short settle window,
|
||||
// then dispatch a backup for any schedule whose window elapsed while
|
||||
// the host was asleep. This is separate from pending_runs: a host that
|
||||
// was asleep never fired its local cron, so no pending row exists.
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"log/slog"
|
||||
"time"
|
||||
)
|
||||
|
||||
// scheduleOverdue reports whether a schedule's most recent expected
|
||||
// fire is newer than the host's last successful backup — i.e. a window
|
||||
// passed with no backup. A nil lastBackup means "never backed up" and
|
||||
// is always overdue (provided the cron parses). An unparseable cron is
|
||||
// treated as not-overdue so a bad expression can never trigger a
|
||||
// surprise dispatch. Uses the same cronParser the agent's scheduler
|
||||
// and schedule validation use, so interpretation is identical.
|
||||
func scheduleOverdue(cronExpr string, lastBackup *time.Time, now time.Time) bool {
|
||||
sched, err := cronParser.Parse(cronExpr)
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
if lastBackup == nil {
|
||||
return true
|
||||
}
|
||||
next := sched.Next(*lastBackup)
|
||||
return !next.After(now)
|
||||
}
|
||||
|
||||
// catchupSettle is how long after a reconnect we wait before evaluating
|
||||
// catch-up, so a laptop that wakes briefly and sleeps again doesn't
|
||||
// trigger a backup it can't finish. ~1 minute per the spec.
|
||||
const catchupSettle = 60 * time.Second
|
||||
|
||||
// ArmCatchup records that an intermittent host just reconnected and
|
||||
// should be evaluated for a missed backup after the settle window.
|
||||
// No-op for always-on hosts (caller passes only intermittent hosts).
|
||||
// Re-arming overwrites the timer (debounce — flapping doesn't stack).
|
||||
func (s *Server) ArmCatchup(hostID string, now time.Time) {
|
||||
s.catchupMu.Lock()
|
||||
defer s.catchupMu.Unlock()
|
||||
s.catchupDueAt[hostID] = now.Add(catchupSettle)
|
||||
}
|
||||
|
||||
// dueCatchups returns the hostIDs whose settle window has elapsed and
|
||||
// removes them from the map. Caller evaluates each.
|
||||
func (s *Server) dueCatchups(now time.Time) []string {
|
||||
s.catchupMu.Lock()
|
||||
defer s.catchupMu.Unlock()
|
||||
var due []string
|
||||
for id, at := range s.catchupDueAt {
|
||||
if !now.Before(at) {
|
||||
due = append(due, id)
|
||||
delete(s.catchupDueAt, id)
|
||||
}
|
||||
}
|
||||
return due
|
||||
}
|
||||
|
||||
// RunCatchupsDue is the tick entrypoint. For each host past its settle
|
||||
// window it dispatches a backup for every enabled schedule that is
|
||||
// overdue. Skips hosts that bounced back offline, that are already
|
||||
// running/queued a job, or that turned out to be always-on.
|
||||
func (s *Server) RunCatchupsDue(ctx context.Context) {
|
||||
if s.deps.Hub == nil {
|
||||
return
|
||||
}
|
||||
now := time.Now().UTC()
|
||||
for _, hostID := range s.dueCatchups(now) {
|
||||
s.runCatchup(ctx, hostID, now)
|
||||
}
|
||||
}
|
||||
|
||||
// runCatchup evaluates and dispatches catch-up backups for a single
|
||||
// host. Kept separate so RunCatchupsDue reads cleanly.
|
||||
func (s *Server) runCatchup(ctx context.Context, hostID string, now time.Time) {
|
||||
conn := s.deps.Hub.Conn(hostID)
|
||||
if conn == nil {
|
||||
return // bounced offline during the settle window; re-arms on next hello
|
||||
}
|
||||
host, err := s.deps.Store.GetHost(ctx, hostID)
|
||||
if err != nil {
|
||||
slog.Warn("catchup: load host", "host_id", hostID, "err", err)
|
||||
return
|
||||
}
|
||||
if host.AlwaysOn {
|
||||
return // mode flipped during settle window
|
||||
}
|
||||
// Skip if a backup is already queued or running for this host —
|
||||
// don't pile a catch-up on top of in-flight work. (hosts.current_job_id
|
||||
// is not maintained, so we check the jobs table directly.)
|
||||
active, err := s.deps.Store.HasActiveBackupJob(ctx, hostID)
|
||||
if err != nil {
|
||||
slog.Warn("catchup: check active backup", "host_id", hostID, "err", err)
|
||||
return
|
||||
}
|
||||
if active {
|
||||
return
|
||||
}
|
||||
schedules, err := s.deps.Store.ListSchedulesByHost(ctx, hostID)
|
||||
if err != nil {
|
||||
slog.Warn("catchup: list schedules", "host_id", hostID, "err", err)
|
||||
return
|
||||
}
|
||||
// NOTE: overdue is measured against host.LastBackupAt, which is the
|
||||
// most recent *successful backup of any schedule* on this host — not
|
||||
// a per-schedule timestamp. For the common intermittent host (a
|
||||
// single backup schedule) this is exact. With multiple schedules of
|
||||
// different cadences, a recent backup from one schedule can mask
|
||||
// another schedule's missed window. Acceptable for v1; revisit with
|
||||
// per-schedule last-success tracking if multi-cadence laptops appear.
|
||||
for _, sc := range schedules {
|
||||
if !sc.Enabled || len(sc.SourceGroupIDs) == 0 {
|
||||
continue
|
||||
}
|
||||
if !scheduleOverdue(sc.CronExpr, host.LastBackupAt, now) {
|
||||
continue
|
||||
}
|
||||
for _, gid := range sc.SourceGroupIDs {
|
||||
g, err := s.deps.Store.GetSourceGroup(ctx, hostID, gid)
|
||||
if err != nil {
|
||||
slog.Warn("catchup: load source group",
|
||||
"host_id", hostID, "schedule_id", sc.ID, "group_id", gid, "err", err)
|
||||
continue
|
||||
}
|
||||
if _, derr := s.dispatchBackupForGroupCore(ctx, conn, hostID, sc.ID, g, now); derr != nil {
|
||||
// Send failed for this group — host may have dropped
|
||||
// again. Earlier groups in this batch were already
|
||||
// dispatched; re-arm so a later reconnect re-evaluates
|
||||
// any still-overdue schedules.
|
||||
s.ArmCatchup(hostID, now)
|
||||
return
|
||||
}
|
||||
slog.Info("catchup: dispatched missed backup",
|
||||
"host_id", hostID, "schedule_id", sc.ID, "group", g.Name)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,246 @@
|
||||
// catchup_scheduler_test.go — integration tests for the catch-up scheduler.
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
// TestRunCatchupDispatchesOverdue verifies four properties of the
|
||||
// catch-up scheduler in separate sub-tests sharing no state.
|
||||
func TestRunCatchupDispatchesOverdue(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
// --- 1. Overdue host with connected agent → backup dispatched -------
|
||||
t.Run("overdue_dispatch", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
hostID, token := enrolHostForWS(t, srv, st, "catchup-overdue")
|
||||
|
||||
if err := st.SetHostAlwaysOn(context.Background(), hostID, false); err != nil {
|
||||
t.Fatalf("set always_on: %v", err)
|
||||
}
|
||||
// Last backup ~8 days ago → schedule overdue.
|
||||
eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
|
||||
if err := st.SetHostLastBackup(context.Background(), hostID, "succeeded", eightDaysAgo); err != nil {
|
||||
t.Fatalf("set last backup: %v", err)
|
||||
}
|
||||
|
||||
if err := st.CreateJob(context.Background(), store.Job{
|
||||
ID: ulid.Make().String(), HostID: hostID, Kind: "init",
|
||||
ActorKind: "system", CreatedAt: time.Now().UTC(),
|
||||
}); err != nil {
|
||||
t.Fatalf("seed init: %v", err)
|
||||
}
|
||||
|
||||
gid := ulid.Make().String()
|
||||
if err := st.CreateSourceGroup(context.Background(), &store.SourceGroup{
|
||||
ID: gid, HostID: hostID, Name: "home", Includes: []string{"/home"},
|
||||
}); err != nil {
|
||||
t.Fatalf("source group: %v", err)
|
||||
}
|
||||
sid := ulid.Make().String()
|
||||
if err := st.CreateSchedule(context.Background(), &store.Schedule{
|
||||
ID: sid, HostID: hostID, CronExpr: "0 2 * * *", Enabled: true,
|
||||
SourceGroupIDs: []string{gid},
|
||||
}); err != nil {
|
||||
t.Fatalf("schedule: %v", err)
|
||||
}
|
||||
|
||||
c := agentDial(t, srv, ts, hostID, token)
|
||||
sendHello(t, c, "catchup-overdue")
|
||||
_ = drainUntil(t, c, api.MsgScheduleSet)
|
||||
|
||||
// Arm with a past time so the settle window is already elapsed.
|
||||
srv.ArmCatchup(hostID, time.Now().UTC().Add(-2*time.Minute))
|
||||
srv.RunCatchupsDue(context.Background())
|
||||
|
||||
// Give the dispatch goroutine a moment to write the job row.
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
var n int
|
||||
if err := st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM jobs WHERE host_id = ? AND kind = 'backup'`, hostID).Scan(&n); err != nil {
|
||||
t.Fatalf("count: %v", err)
|
||||
}
|
||||
if n < 1 {
|
||||
t.Errorf("overdue host: want ≥1 backup job, got %d", n)
|
||||
}
|
||||
})
|
||||
|
||||
// --- 2. Not overdue → no dispatch -----------------------------------
|
||||
t.Run("not_overdue_no_dispatch", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
hostID, token := enrolHostForWS(t, srv, st, "catchup-notoverdue")
|
||||
|
||||
if err := st.SetHostAlwaysOn(context.Background(), hostID, false); err != nil {
|
||||
t.Fatalf("set always_on: %v", err)
|
||||
}
|
||||
// Last backup just now → not overdue.
|
||||
now := time.Now().UTC()
|
||||
if err := st.SetHostLastBackup(context.Background(), hostID, "succeeded", now); err != nil {
|
||||
t.Fatalf("set last backup: %v", err)
|
||||
}
|
||||
|
||||
if err := st.CreateJob(context.Background(), store.Job{
|
||||
ID: ulid.Make().String(), HostID: hostID, Kind: "init",
|
||||
ActorKind: "system", CreatedAt: now,
|
||||
}); err != nil {
|
||||
t.Fatalf("seed init: %v", err)
|
||||
}
|
||||
|
||||
gid := ulid.Make().String()
|
||||
if err := st.CreateSourceGroup(context.Background(), &store.SourceGroup{
|
||||
ID: gid, HostID: hostID, Name: "home", Includes: []string{"/home"},
|
||||
}); err != nil {
|
||||
t.Fatalf("source group: %v", err)
|
||||
}
|
||||
sid := ulid.Make().String()
|
||||
if err := st.CreateSchedule(context.Background(), &store.Schedule{
|
||||
ID: sid, HostID: hostID, CronExpr: "0 2 * * *", Enabled: true,
|
||||
SourceGroupIDs: []string{gid},
|
||||
}); err != nil {
|
||||
t.Fatalf("schedule: %v", err)
|
||||
}
|
||||
|
||||
c := agentDial(t, srv, ts, hostID, token)
|
||||
sendHello(t, c, "catchup-notoverdue")
|
||||
_ = drainUntil(t, c, api.MsgScheduleSet)
|
||||
|
||||
srv.ArmCatchup(hostID, time.Now().UTC().Add(-2*time.Minute))
|
||||
srv.RunCatchupsDue(context.Background())
|
||||
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
var n int
|
||||
if err := st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM jobs WHERE host_id = ? AND kind = 'backup'`, hostID).Scan(&n); err != nil {
|
||||
t.Fatalf("count: %v", err)
|
||||
}
|
||||
if n != 0 {
|
||||
t.Errorf("not-overdue host: want 0 backup jobs, got %d", n)
|
||||
}
|
||||
})
|
||||
|
||||
// --- 3. Active backup in flight → no new dispatch -------------------
|
||||
t.Run("active_backup_blocks_dispatch", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
hostID, token := enrolHostForWS(t, srv, st, "catchup-active")
|
||||
|
||||
if err := st.SetHostAlwaysOn(context.Background(), hostID, false); err != nil {
|
||||
t.Fatalf("set always_on: %v", err)
|
||||
}
|
||||
eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
|
||||
if err := st.SetHostLastBackup(context.Background(), hostID, "succeeded", eightDaysAgo); err != nil {
|
||||
t.Fatalf("set last backup: %v", err)
|
||||
}
|
||||
|
||||
if err := st.CreateJob(context.Background(), store.Job{
|
||||
ID: ulid.Make().String(), HostID: hostID, Kind: "init",
|
||||
ActorKind: "system", CreatedAt: time.Now().UTC(),
|
||||
}); err != nil {
|
||||
t.Fatalf("seed init: %v", err)
|
||||
}
|
||||
|
||||
gid := ulid.Make().String()
|
||||
if err := st.CreateSourceGroup(context.Background(), &store.SourceGroup{
|
||||
ID: gid, HostID: hostID, Name: "home", Includes: []string{"/home"},
|
||||
}); err != nil {
|
||||
t.Fatalf("source group: %v", err)
|
||||
}
|
||||
sid := ulid.Make().String()
|
||||
if err := st.CreateSchedule(context.Background(), &store.Schedule{
|
||||
ID: sid, HostID: hostID, CronExpr: "0 2 * * *", Enabled: true,
|
||||
SourceGroupIDs: []string{gid},
|
||||
}); err != nil {
|
||||
t.Fatalf("schedule: %v", err)
|
||||
}
|
||||
|
||||
// Seed a queued backup job — this is "already in flight".
|
||||
if err := st.CreateJob(context.Background(), store.Job{
|
||||
ID: ulid.Make().String(), HostID: hostID, Kind: "backup",
|
||||
ActorKind: "schedule", CreatedAt: time.Now().UTC(),
|
||||
}); err != nil {
|
||||
t.Fatalf("seed queued backup: %v", err)
|
||||
}
|
||||
|
||||
c := agentDial(t, srv, ts, hostID, token)
|
||||
sendHello(t, c, "catchup-active")
|
||||
_ = drainUntil(t, c, api.MsgScheduleSet)
|
||||
|
||||
srv.ArmCatchup(hostID, time.Now().UTC().Add(-2*time.Minute))
|
||||
srv.RunCatchupsDue(context.Background())
|
||||
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
var n int
|
||||
if err := st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM jobs WHERE host_id = ? AND kind = 'backup'`, hostID).Scan(&n); err != nil {
|
||||
t.Fatalf("count: %v", err)
|
||||
}
|
||||
// Count must still be exactly 1 — no second job added.
|
||||
if n != 1 {
|
||||
t.Errorf("active backup guard: want 1 job (the seeded one), got %d", n)
|
||||
}
|
||||
})
|
||||
|
||||
// --- 4. Disconnected host → no dispatch -----------------------------
|
||||
t.Run("disconnected_no_dispatch", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, _, st := rawTestServer(t)
|
||||
hostID, _ := enrolHostForWS(t, srv, st, "catchup-disconnected")
|
||||
|
||||
if err := st.SetHostAlwaysOn(context.Background(), hostID, false); err != nil {
|
||||
t.Fatalf("set always_on: %v", err)
|
||||
}
|
||||
eightDaysAgo := time.Now().UTC().Add(-8 * 24 * time.Hour)
|
||||
if err := st.SetHostLastBackup(context.Background(), hostID, "succeeded", eightDaysAgo); err != nil {
|
||||
t.Fatalf("set last backup: %v", err)
|
||||
}
|
||||
|
||||
if err := st.CreateJob(context.Background(), store.Job{
|
||||
ID: ulid.Make().String(), HostID: hostID, Kind: "init",
|
||||
ActorKind: "system", CreatedAt: time.Now().UTC(),
|
||||
}); err != nil {
|
||||
t.Fatalf("seed init: %v", err)
|
||||
}
|
||||
|
||||
gid := ulid.Make().String()
|
||||
if err := st.CreateSourceGroup(context.Background(), &store.SourceGroup{
|
||||
ID: gid, HostID: hostID, Name: "home", Includes: []string{"/home"},
|
||||
}); err != nil {
|
||||
t.Fatalf("source group: %v", err)
|
||||
}
|
||||
sid := ulid.Make().String()
|
||||
if err := st.CreateSchedule(context.Background(), &store.Schedule{
|
||||
ID: sid, HostID: hostID, CronExpr: "0 2 * * *", Enabled: true,
|
||||
SourceGroupIDs: []string{gid},
|
||||
}); err != nil {
|
||||
t.Fatalf("schedule: %v", err)
|
||||
}
|
||||
|
||||
// Host is NOT connected — no agentDial.
|
||||
|
||||
srv.ArmCatchup(hostID, time.Now().UTC().Add(-2*time.Minute))
|
||||
srv.RunCatchupsDue(context.Background())
|
||||
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
var n int
|
||||
if err := st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM jobs WHERE host_id = ? AND kind = 'backup'`, hostID).Scan(&n); err != nil {
|
||||
t.Fatalf("count: %v", err)
|
||||
}
|
||||
if n != 0 {
|
||||
t.Errorf("disconnected host: want 0 backup jobs, got %d", n)
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,41 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func TestScheduleOverdue(t *testing.T) {
|
||||
mustParse := func(s string) time.Time {
|
||||
t.Helper()
|
||||
v, err := time.Parse(time.RFC3339, s)
|
||||
if err != nil {
|
||||
t.Fatalf("parse %q: %v", s, err)
|
||||
}
|
||||
return v
|
||||
}
|
||||
daily := "0 2 * * *" // 02:00 every day
|
||||
|
||||
cases := []struct {
|
||||
name string
|
||||
cron string
|
||||
lastBackup *time.Time
|
||||
now time.Time
|
||||
want bool
|
||||
}{
|
||||
{name: "never backed up is overdue", cron: daily, lastBackup: nil, now: mustParse("2026-06-15T09:00:00Z"), want: true},
|
||||
{name: "missed last nights window", cron: daily, lastBackup: ptrTime(mustParse("2026-06-13T02:05:00Z")), now: mustParse("2026-06-15T09:00:00Z"), want: true},
|
||||
{name: "backed up after the most recent window", cron: daily, lastBackup: ptrTime(mustParse("2026-06-15T02:05:00Z")), now: mustParse("2026-06-15T09:00:00Z"), want: false},
|
||||
{name: "unparseable cron is never overdue", cron: "not a cron", lastBackup: nil, now: mustParse("2026-06-15T09:00:00Z"), want: false},
|
||||
}
|
||||
for _, c := range cases {
|
||||
t.Run(c.name, func(t *testing.T) {
|
||||
got := scheduleOverdue(c.cron, c.lastBackup, c.now)
|
||||
if got != c.want {
|
||||
t.Fatalf("scheduleOverdue(%q, %v, %v) = %v, want %v", c.cron, c.lastBackup, c.now, got, c.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func ptrTime(t time.Time) *time.Time { return &t }
|
||||
@@ -11,6 +11,7 @@ import (
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
func makeFilterHosts() []store.Host {
|
||||
@@ -98,6 +99,23 @@ func TestSortDashboardHostsColumns(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// TestFilterAndSortDashboardUpdatesBehind: ?updates=behind narrows
|
||||
// to hosts whose agent_version is non-empty AND != server's version.
|
||||
func TestFilterAndSortDashboardUpdatesBehind(t *testing.T) {
|
||||
t.Parallel()
|
||||
hosts := []store.Host{
|
||||
{ID: "01a", Name: "alpha", AgentVersion: "v0.0.1", Status: "online"},
|
||||
{ID: "01b", Name: "bravo", AgentVersion: version.Version, Status: "online"},
|
||||
{ID: "01c", Name: "charlie", AgentVersion: "", Status: "online"}, // never seen
|
||||
{ID: "01d", Name: "delta", AgentVersion: "v0.0.1", Status: "offline"},
|
||||
}
|
||||
got := filterAndSortDashboardHosts(hosts, dashboardFilter{Updates: "behind", Sort: "name", Dir: "asc"})
|
||||
// alpha + delta both behind; bravo (current) and charlie (empty) excluded.
|
||||
if len(got) != 2 || got[0].Name != "alpha" || got[1].Name != "delta" {
|
||||
t.Errorf("updates=behind: got %v", namesOf(got))
|
||||
}
|
||||
}
|
||||
|
||||
// TestParseDashboardFilterDefaults: empty query gives sort=name asc.
|
||||
func TestParseDashboardFilterDefaults(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -0,0 +1,379 @@
|
||||
// fleet_update.go — admin-only fleet rolling-update endpoints + page.
|
||||
//
|
||||
// Surface:
|
||||
// - POST /api/fleet/update → starts a fleet update (JSON)
|
||||
// - POST /api/fleet-updates/{id}/cancel
|
||||
// - GET /api/fleet-updates/{id} → JSON parent + per-host array
|
||||
// - GET /settings/fleet-update → admin UI page
|
||||
// - GET /settings/fleet-update/partial → htmx polling fragment
|
||||
//
|
||||
// All routes are mounted in the admin band (see routes()).
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"log/slog"
|
||||
stdhttp "net/http"
|
||||
"time"
|
||||
|
||||
"github.com/go-chi/chi/v5"
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
// fleetUpdateStartReq is the JSON body for POST /api/fleet/update.
|
||||
// Both fields are optional: empty target_version defaults to the
|
||||
// server's current version, empty host_ids derives the out-of-date
|
||||
// online subset.
|
||||
type fleetUpdateStartReq struct {
|
||||
TargetVersion string `json:"target_version,omitempty"`
|
||||
HostIDs []string `json:"host_ids,omitempty"`
|
||||
}
|
||||
|
||||
// fleetUpdateHostView is one row in the JSON response for GET
|
||||
// /api/fleet-updates/{id}. Hostname is hydrated from the store so
|
||||
// callers don't need a second round-trip per host.
|
||||
type fleetUpdateHostView struct {
|
||||
HostID string `json:"host_id"`
|
||||
HostName string `json:"host_name,omitempty"`
|
||||
Position int `json:"position"`
|
||||
Status string `json:"status"`
|
||||
JobID string `json:"job_id,omitempty"`
|
||||
FailedReason string `json:"failed_reason,omitempty"`
|
||||
}
|
||||
|
||||
// fleetUpdateView is the JSON projection of the parent + children.
|
||||
type fleetUpdateView struct {
|
||||
ID string `json:"id"`
|
||||
StartedAt string `json:"started_at"`
|
||||
StartedByUserID string `json:"started_by_user_id"`
|
||||
TargetVersion string `json:"target_version"`
|
||||
Status string `json:"status"`
|
||||
CurrentHostID string `json:"current_host_id,omitempty"`
|
||||
HaltedReason string `json:"halted_reason,omitempty"`
|
||||
CompletedAt *string `json:"completed_at,omitempty"`
|
||||
Hosts []fleetUpdateHostView `json:"hosts"`
|
||||
}
|
||||
|
||||
// fleetUpdatePage backs both the full /settings/fleet-update page
|
||||
// and the partial polled fragment. Idle / Active are mutually
|
||||
// exclusive: if Active is non-nil, render the progress view.
|
||||
type fleetUpdatePage struct {
|
||||
// Idle-state fields.
|
||||
OutOfDateHosts []store.Host // online hosts whose version != target
|
||||
TargetVersion string
|
||||
|
||||
// Active-state fields. Nil when no fleet update has ever run.
|
||||
Active *store.FleetUpdate
|
||||
ActiveRows []fleetUpdateHostView
|
||||
|
||||
// Common.
|
||||
HostNames map[string]string
|
||||
// PollURL is the partial endpoint htmx polls every few seconds.
|
||||
PollURL string
|
||||
}
|
||||
|
||||
// handleAPIFleetUpdateStart is POST /api/fleet/update.
|
||||
func (s *Server) handleAPIFleetUpdateStart(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
user, ok := s.requireUser(r)
|
||||
if !ok {
|
||||
writeJSONError(w, stdhttp.StatusUnauthorized, "unauthorised", "")
|
||||
return
|
||||
}
|
||||
if s.deps.FleetWorker == nil {
|
||||
writeJSONError(w, stdhttp.StatusServiceUnavailable, "fleet_worker_unavailable", "")
|
||||
return
|
||||
}
|
||||
var body fleetUpdateStartReq
|
||||
// Empty body is fine — both fields are optional.
|
||||
if r.ContentLength != 0 {
|
||||
if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
|
||||
writeJSONError(w, stdhttp.StatusBadRequest, "invalid_json", err.Error())
|
||||
return
|
||||
}
|
||||
}
|
||||
target := body.TargetVersion
|
||||
if target == "" {
|
||||
target = version.Version
|
||||
}
|
||||
hostIDs := body.HostIDs
|
||||
if len(hostIDs) == 0 {
|
||||
derived, err := s.deriveOutOfDateOnlineHostIDs(r.Context(), target)
|
||||
if err != nil {
|
||||
writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
|
||||
return
|
||||
}
|
||||
hostIDs = derived
|
||||
}
|
||||
if len(hostIDs) == 0 {
|
||||
writeJSONError(w, stdhttp.StatusConflict, "no_hosts_eligible",
|
||||
"no online hosts are out of date")
|
||||
return
|
||||
}
|
||||
|
||||
fuID, err := s.deps.FleetWorker.Start(r.Context(), user.ID, target, hostIDs)
|
||||
if err != nil {
|
||||
if errors.Is(err, store.ErrFleetUpdateRunning) {
|
||||
writeJSONError(w, stdhttp.StatusConflict, "fleet_update_in_progress", err.Error())
|
||||
return
|
||||
}
|
||||
writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
auditPayload, _ := json.Marshal(map[string]any{
|
||||
"fleet_update_id": fuID,
|
||||
"target_version": target,
|
||||
"host_count": len(hostIDs),
|
||||
})
|
||||
_ = s.deps.Store.AppendAudit(r.Context(), store.AuditEntry{
|
||||
ID: ulid.Make().String(), UserID: &user.ID, Actor: "user",
|
||||
Action: "fleet.update_started",
|
||||
TargetKind: ptr("fleet_update"), TargetID: &fuID,
|
||||
TS: time.Now().UTC(),
|
||||
Payload: auditPayload,
|
||||
})
|
||||
|
||||
writeJSON(w, stdhttp.StatusAccepted, map[string]string{"fleet_update_id": fuID})
|
||||
}
|
||||
|
||||
// handleAPIFleetUpdateCancel is POST /api/fleet-updates/{id}/cancel.
|
||||
func (s *Server) handleAPIFleetUpdateCancel(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
user, ok := s.requireUser(r)
|
||||
if !ok {
|
||||
writeJSONError(w, stdhttp.StatusUnauthorized, "unauthorised", "")
|
||||
return
|
||||
}
|
||||
if s.deps.FleetWorker == nil {
|
||||
writeJSONError(w, stdhttp.StatusServiceUnavailable, "fleet_worker_unavailable", "")
|
||||
return
|
||||
}
|
||||
fuID := chi.URLParam(r, "id")
|
||||
if fuID == "" {
|
||||
writeJSONError(w, stdhttp.StatusBadRequest, "missing_id", "")
|
||||
return
|
||||
}
|
||||
fu, _, err := s.deps.Store.GetFleetUpdate(r.Context(), fuID)
|
||||
if err != nil {
|
||||
if errors.Is(err, store.ErrNotFound) {
|
||||
writeJSONError(w, stdhttp.StatusNotFound, "fleet_update_not_found", "")
|
||||
return
|
||||
}
|
||||
writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
|
||||
return
|
||||
}
|
||||
if fu.Status != "running" {
|
||||
writeJSONError(w, stdhttp.StatusConflict, "fleet_update_not_running",
|
||||
"fleet update is not in the running state")
|
||||
return
|
||||
}
|
||||
if err := s.deps.FleetWorker.Cancel(r.Context(), fuID); err != nil {
|
||||
writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
|
||||
return
|
||||
}
|
||||
_ = s.deps.Store.AppendAudit(r.Context(), store.AuditEntry{
|
||||
ID: ulid.Make().String(), UserID: &user.ID, Actor: "user",
|
||||
Action: "fleet.update_cancelled",
|
||||
TargetKind: ptr("fleet_update"), TargetID: &fuID,
|
||||
TS: time.Now().UTC(),
|
||||
})
|
||||
w.WriteHeader(stdhttp.StatusNoContent)
|
||||
}
|
||||
|
||||
// handleAPIFleetUpdateGet is GET /api/fleet-updates/{id}.
|
||||
func (s *Server) handleAPIFleetUpdateGet(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
if _, ok := s.requireUser(r); !ok {
|
||||
writeJSONError(w, stdhttp.StatusUnauthorized, "unauthorised", "")
|
||||
return
|
||||
}
|
||||
fuID := chi.URLParam(r, "id")
|
||||
fu, hosts, err := s.deps.Store.GetFleetUpdate(r.Context(), fuID)
|
||||
if err != nil {
|
||||
if errors.Is(err, store.ErrNotFound) {
|
||||
writeJSONError(w, stdhttp.StatusNotFound, "fleet_update_not_found", "")
|
||||
return
|
||||
}
|
||||
writeJSONError(w, stdhttp.StatusInternalServerError, "internal", err.Error())
|
||||
return
|
||||
}
|
||||
names := s.hostNameMap(r)
|
||||
view := fleetUpdateView{
|
||||
ID: fu.ID,
|
||||
StartedAt: fu.StartedAt.UTC().Format(time.RFC3339Nano),
|
||||
StartedByUserID: fu.StartedByUserID,
|
||||
TargetVersion: fu.TargetVersion,
|
||||
Status: fu.Status,
|
||||
CurrentHostID: fu.CurrentHostID,
|
||||
HaltedReason: fu.HaltedReason,
|
||||
Hosts: make([]fleetUpdateHostView, 0, len(hosts)),
|
||||
}
|
||||
if fu.CompletedAt != nil {
|
||||
s := fu.CompletedAt.UTC().Format(time.RFC3339Nano)
|
||||
view.CompletedAt = &s
|
||||
}
|
||||
for _, h := range hosts {
|
||||
view.Hosts = append(view.Hosts, fleetUpdateHostView{
|
||||
HostID: h.HostID,
|
||||
HostName: names[h.HostID],
|
||||
Position: h.Position,
|
||||
Status: h.Status,
|
||||
JobID: h.JobID,
|
||||
FailedReason: h.FailedReason,
|
||||
})
|
||||
}
|
||||
writeJSON(w, stdhttp.StatusOK, view)
|
||||
}
|
||||
|
||||
// handleUIFleetUpdate renders /settings/fleet-update.
|
||||
func (s *Server) handleUIFleetUpdate(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
u := s.requireUIUser(w, r)
|
||||
if u == nil {
|
||||
return
|
||||
}
|
||||
page, err := s.buildFleetUpdatePage(r)
|
||||
if err != nil {
|
||||
slog.Error("ui fleet update: build page", "err", err)
|
||||
stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
view := s.baseView(r, u)
|
||||
view.Title = "Fleet update · restic-manager"
|
||||
view.Active = "settings"
|
||||
view.Page = page
|
||||
if err := s.deps.UI.Render(w, "fleet_update", view); err != nil {
|
||||
slog.Error("ui fleet update: render", "err", err)
|
||||
}
|
||||
}
|
||||
|
||||
// handleUIFleetUpdatePartial renders just the inner panel for htmx
|
||||
// auto-refresh polling — same data, no chrome.
|
||||
func (s *Server) handleUIFleetUpdatePartial(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
u := s.requireUIUser(w, r)
|
||||
if u == nil {
|
||||
return
|
||||
}
|
||||
page, err := s.buildFleetUpdatePage(r)
|
||||
if err != nil {
|
||||
slog.Error("ui fleet update partial: build page", "err", err)
|
||||
stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
view := s.baseView(r, u)
|
||||
view.Page = page
|
||||
if err := s.deps.UI.RenderPartial(w, "fleet_update_inner", view); err != nil {
|
||||
slog.Error("ui fleet update partial: render", "err", err)
|
||||
}
|
||||
}
|
||||
|
||||
// buildFleetUpdatePage assembles the data both /settings/fleet-update
|
||||
// and its partial render against. Resolves the most-recent fleet
|
||||
// update (active OR completed/cancelled/halted) so the page can show
|
||||
// the last roll's result instead of disappearing into "idle" the
|
||||
// instant a roll finishes.
|
||||
func (s *Server) buildFleetUpdatePage(r *stdhttp.Request) (fleetUpdatePage, error) {
|
||||
page := fleetUpdatePage{
|
||||
TargetVersion: version.Version,
|
||||
HostNames: map[string]string{},
|
||||
PollURL: "/settings/fleet-update/partial",
|
||||
}
|
||||
hosts, err := s.deps.Store.ListHosts(r.Context())
|
||||
if err != nil {
|
||||
return page, err
|
||||
}
|
||||
for _, h := range hosts {
|
||||
page.HostNames[h.ID] = h.Name
|
||||
}
|
||||
|
||||
active, err := s.deps.Store.ActiveFleetUpdate(r.Context())
|
||||
if err != nil {
|
||||
return page, err
|
||||
}
|
||||
mostRecent := active
|
||||
if mostRecent == nil {
|
||||
// Fall back to the most recent terminal row so the page can
|
||||
// show "completed" / "halted" / "cancelled" once the worker
|
||||
// finishes. One small bespoke query — keeps the page from
|
||||
// flashing back to "idle" the instant a roll wraps up.
|
||||
var id string
|
||||
err := s.deps.Store.DB().QueryRowContext(r.Context(),
|
||||
`SELECT id FROM fleet_updates ORDER BY started_at DESC LIMIT 1`).
|
||||
Scan(&id)
|
||||
if err == nil {
|
||||
fu, _, gerr := s.deps.Store.GetFleetUpdate(r.Context(), id)
|
||||
if gerr == nil {
|
||||
mostRecent = fu
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if mostRecent != nil {
|
||||
_, rows, gerr := s.deps.Store.GetFleetUpdate(r.Context(), mostRecent.ID)
|
||||
if gerr == nil {
|
||||
page.Active = mostRecent
|
||||
page.ActiveRows = make([]fleetUpdateHostView, 0, len(rows))
|
||||
for _, hr := range rows {
|
||||
page.ActiveRows = append(page.ActiveRows, fleetUpdateHostView{
|
||||
HostID: hr.HostID,
|
||||
HostName: page.HostNames[hr.HostID],
|
||||
Position: hr.Position,
|
||||
Status: hr.Status,
|
||||
JobID: hr.JobID,
|
||||
FailedReason: hr.FailedReason,
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Idle list (or "still out of date" reference even when an active
|
||||
// roll is running — cheap to compute, harmless to attach).
|
||||
for _, h := range hosts {
|
||||
if h.Status != "online" {
|
||||
continue
|
||||
}
|
||||
if h.AgentVersion == "" || h.AgentVersion == page.TargetVersion {
|
||||
continue
|
||||
}
|
||||
page.OutOfDateHosts = append(page.OutOfDateHosts, h)
|
||||
}
|
||||
return page, nil
|
||||
}
|
||||
|
||||
// deriveOutOfDateOnlineHostIDs returns the list of host IDs that
|
||||
// (a) are online (Hub.Connected) and (b) have an agent_version that's
|
||||
// non-empty AND != target. Used by the start endpoint when the caller
|
||||
// omits host_ids.
|
||||
func (s *Server) deriveOutOfDateOnlineHostIDs(ctx context.Context, target string) ([]string, error) {
|
||||
hosts, err := s.deps.Store.ListHosts(ctx)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
out := []string{}
|
||||
for _, h := range hosts {
|
||||
if h.AgentVersion == "" || h.AgentVersion == target {
|
||||
continue
|
||||
}
|
||||
if !s.deps.Hub.Connected(h.ID) {
|
||||
continue
|
||||
}
|
||||
out = append(out, h.ID)
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
// hostNameMap returns hostID → name; used to hydrate fleet-update
|
||||
// JSON responses.
|
||||
func (s *Server) hostNameMap(r *stdhttp.Request) map[string]string {
|
||||
out := map[string]string{}
|
||||
hosts, err := s.deps.Store.ListHosts(r.Context())
|
||||
if err != nil {
|
||||
return out
|
||||
}
|
||||
for _, h := range hosts {
|
||||
out[h.ID] = h.Name
|
||||
}
|
||||
return out
|
||||
}
|
||||
@@ -0,0 +1,334 @@
|
||||
// fleet_update_test.go — coverage for the P6-15 fleet-update HTTP
|
||||
// surface: start/cancel/get JSON endpoints + RBAC.
|
||||
package http
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
stdhttp "net/http"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
// fakeFleetWorker stands in for *fleetupdate.Worker in HTTP tests.
|
||||
// It records what was passed to Start/Cancel and lets tests inject
|
||||
// canned errors. Satisfies the FleetWorker interface in
|
||||
// host_update.go.
|
||||
type fakeFleetWorker struct {
|
||||
mu sync.Mutex
|
||||
|
||||
startCalls []fakeStartCall
|
||||
startID string
|
||||
startErr error
|
||||
|
||||
cancelCalls []string
|
||||
cancelErr error
|
||||
}
|
||||
|
||||
type fakeStartCall struct {
|
||||
UserID string
|
||||
Target string
|
||||
HostIDs []string
|
||||
}
|
||||
|
||||
func (f *fakeFleetWorker) Start(_ context.Context, userID, target string, hostIDs []string) (string, error) {
|
||||
f.mu.Lock()
|
||||
defer f.mu.Unlock()
|
||||
f.startCalls = append(f.startCalls, fakeStartCall{userID, target, append([]string(nil), hostIDs...)})
|
||||
if f.startErr != nil {
|
||||
return "", f.startErr
|
||||
}
|
||||
return f.startID, nil
|
||||
}
|
||||
|
||||
func (f *fakeFleetWorker) Cancel(_ context.Context, id string) error {
|
||||
f.mu.Lock()
|
||||
defer f.mu.Unlock()
|
||||
f.cancelCalls = append(f.cancelCalls, id)
|
||||
return f.cancelErr
|
||||
}
|
||||
|
||||
// helloOnlineHost is the smallest setup that lets the dispatch /
|
||||
// derivation logic see a host as "online + version mismatch".
|
||||
// Returns the host id.
|
||||
func helloOnlineHost(t *testing.T, srv *Server, st *store.Store, name, agentVer string) string {
|
||||
t.Helper()
|
||||
id := makeHost(t, st, name)
|
||||
if err := st.MarkHostHello(context.Background(), id, agentVer, "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("mark hello: %v", err)
|
||||
}
|
||||
// Mark connected on the hub so deriveOutOfDateOnlineHostIDs
|
||||
// considers it online without needing a real WS handshake. The
|
||||
// Conn has a nil websocket pointer — tests never call Send on it.
|
||||
srv.deps.Hub.Register(id, ws.NewConn(id, nil))
|
||||
return id
|
||||
}
|
||||
|
||||
func TestFleetUpdateStartHappyPath(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
worker := &fakeFleetWorker{startID: ulid.Make().String()}
|
||||
srv.deps.FleetWorker = worker
|
||||
|
||||
cookie, uid := loginAsAdminWithID(t, st)
|
||||
hostID := helloOnlineHost(t, srv, st, "fu-host", "v0")
|
||||
|
||||
body := map[string]any{"host_ids": []string{hostID}}
|
||||
raw, _ := json.Marshal(body)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/fleet/update", bytes.NewReader(raw))
|
||||
req.AddCookie(cookie)
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusAccepted {
|
||||
t.Fatalf("status: got %d, want 202", res.StatusCode)
|
||||
}
|
||||
var out struct {
|
||||
FleetUpdateID string `json:"fleet_update_id"`
|
||||
}
|
||||
if err := json.NewDecoder(res.Body).Decode(&out); err != nil {
|
||||
t.Fatalf("decode: %v", err)
|
||||
}
|
||||
if out.FleetUpdateID != worker.startID {
|
||||
t.Fatalf("fleet_update_id: got %q, want %q", out.FleetUpdateID, worker.startID)
|
||||
}
|
||||
worker.mu.Lock()
|
||||
if len(worker.startCalls) != 1 || worker.startCalls[0].UserID != uid {
|
||||
t.Fatalf("start calls: %+v", worker.startCalls)
|
||||
}
|
||||
if got := worker.startCalls[0].HostIDs; len(got) != 1 || got[0] != hostID {
|
||||
t.Fatalf("host_ids: %v", got)
|
||||
}
|
||||
worker.mu.Unlock()
|
||||
|
||||
// Audit row.
|
||||
var n int
|
||||
if err := st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM audit_log WHERE action = 'fleet.update_started' AND target_id = ?`,
|
||||
out.FleetUpdateID).Scan(&n); err != nil {
|
||||
t.Fatalf("audit count: %v", err)
|
||||
}
|
||||
if n != 1 {
|
||||
t.Fatalf("audit rows: got %d, want 1", n)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFleetUpdateStartConflictWhenAlreadyRunning(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
worker := &fakeFleetWorker{startErr: store.ErrFleetUpdateRunning}
|
||||
srv.deps.FleetWorker = worker
|
||||
cookie := loginAsAdmin(t, st)
|
||||
_ = helloOnlineHost(t, srv, st, "fu-host", "v0")
|
||||
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/fleet/update", bytes.NewReader([]byte(`{}`)))
|
||||
req.AddCookie(cookie)
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusConflict {
|
||||
t.Fatalf("status: got %d, want 409", res.StatusCode)
|
||||
}
|
||||
body := readJSONError(t, res.Body)
|
||||
if body.Code != "fleet_update_in_progress" {
|
||||
t.Fatalf("code: %q", body.Code)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFleetUpdateStartDerivesHostIDsWhenEmpty(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
worker := &fakeFleetWorker{startID: ulid.Make().String()}
|
||||
srv.deps.FleetWorker = worker
|
||||
cookie := loginAsAdmin(t, st)
|
||||
|
||||
// Two online + out-of-date, one online + at-target, one offline.
|
||||
a := helloOnlineHost(t, srv, st, "behind-a", "v0")
|
||||
b := helloOnlineHost(t, srv, st, "behind-b", "v0")
|
||||
_ = helloOnlineHost(t, srv, st, "uptodate", version.Version)
|
||||
offlineID := makeHost(t, st, "offline-host")
|
||||
if err := st.MarkHostHello(context.Background(), offlineID, "v0", "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("mark hello: %v", err)
|
||||
}
|
||||
// Don't MarkOnline → derivation should skip.
|
||||
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/fleet/update", bytes.NewReader([]byte(`{}`)))
|
||||
req.AddCookie(cookie)
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusAccepted {
|
||||
t.Fatalf("status: got %d, want 202", res.StatusCode)
|
||||
}
|
||||
worker.mu.Lock()
|
||||
defer worker.mu.Unlock()
|
||||
if len(worker.startCalls) != 1 {
|
||||
t.Fatalf("start calls: %d", len(worker.startCalls))
|
||||
}
|
||||
got := worker.startCalls[0].HostIDs
|
||||
want := map[string]bool{a: true, b: true}
|
||||
if len(got) != 2 || !want[got[0]] || !want[got[1]] {
|
||||
t.Fatalf("derived host_ids: got %v, want both of %v", got, []string{a, b})
|
||||
}
|
||||
}
|
||||
|
||||
func TestFleetUpdateCancelHappyPath(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
worker := &fakeFleetWorker{}
|
||||
srv.deps.FleetWorker = worker
|
||||
cookie := loginAsAdmin(t, st)
|
||||
|
||||
// Seed a running fleet update directly.
|
||||
fuID := ulid.Make().String()
|
||||
uid := ulid.Make().String()
|
||||
if err := st.CreateUser(context.Background(), store.User{
|
||||
ID: uid, Username: "starter", PasswordHash: "x",
|
||||
Role: store.RoleAdmin, CreatedAt: time.Now().UTC(),
|
||||
}); err != nil {
|
||||
t.Fatalf("seed user: %v", err)
|
||||
}
|
||||
hostID := makeHost(t, st, "fu-cancel-host")
|
||||
if err := st.CreateFleetUpdate(context.Background(),
|
||||
store.FleetUpdate{ID: fuID, StartedByUserID: uid, TargetVersion: "v1"},
|
||||
[]string{hostID}); err != nil {
|
||||
t.Fatalf("seed fleet update: %v", err)
|
||||
}
|
||||
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/fleet-updates/"+fuID+"/cancel", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusNoContent {
|
||||
t.Fatalf("status: got %d, want 204", res.StatusCode)
|
||||
}
|
||||
worker.mu.Lock()
|
||||
if len(worker.cancelCalls) != 1 || worker.cancelCalls[0] != fuID {
|
||||
t.Fatalf("cancel calls: %v", worker.cancelCalls)
|
||||
}
|
||||
worker.mu.Unlock()
|
||||
}
|
||||
|
||||
func TestFleetUpdateCancelNotRunning(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
srv.deps.FleetWorker = &fakeFleetWorker{}
|
||||
cookie := loginAsAdmin(t, st)
|
||||
|
||||
// Seed + complete one so it's no longer running.
|
||||
fuID := ulid.Make().String()
|
||||
uid := ulid.Make().String()
|
||||
_ = st.CreateUser(context.Background(), store.User{
|
||||
ID: uid, Username: "starter2", PasswordHash: "x",
|
||||
Role: store.RoleAdmin, CreatedAt: time.Now().UTC(),
|
||||
})
|
||||
hostID := makeHost(t, st, "fu-done-host")
|
||||
_ = st.CreateFleetUpdate(context.Background(),
|
||||
store.FleetUpdate{ID: fuID, StartedByUserID: uid, TargetVersion: "v1"},
|
||||
[]string{hostID})
|
||||
if err := st.CompleteFleetUpdate(context.Background(), fuID, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("complete: %v", err)
|
||||
}
|
||||
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/fleet-updates/"+fuID+"/cancel", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusConflict {
|
||||
t.Fatalf("status: got %d, want 409", res.StatusCode)
|
||||
}
|
||||
body := readJSONError(t, res.Body)
|
||||
if body.Code != "fleet_update_not_running" {
|
||||
t.Fatalf("code: %q", body.Code)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFleetUpdateGetHydrates(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, ts, st := rawTestServer(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
|
||||
uid := ulid.Make().String()
|
||||
_ = st.CreateUser(context.Background(), store.User{
|
||||
ID: uid, Username: "starter3", PasswordHash: "x",
|
||||
Role: store.RoleAdmin, CreatedAt: time.Now().UTC(),
|
||||
})
|
||||
hostID := makeHost(t, st, "fu-get-host")
|
||||
fuID := ulid.Make().String()
|
||||
if err := st.CreateFleetUpdate(context.Background(),
|
||||
store.FleetUpdate{ID: fuID, StartedByUserID: uid, TargetVersion: "v1.2.3"},
|
||||
[]string{hostID}); err != nil {
|
||||
t.Fatalf("seed: %v", err)
|
||||
}
|
||||
|
||||
req, _ := stdhttp.NewRequest("GET", ts.URL+"/api/fleet-updates/"+fuID, nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusOK {
|
||||
t.Fatalf("status: got %d, want 200", res.StatusCode)
|
||||
}
|
||||
var got fleetUpdateView
|
||||
if err := json.NewDecoder(res.Body).Decode(&got); err != nil {
|
||||
t.Fatalf("decode: %v", err)
|
||||
}
|
||||
if got.ID != fuID || got.TargetVersion != "v1.2.3" || got.Status != "running" {
|
||||
t.Fatalf("parent: %+v", got)
|
||||
}
|
||||
if len(got.Hosts) != 1 || got.Hosts[0].HostID != hostID || got.Hosts[0].HostName != "fu-get-host" {
|
||||
t.Fatalf("hosts: %+v", got.Hosts)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFleetUpdateRBAC(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, ts, st := rawTestServer(t)
|
||||
|
||||
for _, role := range []store.Role{store.RoleViewer, store.RoleOperator} {
|
||||
role := role
|
||||
t.Run(string(role), func(t *testing.T) {
|
||||
cookie := loginAsRole(t, st, role)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/fleet/update", bytes.NewReader([]byte(`{}`)))
|
||||
req.AddCookie(cookie)
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusForbidden {
|
||||
t.Fatalf("status: got %d, want 403", res.StatusCode)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// Sanity check that fakeFleetWorker satisfies the FleetWorker iface.
|
||||
var _ FleetWorker = (*fakeFleetWorker)(nil)
|
||||
@@ -483,6 +483,12 @@ func (s *Server) onAgentHello(ctx context.Context, hostID string, conn *ws.Conn)
|
||||
// and the drain may take seconds across many rows. A non-blocking
|
||||
// goroutine keeps the hello path snappy.
|
||||
go s.DrainPending(context.Background(), hostID)
|
||||
// Intermittent hosts that just reconnected may have slept through a
|
||||
// backup window. Arm a catch-up evaluation after a settle delay; the
|
||||
// pending-drain tick fires it. Always-on hosts never need this.
|
||||
if host, err := s.deps.Store.GetHost(ctx, hostID); err == nil && !host.AlwaysOn {
|
||||
s.ArmCatchup(hostID, time.Now().UTC())
|
||||
}
|
||||
}
|
||||
|
||||
// maybeAutoInit dispatches a `restic init` job iff the host has no
|
||||
|
||||
@@ -0,0 +1,217 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
stdhttp "net/http"
|
||||
"time"
|
||||
|
||||
"github.com/go-chi/chi/v5"
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
// UpdateWatcher is the slim view of the ws.updateWatcher this package
|
||||
// uses for tracking in-flight update dispatches. Defined as an
|
||||
// interface so a test can inject a stub.
|
||||
type UpdateWatcher interface {
|
||||
Track(jobID, hostID string)
|
||||
}
|
||||
|
||||
// FleetWorker is the slim view of the fleetupdate.Worker this package
|
||||
// uses. Kept here for forward compatibility with P6-15 — the host
|
||||
// update endpoint itself does not use it.
|
||||
type FleetWorker interface {
|
||||
Start(ctx context.Context, userID, targetVersion string, hostIDs []string) (string, error)
|
||||
Cancel(ctx context.Context, fleetUpdateID string) error
|
||||
}
|
||||
|
||||
// dispatchHostUpdateResult communicates structured outcomes from the
|
||||
// shared dispatch path so both the HTTP handler and the fleet worker
|
||||
// can format errors in their own idiom.
|
||||
type dispatchHostUpdateResult struct {
|
||||
JobID string
|
||||
Code string // "" on success
|
||||
Status int // HTTP status the JSON handler should use on error
|
||||
Msg string // human-readable detail (optional)
|
||||
}
|
||||
|
||||
// dispatchHostUpdate is the shared "send command.update to one host"
|
||||
// path. It performs every pre-check (host exists, online, version
|
||||
// mismatch, no in-flight update) and on success creates the jobs row,
|
||||
// audits, dispatches the WS envelope, and tracks the watcher entry.
|
||||
//
|
||||
// Pre-checks are returned as structured codes rather than HTTP errors
|
||||
// so the fleet worker can map them onto its own per-host status enum
|
||||
// without parsing strings.
|
||||
func (s *Server) dispatchHostUpdate(ctx context.Context, hostID string, actorKind string, actorID *string) dispatchHostUpdateResult {
|
||||
host, err := s.deps.Store.GetHost(ctx, hostID)
|
||||
if err != nil || host == nil {
|
||||
return dispatchHostUpdateResult{Code: "host_not_found", Status: stdhttp.StatusNotFound}
|
||||
}
|
||||
if !s.deps.Hub.Connected(host.ID) {
|
||||
return dispatchHostUpdateResult{
|
||||
Code: "host_offline", Status: stdhttp.StatusConflict,
|
||||
Msg: "agent is not currently connected",
|
||||
}
|
||||
}
|
||||
if host.AgentVersion != "" && host.AgentVersion == version.Version {
|
||||
return dispatchHostUpdateResult{
|
||||
Code: "already_up_to_date", Status: stdhttp.StatusConflict,
|
||||
Msg: "agent already running version " + version.Version,
|
||||
}
|
||||
}
|
||||
existing, err := s.deps.Store.RunningUpdateJobForHost(ctx, hostID)
|
||||
if err != nil {
|
||||
return dispatchHostUpdateResult{Code: "internal", Status: stdhttp.StatusInternalServerError, Msg: err.Error()}
|
||||
}
|
||||
if existing != "" {
|
||||
return dispatchHostUpdateResult{
|
||||
Code: "update_in_progress", Status: stdhttp.StatusConflict,
|
||||
Msg: "an update job is already in flight for this host",
|
||||
JobID: existing,
|
||||
}
|
||||
}
|
||||
|
||||
jobID := ulid.Make().String()
|
||||
now := time.Now().UTC()
|
||||
if err := s.deps.Store.CreateJob(ctx, store.Job{
|
||||
ID: jobID, HostID: hostID, Kind: "update",
|
||||
ActorKind: actorKind, ActorID: actorID,
|
||||
CreatedAt: now,
|
||||
}); err != nil {
|
||||
return dispatchHostUpdateResult{Code: "internal", Status: stdhttp.StatusInternalServerError, Msg: err.Error()}
|
||||
}
|
||||
env, err := api.Marshal(api.MsgCommandUpdate, ulid.Make().String(), api.CommandUpdatePayload{
|
||||
JobID: jobID,
|
||||
})
|
||||
if err != nil {
|
||||
return dispatchHostUpdateResult{Code: "internal", Status: stdhttp.StatusInternalServerError, Msg: err.Error()}
|
||||
}
|
||||
if err := s.deps.Hub.Send(ctx, hostID, env); err != nil {
|
||||
// Roll the job to failed so we don't leak a queued row.
|
||||
_ = s.deps.Store.MarkJobFinished(ctx, jobID, "failed", -1, nil, err.Error(), time.Now().UTC())
|
||||
return dispatchHostUpdateResult{
|
||||
Code: "host_offline", Status: stdhttp.StatusConflict, Msg: err.Error(),
|
||||
}
|
||||
}
|
||||
if s.deps.UpdateWatcher != nil {
|
||||
s.deps.UpdateWatcher.Track(jobID, hostID)
|
||||
}
|
||||
|
||||
auditPayload, _ := json.Marshal(map[string]string{
|
||||
"job_id": jobID,
|
||||
"target_version": version.Version,
|
||||
})
|
||||
_ = s.deps.Store.AppendAudit(ctx, store.AuditEntry{
|
||||
ID: ulid.Make().String(),
|
||||
UserID: actorID,
|
||||
Actor: actorKind,
|
||||
Action: "host.update_dispatched",
|
||||
TargetKind: ptr("host"),
|
||||
TargetID: &hostID,
|
||||
TS: now,
|
||||
Payload: auditPayload,
|
||||
})
|
||||
|
||||
return dispatchHostUpdateResult{JobID: jobID}
|
||||
}
|
||||
|
||||
// handleHostUpdate is POST /api/hosts/{id}/update — JSON, admin-only.
|
||||
func (s *Server) handleHostUpdate(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
user, ok := s.requireUser(r)
|
||||
if !ok {
|
||||
writeJSONError(w, stdhttp.StatusUnauthorized, "unauthorised", "")
|
||||
return
|
||||
}
|
||||
hostID := chi.URLParam(r, "id")
|
||||
if hostID == "" {
|
||||
writeJSONError(w, stdhttp.StatusBadRequest, "missing_host_id", "")
|
||||
return
|
||||
}
|
||||
actor := "user"
|
||||
var actorID *string
|
||||
if user != nil {
|
||||
actorID = &user.ID
|
||||
}
|
||||
res := s.dispatchHostUpdate(r.Context(), hostID, actor, actorID)
|
||||
if res.Code != "" {
|
||||
writeJSONError(w, res.Status, res.Code, res.Msg)
|
||||
return
|
||||
}
|
||||
writeJSON(w, stdhttp.StatusAccepted, map[string]string{"job_id": res.JobID})
|
||||
}
|
||||
|
||||
// handleHostUpdateForm is the HTMX-friendly POST /hosts/{id}/update
|
||||
// variant. On success it sets HX-Redirect to the job detail page; on
|
||||
// pre-check failures it renders an inline error banner.
|
||||
func (s *Server) handleHostUpdateForm(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
user, ok := s.requireUser(r)
|
||||
if !ok {
|
||||
stdhttp.Error(w, "unauthorised", stdhttp.StatusUnauthorized)
|
||||
return
|
||||
}
|
||||
hostID := chi.URLParam(r, "id")
|
||||
if hostID == "" {
|
||||
stdhttp.Error(w, "missing host_id", stdhttp.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
actor := "user"
|
||||
var actorID *string
|
||||
if user != nil {
|
||||
actorID = &user.ID
|
||||
}
|
||||
res := s.dispatchHostUpdate(r.Context(), hostID, actor, actorID)
|
||||
if res.Code != "" {
|
||||
// Inline banner for HTMX swaps. Mirrors what host_credentials
|
||||
// returns on validation errors — small text/html fragment.
|
||||
w.Header().Set("Content-Type", "text/html; charset=utf-8")
|
||||
w.WriteHeader(res.Status)
|
||||
msg := hostUpdateErrorMessage(res.Code, res.Msg)
|
||||
_, _ = w.Write([]byte(`<div class="banner banner-error" role="alert">` + htmlEscape(msg) + `</div>`))
|
||||
return
|
||||
}
|
||||
w.Header().Set("HX-Redirect", "/jobs/"+res.JobID)
|
||||
w.WriteHeader(stdhttp.StatusOK)
|
||||
}
|
||||
|
||||
func hostUpdateErrorMessage(code, msg string) string {
|
||||
switch code {
|
||||
case "host_not_found":
|
||||
return "Host not found."
|
||||
case "host_offline":
|
||||
return "Agent is offline; can't deliver the update command."
|
||||
case "already_up_to_date":
|
||||
return "Agent is already running the current version."
|
||||
case "update_in_progress":
|
||||
return "An update is already in progress for this host."
|
||||
}
|
||||
if msg != "" {
|
||||
return msg
|
||||
}
|
||||
return "Update dispatch failed."
|
||||
}
|
||||
|
||||
// htmlEscape is a minimal HTML-attr-safe escaper. Avoids pulling html/template
|
||||
// for a one-shot inline banner.
|
||||
func htmlEscape(s string) string {
|
||||
out := make([]byte, 0, len(s))
|
||||
for i := 0; i < len(s); i++ {
|
||||
switch s[i] {
|
||||
case '&':
|
||||
out = append(out, []byte("&")...)
|
||||
case '<':
|
||||
out = append(out, []byte("<")...)
|
||||
case '>':
|
||||
out = append(out, []byte(">")...)
|
||||
case '"':
|
||||
out = append(out, []byte(""")...)
|
||||
default:
|
||||
out = append(out, s[i])
|
||||
}
|
||||
}
|
||||
return string(out)
|
||||
}
|
||||
@@ -0,0 +1,270 @@
|
||||
// host_update_test.go — covers POST /api/hosts/{id}/update.
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"io"
|
||||
stdhttp "net/http"
|
||||
"strings"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/coder/websocket"
|
||||
"github.com/oklog/ulid/v2"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
// stubWatcher records Track calls so tests can assert the watcher was
|
||||
// notified.
|
||||
type stubWatcher struct {
|
||||
mu sync.Mutex
|
||||
tracked []string // hostIDs
|
||||
}
|
||||
|
||||
func (s *stubWatcher) Track(_, hostID string) {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
s.tracked = append(s.tracked, hostID)
|
||||
}
|
||||
|
||||
func TestHostUpdateHappyPath(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
watcher := &stubWatcher{}
|
||||
srv.deps.UpdateWatcher = watcher
|
||||
hostID, token := enrolHostForWS(t, srv, st, "upd-host")
|
||||
c := agentDial(t, srv, ts, hostID, token)
|
||||
sendHello(t, c, "upd-host")
|
||||
_ = drainUntil(t, c, api.MsgScheduleSet)
|
||||
|
||||
// Force a version mismatch so the dispatch isn't short-circuited.
|
||||
if err := st.MarkHostHello(context.Background(), hostID, "v0", "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("mark hello: %v", err)
|
||||
}
|
||||
|
||||
cookie := loginAsAdmin(t, st)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/hosts/"+hostID+"/update", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusAccepted {
|
||||
t.Fatalf("status: got %d, want 202", res.StatusCode)
|
||||
}
|
||||
var out struct {
|
||||
JobID string `json:"job_id"`
|
||||
}
|
||||
if err := json.NewDecoder(res.Body).Decode(&out); err != nil {
|
||||
t.Fatalf("decode: %v", err)
|
||||
}
|
||||
if out.JobID == "" {
|
||||
t.Fatal("missing job_id in response")
|
||||
}
|
||||
|
||||
// command.update envelope arrives.
|
||||
deadline := time.Now().Add(2 * time.Second)
|
||||
var got api.Envelope
|
||||
for time.Now().Before(deadline) {
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
|
||||
mt, raw, rerr := c.Read(ctx)
|
||||
cancel()
|
||||
if rerr != nil {
|
||||
break
|
||||
}
|
||||
if mt != websocket.MessageText {
|
||||
continue
|
||||
}
|
||||
if !strings.Contains(string(raw), `"command.update"`) {
|
||||
continue
|
||||
}
|
||||
_ = json.Unmarshal(raw, &got)
|
||||
break
|
||||
}
|
||||
if got.Type != api.MsgCommandUpdate {
|
||||
t.Fatal("never received command.update envelope")
|
||||
}
|
||||
var cp api.CommandUpdatePayload
|
||||
if err := got.UnmarshalPayload(&cp); err != nil {
|
||||
t.Fatalf("payload: %v", err)
|
||||
}
|
||||
if cp.JobID != out.JobID {
|
||||
t.Fatalf("payload job_id: got %q want %q", cp.JobID, out.JobID)
|
||||
}
|
||||
|
||||
// Watcher tracked.
|
||||
watcher.mu.Lock()
|
||||
defer watcher.mu.Unlock()
|
||||
if len(watcher.tracked) != 1 || watcher.tracked[0] != hostID {
|
||||
t.Fatalf("watcher tracked: %v", watcher.tracked)
|
||||
}
|
||||
|
||||
// Audit row exists.
|
||||
var n int
|
||||
if err := st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM audit_log WHERE action = 'host.update_dispatched' AND target_id = ?`,
|
||||
hostID).Scan(&n); err != nil {
|
||||
t.Fatalf("audit count: %v", err)
|
||||
}
|
||||
if n != 1 {
|
||||
t.Fatalf("audit rows: got %d, want 1", n)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHostUpdateNotFound(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, ts, st := rawTestServer(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/hosts/no-such/update", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusNotFound {
|
||||
t.Fatalf("status: got %d want 404", res.StatusCode)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHostUpdateOffline(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, ts, st := rawTestServer(t)
|
||||
hostID := ulid.Make().String()
|
||||
if err := st.CreateHost(context.Background(), store.Host{
|
||||
ID: hostID, Name: "off", OS: "linux", Arch: "amd64",
|
||||
EnrolledAt: time.Now().UTC(),
|
||||
}, "deadbeef", ""); err != nil {
|
||||
t.Fatalf("create: %v", err)
|
||||
}
|
||||
cookie := loginAsAdmin(t, st)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/hosts/"+hostID+"/update", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusConflict {
|
||||
t.Fatalf("status: got %d want 409", res.StatusCode)
|
||||
}
|
||||
body := readJSONError(t, res.Body)
|
||||
if body.Code != "host_offline" {
|
||||
t.Fatalf("code: %q", body.Code)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHostUpdateAlreadyUpToDate(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
hostID, token := enrolHostForWS(t, srv, st, "uptodate-host")
|
||||
c := agentDial(t, srv, ts, hostID, token)
|
||||
sendHello(t, c, "uptodate-host")
|
||||
_ = drainUntil(t, c, api.MsgScheduleSet)
|
||||
|
||||
// Force agent_version == version.Version.
|
||||
if err := st.MarkHostHello(context.Background(), hostID, version.Version, "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("mark hello: %v", err)
|
||||
}
|
||||
|
||||
cookie := loginAsAdmin(t, st)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/hosts/"+hostID+"/update", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusConflict {
|
||||
t.Fatalf("status: got %d want 409", res.StatusCode)
|
||||
}
|
||||
body := readJSONError(t, res.Body)
|
||||
if body.Code != "already_up_to_date" {
|
||||
t.Fatalf("code: %q", body.Code)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHostUpdateInProgress(t *testing.T) {
|
||||
t.Parallel()
|
||||
srv, ts, st := rawTestServer(t)
|
||||
hostID, token := enrolHostForWS(t, srv, st, "inprog-host")
|
||||
c := agentDial(t, srv, ts, hostID, token)
|
||||
sendHello(t, c, "inprog-host")
|
||||
_ = drainUntil(t, c, api.MsgScheduleSet)
|
||||
if err := st.MarkHostHello(context.Background(), hostID, "v0", "0.17", api.CurrentProtocolVersion, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("mark hello: %v", err)
|
||||
}
|
||||
|
||||
// Pre-seed an in-flight update job.
|
||||
jobID := ulid.Make().String()
|
||||
if err := st.CreateJob(context.Background(), store.Job{
|
||||
ID: jobID, HostID: hostID, Kind: "update",
|
||||
ActorKind: "user", CreatedAt: time.Now().UTC(),
|
||||
}); err != nil {
|
||||
t.Fatalf("seed job: %v", err)
|
||||
}
|
||||
|
||||
cookie := loginAsAdmin(t, st)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/hosts/"+hostID+"/update", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusConflict {
|
||||
t.Fatalf("status: got %d want 409", res.StatusCode)
|
||||
}
|
||||
body := readJSONError(t, res.Body)
|
||||
if body.Code != "update_in_progress" {
|
||||
t.Fatalf("code: %q", body.Code)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHostUpdateRBAC(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, ts, st := rawTestServer(t)
|
||||
hostID := ulid.Make().String()
|
||||
if err := st.CreateHost(context.Background(), store.Host{
|
||||
ID: hostID, Name: "rbac-host", OS: "linux", Arch: "amd64",
|
||||
EnrolledAt: time.Now().UTC(),
|
||||
}, "deadbeef", ""); err != nil {
|
||||
t.Fatalf("create: %v", err)
|
||||
}
|
||||
for _, role := range []store.Role{store.RoleViewer, store.RoleOperator} {
|
||||
role := role
|
||||
t.Run(string(role), func(t *testing.T) {
|
||||
cookie := loginAsRole(t, st, role)
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/api/hosts/"+hostID+"/update", nil)
|
||||
req.AddCookie(cookie)
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusForbidden {
|
||||
t.Fatalf("status for %s: got %d want 403", role, res.StatusCode)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
type jsonErrBody struct {
|
||||
Code string `json:"code"`
|
||||
Message string `json:"message,omitempty"`
|
||||
}
|
||||
|
||||
func readJSONError(t *testing.T, body io.Reader) jsonErrBody {
|
||||
t.Helper()
|
||||
var out jsonErrBody
|
||||
if err := json.NewDecoder(body).Decode(&out); err != nil {
|
||||
t.Fatalf("decode error body: %v", err)
|
||||
}
|
||||
return out
|
||||
}
|
||||
@@ -4,6 +4,7 @@ import (
|
||||
stdhttp "net/http"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
// hostView is the JSON projection of a Host row. Same shape as the
|
||||
@@ -24,9 +25,12 @@ type hostView struct {
|
||||
CurrentJobID *string `json:"current_job_id,omitempty"`
|
||||
LastBackupAt *string `json:"last_backup_at,omitempty"`
|
||||
LastBackupStatus *string `json:"last_backup_status,omitempty"`
|
||||
RepoStatus string `json:"repo_status,omitempty"`
|
||||
RepoSizeBytes int64 `json:"repo_size_bytes"`
|
||||
SnapshotCount int `json:"snapshot_count"`
|
||||
OpenAlertCount int `json:"open_alert_count"`
|
||||
UpdateAvailable bool `json:"update_available"`
|
||||
TargetVersion string `json:"target_version,omitempty"`
|
||||
}
|
||||
|
||||
// handleListHosts returns the full fleet as JSON. Authenticated; the
|
||||
@@ -82,9 +86,12 @@ func hostToView(h store.Host) hostView {
|
||||
Tags: h.Tags,
|
||||
CurrentJobID: h.CurrentJobID,
|
||||
LastBackupStatus: h.LastBackupStatus,
|
||||
RepoStatus: h.RepoStatus,
|
||||
RepoSizeBytes: h.RepoSizeBytes,
|
||||
SnapshotCount: h.SnapshotCount,
|
||||
OpenAlertCount: h.OpenAlertCount,
|
||||
TargetVersion: version.Version,
|
||||
UpdateAvailable: h.AgentVersion != "" && h.AgentVersion != version.Version,
|
||||
}
|
||||
if v.Tags == nil {
|
||||
v.Tags = []string{}
|
||||
|
||||
@@ -0,0 +1,185 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"crypto/subtle"
|
||||
"net"
|
||||
"net/http"
|
||||
"net/netip"
|
||||
"runtime"
|
||||
"strings"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
// handleMetrics serves the Prometheus exposition body. The route is
|
||||
// only mounted when the operator has opted in via RM_METRICS_TOKEN
|
||||
// or RM_METRICS_TRUSTED_CIDR (see Server.New + Cfg.MetricsAuthEnabled).
|
||||
func (s *Server) handleMetrics(w http.ResponseWriter, r *http.Request) {
|
||||
if !authoriseMetricsScrape(r, s.deps.Cfg) {
|
||||
// 401 with no body; Prom respects this and surfaces the failed
|
||||
// scrape. WWW-Authenticate hints at bearer when the operator
|
||||
// actually configured a token.
|
||||
if s.deps.Cfg.MetricsToken != "" {
|
||||
w.Header().Set("WWW-Authenticate", `Bearer realm="restic-manager metrics"`)
|
||||
}
|
||||
w.WriteHeader(http.StatusUnauthorized)
|
||||
return
|
||||
}
|
||||
|
||||
snap, err := s.gatherMetricsSnapshot(r.Context())
|
||||
if err != nil {
|
||||
http.Error(w, "snapshot: "+err.Error(), http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
|
||||
// 0.0.4 is the long-stable text-format version Prometheus accepts
|
||||
// without negotiation; OpenMetrics is intentionally not used here.
|
||||
w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
|
||||
if err := metrics.Render(w, snap); err != nil {
|
||||
// Body is partially written; nothing useful we can do beyond
|
||||
// dropping the connection (chi's recoverer will log).
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
// authoriseMetricsScrape applies bearer + CIDR gates per the spec.
|
||||
// AND semantics when both are configured; either alone is sufficient
|
||||
// when only it is configured.
|
||||
func authoriseMetricsScrape(r *http.Request, cfg config.Config) bool {
|
||||
tokenOK := true
|
||||
if cfg.MetricsToken != "" {
|
||||
tokenOK = false
|
||||
hdr := r.Header.Get("Authorization")
|
||||
const prefix = "Bearer "
|
||||
if strings.HasPrefix(hdr, prefix) {
|
||||
got := []byte(strings.TrimPrefix(hdr, prefix))
|
||||
want := []byte(cfg.MetricsToken)
|
||||
if subtle.ConstantTimeCompare(got, want) == 1 {
|
||||
tokenOK = true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
cidrOK := true
|
||||
if len(cfg.MetricsTrustedCIDRs) > 0 {
|
||||
cidrOK = false
|
||||
ip := callerIP(r, cfg.TrustedProxies)
|
||||
if ip.IsValid() {
|
||||
for _, c := range cfg.MetricsTrustedCIDRs {
|
||||
prefix, err := netip.ParsePrefix(c)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if prefix.Contains(ip) {
|
||||
cidrOK = true
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return tokenOK && cidrOK
|
||||
}
|
||||
|
||||
// callerIP resolves the client IP. When the request hit the server
|
||||
// directly we use RemoteAddr; when the immediate hop is a trusted
|
||||
// proxy we honour the right-most untrusted X-Forwarded-For entry
|
||||
// (mirrors how realIP middlewares typically resolve).
|
||||
func callerIP(r *http.Request, trustedProxies []string) netip.Addr {
|
||||
host, _, err := net.SplitHostPort(r.RemoteAddr)
|
||||
if err != nil {
|
||||
host = r.RemoteAddr
|
||||
}
|
||||
directAddr, err := netip.ParseAddr(host)
|
||||
if err != nil {
|
||||
return netip.Addr{}
|
||||
}
|
||||
|
||||
if !addrInAnyCIDR(directAddr, trustedProxies) {
|
||||
return directAddr
|
||||
}
|
||||
|
||||
xff := r.Header.Get("X-Forwarded-For")
|
||||
if xff == "" {
|
||||
return directAddr
|
||||
}
|
||||
parts := strings.Split(xff, ",")
|
||||
// Walk right→left, skipping trusted proxies, until we land on the
|
||||
// first untrusted hop — that's the genuine client.
|
||||
for i := len(parts) - 1; i >= 0; i-- {
|
||||
p := strings.TrimSpace(parts[i])
|
||||
a, err := netip.ParseAddr(p)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if addrInAnyCIDR(a, trustedProxies) {
|
||||
continue
|
||||
}
|
||||
return a
|
||||
}
|
||||
return directAddr
|
||||
}
|
||||
|
||||
func addrInAnyCIDR(a netip.Addr, cidrs []string) bool {
|
||||
for _, c := range cidrs {
|
||||
pre, err := netip.ParsePrefix(c)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if pre.Contains(a) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// gatherMetricsSnapshot pulls the data the renderer needs. One
|
||||
// indexed query per per-host or fleet-wide read; no N+1.
|
||||
func (s *Server) gatherMetricsSnapshot(ctx context.Context) (metrics.Snapshot, error) {
|
||||
hosts, err := s.deps.Store.ListHosts(ctx)
|
||||
if err != nil {
|
||||
return metrics.Snapshot{}, err
|
||||
}
|
||||
hostRows := make([]metrics.HostRow, 0, len(hosts))
|
||||
for _, h := range hosts {
|
||||
row := metrics.HostRow{
|
||||
ID: h.ID,
|
||||
Name: h.Name,
|
||||
Online: h.Status == "online",
|
||||
SnapshotCount: h.SnapshotCount,
|
||||
OpenAlertCount: h.OpenAlertCount,
|
||||
RepoStatus: h.RepoStatus,
|
||||
}
|
||||
if h.LastBackupAt != nil {
|
||||
ts := h.LastBackupAt.Unix()
|
||||
row.LastBackupUnix = &ts
|
||||
}
|
||||
if h.LastBackupStatus != nil {
|
||||
ok := *h.LastBackupStatus == "succeeded"
|
||||
row.LastBackupSucceeded = &ok
|
||||
}
|
||||
if h.RepoSizeBytes > 0 {
|
||||
sz := h.RepoSizeBytes
|
||||
row.RepoSizeBytes = &sz
|
||||
}
|
||||
hostRows = append(hostRows, row)
|
||||
}
|
||||
|
||||
open, err := s.deps.Store.ListAlerts(ctx, store.AlertFilter{Status: "open"})
|
||||
if err != nil {
|
||||
return metrics.Snapshot{}, err
|
||||
}
|
||||
bySeverity := map[string]int{"info": 0, "warning": 0, "critical": 0}
|
||||
for _, a := range open {
|
||||
bySeverity[a.Severity]++
|
||||
}
|
||||
|
||||
reg := s.deps.Metrics
|
||||
if reg == nil {
|
||||
reg = metrics.NewRegistry() // empty histogram block
|
||||
}
|
||||
return reg.SnapshotWith(hostRows, bySeverity, version.Version, version.Commit, runtime.Version()), nil
|
||||
}
|
||||
@@ -0,0 +1,209 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"io"
|
||||
stdhttp "net/http"
|
||||
"net/http/httptest"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
// newMetricsServer builds a Server with metrics enabled per cfg.
|
||||
// Returns (URL, registry) so tests can both observe job durations
|
||||
// directly and exercise the HTTP gate.
|
||||
func newMetricsServer(t *testing.T, cfg config.Config) (string, *metrics.Registry, *store.Store) {
|
||||
t.Helper()
|
||||
dir := t.TempDir()
|
||||
|
||||
st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
|
||||
if err != nil {
|
||||
t.Fatalf("store: %v", err)
|
||||
}
|
||||
t.Cleanup(func() { _ = st.Close() })
|
||||
|
||||
keyPath := filepath.Join(dir, "secret.key")
|
||||
if err := crypto.GenerateKeyFile(keyPath); err != nil {
|
||||
t.Fatalf("genkey: %v", err)
|
||||
}
|
||||
key, _ := crypto.LoadKeyFromFile(keyPath)
|
||||
aead, _ := crypto.NewAEAD(key)
|
||||
|
||||
cfg.Listen = ":0"
|
||||
cfg.DataDir = dir
|
||||
cfg.SecretKeyFile = keyPath
|
||||
|
||||
reg := metrics.NewRegistry()
|
||||
deps := Deps{
|
||||
Cfg: cfg,
|
||||
Store: st,
|
||||
AEAD: aead,
|
||||
Metrics: reg,
|
||||
}
|
||||
s := New(deps)
|
||||
ts := httptest.NewServer(s.srv.Handler)
|
||||
t.Cleanup(ts.Close)
|
||||
return ts.URL, reg, st
|
||||
}
|
||||
|
||||
func TestMetricsRouteNotMountedByDefault(t *testing.T) {
|
||||
t.Parallel()
|
||||
url, _, _ := newMetricsServer(t, config.Config{})
|
||||
res, err := stdhttp.Get(url + "/metrics")
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusNotFound {
|
||||
t.Errorf("status: got %d, want 404 (route should not be mounted)", res.StatusCode)
|
||||
}
|
||||
}
|
||||
|
||||
func TestMetricsTokenRequired(t *testing.T) {
|
||||
t.Parallel()
|
||||
url, _, _ := newMetricsServer(t, config.Config{
|
||||
MetricsToken: "the-token",
|
||||
})
|
||||
|
||||
// Missing token.
|
||||
res, err := stdhttp.Get(url + "/metrics")
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusUnauthorized {
|
||||
t.Errorf("no token: got %d", res.StatusCode)
|
||||
}
|
||||
if !strings.Contains(res.Header.Get("WWW-Authenticate"), "Bearer") {
|
||||
t.Errorf("WWW-Authenticate hint missing: %q", res.Header.Get("WWW-Authenticate"))
|
||||
}
|
||||
|
||||
// Wrong token.
|
||||
req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||
req.Header.Set("Authorization", "Bearer not-the-token")
|
||||
res2, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res2.Body.Close()
|
||||
if res2.StatusCode != stdhttp.StatusUnauthorized {
|
||||
t.Errorf("wrong token: got %d", res2.StatusCode)
|
||||
}
|
||||
|
||||
// Right token.
|
||||
req3, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||
req3.Header.Set("Authorization", "Bearer the-token")
|
||||
res3, err3 := stdhttp.DefaultClient.Do(req3)
|
||||
if err3 != nil {
|
||||
t.Fatalf("GET: %v", err3)
|
||||
}
|
||||
defer res3.Body.Close()
|
||||
if res3.StatusCode != stdhttp.StatusOK {
|
||||
t.Errorf("right token: got %d", res3.StatusCode)
|
||||
}
|
||||
if ct := res3.Header.Get("Content-Type"); !strings.HasPrefix(ct, "text/plain") {
|
||||
t.Errorf("content-type: %q", ct)
|
||||
}
|
||||
}
|
||||
|
||||
func TestMetricsCIDRGate(t *testing.T) {
|
||||
t.Parallel()
|
||||
// 127.0.0.1 is what httptest hits with; pick a CIDR that excludes it
|
||||
// to assert the "wrong source" branch.
|
||||
url, _, _ := newMetricsServer(t, config.Config{
|
||||
MetricsTrustedCIDRs: []string{"10.0.0.0/8"},
|
||||
})
|
||||
res, err := stdhttp.Get(url + "/metrics")
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusUnauthorized {
|
||||
t.Errorf("loopback hitting non-matching CIDR: got %d, want 401", res.StatusCode)
|
||||
}
|
||||
|
||||
// Now allow loopback.
|
||||
url2, _, _ := newMetricsServer(t, config.Config{
|
||||
MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
|
||||
})
|
||||
res2, err := stdhttp.Get(url2 + "/metrics")
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res2.Body.Close()
|
||||
if res2.StatusCode != stdhttp.StatusOK {
|
||||
t.Errorf("loopback in allow CIDR: got %d, want 200", res2.StatusCode)
|
||||
}
|
||||
}
|
||||
|
||||
func TestMetricsTokenAndCIDRBothRequired(t *testing.T) {
|
||||
t.Parallel()
|
||||
url, _, _ := newMetricsServer(t, config.Config{
|
||||
MetricsToken: "the-token",
|
||||
MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
|
||||
})
|
||||
// Token only — CIDR ok (loopback) but token missing.
|
||||
res, err := stdhttp.Get(url + "/metrics")
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusUnauthorized {
|
||||
t.Errorf("missing token but in CIDR: got %d", res.StatusCode)
|
||||
}
|
||||
|
||||
// Both right.
|
||||
req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||
req.Header.Set("Authorization", "Bearer the-token")
|
||||
res2, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res2.Body.Close()
|
||||
if res2.StatusCode != stdhttp.StatusOK {
|
||||
t.Errorf("both right: got %d", res2.StatusCode)
|
||||
}
|
||||
}
|
||||
|
||||
func readAll(t *testing.T, r io.Reader) string {
|
||||
t.Helper()
|
||||
b, err := io.ReadAll(r)
|
||||
if err != nil {
|
||||
t.Fatalf("read: %v", err)
|
||||
}
|
||||
return string(b)
|
||||
}
|
||||
|
||||
func TestMetricsBodyContainsExpectedLines(t *testing.T) {
|
||||
t.Parallel()
|
||||
url, reg, _ := newMetricsServer(t, config.Config{
|
||||
MetricsToken: "the-token",
|
||||
})
|
||||
reg.ObserveJob("backup", "succeeded", 0) // produce one histogram row
|
||||
|
||||
req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||
req.Header.Set("Authorization", "Bearer the-token")
|
||||
res, err := stdhttp.DefaultClient.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("GET: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
body := readAll(t, res.Body)
|
||||
for _, want := range []string{
|
||||
"rm_hosts_total",
|
||||
"rm_hosts_online",
|
||||
`rm_active_alerts{severity="critical"}`,
|
||||
"rm_build_info{",
|
||||
"rm_job_duration_seconds_count{kind=\"backup\",status=\"succeeded\"}",
|
||||
} {
|
||||
if !strings.Contains(body, want) {
|
||||
t.Errorf("body missing %q\n--- body ---\n%s", want, body)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -512,11 +512,27 @@ func TestDrainPendingSerializesPerHost(t *testing.T) {
|
||||
// Connect the agent so DrainPending can dispatch.
|
||||
c := agentDial(t, srv, ts, hostID, token)
|
||||
sendHello(t, c, "serialise-host")
|
||||
// Drain the on-hello goroutine's pass first (no pending rows yet),
|
||||
// then wait for the schedule.set so the connection is fully settled.
|
||||
// Wait for the on-hello push to settle.
|
||||
_ = drainUntil(t, c, api.MsgScheduleSet)
|
||||
|
||||
// Insert 5 pending rows now that the on-hello drain has already run.
|
||||
// A real agent is always in a read loop. Keep this test client
|
||||
// reading in the background for the rest of the test: without an
|
||||
// active reader the server-side conn can be dropped under parallel
|
||||
// load, which unregisters it from the hub and makes DrainPending
|
||||
// no-op (conn == nil) — the historical source of this test's
|
||||
// flakiness (it would observe 0 or a partial drain). The reader also
|
||||
// consumes the command.run envelopes our drains emit.
|
||||
readerCtx, stopReader := context.WithCancel(context.Background())
|
||||
defer stopReader()
|
||||
go func() {
|
||||
for {
|
||||
if _, _, err := c.Read(readerCtx); err != nil {
|
||||
return
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
// Insert 5 due pending rows.
|
||||
now := time.Now().UTC()
|
||||
for i := range 5 {
|
||||
pid := ulid.Make().String()
|
||||
@@ -533,7 +549,8 @@ func TestDrainPendingSerializesPerHost(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// Spawn 10 goroutines all calling DrainPending concurrently.
|
||||
// Fire 10 concurrent DrainPending calls. The per-host mutex must
|
||||
// ensure each row is dispatched at most once (no double-dispatch).
|
||||
var wg sync.WaitGroup
|
||||
for range 10 {
|
||||
wg.Add(1)
|
||||
@@ -544,24 +561,26 @@ func TestDrainPendingSerializesPerHost(t *testing.T) {
|
||||
}
|
||||
wg.Wait()
|
||||
|
||||
// Drain any envelopes the agent received so we don't block below.
|
||||
// We read with short timeouts and stop when the connection goes quiet.
|
||||
drainDeadline := time.Now().Add(500 * time.Millisecond)
|
||||
for time.Now().Before(drainDeadline) {
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
|
||||
_, _, err := c.Read(ctx)
|
||||
cancel()
|
||||
if err != nil {
|
||||
break
|
||||
}
|
||||
// Drain to completion. The fire-and-forget on-hello DrainPending
|
||||
// shares the same per-host mutex and can hold it during the burst,
|
||||
// leaving rows for a later pass — exactly how production drains
|
||||
// (repeatedly, via the 30s tick / on reconnect). Re-drain until the
|
||||
// queue is empty; because every drain is still serialised, each row
|
||||
// is dispatched at most once, so the exactly-5 job count below proves
|
||||
// there was no double-dispatch.
|
||||
deadline := time.Now().Add(5 * time.Second)
|
||||
for countPendingForHost(t, st, hostID) > 0 && time.Now().Before(deadline) {
|
||||
srv.DrainPending(context.Background(), hostID)
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
}
|
||||
|
||||
// All 5 pending rows must be gone.
|
||||
// All 5 pending rows must be drained.
|
||||
if n := countPendingForHost(t, st, hostID); n != 0 {
|
||||
t.Errorf("pending rows after concurrent drain: got %d, want 0", n)
|
||||
t.Errorf("pending rows after drain-to-completion: got %d, want 0", n)
|
||||
}
|
||||
|
||||
// Exactly 5 backup job rows (one per pending row), not 10+ from a race.
|
||||
// Exactly 5 backup job rows (one per pending row) — never more, which
|
||||
// would mean the per-host mutex failed to prevent double-dispatch.
|
||||
var n int
|
||||
_ = st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM jobs WHERE host_id = ? AND kind = 'backup' AND actor_kind = 'schedule'`,
|
||||
|
||||
@@ -17,6 +17,7 @@ import (
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
||||
@@ -39,6 +40,13 @@ type Deps struct {
|
||||
// NotificationHub (optional, wired in G1) is used by the test-fire
|
||||
// endpoint to dispatch a single synthetic payload through a channel.
|
||||
NotificationHub *notification.Hub
|
||||
// UpdateWatcher tracks in-flight agent self-update dispatches and
|
||||
// reconciles them against incoming hello envelopes. Optional;
|
||||
// nil = no-op (handlers degrade by skipping the Track call).
|
||||
UpdateWatcher UpdateWatcher
|
||||
// FleetWorker drives the rolling fleet-update worker. Optional;
|
||||
// nil = fleet update endpoints (P6-15) report unavailable.
|
||||
FleetWorker FleetWorker
|
||||
// Version is the binary's build version, surfaced in the chrome.
|
||||
// Empty falls back to "dev".
|
||||
Version string
|
||||
@@ -49,6 +57,12 @@ type Deps struct {
|
||||
// OIDC (optional). Non-nil when the operator has configured an
|
||||
// IdP — handlers under /auth/oidc/* are mounted only when set.
|
||||
OIDC *oidc.Client
|
||||
// Metrics (optional). When non-nil the WS job-finished branch
|
||||
// records job durations and the /metrics handler can pull a
|
||||
// histogram snapshot. Independent of MetricsAuthEnabled — the
|
||||
// recorder runs even if the scrape endpoint is gated off, so a
|
||||
// later config flip doesn't lose the running window.
|
||||
Metrics *metrics.Registry
|
||||
}
|
||||
|
||||
// Server is the running HTTP server.
|
||||
@@ -76,6 +90,13 @@ type Server struct {
|
||||
// directories (P3-X2). Pre-allocated in New so the lazy-init
|
||||
// race is impossible.
|
||||
treeCache *treeCache
|
||||
|
||||
// catchupDueAt tracks intermittent hosts that reconnected and are
|
||||
// in their settle window. Keyed hostID → earliest time to evaluate
|
||||
// catch-up. Best-effort + in-memory: a server restart simply re-arms
|
||||
// on the next hello. Guarded by catchupMu.
|
||||
catchupMu sync.Mutex
|
||||
catchupDueAt map[string]time.Time
|
||||
}
|
||||
|
||||
// New builds a configured but not-yet-started server.
|
||||
@@ -90,11 +111,12 @@ func New(deps Deps) *Server {
|
||||
r.Use(requestLogger)
|
||||
|
||||
s := &Server{
|
||||
deps: deps,
|
||||
drainLocks: make(map[string]*sync.Mutex),
|
||||
announceRL: newAnnounceLimiter(),
|
||||
pendingHub: newPendingHub(),
|
||||
treeCache: newTreeCache(),
|
||||
deps: deps,
|
||||
drainLocks: make(map[string]*sync.Mutex),
|
||||
announceRL: newAnnounceLimiter(),
|
||||
pendingHub: newPendingHub(),
|
||||
treeCache: newTreeCache(),
|
||||
catchupDueAt: make(map[string]time.Time),
|
||||
}
|
||||
s.routes(r)
|
||||
|
||||
@@ -123,16 +145,25 @@ func (s *Server) routes(r chi.Router) {
|
||||
r.Post("/api/agents/announce", s.handleAnnounce)
|
||||
r.Get("/agent/binary", s.handleAgentBinary)
|
||||
r.Get("/install/*", s.handleInstallAsset)
|
||||
r.Get("/api/version", s.handleVersion)
|
||||
if s.deps.Cfg.MetricsAuthEnabled() {
|
||||
r.Get("/metrics", s.handleMetrics)
|
||||
}
|
||||
if s.deps.Hub != nil {
|
||||
r.Mount("/ws/agent", ws.AgentHandler(ws.HandlerDeps{
|
||||
hd := ws.HandlerDeps{
|
||||
Hub: s.deps.Hub,
|
||||
Store: s.deps.Store,
|
||||
JobHub: s.deps.JobHub,
|
||||
AlertEngine: s.deps.AlertEngine,
|
||||
Metrics: s.deps.Metrics,
|
||||
OnHello: s.onAgentHello,
|
||||
OnScheduleAck: s.applyScheduleAck,
|
||||
OnScheduleFire: s.dispatchScheduledJob,
|
||||
}))
|
||||
}
|
||||
if w, ok := s.deps.UpdateWatcher.(*ws.UpdateWatcher); ok && w != nil {
|
||||
hd.UpdateWatcher = w
|
||||
}
|
||||
r.Mount("/ws/agent", ws.AgentHandler(hd))
|
||||
}
|
||||
r.Get("/ws/agent/pending", s.handlePendingWS)
|
||||
r.Mount("/static/", staticHandler())
|
||||
@@ -183,7 +214,9 @@ func (s *Server) routes(r chi.Router) {
|
||||
r.Get("/hosts/{id}/sources", s.handleUIHostSources)
|
||||
r.Get("/hosts/{id}/sources/new", s.handleUISourceGroupNewGet)
|
||||
r.Get("/hosts/{id}/sources/{gid}/edit", s.handleUISourceGroupEditGet)
|
||||
r.Get("/hosts/{id}/jobs", s.handleUIHostJobs)
|
||||
r.Get("/hosts/{id}/repo", s.handleUIHostRepo)
|
||||
r.Get("/hosts/{id}/repo/trend", s.handleUIRepoTrend)
|
||||
r.Get("/hosts/{id}/schedules", s.handleUISchedulesList)
|
||||
r.Get("/hosts/{id}/schedules/new", s.handleUIScheduleNewGet)
|
||||
r.Get("/hosts/{id}/schedules/{sid}/edit", s.handleUIScheduleEditGet)
|
||||
@@ -254,6 +287,7 @@ func (s *Server) routes(r chi.Router) {
|
||||
r.Post("/hosts/{id}/repo/probe", s.handleUIRepoProbe)
|
||||
r.Post("/hosts/{id}/repo/hooks", s.handleUIRepoHooksSave)
|
||||
r.Post("/hosts/{id}/tags", s.handleUIHostTagsSave)
|
||||
r.Post("/hosts/{id}/mode", s.handleUIHostModeSave)
|
||||
r.Post("/hosts/{id}/admin-credentials", s.handleUIAdminCredentialsSave)
|
||||
r.Post("/hosts/{id}/admin-credentials/delete", s.handleUIAdminCredentialsDelete)
|
||||
r.Post("/hosts/{id}/schedules/new", s.handleUIScheduleSave)
|
||||
@@ -270,6 +304,14 @@ func (s *Server) routes(r chi.Router) {
|
||||
r.Group(func(r chi.Router) {
|
||||
r.Use(s.requireRole(store.RoleAdmin))
|
||||
|
||||
r.Post("/api/hosts/{id}/update", s.handleHostUpdate)
|
||||
r.Post("/hosts/{id}/update", s.handleHostUpdateForm)
|
||||
|
||||
// Fleet update (P6-15): rolling update across many hosts.
|
||||
r.Post("/api/fleet/update", s.handleAPIFleetUpdateStart)
|
||||
r.Post("/api/fleet-updates/{id}/cancel", s.handleAPIFleetUpdateCancel)
|
||||
r.Get("/api/fleet-updates/{id}", s.handleAPIFleetUpdateGet)
|
||||
|
||||
r.Get("/api/users", s.handleAPIUsersList)
|
||||
r.Post("/api/users", s.handleAPIUserCreate)
|
||||
r.Get("/api/users/{id}", s.handleAPIUserGet)
|
||||
@@ -283,6 +325,8 @@ func (s *Server) routes(r chi.Router) {
|
||||
if s.deps.UI != nil {
|
||||
r.Post("/hosts/{id}/delete", s.handleUIHostDelete)
|
||||
r.Get("/settings", s.handleUISettings)
|
||||
r.Get("/settings/fleet-update", s.handleUIFleetUpdate)
|
||||
r.Get("/settings/fleet-update/partial", s.handleUIFleetUpdatePartial)
|
||||
r.Get("/settings/users", s.handleUIUsersList)
|
||||
r.Get("/settings/users/new", s.handleUIUserNewGet)
|
||||
r.Post("/settings/users/new", s.handleUIUserNewPost)
|
||||
@@ -321,6 +365,27 @@ func (s *Server) Shutdown(ctx context.Context) error {
|
||||
return s.srv.Shutdown(ctx)
|
||||
}
|
||||
|
||||
// SetFleetWorker installs the fleet-update worker post-construction.
|
||||
// Used to break the wiring loop in cmd/server (the worker depends on a
|
||||
// dispatcher that delegates back into the server's host-update path).
|
||||
func (s *Server) SetFleetWorker(fw FleetWorker) { s.deps.FleetWorker = fw }
|
||||
|
||||
// DispatchHostUpdate is the public entry point for callers (the fleet
|
||||
// worker) that need to drive the same dispatch path the HTTP handler
|
||||
// uses, without going through HTTP. Returns the structured result so
|
||||
// the caller can map error codes to its own status enum.
|
||||
func (s *Server) DispatchHostUpdate(ctx context.Context, hostID, actorUserID string) (jobID string, code string, err error) {
|
||||
var actorID *string
|
||||
if actorUserID != "" {
|
||||
actorID = &actorUserID
|
||||
}
|
||||
res := s.dispatchHostUpdate(ctx, hostID, "user", actorID)
|
||||
if res.Code != "" {
|
||||
return res.JobID, res.Code, nil
|
||||
}
|
||||
return res.JobID, "", nil
|
||||
}
|
||||
|
||||
// Addr returns the configured listen address. Useful in tests when
|
||||
// the caller passes :0 to get a random port.
|
||||
func (s *Server) Addr() string { return s.srv.Addr }
|
||||
|
||||
@@ -0,0 +1,89 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
stdhttp "net/http"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
func getDashboard(t *testing.T, baseURL string, cookie *stdhttp.Cookie) string {
|
||||
t.Helper()
|
||||
client := &stdhttp.Client{
|
||||
CheckRedirect: func(_ *stdhttp.Request, _ []*stdhttp.Request) error {
|
||||
return stdhttp.ErrUseLastResponse
|
||||
},
|
||||
}
|
||||
req, err := stdhttp.NewRequest("GET", baseURL+"/", nil)
|
||||
if err != nil {
|
||||
t.Fatalf("new request: %v", err)
|
||||
}
|
||||
req.AddCookie(cookie)
|
||||
res, err := client.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("GET /: %v", err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusOK {
|
||||
t.Fatalf("GET /: want 200, got %d", res.StatusCode)
|
||||
}
|
||||
body := make([]byte, 0, 1<<20)
|
||||
buf := make([]byte, 4096)
|
||||
for {
|
||||
n, rerr := res.Body.Read(buf)
|
||||
body = append(body, buf[:n]...)
|
||||
if rerr != nil {
|
||||
break
|
||||
}
|
||||
}
|
||||
return string(body)
|
||||
}
|
||||
|
||||
func TestDashboard_HostRowSparklineRendersWithHistory(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, baseURL, st := newTestServerWithUI(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
hostID := makeHost(t, st, "h-spark")
|
||||
ctx := context.Background()
|
||||
|
||||
// Two history points → polyline must render. Use dates relative to
|
||||
// now so the points always fall inside the dashboard's rolling
|
||||
// 30-day window (ui_handlers.go: since = now-30d); hard-coded dates
|
||||
// silently age out of the window and break this test over time.
|
||||
for i, day := range []string{
|
||||
time.Now().UTC().AddDate(0, 0, -2).Format("2006-01-02"),
|
||||
time.Now().UTC().AddDate(0, 0, -1).Format("2006-01-02"),
|
||||
} {
|
||||
v := int64(100 + i*50)
|
||||
if err := st.UpsertHostRepoStatsHistory(ctx, hostID, day,
|
||||
store.HostRepoStats{TotalSizeBytes: &v}, time.Now().UTC()); err != nil {
|
||||
t.Fatalf("upsert %s: %v", day, err)
|
||||
}
|
||||
}
|
||||
|
||||
body := getDashboard(t, baseURL, cookie)
|
||||
if !strings.Contains(body, `class="repo-sparkline"`) {
|
||||
t.Errorf("expected sparkline SVG in dashboard body (class=repo-sparkline missing)")
|
||||
}
|
||||
if !strings.Contains(body, `<polyline`) {
|
||||
t.Errorf("expected <polyline> in dashboard body")
|
||||
}
|
||||
}
|
||||
|
||||
func TestDashboard_HostRowSparklineEmptyState(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, baseURL, st := newTestServerWithUI(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
makeHost(t, st, "h-empty")
|
||||
|
||||
body := getDashboard(t, baseURL, cookie)
|
||||
if !strings.Contains(body, `class="repo-sparkline"`) {
|
||||
t.Errorf("expected sparkline SVG element on dashboard")
|
||||
}
|
||||
if !strings.Contains(body, `>—<`) {
|
||||
t.Errorf("expected em-dash placeholder in empty sparkline cell")
|
||||
}
|
||||
}
|
||||
@@ -5,8 +5,10 @@ import (
|
||||
"encoding/base64"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"html/template"
|
||||
"io/fs"
|
||||
"log/slog"
|
||||
"math"
|
||||
stdhttp "net/http"
|
||||
"net/url"
|
||||
"sort"
|
||||
@@ -23,6 +25,8 @@ import (
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/web/sparkline"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/web"
|
||||
)
|
||||
|
||||
@@ -155,6 +159,10 @@ type dashboardPage struct {
|
||||
// when it's already active). Pre-computed so the template stays
|
||||
// dumb.
|
||||
SortURL map[string]string
|
||||
// UpdatesBehind is the count of online hosts whose agent_version
|
||||
// trails the server. Surfaces as the dashboard "N hosts behind"
|
||||
// hero tile and links to ?updates=behind.
|
||||
UpdatesBehind int
|
||||
}
|
||||
|
||||
// dashboardFilter holds the parsed query-string filter state.
|
||||
@@ -165,6 +173,10 @@ type dashboardFilter struct {
|
||||
Tag string // mirrors ActiveTag for round-trip on links
|
||||
Sort string // column key (see sortDashboard)
|
||||
Dir string // "asc" | "desc"
|
||||
// Updates narrows to hosts whose agent is behind the server's
|
||||
// version. Only valid value today is "behind"; empty means no
|
||||
// filter.
|
||||
Updates string
|
||||
}
|
||||
|
||||
// dashboardHostRow carries a host plus the per-row Run-now decision
|
||||
@@ -180,6 +192,17 @@ type dashboardHostRow struct {
|
||||
// NextRun is the next-fire time of RunAllScheduleID (when set),
|
||||
// computed server-side from its cron. nil otherwise.
|
||||
NextRun *time.Time
|
||||
// UpdateAvailable is true when the host's agent has connected at
|
||||
// least once AND its agent_version differs from the server's. Used
|
||||
// by the host_row partial to render the update-available chip.
|
||||
UpdateAvailable bool
|
||||
// TargetVersion is the server's build version, surfaced in the
|
||||
// chip's tooltip and label.
|
||||
TargetVersion string
|
||||
// RepoSparklineSVG is a server-rendered inline SVG showing the
|
||||
// 30-day repo-size trend. Empty-state SVG (em-dash) is returned
|
||||
// when no history rows exist for the host.
|
||||
RepoSparklineSVG template.HTML
|
||||
}
|
||||
|
||||
// pickRunAllSchedule returns the ID of the single schedule whose
|
||||
@@ -255,7 +278,11 @@ func (s *Server) handleUIDashboard(w stdhttp.ResponseWriter, r *stdhttp.Request)
|
||||
// calls per host — fine at fleet sizes we care about.
|
||||
rows := make([]dashboardHostRow, 0, len(hosts))
|
||||
for _, h := range hosts {
|
||||
row := dashboardHostRow{Host: h}
|
||||
row := dashboardHostRow{
|
||||
Host: h,
|
||||
TargetVersion: version.Version,
|
||||
UpdateAvailable: h.AgentVersion != "" && h.AgentVersion != version.Version,
|
||||
}
|
||||
groups, gerr := s.deps.Store.ListSourceGroupsByHost(r.Context(), h.ID)
|
||||
if gerr != nil {
|
||||
slog.Warn("ui dashboard: list source groups", "host_id", h.ID, "err", gerr)
|
||||
@@ -276,6 +303,20 @@ func (s *Server) handleUIDashboard(w stdhttp.ResponseWriter, r *stdhttp.Request)
|
||||
}
|
||||
}
|
||||
}
|
||||
since := time.Now().UTC().AddDate(0, 0, -30)
|
||||
pts, herr := s.deps.Store.ListHostRepoStatsHistory(r.Context(), h.ID, since)
|
||||
if herr != nil {
|
||||
slog.Warn("ui dashboard: list repo history", "host_id", h.ID, "err", herr)
|
||||
}
|
||||
sparkPoints := make([]float64, len(pts))
|
||||
for i, p := range pts {
|
||||
if p.TotalSizeBytes == nil {
|
||||
sparkPoints[i] = math.NaN()
|
||||
} else {
|
||||
sparkPoints[i] = float64(*p.TotalSizeBytes)
|
||||
}
|
||||
}
|
||||
row.RepoSparklineSVG = sparkline.RenderSparkline(sparkPoints, 88, 20)
|
||||
rows = append(rows, row)
|
||||
}
|
||||
|
||||
@@ -289,6 +330,13 @@ func (s *Server) handleUIDashboard(w stdhttp.ResponseWriter, r *stdhttp.Request)
|
||||
critOpenCount = len(crit)
|
||||
}
|
||||
|
||||
updatesBehind := 0
|
||||
for _, h := range allHosts {
|
||||
if h.Status == "online" && h.AgentVersion != "" && h.AgentVersion != version.Version {
|
||||
updatesBehind++
|
||||
}
|
||||
}
|
||||
|
||||
view := s.baseView(r, u)
|
||||
view.Page = dashboardPage{
|
||||
Hosts: rows,
|
||||
@@ -302,6 +350,7 @@ func (s *Server) handleUIDashboard(w stdhttp.ResponseWriter, r *stdhttp.Request)
|
||||
Filter: filter,
|
||||
RefreshURL: "/?" + filter.encode(),
|
||||
SortURL: buildDashboardSortURLs(filter),
|
||||
UpdatesBehind: updatesBehind,
|
||||
}
|
||||
if err := s.deps.UI.Render(w, "dashboard", view); err != nil {
|
||||
slog.Error("ui: render dashboard", "err", err)
|
||||
@@ -320,6 +369,7 @@ func parseDashboardFilter(q url.Values) dashboardFilter {
|
||||
Tag: q.Get("tag"),
|
||||
Sort: q.Get("sort"),
|
||||
Dir: q.Get("dir"),
|
||||
Updates: q.Get("updates"),
|
||||
}
|
||||
if f.Sort == "" {
|
||||
f.Sort = "name"
|
||||
@@ -352,6 +402,9 @@ func (f dashboardFilter) encode() string {
|
||||
if f.Dir != "" && f.Dir != "asc" {
|
||||
v.Set("dir", f.Dir)
|
||||
}
|
||||
if f.Updates != "" {
|
||||
v.Set("updates", f.Updates)
|
||||
}
|
||||
return v.Encode()
|
||||
}
|
||||
|
||||
@@ -402,6 +455,11 @@ func filterAndSortDashboardHosts(hosts []store.Host, f dashboardFilter) []store.
|
||||
continue
|
||||
}
|
||||
}
|
||||
if f.Updates == "behind" {
|
||||
if h.AgentVersion == "" || h.AgentVersion == version.Version {
|
||||
continue
|
||||
}
|
||||
}
|
||||
out = append(out, h)
|
||||
}
|
||||
sortDashboardHosts(out, f.Sort, f.Dir)
|
||||
@@ -809,6 +867,20 @@ type hostChromeData struct {
|
||||
SourceGroupCount int
|
||||
ScheduleCount int
|
||||
ScheduleVersion int64 // host_schedule_version (latest desired)
|
||||
// UpdateAvailable + TargetVersion drive the agent-out-of-date chip
|
||||
// in the host detail header. UpdateAvailable is true iff the host
|
||||
// has connected at least once AND its agent_version != server's.
|
||||
UpdateAvailable bool
|
||||
TargetVersion string
|
||||
// Online + UpdateInProgress drive the per-host "Update agent"
|
||||
// button on host_detail. Online mirrors hub.Connected; pulled here
|
||||
// so the button can disable when the host is unreachable.
|
||||
Online bool
|
||||
UpdateInProgress bool
|
||||
// CanAdmin is true when the viewing user has admin role; used to
|
||||
// gate the "Update agent" button. Kept on the chrome struct so any
|
||||
// page reusing host_chrome already has it for free.
|
||||
CanAdmin bool
|
||||
// KnownTags is the union of tags already in use across the fleet,
|
||||
// used for autocomplete on the host-tags edit form. Cheap query.
|
||||
KnownTags []string
|
||||
@@ -834,6 +906,14 @@ type hostChromeData struct {
|
||||
// render the page with stale counts than 500 the whole tab.
|
||||
func (s *Server) loadHostChrome(r *stdhttp.Request, host store.Host, subtab, crumb string) hostChromeData {
|
||||
d := hostChromeData{Host: host, SubTab: subtab, Crumb: crumb}
|
||||
d.TargetVersion = version.Version
|
||||
d.UpdateAvailable = host.AgentVersion != "" && host.AgentVersion != version.Version
|
||||
if s.deps.Hub != nil {
|
||||
d.Online = s.deps.Hub.Connected(host.ID)
|
||||
}
|
||||
if existing, _ := s.deps.Store.RunningUpdateJobForHost(r.Context(), host.ID); existing != "" {
|
||||
d.UpdateInProgress = true
|
||||
}
|
||||
if groups, err := s.deps.Store.ListSourceGroupsByHost(r.Context(), host.ID); err == nil {
|
||||
d.SourceGroupCount = len(groups)
|
||||
} else {
|
||||
@@ -903,6 +983,43 @@ func (s *Server) handleUIHostTagsSave(w stdhttp.ResponseWriter, r *stdhttp.Reque
|
||||
stdhttp.Redirect(w, r, "/hosts/"+hostID, stdhttp.StatusSeeOther)
|
||||
}
|
||||
|
||||
// handleUIHostModeSave flips a host's always-on flag. Checkbox present
|
||||
// in the form (value any) => always-on; absent => intermittent.
|
||||
// Operator-band; mounted in server.go. On change we clear open
|
||||
// offline/staleness alerts via the engine so the next sweep re-raises
|
||||
// only what still applies under the new mode.
|
||||
func (s *Server) handleUIHostModeSave(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
u := s.requireUIUser(w, r)
|
||||
if u == nil {
|
||||
return
|
||||
}
|
||||
hostID := chi.URLParam(r, "id")
|
||||
if _, err := s.deps.Store.GetHost(r.Context(), hostID); err != nil {
|
||||
stdhttp.NotFound(w, r)
|
||||
return
|
||||
}
|
||||
if err := r.ParseForm(); err != nil {
|
||||
stdhttp.Error(w, "bad request", stdhttp.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
alwaysOn := r.PostForm.Get("always_on") != ""
|
||||
if err := s.deps.Store.SetHostAlwaysOn(r.Context(), hostID, alwaysOn); err != nil {
|
||||
slog.Error("ui host mode: save", "host_id", hostID, "err", err)
|
||||
stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
if s.deps.AlertEngine != nil {
|
||||
s.deps.AlertEngine.ResolveOnModeChange(r.Context(), hostID, time.Now().UTC())
|
||||
}
|
||||
_ = s.deps.Store.AppendAudit(r.Context(), store.AuditEntry{
|
||||
ID: ulid.Make().String(), UserID: &u.ID, Actor: "user",
|
||||
Action: "host.mode_updated",
|
||||
TargetKind: ptr("host"), TargetID: &hostID,
|
||||
TS: time.Now().UTC(),
|
||||
})
|
||||
stdhttp.Redirect(w, r, "/hosts/"+hostID, stdhttp.StatusSeeOther)
|
||||
}
|
||||
|
||||
// normaliseTags splits a comma-separated string, lowercases each token,
|
||||
// trims whitespace, drops empties, and dedupes. Order is preserved
|
||||
// from first occurrence (so the user's typing order shows on screen).
|
||||
@@ -972,8 +1089,10 @@ func (s *Server) handleUIHostDetail(w stdhttp.ResponseWriter, r *stdhttp.Request
|
||||
|
||||
view := s.baseView(r, u)
|
||||
view.Title = host.Name + " · restic-manager"
|
||||
chrome := s.loadHostChrome(r, *host, "snapshots", "snapshots")
|
||||
chrome.CanAdmin = u.Role == string(store.RoleAdmin)
|
||||
view.Page = hostDetailPage{
|
||||
hostChromeData: s.loadHostChrome(r, *host, "snapshots", "snapshots"),
|
||||
hostChromeData: chrome,
|
||||
Snapshots: shown,
|
||||
SnapshotsShown: len(shown),
|
||||
LegacyRestic: !restic.Env{Version: host.ResticVersion}.AtLeastVersion(0, 17),
|
||||
|
||||
@@ -0,0 +1,47 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"log/slog"
|
||||
stdhttp "net/http"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
// hostJobsPage is the page-data struct for /hosts/{id}/jobs.
|
||||
type hostJobsPage struct {
|
||||
hostChromeData
|
||||
Jobs []store.Job
|
||||
}
|
||||
|
||||
// handleUIHostJobs renders the per-host jobs list. Read-only — no
|
||||
// actions, just a click-through to the existing /jobs/{id} detail
|
||||
// page for any row.
|
||||
func (s *Server) handleUIHostJobs(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
u := s.requireUIUser(w, r)
|
||||
if u == nil {
|
||||
return
|
||||
}
|
||||
host, ok := s.loadHostForUI(w, r)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
jobs, err := s.deps.Store.ListJobsByHost(r.Context(), host.ID, 100)
|
||||
if err != nil {
|
||||
slog.Error("ui host jobs: list", "host_id", host.ID, "err", err)
|
||||
stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
|
||||
page := hostJobsPage{
|
||||
hostChromeData: s.loadHostChrome(r, *host, "jobs", "jobs"),
|
||||
Jobs: jobs,
|
||||
}
|
||||
view := s.baseView(r, u)
|
||||
view.Title = host.Name + " jobs · restic-manager"
|
||||
view.Page = page
|
||||
if err := s.deps.UI.Render(w, "host_jobs", view); err != nil {
|
||||
slog.Error("ui: render host_jobs", "err", err)
|
||||
stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,85 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"io"
|
||||
stdhttp "net/http"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
func TestUIHostJobs_RendersList(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, baseURL, st := newTestServerWithUI(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
hostID := makeHost(t, st, "h-jobs-render")
|
||||
|
||||
// Two jobs with distinct kinds + statuses.
|
||||
now := time.Now().UTC()
|
||||
ctx := context.Background()
|
||||
if err := st.CreateJob(ctx, store.Job{
|
||||
ID: "01HZZZZZZZZZZZZZZZZZZZZZ10", HostID: hostID, Kind: "backup",
|
||||
ActorKind: "user", CreatedAt: now.Add(-time.Hour),
|
||||
}); err != nil {
|
||||
t.Fatalf("create job: %v", err)
|
||||
}
|
||||
if err := st.MarkJobFinished(ctx, "01HZZZZZZZZZZZZZZZZZZZZZ10", "succeeded", 0, nil, "", now.Add(-time.Hour+time.Minute)); err != nil {
|
||||
t.Fatalf("finish job: %v", err)
|
||||
}
|
||||
if err := st.CreateJob(ctx, store.Job{
|
||||
ID: "01HZZZZZZZZZZZZZZZZZZZZZ11", HostID: hostID, Kind: "prune",
|
||||
ActorKind: "schedule", CreatedAt: now,
|
||||
}); err != nil {
|
||||
t.Fatalf("create job: %v", err)
|
||||
}
|
||||
if err := st.MarkJobFinished(ctx, "01HZZZZZZZZZZZZZZZZZZZZZ11", "failed", 1, nil, "boom", now.Add(time.Minute)); err != nil {
|
||||
t.Fatalf("finish job: %v", err)
|
||||
}
|
||||
|
||||
body := getHostJobsPage(t, baseURL, hostID, cookie)
|
||||
for _, want := range []string{"backup", "prune", "succeeded", "failed", "schedule", "user", `class="jobs-row`} {
|
||||
if !strings.Contains(body, want) {
|
||||
t.Errorf("expected %q in body, missing", want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestUIHostJobs_EmptyState(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, baseURL, st := newTestServerWithUI(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
hostID := makeHost(t, st, "h-jobs-empty")
|
||||
|
||||
body := getHostJobsPage(t, baseURL, hostID, cookie)
|
||||
if !strings.Contains(body, "No jobs yet.") {
|
||||
t.Error("expected empty-state heading")
|
||||
}
|
||||
}
|
||||
|
||||
// getHostJobsPage fetches /hosts/{id}/jobs and returns the body string.
|
||||
func getHostJobsPage(t *testing.T, baseURL, hostID string, cookie *stdhttp.Cookie) string {
|
||||
t.Helper()
|
||||
client := &stdhttp.Client{
|
||||
CheckRedirect: func(_ *stdhttp.Request, _ []*stdhttp.Request) error {
|
||||
return stdhttp.ErrUseLastResponse
|
||||
},
|
||||
}
|
||||
req, err := stdhttp.NewRequest("GET", baseURL+"/hosts/"+hostID+"/jobs", nil)
|
||||
if err != nil {
|
||||
t.Fatalf("new request: %v", err)
|
||||
}
|
||||
req.AddCookie(cookie)
|
||||
res, err := client.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("GET /hosts/%s/jobs: %v", hostID, err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusOK {
|
||||
t.Fatalf("GET /hosts/%s/jobs: want 200, got %d", hostID, res.StatusCode)
|
||||
}
|
||||
raw, _ := io.ReadAll(res.Body)
|
||||
return string(raw)
|
||||
}
|
||||
@@ -0,0 +1,88 @@
|
||||
// ui_host_mode_test.go — covers handleUIHostModeSave: toggling a
|
||||
// host's always-on flag via POST /hosts/{id}/mode.
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
stdhttp "net/http"
|
||||
"net/url"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// TestHostModeSaveToggle verifies the checkbox-absent ⇒ intermittent
|
||||
// and checkbox-present ⇒ always-on semantics, and that the audit row
|
||||
// lands for each request.
|
||||
func TestHostModeSaveToggle(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, ts, st := rawTestServerWithUI(t)
|
||||
hostID, _ := enrolHostForUI(t, nil, st, "mode-toggle-host")
|
||||
|
||||
cookie := loginAsAdmin(t, st)
|
||||
|
||||
cli := &stdhttp.Client{
|
||||
CheckRedirect: func(*stdhttp.Request, []*stdhttp.Request) error {
|
||||
return stdhttp.ErrUseLastResponse
|
||||
},
|
||||
}
|
||||
|
||||
// --- POST with no always_on field => intermittent ---
|
||||
form := url.Values{}
|
||||
req, _ := stdhttp.NewRequest("POST", ts.URL+"/hosts/"+hostID+"/mode",
|
||||
strings.NewReader(form.Encode()))
|
||||
req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
|
||||
req.AddCookie(cookie)
|
||||
res, err := cli.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
_ = res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusSeeOther {
|
||||
t.Fatalf("status: got %d, want 303", res.StatusCode)
|
||||
}
|
||||
if loc := res.Header.Get("Location"); loc != "/hosts/"+hostID {
|
||||
t.Errorf("Location: got %q, want /hosts/%s", loc, hostID)
|
||||
}
|
||||
|
||||
got, err := st.GetHost(context.Background(), hostID)
|
||||
if err != nil {
|
||||
t.Fatalf("GetHost: %v", err)
|
||||
}
|
||||
if got.AlwaysOn {
|
||||
t.Errorf("AlwaysOn after empty form: got true, want false")
|
||||
}
|
||||
|
||||
// --- POST with always_on=on => always-on ---
|
||||
form2 := url.Values{"always_on": {"on"}}
|
||||
req2, _ := stdhttp.NewRequest("POST", ts.URL+"/hosts/"+hostID+"/mode",
|
||||
strings.NewReader(form2.Encode()))
|
||||
req2.Header.Set("Content-Type", "application/x-www-form-urlencoded")
|
||||
req2.AddCookie(cookie)
|
||||
res2, err := cli.Do(req2)
|
||||
if err != nil {
|
||||
t.Fatalf("do: %v", err)
|
||||
}
|
||||
_ = res2.Body.Close()
|
||||
if res2.StatusCode != stdhttp.StatusSeeOther {
|
||||
t.Fatalf("status: got %d, want 303", res2.StatusCode)
|
||||
}
|
||||
|
||||
got2, err := st.GetHost(context.Background(), hostID)
|
||||
if err != nil {
|
||||
t.Fatalf("GetHost: %v", err)
|
||||
}
|
||||
if !got2.AlwaysOn {
|
||||
t.Errorf("AlwaysOn after always_on=on: got false, want true")
|
||||
}
|
||||
|
||||
// Audit rows must exist (one per request).
|
||||
var n int
|
||||
if err := st.DB().QueryRow(
|
||||
`SELECT COUNT(*) FROM audit_log WHERE action = 'host.mode_updated' AND target_id = ?`,
|
||||
hostID).Scan(&n); err != nil {
|
||||
t.Fatalf("count audit: %v", err)
|
||||
}
|
||||
if n != 2 {
|
||||
t.Errorf("audit rows: got %d, want 2", n)
|
||||
}
|
||||
}
|
||||
@@ -1,9 +1,12 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"html/template"
|
||||
"log/slog"
|
||||
"math"
|
||||
stdhttp "net/http"
|
||||
"strconv"
|
||||
"strings"
|
||||
@@ -13,6 +16,7 @@ import (
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/web/sparkline"
|
||||
)
|
||||
|
||||
// ui_repo.go — HTML form-driven repo-tab handlers (connection,
|
||||
@@ -27,6 +31,15 @@ import (
|
||||
// POST /hosts/{id}/admin-credentials — admin (prune) creds
|
||||
// POST /hosts/{id}/admin-credentials/delete — clear admin creds
|
||||
|
||||
// repoTrendView is the data the repo_size_chart partial needs.
|
||||
// HostID + Range round-trip through the htmx range pills; ChartSVG
|
||||
// is pre-rendered server-side so the partial is just a wrapper.
|
||||
type repoTrendView struct {
|
||||
HostID string
|
||||
Range string
|
||||
ChartSVG template.HTML
|
||||
}
|
||||
|
||||
// repoStatsView is a flat, pre-dereferenced projection of
|
||||
// store.HostRepoStats for use in templates. Nil pointer fields are
|
||||
// collapsed to zero/false and accompanied by a Has* sentinel so the
|
||||
@@ -74,6 +87,10 @@ type hostRepoPage struct {
|
||||
// Nil when no row exists yet (fresh hosts).
|
||||
StatsView *repoStatsView
|
||||
|
||||
// Trend holds the pre-rendered chart fragment data for the
|
||||
// 30/90/365-day repo-size + snapshot-count overlay chart.
|
||||
Trend repoTrendView
|
||||
|
||||
// Snapshots-by-tag — map[group_name]count, plus an "untagged" row.
|
||||
SnapshotsByTag map[string]int
|
||||
UntaggedSnapshots int
|
||||
@@ -225,9 +242,52 @@ func (s *Server) loadHostRepoPage(r *stdhttp.Request, host store.Host) (*hostRep
|
||||
}
|
||||
}
|
||||
}
|
||||
p.Trend = s.buildRepoTrendView(r.Context(), host.ID, "30d")
|
||||
|
||||
return p, nil
|
||||
}
|
||||
|
||||
// buildRepoTrendView builds the chart-partial data for a host. Used
|
||||
// both by the page-load (initial 30d render) and the htmx fragment
|
||||
// endpoint (range switching). An invalid rangeKey falls back to "30d".
|
||||
func (s *Server) buildRepoTrendView(ctx context.Context, hostID, rangeKey string) repoTrendView {
|
||||
days := 30
|
||||
switch rangeKey {
|
||||
case "90d":
|
||||
days = 90
|
||||
case "1y":
|
||||
days = 365
|
||||
default:
|
||||
rangeKey = "30d"
|
||||
}
|
||||
since := time.Now().UTC().AddDate(0, 0, -days)
|
||||
pts, err := s.deps.Store.ListHostRepoStatsHistory(ctx, hostID, since)
|
||||
if err != nil {
|
||||
slog.Warn("ui repo trend: list history", "host_id", hostID, "err", err)
|
||||
}
|
||||
sizes := make([]float64, len(pts))
|
||||
counts := make([]float64, len(pts))
|
||||
dayList := make([]time.Time, len(pts))
|
||||
for i, p := range pts {
|
||||
dayList[i] = p.Day
|
||||
if p.TotalSizeBytes == nil {
|
||||
sizes[i] = math.NaN()
|
||||
} else {
|
||||
sizes[i] = float64(*p.TotalSizeBytes)
|
||||
}
|
||||
if p.SnapshotCount == nil {
|
||||
counts[i] = math.NaN()
|
||||
} else {
|
||||
counts[i] = float64(*p.SnapshotCount)
|
||||
}
|
||||
}
|
||||
chartSVG := sparkline.RenderChart([]sparkline.Series{
|
||||
{Name: "size", Stroke: "#3b82f6", Axis: sparkline.AxisLeft, Format: sparkline.FormatBytes, Points: sizes},
|
||||
{Name: "snapshots", Stroke: "#f59e0b", Axis: sparkline.AxisRight, Format: sparkline.FormatCount, Points: counts},
|
||||
}, dayList, sparkline.ChartOpts{Width: 640, Height: 220})
|
||||
return repoTrendView{HostID: hostID, Range: rangeKey, ChartSVG: chartSVG}
|
||||
}
|
||||
|
||||
func (s *Server) handleUIHostRepo(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
u := s.requireUIUser(w, r)
|
||||
if u == nil {
|
||||
|
||||
@@ -0,0 +1,25 @@
|
||||
// ui_repo_trend.go — htmx fragment endpoint for the repo-page
|
||||
// trend chart. Returns just the chart partial wrapped in
|
||||
// <div id="repo-trend-chart"> so htmx can outerHTML-swap it.
|
||||
//
|
||||
// GET /hosts/{id}/repo/trend?range=30d|90d|1y
|
||||
package http
|
||||
|
||||
import (
|
||||
stdhttp "net/http"
|
||||
|
||||
"github.com/go-chi/chi/v5"
|
||||
)
|
||||
|
||||
func (s *Server) handleUIRepoTrend(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
u := s.requireUIUser(w, r)
|
||||
if u == nil {
|
||||
return
|
||||
}
|
||||
hostID := chi.URLParam(r, "id")
|
||||
view := s.baseView(r, u)
|
||||
view.Page = s.buildRepoTrendView(r.Context(), hostID, r.URL.Query().Get("range"))
|
||||
if err := s.deps.UI.RenderPartial(w, "repo_size_chart", view); err != nil {
|
||||
stdhttp.Error(w, "internal", stdhttp.StatusInternalServerError)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,123 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"context"
|
||||
stdhttp "net/http"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||
)
|
||||
|
||||
func getTrend(t *testing.T, baseURL, hostID, rangeKey string, cookie *stdhttp.Cookie) string {
|
||||
t.Helper()
|
||||
client := &stdhttp.Client{
|
||||
CheckRedirect: func(_ *stdhttp.Request, _ []*stdhttp.Request) error {
|
||||
return stdhttp.ErrUseLastResponse
|
||||
},
|
||||
}
|
||||
url := baseURL + "/hosts/" + hostID + "/repo/trend"
|
||||
if rangeKey != "" {
|
||||
url += "?range=" + rangeKey
|
||||
}
|
||||
req, err := stdhttp.NewRequest("GET", url, nil)
|
||||
if err != nil {
|
||||
t.Fatalf("new request: %v", err)
|
||||
}
|
||||
req.AddCookie(cookie)
|
||||
res, err := client.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("GET %s: %v", url, err)
|
||||
}
|
||||
defer res.Body.Close()
|
||||
if res.StatusCode != stdhttp.StatusOK {
|
||||
t.Fatalf("GET %s: want 200, got %d", url, res.StatusCode)
|
||||
}
|
||||
body := make([]byte, 0, 1<<20)
|
||||
buf := make([]byte, 4096)
|
||||
for {
|
||||
n, rerr := res.Body.Read(buf)
|
||||
body = append(body, buf[:n]...)
|
||||
if rerr != nil {
|
||||
break
|
||||
}
|
||||
}
|
||||
return string(body)
|
||||
}
|
||||
|
||||
func TestUIRepoTrend_30dRange(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, baseURL, st := newTestServerWithUI(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
hostID := makeHost(t, st, "h-trend")
|
||||
ctx := context.Background()
|
||||
|
||||
now := time.Now().UTC()
|
||||
for i := 0; i < 5; i++ {
|
||||
day := now.AddDate(0, 0, -i).Format("2006-01-02")
|
||||
v := int64(1000 + i*100)
|
||||
c := int64(10 + i)
|
||||
if err := st.UpsertHostRepoStatsHistory(ctx, hostID, day,
|
||||
store.HostRepoStats{TotalSizeBytes: &v, SnapshotCount: &c}, now); err != nil {
|
||||
t.Fatalf("seed %s: %v", day, err)
|
||||
}
|
||||
}
|
||||
|
||||
body := getTrend(t, baseURL, hostID, "30d", cookie)
|
||||
if !strings.Contains(body, `class="repo-trend-chart"`) {
|
||||
t.Errorf("expected repo-trend-chart SVG in fragment")
|
||||
}
|
||||
if !strings.Contains(body, `id="repo-trend-chart"`) {
|
||||
t.Errorf("expected outer wrapper id=repo-trend-chart")
|
||||
}
|
||||
if !strings.Contains(body, `data-range="30d"`) {
|
||||
t.Errorf("expected data-range=30d")
|
||||
}
|
||||
}
|
||||
|
||||
func TestUIRepoTrend_InvalidRangeFallsBackTo30d(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, baseURL, st := newTestServerWithUI(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
hostID := makeHost(t, st, "h-trend2")
|
||||
|
||||
body := getTrend(t, baseURL, hostID, "banana", cookie)
|
||||
if !strings.Contains(body, `data-range="30d"`) {
|
||||
t.Errorf("expected data-range=30d on invalid range fallback")
|
||||
}
|
||||
}
|
||||
|
||||
// TestUIRepoPageRendersTrendPanel — full-page render path: seed 3
|
||||
// history rows, fetch /hosts/{id}/repo, assert the Trend panel with
|
||||
// SVG chart ID, class, and heading text appear embedded in the page.
|
||||
func TestUIRepoPageRendersTrendPanel(t *testing.T) {
|
||||
t.Parallel()
|
||||
_, baseURL, st := newTestServerWithUI(t)
|
||||
cookie := loginAsAdmin(t, st)
|
||||
hostID := makeHost(t, st, "h-trend-page")
|
||||
ctx := context.Background()
|
||||
|
||||
now := time.Now().UTC()
|
||||
for i := 0; i < 3; i++ {
|
||||
day := now.AddDate(0, 0, -i).Format("2006-01-02")
|
||||
v := int64(2000 + i*200)
|
||||
c := int64(20 + i)
|
||||
if err := st.UpsertHostRepoStatsHistory(ctx, hostID, day,
|
||||
store.HostRepoStats{TotalSizeBytes: &v, SnapshotCount: &c}, now); err != nil {
|
||||
t.Fatalf("seed %s: %v", day, err)
|
||||
}
|
||||
}
|
||||
|
||||
body := getRepoPage(t, baseURL, hostID, cookie)
|
||||
|
||||
if !strings.Contains(body, `id="repo-trend-chart"`) {
|
||||
t.Errorf("expected id=\"repo-trend-chart\" in full-page render")
|
||||
}
|
||||
if !strings.Contains(body, `class="repo-trend-chart"`) {
|
||||
t.Errorf("expected class=\"repo-trend-chart\" in full-page render")
|
||||
}
|
||||
if !strings.Contains(body, ">Trend<") {
|
||||
t.Errorf("expected panel heading '>Trend<' in full-page render")
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,20 @@
|
||||
package http
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
stdhttp "net/http"
|
||||
|
||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||
)
|
||||
|
||||
// handleVersion exposes the server's build-time identifying constants
|
||||
// (set via -ldflags). Public-band — no secrets surface here, the agent
|
||||
// updater compares its own agent_version byte-for-byte against the
|
||||
// Version field to drive the "out of date" signal.
|
||||
func (s *Server) handleVersion(w stdhttp.ResponseWriter, r *stdhttp.Request) {
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
_ = json.NewEncoder(w).Encode(map[string]string{
|
||||
"version": version.Version,
|
||||
"commit": version.Commit,
|
||||
})
|
||||
}
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user