Compare commits
6 Commits
1af02f4495
...
a3f134bcd6
| Author | SHA1 | Date | |
|---|---|---|---|
| a3f134bcd6 | |||
| 17b9ee08b7 | |||
| 89537d417a | |||
| a252b25854 | |||
| 73e733be61 | |||
| 70ff554402 |
@@ -0,0 +1,32 @@
|
|||||||
|
<!--
|
||||||
|
Thanks for the PR! A few quick checks before submitting:
|
||||||
|
|
||||||
|
* Did you open an issue first for non-trivial changes?
|
||||||
|
* `make lint test` is green locally?
|
||||||
|
* Commits are focused (one logical change per commit)?
|
||||||
|
* No `Co-Authored-By` trailers (repo policy)?
|
||||||
|
* No new dependencies without a one-line justification below?
|
||||||
|
-->
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
<!-- One paragraph: what changed and why. -->
|
||||||
|
|
||||||
|
## Test plan
|
||||||
|
|
||||||
|
<!-- Bullet list of what you actually ran. Be specific.
|
||||||
|
- `make test` → green
|
||||||
|
- Manually exercised the new flow at /hosts/{id}/foo
|
||||||
|
- Smoke env: enrolled a fresh host, ran a backup end-to-end
|
||||||
|
-->
|
||||||
|
|
||||||
|
## Notes for the reviewer
|
||||||
|
|
||||||
|
<!-- Anything the reviewer needs to know that isn't obvious from the
|
||||||
|
diff: related issue, follow-up work that's intentionally not
|
||||||
|
in this PR, deferred concerns, design alternatives considered
|
||||||
|
and rejected. -->
|
||||||
|
|
||||||
|
## Linked issues
|
||||||
|
|
||||||
|
<!-- "Closes #123" / "Refs #456" / "Part of P5-06" -->
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
---
|
||||||
|
name: Bug report
|
||||||
|
about: Something isn't behaving the way the docs / code suggest it should
|
||||||
|
title: "[bug] "
|
||||||
|
labels: bug
|
||||||
|
---
|
||||||
|
|
||||||
|
## What happened
|
||||||
|
|
||||||
|
<!-- A clear description of the actual behaviour. Include the exact
|
||||||
|
UI surface, API endpoint, or CLI invocation involved. -->
|
||||||
|
|
||||||
|
## What you expected
|
||||||
|
|
||||||
|
<!-- What you thought would happen, and where that expectation came from
|
||||||
|
(docs page, command output, prior behaviour). -->
|
||||||
|
|
||||||
|
## Steps to reproduce
|
||||||
|
|
||||||
|
1.
|
||||||
|
2.
|
||||||
|
3.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- restic-manager server version: <!-- `restic-manager-server --version` or footer of the UI -->
|
||||||
|
- Agent version (if relevant): <!-- `restic-manager-agent --version` -->
|
||||||
|
- restic version on affected host: <!-- `restic version` -->
|
||||||
|
- Host OS: <!-- e.g. "Ubuntu 22.04 amd64" or "Windows Server 2022" -->
|
||||||
|
- How was the server installed: <!-- docker compose / source build / other -->
|
||||||
|
|
||||||
|
## Logs / output
|
||||||
|
|
||||||
|
<details><summary>Server log (sanitised)</summary>
|
||||||
|
|
||||||
|
```
|
||||||
|
<!-- paste relevant lines; redact tokens, passwords, repo URLs -->
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details><summary>Agent log (sanitised)</summary>
|
||||||
|
|
||||||
|
```
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
## Anything else
|
||||||
|
|
||||||
|
<!-- Screenshots, related issues, recent changes you made before the
|
||||||
|
bug appeared, anything that might help. -->
|
||||||
@@ -0,0 +1,34 @@
|
|||||||
|
---
|
||||||
|
name: Feature request
|
||||||
|
about: Suggest a new capability or change to existing behaviour
|
||||||
|
title: "[feature] "
|
||||||
|
labels: enhancement
|
||||||
|
---
|
||||||
|
|
||||||
|
## What you're trying to do
|
||||||
|
|
||||||
|
<!-- Describe the use case, not the proposed solution. Who is the
|
||||||
|
operator, what are they trying to accomplish, and what's
|
||||||
|
blocking them today? -->
|
||||||
|
|
||||||
|
## Why the current behaviour falls short
|
||||||
|
|
||||||
|
<!-- What does the system do today, and where does it stop short of
|
||||||
|
the use case above? -->
|
||||||
|
|
||||||
|
## Proposed direction (optional)
|
||||||
|
|
||||||
|
<!-- If you have a specific design in mind, describe it. Skip this
|
||||||
|
section if you'd rather leave it to the maintainer. -->
|
||||||
|
|
||||||
|
## Scope check
|
||||||
|
|
||||||
|
- [ ] I've read [`spec.md`](../spec.md) §2 (Goals & Non-Goals).
|
||||||
|
- [ ] This isn't already on the roadmap in [`tasks.md`](../tasks.md).
|
||||||
|
- [ ] This fits the project's "small fleet, one person operating"
|
||||||
|
target rather than enterprise / multi-tenant / SaaS use cases.
|
||||||
|
|
||||||
|
## Anything else
|
||||||
|
|
||||||
|
<!-- Related restic features, prior art in similar tools, links to
|
||||||
|
discussions you've had elsewhere. -->
|
||||||
@@ -0,0 +1,98 @@
|
|||||||
|
# P5-06 — End-to-end test suite.
|
||||||
|
#
|
||||||
|
# Spec : docs/superpowers/specs/2026-05-07-p5-oss-readiness-design.md
|
||||||
|
# Stack: e2e/compose.e2e.yml (server + agent + rest-server + playwright)
|
||||||
|
# Tests: e2e/playwright/tests/*.spec.ts
|
||||||
|
#
|
||||||
|
# Triggered on every PR into main and on workflow_dispatch. Runs
|
||||||
|
# longer than the unit-test workflow (~3-4 minutes for a clean run);
|
||||||
|
# kept separate so a slow e2e doesn't block the fast lint/test loop.
|
||||||
|
#
|
||||||
|
# Networking note: every interaction with the server (health probe,
|
||||||
|
# Playwright) happens from a container on the compose `rmnet`
|
||||||
|
# network, addressing the server as `http://server:8080`. We can't
|
||||||
|
# rely on `127.0.0.1:8080` because Gitea's runner executes steps
|
||||||
|
# inside its own container, where compose's host port-publish is
|
||||||
|
# not visible.
|
||||||
|
|
||||||
|
name: e2e
|
||||||
|
|
||||||
|
on:
|
||||||
|
pull_request:
|
||||||
|
branches: [main]
|
||||||
|
workflow_dispatch:
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
e2e:
|
||||||
|
name: Playwright vs docker-compose
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
timeout-minutes: 15
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
|
||||||
|
- name: Build the e2e stack
|
||||||
|
run: docker compose -f e2e/compose.e2e.yml build
|
||||||
|
|
||||||
|
- name: Bring up the stack
|
||||||
|
run: docker compose -f e2e/compose.e2e.yml up -d server rest-server source-fixture
|
||||||
|
|
||||||
|
- name: Wait for server health
|
||||||
|
run: |
|
||||||
|
set -eu
|
||||||
|
for i in $(seq 1 30); do
|
||||||
|
if docker run --rm --network e2e_rmnet curlimages/curl:8.10.1 \
|
||||||
|
-fsS http://server:8080/api/version >/dev/null 2>&1; then
|
||||||
|
echo "server up"; exit 0
|
||||||
|
fi
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
echo "server didn't come up"; docker compose -f e2e/compose.e2e.yml logs server; exit 1
|
||||||
|
|
||||||
|
- name: Capture bootstrap token from server logs
|
||||||
|
id: bootstrap
|
||||||
|
run: |
|
||||||
|
set -eu
|
||||||
|
for i in $(seq 1 15); do
|
||||||
|
line=$(docker compose -f e2e/compose.e2e.yml logs server 2>&1 | grep -E 'bootstrap token' -A2 | grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1 || true)
|
||||||
|
if [ -n "$line" ]; then
|
||||||
|
echo "RM_BOOTSTRAP_TOKEN=$line" >> "$GITHUB_ENV"
|
||||||
|
echo "got bootstrap token (${#line} chars)"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
echo "bootstrap token not found in logs"
|
||||||
|
docker compose -f e2e/compose.e2e.yml logs server
|
||||||
|
exit 1
|
||||||
|
|
||||||
|
- name: Start the agent
|
||||||
|
run: docker compose -f e2e/compose.e2e.yml up -d agent
|
||||||
|
|
||||||
|
- name: Prepare report mounts
|
||||||
|
run: |
|
||||||
|
mkdir -p e2e/playwright/playwright-report e2e/playwright/test-results
|
||||||
|
chmod -R a+rwX e2e/playwright/playwright-report e2e/playwright/test-results
|
||||||
|
|
||||||
|
- name: Run Playwright tests
|
||||||
|
env:
|
||||||
|
RM_BOOTSTRAP_TOKEN: ${{ env.RM_BOOTSTRAP_TOKEN }}
|
||||||
|
run: docker compose -f e2e/compose.e2e.yml run --rm playwright
|
||||||
|
|
||||||
|
- name: Compose logs (on failure)
|
||||||
|
if: failure()
|
||||||
|
run: |
|
||||||
|
docker compose -f e2e/compose.e2e.yml logs --tail=200 server
|
||||||
|
docker compose -f e2e/compose.e2e.yml logs --tail=200 agent
|
||||||
|
docker compose -f e2e/compose.e2e.yml logs --tail=200 rest-server
|
||||||
|
|
||||||
|
- name: Upload Playwright report (on failure)
|
||||||
|
if: failure()
|
||||||
|
uses: actions/upload-artifact@v3
|
||||||
|
with:
|
||||||
|
name: playwright-report
|
||||||
|
path: e2e/playwright/playwright-report
|
||||||
|
retention-days: 7
|
||||||
|
|
||||||
|
- name: Tear down
|
||||||
|
if: always()
|
||||||
|
run: docker compose -f e2e/compose.e2e.yml down -v
|
||||||
@@ -2,6 +2,10 @@
|
|||||||
/bin/
|
/bin/
|
||||||
/dist/
|
/dist/
|
||||||
|
|
||||||
|
# Generated mdBook output (source under docs/book/src is committed,
|
||||||
|
# the rendered book/ directory is not).
|
||||||
|
/docs/book/book/
|
||||||
|
|
||||||
# Local data / runtime state
|
# Local data / runtime state
|
||||||
/data/
|
/data/
|
||||||
/certs/
|
/certs/
|
||||||
|
|||||||
@@ -0,0 +1,69 @@
|
|||||||
|
# Code of Conduct
|
||||||
|
|
||||||
|
restic-manager is a small project run by one person. This Code of
|
||||||
|
Conduct sets out the basic expectations for participating in the
|
||||||
|
project's issue tracker, pull requests, and any other community
|
||||||
|
spaces (chat, mailing lists) we may run in future.
|
||||||
|
|
||||||
|
## Expected behaviour
|
||||||
|
|
||||||
|
- **Be civil.** Disagreement is fine; rudeness is not. The same
|
||||||
|
comment can usually be made without making it personal.
|
||||||
|
- **Assume good faith.** People asking what feels like a basic
|
||||||
|
question may be new to the project. People proposing what feels
|
||||||
|
like a duplicate idea may not have seen the prior discussion.
|
||||||
|
Point them to the right place politely.
|
||||||
|
- **Stay on topic.** Issue threads are for the issue. Tangential
|
||||||
|
conversations belong in their own thread.
|
||||||
|
- **Acknowledge the project's scope.** restic-manager is
|
||||||
|
intentionally small in scope (see `spec.md` §2). Reasonable
|
||||||
|
feature suggestions may still be declined for fit reasons.
|
||||||
|
|
||||||
|
## Unacceptable behaviour
|
||||||
|
|
||||||
|
- Harassment, threats, or insults — public or private.
|
||||||
|
- Discriminatory comments based on age, body size, disability,
|
||||||
|
ethnicity, gender identity or expression, level of experience,
|
||||||
|
nationality, personal appearance, race, religion, sexual identity
|
||||||
|
or orientation.
|
||||||
|
- Sustained disruption — derailing threads, ignoring repeated
|
||||||
|
requests to take a discussion elsewhere, brigading.
|
||||||
|
- Publishing other people's private information without permission.
|
||||||
|
|
||||||
|
## Reporting
|
||||||
|
|
||||||
|
If someone in the project's spaces is behaving in a way that
|
||||||
|
breaches this Code of Conduct, contact the maintainer directly
|
||||||
|
through the contact details on their Gitea profile, or via the
|
||||||
|
private security disclosure path documented in
|
||||||
|
[SECURITY.md](./SECURITY.md). Reports stay confidential.
|
||||||
|
|
||||||
|
The maintainer will review the report, gather context if needed,
|
||||||
|
and respond. Possible outcomes include a private warning, a public
|
||||||
|
clarification of expectations, a temporary or permanent ban from
|
||||||
|
project spaces, or no action if the report doesn't hold up.
|
||||||
|
|
||||||
|
There is no formal appeals process — this is a one-person project,
|
||||||
|
not a foundation. If you think a decision was wrong you can say
|
||||||
|
so, in writing, to the maintainer; that's it.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This Code of Conduct applies to interactions in any space the
|
||||||
|
project owns or operates: the Gitea repository (issues, pull
|
||||||
|
requests, discussions, wiki), any chat channels we publish, and
|
||||||
|
any conferences or events the project is officially represented at.
|
||||||
|
|
||||||
|
It does not apply to:
|
||||||
|
|
||||||
|
- Forks of the project that aren't being submitted back upstream.
|
||||||
|
- Conversations between contributors that don't reference the
|
||||||
|
project.
|
||||||
|
- Public criticism of the project itself.
|
||||||
|
|
||||||
|
## Acknowledgement
|
||||||
|
|
||||||
|
This document borrows shape and language from the
|
||||||
|
[Contributor Covenant](https://www.contributor-covenant.org/) v2.1
|
||||||
|
but is intentionally shorter and adapted to the project's
|
||||||
|
single-maintainer reality.
|
||||||
+159
-21
@@ -1,30 +1,168 @@
|
|||||||
# Contributing
|
# Contributing to restic-manager
|
||||||
|
|
||||||
Thanks for your interest in contributing to restic-manager.
|
Thanks for your interest in restic-manager. This document covers how
|
||||||
|
to set up a development environment, the conventions the project
|
||||||
|
follows, and how patches make it from your machine into `main`.
|
||||||
|
|
||||||
> This is a placeholder. The project is in pre-alpha (Phase 1 / MVP). A
|
## Project status and scope
|
||||||
> full contributor guide will land alongside the Phase 5 OSS-readiness
|
|
||||||
> work — see [`tasks.md`](./tasks.md) P5-02. Until then the notes below
|
|
||||||
> apply.
|
|
||||||
|
|
||||||
## Before opening a PR
|
restic-manager is in pre-1.0. Core functionality (Phases 0–4) is
|
||||||
|
landed; OSS-readiness polish is in progress. The top of
|
||||||
|
[`tasks.md`](./tasks.md) tracks what's next; [`spec.md`](./spec.md)
|
||||||
|
is the canonical design doc and the source of truth for any
|
||||||
|
"why is it built this way" question.
|
||||||
|
|
||||||
1. Open an issue first for non-trivial changes — the design is still
|
The project is **single-maintainer, hobbyist-scale, and licensed
|
||||||
moving (see [`spec.md`](./spec.md)) and unsolicited large PRs may
|
under [PolyForm Noncommercial 1.0.0](./LICENSE)**. That has two
|
||||||
conflict with in-flight work.
|
practical implications:
|
||||||
2. `make lint test` should pass.
|
|
||||||
3. Match the existing code style — `gofumpt`, `goimports`, no comments
|
|
||||||
that just restate what the code does.
|
|
||||||
4. Keep commits focused; one logical change per commit.
|
|
||||||
|
|
||||||
## Reporting security issues
|
1. Big PRs without prior discussion may be declined for fit
|
||||||
|
reasons even when they're correct — opening an issue first lets
|
||||||
|
us check alignment cheaply.
|
||||||
|
2. Commercial use is not permitted by the license. Bug reports and
|
||||||
|
patches from operators of personal/community deployments are
|
||||||
|
very welcome.
|
||||||
|
|
||||||
Please do **not** open a public issue for security problems. A
|
## Getting started
|
||||||
`SECURITY.md` with a private disclosure path will be added in Phase 5
|
|
||||||
(P5-05). Until then, contact the repository owner directly via the
|
### Prerequisites
|
||||||
contact details on their gitea profile.
|
|
||||||
|
- Go 1.25 or newer (`go.mod` is the source of truth)
|
||||||
|
- `make`
|
||||||
|
- For the front-end CSS bundle: nothing extra — `make build`
|
||||||
|
downloads a pinned `tailwindcss` standalone binary into `bin/`.
|
||||||
|
- For the docs site: nothing extra — `make docs` does the same trick
|
||||||
|
with `mdbook`.
|
||||||
|
- For end-to-end tests: Docker + Docker Compose, plus `npx` for
|
||||||
|
Playwright.
|
||||||
|
|
||||||
|
### One-time setup
|
||||||
|
|
||||||
|
```sh
|
||||||
|
git clone https://gitea.dcglab.co.uk/steve/restic-manager.git
|
||||||
|
cd restic-manager
|
||||||
|
make build # compiles bin/restic-manager-{server,agent}
|
||||||
|
make test # full unit + integration test sweep
|
||||||
|
make lint # gofumpt + goimports + golangci-lint
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running locally
|
||||||
|
|
||||||
|
For most development, the [smoke environment](./docs/e2e-smoke.md)
|
||||||
|
is the path of least resistance:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
make smoke-restart # rebuilds, launches as a systemd --user unit
|
||||||
|
make smoke-logs # tail of the server log
|
||||||
|
```
|
||||||
|
|
||||||
|
Then point a browser at `http://127.0.0.1:8080`. The first run
|
||||||
|
prints a one-time bootstrap token to the log; use it to create the
|
||||||
|
admin user.
|
||||||
|
|
||||||
|
## Code conventions
|
||||||
|
|
||||||
|
### Style
|
||||||
|
|
||||||
|
- `gofumpt` for formatting; `goimports` for import grouping.
|
||||||
|
Both run via the pre-commit hook in this repo.
|
||||||
|
- `golangci-lint` with `.golangci.yml` defaults; CI rejects on lint
|
||||||
|
errors.
|
||||||
|
- UK English in identifiers, comments, log messages, and UI strings
|
||||||
|
(the misspell linter is configured for the UK locale — see
|
||||||
|
P3-X5 for the original sweep).
|
||||||
|
- Comments explain **why**, not what; avoid restating the code.
|
||||||
|
A surprising invariant or an external constraint is worth
|
||||||
|
writing down. "Adds 1 to x" is not.
|
||||||
|
- `slog` for structured logs. Never log secrets — and especially
|
||||||
|
never the merged-creds rest-server URL (see [`CLAUDE.md`](./CLAUDE.md)).
|
||||||
|
|
||||||
|
### File and package layout
|
||||||
|
|
||||||
|
- `cmd/server` and `cmd/agent` are the two binary entry points.
|
||||||
|
- `internal/` holds everything that's not part of the public Go
|
||||||
|
API (which is none of it — restic-manager isn't a library).
|
||||||
|
- Per-feature packages live under `internal/server/...` for the
|
||||||
|
control plane and `internal/agent/...` for the agent.
|
||||||
|
- `web/templates/` are HTML templates rendered with the standard
|
||||||
|
library; embedded via `web.FS`.
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
|
||||||
|
- Unit tests live alongside the code as `*_test.go`. Use the
|
||||||
|
in-process sqlite store (`store.Open(":memory:")`) when you need
|
||||||
|
state — there is no test mock layer to maintain.
|
||||||
|
- HTTP handlers test through `httptest.NewServer` against the real
|
||||||
|
router; see `internal/server/http/auth_test.go` for the canonical
|
||||||
|
fixture pattern.
|
||||||
|
- End-to-end tests live in `e2e/` and run against a Docker Compose
|
||||||
|
stack. See [`docs/e2e.md`](./docs/e2e.md).
|
||||||
|
|
||||||
|
### Database migrations
|
||||||
|
|
||||||
|
- Migrations are hand-rolled SQL in `internal/store/migrations/`
|
||||||
|
and embedded via `embed.FS`.
|
||||||
|
- Prefer column-level `ALTER TABLE` over rebuilds — see
|
||||||
|
[`CLAUDE.md`](./CLAUDE.md) "Migrations" section for the FK-cascade
|
||||||
|
trap that bit migration 0007's first draft.
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
### Before opening a PR
|
||||||
|
|
||||||
|
1. **Open an issue first** for non-trivial changes. The design is
|
||||||
|
still moving; an issue lets us agree on direction cheaply.
|
||||||
|
2. Run `make lint test` locally — both must pass.
|
||||||
|
3. Match existing code style (see above).
|
||||||
|
4. Keep commits focused: one logical change per commit. Imperative
|
||||||
|
subject lines, body explaining why if it isn't obvious.
|
||||||
|
5. Don't add `Co-Authored-By` trailers — repo policy. If you used
|
||||||
|
AI assistance in writing the patch, that's fine; we just don't
|
||||||
|
pollute every commit message with attribution boilerplate.
|
||||||
|
|
||||||
|
### Pull requests
|
||||||
|
|
||||||
|
PRs target `main`. CI runs lint + tests on Linux amd64/arm64 and
|
||||||
|
Windows amd64; all three must be green to merge. Squash-merge is
|
||||||
|
the default; the PR title becomes the merge-commit subject, so
|
||||||
|
keep it short and informative.
|
||||||
|
|
||||||
|
The PR template asks for:
|
||||||
|
|
||||||
|
- A short description of what changed and why.
|
||||||
|
- A test plan (commands run, scenarios verified).
|
||||||
|
- Anything reviewers need to know to assess the change (related
|
||||||
|
issue, follow-up work, deferred concerns).
|
||||||
|
|
||||||
|
### Reporting bugs
|
||||||
|
|
||||||
|
Open an issue with:
|
||||||
|
|
||||||
|
- restic-manager version (`server --version`) and agent version.
|
||||||
|
- restic version on the affected host.
|
||||||
|
- Steps to reproduce.
|
||||||
|
- Server and agent logs (sanitise any tokens before pasting).
|
||||||
|
|
||||||
|
Security-sensitive bugs go through the [SECURITY.md](./SECURITY.md)
|
||||||
|
disclosure path instead — please don't open a public issue for
|
||||||
|
them.
|
||||||
|
|
||||||
|
### Suggesting features
|
||||||
|
|
||||||
|
Open an issue describing the use case (not just the proposed
|
||||||
|
solution). The roadmap in `tasks.md` shows where the project is
|
||||||
|
heading; if the suggestion fits a future phase we'll wire it in
|
||||||
|
there. If it falls outside the project's scope (multi-tenancy, SaaS,
|
||||||
|
non-restic backends — see `spec.md` §2 non-goals) we'll say so
|
||||||
|
early to save your time.
|
||||||
|
|
||||||
|
## Code of conduct
|
||||||
|
|
||||||
|
Project participation is governed by [CODE_OF_CONDUCT.md](./CODE_OF_CONDUCT.md).
|
||||||
|
The short version: be civil; assume good faith; harassment is not
|
||||||
|
tolerated.
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
By contributing you agree that your contributions are licensed under
|
By contributing you agree that your contributions are licensed
|
||||||
the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
|
under the [PolyForm Noncommercial 1.0.0](./LICENSE) license.
|
||||||
|
|||||||
@@ -24,7 +24,18 @@ TAILWIND_URL := https://github.com/tailwindlabs/tailwindcss/releases/downlo
|
|||||||
TAILWIND_INPUT := web/styles/input.css
|
TAILWIND_INPUT := web/styles/input.css
|
||||||
TAILWIND_OUTPUT := web/static/css/styles.css
|
TAILWIND_OUTPUT := web/static/css/styles.css
|
||||||
|
|
||||||
.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
|
# mdBook for the docs site (P5-01). Single static binary, no
|
||||||
|
# Rust toolchain — same pattern as Tailwind.
|
||||||
|
MDBOOK_VERSION ?= v0.4.51
|
||||||
|
MDBOOK_OS := $(shell uname -s | tr A-Z a-z)
|
||||||
|
MDBOOK_TRIPLE := $(shell uname -m)-unknown-$(if $(filter darwin,$(MDBOOK_OS)),apple-darwin,linux-gnu)
|
||||||
|
MDBOOK_BIN := $(BIN_DIR)/mdbook
|
||||||
|
MDBOOK_TARBALL := mdbook-$(MDBOOK_VERSION)-$(MDBOOK_TRIPLE).tar.gz
|
||||||
|
MDBOOK_URL := https://github.com/rust-lang/mdBook/releases/download/$(MDBOOK_VERSION)/$(MDBOOK_TARBALL)
|
||||||
|
DOCS_BOOK_DIR := docs/book
|
||||||
|
DOCS_BOOK_OUT := $(DOCS_BOOK_DIR)/book
|
||||||
|
|
||||||
|
.PHONY: help build server agent test test-race lint fmt tidy clean run-server run-agent docker release tailwind tailwind-watch docs docs-watch setup hooks smoke-restart smoke-stop smoke-status smoke-logs smoke-deploy
|
||||||
|
|
||||||
# ---- smoke-env tooling -------------------------------------------------
|
# ---- smoke-env tooling -------------------------------------------------
|
||||||
# The smoke server runs as a transient user-systemd unit so it survives
|
# The smoke server runs as a transient user-systemd unit so it survives
|
||||||
@@ -60,6 +71,18 @@ tailwind-watch: $(TAILWIND_BIN) ## Watch and rebuild on every save
|
|||||||
@mkdir -p $$(dirname $(TAILWIND_OUTPUT))
|
@mkdir -p $$(dirname $(TAILWIND_OUTPUT))
|
||||||
$(TAILWIND_BIN) -c tailwind.config.js -i $(TAILWIND_INPUT) -o $(TAILWIND_OUTPUT) --watch
|
$(TAILWIND_BIN) -c tailwind.config.js -i $(TAILWIND_INPUT) -o $(TAILWIND_OUTPUT) --watch
|
||||||
|
|
||||||
|
$(MDBOOK_BIN):
|
||||||
|
@mkdir -p $(BIN_DIR)
|
||||||
|
@echo "==> downloading mdbook $(MDBOOK_VERSION) ($(MDBOOK_TRIPLE))"
|
||||||
|
curl -fsSL "$(MDBOOK_URL)" | tar -xz -C $(BIN_DIR) mdbook
|
||||||
|
@chmod +x $@
|
||||||
|
|
||||||
|
docs: $(MDBOOK_BIN) ## Build the docs/book/ mdBook site into docs/book/book/
|
||||||
|
$(MDBOOK_BIN) build $(DOCS_BOOK_DIR)
|
||||||
|
|
||||||
|
docs-watch: $(MDBOOK_BIN) ## Serve the docs site at http://127.0.0.1:3000 with live reload
|
||||||
|
$(MDBOOK_BIN) serve $(DOCS_BOOK_DIR) -n 127.0.0.1 -p 3000
|
||||||
|
|
||||||
agent: ## Build the agent binary
|
agent: ## Build the agent binary
|
||||||
@mkdir -p $(BIN_DIR)
|
@mkdir -p $(BIN_DIR)
|
||||||
CGO_ENABLED=0 go build $(GOFLAGS) -ldflags "$(LDFLAGS)" -o $(AGENT_BIN) ./cmd/agent
|
CGO_ENABLED=0 go build $(GOFLAGS) -ldflags "$(LDFLAGS)" -o $(AGENT_BIN) ./cmd/agent
|
||||||
@@ -90,7 +113,7 @@ tidy: ## go mod tidy
|
|||||||
go mod tidy
|
go mod tidy
|
||||||
|
|
||||||
clean: ## Remove build artifacts
|
clean: ## Remove build artifacts
|
||||||
rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT)
|
rm -rf $(BIN_DIR) coverage.out coverage.html $(TAILWIND_OUTPUT) $(DOCS_BOOK_OUT)
|
||||||
|
|
||||||
run-server: server ## Build and run the server
|
run-server: server ## Build and run the server
|
||||||
$(SERVER_BIN)
|
$(SERVER_BIN)
|
||||||
|
|||||||
@@ -1,36 +1,62 @@
|
|||||||
# restic-manager
|
# restic-manager
|
||||||
|
|
||||||
Self-hosted, browser-based, single-pane-of-glass for managing
|
Self-hosted, browser-based, single-pane-of-glass for managing
|
||||||
[restic](https://restic.net) backups across a fleet of Linux and Windows
|
[restic](https://restic.net) backups across a fleet of Linux and
|
||||||
endpoints.
|
Windows endpoints.
|
||||||
|
|
||||||
> Status: pre-alpha. Phase 0 (project bootstrap) complete; Phase 1 (MVP) in
|
> **Status:** pre-1.0, feature-complete for the original use
|
||||||
> progress. See [`spec.md`](./spec.md) for the design and
|
> case. Phases 0–4 + 6 are landed (MVP, scheduling, restore,
|
||||||
> [`tasks.md`](./tasks.md) for the roadmap.
|
> RBAC + OIDC, observability); Phase 5 (OSS readiness — docs site,
|
||||||
|
> contributor onboarding, end-to-end CI) is in flight. See
|
||||||
|
> [`spec.md`](./spec.md) for the design and [`tasks.md`](./tasks.md)
|
||||||
|
> for the live roadmap.
|
||||||
|
|
||||||
## What it does (target)
|
## What it does
|
||||||
|
|
||||||
- Central visibility into backup state for every endpoint
|
- Central visibility into backup state for every endpoint.
|
||||||
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
|
- Trigger any restic operation remotely (`backup`, `forget`,
|
||||||
`check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
|
`prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`,
|
||||||
- Manage per-host backup schedules from the UI
|
`restore`).
|
||||||
- Live job progress streamed back to the UI
|
- Per-host schedules with named source groups + retention.
|
||||||
- Restore wizard (browse snapshots, pick paths, restore to original or
|
- Live job log streamed to the browser; downloadable as
|
||||||
alternate host)
|
text/NDJSON afterwards.
|
||||||
- Repo health surfacing (size, dedup ratio, last check, lock state)
|
- Restore wizard: browse a snapshot's tree, pick paths, restore
|
||||||
- Alerting on failure or staleness
|
in-place or to a new directory.
|
||||||
- Cross-platform agent (Linux + Windows)
|
- Repo health surfacing (size, raw size, last check, lock state),
|
||||||
- Ransomware-resistant repo access via append-only credentials
|
plus a 30/90-day repo-size trend.
|
||||||
|
- Alerting over webhook, ntfy, or SMTP.
|
||||||
|
- Cross-platform agent (Linux systemd + Windows SCM).
|
||||||
|
- Append-only-friendly: separate admin credential for prune.
|
||||||
|
- Optional Prometheus `/metrics` endpoint + sample Grafana
|
||||||
|
dashboard.
|
||||||
|
- Optional OIDC SSO (Authelia, Authentik, etc.).
|
||||||
|
|
||||||
## Architecture (one-line summary)
|
## Screenshots
|
||||||
|
|
||||||
A small Go control-plane on the Proxmox host, lightweight Go agents on each
|
| Sign in | Empty dashboard | Add host |
|
||||||
endpoint that hold an outbound WebSocket to the control-plane, and a
|
|:-------:|:---------------:|:--------:|
|
||||||
`restic/rest-server` on Unraid that holds the actual backup data. The
|
|  |  |  |
|
||||||
control-plane never touches backup bytes.
|
|
||||||
|
| Alerts | Settings | Audit log |
|
||||||
|
|:------:|:--------:|:---------:|
|
||||||
|
|  |  |  |
|
||||||
|
|
||||||
|
(Screenshots from a fresh smoke install with no hosts. A populated
|
||||||
|
fleet view and the live-log + restore wizard surfaces are part of
|
||||||
|
the docs site under [`docs/book/`](./docs/book) — `make docs` to
|
||||||
|
render locally.)
|
||||||
|
|
||||||
|
## Architecture (one-line)
|
||||||
|
|
||||||
|
A small Go control-plane in Docker, lightweight Go agents on each
|
||||||
|
endpoint holding an outbound WebSocket to the control-plane, and
|
||||||
|
a restic repository (rest-server, S3, B2, SFTP — anything restic
|
||||||
|
speaks) that holds the actual backup data. **The control-plane
|
||||||
|
never touches backup bytes.**
|
||||||
|
|
||||||
Full architecture diagram and component breakdown:
|
Full architecture diagram and component breakdown:
|
||||||
[`spec.md` §3](./spec.md).
|
[`spec.md` §3](./spec.md), or the rendered version in the
|
||||||
|
[docs site](./docs/book/src/concepts/architecture.md).
|
||||||
|
|
||||||
## Repository layout
|
## Repository layout
|
||||||
|
|
||||||
@@ -38,31 +64,63 @@ Full architecture diagram and component breakdown:
|
|||||||
cmd/server/ control-plane binary
|
cmd/server/ control-plane binary
|
||||||
cmd/agent/ endpoint agent binary
|
cmd/agent/ endpoint agent binary
|
||||||
internal/api shared API types (REST + WS envelopes)
|
internal/api shared API types (REST + WS envelopes)
|
||||||
internal/server/ HTTP, WS, UI handlers
|
internal/server/ HTTP, WS, UI handlers, alert engine
|
||||||
internal/agent/ service integration, restic runner, local scheduler
|
internal/agent/ service integration, restic runner, local scheduler
|
||||||
internal/restic restic CLI wrapper
|
internal/restic restic CLI wrapper
|
||||||
internal/store SQLite persistence
|
internal/store SQLite persistence
|
||||||
internal/crypto secret encryption
|
internal/crypto secret encryption (AEAD)
|
||||||
internal/auth passwords, sessions, agent tokens
|
internal/auth passwords, sessions, agent tokens
|
||||||
web/ server-rendered templates + static assets
|
web/ server-rendered templates + static assets
|
||||||
deploy/ Dockerfile, docker-compose.yml, install scripts
|
deploy/ Dockerfile, docker-compose.yml, install scripts, Grafana dashboard
|
||||||
design/ UI wireframes (Phase 0 design pass)
|
docs/ prose docs + the mdBook site under docs/book
|
||||||
|
e2e/ compose stack + Playwright tests for end-to-end CI
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Quickstart
|
||||||
|
|
||||||
|
The reference deployment is a single Docker container fronted by
|
||||||
|
your existing reverse proxy. See the [installation guide](docs/book/src/getting-started/install.md)
|
||||||
|
for the full path; the very short version:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
export RM_VERSION=v0.9.0 # pin a real tag
|
||||||
|
export RM_BASE_URL=https://restic.example.com
|
||||||
|
export RM_TRUSTED_PROXY=10.0.0.0/8
|
||||||
|
docker compose -f deploy/docker-compose.yml up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
The server prints a one-time bootstrap token to the log on first
|
||||||
|
start. POST it to `/api/bootstrap` (or open `/bootstrap` in a
|
||||||
|
browser) to create the admin user.
|
||||||
|
|
||||||
## Local development
|
## Local development
|
||||||
|
|
||||||
Requires Go 1.25+ (built and tested on 1.26). The floor is set by
|
Requires Go 1.25+. The floor is set by `modernc.org/sqlite` v1.50.
|
||||||
`modernc.org/sqlite` v1.50.
|
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
make build # builds cmd/server and cmd/agent into ./bin
|
make build # builds cmd/server and cmd/agent into ./bin
|
||||||
make test # runs go test ./...
|
make test # runs go test ./...
|
||||||
make lint # runs golangci-lint
|
make lint # runs golangci-lint
|
||||||
make run-server # runs the server (dev defaults)
|
make smoke-restart # systemd --user smoke server (see CLAUDE.md)
|
||||||
|
make docs # renders the mdBook site to docs/book/book/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
End-to-end test harness against a Docker Compose stack with a
|
||||||
|
sibling Linux agent: see [`docs/e2e.md`](docs/e2e.md). Runs in CI
|
||||||
|
on every PR.
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
- **Concepts and operator guides**: [docs site](docs/book/src/intro.md),
|
||||||
|
rendered with `make docs`.
|
||||||
|
- **Reverse-proxy setup**: [docs/reverse-proxy.md](docs/reverse-proxy.md).
|
||||||
|
- **Prometheus + Grafana**: [docs/prometheus.md](docs/prometheus.md).
|
||||||
|
- **End-to-end test harness**: [docs/e2e.md](docs/e2e.md).
|
||||||
|
- **Security policy**: [SECURITY.md](SECURITY.md).
|
||||||
|
- **Contributing**: [CONTRIBUTING.md](CONTRIBUTING.md).
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
PolyForm Noncommercial 1.0.0 — see [`LICENSE`](./LICENSE). Free for personal,
|
[PolyForm Noncommercial 1.0.0](./LICENSE). Free for personal,
|
||||||
hobby, research, educational, governmental, and other noncommercial use.
|
hobby, research, educational, governmental, and other noncommercial
|
||||||
Commercial use requires a separate license.
|
use. Commercial use requires a separate license.
|
||||||
|
|||||||
+137
@@ -0,0 +1,137 @@
|
|||||||
|
# Security policy
|
||||||
|
|
||||||
|
restic-manager handles credentials that grant access to backup
|
||||||
|
repositories — losing them means an attacker can read or destroy a
|
||||||
|
fleet's backups. We take security reports seriously even at this
|
||||||
|
project's small scale.
|
||||||
|
|
||||||
|
## Supported versions
|
||||||
|
|
||||||
|
Pre-1.0, only the latest tagged release on `main` is supported.
|
||||||
|
Backporting fixes to older tags is not currently offered.
|
||||||
|
|
||||||
|
| Version | Supported |
|
||||||
|
|--------------------|----------------|
|
||||||
|
| `main` HEAD | Yes |
|
||||||
|
| Latest released tag| Yes |
|
||||||
|
| Anything older | No |
|
||||||
|
|
||||||
|
## Reporting a vulnerability
|
||||||
|
|
||||||
|
**Please don't open a public issue for security problems.**
|
||||||
|
|
||||||
|
Instead, use one of these private channels:
|
||||||
|
|
||||||
|
1. **Gitea private message** to the repository owner. The
|
||||||
|
instance is at <https://gitea.dcglab.co.uk> and the owner's
|
||||||
|
profile (`steve`) has direct-message contact set up.
|
||||||
|
2. **Email** to the address on the maintainer's Gitea profile.
|
||||||
|
Use a subject like `[SECURITY] restic-manager: <one-line summary>`
|
||||||
|
so it doesn't get lost. PGP optional — if you want to encrypt,
|
||||||
|
ask for a key first.
|
||||||
|
|
||||||
|
If you don't get an acknowledgement within **3 working days**,
|
||||||
|
please escalate through the other channel — solo maintainers do
|
||||||
|
miss things, and the goal here is to fix the problem, not to
|
||||||
|
preserve protocol.
|
||||||
|
|
||||||
|
### What to include
|
||||||
|
|
||||||
|
- A description of the issue and the impact (what does an attacker
|
||||||
|
gain? confidentiality, integrity, availability?).
|
||||||
|
- Affected component (server, agent, install script, docs).
|
||||||
|
- Affected version (`restic-manager-server --version`).
|
||||||
|
- Reproduction steps if you have them. A working PoC is welcome
|
||||||
|
but not required — a credible threat model is enough.
|
||||||
|
- Whether you intend to publish a writeup, and any timing
|
||||||
|
preferences.
|
||||||
|
|
||||||
|
### What we'll do
|
||||||
|
|
||||||
|
1. Acknowledge receipt within 3 working days.
|
||||||
|
2. Confirm or refute the issue, and agree a rough severity (CVSS
|
||||||
|
or just "this is bad / this isn't"). Asking clarifying
|
||||||
|
questions is normal at this stage — please don't read it as
|
||||||
|
foot-dragging.
|
||||||
|
3. Develop a fix on a private branch, test it, and prepare a
|
||||||
|
release.
|
||||||
|
4. Coordinate disclosure timing with you. The default is **30
|
||||||
|
days from confirmed report to public disclosure**, with a
|
||||||
|
patched release published before the disclosure date. Faster
|
||||||
|
if a workable PoC is already circulating; slower only by
|
||||||
|
mutual agreement.
|
||||||
|
5. Credit the reporter in the release notes (or omit the credit
|
||||||
|
if you'd rather stay anonymous — your choice).
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
In scope:
|
||||||
|
|
||||||
|
- The server binary (`cmd/server`) and any HTTP, WebSocket, or CLI
|
||||||
|
surface it exposes.
|
||||||
|
- The agent binary (`cmd/agent`) and the way it consumes commands
|
||||||
|
from the server.
|
||||||
|
- The install scripts (`deploy/install/install.sh`, `install.ps1`)
|
||||||
|
and the systemd unit shipped with them.
|
||||||
|
- The docker-compose reference deployment and the docker image we
|
||||||
|
publish.
|
||||||
|
- Any cryptographic primitive choice or implementation detail
|
||||||
|
(AEAD, token hashing, session handling, OIDC handshake).
|
||||||
|
- Documentation that, if followed, leads operators into an
|
||||||
|
insecure configuration.
|
||||||
|
|
||||||
|
Out of scope (not because they aren't real problems, just not ones
|
||||||
|
this report channel can act on):
|
||||||
|
|
||||||
|
- Vulnerabilities in restic itself — report those upstream at
|
||||||
|
<https://github.com/restic/restic>.
|
||||||
|
- Vulnerabilities in third-party dependencies that haven't yet been
|
||||||
|
patched upstream — report upstream first.
|
||||||
|
- Issues that require pre-authenticated admin access on the control
|
||||||
|
plane (admins can already do everything; that's not a privilege
|
||||||
|
escalation, that's the design).
|
||||||
|
- DoS via resource exhaustion on a deployment without the
|
||||||
|
recommended reverse proxy / rate limiting in front (see
|
||||||
|
`docs/reverse-proxy.md`).
|
||||||
|
- Social-engineering scenarios that don't have a technical hook
|
||||||
|
into the project's own surfaces.
|
||||||
|
|
||||||
|
## Threat model summary
|
||||||
|
|
||||||
|
For context (longer version in [`spec.md`](./spec.md) §11):
|
||||||
|
|
||||||
|
- The server is **HTTP-only**; TLS termination, ACME, HSTS, and
|
||||||
|
edge rate-limiting are the reverse proxy's job.
|
||||||
|
- Credentials are encrypted at rest with an AEAD key loaded from
|
||||||
|
`RM_SECRET_KEY_FILE`. The same key encrypts agent secrets that
|
||||||
|
travel to the agent over the WS channel.
|
||||||
|
- Agents authenticate with bearer tokens issued at enrolment and
|
||||||
|
hashed at rest. Compromise of the server DB does **not** leak
|
||||||
|
bearer tokens in plaintext, but does leak the hashes (which is
|
||||||
|
enough to log in *as* the agent until the operator revokes —
|
||||||
|
see [NS-01 / NS-02](./tasks.md) for the revoke + regenerate
|
||||||
|
flows).
|
||||||
|
- The control plane intentionally **never touches backup bytes** —
|
||||||
|
the agent runs `restic` directly against the repo. A
|
||||||
|
compromised control plane can dispatch new jobs but cannot
|
||||||
|
exfiltrate snapshot contents in-band.
|
||||||
|
- Append-only credentials are first-class. Forget/prune jobs use a
|
||||||
|
separate, admin-marked credential that the server only pushes
|
||||||
|
for the duration of a maintenance dispatch.
|
||||||
|
|
||||||
|
## Hardening checklist for operators
|
||||||
|
|
||||||
|
- Run behind a TLS-terminating reverse proxy (Caddy/nginx/Traefik).
|
||||||
|
- Set `RM_TRUSTED_PROXY` to the proxy's CIDR so request IPs aren't
|
||||||
|
spoofable.
|
||||||
|
- Back up `RM_SECRET_KEY_FILE` separately from the database.
|
||||||
|
Without it the encrypted creds are unrecoverable.
|
||||||
|
- Use append-only credentials for the everyday backup path; only
|
||||||
|
the optional admin credential should have write/forget/prune
|
||||||
|
power.
|
||||||
|
- Disable users (don't delete) when staff change roles — bearer
|
||||||
|
tokens stay valid until rotated.
|
||||||
|
- Watch the alert and audit-log views during enrolment of new
|
||||||
|
hosts.
|
||||||
|
|
||||||
|
Thanks for helping keep restic-manager users safe.
|
||||||
@@ -20,6 +20,7 @@ import (
|
|||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/fleetupdate"
|
||||||
rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
|
rmhttp "gitea.dcglab.co.uk/steve/restic-manager/internal/server/http"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/maintenance"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
||||||
@@ -89,6 +90,7 @@ func run() error {
|
|||||||
|
|
||||||
hub := ws.NewHub()
|
hub := ws.NewHub()
|
||||||
jobHub := ws.NewJobHub()
|
jobHub := ws.NewJobHub()
|
||||||
|
metricsRegistry := metrics.NewRegistry()
|
||||||
|
|
||||||
notifHub := notification.NewHub(st, aead, cfg.BaseURL)
|
notifHub := notification.NewHub(st, aead, cfg.BaseURL)
|
||||||
alertEngine := alert.NewEngine(st, notifHub)
|
alertEngine := alert.NewEngine(st, notifHub)
|
||||||
@@ -122,6 +124,7 @@ func run() error {
|
|||||||
UI: renderer,
|
UI: renderer,
|
||||||
Version: version,
|
Version: version,
|
||||||
OIDC: oidcClient,
|
OIDC: oidcClient,
|
||||||
|
Metrics: metricsRegistry,
|
||||||
}
|
}
|
||||||
|
|
||||||
// First-run bootstrap: if the users table is empty, mint a one-time
|
// First-run bootstrap: if the users table is empty, mint a one-time
|
||||||
|
|||||||
@@ -0,0 +1,325 @@
|
|||||||
|
{
|
||||||
|
"annotations": {
|
||||||
|
"list": [
|
||||||
|
{
|
||||||
|
"builtIn": 1,
|
||||||
|
"datasource": { "type": "grafana", "uid": "-- Grafana --" },
|
||||||
|
"enable": true,
|
||||||
|
"hide": true,
|
||||||
|
"iconColor": "rgba(0, 211, 255, 1)",
|
||||||
|
"name": "Annotations & Alerts",
|
||||||
|
"type": "dashboard"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"description": "restic-manager fleet overview. Imports against any Prometheus data source.",
|
||||||
|
"editable": true,
|
||||||
|
"fiscalYearStartMonth": 0,
|
||||||
|
"graphTooltip": 0,
|
||||||
|
"id": null,
|
||||||
|
"links": [],
|
||||||
|
"liveNow": false,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"id": 1,
|
||||||
|
"title": "Fleet status",
|
||||||
|
"type": "stat",
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"gridPos": { "h": 6, "w": 6, "x": 0, "y": 0 },
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": { "mode": "thresholds" },
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{ "color": "red", "value": null },
|
||||||
|
{ "color": "green", "value": 1 }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"options": {
|
||||||
|
"colorMode": "value",
|
||||||
|
"graphMode": "area",
|
||||||
|
"justifyMode": "auto",
|
||||||
|
"orientation": "auto",
|
||||||
|
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||||
|
"textMode": "auto"
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "rm_hosts_online",
|
||||||
|
"legendFormat": "online",
|
||||||
|
"refId": "A"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "rm_hosts_total",
|
||||||
|
"legendFormat": "total",
|
||||||
|
"refId": "B"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 2,
|
||||||
|
"title": "Open alerts",
|
||||||
|
"type": "stat",
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"gridPos": { "h": 6, "w": 6, "x": 6, "y": 0 },
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": { "mode": "thresholds" },
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{ "color": "green", "value": null },
|
||||||
|
{ "color": "yellow", "value": 1 },
|
||||||
|
{ "color": "red", "value": 5 }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"options": {
|
||||||
|
"colorMode": "value",
|
||||||
|
"graphMode": "none",
|
||||||
|
"orientation": "horizontal",
|
||||||
|
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||||
|
"textMode": "auto"
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "sum by (severity) (rm_active_alerts)",
|
||||||
|
"legendFormat": "{{severity}}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 3,
|
||||||
|
"title": "Backups failing (last reported run)",
|
||||||
|
"type": "stat",
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"gridPos": { "h": 6, "w": 6, "x": 12, "y": 0 },
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": { "mode": "thresholds" },
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{ "color": "green", "value": null },
|
||||||
|
{ "color": "red", "value": 1 }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"options": {
|
||||||
|
"colorMode": "value",
|
||||||
|
"graphMode": "area",
|
||||||
|
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||||
|
"textMode": "auto"
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "count(rm_host_last_backup_success == 0)",
|
||||||
|
"legendFormat": "failing",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 4,
|
||||||
|
"title": "Hosts",
|
||||||
|
"type": "table",
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"gridPos": { "h": 10, "w": 24, "x": 0, "y": 6 },
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"custom": { "align": "auto", "displayMode": "auto" }
|
||||||
|
},
|
||||||
|
"overrides": [
|
||||||
|
{
|
||||||
|
"matcher": { "id": "byName", "options": "Value #B" },
|
||||||
|
"properties": [
|
||||||
|
{ "id": "displayName", "value": "Last backup (s ago)" },
|
||||||
|
{ "id": "unit", "value": "s" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"matcher": { "id": "byName", "options": "Value #C" },
|
||||||
|
"properties": [
|
||||||
|
{ "id": "displayName", "value": "Repo size" },
|
||||||
|
{ "id": "unit", "value": "bytes" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"matcher": { "id": "byName", "options": "Value #D" },
|
||||||
|
"properties": [
|
||||||
|
{ "id": "displayName", "value": "Snapshots" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"matcher": { "id": "byName", "options": "Value #A" },
|
||||||
|
"properties": [
|
||||||
|
{ "id": "displayName", "value": "Online" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"matcher": { "id": "byName", "options": "Value #E" },
|
||||||
|
"properties": [
|
||||||
|
{ "id": "displayName", "value": "Open alerts" }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"options": { "showHeader": true },
|
||||||
|
"transformations": [
|
||||||
|
{
|
||||||
|
"id": "merge",
|
||||||
|
"options": {}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "rm_host_agent_online",
|
||||||
|
"format": "table",
|
||||||
|
"instant": true,
|
||||||
|
"refId": "A"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "time() - rm_host_last_backup_timestamp_seconds",
|
||||||
|
"format": "table",
|
||||||
|
"instant": true,
|
||||||
|
"refId": "B"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "rm_host_repo_size_bytes",
|
||||||
|
"format": "table",
|
||||||
|
"instant": true,
|
||||||
|
"refId": "C"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "rm_host_snapshot_count",
|
||||||
|
"format": "table",
|
||||||
|
"instant": true,
|
||||||
|
"refId": "D"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "rm_host_open_alerts",
|
||||||
|
"format": "table",
|
||||||
|
"instant": true,
|
||||||
|
"refId": "E"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 5,
|
||||||
|
"title": "Repo size over time",
|
||||||
|
"type": "timeseries",
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": { "mode": "palette-classic" },
|
||||||
|
"custom": {
|
||||||
|
"axisLabel": "",
|
||||||
|
"drawStyle": "line",
|
||||||
|
"fillOpacity": 10,
|
||||||
|
"lineWidth": 1,
|
||||||
|
"pointSize": 5,
|
||||||
|
"showPoints": "never"
|
||||||
|
},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"options": {
|
||||||
|
"legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
|
||||||
|
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "rm_host_repo_size_bytes",
|
||||||
|
"legendFormat": "{{host}}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 6,
|
||||||
|
"title": "Job duration p95 (last 1h, by kind)",
|
||||||
|
"type": "timeseries",
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": { "mode": "palette-classic" },
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line",
|
||||||
|
"fillOpacity": 5,
|
||||||
|
"lineWidth": 1,
|
||||||
|
"pointSize": 4,
|
||||||
|
"showPoints": "never"
|
||||||
|
},
|
||||||
|
"unit": "s"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"options": {
|
||||||
|
"legend": { "calcs": ["last"], "displayMode": "list", "placement": "bottom", "showLegend": true },
|
||||||
|
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" },
|
||||||
|
"expr": "histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))",
|
||||||
|
"legendFormat": "{{kind}}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refresh": "30s",
|
||||||
|
"schemaVersion": 39,
|
||||||
|
"style": "dark",
|
||||||
|
"tags": ["restic-manager", "backups"],
|
||||||
|
"templating": {
|
||||||
|
"list": [
|
||||||
|
{
|
||||||
|
"current": {},
|
||||||
|
"hide": 0,
|
||||||
|
"includeAll": false,
|
||||||
|
"label": "Prometheus",
|
||||||
|
"multi": false,
|
||||||
|
"name": "DS_PROMETHEUS",
|
||||||
|
"options": [],
|
||||||
|
"query": "prometheus",
|
||||||
|
"refresh": 1,
|
||||||
|
"regex": "",
|
||||||
|
"skipUrlSync": false,
|
||||||
|
"type": "datasource"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"time": { "from": "now-6h", "to": "now" },
|
||||||
|
"timepicker": {},
|
||||||
|
"timezone": "",
|
||||||
|
"title": "restic-manager — fleet",
|
||||||
|
"uid": "rm-fleet-overview",
|
||||||
|
"version": 1,
|
||||||
|
"weekStart": ""
|
||||||
|
}
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
[book]
|
||||||
|
title = "restic-manager"
|
||||||
|
description = "Self-hosted control plane for restic backups across a fleet of Linux and Windows endpoints."
|
||||||
|
authors = ["Steve Cliff"]
|
||||||
|
language = "en-GB"
|
||||||
|
multilingual = false
|
||||||
|
src = "src"
|
||||||
|
|
||||||
|
[output.html]
|
||||||
|
default-theme = "ayu"
|
||||||
|
preferred-dark-theme = "ayu"
|
||||||
|
git-repository-url = "https://gitea.dcglab.co.uk/steve/restic-manager"
|
||||||
|
git-repository-icon = "fa-code-fork"
|
||||||
|
edit-url-template = "https://gitea.dcglab.co.uk/steve/restic-manager/_edit/main/docs/book/{path}"
|
||||||
|
no-section-label = false
|
||||||
|
|
||||||
|
[output.html.fold]
|
||||||
|
enable = true
|
||||||
|
level = 2
|
||||||
@@ -0,0 +1,40 @@
|
|||||||
|
# Summary
|
||||||
|
|
||||||
|
[Introduction](./intro.md)
|
||||||
|
|
||||||
|
# Getting started
|
||||||
|
|
||||||
|
- [Installing the server](./getting-started/install.md)
|
||||||
|
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
|
||||||
|
- [Running behind a reverse proxy](./getting-started/reverse-proxy.md)
|
||||||
|
|
||||||
|
# Concepts
|
||||||
|
|
||||||
|
- [Architecture](./concepts/architecture.md)
|
||||||
|
- [Credentials and how they flow](./concepts/credentials.md)
|
||||||
|
- [Schedules and source groups](./concepts/schedules-and-source-groups.md)
|
||||||
|
- [Repo maintenance](./concepts/repo-maintenance.md)
|
||||||
|
|
||||||
|
# Operations
|
||||||
|
|
||||||
|
- [Backups and restores](./operations/backups-and-restores.md)
|
||||||
|
- [Alerts and notifications](./operations/alerts.md)
|
||||||
|
- [Observability with Prometheus](./operations/observability.md)
|
||||||
|
- [Updating agents](./operations/updates.md)
|
||||||
|
|
||||||
|
# Security
|
||||||
|
|
||||||
|
- [Threat model](./security/threat-model.md)
|
||||||
|
- [Hardening checklist](./security/hardening.md)
|
||||||
|
- [Reporting vulnerabilities](./security/disclosure.md)
|
||||||
|
|
||||||
|
# Reference
|
||||||
|
|
||||||
|
- [Environment variables](./reference/env-vars.md)
|
||||||
|
- [HTTP endpoints](./reference/http-endpoints.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
[Contributing](./contributing.md)
|
||||||
|
[Roadmap](./roadmap.md)
|
||||||
|
[License](./license.md)
|
||||||
@@ -0,0 +1,121 @@
|
|||||||
|
# Architecture
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
```
|
||||||
|
┌────────────────────────────────────────────────────────────┐
|
||||||
|
│ Server (control plane, single process) │
|
||||||
|
│ * chi-based HTTP API + HTMX server-rendered UI │
|
||||||
|
│ * WebSocket hub for agent fan-out + browser fan-out │
|
||||||
|
│ * SQLite store (modernc.org/sqlite, pure Go) │
|
||||||
|
│ * AEAD encryption helpers │
|
||||||
|
│ * Alert engine + notification hub │
|
||||||
|
└────────────┬───────────────────────────────────┬───────────┘
|
||||||
|
│ outbound WS only │ HTTP(S)
|
||||||
|
│ │
|
||||||
|
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
|
||||||
|
│ Agent (per host) │ │ Browser (operator) │
|
||||||
|
│ * coder/websocket │ │ * htmx + a tiny bit │
|
||||||
|
│ * cron for schedules │ │ of vanilla JS for │
|
||||||
|
│ * restic wrapper │ │ live job updates │
|
||||||
|
│ * sysinfo collector │ └──────────────────────────┘
|
||||||
|
└────────────┬─────────────┘
|
||||||
|
│ subprocess: restic ...
|
||||||
|
│
|
||||||
|
┌────────────▼─────────────────────────────────────────────────┐
|
||||||
|
│ restic repository (rest-server, S3, B2, SFTP, local …) │
|
||||||
|
│ Backup data flows directly here. Server never touches it. │
|
||||||
|
└──────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Why outbound-only WebSockets?
|
||||||
|
|
||||||
|
The agent dials the server on `/ws/agent` with a bearer token. The
|
||||||
|
server doesn't initiate connections to the agent. Three reasons:
|
||||||
|
|
||||||
|
1. **Firewall friendliness.** Nothing on the endpoint needs an
|
||||||
|
inbound port; this works behind the typical "branch office NAT"
|
||||||
|
without router config.
|
||||||
|
2. **Single auth point.** The bearer token is the only credential
|
||||||
|
that crosses the boundary; the agent never accepts an
|
||||||
|
incoming socket.
|
||||||
|
3. **Reconnect semantics are simpler.** When the connection drops
|
||||||
|
(NAT timeout, server restart, transient network glitch) the
|
||||||
|
agent backs off and re-dials; the server marks the host
|
||||||
|
offline after 90s and lets the alert engine raise a stale-host
|
||||||
|
alert.
|
||||||
|
|
||||||
|
## Why SQLite?
|
||||||
|
|
||||||
|
SQLite covers the project's HA non-goal: there isn't one. A small
|
||||||
|
control plane managing twelve endpoints does not need replication
|
||||||
|
or a separate database tier. SQLite gives us:
|
||||||
|
|
||||||
|
- A single file to back up (plus the secret key).
|
||||||
|
- Hand-rolled migrations under `internal/store/migrations/` —
|
||||||
|
no migration framework lock-in.
|
||||||
|
- `WAL` mode plus per-connection foreign-key enforcement.
|
||||||
|
|
||||||
|
The migrations file the entire schema; there's no ORM or
|
||||||
|
query-builder layer between Go code and SQL.
|
||||||
|
|
||||||
|
## Why the agent runs `restic` itself, not via the server
|
||||||
|
|
||||||
|
The control plane never holds backup bytes in flight. That's
|
||||||
|
deliberate:
|
||||||
|
|
||||||
|
- A compromised control plane cannot exfiltrate snapshot
|
||||||
|
contents in-band — at worst it can dispatch new backup or
|
||||||
|
forget jobs (audit-logged) but the data path is between the
|
||||||
|
agent and the repository.
|
||||||
|
- The same agent process can target whichever transport restic
|
||||||
|
natively supports (rest-server, S3, B2, SFTP, local), no
|
||||||
|
separate mux on the server side.
|
||||||
|
|
||||||
|
## Job lifecycle
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────┐
|
||||||
|
operator → │ POST /hosts/{id}/ │
|
||||||
|
│ run-backup │
|
||||||
|
└──────────┬───────────┘
|
||||||
|
│ 1. INSERT INTO jobs (status='queued')
|
||||||
|
│ 2. dispatch command.run over WS
|
||||||
|
▼
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ Agent dispatches │
|
||||||
|
│ restic subprocess │
|
||||||
|
└──────────┬───────────┘
|
||||||
|
│
|
||||||
|
│ 3. job.started ───▶ store.MarkJobStarted
|
||||||
|
│ 4. job.progress ───▶ JobHub broadcast (live UI)
|
||||||
|
│ 5. log.stream ───▶ append to job_logs
|
||||||
|
│ 6. job.finished ───▶ store.MarkJobFinished
|
||||||
|
│ + alert engine eval
|
||||||
|
│ + (P6) metrics histogram
|
||||||
|
▼
|
||||||
|
terminal: succeeded | failed | cancelled
|
||||||
|
```
|
||||||
|
|
||||||
|
Operators see live updates because the browser subscribes to
|
||||||
|
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
|
||||||
|
agent-emitted envelope to all live subscribers in addition to
|
||||||
|
persisting it.
|
||||||
|
|
||||||
|
## What scheduling looks like
|
||||||
|
|
||||||
|
- The agent runs a local `robfig/cron/v3` instance.
|
||||||
|
- The server pushes the desired schedule set to the agent on
|
||||||
|
hello + after every CRUD change.
|
||||||
|
- When the agent's cron fires, it sends `schedule.fire` to the
|
||||||
|
server. The server creates a job row, sends `command.run` back,
|
||||||
|
and the agent dispatches a normal backup.
|
||||||
|
- If the WS drops between fire and run, the server queues the
|
||||||
|
schedule firing into `pending_runs` and drains on agent
|
||||||
|
reconnect — no missed scheduled backups due to network blips.
|
||||||
|
|
||||||
|
For everything that isn't a backup (forget, prune, check), the
|
||||||
|
server runs a 60-second maintenance ticker against
|
||||||
|
`host_repo_maintenance` rows and dispatches the relevant command
|
||||||
|
when a cadence is due. The agent's local cron only handles
|
||||||
|
backups.
|
||||||
@@ -0,0 +1,98 @@
|
|||||||
|
# Credentials and how they flow
|
||||||
|
|
||||||
|
restic-manager handles three credential surfaces:
|
||||||
|
|
||||||
|
1. **Operator credentials** — the username + password (or OIDC
|
||||||
|
identity) that logs into the UI.
|
||||||
|
2. **Agent bearer tokens** — issued at enrolment, used by the
|
||||||
|
agent to authenticate its WebSocket to the server.
|
||||||
|
3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
|
||||||
|
credentials the agent passes to `restic` itself.
|
||||||
|
|
||||||
|
Each has a different threat model and storage strategy.
|
||||||
|
|
||||||
|
## Operator credentials
|
||||||
|
|
||||||
|
- Local users are stored in `users` with a bcrypt password hash.
|
||||||
|
- Sessions are random tokens minted at login, stored hashed in
|
||||||
|
the `sessions` table, expired after 24h. Cookie is HttpOnly,
|
||||||
|
SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
|
||||||
|
default).
|
||||||
|
- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
|
||||||
|
pinning their IdP identity. Local password login is rejected
|
||||||
|
for OIDC users.
|
||||||
|
- Disabling a user soft-deletes them via `disabled_at` —
|
||||||
|
pre-existing sessions are invalidated on the next request.
|
||||||
|
|
||||||
|
## Agent bearer tokens
|
||||||
|
|
||||||
|
- Minted at enrolment, hashed at rest with `auth.HashToken`.
|
||||||
|
- The plaintext token only exists in memory at enrolment time
|
||||||
|
and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
|
||||||
|
mode `0600`, owned by the service user).
|
||||||
|
- Compromise of the server DB leaks the hashes, which is enough
|
||||||
|
to *log in as that agent* until you revoke. Compromise of the
|
||||||
|
agent host leaks the plaintext (via the config file) — same
|
||||||
|
end result.
|
||||||
|
- Rotation: re-enrol the host. Today there's no in-place rotate;
|
||||||
|
the operator deletes the host (which cascades, including
|
||||||
|
revoking the bearer hash) and re-runs the install command.
|
||||||
|
|
||||||
|
## Repo credentials
|
||||||
|
|
||||||
|
This is the credential that ultimately matters for backup
|
||||||
|
integrity. restic-manager keeps two slots per host:
|
||||||
|
|
||||||
|
- **The everyday credential** (`host_credentials.kind = ''`).
|
||||||
|
Append-only-friendly: this is the one your backup schedule
|
||||||
|
uses. It can write but not delete or forget.
|
||||||
|
- **The admin credential** (`host_credentials.kind = 'admin'`).
|
||||||
|
Has full delete rights. Only pushed to the agent transiently
|
||||||
|
while a `prune` or `forget` job is dispatching, and discarded
|
||||||
|
by the agent after the job ends.
|
||||||
|
|
||||||
|
### Encryption flow
|
||||||
|
|
||||||
|
1. Operator types the credential into the UI or the install form.
|
||||||
|
2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
|
||||||
|
key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
|
||||||
|
memory.
|
||||||
|
3. Encrypted blob is stored in `host_credentials.cred_blob`.
|
||||||
|
4. When the agent connects, the server decrypts the blob and
|
||||||
|
sends the **plaintext** down the WebSocket inside a
|
||||||
|
`config.update` envelope.
|
||||||
|
5. The agent stores the plaintext in its in-memory secrets store
|
||||||
|
for the lifetime of the process; it's reloaded fresh on every
|
||||||
|
server-side push.
|
||||||
|
6. When a job runs, the agent merges the credential into the
|
||||||
|
restic environment (`restic.Env.RepoURL` stays bare; the
|
||||||
|
`user:pass@…` form is built only inside `envSlice()` at the
|
||||||
|
moment of `exec.Command`).
|
||||||
|
|
||||||
|
The merged form is **never logged**. The slog package's structured
|
||||||
|
output gets `restic.RedactURL()` for any URL it has cause to
|
||||||
|
mention.
|
||||||
|
|
||||||
|
### Why push plaintext over the wire?
|
||||||
|
|
||||||
|
The transport itself is the trust boundary: the WebSocket runs
|
||||||
|
inside the same TLS-terminated reverse-proxy connection your
|
||||||
|
browser uses, and the agent has already authenticated with its
|
||||||
|
bearer token. Re-encrypting the payload on top of that would just
|
||||||
|
move the key-management problem somewhere else.
|
||||||
|
|
||||||
|
If your reverse proxy isn't TLS-terminated, the deployment is
|
||||||
|
already broken — see [Hardening](../security/hardening.md).
|
||||||
|
|
||||||
|
## Setup tokens (admin-driven)
|
||||||
|
|
||||||
|
When an admin creates a new user, the server mints a one-time
|
||||||
|
setup link valid for 1 hour. The hash is stored; the raw token
|
||||||
|
is shown to the admin once. The user opens the link, sets a
|
||||||
|
password, and is dropped into a session. Expired tokens are
|
||||||
|
swept on the alert engine's 60s tick.
|
||||||
|
|
||||||
|
Same pattern for enrolment tokens: the raw token only exists in
|
||||||
|
memory at mint time, and the install snippet is the operator's
|
||||||
|
only chance to capture it. If you lose it, regenerate via the
|
||||||
|
**Add host** page (NS-02).
|
||||||
@@ -0,0 +1,85 @@
|
|||||||
|
# Repo maintenance
|
||||||
|
|
||||||
|
Backups go in; without maintenance, repos grow forever and
|
||||||
|
eventually fall over. restic-manager runs three maintenance
|
||||||
|
operations on a per-host cadence:
|
||||||
|
|
||||||
|
| Command | What it does | Default cadence |
|
||||||
|
|----------|-------------------------------------------------------------|-----------------|
|
||||||
|
| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
|
||||||
|
| `prune` | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
|
||||||
|
| `check` | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
|
||||||
|
|
||||||
|
A new field on each host row, `host_repo_maintenance`, holds the
|
||||||
|
cron expressions and last-fire anchors. The maintenance ticker on
|
||||||
|
the server runs every 60s, finds hosts whose next-fire is due,
|
||||||
|
and dispatches the right command. The agent's local cron is
|
||||||
|
**only** for backups.
|
||||||
|
|
||||||
|
## Why server-side and not agent-side?
|
||||||
|
|
||||||
|
The agent's cron knows about backups because backups are
|
||||||
|
per-source-group. Maintenance is per-repo, not per-source-group,
|
||||||
|
so doing it server-side keeps the per-host wiring simple:
|
||||||
|
|
||||||
|
- One ticker, not N agent crons to keep in sync.
|
||||||
|
- Cancelling a maintenance dispatch is just "don't dispatch the
|
||||||
|
next one" — no agent-side state to clean up.
|
||||||
|
- Skipping offline hosts is trivial (no queue; only scheduled
|
||||||
|
*backups* queue into `pending_runs`).
|
||||||
|
|
||||||
|
## Forget and the multi-group payload
|
||||||
|
|
||||||
|
A single `forget` job can target several source groups at once.
|
||||||
|
The wire envelope (`ForgetGroups`) carries one entry per group,
|
||||||
|
each with its retention policy. The agent runs N
|
||||||
|
`restic forget --tag <name> --keep-...` invocations in sequence,
|
||||||
|
streams their output, and reports a single terminal status.
|
||||||
|
|
||||||
|
## Prune and the admin credential
|
||||||
|
|
||||||
|
Prune mutates the repo. The everyday append-only credential
|
||||||
|
**cannot** prune — that's the whole point of append-only.
|
||||||
|
restic-manager keeps a second slot per host (`kind = 'admin'`)
|
||||||
|
for the credential that can.
|
||||||
|
|
||||||
|
When a prune is dispatched (cadence-driven or operator-driven):
|
||||||
|
|
||||||
|
1. Server pushes the admin credential to the agent in a fresh
|
||||||
|
`config.update`.
|
||||||
|
2. Agent runs `restic prune` with the merged credential.
|
||||||
|
3. Job finishes; agent discards the admin credential from its
|
||||||
|
in-memory secrets store.
|
||||||
|
|
||||||
|
The server never logs the merged URL (see
|
||||||
|
[Credentials](./credentials.md)).
|
||||||
|
|
||||||
|
## Check and lock state
|
||||||
|
|
||||||
|
`restic check` warns about stale locks when it finds them. The
|
||||||
|
agent ships every check's output back as a `repo.stats` envelope
|
||||||
|
and a stream of log lines; if a stale lock is detected, the
|
||||||
|
**Repo** page surfaces a banner with an **Unlock** button. The
|
||||||
|
operator-only `unlock` command runs `restic unlock` and clears
|
||||||
|
the banner.
|
||||||
|
|
||||||
|
`unlock` has no cadence — it's a manual action, never automatic.
|
||||||
|
Auto-unlocking would mask the cause (probably a previously
|
||||||
|
crashed long-running operation) and risk corrupting an
|
||||||
|
operation the operator has merely lost track of.
|
||||||
|
|
||||||
|
## Repo stats
|
||||||
|
|
||||||
|
After every backup, check, prune, and unlock, the agent runs
|
||||||
|
`restic stats --json --mode raw-data` and ships the result as a
|
||||||
|
`repo.stats` envelope. The server stores this in
|
||||||
|
`host_repo_stats` (latest only) and `host_repo_stats_history`
|
||||||
|
(one row per host per day, last-write-wins per column — a
|
||||||
|
prune-only patch never nulls a backup-time size).
|
||||||
|
|
||||||
|
The host detail page surfaces:
|
||||||
|
|
||||||
|
- Total size + raw size in the vitals strip.
|
||||||
|
- Last-check timestamp + colour-coded status.
|
||||||
|
- Last-prune timestamp.
|
||||||
|
- 30/90-day repo size trend chart.
|
||||||
@@ -0,0 +1,105 @@
|
|||||||
|
# Schedules and source groups
|
||||||
|
|
||||||
|
Two related but separable ideas:
|
||||||
|
|
||||||
|
- A **source group** is a named bundle of "what to back up":
|
||||||
|
include paths, exclude patterns, retention policy, retry
|
||||||
|
configuration, optional pre/post hooks. The group's name is
|
||||||
|
used as the restic snapshot tag, so retention can target it
|
||||||
|
with `restic forget --tag <name>`.
|
||||||
|
- A **schedule** is a cron expression that, when it fires,
|
||||||
|
triggers a backup of one or more source groups on a host.
|
||||||
|
|
||||||
|
Decoupling them means you can have one schedule covering several
|
||||||
|
groups (e.g. `0 1 * * *` running both `system` and `data`), and
|
||||||
|
each group has its own retention without duplicating policy
|
||||||
|
across schedules.
|
||||||
|
|
||||||
|
## Source group anatomy
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: data
|
||||||
|
includes:
|
||||||
|
- /var/lib/postgresql
|
||||||
|
- /home
|
||||||
|
excludes:
|
||||||
|
- /home/*/.cache
|
||||||
|
- /home/*/Downloads
|
||||||
|
retention:
|
||||||
|
keep_last: 7
|
||||||
|
keep_daily: 14
|
||||||
|
keep_weekly: 4
|
||||||
|
keep_monthly: 6
|
||||||
|
retry_max: 3
|
||||||
|
retry_backoff_seconds: 600
|
||||||
|
pre_hook: |
|
||||||
|
pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
|
||||||
|
post_hook: |
|
||||||
|
rm -f /var/lib/postgresql/dumps/all.dump
|
||||||
|
```
|
||||||
|
|
||||||
|
### Conflict detection
|
||||||
|
|
||||||
|
If your retention policy says `keep_hourly: 24` but no schedule
|
||||||
|
points at this group sub-daily, the UI surfaces a
|
||||||
|
**conflict-dimension banner** ("`hourly` won't be honoured —
|
||||||
|
no schedule fires more often than once a day"). The flag is
|
||||||
|
stored on the source group (`conflict_dimension`) and refreshed
|
||||||
|
whenever a schedule or group changes.
|
||||||
|
|
||||||
|
### Hooks
|
||||||
|
|
||||||
|
`pre_hook` and `post_hook` run on the agent host inside
|
||||||
|
`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
|
||||||
|
to the live job log as `hook(<phase>): …` lines.
|
||||||
|
|
||||||
|
- A non-zero `pre_hook` exit aborts the backup.
|
||||||
|
- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
|
||||||
|
in the environment. Use this for cleanup that must happen
|
||||||
|
whether the backup worked or not.
|
||||||
|
- Hooks only run for `kind=backup` jobs. They do not run for
|
||||||
|
`forget`, `prune`, `check`, etc.
|
||||||
|
- AEAD-encrypted at rest at the HTTP layer; the agent receives
|
||||||
|
plaintext over the WS channel.
|
||||||
|
|
||||||
|
A "host default" pair of hooks lives on the host itself; a
|
||||||
|
source group's own hooks override them when set.
|
||||||
|
|
||||||
|
## Schedule anatomy
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cron: "0 2 * * *"
|
||||||
|
enabled: true
|
||||||
|
source_group_ids:
|
||||||
|
- <gid for "data">
|
||||||
|
- <gid for "system">
|
||||||
|
```
|
||||||
|
|
||||||
|
Slim by design: a schedule says **when** and **which groups**.
|
||||||
|
Everything else (paths, retention, hooks) lives on the groups.
|
||||||
|
|
||||||
|
The agent's local cron fires the schedule. If the WebSocket is
|
||||||
|
down at fire time, the server queues the firing into
|
||||||
|
`pending_runs` and drains it on the next agent reconnect — a
|
||||||
|
short network blip won't lose the backup.
|
||||||
|
|
||||||
|
### Last / next run
|
||||||
|
|
||||||
|
The schedules tab shows "next" (computed by parsing the cron
|
||||||
|
expression with `robfig/cron/v3`) and "last" (the latest
|
||||||
|
`actor_kind=schedule` job in the `jobs` table) for every
|
||||||
|
schedule. The dashboard host row also surfaces `next 12h ago/from
|
||||||
|
now` when a single covering schedule is the run-now candidate.
|
||||||
|
|
||||||
|
## Bandwidth limits
|
||||||
|
|
||||||
|
Two places set restic's `--limit-upload` / `--limit-download`:
|
||||||
|
|
||||||
|
1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
|
||||||
|
`bandwidth_down_kbps`). Pushed to the agent on hello and
|
||||||
|
after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
|
||||||
|
invocation on the host.
|
||||||
|
2. **Per-job overrides** on the per-source-group Run-now form.
|
||||||
|
Win over host caps for the lifetime of that one job.
|
||||||
|
|
||||||
|
If neither is set, restic runs unthrottled.
|
||||||
@@ -0,0 +1,17 @@
|
|||||||
|
# Contributing
|
||||||
|
|
||||||
|
Full contributor guide:
|
||||||
|
[`CONTRIBUTING.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CONTRIBUTING.md)
|
||||||
|
in the repository root.
|
||||||
|
|
||||||
|
The short version:
|
||||||
|
|
||||||
|
- Open an issue first for non-trivial changes; the design is
|
||||||
|
still moving and unsolicited large PRs may conflict with
|
||||||
|
in-flight work.
|
||||||
|
- `make lint test` must pass.
|
||||||
|
- One logical change per commit, no `Co-Authored-By` trailers.
|
||||||
|
- UK English in identifiers and comments; comments explain the
|
||||||
|
**why** not the **what**.
|
||||||
|
|
||||||
|
Code of conduct: [`CODE_OF_CONDUCT.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CODE_OF_CONDUCT.md).
|
||||||
@@ -0,0 +1,113 @@
|
|||||||
|
# Enrolling your first host
|
||||||
|
|
||||||
|
The control plane only knows about hosts you've explicitly
|
||||||
|
enrolled. Two paths exist:
|
||||||
|
|
||||||
|
1. **Token-based enrolment** — admin generates a token, pastes it
|
||||||
|
into an install command on the host. The host appears immediately,
|
||||||
|
already mapped to the desired repo.
|
||||||
|
2. **Announce-and-approve** — the agent runs without a token,
|
||||||
|
"announces" itself to the server, and a human in the UI accepts
|
||||||
|
the announcement.
|
||||||
|
|
||||||
|
Token-based is the default and what most operators want; the
|
||||||
|
announce flow exists for the case where you can't easily paste a
|
||||||
|
secret onto the host (auto-imaged endpoints, scripted bring-ups
|
||||||
|
from a config repo).
|
||||||
|
|
||||||
|
## Token-based enrolment
|
||||||
|
|
||||||
|
### From the UI
|
||||||
|
|
||||||
|
1. Click **+ Add host** on the dashboard.
|
||||||
|
2. Fill in the hostname, the restic repo URL, and the repo
|
||||||
|
credentials. The credentials are AEAD-encrypted at the server
|
||||||
|
immediately; what you paste is what the agent receives.
|
||||||
|
3. Optionally pick the initial source paths — these become the
|
||||||
|
first source group on the host.
|
||||||
|
4. Submit. The server mints a one-time token and shows you a copy-
|
||||||
|
pasteable install snippet.
|
||||||
|
|
||||||
|
### On the host (Linux)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -fsSL https://restic.example.com/install/install.sh | \
|
||||||
|
sudo RM_SERVER=https://restic.example.com \
|
||||||
|
RM_ENROL_TOKEN=<token> \
|
||||||
|
bash
|
||||||
|
```
|
||||||
|
|
||||||
|
The script:
|
||||||
|
|
||||||
|
1. Detects architecture (`amd64` or `arm64`).
|
||||||
|
2. Downloads the agent binary from `/agent/binary?os=…&arch=…`.
|
||||||
|
3. Drops the systemd unit at
|
||||||
|
`/etc/systemd/system/restic-manager-agent.service`.
|
||||||
|
4. Runs the agent in `-enrol` mode, which posts the token and
|
||||||
|
stores the persistent bearer it gets back.
|
||||||
|
5. Enables and starts the unit.
|
||||||
|
|
||||||
|
Within seconds the host should appear on the dashboard as
|
||||||
|
**online**.
|
||||||
|
|
||||||
|
### On the host (Windows)
|
||||||
|
|
||||||
|
```pwsh
|
||||||
|
$env:RM_SERVER = "https://restic.example.com"
|
||||||
|
$env:RM_ENROL_TOKEN = "<token>"
|
||||||
|
iwr -useb $env:RM_SERVER/install/install.ps1 | iex
|
||||||
|
```
|
||||||
|
|
||||||
|
Equivalent shape: registers a Windows service via the SCM
|
||||||
|
(see P2-16 for details), runs `-enrol`, starts the service.
|
||||||
|
|
||||||
|
## Recovering a lost token
|
||||||
|
|
||||||
|
Tokens are single-use and short-lived (1h). If you closed the tab
|
||||||
|
before pasting the install command, head to the **Add host** page —
|
||||||
|
outstanding tokens are listed there with a **Regenerate** button.
|
||||||
|
Regenerating revokes the old token's hash and mints a fresh raw
|
||||||
|
token while preserving the original repo credentials and initial
|
||||||
|
paths. (NS-02 in `tasks.md` if you want the design rationale.)
|
||||||
|
|
||||||
|
## Announce-and-approve
|
||||||
|
|
||||||
|
If the host can reach the server but you don't want to paste a
|
||||||
|
secret on it, run the agent in `-announce` mode:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
restic-manager-agent -announce \
|
||||||
|
-server https://restic.example.com \
|
||||||
|
-hostname myhost
|
||||||
|
```
|
||||||
|
|
||||||
|
The host appears in the **Pending hosts** panel on the dashboard
|
||||||
|
with its hostname, OS, arch, and the source IP that announced it.
|
||||||
|
Click **Accept**, fill in the repo URL + credentials, and the
|
||||||
|
server pushes the bearer over the still-open WebSocket. No
|
||||||
|
back-and-forth round trip.
|
||||||
|
|
||||||
|
If you don't accept within an hour the announcement is swept.
|
||||||
|
|
||||||
|
## What happens on the agent
|
||||||
|
|
||||||
|
After enrolment, the agent:
|
||||||
|
|
||||||
|
1. Connects via WebSocket to `/ws/agent` with its bearer token.
|
||||||
|
2. Sends a `hello` envelope with its OS, arch, agent version,
|
||||||
|
restic version, and protocol version.
|
||||||
|
3. Receives a `config.update` carrying its encrypted repo
|
||||||
|
credentials and any source-group paths.
|
||||||
|
4. Sits idle, sending a heartbeat every 30s. Operator-driven
|
||||||
|
"Run now" actions arrive as `command.run` envelopes; scheduled
|
||||||
|
jobs are driven by the agent's local cron.
|
||||||
|
|
||||||
|
## Auto-init of the repository
|
||||||
|
|
||||||
|
The first time a backup runs, the agent invokes `restic init`
|
||||||
|
against the repo you configured at enrolment. If the repo already
|
||||||
|
exists (`config file already exists`) the agent treats it as a
|
||||||
|
success and proceeds. The host's repo status (`unknown` →
|
||||||
|
`ready` / `init_failed`) is surfaced under the vitals strip on
|
||||||
|
the host detail page; if init fails, save fresh credentials in
|
||||||
|
the **Repo** tab to retry.
|
||||||
@@ -0,0 +1,92 @@
|
|||||||
|
# Installing the server
|
||||||
|
|
||||||
|
The reference deployment is a single Docker container fronted by
|
||||||
|
your existing reverse proxy. The image bundles the server binary,
|
||||||
|
the cross-compiled agent binaries, and the install scripts.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- A Linux host with Docker and Docker Compose.
|
||||||
|
- A reverse proxy in front (Caddy, nginx, Traefik) terminating
|
||||||
|
TLS on a public hostname. The server itself is HTTP-only by
|
||||||
|
design — see [Reverse proxy](./reverse-proxy.md) for why.
|
||||||
|
- A persistent volume for the server's data directory.
|
||||||
|
|
||||||
|
## Quick start
|
||||||
|
|
||||||
|
The reference compose file lives at
|
||||||
|
[`deploy/docker-compose.yml`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/docker-compose.yml):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
restic-manager:
|
||||||
|
image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:-latest}
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
RM_LISTEN: ":8080"
|
||||||
|
RM_DATA_DIR: "/data"
|
||||||
|
RM_BASE_URL: "https://restic.example.com"
|
||||||
|
# Trust your reverse proxy's CIDR so X-Forwarded-* are honoured.
|
||||||
|
RM_TRUSTED_PROXY: "10.0.0.0/8"
|
||||||
|
volumes:
|
||||||
|
- rm-data:/data
|
||||||
|
ports:
|
||||||
|
# Bind localhost only — your reverse proxy is the public face.
|
||||||
|
- "127.0.0.1:8080:8080"
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
rm-data:
|
||||||
|
```
|
||||||
|
|
||||||
|
Bring it up:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
docker compose up -d
|
||||||
|
docker compose logs -f restic-manager
|
||||||
|
```
|
||||||
|
|
||||||
|
The first run prints a one-time **bootstrap token** to the log. Use
|
||||||
|
it within an hour or it expires; if you miss the window the
|
||||||
|
container print it again on next start as long as no admin user
|
||||||
|
exists.
|
||||||
|
|
||||||
|
## First-run admin setup
|
||||||
|
|
||||||
|
Open `https://restic.example.com/bootstrap` (or whatever your
|
||||||
|
public URL is). Paste the bootstrap token, pick a username and a
|
||||||
|
password (≥ 12 characters), and submit. You'll land in the
|
||||||
|
dashboard logged in as the new admin.
|
||||||
|
|
||||||
|
If you'd rather curl it, the equivalent is:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -X POST https://restic.example.com/api/bootstrap \
|
||||||
|
-H 'Content-Type: application/json' \
|
||||||
|
-d '{"token":"<token-from-log>","username":"admin","password":"<≥12 chars>"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Backing up the secret key
|
||||||
|
|
||||||
|
Inside the data volume, `secret.key` holds the AEAD key used to
|
||||||
|
encrypt every credential at rest. **Back it up separately from
|
||||||
|
the database.** Without it, encrypted credentials in the database
|
||||||
|
are unrecoverable; you'd have to re-enrol every host.
|
||||||
|
|
||||||
|
A simple working approach: copy `secret.key` to your password
|
||||||
|
manager or to a separately-backed-up secrets vault the day you
|
||||||
|
install. It doesn't change.
|
||||||
|
|
||||||
|
## Updating the server
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Pin a new version in your compose file (.env or docker-compose.yml),
|
||||||
|
# then:
|
||||||
|
docker compose pull
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
Migrations run automatically on startup; the server will refuse to
|
||||||
|
start if a migration fails (better to bail than to half-migrate).
|
||||||
|
|
||||||
|
For the agent self-update story, see
|
||||||
|
[Updating agents](../operations/updates.md).
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
# Running behind a reverse proxy
|
||||||
|
|
||||||
|
The restic-manager server is HTTP-only by design. TLS termination,
|
||||||
|
public hostname, ACME, HSTS, and edge-level rate limiting all
|
||||||
|
belong to a reverse proxy you already operate outside this project.
|
||||||
|
|
||||||
|
## What the proxy must forward
|
||||||
|
|
||||||
|
The server reads four headers when (and only when) the immediate
|
||||||
|
peer matches `RM_TRUSTED_PROXY`:
|
||||||
|
|
||||||
|
| Header | Value | Why |
|
||||||
|
|------------------------|----------------------------------------------------|-----|
|
||||||
|
| `X-Forwarded-For` | The original client IP | Rate-limit keys, audit log entries, OIDC redirect-URI checks. |
|
||||||
|
| `X-Forwarded-Proto` | `https` | Used for absolute URLs (e.g. OIDC redirect URIs). |
|
||||||
|
| `Host` | The public hostname clients use | Cookies are scoped to this; `RM_BASE_URL` must match. |
|
||||||
|
| `Connection` / `Upgrade` | Pass through unchanged | `/ws/agent` and `/api/jobs/{id}/stream` are WebSockets; without `Upgrade: websocket` they fail. |
|
||||||
|
|
||||||
|
Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of
|
||||||
|
CIDRs) the proxy connects from. Anything outside that range has
|
||||||
|
its `X-Forwarded-*` headers ignored, so a stray request that
|
||||||
|
bypasses the proxy can't spoof the client IP.
|
||||||
|
|
||||||
|
## Caddy
|
||||||
|
|
||||||
|
```caddyfile
|
||||||
|
restic.example.com {
|
||||||
|
encode zstd gzip
|
||||||
|
reverse_proxy 127.0.0.1:8080 {
|
||||||
|
header_up X-Real-IP {remote_host}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Caddy adds `X-Forwarded-For` / `X-Forwarded-Proto` automatically
|
||||||
|
and passes WebSocket headers through by default, so this is the
|
||||||
|
whole config.
|
||||||
|
|
||||||
|
## nginx
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
server {
|
||||||
|
listen 443 ssl http2;
|
||||||
|
server_name restic.example.com;
|
||||||
|
|
||||||
|
ssl_certificate /etc/letsencrypt/live/restic.example.com/fullchain.pem;
|
||||||
|
ssl_certificate_key /etc/letsencrypt/live/restic.example.com/privkey.pem;
|
||||||
|
|
||||||
|
location / {
|
||||||
|
proxy_pass http://127.0.0.1:8080;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_set_header X-Forwarded-Proto https;
|
||||||
|
|
||||||
|
# WebSocket upgrade
|
||||||
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
|
proxy_set_header Connection "upgrade";
|
||||||
|
|
||||||
|
# Long-lived agent WS — disable read timeout for this surface.
|
||||||
|
proxy_read_timeout 86400s;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Traefik
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
restic-manager:
|
||||||
|
rule: "Host(`restic.example.com`)"
|
||||||
|
entryPoints: [websecure]
|
||||||
|
tls:
|
||||||
|
certResolver: letsencrypt
|
||||||
|
service: restic-manager
|
||||||
|
|
||||||
|
services:
|
||||||
|
restic-manager:
|
||||||
|
loadBalancer:
|
||||||
|
servers:
|
||||||
|
- url: "http://restic-manager:8080"
|
||||||
|
passHostHeader: true
|
||||||
|
```
|
||||||
|
|
||||||
|
Traefik forwards WebSocket upgrades and the standard
|
||||||
|
`X-Forwarded-*` set out of the box.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
After bringing the proxy up, the audit log should show your real
|
||||||
|
client IP for an interactive login (not the proxy's local
|
||||||
|
address). If you see `127.0.0.1` or the proxy's container IP, your
|
||||||
|
`RM_TRUSTED_PROXY` is wrong or `X-Forwarded-For` isn't being
|
||||||
|
forwarded.
|
||||||
@@ -0,0 +1,86 @@
|
|||||||
|
# restic-manager
|
||||||
|
|
||||||
|
restic-manager is a self-hosted, browser-based, single-pane-of-glass
|
||||||
|
for managing [restic](https://restic.net) backups across a fleet of
|
||||||
|
Linux and Windows endpoints. It's designed for **small fleets** —
|
||||||
|
the original target was twelve endpoints — and **one operator**.
|
||||||
|
|
||||||
|
## What it does
|
||||||
|
|
||||||
|
- Centralised view of every endpoint's last backup, repo size,
|
||||||
|
snapshot count, and recent jobs.
|
||||||
|
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
|
||||||
|
`check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`).
|
||||||
|
- Per-host backup schedules with source groups (named bundles of
|
||||||
|
paths + retention policy).
|
||||||
|
- Live job log streamed to the browser; downloadable as text or NDJSON.
|
||||||
|
- Restore wizard with snapshot tree browse + path selection.
|
||||||
|
- Repo-level health surfacing (size, raw size, last-check, lock
|
||||||
|
state) plus a 30/90-day size trend.
|
||||||
|
- Alerting over webhook, ntfy, or SMTP.
|
||||||
|
- Cross-platform agent (Linux + Windows).
|
||||||
|
- Append-only-credential-friendly with a separate admin credential
|
||||||
|
for forget/prune.
|
||||||
|
|
||||||
|
## What it isn't
|
||||||
|
|
||||||
|
- **Not a SaaS.** Single-instance, single-tenant, by design.
|
||||||
|
- **Not a replacement for restic** — it's a control plane. The agent
|
||||||
|
shells out to a real `restic` binary.
|
||||||
|
- **Not highly available.** SQLite, single process; if you need
|
||||||
|
HA backups, you're shopping in the wrong aisle.
|
||||||
|
- **Not a multi-protocol backup tool.** restic only.
|
||||||
|
|
||||||
|
## How it fits together
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────┐
|
||||||
|
│ Server (control plane, Docker) │
|
||||||
|
│ - REST + WebSocket API │
|
||||||
|
│ - SQLite store │
|
||||||
|
│ - Embedded HTMX UI │
|
||||||
|
└──────────┬─────────────────────────┬─────────┘
|
||||||
|
│ outbound WS │ HTTP(S)
|
||||||
|
│ │
|
||||||
|
┌──────────▼──────────┐ ┌──────────▼─────────┐
|
||||||
|
│ Agent (per host) │ │ Browser (operator) │
|
||||||
|
│ - restic wrapper │ └─────────────────────┘
|
||||||
|
│ - cron for sched. │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│ restic
|
||||||
|
┌──────────▼──────────────────────────────────┐
|
||||||
|
│ rest-server / S3 / SFTP / local repo │
|
||||||
|
│ (the actual backup data — server never │
|
||||||
|
│ touches it) │
|
||||||
|
└─────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
The control plane is a Go binary that runs in Docker. Each endpoint
|
||||||
|
runs a small Go agent that holds an outbound WebSocket to the
|
||||||
|
control plane. Backup data flows directly between the agent and the
|
||||||
|
restic repository — the control plane never sees a snapshot byte.
|
||||||
|
|
||||||
|
## Where to start
|
||||||
|
|
||||||
|
- [Installing the server](./getting-started/install.md) walks
|
||||||
|
through the Docker-based reference deployment.
|
||||||
|
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
|
||||||
|
covers the install scripts and the announce-and-approve flow.
|
||||||
|
- [Architecture](./concepts/architecture.md) is the right read if
|
||||||
|
you want to know why something is the way it is before running
|
||||||
|
the install.
|
||||||
|
|
||||||
|
## Project status
|
||||||
|
|
||||||
|
Pre-1.0 but feature-complete for the original use case. Phases
|
||||||
|
0–4 are landed (MVP, scheduling, restore, RBAC + OIDC); Phase 5
|
||||||
|
(this docs site, contributor onboarding, end-to-end CI) is in
|
||||||
|
flight. See [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md)
|
||||||
|
for the live roadmap and [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
|
||||||
|
for the canonical design doc.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
[PolyForm Noncommercial 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0/).
|
||||||
|
Personal and community deployments welcome; commercial use
|
||||||
|
requires a separate license.
|
||||||
@@ -0,0 +1,39 @@
|
|||||||
|
# License
|
||||||
|
|
||||||
|
restic-manager is licensed under
|
||||||
|
[**PolyForm Noncommercial 1.0.0**](https://polyformproject.org/licenses/noncommercial/1.0.0/).
|
||||||
|
The full text lives at
|
||||||
|
[`LICENSE`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/LICENSE)
|
||||||
|
in the repository root.
|
||||||
|
|
||||||
|
## What this means
|
||||||
|
|
||||||
|
- **Personal, hobbyist, educational, charitable, and similar
|
||||||
|
noncommercial use** is fully permitted, including modification
|
||||||
|
and redistribution.
|
||||||
|
- **Commercial use is not permitted** without a separate
|
||||||
|
license. The maintainer is not currently offering one — if
|
||||||
|
you need commercial rights, open an issue to start the
|
||||||
|
conversation.
|
||||||
|
- The license is permissive about everything except commercial
|
||||||
|
use: you can fork, modify, deploy in your home/lab, and
|
||||||
|
contribute back.
|
||||||
|
|
||||||
|
## Why this license
|
||||||
|
|
||||||
|
The PolyForm Noncommercial license was chosen because:
|
||||||
|
|
||||||
|
- It's a real, legal, plainly-worded license (not a custom
|
||||||
|
half-written variant).
|
||||||
|
- It permits the realistic uses for a hobby project (the
|
||||||
|
maintainer's homelab, a friend's fleet, a charity's IT
|
||||||
|
closet) without inviting commercial vendors to repackage
|
||||||
|
the work.
|
||||||
|
- It's compatible with the project staying small and
|
||||||
|
maintainable — the maintainer doesn't want to be on the hook
|
||||||
|
for SLA-grade commercial support.
|
||||||
|
|
||||||
|
## Contributions
|
||||||
|
|
||||||
|
By contributing, you agree your contributions are licensed
|
||||||
|
under the same PolyForm Noncommercial 1.0.0 license.
|
||||||
@@ -0,0 +1,73 @@
|
|||||||
|
# Alerts and notifications
|
||||||
|
|
||||||
|
restic-manager raises alerts on conditions that need human
|
||||||
|
attention. The alert engine evaluates rules on a 60s tick and
|
||||||
|
on every job-finished / host-online event.
|
||||||
|
|
||||||
|
## Built-in alert kinds
|
||||||
|
|
||||||
|
| Kind | Trigger | Severity |
|
||||||
|
|---------------------|---------|----------|
|
||||||
|
| `backup_failed` | A backup job ends in `failed` or `cancelled` | warning |
|
||||||
|
| `forget_failed` | A forget job ends in `failed` | warning |
|
||||||
|
| `prune_failed` | A prune job ends in `failed` | critical |
|
||||||
|
| `check_failed` | A check job ends in `failed` | critical |
|
||||||
|
| `agent_offline` | A host has been offline more than 90s past its heartbeat cadence | warning |
|
||||||
|
| `stale_schedule` | A schedule's "last run" is more than 1.5 × its interval ago | warning |
|
||||||
|
| `update_failed` | An agent self-update returned a fail or didn't reconnect within 90s | warning |
|
||||||
|
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
|
||||||
|
|
||||||
|
Each alert has a `dedup_key` so re-firing the same condition
|
||||||
|
just bumps `last_seen_at` — the operator gets one row per
|
||||||
|
condition, not a thousand.
|
||||||
|
|
||||||
|
## Lifecycle
|
||||||
|
|
||||||
|
```
|
||||||
|
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
|
||||||
|
│ │
|
||||||
|
└────────auto-resolve──────┘
|
||||||
|
(e.g. agent_offline auto-resolves on agent_online)
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Acknowledge** says "I've seen this, stop notifying about it".
|
||||||
|
- **Resolve** says "the underlying condition is gone".
|
||||||
|
- Some alerts auto-resolve when the condition clears
|
||||||
|
(`agent_offline` is the canonical example).
|
||||||
|
|
||||||
|
## Notification channels
|
||||||
|
|
||||||
|
Configure under **Settings → Notifications**. Each channel can
|
||||||
|
subscribe to all alerts or filter by severity.
|
||||||
|
|
||||||
|
### Webhook
|
||||||
|
|
||||||
|
Posts a JSON envelope to a URL of your choice. Useful for
|
||||||
|
piping into Slack via an Incoming Webhook URL or into your own
|
||||||
|
alerting tooling.
|
||||||
|
|
||||||
|
### ntfy
|
||||||
|
|
||||||
|
Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
|
||||||
|
topic. Configure the topic URL; optional bearer token if you
|
||||||
|
self-host with auth.
|
||||||
|
|
||||||
|
### SMTP
|
||||||
|
|
||||||
|
Plain SMTP (with optional TLS). Configure host, port,
|
||||||
|
username, password, and the recipient list.
|
||||||
|
|
||||||
|
## Test fire
|
||||||
|
|
||||||
|
Each channel exposes a **Test fire** button that dispatches a
|
||||||
|
single synthetic alert through the channel without touching the
|
||||||
|
alert engine. Use this when you've added a channel and want to
|
||||||
|
verify connectivity before the next real failure happens.
|
||||||
|
|
||||||
|
## What gets logged
|
||||||
|
|
||||||
|
Every alert raise / acknowledge / resolve writes an audit log
|
||||||
|
entry. The audit log UI at **Settings → Audit log** filters by
|
||||||
|
user, action, target, and time range — useful for the
|
||||||
|
post-incident "who clicked acknowledge on the prune-failure
|
||||||
|
alert" question.
|
||||||
@@ -0,0 +1,73 @@
|
|||||||
|
# Backups and restores
|
||||||
|
|
||||||
|
## Running a backup
|
||||||
|
|
||||||
|
Three ways to trigger one:
|
||||||
|
|
||||||
|
1. **Scheduled** — the agent's local cron fires at the time set
|
||||||
|
on the schedule.
|
||||||
|
2. **Run-now** — operator clicks **Run now** on the host detail
|
||||||
|
right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
|
||||||
|
source groups) or to a per-group form for finer control.
|
||||||
|
3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
|
||||||
|
payload. Same audit + dispatch path.
|
||||||
|
|
||||||
|
In every case the server creates a `jobs` row, broadcasts a
|
||||||
|
`command.run` to the host, and lands the operator on the live
|
||||||
|
job log page (HTMX `HX-Redirect`).
|
||||||
|
|
||||||
|
## Cancelling a job
|
||||||
|
|
||||||
|
Any running job — backup, forget, prune, restore, anything —
|
||||||
|
exposes a **Cancel** button on its detail page. The server
|
||||||
|
broadcasts `command.cancel`, and the agent kills the running
|
||||||
|
restic subprocess via context cancel: SIGTERM first, SIGKILL
|
||||||
|
after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
|
||||||
|
SIGTERM step is replaced with `os.Kill` because Windows can't
|
||||||
|
deliver SIGTERM. Result: a cancelled job lands as `cancelled`
|
||||||
|
within a couple of hundred milliseconds.
|
||||||
|
|
||||||
|
## Restore wizard
|
||||||
|
|
||||||
|
Restoring a file or path goes through a four-step wizard at
|
||||||
|
`/hosts/{id}/restore`:
|
||||||
|
|
||||||
|
1. **Pick a snapshot.** Search by id or by date; the page is
|
||||||
|
pre-populated when you launched the wizard from a snapshot row.
|
||||||
|
2. **Browse the snapshot tree.** Lazy-loaded children via the
|
||||||
|
`MsgTreeList` synchronous WS RPC; results are cached
|
||||||
|
per-wizard-session for 30 minutes. Pick the absolute paths
|
||||||
|
you want.
|
||||||
|
3. **Choose a target.** Either **In place** (overwrites the
|
||||||
|
live filesystem; requires you to type the hostname to
|
||||||
|
confirm) or **New directory** (default
|
||||||
|
`$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
|
||||||
|
`${HOME}` / `~/` and creates the directory chain).
|
||||||
|
4. **Review and submit.** Server mints a job, dispatches
|
||||||
|
`command.run` with a `RestorePayload`, and `HX-Redirect`s to
|
||||||
|
the live job log.
|
||||||
|
|
||||||
|
`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
|
||||||
|
in that release). Hosts running 0.16 don't get the flag and
|
||||||
|
restore as the running user instead.
|
||||||
|
|
||||||
|
## Snapshot diff
|
||||||
|
|
||||||
|
Two snapshot ids in the **Diff** form on the host detail page →
|
||||||
|
a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
|
||||||
|
to the standard live job log. Useful when investigating a
|
||||||
|
suspiciously-sized backup.
|
||||||
|
|
||||||
|
## Job log artefacts
|
||||||
|
|
||||||
|
Every job's log is persisted in `job_logs` (one row per line),
|
||||||
|
not just streamed in-memory. That gives you:
|
||||||
|
|
||||||
|
- A live view at `/jobs/{id}` while the job runs.
|
||||||
|
- Two download formats from the same page header dropdown:
|
||||||
|
- **txt** — one line per row, `HH:MM:SS.mmm TAG payload`.
|
||||||
|
- **ndjson** — one self-contained JSON object per line
|
||||||
|
(`{seq, ts, stream, payload}`), perfect for `jq`.
|
||||||
|
|
||||||
|
Downloads work whether the job is running or finished —
|
||||||
|
the source is the DB, not the live socket.
|
||||||
@@ -0,0 +1,61 @@
|
|||||||
|
# Observability with Prometheus
|
||||||
|
|
||||||
|
restic-manager can expose a Prometheus scrape endpoint at
|
||||||
|
`GET /metrics`. The endpoint is **opt-in** — without an explicit
|
||||||
|
auth gate it isn't even mounted, so a forgotten config can't
|
||||||
|
accidentally publish fleet state.
|
||||||
|
|
||||||
|
The full reference lives at
|
||||||
|
[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
|
||||||
|
the short version follows.
|
||||||
|
|
||||||
|
## Enable the endpoint
|
||||||
|
|
||||||
|
Set at least one of:
|
||||||
|
|
||||||
|
- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
|
||||||
|
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
|
||||||
|
|
||||||
|
Both ANDed when both set. Constant-time token compare; CIDR
|
||||||
|
honours `X-Forwarded-For` only when the immediate hop matches
|
||||||
|
`RM_TRUSTED_PROXY`.
|
||||||
|
|
||||||
|
## Metrics emitted
|
||||||
|
|
||||||
|
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
|
||||||
|
`rm_active_alerts{severity}`, `rm_build_info{...}`.
|
||||||
|
- **Per-host gauges**: `rm_host_agent_online`,
|
||||||
|
`rm_host_last_backup_timestamp_seconds`,
|
||||||
|
`rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
|
||||||
|
`rm_host_snapshot_count`, `rm_host_open_alerts`,
|
||||||
|
`rm_host_repo_status`.
|
||||||
|
- **Histogram**:
|
||||||
|
`rm_job_duration_seconds{kind,status,le=…}` (buckets
|
||||||
|
`1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
|
||||||
|
|
||||||
|
In-memory histogram only. Prometheus persists the scrapes; if
|
||||||
|
you need durable history at hourly resolution that's
|
||||||
|
Prometheus's job.
|
||||||
|
|
||||||
|
## Sample Grafana dashboard
|
||||||
|
|
||||||
|
[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
|
||||||
|
imports through Grafana's **+ → Import → Upload JSON file**.
|
||||||
|
Six panels:
|
||||||
|
|
||||||
|
1. Fleet status (online / total).
|
||||||
|
2. Open alerts by severity.
|
||||||
|
3. Backups failing on most-recent run.
|
||||||
|
4. Hosts table — last backup, repo size, snapshots, open alerts.
|
||||||
|
5. Repo size over time, one line per host.
|
||||||
|
6. Job-duration p95 over a 1h window per kind.
|
||||||
|
|
||||||
|
## Alerting
|
||||||
|
|
||||||
|
restic-manager already has a built-in alert engine
|
||||||
|
([Alerts](./alerts.md)). The dashboard intentionally doesn't
|
||||||
|
duplicate it as Prometheus alert rules. If you want
|
||||||
|
Prometheus-side alerts on top, write your own based on the
|
||||||
|
metrics above — `rm_host_last_backup_success == 0`,
|
||||||
|
`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
|
||||||
|
or whatever suits your environment.
|
||||||
@@ -0,0 +1,50 @@
|
|||||||
|
# Updating agents
|
||||||
|
|
||||||
|
Server updates are a `docker compose pull && up -d` away.
|
||||||
|
Agents update via the control plane.
|
||||||
|
|
||||||
|
## Single-host update
|
||||||
|
|
||||||
|
Each host's detail page shows an **Update agent** button when
|
||||||
|
the agent's reported version is older than the server's. The
|
||||||
|
button:
|
||||||
|
|
||||||
|
1. Dispatches a `command.update` to that host.
|
||||||
|
2. The agent fetches the appropriate binary from
|
||||||
|
`$RM_SERVER/agent/binary?os=…&arch=…` to
|
||||||
|
`<binary-path>.new`.
|
||||||
|
3. Copies the running binary to `<binary-path>.old` (one
|
||||||
|
revision back, in case rollback is needed).
|
||||||
|
4. Atomic-renames `.new` over the running binary.
|
||||||
|
5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
|
||||||
|
brings the process back on the new binary.
|
||||||
|
|
||||||
|
A 90-second timer on the server side waits for a hello at the
|
||||||
|
target version and marks the update succeeded — or, if the
|
||||||
|
agent doesn't reconnect at the expected version in time, marks
|
||||||
|
the update **failed** and raises an `update_failed` alert.
|
||||||
|
|
||||||
|
## Fleet update
|
||||||
|
|
||||||
|
The admin-only **Settings → Fleet update** page drives a rolling
|
||||||
|
update across every host in the fleet:
|
||||||
|
|
||||||
|
- One host at a time.
|
||||||
|
- Wait for hello-with-target-version (max 95s).
|
||||||
|
- On any host failing, **halt** the rollout, raise a
|
||||||
|
`fleet_update_halted` alert, leave the rest of the fleet on
|
||||||
|
the old version. No surprise mass-failures.
|
||||||
|
|
||||||
|
You can cancel an in-progress fleet update; the worker stops
|
||||||
|
after the current host finishes.
|
||||||
|
|
||||||
|
## TLS and corruption
|
||||||
|
|
||||||
|
Updates rely on the reverse proxy's TLS to detect corruption in
|
||||||
|
transit. There's no separate sha256 verification step — we
|
||||||
|
chose the simpler model on the basis that the same TLS already
|
||||||
|
gates every other byte the server hands to the agent.
|
||||||
|
|
||||||
|
If you'd like a separate signature step before applying updates,
|
||||||
|
that's a future-phase enhancement (see `tasks.md` Phase 6
|
||||||
|
candidates).
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
# Environment variables
|
||||||
|
|
||||||
|
The server reads its configuration from environment variables
|
||||||
|
(canonical) with an optional YAML overlay. Env wins over YAML so
|
||||||
|
operators can tweak a single setting without rewriting the file.
|
||||||
|
|
||||||
|
## Server
|
||||||
|
|
||||||
|
| Variable | Default | Meaning |
|
||||||
|
|---------------------------|----------------------------------|---------|
|
||||||
|
| `RM_LISTEN` | `:8080` | TCP listener for the HTTP server. |
|
||||||
|
| `RM_DATA_DIR` | `/data` | Persistent state directory (SQLite, secret key, agent assets). |
|
||||||
|
| `RM_BASE_URL` | (none) | Public URL clients use; required for OIDC redirects + cookie scope. |
|
||||||
|
| `RM_SECRET_KEY_FILE` | `${RM_DATA_DIR}/secret.key` | Path to the AEAD key file. Auto-generated on first run. |
|
||||||
|
| `RM_COOKIE_SECURE` | `true` | Set `false` only for local HTTP testing. Controls `Secure` on session cookies. |
|
||||||
|
| `RM_TRUSTED_PROXY` | (none) | Comma-separated CIDRs trusted for `X-Forwarded-*`. |
|
||||||
|
| `RM_BUNDLED_ASSETS_DIR` | `/opt/restic-manager/dist` | Read-only path with bundled agent binaries + install scripts (the docker image bakes them here). |
|
||||||
|
| `RM_METRICS_TOKEN` | (off) | When set, `GET /metrics` requires `Authorization: Bearer <token>`. |
|
||||||
|
| `RM_METRICS_TRUSTED_CIDR` | (off) | When set, `GET /metrics` restricts source IPs (comma-CIDR). |
|
||||||
|
|
||||||
|
OIDC variables (all optional; empty issuer disables OIDC):
|
||||||
|
|
||||||
|
| Variable | Meaning |
|
||||||
|
|--------------------------------|---------|
|
||||||
|
| `RM_OIDC_ISSUER` | OIDC discovery URL (e.g. `https://auth.example.com`). |
|
||||||
|
| `RM_OIDC_CLIENT_ID` | Client ID registered with the IdP. |
|
||||||
|
| `RM_OIDC_CLIENT_SECRET` | Client secret (or use `RM_OIDC_CLIENT_SECRET_FILE`). |
|
||||||
|
| `RM_OIDC_CLIENT_SECRET_FILE` | Path to a file holding the client secret. |
|
||||||
|
| `RM_OIDC_DISPLAY_NAME` | Button label on the login page (e.g. "Authelia"). |
|
||||||
|
| `RM_OIDC_ROLE_CLAIM` | Token claim that carries roles (default `groups`). |
|
||||||
|
| `RM_OIDC_ROLE_MAPPING` | `idp-group=role` entries, comma-separated (e.g. `rm-admin=admin,rm-ops=operator`). |
|
||||||
|
| `RM_OIDC_REDIRECT_URL` | Override for the redirect URL; defaults to `${RM_BASE_URL}/auth/oidc/callback`. |
|
||||||
|
|
||||||
|
## Agent
|
||||||
|
|
||||||
|
| Variable | Default | Meaning |
|
||||||
|
|----------------------|---------|---------|
|
||||||
|
| `RM_AGENT_CONFIG` | `/etc/restic-manager/agent.yaml` (Linux) | Config file path. |
|
||||||
|
|
||||||
|
The agent's other settings live in the YAML file (server URL,
|
||||||
|
bearer token, optional cert pin). The install script writes that
|
||||||
|
file for you at enrolment.
|
||||||
|
|
||||||
|
## Build-time
|
||||||
|
|
||||||
|
The Makefile threads `-ldflags` from `git describe` into the
|
||||||
|
`internal/version` package so `--version` and the dashboard
|
||||||
|
footer show the right values:
|
||||||
|
|
||||||
|
```
|
||||||
|
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION)
|
||||||
|
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
|
||||||
|
```
|
||||||
|
|
||||||
|
If you build with `go build` directly (no Makefile), `Version`
|
||||||
|
falls back to `dev` and the agent-update comparison falls back
|
||||||
|
to "always equal". Source-build deployments can still run; they
|
||||||
|
just don't participate in the self-update flow.
|
||||||
@@ -0,0 +1,82 @@
|
|||||||
|
# HTTP endpoints
|
||||||
|
|
||||||
|
A non-exhaustive map of the surfaces the control plane exposes.
|
||||||
|
All `/api/*` routes return JSON; all other paths render HTML
|
||||||
|
(server-rendered with HTMX in the loop).
|
||||||
|
|
||||||
|
The canonical wiring lives at
|
||||||
|
[`internal/server/http/server.go`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/internal/server/http/server.go);
|
||||||
|
when in doubt, read the routes block there.
|
||||||
|
|
||||||
|
## Public (no auth)
|
||||||
|
|
||||||
|
| Method | Path | Purpose |
|
||||||
|
|--------|----------------------------|---------|
|
||||||
|
| GET | `/healthz` | Liveness probe. Returns 204. |
|
||||||
|
| POST | `/api/auth/login` | Local-user login. JSON body: `{username, password}`. |
|
||||||
|
| POST | `/api/auth/logout` | Invalidate the session cookie. |
|
||||||
|
| POST | `/api/bootstrap` | First-run admin creation. Accepts the token printed at first start. |
|
||||||
|
| POST | `/api/agents/enroll` | Token-based agent enrolment. |
|
||||||
|
| POST | `/api/agents/announce` | Announce-and-approve agent enrolment. |
|
||||||
|
| GET | `/agent/binary?os=&arch=` | Serves the agent binary for the install scripts. |
|
||||||
|
| GET | `/install/*` | Serves the Linux + Windows install scripts and the systemd unit. |
|
||||||
|
| GET | `/api/version` | Build version + commit JSON. |
|
||||||
|
| GET | `/metrics` | Prometheus exposition (only when opted-in via `RM_METRICS_TOKEN` / `RM_METRICS_TRUSTED_CIDR`). |
|
||||||
|
| GET | `/login`, `/setup`, `/bootstrap` | UI pages. |
|
||||||
|
|
||||||
|
## Authenticated (any role)
|
||||||
|
|
||||||
|
| Method | Path | Purpose |
|
||||||
|
|--------|------------------------------------------|---------|
|
||||||
|
| GET | `/` | Dashboard. |
|
||||||
|
| GET | `/hosts/{id}` | Host detail. |
|
||||||
|
| GET | `/hosts/{id}/repo` | Repo tab. |
|
||||||
|
| GET | `/hosts/{id}/jobs` | Jobs tab. |
|
||||||
|
| GET | `/hosts/{id}/sources` | Source groups list. |
|
||||||
|
| GET | `/hosts/{id}/schedules` | Schedules list. |
|
||||||
|
| GET | `/jobs/{id}` | Live job log. |
|
||||||
|
| GET | `/api/hosts`, `/api/fleet/summary` | JSON list + summary. |
|
||||||
|
| GET | `/api/jobs/{id}/stream` | WebSocket subscription to a job's live log. |
|
||||||
|
| GET | `/api/jobs/{id}/log.{txt,ndjson}` | Persisted log download. |
|
||||||
|
|
||||||
|
## Operator role and above
|
||||||
|
|
||||||
|
| Method | Path | Purpose |
|
||||||
|
|--------|---------------------------------------|---------|
|
||||||
|
| POST | `/hosts/{id}/run-backup` | Run-now (HTMX form-post). |
|
||||||
|
| POST | `/hosts/{id}/sources/{gid}/run-now` | Per-source-group run-now. |
|
||||||
|
| POST | `/hosts/{id}/repo/{prune,check,unlock,reinit,probe}` | Maintenance actions. |
|
||||||
|
| POST | `/api/hosts/{id}/snapshots/diff` | Snapshot-diff job. |
|
||||||
|
| POST | `/hosts/{id}/restore` | Restore wizard submit. |
|
||||||
|
| POST | `/api/jobs/{id}/cancel` | Cancel a running job. |
|
||||||
|
| POST | `/hosts/{id}/tags` | Update host tags. |
|
||||||
|
| POST | `/hosts/{id}/sources` and friends | Source-group CRUD. |
|
||||||
|
| POST | `/hosts/{id}/schedules` and friends | Schedule CRUD. |
|
||||||
|
| POST | `/hosts/{id}/repo/credentials`, `/admin-credentials` | Credential update. |
|
||||||
|
|
||||||
|
## Admin role only
|
||||||
|
|
||||||
|
| Method | Path | Purpose |
|
||||||
|
|--------|---------------------------------------|---------|
|
||||||
|
| POST | `/hosts/new` | Mint enrolment token (Add host). |
|
||||||
|
| POST | `/hosts/{id}/delete` | Delete + cascade. |
|
||||||
|
| POST | `/hosts/{id}/update` | Dispatch a single agent update. |
|
||||||
|
| GET/POST | `/settings/users/...` | User management. |
|
||||||
|
| POST | `/settings/notifications/...` | Notification channel CRUD + test fire. |
|
||||||
|
| POST | `/settings/fleet-update/...` | Fleet-update worker. |
|
||||||
|
|
||||||
|
## WebSocket
|
||||||
|
|
||||||
|
| Path | Who connects | Auth |
|
||||||
|
|--------------------------------|--------------|------|
|
||||||
|
| `/ws/agent` | Agent | Bearer token issued at enrolment. |
|
||||||
|
| `/ws/agent/pending` | Agent (announce flow) | Pending-id query param. |
|
||||||
|
| `/api/jobs/{id}/stream` | Browser | Session cookie. |
|
||||||
|
|
||||||
|
## RBAC enforcement
|
||||||
|
|
||||||
|
Routes are grouped into chi route-groups by required role
|
||||||
|
(`viewer < operator < admin`); the `requireRole` middleware in
|
||||||
|
`internal/server/http/middleware.go` is the bouncer. Sessions
|
||||||
|
re-validate `disabled_at` on every request, so a disabled user's
|
||||||
|
cookie stops working immediately.
|
||||||
@@ -0,0 +1,32 @@
|
|||||||
|
# Roadmap
|
||||||
|
|
||||||
|
The live roadmap is in
|
||||||
|
[`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md).
|
||||||
|
Phases ship in order; items inside a phase ship as the
|
||||||
|
opportunity arises.
|
||||||
|
|
||||||
|
## Status snapshot
|
||||||
|
|
||||||
|
| Phase | Theme | Status |
|
||||||
|
|-------|--------------------------------------------------|--------|
|
||||||
|
| 0 | Project bootstrap | ✅ done |
|
||||||
|
| 1 | MVP: enrolment, visibility, on-demand backup | ✅ done |
|
||||||
|
| 2 | Scheduling, retention, repo operations | ✅ done |
|
||||||
|
| 3 | Restore, alerts, audit | ✅ done |
|
||||||
|
| 4 | RBAC, OIDC, host tags | ✅ done |
|
||||||
|
| 5 | OSS readiness | 🚧 in flight (this docs site is part of it) |
|
||||||
|
| 6 | Update delivery + observability polish | ✅ done |
|
||||||
|
|
||||||
|
## What's not on the roadmap
|
||||||
|
|
||||||
|
The non-goals list in [`spec.md` §2](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md):
|
||||||
|
|
||||||
|
- Replacing restic itself or providing custom repo formats
|
||||||
|
- Managing non-restic backup tools
|
||||||
|
- Multi-tenancy / SaaS deployment
|
||||||
|
- High availability of the control plane (SQLite, single-instance)
|
||||||
|
- Mobile-native apps (responsive web only)
|
||||||
|
|
||||||
|
If something there is critical to your use case, restic-manager
|
||||||
|
isn't the right tool. That's not a closed door — it's a
|
||||||
|
deliberate scope decision so the project stays maintainable.
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
# Reporting vulnerabilities
|
||||||
|
|
||||||
|
The full disclosure policy lives in
|
||||||
|
[`SECURITY.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/SECURITY.md)
|
||||||
|
at the repo root. The short version:
|
||||||
|
|
||||||
|
- **Don't open a public issue.**
|
||||||
|
- Send a Gitea private message to `steve` on
|
||||||
|
<https://gitea.dcglab.co.uk>, or email the address on the
|
||||||
|
maintainer's profile, with a subject like
|
||||||
|
`[SECURITY] restic-manager: <one-line summary>`.
|
||||||
|
- Expect an acknowledgement within 3 working days; escalate
|
||||||
|
through the other channel if you don't get one.
|
||||||
|
- Default disclosure window is **30 days from confirmed report
|
||||||
|
to public disclosure**, faster if a PoC is already
|
||||||
|
circulating, slower only by mutual agreement.
|
||||||
|
|
||||||
|
## What to include
|
||||||
|
|
||||||
|
A description of the issue and the impact, the affected
|
||||||
|
component (server / agent / install script / docs), the version,
|
||||||
|
and reproduction steps. A working PoC is welcome but not
|
||||||
|
required — a credible threat model is enough.
|
||||||
|
|
||||||
|
## In scope vs. out of scope
|
||||||
|
|
||||||
|
See the full policy. Quick highlights:
|
||||||
|
|
||||||
|
- **In scope:** server, agent, install scripts, docker image,
|
||||||
|
docker-compose reference, crypto choices, docs that lead to
|
||||||
|
insecure configs.
|
||||||
|
- **Out of scope:** restic itself (report upstream), unpatched
|
||||||
|
third-party deps (report upstream first), pre-authenticated
|
||||||
|
admin abuse (admins are designed to have full power), DoS on
|
||||||
|
deployments without the recommended reverse proxy.
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
# Hardening checklist
|
||||||
|
|
||||||
|
A baseline for new deployments. Most of these are defaults; the
|
||||||
|
list is here to make audit easy.
|
||||||
|
|
||||||
|
## Server
|
||||||
|
|
||||||
|
- [ ] Reverse proxy in front, TLS terminating at the proxy
|
||||||
|
(Caddy/nginx/Traefik).
|
||||||
|
- [ ] `RM_TRUSTED_PROXY` set to the proxy's CIDR.
|
||||||
|
- [ ] `RM_BASE_URL` matches the public hostname and the cookie
|
||||||
|
scope you want.
|
||||||
|
- [ ] `RM_COOKIE_SECURE=true` (the default; only set `false`
|
||||||
|
for local HTTP testing).
|
||||||
|
- [ ] HTTP listener bound to **localhost** in the compose file,
|
||||||
|
not `0.0.0.0`. The reverse proxy is the only thing that
|
||||||
|
should reach it.
|
||||||
|
- [ ] `secret.key` backed up separately from the database.
|
||||||
|
- [ ] Bootstrap token consumed and the printed log line scrubbed
|
||||||
|
from any log archive.
|
||||||
|
|
||||||
|
## Authentication
|
||||||
|
|
||||||
|
- [ ] Admin user has a password ≥ 12 characters (the floor).
|
||||||
|
- [ ] OIDC enabled if you have an IdP — local password auth
|
||||||
|
stays as a break-glass.
|
||||||
|
- [ ] Disabled (not deleted) any users who change roles or leave
|
||||||
|
so their session is invalidated immediately.
|
||||||
|
- [ ] The last-admin guard isn't tripped — there's always at
|
||||||
|
least one enabled admin user.
|
||||||
|
|
||||||
|
## Repo credentials
|
||||||
|
|
||||||
|
- [ ] Append-only credential set as the everyday cred for every
|
||||||
|
host.
|
||||||
|
- [ ] Admin credential set only where prune cadence is enabled.
|
||||||
|
- [ ] No credentials reused across hosts. Each host should have
|
||||||
|
its own credential pair so a single host compromise has a
|
||||||
|
single blast radius.
|
||||||
|
- [ ] If using rest-server, `--append-only` flag is on for the
|
||||||
|
everyday user; the prune user is a separate identity.
|
||||||
|
|
||||||
|
## Agent
|
||||||
|
|
||||||
|
- [ ] Agent runs as `root` (Linux) or `LocalSystem` (Windows)
|
||||||
|
**only when** the source paths require it. Otherwise pin
|
||||||
|
a service user that has read access to what's backed up
|
||||||
|
and nothing else.
|
||||||
|
- [ ] systemd unit's sandboxing flags are intact
|
||||||
|
(`NoNewPrivileges`, `Protect*`, `MemoryDenyWriteExecute`).
|
||||||
|
- [ ] Agent's config file `/etc/restic-manager/agent.yaml` is
|
||||||
|
mode `0600` and owned by the service user. The bearer
|
||||||
|
token lives in there.
|
||||||
|
|
||||||
|
## Operations
|
||||||
|
|
||||||
|
- [ ] Alerts wired to a real channel (webhook into Slack,
|
||||||
|
ntfy topic, SMTP) — not just sitting in the UI.
|
||||||
|
- [ ] Test-fire each notification channel after configuring.
|
||||||
|
- [ ] Audit-log retention is long enough to cover the operator's
|
||||||
|
incident-response window.
|
||||||
|
- [ ] Prometheus endpoint, if enabled, gated by token AND CIDR
|
||||||
|
where practical (default is opt-in / off).
|
||||||
|
|
||||||
|
## Recovery
|
||||||
|
|
||||||
|
- [ ] A documented procedure for rotating a leaked agent bearer
|
||||||
|
(delete + re-enrol the host).
|
||||||
|
- [ ] A test-restore done at least once, end-to-end, before
|
||||||
|
relying on the system in anger.
|
||||||
|
- [ ] `secret.key` and the SQLite database covered by separate
|
||||||
|
backup paths so neither alone reconstitutes the other.
|
||||||
@@ -0,0 +1,110 @@
|
|||||||
|
# Threat model
|
||||||
|
|
||||||
|
This page documents what restic-manager defends against, what it
|
||||||
|
doesn't, and the trust assumptions a deployment is making. The
|
||||||
|
canonical version lives in [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
|
||||||
|
§11; the summary here is shaped for operators rather than
|
||||||
|
implementers.
|
||||||
|
|
||||||
|
## Trust boundaries
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ TRUSTED zone │
|
||||||
|
│ ┌─────────────┐ ┌──────────────┐ │
|
||||||
|
│ │ Operator's │ │ Reverse │ │
|
||||||
|
│ │ browser │◄──►│ proxy │ │ TLS terminates here
|
||||||
|
│ └─────────────┘ └──────┬───────┘ │
|
||||||
|
└────────────────────────────┼─────────────┘
|
||||||
|
│ HTTP, plaintext
|
||||||
|
│ (loopback or trusted LAN)
|
||||||
|
┌────────────────────────────▼─────────────┐
|
||||||
|
│ Server (control plane) │
|
||||||
|
└────────────┬─────────────────────────────┘
|
||||||
|
│ outbound WebSocket (TLS to clients via proxy)
|
||||||
|
│ — bearer-authenticated
|
||||||
|
┌────────────▼──────────────┐
|
||||||
|
│ Agent (per host) │ ◄── attacker model: assume one
|
||||||
|
└────────────┬──────────────┘ endpoint can be compromised
|
||||||
|
│ subprocess
|
||||||
|
▼
|
||||||
|
restic ──▶ repository (rest-server / S3 / SFTP / …)
|
||||||
|
```
|
||||||
|
|
||||||
|
## What we defend against
|
||||||
|
|
||||||
|
### Network attacker between operator and server
|
||||||
|
|
||||||
|
- HTTPS via the reverse proxy is the only operator-facing surface
|
||||||
|
on a sane deployment.
|
||||||
|
- `RM_COOKIE_SECURE=true` (default) means the session cookie
|
||||||
|
refuses to ride a non-HTTPS connection.
|
||||||
|
- `RM_TRUSTED_PROXY` gates whether `X-Forwarded-*` is honoured;
|
||||||
|
a bypassing request can't spoof the client IP.
|
||||||
|
|
||||||
|
### Compromised agent host
|
||||||
|
|
||||||
|
- The agent's bearer token can dispatch commands **only on its
|
||||||
|
own host**. It can't read other hosts' state, dispatch jobs
|
||||||
|
on other hosts, or escalate within the control plane.
|
||||||
|
- If you suspect a host compromise:
|
||||||
|
1. Disable the agent's host row from **Hosts → Delete**
|
||||||
|
(cascades the bearer hash).
|
||||||
|
2. Rotate the repo credential at the rest-server / object
|
||||||
|
store side.
|
||||||
|
3. Audit-log lists every action that bearer ever drove.
|
||||||
|
|
||||||
|
### DB compromise without the secret key
|
||||||
|
|
||||||
|
- Repo credentials are AEAD-encrypted at rest. A DB dump alone
|
||||||
|
doesn't expose them.
|
||||||
|
- Agent bearer **hashes** are leaked; that's enough to
|
||||||
|
authenticate as any agent until you revoke. A rotation
|
||||||
|
procedure is just "delete + re-enrol" today.
|
||||||
|
- Operator passwords are bcrypt-hashed; OIDC users have no
|
||||||
|
password to leak.
|
||||||
|
- Session tokens are hashed; an attacker can't replay a
|
||||||
|
session from a DB dump.
|
||||||
|
|
||||||
|
### DB compromise WITH the secret key
|
||||||
|
|
||||||
|
The attacker can decrypt every credential. Treat
|
||||||
|
`secret.key` with the same care as a password manager database.
|
||||||
|
Back it up to a separate vault, not to the same Docker volume
|
||||||
|
as the database.
|
||||||
|
|
||||||
|
### Forget/prune as a DoS vector
|
||||||
|
|
||||||
|
- The everyday backup credential cannot prune (append-only).
|
||||||
|
- The admin credential is only pushed to the agent at the
|
||||||
|
moment of dispatch and discarded after the job ends.
|
||||||
|
- Compromise of a single agent host does **not** grant prune
|
||||||
|
rights — at worst the attacker gets fresh write access until
|
||||||
|
the credential is rotated.
|
||||||
|
|
||||||
|
### Operator-side typo or bad copy-paste
|
||||||
|
|
||||||
|
- Repo credentials are stored encrypted; mis-typed creds fail
|
||||||
|
fast on the next `restic` invocation rather than silently
|
||||||
|
corrupting state.
|
||||||
|
- NS-03 added auto-init: the first dispatched job after creds
|
||||||
|
change runs `restic init`, surfaces the error eagerly under
|
||||||
|
the host's vitals strip if the creds are bad, and resets the
|
||||||
|
host's `repo_status` so the operator can retry without
|
||||||
|
hunting through job logs.
|
||||||
|
|
||||||
|
## What we don't defend against
|
||||||
|
|
||||||
|
- **Insider threat at the maintainer level.** A malicious
|
||||||
|
maintainer can publish a backdoored container; SBOM /
|
||||||
|
signing infrastructure (Phase 6 candidate) would help here
|
||||||
|
but isn't shipped today.
|
||||||
|
- **Supply chain.** We pin module versions (`go.sum`) and
|
||||||
|
pin the Tailwind binary's release tag, but a compromise in
|
||||||
|
one of those upstreams would land here.
|
||||||
|
- **Side-channel via restic itself.** A bug in restic that
|
||||||
|
enables snapshot-content disclosure is restic's problem; the
|
||||||
|
control plane doesn't see snapshot bytes either way.
|
||||||
|
- **DoS via resource exhaustion** without the recommended
|
||||||
|
reverse-proxy / rate-limit in front. Don't expose the
|
||||||
|
server's HTTP port to the public internet directly.
|
||||||
+120
@@ -0,0 +1,120 @@
|
|||||||
|
# End-to-end test harness
|
||||||
|
|
||||||
|
The e2e harness stands up the full production-shaped stack
|
||||||
|
(server + agent + rest-server) in Docker Compose and drives it
|
||||||
|
through Playwright. CI runs it on every PR; operators can run it
|
||||||
|
locally too.
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
```
|
||||||
|
e2e/
|
||||||
|
├── compose.e2e.yml compose stack: server + rest-server + agent
|
||||||
|
├── Dockerfile.agent Linux container for the agent (alpine + restic)
|
||||||
|
├── agent-entrypoint.sh decides between announce / token-enrol / run
|
||||||
|
└── playwright/
|
||||||
|
├── package.json
|
||||||
|
├── playwright.config.ts
|
||||||
|
└── tests/
|
||||||
|
├── lib/server.ts bootstrap, login, accept, poll helpers
|
||||||
|
└── smoke.spec.ts happy-path: enrol → backup → succeeded
|
||||||
|
```
|
||||||
|
|
||||||
|
## Local run
|
||||||
|
|
||||||
|
Prerequisites: Docker + Docker Compose, and `npx` for Playwright.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# 1. Build + bring up the stack (server, rest-server, source data).
|
||||||
|
docker compose -f e2e/compose.e2e.yml up --build -d server rest-server source-fixture
|
||||||
|
|
||||||
|
# 2. Wait for the server, then scrape the bootstrap token from the log.
|
||||||
|
until curl -fsS http://127.0.0.1:8080/api/version >/dev/null; do sleep 1; done
|
||||||
|
RM_BOOTSTRAP_TOKEN=$(docker compose -f e2e/compose.e2e.yml logs server \
|
||||||
|
| grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1)
|
||||||
|
export RM_BOOTSTRAP_TOKEN
|
||||||
|
|
||||||
|
# 3. Start the agent (it announces against the running server).
|
||||||
|
docker compose -f e2e/compose.e2e.yml up -d agent
|
||||||
|
|
||||||
|
# 4. Install + run Playwright.
|
||||||
|
cd e2e/playwright
|
||||||
|
npm install
|
||||||
|
npx playwright install --with-deps chromium
|
||||||
|
npx playwright test
|
||||||
|
```
|
||||||
|
|
||||||
|
When the test passes you'll see:
|
||||||
|
|
||||||
|
```
|
||||||
|
Running 2 tests using 1 worker
|
||||||
|
✓ smoke: enrol-via-announce → backup › happy path completes in under a minute (47s)
|
||||||
|
✓ smoke: scrape /metrics › metrics endpoint exposes the host gauge (180ms)
|
||||||
|
|
||||||
|
2 passed (47.5s)
|
||||||
|
```
|
||||||
|
|
||||||
|
Tear-down:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
docker compose -f e2e/compose.e2e.yml down -v
|
||||||
|
```
|
||||||
|
|
||||||
|
`-v` removes the named volumes too — important between runs because
|
||||||
|
the rest-server volume holds an initialised repo and the
|
||||||
|
agent-config volume holds a stale bearer.
|
||||||
|
|
||||||
|
## What the test exercises
|
||||||
|
|
||||||
|
1. **Bootstrap.** Posts the admin-creation request to
|
||||||
|
`/api/bootstrap` with the token scraped from the server log.
|
||||||
|
2. **Login (UI).** Drives the login form via Playwright; verifies
|
||||||
|
the dashboard loads with a session cookie set.
|
||||||
|
3. **Pending host appears.** Polls the dashboard for the inline
|
||||||
|
accept form generated by the announcing agent; reads the
|
||||||
|
pending-id out of its action URL.
|
||||||
|
4. **Accept.** POSTs `/api/pending-hosts/{id}/accept` with the
|
||||||
|
rest-server URL + repo password. The server mints a Host row
|
||||||
|
+ bearer + AEAD-encrypted creds and pushes the bearer down
|
||||||
|
the still-open pending WebSocket.
|
||||||
|
5. **Online + auto-init.** Polls `/api/hosts` until the new host
|
||||||
|
is `status=online`. Auto-init runs as part of this — the
|
||||||
|
first dispatched job after creds save is `restic init`.
|
||||||
|
6. **Run backup.** Submits the host detail page's `Run now`
|
||||||
|
form; expects `HX-Redirect` to the live job page.
|
||||||
|
7. **Verify.** Polls `/api/hosts` until the host's
|
||||||
|
`last_backup_status` flips to `succeeded`.
|
||||||
|
8. **Metrics.** Scrapes `/metrics` and asserts the
|
||||||
|
server-gauge + build-info lines are present (the compose
|
||||||
|
stack opens the endpoint via `RM_METRICS_TRUSTED_CIDR=0.0.0.0/0`).
|
||||||
|
|
||||||
|
## CI workflow
|
||||||
|
|
||||||
|
[`.gitea/workflows/e2e.yml`](../.gitea/workflows/e2e.yml) runs the
|
||||||
|
suite on every PR into `main`. On failure it dumps the last 200
|
||||||
|
lines of each container log as a workflow annotation and uploads
|
||||||
|
the Playwright HTML report as an artefact.
|
||||||
|
|
||||||
|
## When tests fail
|
||||||
|
|
||||||
|
- **Pending host never appears.** Agent container probably
|
||||||
|
couldn't reach the server. Check `docker compose logs agent`
|
||||||
|
for connection errors and `docker compose logs server` for
|
||||||
|
any 4xx on `/api/agents/announce`.
|
||||||
|
- **Backup hangs in `running`.** The agent shells out to
|
||||||
|
`restic`; check the live job log at
|
||||||
|
`http://127.0.0.1:8080/jobs/<id>` (still up after a
|
||||||
|
failed test as long as you didn't `down -v`).
|
||||||
|
- **`RM_BOOTSTRAP_TOKEN not set`.** The server log scrape
|
||||||
|
matched the wrong line or the token regex is too tight. The
|
||||||
|
server prints the token on a line starting with ` ` (four
|
||||||
|
spaces) inside a banner; widen the regex if your server log
|
||||||
|
format changes.
|
||||||
|
|
||||||
|
## Adding new tests
|
||||||
|
|
||||||
|
The harness is intentionally flat — one `*.spec.ts` per
|
||||||
|
scenario. Reuse the helpers in `lib/server.ts` and avoid
|
||||||
|
duplicating bootstrap / login boilerplate. Heavy fixtures
|
||||||
|
(custom users, OIDC IdP) belong in their own compose override
|
||||||
|
file rather than complicating `compose.e2e.yml`.
|
||||||
@@ -0,0 +1,139 @@
|
|||||||
|
# Prometheus + Grafana
|
||||||
|
|
||||||
|
restic-manager exposes a Prometheus scrape endpoint at `GET /metrics`.
|
||||||
|
The endpoint is **opt-in** — it is not mounted at all unless you set
|
||||||
|
at least one of the auth gates below. Once enabled, it serves the
|
||||||
|
standard `text/plain` exposition format that every Prometheus
|
||||||
|
release since 2.x parses without configuration.
|
||||||
|
|
||||||
|
A sample Grafana dashboard lives at
|
||||||
|
`deploy/grafana/restic-manager-dashboard.json`.
|
||||||
|
|
||||||
|
## Enable the endpoint
|
||||||
|
|
||||||
|
Two switches, both off by default. If both are set, both must pass
|
||||||
|
(token AND source-IP); if only one is set, that gate alone
|
||||||
|
authorises a scrape.
|
||||||
|
|
||||||
|
| Env var | YAML key | Effect |
|
||||||
|
|----------------------------|------------------------|--------|
|
||||||
|
| `RM_METRICS_TOKEN` | `metrics_token` | Requires `Authorization: Bearer <token>`. Compared in constant time. |
|
||||||
|
| `RM_METRICS_TRUSTED_CIDR` | `metrics_trusted_cidrs` (list) | Restricts the source IP to one of the listed CIDRs. Comma-separated in env, list in YAML. Honours `X-Forwarded-For` only when the immediate hop matches `RM_TRUSTED_PROXY`. |
|
||||||
|
|
||||||
|
When neither is set, `GET /metrics` returns 404 — the route is not
|
||||||
|
registered with the chi router so a forgotten config can't
|
||||||
|
accidentally publish fleet state.
|
||||||
|
|
||||||
|
### Example: Docker
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
restic-manager:
|
||||||
|
image: gitea.dcglab.co.uk/steve/restic-manager:latest
|
||||||
|
environment:
|
||||||
|
RM_METRICS_TOKEN_FILE: /run/secrets/rm_metrics_token
|
||||||
|
RM_METRICS_TRUSTED_CIDR: "10.0.0.0/8"
|
||||||
|
secrets:
|
||||||
|
- rm_metrics_token
|
||||||
|
```
|
||||||
|
|
||||||
|
(`RM_METRICS_TOKEN_FILE` is not currently supported — set
|
||||||
|
`RM_METRICS_TOKEN` directly. The `_FILE` convention is on the
|
||||||
|
roadmap.)
|
||||||
|
|
||||||
|
## Prometheus scrape config
|
||||||
|
|
||||||
|
Drop into your `prometheus.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: restic-manager
|
||||||
|
metrics_path: /metrics
|
||||||
|
scheme: https # via your reverse proxy
|
||||||
|
static_configs:
|
||||||
|
- targets: ['restic.example.com']
|
||||||
|
authorization:
|
||||||
|
type: Bearer
|
||||||
|
credentials_file: /etc/prometheus/secrets/rm_metrics_token
|
||||||
|
```
|
||||||
|
|
||||||
|
If you don't run a TLS-terminating proxy in front, drop `scheme:
|
||||||
|
https` (the server is HTTP-only — see `docs/reverse-proxy.md`).
|
||||||
|
|
||||||
|
## Metric reference
|
||||||
|
|
||||||
|
All names are `rm_`-prefixed. Per-host metrics carry a `host_id`
|
||||||
|
label (the stable ULID, immune to renames) and a `host` label
|
||||||
|
(the human-readable name).
|
||||||
|
|
||||||
|
### Server gauges
|
||||||
|
|
||||||
|
| Name | Labels | Description |
|
||||||
|
|-----------------------|------------------------------------|-------------|
|
||||||
|
| `rm_hosts_total` | — | Total number of enrolled hosts (excludes pending announces). |
|
||||||
|
| `rm_hosts_online` | — | Number of hosts with `status='online'`. |
|
||||||
|
| `rm_active_alerts` | `severity` ∈ {info, warning, critical} | Open alerts by severity. |
|
||||||
|
| `rm_build_info` | `version, commit, go_version` | Always 1; pure label-bag for joining. |
|
||||||
|
|
||||||
|
### Per-host gauges
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
|--------------------------------------------|-------------|
|
||||||
|
| `rm_host_agent_online` | 1 if the agent is currently online, 0 otherwise. |
|
||||||
|
| `rm_host_last_backup_timestamp_seconds` | Unix timestamp of the host's most recent backup. **Omitted** for hosts with no backup yet. |
|
||||||
|
| `rm_host_last_backup_success` | 1 if the most recent backup succeeded, 0 otherwise. **Omitted** for hosts with no backup yet. |
|
||||||
|
| `rm_host_repo_size_bytes` | Latest reported repo size from `restic stats --mode raw-data`. **Omitted** when unknown. |
|
||||||
|
| `rm_host_snapshot_count` | Number of restic snapshots known on the host's repo. |
|
||||||
|
| `rm_host_open_alerts` | Number of currently open alerts attached to this host. |
|
||||||
|
| `rm_host_repo_status` | Always 1; the `status` label carries `unknown` / `ready` / `init_failed`. |
|
||||||
|
|
||||||
|
### Job duration histogram
|
||||||
|
|
||||||
|
```
|
||||||
|
rm_job_duration_seconds_bucket{kind, status, le}
|
||||||
|
rm_job_duration_seconds_sum{kind, status}
|
||||||
|
rm_job_duration_seconds_count{kind, status}
|
||||||
|
```
|
||||||
|
|
||||||
|
`kind` ∈ {backup, forget, prune, check, unlock, restore, diff, init, update}.
|
||||||
|
`status` ∈ {succeeded, failed, cancelled}.
|
||||||
|
|
||||||
|
Buckets (seconds):
|
||||||
|
|
||||||
|
```
|
||||||
|
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
|
||||||
|
1s 5s 30s 1m 5m 30m 1h 6h 24h
|
||||||
|
```
|
||||||
|
|
||||||
|
The histogram is in-memory only — values reset on process restart.
|
||||||
|
Operators who want durable history should let Prometheus persist
|
||||||
|
the scrapes; restic-manager itself is a control plane, not a
|
||||||
|
metrics database.
|
||||||
|
|
||||||
|
## Grafana dashboard
|
||||||
|
|
||||||
|
Import `deploy/grafana/restic-manager-dashboard.json`:
|
||||||
|
|
||||||
|
1. In Grafana, **+ → Import → Upload JSON file**.
|
||||||
|
2. Pick the Prometheus data source you scrape with.
|
||||||
|
3. The dashboard's six panels populate from the metrics above:
|
||||||
|
* **Fleet status** — online/total stat panel.
|
||||||
|
* **Open alerts** — by severity.
|
||||||
|
* **Hosts** — per-host table (last backup, repo size, snapshots, alerts).
|
||||||
|
* **Repo size over time** — one line per host.
|
||||||
|
* **Backups failing** — count of hosts whose last backup didn't succeed.
|
||||||
|
* **Job duration p95** — `histogram_quantile(0.95, …)` over a 1h window per kind.
|
||||||
|
|
||||||
|
Alerting is intentionally not configured in the dashboard — the
|
||||||
|
control plane already has alerts (P3-05) with native channels for
|
||||||
|
webhook, ntfy, and SMTP. Re-implementing them in Prometheus would
|
||||||
|
just duplicate state. If you do want Prom-side alerts, copy the
|
||||||
|
recording rules into your usual location.
|
||||||
|
|
||||||
|
## Cardinality
|
||||||
|
|
||||||
|
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets)
|
||||||
|
histogram rows. A 100-host fleet emits roughly 700 host rows + 270
|
||||||
|
histogram rows — well below any practical limit. There are no
|
||||||
|
`job_id` labels (cardinality bomb avoidance) and no per-source-group
|
||||||
|
labels.
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 27 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 98 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 178 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 48 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 92 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 47 KiB |
@@ -0,0 +1,61 @@
|
|||||||
|
# Plan — P6-04 + P6-05 Prometheus metrics + Grafana dashboard
|
||||||
|
|
||||||
|
Spec: `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`
|
||||||
|
|
||||||
|
## Step 1 — Config wiring
|
||||||
|
|
||||||
|
- Add fields to `internal/server/config/config.go`:
|
||||||
|
- `MetricsToken string` (yaml `metrics_token`)
|
||||||
|
- `MetricsTrustedCIDRs []string` (yaml `metrics_trusted_cidrs`)
|
||||||
|
- method `(c Config) MetricsAuthEnabled() bool` returning true iff at least one of the two is configured.
|
||||||
|
- Env loading: `RM_METRICS_TOKEN` and `RM_METRICS_TRUSTED_CIDR` (comma-CIDR).
|
||||||
|
- `validate()` extension: ensure each CIDR parses (reuse the same `netip.ParsePrefix` pattern that already validates `TrustedProxies`).
|
||||||
|
- Tests: extend `config_test.go` covering both env vars + happy/sad CIDR.
|
||||||
|
|
||||||
|
## Step 2 — `internal/server/metrics` package
|
||||||
|
|
||||||
|
- `Registry` struct: `sync.Mutex`, `map[jobKey]*histogramState` where `jobKey = struct{kind,status string}`.
|
||||||
|
- `ObserveJob(kind, status string, dur time.Duration)` — clamps negative durations to 0; locks; bumps the right bucket + sum + count.
|
||||||
|
- `Snapshot() Snapshot` — copies state under lock; returns plain value type.
|
||||||
|
- `Snapshot` carries `Histogram` rows (kind, status, buckets, sum, count) and accepts the rest from the caller (host rows, alert counts, build info).
|
||||||
|
- `Render(w io.Writer, s Snapshot) error` — emits standard text exposition with stable line ordering. No external dep; manual escape of `\` `"` `\n` in label values per the Prom format spec.
|
||||||
|
- Unit tests: golden render, concurrent observe, bucket boundaries.
|
||||||
|
|
||||||
|
## Step 3 — HTTP handler
|
||||||
|
|
||||||
|
- New `internal/server/http/metrics.go`:
|
||||||
|
- `(s *Server) handleMetrics(w, r)` — calls `authoriseMetricsScrape`, then `gatherSnapshot(ctx)` then `metrics.Render`.
|
||||||
|
- `authoriseMetricsScrape(r, cfg) (ok bool, status int)` — pure helper; bearer token compared with `subtle.ConstantTimeCompare`; CIDR check on `r.RemoteAddr` first, then `X-Forwarded-For` if a trusted proxy fronted us (mirror `realIP`'s logic; simplest path is to call `chi/middleware.RealIP`-aware lookup the existing handlers use).
|
||||||
|
- `gatherSnapshot(ctx)` — assembles the snapshot from `Store.ListHosts`, `Store.ListAlerts({Status:"open"})`, the metrics registry, and `version.Version`/`version.Commit`/`runtime.Version()`.
|
||||||
|
- Route mounted in `server.go` only if `s.deps.Cfg.MetricsAuthEnabled()`.
|
||||||
|
- `Deps` grows a `Metrics *metrics.Registry` field; nil-tolerant in handlers.
|
||||||
|
|
||||||
|
## Step 4 — Hook job-finished
|
||||||
|
|
||||||
|
- `internal/server/ws/handler.go`:
|
||||||
|
- `HandlerDeps` grows `Metrics *metrics.Registry`.
|
||||||
|
- In the `MsgJobFinished` branch, after the `GetJob` lookup we already do, observe `(job.Kind, p.Status.String(), p.FinishedAt.Sub(deref(job.StartedAt)))`. Skip if `job.StartedAt` is nil (rare race).
|
||||||
|
- `cmd/server` wires the registry into both `Deps` and `HandlerDeps` from a single instance.
|
||||||
|
|
||||||
|
## Step 5 — Tests
|
||||||
|
|
||||||
|
- `internal/server/metrics/registry_test.go` — observe + snapshot determinism.
|
||||||
|
- `internal/server/metrics/render_test.go` — golden output for a fixed snapshot.
|
||||||
|
- `internal/server/http/metrics_test.go` — auth matrix (six cases per the spec) using the existing `newTestServer` fixture pattern. Render snapshot includes ≥1 host so we exercise the gather path end-to-end.
|
||||||
|
|
||||||
|
## Step 6 — Docs + dashboard (P6-05)
|
||||||
|
|
||||||
|
- `docs/prometheus.md` — enable + scrape config + metric reference + dashboard import.
|
||||||
|
- `deploy/grafana/restic-manager-dashboard.json` — six-panel dashboard. Hand-authored against Grafana 11 dashboard schema (uid, schemaVersion, panels with `targets[].expr`, datasource as variable). Validated by importing into Grafana — but since we can't run Grafana in CI, the structural sniff test is just that the JSON parses and contains the expected panel titles + datasource variable.
|
||||||
|
|
||||||
|
## Step 7 — Tasks.md + verification
|
||||||
|
|
||||||
|
- Strike P6-04, P6-05 in `tasks.md`; add an "as shipped" note mirroring the prior P6 entries.
|
||||||
|
- Run `go vet ./...`, `go test ./...`, `make build`.
|
||||||
|
- Push branch (no PR per standing instruction).
|
||||||
|
|
||||||
|
## Risk register
|
||||||
|
|
||||||
|
- **CIDR check for proxied scrapes** — easy to mis-implement, easy to mis-document. The handler test must exercise both "direct hit" and "X-Forwarded-For" paths.
|
||||||
|
- **Histogram lock contention** — every job finish takes the mutex. Job throughput is low (a few/min/host max), and `ObserveJob` is a couple of map lookups; no risk in practice.
|
||||||
|
- **Dashboard JSON drift** — Grafana versions evolve. Pinning `schemaVersion` and using only well-supported panel types (timeseries, stat, table) keeps the import working across recent versions.
|
||||||
@@ -0,0 +1,175 @@
|
|||||||
|
# P6-04 + P6-05 — Prometheus `/metrics` + Grafana dashboard
|
||||||
|
|
||||||
|
Date: 2026-05-07
|
||||||
|
Author: Claude (autonomous, sensible-defaults brief from operator)
|
||||||
|
Tasks: P6-04 (M), P6-05 (S)
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The control plane already knows everything a backup operator needs
|
||||||
|
to monitor — last-backup timestamp + status, repo size, snapshot
|
||||||
|
count, agent online, open alerts, build version — but it surfaces
|
||||||
|
those only through the dashboard HTML and a few JSON endpoints. To
|
||||||
|
plug into the operator's existing observability stack we need a
|
||||||
|
plain Prometheus exposition endpoint and a Grafana dashboard JSON
|
||||||
|
that reads from it.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- `GET /metrics` emits standard Prometheus text-format with the
|
||||||
|
per-host, server, and job-duration metrics enumerated in the
|
||||||
|
task entry (P6-04 in `tasks.md`).
|
||||||
|
- Endpoint is opt-in and gated by a bearer token and/or an IP
|
||||||
|
allow-list — never publicly readable by default.
|
||||||
|
- No new third-party dependency (`prometheus/client_golang` is not
|
||||||
|
pulled in). The exposition format is small and stable enough to
|
||||||
|
emit by hand; matches the repo's "no Tailwind/Node" style.
|
||||||
|
- Sample Grafana dashboard committed to the repo so a stranger can
|
||||||
|
drop it into a Grafana instance and get a working view.
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- OpenMetrics (the legacy text format with `# HELP`/`# TYPE` is
|
||||||
|
what every prom server still parses and what every example
|
||||||
|
online demonstrates — pick the boring option).
|
||||||
|
- Pushgateway or remote-write integration.
|
||||||
|
- Per-job metric cardinality (no `job_id` labels — that would
|
||||||
|
make the histogram explode).
|
||||||
|
- Alerting rules. Operators already have alerts inside
|
||||||
|
restic-manager (P3-05); duplicating them in Prometheus is a
|
||||||
|
YAGNI hazard. The dashboard is read-only.
|
||||||
|
|
||||||
|
## Auth
|
||||||
|
|
||||||
|
Two switches, both off by default. If neither is set the route
|
||||||
|
isn't mounted at all (404 from the chi router) — this avoids any
|
||||||
|
accidental "wide-open scrape endpoint" deployment.
|
||||||
|
|
||||||
|
| env var | type | meaning |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `RM_METRICS_TOKEN` | string | If set, callers must send `Authorization: Bearer <token>`. Compared with `crypto/subtle.ConstantTimeCompare`. |
|
||||||
|
| `RM_METRICS_TRUSTED_CIDR` | comma-CIDR | If set, callers must hit from a source IP inside one of these CIDRs. Reuses the existing `RM_TRUSTED_PROXY` semantics for honouring `X-Forwarded-For`. |
|
||||||
|
|
||||||
|
If both are set, both must pass (AND, not OR — a token leak doesn't grant access from outside the network, and a trusted-network compromise doesn't grant access without the token). If only one is set, that one alone gates access.
|
||||||
|
|
||||||
|
YAML overlay mirrors env: `metrics_token`, `metrics_trusted_cidrs`.
|
||||||
|
|
||||||
|
## Metrics
|
||||||
|
|
||||||
|
All metric names are prefixed `rm_`. Help text is concise.
|
||||||
|
|
||||||
|
### Per-host gauges (one row per `host_id`)
|
||||||
|
|
||||||
|
```
|
||||||
|
rm_host_agent_online{host_id,host} 1 if status='online' else 0
|
||||||
|
rm_host_last_backup_timestamp_seconds{host_id,host} unix seconds; omitted if no backup yet
|
||||||
|
rm_host_last_backup_success{host_id,host} 1 if last_backup_status='succeeded' else 0; omitted if no backup yet
|
||||||
|
rm_host_repo_size_bytes{host_id,host} total_size from latest repo stats; omitted if unknown
|
||||||
|
rm_host_snapshot_count{host_id,host} integer
|
||||||
|
rm_host_open_alerts{host_id,host} count of open + un-resolved alerts attached to this host
|
||||||
|
rm_host_repo_status{host_id,host,status} 1, with status ∈ {unknown,ready,init_failed} (info-style, exactly one row per host)
|
||||||
|
```
|
||||||
|
|
||||||
|
`host` label is `hosts.name` for human readability; `host_id` is
|
||||||
|
the stable ULID for joining across renames.
|
||||||
|
|
||||||
|
### Server gauges
|
||||||
|
|
||||||
|
```
|
||||||
|
rm_hosts_total count of hosts (excludes pending)
|
||||||
|
rm_hosts_online count of hosts with status='online'
|
||||||
|
rm_active_alerts{severity} count of open alerts by severity ∈ {info,warning,critical}
|
||||||
|
rm_build_info{version,commit,go_version} always 1; pure label-bag for joining
|
||||||
|
```
|
||||||
|
|
||||||
|
### Job duration histogram
|
||||||
|
|
||||||
|
```
|
||||||
|
rm_job_duration_seconds_bucket{kind,status,le=...}
|
||||||
|
rm_job_duration_seconds_sum{kind,status}
|
||||||
|
rm_job_duration_seconds_count{kind,status}
|
||||||
|
```
|
||||||
|
|
||||||
|
`kind` ∈ {backup,forget,prune,check,unlock,restore,diff,init,update}
|
||||||
|
(every JobKind we currently dispatch). `status` ∈
|
||||||
|
{succeeded,failed,cancelled}. Buckets cover the realistic range —
|
||||||
|
short admin commands (unlock, init) finish in seconds; backups can
|
||||||
|
be hours:
|
||||||
|
|
||||||
|
```
|
||||||
|
1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf
|
||||||
|
(1s 5s 30s 1m 5m 30m 1h 6h 24h)
|
||||||
|
```
|
||||||
|
|
||||||
|
In-memory only. Reset on process restart — operators who want
|
||||||
|
durable history scrape into Prom and let it persist.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
New package `internal/server/metrics`:
|
||||||
|
|
||||||
|
- `Registry` — owns the histogram state (sync.Mutex + map keyed by
|
||||||
|
`kind+status`). `ObserveJob(kind, status string, dur time.Duration)`
|
||||||
|
is the only mutator. Lookups via `Snapshot()` are read-only and
|
||||||
|
copy out.
|
||||||
|
- `Render(w io.Writer, snapshot Snapshot)` — emits the full
|
||||||
|
exposition body. The snapshot is supplied by the HTTP handler
|
||||||
|
pulling from `Store` on each scrape; the package itself has no
|
||||||
|
store dependency, which keeps it trivially unit-testable.
|
||||||
|
|
||||||
|
New file `internal/server/http/metrics.go`:
|
||||||
|
|
||||||
|
- `handleMetrics(w, r)` — auth check (bearer + CIDR), pull current
|
||||||
|
fleet snapshot from `Store`, ask `metrics.Render` to emit.
|
||||||
|
- Auth helper `authoriseMetricsScrape(r)` — pure function over
|
||||||
|
request + config; tested directly.
|
||||||
|
|
||||||
|
Wiring:
|
||||||
|
|
||||||
|
- `cmd/server` constructs the `metrics.Registry` once and threads
|
||||||
|
it into both `Deps` (for the HTTP layer) and `ws.HandlerDeps`
|
||||||
|
(so the job-finished branch can call `ObserveJob`).
|
||||||
|
- `ws/handler.go` MsgJobFinished branch grows a single line:
|
||||||
|
`if deps.Metrics != nil { deps.Metrics.ObserveJob(job.Kind, p.Status, p.FinishedAt.Sub(job.StartedAt)) }`.
|
||||||
|
Falls back gracefully if the registry was never wired (tests).
|
||||||
|
|
||||||
|
Route registration in `server.go`:
|
||||||
|
|
||||||
|
```go
|
||||||
|
if s.deps.Cfg.MetricsAuthEnabled() {
|
||||||
|
r.Get("/metrics", s.handleMetrics)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cardinality + cost
|
||||||
|
|
||||||
|
Per scrape: O(hosts) gauge rows + O(kinds × statuses × buckets) histogram rows. For a 100-host fleet that's ~700 host rows + ~270 histogram rows per scrape — well under any practical limit. The store reads we issue per scrape are: `ListHosts` (already exists, one query), `ListAlerts` filtered by open status (one query), `GetHostRepoStats` already projected onto `Host` via `repo_size_bytes`. No N+1.
|
||||||
|
|
||||||
|
A 10s scrape interval against a 100-host fleet is cheap: each scrape is a couple of indexed sqlite reads + a small string render. We're nowhere near a place where caching the snapshot would be worthwhile.
|
||||||
|
|
||||||
|
## Documentation (P6-05)
|
||||||
|
|
||||||
|
- `docs/prometheus.md` — sibling to the existing `docs/reverse-proxy.md`. Sections: enabling the endpoint (env vars), Prometheus scrape config snippet (with bearer + tls), the metric reference table (copy-pasted from this spec), the dashboard import instructions.
|
||||||
|
- `deploy/grafana/restic-manager-dashboard.json` — Grafana 11+ dashboard JSON. Single Prometheus datasource variable, six panels:
|
||||||
|
1. **Fleet status** — stat panel showing `rm_hosts_online / rm_hosts_total` + a sparkline.
|
||||||
|
2. **Open alerts** — stat panel by severity (`sum by (severity) (rm_active_alerts)`).
|
||||||
|
3. **Hosts** — table of `host`, `online`, `last_backup` (relative time via `time() - rm_host_last_backup_timestamp_seconds`), `repo_size`, `snapshots`.
|
||||||
|
4. **Repo size over time** — time series, one line per host, `rm_host_repo_size_bytes`.
|
||||||
|
5. **Backups failing** — time series counting hosts where `rm_host_last_backup_success == 0`.
|
||||||
|
6. **Job duration p95** — `histogram_quantile(0.95, sum by (kind, le) (rate(rm_job_duration_seconds_bucket[1h])))` over a 1h window.
|
||||||
|
|
||||||
|
Dashboard is committed as plain JSON; an operator imports it through the Grafana UI ("+ → Import → upload JSON file") or Grafana provisioning.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
- Unit tests for `metrics.Render` against a fixed snapshot — golden-file style. Pin the exact line ordering (sorted by metric name + label set) so diffs stay tractable.
|
||||||
|
- Unit tests for `metrics.Registry.ObserveJob` — concurrent writes, bucket boundary correctness, snapshot independence.
|
||||||
|
- Handler tests for `/metrics` covering: no auth configured → 404; token configured + missing → 401; token configured + correct → 200 + body sniff; CIDR configured + wrong source → 401; CIDR configured + right source → 200; both configured → require both.
|
||||||
|
- End-to-end smoke verification deferred to manual operator walk-through; full Playwright pass is P5-06's job.
|
||||||
|
|
||||||
|
## Out of scope, explicitly
|
||||||
|
|
||||||
|
- Per-job latency tracking with `job_id` labels (cardinality bomb).
|
||||||
|
- Restore-specific metrics (P3 surfaces are still settling).
|
||||||
|
- Histograms keyed on host (kind × status × buckets × hosts is a Prom anti-pattern).
|
||||||
|
- Auto-discovery / file-SD generators for Prometheus.
|
||||||
@@ -0,0 +1,42 @@
|
|||||||
|
# Build a Linux container that runs the restic-manager agent against a
|
||||||
|
# sibling rest-server in the e2e compose stack. Used only by tests
|
||||||
|
# (e2e/compose.e2e.yml + .gitea/workflows/e2e.yml).
|
||||||
|
#
|
||||||
|
# Two stages:
|
||||||
|
# 1. golang:alpine to build the agent binary.
|
||||||
|
# 2. alpine:3.20 with the `restic` package + the built binary.
|
||||||
|
#
|
||||||
|
# Pinning by digest is intentional for CI reproducibility.
|
||||||
|
|
||||||
|
FROM golang:1.25-alpine AS build
|
||||||
|
WORKDIR /src
|
||||||
|
|
||||||
|
ENV CGO_ENABLED=0 \
|
||||||
|
GOFLAGS="-trimpath"
|
||||||
|
|
||||||
|
COPY go.mod go.sum* ./
|
||||||
|
RUN go mod download
|
||||||
|
|
||||||
|
COPY . .
|
||||||
|
ARG VERSION=e2e
|
||||||
|
RUN go build -ldflags="-s -w -X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=${VERSION}" \
|
||||||
|
-o /out/restic-manager-agent ./cmd/agent
|
||||||
|
|
||||||
|
FROM alpine:3.20
|
||||||
|
RUN apk add --no-cache restic ca-certificates curl
|
||||||
|
COPY --from=build /out/restic-manager-agent /usr/local/bin/restic-manager-agent
|
||||||
|
|
||||||
|
# Agents normally run as root because backup paths often need it. The
|
||||||
|
# e2e fixture only backs up paths under /data which we own, so this
|
||||||
|
# container would tolerate a non-root user — but staying root keeps
|
||||||
|
# parity with the production install.
|
||||||
|
USER root
|
||||||
|
|
||||||
|
# The agent needs a writable directory for its config + secrets store.
|
||||||
|
RUN mkdir -p /etc/restic-manager /var/lib/restic-manager-agent
|
||||||
|
ENV RM_AGENT_CONFIG=/etc/restic-manager/agent.yaml
|
||||||
|
|
||||||
|
# The compose entrypoint sets the announce URL via env.
|
||||||
|
COPY e2e/agent-entrypoint.sh /usr/local/bin/entrypoint.sh
|
||||||
|
RUN chmod +x /usr/local/bin/entrypoint.sh
|
||||||
|
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
|
||||||
@@ -0,0 +1,21 @@
|
|||||||
|
# Playwright runner for the e2e suite. Built and run by
|
||||||
|
# e2e/compose.e2e.yml so the test process sits on the same docker
|
||||||
|
# network as the server, agent, and rest-server. The previous setup
|
||||||
|
# ran Playwright on the workflow runner host and reached the server
|
||||||
|
# via 127.0.0.1:8080; that fails on Gitea's act-style runners
|
||||||
|
# because the workflow steps execute inside a runner container,
|
||||||
|
# not on the host where compose publishes its ports.
|
||||||
|
|
||||||
|
FROM mcr.microsoft.com/playwright:v1.59.1-jammy
|
||||||
|
|
||||||
|
WORKDIR /work
|
||||||
|
|
||||||
|
# Install npm deps in a separate layer keyed off package.json so
|
||||||
|
# changes to specs don't bust the dep cache.
|
||||||
|
COPY e2e/playwright/package.json /work/package.json
|
||||||
|
RUN npm install --no-audit --no-fund
|
||||||
|
|
||||||
|
COPY e2e/playwright/ /work/
|
||||||
|
|
||||||
|
ENV CI=1
|
||||||
|
ENTRYPOINT ["npx", "playwright", "test"]
|
||||||
Executable
+27
@@ -0,0 +1,27 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
# Entrypoint for the e2e agent container.
|
||||||
|
#
|
||||||
|
# Three states:
|
||||||
|
# 1. Already enrolled (agent.yaml has a bearer): run the agent.
|
||||||
|
# 2. Token supplied via $RM_ENROL_TOKEN: enrol then run.
|
||||||
|
# 3. Otherwise: announce against $RM_SERVER and wait for an admin to
|
||||||
|
# accept us. The announce flow blocks until accepted, then drops
|
||||||
|
# straight into the normal run loop, so this is the test-friendly
|
||||||
|
# path.
|
||||||
|
set -eu
|
||||||
|
|
||||||
|
CFG="${RM_AGENT_CONFIG:-/etc/restic-manager/agent.yaml}"
|
||||||
|
SERVER="${RM_SERVER:?set RM_SERVER}"
|
||||||
|
|
||||||
|
if [ -f "$CFG" ] && grep -q '^agent_token:' "$CFG"; then
|
||||||
|
exec restic-manager-agent -config "$CFG"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -n "${RM_ENROL_TOKEN:-}" ]; then
|
||||||
|
exec restic-manager-agent -config "$CFG" \
|
||||||
|
-enroll-server "$SERVER" \
|
||||||
|
-enroll-token "$RM_ENROL_TOKEN"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Announce-and-approve: blocks until an admin accepts, then runs.
|
||||||
|
exec restic-manager-agent -config "$CFG" -enroll-server "$SERVER"
|
||||||
@@ -0,0 +1,108 @@
|
|||||||
|
# End-to-end test stack — used by .gitea/workflows/e2e.yml and by
|
||||||
|
# operators who want to run the Playwright suite locally.
|
||||||
|
#
|
||||||
|
# Three services:
|
||||||
|
# * server — restic-manager built from the working tree
|
||||||
|
# * agent — restic-manager agent built from the working tree
|
||||||
|
# (announces; Playwright accepts it during the test)
|
||||||
|
# * rest-server — the actual restic backend, sibling of the agent
|
||||||
|
#
|
||||||
|
# Run from the repo root:
|
||||||
|
# docker compose -f e2e/compose.e2e.yml up --build --abort-on-container-exit
|
||||||
|
|
||||||
|
services:
|
||||||
|
rest-server:
|
||||||
|
image: restic/rest-server:0.13.0
|
||||||
|
environment:
|
||||||
|
DATA_DIR: /data
|
||||||
|
OPTIONS: "--no-auth"
|
||||||
|
volumes:
|
||||||
|
- rest-data:/data
|
||||||
|
networks: [rmnet]
|
||||||
|
|
||||||
|
server:
|
||||||
|
build:
|
||||||
|
context: ..
|
||||||
|
dockerfile: deploy/Dockerfile.server
|
||||||
|
args:
|
||||||
|
VERSION: e2e
|
||||||
|
environment:
|
||||||
|
RM_LISTEN: ":8080"
|
||||||
|
RM_DATA_DIR: "/data"
|
||||||
|
RM_BASE_URL: "http://server:8080"
|
||||||
|
RM_COOKIE_SECURE: "false"
|
||||||
|
# Bind the metrics endpoint loose for the test, so one of the
|
||||||
|
# Playwright assertions can exercise it.
|
||||||
|
RM_METRICS_TRUSTED_CIDR: "0.0.0.0/0"
|
||||||
|
volumes:
|
||||||
|
- server-data:/data
|
||||||
|
ports:
|
||||||
|
- "127.0.0.1:8080:8080"
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "/usr/local/bin/restic-manager-server", "--version"]
|
||||||
|
interval: 2s
|
||||||
|
timeout: 2s
|
||||||
|
retries: 30
|
||||||
|
networks: [rmnet]
|
||||||
|
|
||||||
|
agent:
|
||||||
|
build:
|
||||||
|
context: ..
|
||||||
|
dockerfile: e2e/Dockerfile.agent
|
||||||
|
args:
|
||||||
|
VERSION: e2e
|
||||||
|
environment:
|
||||||
|
RM_SERVER: "http://server:8080"
|
||||||
|
depends_on:
|
||||||
|
- server
|
||||||
|
volumes:
|
||||||
|
# Source paths the agent backs up. Compose pre-populates this
|
||||||
|
# with a few files so the snapshot list isn't empty.
|
||||||
|
- source-data:/source
|
||||||
|
- agent-config:/etc/restic-manager
|
||||||
|
- agent-state:/var/lib/restic-manager-agent
|
||||||
|
networks: [rmnet]
|
||||||
|
|
||||||
|
# Playwright test runner. Profile-gated so `compose up` doesn't
|
||||||
|
# start it; CI runs it via `compose run --rm playwright`. Lives on
|
||||||
|
# rmnet so it can reach the server via its compose-network DNS
|
||||||
|
# name rather than depending on host port-publish (which doesn't
|
||||||
|
# work on Gitea's container-based runners).
|
||||||
|
playwright:
|
||||||
|
profiles: [test]
|
||||||
|
build:
|
||||||
|
context: ..
|
||||||
|
dockerfile: e2e/Dockerfile.playwright
|
||||||
|
environment:
|
||||||
|
RM_BASE_URL: "http://server:8080"
|
||||||
|
RM_BOOTSTRAP_TOKEN: "${RM_BOOTSTRAP_TOKEN:-}"
|
||||||
|
volumes:
|
||||||
|
- ./playwright/playwright-report:/work/playwright-report
|
||||||
|
- ./playwright/test-results:/work/test-results
|
||||||
|
depends_on:
|
||||||
|
- server
|
||||||
|
- agent
|
||||||
|
networks: [rmnet]
|
||||||
|
|
||||||
|
# One-shot init container that drops a couple of files into the
|
||||||
|
# source volume so backups have something to snapshot.
|
||||||
|
source-fixture:
|
||||||
|
image: alpine:3.20
|
||||||
|
command: >
|
||||||
|
sh -c 'mkdir -p /source && echo "hello world" > /source/hello.txt &&
|
||||||
|
echo "another file" > /source/two.txt && sleep 0.2'
|
||||||
|
volumes:
|
||||||
|
- source-data:/source
|
||||||
|
networks: [rmnet]
|
||||||
|
restart: "no"
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
server-data:
|
||||||
|
rest-data:
|
||||||
|
source-data:
|
||||||
|
agent-config:
|
||||||
|
agent-state:
|
||||||
|
|
||||||
|
networks:
|
||||||
|
rmnet:
|
||||||
|
driver: bridge
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
{
|
||||||
|
"name": "restic-manager-e2e",
|
||||||
|
"version": "0.0.0",
|
||||||
|
"private": true,
|
||||||
|
"type": "module",
|
||||||
|
"scripts": {
|
||||||
|
"test": "playwright test",
|
||||||
|
"test:headed": "playwright test --headed",
|
||||||
|
"test:debug": "PWDEBUG=1 playwright test"
|
||||||
|
},
|
||||||
|
"devDependencies": {
|
||||||
|
"@playwright/test": "1.59.1"
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
import { defineConfig, devices } from '@playwright/test';
|
||||||
|
|
||||||
|
// Single-target Chromium config: the e2e suite is narrow (smoke
|
||||||
|
// the production-shaped flow against the docker-compose stack).
|
||||||
|
// Cross-browser matrix doesn't add signal — what we're verifying is
|
||||||
|
// the server's HTML and the agent's WebSocket handshake, neither of
|
||||||
|
// which depends on browser engine.
|
||||||
|
|
||||||
|
const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
|
||||||
|
|
||||||
|
export default defineConfig({
|
||||||
|
testDir: './tests',
|
||||||
|
timeout: 60_000,
|
||||||
|
expect: { timeout: 10_000 },
|
||||||
|
fullyParallel: false,
|
||||||
|
retries: process.env.CI ? 1 : 0,
|
||||||
|
workers: 1,
|
||||||
|
reporter: [['list'], ['html', { open: 'never' }]],
|
||||||
|
use: {
|
||||||
|
baseURL,
|
||||||
|
trace: 'retain-on-failure',
|
||||||
|
screenshot: 'only-on-failure',
|
||||||
|
video: 'retain-on-failure',
|
||||||
|
},
|
||||||
|
projects: [
|
||||||
|
{
|
||||||
|
name: 'chromium',
|
||||||
|
use: { ...devices['Desktop Chrome'] },
|
||||||
|
},
|
||||||
|
],
|
||||||
|
});
|
||||||
@@ -0,0 +1,114 @@
|
|||||||
|
// Helpers used by every test. The shape favours the JSON API for
|
||||||
|
// reads + accept/dispatch (deterministic, easy to assert) and the
|
||||||
|
// browser for human-facing surfaces (login form, dashboard render).
|
||||||
|
|
||||||
|
import { APIRequestContext, expect, Page } from '@playwright/test';
|
||||||
|
|
||||||
|
export const baseURL = process.env.RM_BASE_URL ?? 'http://127.0.0.1:8080';
|
||||||
|
|
||||||
|
export interface HostJSON {
|
||||||
|
id: string;
|
||||||
|
name: string;
|
||||||
|
status: string;
|
||||||
|
last_backup_status?: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function readBootstrapToken(): Promise<string> {
|
||||||
|
const tok = process.env.RM_BOOTSTRAP_TOKEN;
|
||||||
|
if (!tok) {
|
||||||
|
throw new Error('RM_BOOTSTRAP_TOKEN not set — the harness scrapes it from server logs');
|
||||||
|
}
|
||||||
|
return tok;
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function bootstrapAdmin(
|
||||||
|
request: APIRequestContext,
|
||||||
|
{
|
||||||
|
username = 'admin',
|
||||||
|
password = 'e2e-test-password-1234',
|
||||||
|
}: { username?: string; password?: string } = {},
|
||||||
|
): Promise<{ username: string; password: string }> {
|
||||||
|
const token = await readBootstrapToken();
|
||||||
|
const res = await request.post(`${baseURL}/api/bootstrap`, {
|
||||||
|
data: { token, username, password },
|
||||||
|
});
|
||||||
|
if (!res.ok() && res.status() !== 409 /* already bootstrapped */) {
|
||||||
|
throw new Error(`bootstrap: ${res.status()} ${await res.text()}`);
|
||||||
|
}
|
||||||
|
return { username, password };
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function loginViaUI(page: Page, username: string, password: string): Promise<void> {
|
||||||
|
await page.goto(`${baseURL}/login`);
|
||||||
|
await page.locator('#login-username').fill(username);
|
||||||
|
await page.locator('#login-password').fill(password);
|
||||||
|
await Promise.all([
|
||||||
|
page.waitForURL(new RegExp(`^${baseURL}/?$`)),
|
||||||
|
page.locator('form[action="/login"] button[type="submit"]').click(),
|
||||||
|
]);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Polls the dashboard until a pending host card is visible, then
|
||||||
|
* extracts its pending-id from the inline accept form's action URL.
|
||||||
|
*/
|
||||||
|
export async function waitForPendingHostID(page: Page): Promise<string> {
|
||||||
|
const formLocator = page.locator('form[action^="/api/pending-hosts/"][action$="/accept"]').first();
|
||||||
|
await expect(formLocator).toBeVisible({ timeout: 60_000 });
|
||||||
|
const action = await formLocator.getAttribute('action');
|
||||||
|
if (!action) throw new Error('pending host form has no action attribute');
|
||||||
|
const m = action.match(/\/api\/pending-hosts\/([^/]+)\/accept/);
|
||||||
|
if (!m) throw new Error(`unexpected action URL: ${action}`);
|
||||||
|
return m[1];
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function acceptPending(
|
||||||
|
request: APIRequestContext,
|
||||||
|
cookie: string,
|
||||||
|
pendingID: string,
|
||||||
|
repo: { url: string; username?: string; password: string },
|
||||||
|
): Promise<void> {
|
||||||
|
const res = await request.post(`${baseURL}/api/pending-hosts/${pendingID}/accept`, {
|
||||||
|
headers: { cookie, 'content-type': 'application/json' },
|
||||||
|
data: {
|
||||||
|
repo_url: repo.url,
|
||||||
|
repo_username: repo.username ?? '',
|
||||||
|
repo_password: repo.password,
|
||||||
|
},
|
||||||
|
});
|
||||||
|
if (!res.ok()) {
|
||||||
|
throw new Error(`accept: ${res.status()} ${await res.text()}`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function listHosts(request: APIRequestContext, cookie: string): Promise<HostJSON[]> {
|
||||||
|
const res = await request.get(`${baseURL}/api/hosts`, { headers: { cookie } });
|
||||||
|
if (!res.ok()) throw new Error(`list hosts: ${res.status()} ${await res.text()}`);
|
||||||
|
const body = (await res.json()) as { items?: HostJSON[]; hosts?: HostJSON[] };
|
||||||
|
return body.items ?? body.hosts ?? [];
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function waitForHostStatus(
|
||||||
|
request: APIRequestContext,
|
||||||
|
cookie: string,
|
||||||
|
matcher: (h: HostJSON) => boolean,
|
||||||
|
timeoutMs = 60_000,
|
||||||
|
): Promise<HostJSON> {
|
||||||
|
const deadline = Date.now() + timeoutMs;
|
||||||
|
let last: HostJSON | undefined;
|
||||||
|
while (Date.now() < deadline) {
|
||||||
|
const hosts = await listHosts(request, cookie);
|
||||||
|
const hit = hosts.find(matcher);
|
||||||
|
if (hit) return hit;
|
||||||
|
last = hosts[0];
|
||||||
|
await new Promise((r) => setTimeout(r, 1_000));
|
||||||
|
}
|
||||||
|
throw new Error(`waitForHostStatus: timeout. Last seen: ${JSON.stringify(last)}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function getSessionCookie(page: Page): Promise<string> {
|
||||||
|
const cookies = await page.context().cookies();
|
||||||
|
const c = cookies.find((c) => c.name === 'rm_session');
|
||||||
|
if (!c) throw new Error('rm_session cookie not set after login');
|
||||||
|
return `${c.name}=${c.value}`;
|
||||||
|
}
|
||||||
@@ -0,0 +1,80 @@
|
|||||||
|
// End-to-end smoke: bootstrap → accept pending host → run backup → see succeeded.
|
||||||
|
//
|
||||||
|
// The compose stack stands up a server, a sibling rest-server, and an
|
||||||
|
// agent in announce-and-approve mode. This test drives the operator
|
||||||
|
// path through the UI (login + dashboard) and the API
|
||||||
|
// (accept + run-now + poll for terminal) — UI for the human surfaces,
|
||||||
|
// API for the deterministic ones.
|
||||||
|
|
||||||
|
import { test, expect } from '@playwright/test';
|
||||||
|
import {
|
||||||
|
baseURL,
|
||||||
|
bootstrapAdmin,
|
||||||
|
loginViaUI,
|
||||||
|
waitForPendingHostID,
|
||||||
|
acceptPending,
|
||||||
|
waitForHostStatus,
|
||||||
|
getSessionCookie,
|
||||||
|
} from './lib/server';
|
||||||
|
|
||||||
|
test.describe('smoke: enrol-via-announce → backup', () => {
|
||||||
|
test('happy path completes in under a minute', async ({ page, request }) => {
|
||||||
|
const { username, password } = await bootstrapAdmin(request);
|
||||||
|
await loginViaUI(page, username, password);
|
||||||
|
|
||||||
|
// Dashboard renders.
|
||||||
|
await expect(page.locator('main')).toContainText(/host|fleet|pending/i, { timeout: 10_000 });
|
||||||
|
|
||||||
|
// Pending host appears (the agent container has been
|
||||||
|
// announcing since startup).
|
||||||
|
const pendingID = await waitForPendingHostID(page);
|
||||||
|
const cookie = await getSessionCookie(page);
|
||||||
|
|
||||||
|
// Accept with the rest-server creds. compose's rest-server runs
|
||||||
|
// --no-auth, so any credentials work; restic still demands a
|
||||||
|
// password to encrypt the repo.
|
||||||
|
await acceptPending(request, cookie, pendingID, {
|
||||||
|
url: 'rest:http://rest-server:8000/',
|
||||||
|
password: 'e2e-repo-password',
|
||||||
|
});
|
||||||
|
|
||||||
|
// Wait for the host to come online + auto-init to land.
|
||||||
|
const onlineHost = await waitForHostStatus(
|
||||||
|
request, cookie,
|
||||||
|
(h) => h.status === 'online',
|
||||||
|
60_000,
|
||||||
|
);
|
||||||
|
expect(onlineHost.id).toBeTruthy();
|
||||||
|
|
||||||
|
// Trigger a backup via the UI form-post (HX-Redirect to /jobs/{id}).
|
||||||
|
await page.goto(`${baseURL}/hosts/${onlineHost.id}`);
|
||||||
|
await Promise.all([
|
||||||
|
page.waitForURL(/\/jobs\//),
|
||||||
|
page.locator('form[action$="/run-backup"] button[type="submit"]').first().click(),
|
||||||
|
]);
|
||||||
|
|
||||||
|
// Wait for the host's last_backup_status to flip to 'succeeded'.
|
||||||
|
// The job page itself is harder to assert on (it uses
|
||||||
|
// server-pushed updates and a reload-on-finish pattern); the
|
||||||
|
// host record is the source of truth and is what the dashboard
|
||||||
|
// surfaces.
|
||||||
|
const finishedHost = await waitForHostStatus(
|
||||||
|
request, cookie,
|
||||||
|
(h) => h.id === onlineHost.id && h.last_backup_status === 'succeeded',
|
||||||
|
120_000,
|
||||||
|
);
|
||||||
|
expect(finishedHost.last_backup_status).toBe('succeeded');
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
test.describe('smoke: scrape /metrics', () => {
|
||||||
|
test('metrics endpoint exposes the host gauge', async ({ request }) => {
|
||||||
|
// Compose sets RM_METRICS_TRUSTED_CIDR=0.0.0.0/0 so the
|
||||||
|
// endpoint is open to the test runner.
|
||||||
|
const res = await request.get(`${baseURL}/metrics`);
|
||||||
|
expect(res.status()).toBe(200);
|
||||||
|
const body = await res.text();
|
||||||
|
expect(body).toContain('rm_hosts_total');
|
||||||
|
expect(body).toContain('rm_build_info{');
|
||||||
|
});
|
||||||
|
});
|
||||||
@@ -41,6 +41,24 @@ type Config struct {
|
|||||||
// DataDir. Source-build deployments can override via
|
// DataDir. Source-build deployments can override via
|
||||||
// RM_BUNDLED_ASSETS_DIR.
|
// RM_BUNDLED_ASSETS_DIR.
|
||||||
BundledAssetsDir string `yaml:"bundled_assets_dir"`
|
BundledAssetsDir string `yaml:"bundled_assets_dir"`
|
||||||
|
|
||||||
|
// MetricsToken, if set, gates the /metrics scrape endpoint
|
||||||
|
// behind a `Authorization: Bearer <token>` check (constant-time
|
||||||
|
// compare). When neither this nor MetricsTrustedCIDRs is set,
|
||||||
|
// the route is not mounted at all (the endpoint is opt-in).
|
||||||
|
MetricsToken string `yaml:"metrics_token"`
|
||||||
|
|
||||||
|
// MetricsTrustedCIDRs, if non-empty, gates /metrics so only
|
||||||
|
// callers from these networks may scrape. ANDed with
|
||||||
|
// MetricsToken when both are set.
|
||||||
|
MetricsTrustedCIDRs []string `yaml:"metrics_trusted_cidrs"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// MetricsAuthEnabled reports whether the operator has opted into
|
||||||
|
// exposing the Prometheus scrape endpoint by configuring at least
|
||||||
|
// one auth gate.
|
||||||
|
func (c Config) MetricsAuthEnabled() bool {
|
||||||
|
return c.MetricsToken != "" || len(c.MetricsTrustedCIDRs) > 0
|
||||||
}
|
}
|
||||||
|
|
||||||
// Load resolves config in this order:
|
// Load resolves config in this order:
|
||||||
@@ -93,6 +111,19 @@ func Load(yamlPath string) (Config, error) {
|
|||||||
if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
|
if v, ok := os.LookupEnv("RM_BUNDLED_ASSETS_DIR"); ok {
|
||||||
c.BundledAssetsDir = v
|
c.BundledAssetsDir = v
|
||||||
}
|
}
|
||||||
|
if v, ok := os.LookupEnv("RM_METRICS_TOKEN"); ok {
|
||||||
|
c.MetricsToken = v
|
||||||
|
}
|
||||||
|
if v, ok := os.LookupEnv("RM_METRICS_TRUSTED_CIDR"); ok {
|
||||||
|
parts := strings.Split(v, ",")
|
||||||
|
c.MetricsTrustedCIDRs = c.MetricsTrustedCIDRs[:0]
|
||||||
|
for _, p := range parts {
|
||||||
|
p = strings.TrimSpace(p)
|
||||||
|
if p != "" {
|
||||||
|
c.MetricsTrustedCIDRs = append(c.MetricsTrustedCIDRs, p)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
|
if v, ok := os.LookupEnv("RM_TRUSTED_PROXY"); ok {
|
||||||
// Comma-separated CIDRs; allow whitespace for readability.
|
// Comma-separated CIDRs; allow whitespace for readability.
|
||||||
parts := strings.Split(v, ",")
|
parts := strings.Split(v, ",")
|
||||||
@@ -137,5 +168,10 @@ func (c *Config) validate() error {
|
|||||||
return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
|
return fmt.Errorf("config: RM_TRUSTED_PROXY entry %q is not a valid CIDR: %w", cidr, err)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
for _, cidr := range c.MetricsTrustedCIDRs {
|
||||||
|
if _, err := netip.ParsePrefix(cidr); err != nil {
|
||||||
|
return fmt.Errorf("config: RM_METRICS_TRUSTED_CIDR entry %q is not a valid CIDR: %w", cidr, err)
|
||||||
|
}
|
||||||
|
}
|
||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -98,6 +98,45 @@ func TestCookieSecureDefaultAndOverride(t *testing.T) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
func TestMetricsAuthGates(t *testing.T) {
|
||||||
|
t.Setenv("RM_LISTEN", ":8080")
|
||||||
|
t.Setenv("RM_DATA_DIR", "/tmp/x")
|
||||||
|
|
||||||
|
c, err := Load("")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("load: %v", err)
|
||||||
|
}
|
||||||
|
if c.MetricsAuthEnabled() {
|
||||||
|
t.Errorf("metrics endpoint should be off by default")
|
||||||
|
}
|
||||||
|
|
||||||
|
t.Setenv("RM_METRICS_TOKEN", "s3cr3t-token-with-enough-bytes")
|
||||||
|
t.Setenv("RM_METRICS_TRUSTED_CIDR", "10.0.0.0/8, 192.168.1.0/24")
|
||||||
|
c, err = Load("")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("load: %v", err)
|
||||||
|
}
|
||||||
|
if c.MetricsToken != "s3cr3t-token-with-enough-bytes" {
|
||||||
|
t.Errorf("token: %q", c.MetricsToken)
|
||||||
|
}
|
||||||
|
if got := c.MetricsTrustedCIDRs; len(got) != 2 || got[0] != "10.0.0.0/8" || got[1] != "192.168.1.0/24" {
|
||||||
|
t.Errorf("cidrs: %v", got)
|
||||||
|
}
|
||||||
|
if !c.MetricsAuthEnabled() {
|
||||||
|
t.Errorf("MetricsAuthEnabled should be true")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestMetricsTrustedCIDRRejectsGarbage(t *testing.T) {
|
||||||
|
t.Setenv("RM_LISTEN", ":8080")
|
||||||
|
t.Setenv("RM_DATA_DIR", "/tmp/x")
|
||||||
|
t.Setenv("RM_METRICS_TRUSTED_CIDR", "garbage")
|
||||||
|
|
||||||
|
if _, err := Load(""); err == nil {
|
||||||
|
t.Fatal("expected validation error, got nil")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
func writeFile(path string, body []byte) error {
|
func writeFile(path string, body []byte) error {
|
||||||
return writeFileImpl(path, body)
|
return writeFileImpl(path, body)
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,185 @@
|
|||||||
|
package http
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"crypto/subtle"
|
||||||
|
"net"
|
||||||
|
"net/http"
|
||||||
|
"net/netip"
|
||||||
|
"runtime"
|
||||||
|
"strings"
|
||||||
|
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||||
|
)
|
||||||
|
|
||||||
|
// handleMetrics serves the Prometheus exposition body. The route is
|
||||||
|
// only mounted when the operator has opted in via RM_METRICS_TOKEN
|
||||||
|
// or RM_METRICS_TRUSTED_CIDR (see Server.New + Cfg.MetricsAuthEnabled).
|
||||||
|
func (s *Server) handleMetrics(w http.ResponseWriter, r *http.Request) {
|
||||||
|
if !authoriseMetricsScrape(r, s.deps.Cfg) {
|
||||||
|
// 401 with no body; Prom respects this and surfaces the failed
|
||||||
|
// scrape. WWW-Authenticate hints at bearer when the operator
|
||||||
|
// actually configured a token.
|
||||||
|
if s.deps.Cfg.MetricsToken != "" {
|
||||||
|
w.Header().Set("WWW-Authenticate", `Bearer realm="restic-manager metrics"`)
|
||||||
|
}
|
||||||
|
w.WriteHeader(http.StatusUnauthorized)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
snap, err := s.gatherMetricsSnapshot(r.Context())
|
||||||
|
if err != nil {
|
||||||
|
http.Error(w, "snapshot: "+err.Error(), http.StatusInternalServerError)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// 0.0.4 is the long-stable text-format version Prometheus accepts
|
||||||
|
// without negotiation; OpenMetrics is intentionally not used here.
|
||||||
|
w.Header().Set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
|
||||||
|
if err := metrics.Render(w, snap); err != nil {
|
||||||
|
// Body is partially written; nothing useful we can do beyond
|
||||||
|
// dropping the connection (chi's recoverer will log).
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// authoriseMetricsScrape applies bearer + CIDR gates per the spec.
|
||||||
|
// AND semantics when both are configured; either alone is sufficient
|
||||||
|
// when only it is configured.
|
||||||
|
func authoriseMetricsScrape(r *http.Request, cfg config.Config) bool {
|
||||||
|
tokenOK := true
|
||||||
|
if cfg.MetricsToken != "" {
|
||||||
|
tokenOK = false
|
||||||
|
hdr := r.Header.Get("Authorization")
|
||||||
|
const prefix = "Bearer "
|
||||||
|
if strings.HasPrefix(hdr, prefix) {
|
||||||
|
got := []byte(strings.TrimPrefix(hdr, prefix))
|
||||||
|
want := []byte(cfg.MetricsToken)
|
||||||
|
if subtle.ConstantTimeCompare(got, want) == 1 {
|
||||||
|
tokenOK = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
cidrOK := true
|
||||||
|
if len(cfg.MetricsTrustedCIDRs) > 0 {
|
||||||
|
cidrOK = false
|
||||||
|
ip := callerIP(r, cfg.TrustedProxies)
|
||||||
|
if ip.IsValid() {
|
||||||
|
for _, c := range cfg.MetricsTrustedCIDRs {
|
||||||
|
prefix, err := netip.ParsePrefix(c)
|
||||||
|
if err != nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if prefix.Contains(ip) {
|
||||||
|
cidrOK = true
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return tokenOK && cidrOK
|
||||||
|
}
|
||||||
|
|
||||||
|
// callerIP resolves the client IP. When the request hit the server
|
||||||
|
// directly we use RemoteAddr; when the immediate hop is a trusted
|
||||||
|
// proxy we honour the right-most untrusted X-Forwarded-For entry
|
||||||
|
// (mirrors how realIP middlewares typically resolve).
|
||||||
|
func callerIP(r *http.Request, trustedProxies []string) netip.Addr {
|
||||||
|
host, _, err := net.SplitHostPort(r.RemoteAddr)
|
||||||
|
if err != nil {
|
||||||
|
host = r.RemoteAddr
|
||||||
|
}
|
||||||
|
directAddr, err := netip.ParseAddr(host)
|
||||||
|
if err != nil {
|
||||||
|
return netip.Addr{}
|
||||||
|
}
|
||||||
|
|
||||||
|
if !addrInAnyCIDR(directAddr, trustedProxies) {
|
||||||
|
return directAddr
|
||||||
|
}
|
||||||
|
|
||||||
|
xff := r.Header.Get("X-Forwarded-For")
|
||||||
|
if xff == "" {
|
||||||
|
return directAddr
|
||||||
|
}
|
||||||
|
parts := strings.Split(xff, ",")
|
||||||
|
// Walk right→left, skipping trusted proxies, until we land on the
|
||||||
|
// first untrusted hop — that's the genuine client.
|
||||||
|
for i := len(parts) - 1; i >= 0; i-- {
|
||||||
|
p := strings.TrimSpace(parts[i])
|
||||||
|
a, err := netip.ParseAddr(p)
|
||||||
|
if err != nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if addrInAnyCIDR(a, trustedProxies) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
return a
|
||||||
|
}
|
||||||
|
return directAddr
|
||||||
|
}
|
||||||
|
|
||||||
|
func addrInAnyCIDR(a netip.Addr, cidrs []string) bool {
|
||||||
|
for _, c := range cidrs {
|
||||||
|
pre, err := netip.ParsePrefix(c)
|
||||||
|
if err != nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if pre.Contains(a) {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
|
||||||
|
// gatherMetricsSnapshot pulls the data the renderer needs. One
|
||||||
|
// indexed query per per-host or fleet-wide read; no N+1.
|
||||||
|
func (s *Server) gatherMetricsSnapshot(ctx context.Context) (metrics.Snapshot, error) {
|
||||||
|
hosts, err := s.deps.Store.ListHosts(ctx)
|
||||||
|
if err != nil {
|
||||||
|
return metrics.Snapshot{}, err
|
||||||
|
}
|
||||||
|
hostRows := make([]metrics.HostRow, 0, len(hosts))
|
||||||
|
for _, h := range hosts {
|
||||||
|
row := metrics.HostRow{
|
||||||
|
ID: h.ID,
|
||||||
|
Name: h.Name,
|
||||||
|
Online: h.Status == "online",
|
||||||
|
SnapshotCount: h.SnapshotCount,
|
||||||
|
OpenAlertCount: h.OpenAlertCount,
|
||||||
|
RepoStatus: h.RepoStatus,
|
||||||
|
}
|
||||||
|
if h.LastBackupAt != nil {
|
||||||
|
ts := h.LastBackupAt.Unix()
|
||||||
|
row.LastBackupUnix = &ts
|
||||||
|
}
|
||||||
|
if h.LastBackupStatus != nil {
|
||||||
|
ok := *h.LastBackupStatus == "succeeded"
|
||||||
|
row.LastBackupSucceeded = &ok
|
||||||
|
}
|
||||||
|
if h.RepoSizeBytes > 0 {
|
||||||
|
sz := h.RepoSizeBytes
|
||||||
|
row.RepoSizeBytes = &sz
|
||||||
|
}
|
||||||
|
hostRows = append(hostRows, row)
|
||||||
|
}
|
||||||
|
|
||||||
|
open, err := s.deps.Store.ListAlerts(ctx, store.AlertFilter{Status: "open"})
|
||||||
|
if err != nil {
|
||||||
|
return metrics.Snapshot{}, err
|
||||||
|
}
|
||||||
|
bySeverity := map[string]int{"info": 0, "warning": 0, "critical": 0}
|
||||||
|
for _, a := range open {
|
||||||
|
bySeverity[a.Severity]++
|
||||||
|
}
|
||||||
|
|
||||||
|
reg := s.deps.Metrics
|
||||||
|
if reg == nil {
|
||||||
|
reg = metrics.NewRegistry() // empty histogram block
|
||||||
|
}
|
||||||
|
return reg.SnapshotWith(hostRows, bySeverity, version.Version, version.Commit, runtime.Version()), nil
|
||||||
|
}
|
||||||
@@ -0,0 +1,209 @@
|
|||||||
|
package http
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"io"
|
||||||
|
stdhttp "net/http"
|
||||||
|
"net/http/httptest"
|
||||||
|
"path/filepath"
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||||
|
)
|
||||||
|
|
||||||
|
// newMetricsServer builds a Server with metrics enabled per cfg.
|
||||||
|
// Returns (URL, registry) so tests can both observe job durations
|
||||||
|
// directly and exercise the HTTP gate.
|
||||||
|
func newMetricsServer(t *testing.T, cfg config.Config) (string, *metrics.Registry, *store.Store) {
|
||||||
|
t.Helper()
|
||||||
|
dir := t.TempDir()
|
||||||
|
|
||||||
|
st, err := store.Open(context.Background(), filepath.Join(dir, "rm.db"))
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("store: %v", err)
|
||||||
|
}
|
||||||
|
t.Cleanup(func() { _ = st.Close() })
|
||||||
|
|
||||||
|
keyPath := filepath.Join(dir, "secret.key")
|
||||||
|
if err := crypto.GenerateKeyFile(keyPath); err != nil {
|
||||||
|
t.Fatalf("genkey: %v", err)
|
||||||
|
}
|
||||||
|
key, _ := crypto.LoadKeyFromFile(keyPath)
|
||||||
|
aead, _ := crypto.NewAEAD(key)
|
||||||
|
|
||||||
|
cfg.Listen = ":0"
|
||||||
|
cfg.DataDir = dir
|
||||||
|
cfg.SecretKeyFile = keyPath
|
||||||
|
|
||||||
|
reg := metrics.NewRegistry()
|
||||||
|
deps := Deps{
|
||||||
|
Cfg: cfg,
|
||||||
|
Store: st,
|
||||||
|
AEAD: aead,
|
||||||
|
Metrics: reg,
|
||||||
|
}
|
||||||
|
s := New(deps)
|
||||||
|
ts := httptest.NewServer(s.srv.Handler)
|
||||||
|
t.Cleanup(ts.Close)
|
||||||
|
return ts.URL, reg, st
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestMetricsRouteNotMountedByDefault(t *testing.T) {
|
||||||
|
t.Parallel()
|
||||||
|
url, _, _ := newMetricsServer(t, config.Config{})
|
||||||
|
res, err := stdhttp.Get(url + "/metrics")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res.Body.Close()
|
||||||
|
if res.StatusCode != stdhttp.StatusNotFound {
|
||||||
|
t.Errorf("status: got %d, want 404 (route should not be mounted)", res.StatusCode)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestMetricsTokenRequired(t *testing.T) {
|
||||||
|
t.Parallel()
|
||||||
|
url, _, _ := newMetricsServer(t, config.Config{
|
||||||
|
MetricsToken: "the-token",
|
||||||
|
})
|
||||||
|
|
||||||
|
// Missing token.
|
||||||
|
res, err := stdhttp.Get(url + "/metrics")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res.Body.Close()
|
||||||
|
if res.StatusCode != stdhttp.StatusUnauthorized {
|
||||||
|
t.Errorf("no token: got %d", res.StatusCode)
|
||||||
|
}
|
||||||
|
if !strings.Contains(res.Header.Get("WWW-Authenticate"), "Bearer") {
|
||||||
|
t.Errorf("WWW-Authenticate hint missing: %q", res.Header.Get("WWW-Authenticate"))
|
||||||
|
}
|
||||||
|
|
||||||
|
// Wrong token.
|
||||||
|
req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||||
|
req.Header.Set("Authorization", "Bearer not-the-token")
|
||||||
|
res2, err := stdhttp.DefaultClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res2.Body.Close()
|
||||||
|
if res2.StatusCode != stdhttp.StatusUnauthorized {
|
||||||
|
t.Errorf("wrong token: got %d", res2.StatusCode)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Right token.
|
||||||
|
req3, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||||
|
req3.Header.Set("Authorization", "Bearer the-token")
|
||||||
|
res3, err3 := stdhttp.DefaultClient.Do(req3)
|
||||||
|
if err3 != nil {
|
||||||
|
t.Fatalf("GET: %v", err3)
|
||||||
|
}
|
||||||
|
defer res3.Body.Close()
|
||||||
|
if res3.StatusCode != stdhttp.StatusOK {
|
||||||
|
t.Errorf("right token: got %d", res3.StatusCode)
|
||||||
|
}
|
||||||
|
if ct := res3.Header.Get("Content-Type"); !strings.HasPrefix(ct, "text/plain") {
|
||||||
|
t.Errorf("content-type: %q", ct)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestMetricsCIDRGate(t *testing.T) {
|
||||||
|
t.Parallel()
|
||||||
|
// 127.0.0.1 is what httptest hits with; pick a CIDR that excludes it
|
||||||
|
// to assert the "wrong source" branch.
|
||||||
|
url, _, _ := newMetricsServer(t, config.Config{
|
||||||
|
MetricsTrustedCIDRs: []string{"10.0.0.0/8"},
|
||||||
|
})
|
||||||
|
res, err := stdhttp.Get(url + "/metrics")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res.Body.Close()
|
||||||
|
if res.StatusCode != stdhttp.StatusUnauthorized {
|
||||||
|
t.Errorf("loopback hitting non-matching CIDR: got %d, want 401", res.StatusCode)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Now allow loopback.
|
||||||
|
url2, _, _ := newMetricsServer(t, config.Config{
|
||||||
|
MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
|
||||||
|
})
|
||||||
|
res2, err := stdhttp.Get(url2 + "/metrics")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res2.Body.Close()
|
||||||
|
if res2.StatusCode != stdhttp.StatusOK {
|
||||||
|
t.Errorf("loopback in allow CIDR: got %d, want 200", res2.StatusCode)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestMetricsTokenAndCIDRBothRequired(t *testing.T) {
|
||||||
|
t.Parallel()
|
||||||
|
url, _, _ := newMetricsServer(t, config.Config{
|
||||||
|
MetricsToken: "the-token",
|
||||||
|
MetricsTrustedCIDRs: []string{"127.0.0.0/8"},
|
||||||
|
})
|
||||||
|
// Token only — CIDR ok (loopback) but token missing.
|
||||||
|
res, err := stdhttp.Get(url + "/metrics")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res.Body.Close()
|
||||||
|
if res.StatusCode != stdhttp.StatusUnauthorized {
|
||||||
|
t.Errorf("missing token but in CIDR: got %d", res.StatusCode)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Both right.
|
||||||
|
req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||||
|
req.Header.Set("Authorization", "Bearer the-token")
|
||||||
|
res2, err := stdhttp.DefaultClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res2.Body.Close()
|
||||||
|
if res2.StatusCode != stdhttp.StatusOK {
|
||||||
|
t.Errorf("both right: got %d", res2.StatusCode)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func readAll(t *testing.T, r io.Reader) string {
|
||||||
|
t.Helper()
|
||||||
|
b, err := io.ReadAll(r)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("read: %v", err)
|
||||||
|
}
|
||||||
|
return string(b)
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestMetricsBodyContainsExpectedLines(t *testing.T) {
|
||||||
|
t.Parallel()
|
||||||
|
url, reg, _ := newMetricsServer(t, config.Config{
|
||||||
|
MetricsToken: "the-token",
|
||||||
|
})
|
||||||
|
reg.ObserveJob("backup", "succeeded", 0) // produce one histogram row
|
||||||
|
|
||||||
|
req, _ := stdhttp.NewRequest(stdhttp.MethodGet, url+"/metrics", nil)
|
||||||
|
req.Header.Set("Authorization", "Bearer the-token")
|
||||||
|
res, err := stdhttp.DefaultClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET: %v", err)
|
||||||
|
}
|
||||||
|
defer res.Body.Close()
|
||||||
|
body := readAll(t, res.Body)
|
||||||
|
for _, want := range []string{
|
||||||
|
"rm_hosts_total",
|
||||||
|
"rm_hosts_online",
|
||||||
|
`rm_active_alerts{severity="critical"}`,
|
||||||
|
"rm_build_info{",
|
||||||
|
"rm_job_duration_seconds_count{kind=\"backup\",status=\"succeeded\"}",
|
||||||
|
} {
|
||||||
|
if !strings.Contains(body, want) {
|
||||||
|
t.Errorf("body missing %q\n--- body ---\n%s", want, body)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -17,6 +17,7 @@ import (
|
|||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/crypto"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/notification"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/config"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/oidc"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ui"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/ws"
|
||||||
@@ -56,6 +57,12 @@ type Deps struct {
|
|||||||
// OIDC (optional). Non-nil when the operator has configured an
|
// OIDC (optional). Non-nil when the operator has configured an
|
||||||
// IdP — handlers under /auth/oidc/* are mounted only when set.
|
// IdP — handlers under /auth/oidc/* are mounted only when set.
|
||||||
OIDC *oidc.Client
|
OIDC *oidc.Client
|
||||||
|
// Metrics (optional). When non-nil the WS job-finished branch
|
||||||
|
// records job durations and the /metrics handler can pull a
|
||||||
|
// histogram snapshot. Independent of MetricsAuthEnabled — the
|
||||||
|
// recorder runs even if the scrape endpoint is gated off, so a
|
||||||
|
// later config flip doesn't lose the running window.
|
||||||
|
Metrics *metrics.Registry
|
||||||
}
|
}
|
||||||
|
|
||||||
// Server is the running HTTP server.
|
// Server is the running HTTP server.
|
||||||
@@ -131,12 +138,16 @@ func (s *Server) routes(r chi.Router) {
|
|||||||
r.Get("/agent/binary", s.handleAgentBinary)
|
r.Get("/agent/binary", s.handleAgentBinary)
|
||||||
r.Get("/install/*", s.handleInstallAsset)
|
r.Get("/install/*", s.handleInstallAsset)
|
||||||
r.Get("/api/version", s.handleVersion)
|
r.Get("/api/version", s.handleVersion)
|
||||||
|
if s.deps.Cfg.MetricsAuthEnabled() {
|
||||||
|
r.Get("/metrics", s.handleMetrics)
|
||||||
|
}
|
||||||
if s.deps.Hub != nil {
|
if s.deps.Hub != nil {
|
||||||
hd := ws.HandlerDeps{
|
hd := ws.HandlerDeps{
|
||||||
Hub: s.deps.Hub,
|
Hub: s.deps.Hub,
|
||||||
Store: s.deps.Store,
|
Store: s.deps.Store,
|
||||||
JobHub: s.deps.JobHub,
|
JobHub: s.deps.JobHub,
|
||||||
AlertEngine: s.deps.AlertEngine,
|
AlertEngine: s.deps.AlertEngine,
|
||||||
|
Metrics: s.deps.Metrics,
|
||||||
OnHello: s.onAgentHello,
|
OnHello: s.onAgentHello,
|
||||||
OnScheduleAck: s.applyScheduleAck,
|
OnScheduleAck: s.applyScheduleAck,
|
||||||
OnScheduleFire: s.dispatchScheduledJob,
|
OnScheduleFire: s.dispatchScheduledJob,
|
||||||
|
|||||||
@@ -0,0 +1,301 @@
|
|||||||
|
// Package metrics owns the in-process Prometheus exposition for
|
||||||
|
// the control plane. It deliberately avoids prometheus/client_golang
|
||||||
|
// — the legacy text format is small and stable, and the repo's house
|
||||||
|
// style is to keep dependency surface minimal.
|
||||||
|
//
|
||||||
|
// Two halves:
|
||||||
|
//
|
||||||
|
// - Registry holds a job-duration histogram. Server hooks call
|
||||||
|
// Registry.ObserveJob from the WS job-finished branch.
|
||||||
|
//
|
||||||
|
// - Render emits a complete /metrics body from a Snapshot. The
|
||||||
|
// Snapshot is a plain value bag; the HTTP handler assembles it
|
||||||
|
// from store reads + Registry.Snapshot at scrape time. This
|
||||||
|
// keeps the package free of any database or HTTP dependency.
|
||||||
|
package metrics
|
||||||
|
|
||||||
|
import (
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"sort"
|
||||||
|
"strings"
|
||||||
|
"sync"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
// JobDurationBuckets is the upper-bound ladder for the job duration
|
||||||
|
// histogram, in seconds. Covers admin commands (unlock/init/check
|
||||||
|
// finishing in seconds) up through hours-long backups; +Inf is
|
||||||
|
// implicit.
|
||||||
|
var JobDurationBuckets = []float64{1, 5, 30, 60, 300, 1800, 3600, 21600, 86400}
|
||||||
|
|
||||||
|
// Registry is the in-memory store for the job-duration histogram.
|
||||||
|
// Concurrent observers and a single periodic snapshotter is the
|
||||||
|
// expected access pattern; both are guarded by a mutex.
|
||||||
|
type Registry struct {
|
||||||
|
mu sync.Mutex
|
||||||
|
jobs map[jobKey]*histogramState
|
||||||
|
clock func() time.Time
|
||||||
|
}
|
||||||
|
|
||||||
|
type jobKey struct{ kind, status string }
|
||||||
|
|
||||||
|
type histogramState struct {
|
||||||
|
// counts[i] = number of observations <= JobDurationBuckets[i].
|
||||||
|
// counts[len(JobDurationBuckets)] is the implicit +Inf bucket
|
||||||
|
// (== total count, kept here for symmetry with the rendered
|
||||||
|
// _bucket{le="+Inf"} line and as a sanity check).
|
||||||
|
counts []uint64
|
||||||
|
sum float64
|
||||||
|
count uint64
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewRegistry builds an empty registry.
|
||||||
|
func NewRegistry() *Registry {
|
||||||
|
return &Registry{
|
||||||
|
jobs: make(map[jobKey]*histogramState),
|
||||||
|
clock: time.Now,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ObserveJob records one job-duration sample. Negative durations
|
||||||
|
// (clock-skew artefacts) are clamped to zero. Empty kind/status
|
||||||
|
// strings are tolerated but degrade the dashboard — callers should
|
||||||
|
// pass meaningful values.
|
||||||
|
func (r *Registry) ObserveJob(kind, status string, dur time.Duration) {
|
||||||
|
if r == nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if dur < 0 {
|
||||||
|
dur = 0
|
||||||
|
}
|
||||||
|
secs := dur.Seconds()
|
||||||
|
|
||||||
|
r.mu.Lock()
|
||||||
|
defer r.mu.Unlock()
|
||||||
|
k := jobKey{kind: kind, status: status}
|
||||||
|
hs, ok := r.jobs[k]
|
||||||
|
if !ok {
|
||||||
|
hs = &histogramState{counts: make([]uint64, len(JobDurationBuckets)+1)}
|
||||||
|
r.jobs[k] = hs
|
||||||
|
}
|
||||||
|
for i, ub := range JobDurationBuckets {
|
||||||
|
if secs <= ub {
|
||||||
|
hs.counts[i]++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
hs.counts[len(JobDurationBuckets)]++ // +Inf
|
||||||
|
hs.sum += secs
|
||||||
|
hs.count++
|
||||||
|
}
|
||||||
|
|
||||||
|
// HistogramRow is one (kind,status) row in a Snapshot. Buckets is
|
||||||
|
// the cumulative count per upper bound (matching JobDurationBuckets,
|
||||||
|
// last element is the +Inf total).
|
||||||
|
type HistogramRow struct {
|
||||||
|
Kind string
|
||||||
|
Status string
|
||||||
|
Buckets []uint64
|
||||||
|
Sum float64
|
||||||
|
Count uint64
|
||||||
|
}
|
||||||
|
|
||||||
|
// snapshotJobs returns a deterministic, sorted copy of the
|
||||||
|
// histogram state. Sort order: kind asc, status asc.
|
||||||
|
func (r *Registry) snapshotJobs() []HistogramRow {
|
||||||
|
if r == nil {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
r.mu.Lock()
|
||||||
|
defer r.mu.Unlock()
|
||||||
|
rows := make([]HistogramRow, 0, len(r.jobs))
|
||||||
|
for k, hs := range r.jobs {
|
||||||
|
buckets := make([]uint64, len(hs.counts))
|
||||||
|
copy(buckets, hs.counts)
|
||||||
|
rows = append(rows, HistogramRow{
|
||||||
|
Kind: k.kind,
|
||||||
|
Status: k.status,
|
||||||
|
Buckets: buckets,
|
||||||
|
Sum: hs.sum,
|
||||||
|
Count: hs.count,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
sort.Slice(rows, func(i, j int) bool {
|
||||||
|
if rows[i].Kind != rows[j].Kind {
|
||||||
|
return rows[i].Kind < rows[j].Kind
|
||||||
|
}
|
||||||
|
return rows[i].Status < rows[j].Status
|
||||||
|
})
|
||||||
|
return rows
|
||||||
|
}
|
||||||
|
|
||||||
|
// HostRow is one host's projection for the per-host gauges.
|
||||||
|
// Pointers carry "no value" semantics so we can omit a metric line
|
||||||
|
// when, e.g., a host has never run a backup.
|
||||||
|
type HostRow struct {
|
||||||
|
ID string
|
||||||
|
Name string
|
||||||
|
Online bool
|
||||||
|
LastBackupUnix *int64 // nil = no backup yet
|
||||||
|
LastBackupSucceeded *bool // nil = no backup yet
|
||||||
|
RepoSizeBytes *int64 // nil = no stats yet
|
||||||
|
SnapshotCount int
|
||||||
|
OpenAlertCount int
|
||||||
|
RepoStatus string // "unknown" | "ready" | "init_failed"
|
||||||
|
}
|
||||||
|
|
||||||
|
// Snapshot is a frozen view of the data needed to render /metrics.
|
||||||
|
// Constructed by the HTTP handler from Store reads + Registry.snapshotJobs.
|
||||||
|
type Snapshot struct {
|
||||||
|
Hosts []HostRow
|
||||||
|
HostsTotal int
|
||||||
|
HostsOnline int
|
||||||
|
AlertsBySeverity map[string]int // severity → count
|
||||||
|
BuildVersion string
|
||||||
|
BuildCommit string
|
||||||
|
GoVersion string
|
||||||
|
JobDurationRows []HistogramRow
|
||||||
|
}
|
||||||
|
|
||||||
|
// SnapshotWith builds a Snapshot from raw inputs and the registry's
|
||||||
|
// current job-duration state. Convenience for the HTTP handler.
|
||||||
|
func (r *Registry) SnapshotWith(hosts []HostRow, alerts map[string]int, buildVer, commit, goVer string) Snapshot {
|
||||||
|
online := 0
|
||||||
|
for _, h := range hosts {
|
||||||
|
if h.Online {
|
||||||
|
online++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return Snapshot{
|
||||||
|
Hosts: hosts,
|
||||||
|
HostsTotal: len(hosts),
|
||||||
|
HostsOnline: online,
|
||||||
|
AlertsBySeverity: alerts,
|
||||||
|
BuildVersion: buildVer,
|
||||||
|
BuildCommit: commit,
|
||||||
|
GoVersion: goVer,
|
||||||
|
JobDurationRows: r.snapshotJobs(),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Render emits a complete Prometheus text-exposition body for s.
|
||||||
|
// Output is deterministic: metric names appear in a fixed order and
|
||||||
|
// labels within a metric are sorted by their first label value.
|
||||||
|
func Render(w io.Writer, s Snapshot) error {
|
||||||
|
var b strings.Builder
|
||||||
|
|
||||||
|
// --- Server gauges ---------------------------------------------------
|
||||||
|
b.WriteString("# HELP rm_hosts_total Total number of enrolled hosts (excludes pending announces).\n")
|
||||||
|
b.WriteString("# TYPE rm_hosts_total gauge\n")
|
||||||
|
fmt.Fprintf(&b, "rm_hosts_total %d\n", s.HostsTotal)
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_hosts_online Number of hosts currently online (status='online').\n")
|
||||||
|
b.WriteString("# TYPE rm_hosts_online gauge\n")
|
||||||
|
fmt.Fprintf(&b, "rm_hosts_online %d\n", s.HostsOnline)
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_active_alerts Open alerts grouped by severity.\n")
|
||||||
|
b.WriteString("# TYPE rm_active_alerts gauge\n")
|
||||||
|
severities := []string{"info", "warning", "critical"}
|
||||||
|
for _, sev := range severities {
|
||||||
|
fmt.Fprintf(&b, "rm_active_alerts{severity=%q} %d\n", sev, s.AlertsBySeverity[sev])
|
||||||
|
}
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_build_info Build identifying labels; value is always 1.\n")
|
||||||
|
b.WriteString("# TYPE rm_build_info gauge\n")
|
||||||
|
fmt.Fprintf(&b, "rm_build_info{version=%q,commit=%q,go_version=%q} 1\n",
|
||||||
|
s.BuildVersion, s.BuildCommit, s.GoVersion)
|
||||||
|
|
||||||
|
// --- Per-host gauges -------------------------------------------------
|
||||||
|
// Stable order: by host id.
|
||||||
|
hosts := append([]HostRow(nil), s.Hosts...)
|
||||||
|
sort.Slice(hosts, func(i, j int) bool { return hosts[i].ID < hosts[j].ID })
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_host_agent_online 1 if the agent is currently online, 0 otherwise.\n")
|
||||||
|
b.WriteString("# TYPE rm_host_agent_online gauge\n")
|
||||||
|
for _, h := range hosts {
|
||||||
|
v := 0
|
||||||
|
if h.Online {
|
||||||
|
v = 1
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "rm_host_agent_online{host_id=%q,host=%q} %d\n",
|
||||||
|
h.ID, h.Name, v)
|
||||||
|
}
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_host_last_backup_timestamp_seconds Unix timestamp of the host's most recent backup. Omitted for hosts with no backup yet.\n")
|
||||||
|
b.WriteString("# TYPE rm_host_last_backup_timestamp_seconds gauge\n")
|
||||||
|
for _, h := range hosts {
|
||||||
|
if h.LastBackupUnix == nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "rm_host_last_backup_timestamp_seconds{host_id=%q,host=%q} %d\n",
|
||||||
|
h.ID, h.Name, *h.LastBackupUnix)
|
||||||
|
}
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_host_last_backup_success 1 if the host's most recent backup succeeded, 0 otherwise. Omitted for hosts with no backup yet.\n")
|
||||||
|
b.WriteString("# TYPE rm_host_last_backup_success gauge\n")
|
||||||
|
for _, h := range hosts {
|
||||||
|
if h.LastBackupSucceeded == nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
v := 0
|
||||||
|
if *h.LastBackupSucceeded {
|
||||||
|
v = 1
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "rm_host_last_backup_success{host_id=%q,host=%q} %d\n",
|
||||||
|
h.ID, h.Name, v)
|
||||||
|
}
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_host_repo_size_bytes Latest reported repo size from `restic stats --mode raw-data`. Omitted for hosts with no stats yet.\n")
|
||||||
|
b.WriteString("# TYPE rm_host_repo_size_bytes gauge\n")
|
||||||
|
for _, h := range hosts {
|
||||||
|
if h.RepoSizeBytes == nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "rm_host_repo_size_bytes{host_id=%q,host=%q} %d\n",
|
||||||
|
h.ID, h.Name, *h.RepoSizeBytes)
|
||||||
|
}
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_host_snapshot_count Number of restic snapshots known on the host's repo.\n")
|
||||||
|
b.WriteString("# TYPE rm_host_snapshot_count gauge\n")
|
||||||
|
for _, h := range hosts {
|
||||||
|
fmt.Fprintf(&b, "rm_host_snapshot_count{host_id=%q,host=%q} %d\n",
|
||||||
|
h.ID, h.Name, h.SnapshotCount)
|
||||||
|
}
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_host_open_alerts Number of currently open alerts attached to this host.\n")
|
||||||
|
b.WriteString("# TYPE rm_host_open_alerts gauge\n")
|
||||||
|
for _, h := range hosts {
|
||||||
|
fmt.Fprintf(&b, "rm_host_open_alerts{host_id=%q,host=%q} %d\n",
|
||||||
|
h.ID, h.Name, h.OpenAlertCount)
|
||||||
|
}
|
||||||
|
|
||||||
|
b.WriteString("# HELP rm_host_repo_status Repo readiness state for the host. Exactly one row per host with status label set.\n")
|
||||||
|
b.WriteString("# TYPE rm_host_repo_status gauge\n")
|
||||||
|
for _, h := range hosts {
|
||||||
|
st := h.RepoStatus
|
||||||
|
if st == "" {
|
||||||
|
st = "unknown"
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "rm_host_repo_status{host_id=%q,host=%q,status=%q} 1\n",
|
||||||
|
h.ID, h.Name, st)
|
||||||
|
}
|
||||||
|
|
||||||
|
// --- Histogram -------------------------------------------------------
|
||||||
|
b.WriteString("# HELP rm_job_duration_seconds End-to-end duration of completed jobs, by kind and terminal status.\n")
|
||||||
|
b.WriteString("# TYPE rm_job_duration_seconds histogram\n")
|
||||||
|
for _, row := range s.JobDurationRows {
|
||||||
|
for i, ub := range JobDurationBuckets {
|
||||||
|
fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"%g\"} %d\n",
|
||||||
|
row.Kind, row.Status, ub, row.Buckets[i])
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "rm_job_duration_seconds_bucket{kind=%q,status=%q,le=\"+Inf\"} %d\n",
|
||||||
|
row.Kind, row.Status, row.Buckets[len(JobDurationBuckets)])
|
||||||
|
fmt.Fprintf(&b, "rm_job_duration_seconds_sum{kind=%q,status=%q} %g\n",
|
||||||
|
row.Kind, row.Status, row.Sum)
|
||||||
|
fmt.Fprintf(&b, "rm_job_duration_seconds_count{kind=%q,status=%q} %d\n",
|
||||||
|
row.Kind, row.Status, row.Count)
|
||||||
|
}
|
||||||
|
|
||||||
|
_, err := io.WriteString(w, b.String())
|
||||||
|
return err
|
||||||
|
}
|
||||||
@@ -0,0 +1,182 @@
|
|||||||
|
package metrics
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"strings"
|
||||||
|
"sync"
|
||||||
|
"testing"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestObserveJobBuckets(t *testing.T) {
|
||||||
|
r := NewRegistry()
|
||||||
|
// Bucket boundaries: 1, 5, 30, 60, 300, 1800, 3600, 21600, 86400
|
||||||
|
r.ObserveJob("backup", "succeeded", 500*time.Millisecond) // <= 1
|
||||||
|
r.ObserveJob("backup", "succeeded", 30*time.Second) // == 30 (boundary)
|
||||||
|
r.ObserveJob("backup", "succeeded", 90*time.Second) // > 60, <= 300
|
||||||
|
r.ObserveJob("backup", "succeeded", 2*time.Hour) // > 3600 → 21600 bucket
|
||||||
|
rows := r.snapshotJobs()
|
||||||
|
if len(rows) != 1 {
|
||||||
|
t.Fatalf("rows: %d", len(rows))
|
||||||
|
}
|
||||||
|
row := rows[0]
|
||||||
|
if row.Count != 4 {
|
||||||
|
t.Errorf("count: %d", row.Count)
|
||||||
|
}
|
||||||
|
wantSum := 0.5 + 30 + 90 + 7200.0
|
||||||
|
if row.Sum != wantSum {
|
||||||
|
t.Errorf("sum: got %v want %v", row.Sum, wantSum)
|
||||||
|
}
|
||||||
|
// Cumulative buckets:
|
||||||
|
// le=1 → 1 (the 0.5s)
|
||||||
|
// le=5 → 1
|
||||||
|
// le=30 → 2 (boundary inclusive: 30s included)
|
||||||
|
// le=60 → 2
|
||||||
|
// le=300 → 3
|
||||||
|
// le=1800 → 3
|
||||||
|
// le=3600 → 3
|
||||||
|
// le=21600 → 4
|
||||||
|
// le=86400 → 4
|
||||||
|
// le=+Inf → 4
|
||||||
|
want := []uint64{1, 1, 2, 2, 3, 3, 3, 4, 4, 4}
|
||||||
|
for i, w := range want {
|
||||||
|
if row.Buckets[i] != w {
|
||||||
|
t.Errorf("bucket[%d]=%d want %d", i, row.Buckets[i], w)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestObserveJobNegativeClampedToZero(t *testing.T) {
|
||||||
|
r := NewRegistry()
|
||||||
|
r.ObserveJob("backup", "succeeded", -5*time.Second)
|
||||||
|
rows := r.snapshotJobs()
|
||||||
|
if len(rows) != 1 || rows[0].Sum != 0 || rows[0].Count != 1 {
|
||||||
|
t.Errorf("expected one zero-second observation, got %+v", rows)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestObserveJobConcurrent(t *testing.T) {
|
||||||
|
r := NewRegistry()
|
||||||
|
const goroutines = 16
|
||||||
|
const each = 200
|
||||||
|
var wg sync.WaitGroup
|
||||||
|
for g := 0; g < goroutines; g++ {
|
||||||
|
wg.Add(1)
|
||||||
|
go func() {
|
||||||
|
defer wg.Done()
|
||||||
|
for i := 0; i < each; i++ {
|
||||||
|
r.ObserveJob("backup", "succeeded", time.Second)
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
}
|
||||||
|
wg.Wait()
|
||||||
|
rows := r.snapshotJobs()
|
||||||
|
if len(rows) != 1 {
|
||||||
|
t.Fatalf("rows: %d", len(rows))
|
||||||
|
}
|
||||||
|
if rows[0].Count != uint64(goroutines*each) {
|
||||||
|
t.Errorf("count: got %d want %d", rows[0].Count, goroutines*each)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestObserveJobNilRegistryNoop(t *testing.T) {
|
||||||
|
var r *Registry // nil
|
||||||
|
r.ObserveJob("backup", "succeeded", time.Second)
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRenderGolden(t *testing.T) {
|
||||||
|
r := NewRegistry()
|
||||||
|
r.ObserveJob("backup", "succeeded", 5*time.Second)
|
||||||
|
r.ObserveJob("forget", "succeeded", 100*time.Millisecond)
|
||||||
|
|
||||||
|
pi64 := func(v int64) *int64 { return &v }
|
||||||
|
pbool := func(v bool) *bool { return &v }
|
||||||
|
|
||||||
|
hosts := []HostRow{
|
||||||
|
{
|
||||||
|
ID: "01H0001", Name: "alpha",
|
||||||
|
Online: true,
|
||||||
|
LastBackupUnix: pi64(1700000000),
|
||||||
|
LastBackupSucceeded: pbool(true),
|
||||||
|
RepoSizeBytes: pi64(123456789),
|
||||||
|
SnapshotCount: 42,
|
||||||
|
OpenAlertCount: 0,
|
||||||
|
RepoStatus: "ready",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
ID: "01H0002", Name: "bravo",
|
||||||
|
Online: false,
|
||||||
|
SnapshotCount: 0,
|
||||||
|
OpenAlertCount: 1,
|
||||||
|
RepoStatus: "init_failed",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
snap := r.SnapshotWith(hosts,
|
||||||
|
map[string]int{"info": 0, "warning": 1, "critical": 0},
|
||||||
|
"v1.2.3", "deadbeef", "go1.25.0")
|
||||||
|
|
||||||
|
var buf bytes.Buffer
|
||||||
|
if err := Render(&buf, snap); err != nil {
|
||||||
|
t.Fatalf("render: %v", err)
|
||||||
|
}
|
||||||
|
out := buf.String()
|
||||||
|
|
||||||
|
for _, want := range []string{
|
||||||
|
"# HELP rm_hosts_total ",
|
||||||
|
"rm_hosts_total 2\n",
|
||||||
|
"rm_hosts_online 1\n",
|
||||||
|
`rm_active_alerts{severity="warning"} 1`,
|
||||||
|
`rm_active_alerts{severity="info"} 0`,
|
||||||
|
`rm_active_alerts{severity="critical"} 0`,
|
||||||
|
`rm_build_info{version="v1.2.3",commit="deadbeef",go_version="go1.25.0"} 1`,
|
||||||
|
`rm_host_agent_online{host_id="01H0001",host="alpha"} 1`,
|
||||||
|
`rm_host_agent_online{host_id="01H0002",host="bravo"} 0`,
|
||||||
|
`rm_host_last_backup_timestamp_seconds{host_id="01H0001",host="alpha"} 1700000000`,
|
||||||
|
`rm_host_last_backup_success{host_id="01H0001",host="alpha"} 1`,
|
||||||
|
`rm_host_repo_size_bytes{host_id="01H0001",host="alpha"} 123456789`,
|
||||||
|
`rm_host_snapshot_count{host_id="01H0001",host="alpha"} 42`,
|
||||||
|
`rm_host_snapshot_count{host_id="01H0002",host="bravo"} 0`,
|
||||||
|
`rm_host_open_alerts{host_id="01H0002",host="bravo"} 1`,
|
||||||
|
`rm_host_repo_status{host_id="01H0001",host="alpha",status="ready"} 1`,
|
||||||
|
`rm_host_repo_status{host_id="01H0002",host="bravo",status="init_failed"} 1`,
|
||||||
|
`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="1"} 0`,
|
||||||
|
`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="5"} 1`,
|
||||||
|
`rm_job_duration_seconds_bucket{kind="backup",status="succeeded",le="+Inf"} 1`,
|
||||||
|
`rm_job_duration_seconds_sum{kind="backup",status="succeeded"} 5`,
|
||||||
|
`rm_job_duration_seconds_count{kind="backup",status="succeeded"} 1`,
|
||||||
|
`rm_job_duration_seconds_bucket{kind="forget",status="succeeded",le="1"} 1`,
|
||||||
|
} {
|
||||||
|
if !strings.Contains(out, want) {
|
||||||
|
t.Errorf("missing line:\n %s\n--- full output ---\n%s", want, out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// bravo had no last backup → those metric lines must be absent for it.
|
||||||
|
for _, ban := range []string{
|
||||||
|
`rm_host_last_backup_timestamp_seconds{host_id="01H0002"`,
|
||||||
|
`rm_host_last_backup_success{host_id="01H0002"`,
|
||||||
|
`rm_host_repo_size_bytes{host_id="01H0002"`,
|
||||||
|
} {
|
||||||
|
if strings.Contains(out, ban) {
|
||||||
|
t.Errorf("unexpected line for bravo: %q", ban)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRenderEmptySnapshot(t *testing.T) {
|
||||||
|
r := NewRegistry()
|
||||||
|
snap := r.SnapshotWith(nil, nil, "dev", "", "go1.25.0")
|
||||||
|
var buf bytes.Buffer
|
||||||
|
if err := Render(&buf, snap); err != nil {
|
||||||
|
t.Fatalf("render: %v", err)
|
||||||
|
}
|
||||||
|
out := buf.String()
|
||||||
|
if !strings.Contains(out, "rm_hosts_total 0\n") {
|
||||||
|
t.Errorf("missing zero-host gauge:\n%s", out)
|
||||||
|
}
|
||||||
|
// Histogram block has its HELP/TYPE but no rows. The HELP/TYPE
|
||||||
|
// presence is correct and helps Prometheus pre-register the metric.
|
||||||
|
if !strings.Contains(out, "# TYPE rm_job_duration_seconds histogram") {
|
||||||
|
t.Errorf("histogram HELP/TYPE missing")
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -15,6 +15,7 @@ import (
|
|||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/alert"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/alert"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/api"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/auth"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/auth"
|
||||||
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/server/metrics"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/store"
|
||||||
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
"gitea.dcglab.co.uk/steve/restic-manager/internal/version"
|
||||||
)
|
)
|
||||||
@@ -27,6 +28,9 @@ type HandlerDeps struct {
|
|||||||
// AlertEngine receives job-finished and host-online events so the
|
// AlertEngine receives job-finished and host-online events so the
|
||||||
// alert engine can evaluate its rules. Optional; nil = no-op.
|
// alert engine can evaluate its rules. Optional; nil = no-op.
|
||||||
AlertEngine *alert.Engine
|
AlertEngine *alert.Engine
|
||||||
|
// Metrics records job-duration observations on every terminal
|
||||||
|
// status. Optional; nil = no-op (test fixtures pass nil).
|
||||||
|
Metrics *metrics.Registry
|
||||||
// UpdateWatcher reconciles in-flight agent-update dispatches against
|
// UpdateWatcher reconciles in-flight agent-update dispatches against
|
||||||
// hello envelopes. Optional; nil = no-op.
|
// hello envelopes. Optional; nil = no-op.
|
||||||
UpdateWatcher *UpdateWatcher
|
UpdateWatcher *UpdateWatcher
|
||||||
@@ -239,6 +243,13 @@ func dispatchAgentMessage(ctx context.Context, c *Conn, hostID string, env api.E
|
|||||||
slog.Warn("ws: set host last backup", "host_id", hostID, "err", err)
|
slog.Warn("ws: set host last backup", "host_id", hostID, "err", err)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
// Job-duration histogram (P6-04). Skip when StartedAt is
|
||||||
|
// missing (race: agent shipped finished without a started,
|
||||||
|
// or the row predates this code).
|
||||||
|
if deps.Metrics != nil && job.StartedAt != nil {
|
||||||
|
deps.Metrics.ObserveJob(job.Kind, string(p.Status),
|
||||||
|
p.FinishedAt.Sub(*job.StartedAt))
|
||||||
|
}
|
||||||
}
|
}
|
||||||
if deps.JobHub != nil {
|
if deps.JobHub != nil {
|
||||||
deps.JobHub.Broadcast(p.JobID, env)
|
deps.JobHub.Broadcast(p.JobID, env)
|
||||||
|
|||||||
@@ -326,12 +326,54 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
|||||||
|
|
||||||
## Phase 5 — OSS readiness
|
## Phase 5 — OSS readiness
|
||||||
|
|
||||||
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
|
- [x] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
|
||||||
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
|
- [x] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
|
||||||
- [x] **P5-03** (S) Release automation — **pivoted away from goreleaser/binary archives** on 2026-05-05 (spec: `docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md`). Single deliverable per tag: a multi-arch (linux amd64+arm64) server image, with cross-compiled agent binaries (linux amd64+arm64, windows amd64) + `install.sh` + `install.ps1` + the systemd unit baked under `/opt/restic-manager/dist/`. The `/agent/binary` and `/install/*` handlers fall back from `<DataDir>/...` to `<BundledAssetsDir>/...` so a fresh container Just Works. Workflow `.gitea/workflows/release.yml` triggers on `v*.*.*` tag-push (real release: fan-out `:vX.Y.Z`, `:X.Y`, `:X`, plus `:latest` once `MAJOR>=1`) and `workflow_dispatch` (snapshot: `:snapshot-<shortsha>` only). Pushed to the Gitea container registry on this instance — no external creds, no GHCR mirror. Cosign / SBOM / minisign / GHCR mirror deferred to Phase 6. Source builds via `make build` remain a first-class path.
|
- [x] **P5-03** (S) Release automation — **pivoted away from goreleaser/binary archives** on 2026-05-05 (spec: `docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md`). Single deliverable per tag: a multi-arch (linux amd64+arm64) server image, with cross-compiled agent binaries (linux amd64+arm64, windows amd64) + `install.sh` + `install.ps1` + the systemd unit baked under `/opt/restic-manager/dist/`. The `/agent/binary` and `/install/*` handlers fall back from `<DataDir>/...` to `<BundledAssetsDir>/...` so a fresh container Just Works. Workflow `.gitea/workflows/release.yml` triggers on `v*.*.*` tag-push (real release: fan-out `:vX.Y.Z`, `:X.Y`, `:X`, plus `:latest` once `MAJOR>=1`) and `workflow_dispatch` (snapshot: `:snapshot-<shortsha>` only). Pushed to the Gitea container registry on this instance — no external creds, no GHCR mirror. Cosign / SBOM / minisign / GHCR mirror deferred to Phase 6. Source builds via `make build` remain a first-class path.
|
||||||
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
|
- [x] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
|
||||||
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
|
- [x] **P5-05** (S) `SECURITY.md` with disclosure process
|
||||||
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
|
- [x] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
|
||||||
|
|
||||||
|
> **As shipped (2026-05-07, branch `p5-oss-readiness`):**
|
||||||
|
>
|
||||||
|
> **P5-01 — docs site.** mdBook under `docs/book/` with structured
|
||||||
|
> chapters: getting-started (install, enrolling hosts, reverse
|
||||||
|
> proxy), concepts (architecture, credentials, schedules + source
|
||||||
|
> groups, repo maintenance), operations (backups + restores, alerts,
|
||||||
|
> observability, updates), security (threat model, hardening,
|
||||||
|
> disclosure), reference (env vars, HTTP endpoints), plus
|
||||||
|
> contributing / roadmap / license pages. mdBook binary downloaded
|
||||||
|
> via Makefile (`make docs` / `make docs-watch`) — same "static
|
||||||
|
> binary, no toolchain" pattern as Tailwind. Generated `book/`
|
||||||
|
> dir gitignored.
|
||||||
|
>
|
||||||
|
> **P5-02 — CONTRIBUTING + CoC + templates.** `CONTRIBUTING.md`
|
||||||
|
> rewritten from placeholder to full guide (setup, conventions,
|
||||||
|
> workflow, RBAC of the project itself). `CODE_OF_CONDUCT.md`
|
||||||
|
> shaped on the Contributor Covenant but adapted for a
|
||||||
|
> single-maintainer project. `.gitea/issue_template/{bug_report,feature_request}.md`
|
||||||
|
> + `.gitea/PULL_REQUEST_TEMPLATE.md`.
|
||||||
|
>
|
||||||
|
> **P5-04 — README screenshots.** Six full-page captures from a
|
||||||
|
> fresh server bootstrap under `docs/screenshots/` (login, empty
|
||||||
|
> dashboard, add host, alerts, settings, audit log). README
|
||||||
|
> rewritten to centre the screenshot grid + link out to docs site.
|
||||||
|
> Captured live from a working build via Playwright; replaceable
|
||||||
|
> as the UI evolves without breaking layout.
|
||||||
|
>
|
||||||
|
> **P5-05 — SECURITY.md.** Disclosure policy (3-day ack, 30-day
|
||||||
|
> default disclosure window), supported-versions matrix, scope
|
||||||
|
> in/out, threat-model summary, hardening checklist for
|
||||||
|
> operators. Mirrored as a chapter in the docs site.
|
||||||
|
>
|
||||||
|
> **P5-06 — e2e harness.** `e2e/compose.e2e.yml` stands up
|
||||||
|
> server + sibling Linux agent (alpine + restic) + restic/rest-server
|
||||||
|
> backend, with announce-and-approve as the enrolment path so
|
||||||
|
> Playwright drives the operator flow end-to-end. Tests under
|
||||||
|
> `e2e/playwright/tests/`: smoke spec covers bootstrap → login →
|
||||||
|
> accept-pending → backup → terminal-status; second spec scrapes
|
||||||
|
> `/metrics` to verify the P6-04 endpoint. New
|
||||||
|
> `.gitea/workflows/e2e.yml` runs on every PR (separate from the
|
||||||
|
> fast lint/test workflow). Local how-to in `docs/e2e.md`.
|
||||||
- [x] **P5-07** (S) Reference deployment landed alongside P5-03. `deploy/docker-compose.yml` stands up *only* the server (image-pinned via `RM_VERSION`, named volume for operator state, bound to localhost) — TLS termination is left to whichever reverse proxy the operator already runs. `docs/reverse-proxy.md` documents the headers + WebSocket pass-through the proxy must forward, the `RM_TRUSTED_PROXY` CIDR rule, and worked examples for Caddy, nginx, and Traefik.
|
- [x] **P5-07** (S) Reference deployment landed alongside P5-03. `deploy/docker-compose.yml` stands up *only* the server (image-pinned via `RM_VERSION`, named volume for operator state, bound to localhost) — TLS termination is left to whichever reverse proxy the operator already runs. `docs/reverse-proxy.md` documents the headers + WebSocket pass-through the proxy must forward, the `RM_TRUSTED_PROXY` CIDR rule, and worked examples for Caddy, nginx, and Traefik.
|
||||||
|
|
||||||
### Phase 5 acceptance
|
### Phase 5 acceptance
|
||||||
@@ -390,8 +432,45 @@ Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
|||||||
> swap, helper `buildRepoTrendView` shared between page-load and
|
> swap, helper `buildRepoTrendView` shared between page-load and
|
||||||
> fragment endpoint). No new dependencies, no client JS, no agent
|
> fragment endpoint). No new dependencies, no client JS, no agent
|
||||||
> change. CI green; in-browser smoke walk-through pending operator.
|
> change. CI green; in-browser smoke walk-through pending operator.
|
||||||
- [ ] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
|
- [x] **P6-04** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list. _(Was P4-08.)_
|
||||||
- [ ] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
|
- [x] **P6-05** (S) Document Prometheus integration + sample Grafana dashboard JSON. _(Was P4-09.)_
|
||||||
|
|
||||||
|
> **As shipped (2026-05-07, branch `p6-04-05-prometheus-metrics`):**
|
||||||
|
> Spec `docs/superpowers/specs/2026-05-07-p6-04-05-prometheus-metrics-design.md`,
|
||||||
|
> plan `docs/superpowers/plans/2026-05-07-p6-04-05-prometheus-metrics.md`.
|
||||||
|
> New `internal/server/metrics` package emits the legacy
|
||||||
|
> `text/plain; version=0.0.4` exposition format directly — no
|
||||||
|
> `prometheus/client_golang` dependency, matching the repo's
|
||||||
|
> "no Tailwind, no Node" minimal-deps style. `/metrics` is **opt-in**:
|
||||||
|
> `RM_METRICS_TOKEN` and/or `RM_METRICS_TRUSTED_CIDR` must be set or
|
||||||
|
> the route isn't mounted at all (404). When both are set, both must
|
||||||
|
> pass; either alone gates access. Token compare is constant-time.
|
||||||
|
> CIDR check honours `X-Forwarded-For` only when the immediate hop
|
||||||
|
> is a configured `RM_TRUSTED_PROXY` (mirrors the existing realIP
|
||||||
|
> resolution).
|
||||||
|
>
|
||||||
|
> **Metrics:** per-host gauges (`rm_host_agent_online`,
|
||||||
|
> `rm_host_last_backup_timestamp_seconds`, `rm_host_last_backup_success`,
|
||||||
|
> `rm_host_repo_size_bytes`, `rm_host_snapshot_count`,
|
||||||
|
> `rm_host_open_alerts`, `rm_host_repo_status`); server gauges
|
||||||
|
> (`rm_hosts_total`, `rm_hosts_online`, `rm_active_alerts{severity}`,
|
||||||
|
> `rm_build_info{version,commit,go_version}`); histogram
|
||||||
|
> `rm_job_duration_seconds_bucket{kind,status,le}` with buckets
|
||||||
|
> `1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`.
|
||||||
|
> Histogram is in-memory; observations come from the existing
|
||||||
|
> `MsgJobFinished` branch in `internal/server/ws/handler.go`.
|
||||||
|
>
|
||||||
|
> **Docs:** `docs/prometheus.md` covers enable + scrape config +
|
||||||
|
> metric reference + dashboard import. **Dashboard:**
|
||||||
|
> `deploy/grafana/restic-manager-dashboard.json` — six panels
|
||||||
|
> (fleet status, open alerts, backups failing, hosts table, repo
|
||||||
|
> size over time, job-duration p95). Schema 39, single Prometheus
|
||||||
|
> datasource variable.
|
||||||
|
>
|
||||||
|
> **Tests:** golden-render + concurrent-observe + bucket-boundary
|
||||||
|
> in the metrics package; auth matrix (no auth → 404; token
|
||||||
|
> missing/wrong/right; CIDR matching/non-matching; token AND CIDR)
|
||||||
|
> in the HTTP layer.
|
||||||
|
|
||||||
### Phase 6 acceptance
|
### Phase 6 acceptance
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user