P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
@@ -0,0 +1,19 @@
|
||||
[book]
|
||||
title = "restic-manager"
|
||||
description = "Self-hosted control plane for restic backups across a fleet of Linux and Windows endpoints."
|
||||
authors = ["Steve Cliff"]
|
||||
language = "en-GB"
|
||||
multilingual = false
|
||||
src = "src"
|
||||
|
||||
[output.html]
|
||||
default-theme = "ayu"
|
||||
preferred-dark-theme = "ayu"
|
||||
git-repository-url = "https://gitea.dcglab.co.uk/steve/restic-manager"
|
||||
git-repository-icon = "fa-code-fork"
|
||||
edit-url-template = "https://gitea.dcglab.co.uk/steve/restic-manager/_edit/main/docs/book/{path}"
|
||||
no-section-label = false
|
||||
|
||||
[output.html.fold]
|
||||
enable = true
|
||||
level = 2
|
||||
@@ -0,0 +1,40 @@
|
||||
# Summary
|
||||
|
||||
[Introduction](./intro.md)
|
||||
|
||||
# Getting started
|
||||
|
||||
- [Installing the server](./getting-started/install.md)
|
||||
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
|
||||
- [Running behind a reverse proxy](./getting-started/reverse-proxy.md)
|
||||
|
||||
# Concepts
|
||||
|
||||
- [Architecture](./concepts/architecture.md)
|
||||
- [Credentials and how they flow](./concepts/credentials.md)
|
||||
- [Schedules and source groups](./concepts/schedules-and-source-groups.md)
|
||||
- [Repo maintenance](./concepts/repo-maintenance.md)
|
||||
|
||||
# Operations
|
||||
|
||||
- [Backups and restores](./operations/backups-and-restores.md)
|
||||
- [Alerts and notifications](./operations/alerts.md)
|
||||
- [Observability with Prometheus](./operations/observability.md)
|
||||
- [Updating agents](./operations/updates.md)
|
||||
|
||||
# Security
|
||||
|
||||
- [Threat model](./security/threat-model.md)
|
||||
- [Hardening checklist](./security/hardening.md)
|
||||
- [Reporting vulnerabilities](./security/disclosure.md)
|
||||
|
||||
# Reference
|
||||
|
||||
- [Environment variables](./reference/env-vars.md)
|
||||
- [HTTP endpoints](./reference/http-endpoints.md)
|
||||
|
||||
---
|
||||
|
||||
[Contributing](./contributing.md)
|
||||
[Roadmap](./roadmap.md)
|
||||
[License](./license.md)
|
||||
@@ -0,0 +1,121 @@
|
||||
# Architecture
|
||||
|
||||
## Components
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────┐
|
||||
│ Server (control plane, single process) │
|
||||
│ * chi-based HTTP API + HTMX server-rendered UI │
|
||||
│ * WebSocket hub for agent fan-out + browser fan-out │
|
||||
│ * SQLite store (modernc.org/sqlite, pure Go) │
|
||||
│ * AEAD encryption helpers │
|
||||
│ * Alert engine + notification hub │
|
||||
└────────────┬───────────────────────────────────┬───────────┘
|
||||
│ outbound WS only │ HTTP(S)
|
||||
│ │
|
||||
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
|
||||
│ Agent (per host) │ │ Browser (operator) │
|
||||
│ * coder/websocket │ │ * htmx + a tiny bit │
|
||||
│ * cron for schedules │ │ of vanilla JS for │
|
||||
│ * restic wrapper │ │ live job updates │
|
||||
│ * sysinfo collector │ └──────────────────────────┘
|
||||
└────────────┬─────────────┘
|
||||
│ subprocess: restic ...
|
||||
│
|
||||
┌────────────▼─────────────────────────────────────────────────┐
|
||||
│ restic repository (rest-server, S3, B2, SFTP, local …) │
|
||||
│ Backup data flows directly here. Server never touches it. │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Why outbound-only WebSockets?
|
||||
|
||||
The agent dials the server on `/ws/agent` with a bearer token. The
|
||||
server doesn't initiate connections to the agent. Three reasons:
|
||||
|
||||
1. **Firewall friendliness.** Nothing on the endpoint needs an
|
||||
inbound port; this works behind the typical "branch office NAT"
|
||||
without router config.
|
||||
2. **Single auth point.** The bearer token is the only credential
|
||||
that crosses the boundary; the agent never accepts an
|
||||
incoming socket.
|
||||
3. **Reconnect semantics are simpler.** When the connection drops
|
||||
(NAT timeout, server restart, transient network glitch) the
|
||||
agent backs off and re-dials; the server marks the host
|
||||
offline after 90s and lets the alert engine raise a stale-host
|
||||
alert.
|
||||
|
||||
## Why SQLite?
|
||||
|
||||
SQLite covers the project's HA non-goal: there isn't one. A small
|
||||
control plane managing twelve endpoints does not need replication
|
||||
or a separate database tier. SQLite gives us:
|
||||
|
||||
- A single file to back up (plus the secret key).
|
||||
- Hand-rolled migrations under `internal/store/migrations/` —
|
||||
no migration framework lock-in.
|
||||
- `WAL` mode plus per-connection foreign-key enforcement.
|
||||
|
||||
The migrations file the entire schema; there's no ORM or
|
||||
query-builder layer between Go code and SQL.
|
||||
|
||||
## Why the agent runs `restic` itself, not via the server
|
||||
|
||||
The control plane never holds backup bytes in flight. That's
|
||||
deliberate:
|
||||
|
||||
- A compromised control plane cannot exfiltrate snapshot
|
||||
contents in-band — at worst it can dispatch new backup or
|
||||
forget jobs (audit-logged) but the data path is between the
|
||||
agent and the repository.
|
||||
- The same agent process can target whichever transport restic
|
||||
natively supports (rest-server, S3, B2, SFTP, local), no
|
||||
separate mux on the server side.
|
||||
|
||||
## Job lifecycle
|
||||
|
||||
```
|
||||
┌──────────────────────┐
|
||||
operator → │ POST /hosts/{id}/ │
|
||||
│ run-backup │
|
||||
└──────────┬───────────┘
|
||||
│ 1. INSERT INTO jobs (status='queued')
|
||||
│ 2. dispatch command.run over WS
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Agent dispatches │
|
||||
│ restic subprocess │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
│ 3. job.started ───▶ store.MarkJobStarted
|
||||
│ 4. job.progress ───▶ JobHub broadcast (live UI)
|
||||
│ 5. log.stream ───▶ append to job_logs
|
||||
│ 6. job.finished ───▶ store.MarkJobFinished
|
||||
│ + alert engine eval
|
||||
│ + (P6) metrics histogram
|
||||
▼
|
||||
terminal: succeeded | failed | cancelled
|
||||
```
|
||||
|
||||
Operators see live updates because the browser subscribes to
|
||||
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
|
||||
agent-emitted envelope to all live subscribers in addition to
|
||||
persisting it.
|
||||
|
||||
## What scheduling looks like
|
||||
|
||||
- The agent runs a local `robfig/cron/v3` instance.
|
||||
- The server pushes the desired schedule set to the agent on
|
||||
hello + after every CRUD change.
|
||||
- When the agent's cron fires, it sends `schedule.fire` to the
|
||||
server. The server creates a job row, sends `command.run` back,
|
||||
and the agent dispatches a normal backup.
|
||||
- If the WS drops between fire and run, the server queues the
|
||||
schedule firing into `pending_runs` and drains on agent
|
||||
reconnect — no missed scheduled backups due to network blips.
|
||||
|
||||
For everything that isn't a backup (forget, prune, check), the
|
||||
server runs a 60-second maintenance ticker against
|
||||
`host_repo_maintenance` rows and dispatches the relevant command
|
||||
when a cadence is due. The agent's local cron only handles
|
||||
backups.
|
||||
@@ -0,0 +1,98 @@
|
||||
# Credentials and how they flow
|
||||
|
||||
restic-manager handles three credential surfaces:
|
||||
|
||||
1. **Operator credentials** — the username + password (or OIDC
|
||||
identity) that logs into the UI.
|
||||
2. **Agent bearer tokens** — issued at enrolment, used by the
|
||||
agent to authenticate its WebSocket to the server.
|
||||
3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
|
||||
credentials the agent passes to `restic` itself.
|
||||
|
||||
Each has a different threat model and storage strategy.
|
||||
|
||||
## Operator credentials
|
||||
|
||||
- Local users are stored in `users` with a bcrypt password hash.
|
||||
- Sessions are random tokens minted at login, stored hashed in
|
||||
the `sessions` table, expired after 24h. Cookie is HttpOnly,
|
||||
SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
|
||||
default).
|
||||
- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
|
||||
pinning their IdP identity. Local password login is rejected
|
||||
for OIDC users.
|
||||
- Disabling a user soft-deletes them via `disabled_at` —
|
||||
pre-existing sessions are invalidated on the next request.
|
||||
|
||||
## Agent bearer tokens
|
||||
|
||||
- Minted at enrolment, hashed at rest with `auth.HashToken`.
|
||||
- The plaintext token only exists in memory at enrolment time
|
||||
and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
|
||||
mode `0600`, owned by the service user).
|
||||
- Compromise of the server DB leaks the hashes, which is enough
|
||||
to *log in as that agent* until you revoke. Compromise of the
|
||||
agent host leaks the plaintext (via the config file) — same
|
||||
end result.
|
||||
- Rotation: re-enrol the host. Today there's no in-place rotate;
|
||||
the operator deletes the host (which cascades, including
|
||||
revoking the bearer hash) and re-runs the install command.
|
||||
|
||||
## Repo credentials
|
||||
|
||||
This is the credential that ultimately matters for backup
|
||||
integrity. restic-manager keeps two slots per host:
|
||||
|
||||
- **The everyday credential** (`host_credentials.kind = ''`).
|
||||
Append-only-friendly: this is the one your backup schedule
|
||||
uses. It can write but not delete or forget.
|
||||
- **The admin credential** (`host_credentials.kind = 'admin'`).
|
||||
Has full delete rights. Only pushed to the agent transiently
|
||||
while a `prune` or `forget` job is dispatching, and discarded
|
||||
by the agent after the job ends.
|
||||
|
||||
### Encryption flow
|
||||
|
||||
1. Operator types the credential into the UI or the install form.
|
||||
2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
|
||||
key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
|
||||
memory.
|
||||
3. Encrypted blob is stored in `host_credentials.cred_blob`.
|
||||
4. When the agent connects, the server decrypts the blob and
|
||||
sends the **plaintext** down the WebSocket inside a
|
||||
`config.update` envelope.
|
||||
5. The agent stores the plaintext in its in-memory secrets store
|
||||
for the lifetime of the process; it's reloaded fresh on every
|
||||
server-side push.
|
||||
6. When a job runs, the agent merges the credential into the
|
||||
restic environment (`restic.Env.RepoURL` stays bare; the
|
||||
`user:pass@…` form is built only inside `envSlice()` at the
|
||||
moment of `exec.Command`).
|
||||
|
||||
The merged form is **never logged**. The slog package's structured
|
||||
output gets `restic.RedactURL()` for any URL it has cause to
|
||||
mention.
|
||||
|
||||
### Why push plaintext over the wire?
|
||||
|
||||
The transport itself is the trust boundary: the WebSocket runs
|
||||
inside the same TLS-terminated reverse-proxy connection your
|
||||
browser uses, and the agent has already authenticated with its
|
||||
bearer token. Re-encrypting the payload on top of that would just
|
||||
move the key-management problem somewhere else.
|
||||
|
||||
If your reverse proxy isn't TLS-terminated, the deployment is
|
||||
already broken — see [Hardening](../security/hardening.md).
|
||||
|
||||
## Setup tokens (admin-driven)
|
||||
|
||||
When an admin creates a new user, the server mints a one-time
|
||||
setup link valid for 1 hour. The hash is stored; the raw token
|
||||
is shown to the admin once. The user opens the link, sets a
|
||||
password, and is dropped into a session. Expired tokens are
|
||||
swept on the alert engine's 60s tick.
|
||||
|
||||
Same pattern for enrolment tokens: the raw token only exists in
|
||||
memory at mint time, and the install snippet is the operator's
|
||||
only chance to capture it. If you lose it, regenerate via the
|
||||
**Add host** page (NS-02).
|
||||
@@ -0,0 +1,85 @@
|
||||
# Repo maintenance
|
||||
|
||||
Backups go in; without maintenance, repos grow forever and
|
||||
eventually fall over. restic-manager runs three maintenance
|
||||
operations on a per-host cadence:
|
||||
|
||||
| Command | What it does | Default cadence |
|
||||
|----------|-------------------------------------------------------------|-----------------|
|
||||
| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
|
||||
| `prune` | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
|
||||
| `check` | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
|
||||
|
||||
A new field on each host row, `host_repo_maintenance`, holds the
|
||||
cron expressions and last-fire anchors. The maintenance ticker on
|
||||
the server runs every 60s, finds hosts whose next-fire is due,
|
||||
and dispatches the right command. The agent's local cron is
|
||||
**only** for backups.
|
||||
|
||||
## Why server-side and not agent-side?
|
||||
|
||||
The agent's cron knows about backups because backups are
|
||||
per-source-group. Maintenance is per-repo, not per-source-group,
|
||||
so doing it server-side keeps the per-host wiring simple:
|
||||
|
||||
- One ticker, not N agent crons to keep in sync.
|
||||
- Cancelling a maintenance dispatch is just "don't dispatch the
|
||||
next one" — no agent-side state to clean up.
|
||||
- Skipping offline hosts is trivial (no queue; only scheduled
|
||||
*backups* queue into `pending_runs`).
|
||||
|
||||
## Forget and the multi-group payload
|
||||
|
||||
A single `forget` job can target several source groups at once.
|
||||
The wire envelope (`ForgetGroups`) carries one entry per group,
|
||||
each with its retention policy. The agent runs N
|
||||
`restic forget --tag <name> --keep-...` invocations in sequence,
|
||||
streams their output, and reports a single terminal status.
|
||||
|
||||
## Prune and the admin credential
|
||||
|
||||
Prune mutates the repo. The everyday append-only credential
|
||||
**cannot** prune — that's the whole point of append-only.
|
||||
restic-manager keeps a second slot per host (`kind = 'admin'`)
|
||||
for the credential that can.
|
||||
|
||||
When a prune is dispatched (cadence-driven or operator-driven):
|
||||
|
||||
1. Server pushes the admin credential to the agent in a fresh
|
||||
`config.update`.
|
||||
2. Agent runs `restic prune` with the merged credential.
|
||||
3. Job finishes; agent discards the admin credential from its
|
||||
in-memory secrets store.
|
||||
|
||||
The server never logs the merged URL (see
|
||||
[Credentials](./credentials.md)).
|
||||
|
||||
## Check and lock state
|
||||
|
||||
`restic check` warns about stale locks when it finds them. The
|
||||
agent ships every check's output back as a `repo.stats` envelope
|
||||
and a stream of log lines; if a stale lock is detected, the
|
||||
**Repo** page surfaces a banner with an **Unlock** button. The
|
||||
operator-only `unlock` command runs `restic unlock` and clears
|
||||
the banner.
|
||||
|
||||
`unlock` has no cadence — it's a manual action, never automatic.
|
||||
Auto-unlocking would mask the cause (probably a previously
|
||||
crashed long-running operation) and risk corrupting an
|
||||
operation the operator has merely lost track of.
|
||||
|
||||
## Repo stats
|
||||
|
||||
After every backup, check, prune, and unlock, the agent runs
|
||||
`restic stats --json --mode raw-data` and ships the result as a
|
||||
`repo.stats` envelope. The server stores this in
|
||||
`host_repo_stats` (latest only) and `host_repo_stats_history`
|
||||
(one row per host per day, last-write-wins per column — a
|
||||
prune-only patch never nulls a backup-time size).
|
||||
|
||||
The host detail page surfaces:
|
||||
|
||||
- Total size + raw size in the vitals strip.
|
||||
- Last-check timestamp + colour-coded status.
|
||||
- Last-prune timestamp.
|
||||
- 30/90-day repo size trend chart.
|
||||
@@ -0,0 +1,105 @@
|
||||
# Schedules and source groups
|
||||
|
||||
Two related but separable ideas:
|
||||
|
||||
- A **source group** is a named bundle of "what to back up":
|
||||
include paths, exclude patterns, retention policy, retry
|
||||
configuration, optional pre/post hooks. The group's name is
|
||||
used as the restic snapshot tag, so retention can target it
|
||||
with `restic forget --tag <name>`.
|
||||
- A **schedule** is a cron expression that, when it fires,
|
||||
triggers a backup of one or more source groups on a host.
|
||||
|
||||
Decoupling them means you can have one schedule covering several
|
||||
groups (e.g. `0 1 * * *` running both `system` and `data`), and
|
||||
each group has its own retention without duplicating policy
|
||||
across schedules.
|
||||
|
||||
## Source group anatomy
|
||||
|
||||
```yaml
|
||||
name: data
|
||||
includes:
|
||||
- /var/lib/postgresql
|
||||
- /home
|
||||
excludes:
|
||||
- /home/*/.cache
|
||||
- /home/*/Downloads
|
||||
retention:
|
||||
keep_last: 7
|
||||
keep_daily: 14
|
||||
keep_weekly: 4
|
||||
keep_monthly: 6
|
||||
retry_max: 3
|
||||
retry_backoff_seconds: 600
|
||||
pre_hook: |
|
||||
pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
|
||||
post_hook: |
|
||||
rm -f /var/lib/postgresql/dumps/all.dump
|
||||
```
|
||||
|
||||
### Conflict detection
|
||||
|
||||
If your retention policy says `keep_hourly: 24` but no schedule
|
||||
points at this group sub-daily, the UI surfaces a
|
||||
**conflict-dimension banner** ("`hourly` won't be honoured —
|
||||
no schedule fires more often than once a day"). The flag is
|
||||
stored on the source group (`conflict_dimension`) and refreshed
|
||||
whenever a schedule or group changes.
|
||||
|
||||
### Hooks
|
||||
|
||||
`pre_hook` and `post_hook` run on the agent host inside
|
||||
`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
|
||||
to the live job log as `hook(<phase>): …` lines.
|
||||
|
||||
- A non-zero `pre_hook` exit aborts the backup.
|
||||
- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
|
||||
in the environment. Use this for cleanup that must happen
|
||||
whether the backup worked or not.
|
||||
- Hooks only run for `kind=backup` jobs. They do not run for
|
||||
`forget`, `prune`, `check`, etc.
|
||||
- AEAD-encrypted at rest at the HTTP layer; the agent receives
|
||||
plaintext over the WS channel.
|
||||
|
||||
A "host default" pair of hooks lives on the host itself; a
|
||||
source group's own hooks override them when set.
|
||||
|
||||
## Schedule anatomy
|
||||
|
||||
```yaml
|
||||
cron: "0 2 * * *"
|
||||
enabled: true
|
||||
source_group_ids:
|
||||
- <gid for "data">
|
||||
- <gid for "system">
|
||||
```
|
||||
|
||||
Slim by design: a schedule says **when** and **which groups**.
|
||||
Everything else (paths, retention, hooks) lives on the groups.
|
||||
|
||||
The agent's local cron fires the schedule. If the WebSocket is
|
||||
down at fire time, the server queues the firing into
|
||||
`pending_runs` and drains it on the next agent reconnect — a
|
||||
short network blip won't lose the backup.
|
||||
|
||||
### Last / next run
|
||||
|
||||
The schedules tab shows "next" (computed by parsing the cron
|
||||
expression with `robfig/cron/v3`) and "last" (the latest
|
||||
`actor_kind=schedule` job in the `jobs` table) for every
|
||||
schedule. The dashboard host row also surfaces `next 12h ago/from
|
||||
now` when a single covering schedule is the run-now candidate.
|
||||
|
||||
## Bandwidth limits
|
||||
|
||||
Two places set restic's `--limit-upload` / `--limit-download`:
|
||||
|
||||
1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
|
||||
`bandwidth_down_kbps`). Pushed to the agent on hello and
|
||||
after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
|
||||
invocation on the host.
|
||||
2. **Per-job overrides** on the per-source-group Run-now form.
|
||||
Win over host caps for the lifetime of that one job.
|
||||
|
||||
If neither is set, restic runs unthrottled.
|
||||
@@ -0,0 +1,17 @@
|
||||
# Contributing
|
||||
|
||||
Full contributor guide:
|
||||
[`CONTRIBUTING.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CONTRIBUTING.md)
|
||||
in the repository root.
|
||||
|
||||
The short version:
|
||||
|
||||
- Open an issue first for non-trivial changes; the design is
|
||||
still moving and unsolicited large PRs may conflict with
|
||||
in-flight work.
|
||||
- `make lint test` must pass.
|
||||
- One logical change per commit, no `Co-Authored-By` trailers.
|
||||
- UK English in identifiers and comments; comments explain the
|
||||
**why** not the **what**.
|
||||
|
||||
Code of conduct: [`CODE_OF_CONDUCT.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/CODE_OF_CONDUCT.md).
|
||||
@@ -0,0 +1,113 @@
|
||||
# Enrolling your first host
|
||||
|
||||
The control plane only knows about hosts you've explicitly
|
||||
enrolled. Two paths exist:
|
||||
|
||||
1. **Token-based enrolment** — admin generates a token, pastes it
|
||||
into an install command on the host. The host appears immediately,
|
||||
already mapped to the desired repo.
|
||||
2. **Announce-and-approve** — the agent runs without a token,
|
||||
"announces" itself to the server, and a human in the UI accepts
|
||||
the announcement.
|
||||
|
||||
Token-based is the default and what most operators want; the
|
||||
announce flow exists for the case where you can't easily paste a
|
||||
secret onto the host (auto-imaged endpoints, scripted bring-ups
|
||||
from a config repo).
|
||||
|
||||
## Token-based enrolment
|
||||
|
||||
### From the UI
|
||||
|
||||
1. Click **+ Add host** on the dashboard.
|
||||
2. Fill in the hostname, the restic repo URL, and the repo
|
||||
credentials. The credentials are AEAD-encrypted at the server
|
||||
immediately; what you paste is what the agent receives.
|
||||
3. Optionally pick the initial source paths — these become the
|
||||
first source group on the host.
|
||||
4. Submit. The server mints a one-time token and shows you a copy-
|
||||
pasteable install snippet.
|
||||
|
||||
### On the host (Linux)
|
||||
|
||||
```sh
|
||||
curl -fsSL https://restic.example.com/install/install.sh | \
|
||||
sudo RM_SERVER=https://restic.example.com \
|
||||
RM_ENROL_TOKEN=<token> \
|
||||
bash
|
||||
```
|
||||
|
||||
The script:
|
||||
|
||||
1. Detects architecture (`amd64` or `arm64`).
|
||||
2. Downloads the agent binary from `/agent/binary?os=…&arch=…`.
|
||||
3. Drops the systemd unit at
|
||||
`/etc/systemd/system/restic-manager-agent.service`.
|
||||
4. Runs the agent in `-enrol` mode, which posts the token and
|
||||
stores the persistent bearer it gets back.
|
||||
5. Enables and starts the unit.
|
||||
|
||||
Within seconds the host should appear on the dashboard as
|
||||
**online**.
|
||||
|
||||
### On the host (Windows)
|
||||
|
||||
```pwsh
|
||||
$env:RM_SERVER = "https://restic.example.com"
|
||||
$env:RM_ENROL_TOKEN = "<token>"
|
||||
iwr -useb $env:RM_SERVER/install/install.ps1 | iex
|
||||
```
|
||||
|
||||
Equivalent shape: registers a Windows service via the SCM
|
||||
(see P2-16 for details), runs `-enrol`, starts the service.
|
||||
|
||||
## Recovering a lost token
|
||||
|
||||
Tokens are single-use and short-lived (1h). If you closed the tab
|
||||
before pasting the install command, head to the **Add host** page —
|
||||
outstanding tokens are listed there with a **Regenerate** button.
|
||||
Regenerating revokes the old token's hash and mints a fresh raw
|
||||
token while preserving the original repo credentials and initial
|
||||
paths. (NS-02 in `tasks.md` if you want the design rationale.)
|
||||
|
||||
## Announce-and-approve
|
||||
|
||||
If the host can reach the server but you don't want to paste a
|
||||
secret on it, run the agent in `-announce` mode:
|
||||
|
||||
```sh
|
||||
restic-manager-agent -announce \
|
||||
-server https://restic.example.com \
|
||||
-hostname myhost
|
||||
```
|
||||
|
||||
The host appears in the **Pending hosts** panel on the dashboard
|
||||
with its hostname, OS, arch, and the source IP that announced it.
|
||||
Click **Accept**, fill in the repo URL + credentials, and the
|
||||
server pushes the bearer over the still-open WebSocket. No
|
||||
back-and-forth round trip.
|
||||
|
||||
If you don't accept within an hour the announcement is swept.
|
||||
|
||||
## What happens on the agent
|
||||
|
||||
After enrolment, the agent:
|
||||
|
||||
1. Connects via WebSocket to `/ws/agent` with its bearer token.
|
||||
2. Sends a `hello` envelope with its OS, arch, agent version,
|
||||
restic version, and protocol version.
|
||||
3. Receives a `config.update` carrying its encrypted repo
|
||||
credentials and any source-group paths.
|
||||
4. Sits idle, sending a heartbeat every 30s. Operator-driven
|
||||
"Run now" actions arrive as `command.run` envelopes; scheduled
|
||||
jobs are driven by the agent's local cron.
|
||||
|
||||
## Auto-init of the repository
|
||||
|
||||
The first time a backup runs, the agent invokes `restic init`
|
||||
against the repo you configured at enrolment. If the repo already
|
||||
exists (`config file already exists`) the agent treats it as a
|
||||
success and proceeds. The host's repo status (`unknown` →
|
||||
`ready` / `init_failed`) is surfaced under the vitals strip on
|
||||
the host detail page; if init fails, save fresh credentials in
|
||||
the **Repo** tab to retry.
|
||||
@@ -0,0 +1,92 @@
|
||||
# Installing the server
|
||||
|
||||
The reference deployment is a single Docker container fronted by
|
||||
your existing reverse proxy. The image bundles the server binary,
|
||||
the cross-compiled agent binaries, and the install scripts.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A Linux host with Docker and Docker Compose.
|
||||
- A reverse proxy in front (Caddy, nginx, Traefik) terminating
|
||||
TLS on a public hostname. The server itself is HTTP-only by
|
||||
design — see [Reverse proxy](./reverse-proxy.md) for why.
|
||||
- A persistent volume for the server's data directory.
|
||||
|
||||
## Quick start
|
||||
|
||||
The reference compose file lives at
|
||||
[`deploy/docker-compose.yml`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/docker-compose.yml):
|
||||
|
||||
```yaml
|
||||
services:
|
||||
restic-manager:
|
||||
image: gitea.dcglab.co.uk/steve/restic-manager:${RM_VERSION:-latest}
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
RM_LISTEN: ":8080"
|
||||
RM_DATA_DIR: "/data"
|
||||
RM_BASE_URL: "https://restic.example.com"
|
||||
# Trust your reverse proxy's CIDR so X-Forwarded-* are honoured.
|
||||
RM_TRUSTED_PROXY: "10.0.0.0/8"
|
||||
volumes:
|
||||
- rm-data:/data
|
||||
ports:
|
||||
# Bind localhost only — your reverse proxy is the public face.
|
||||
- "127.0.0.1:8080:8080"
|
||||
|
||||
volumes:
|
||||
rm-data:
|
||||
```
|
||||
|
||||
Bring it up:
|
||||
|
||||
```sh
|
||||
docker compose up -d
|
||||
docker compose logs -f restic-manager
|
||||
```
|
||||
|
||||
The first run prints a one-time **bootstrap token** to the log. Use
|
||||
it within an hour or it expires; if you miss the window the
|
||||
container print it again on next start as long as no admin user
|
||||
exists.
|
||||
|
||||
## First-run admin setup
|
||||
|
||||
Open `https://restic.example.com/bootstrap` (or whatever your
|
||||
public URL is). Paste the bootstrap token, pick a username and a
|
||||
password (≥ 12 characters), and submit. You'll land in the
|
||||
dashboard logged in as the new admin.
|
||||
|
||||
If you'd rather curl it, the equivalent is:
|
||||
|
||||
```sh
|
||||
curl -X POST https://restic.example.com/api/bootstrap \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"token":"<token-from-log>","username":"admin","password":"<≥12 chars>"}'
|
||||
```
|
||||
|
||||
## Backing up the secret key
|
||||
|
||||
Inside the data volume, `secret.key` holds the AEAD key used to
|
||||
encrypt every credential at rest. **Back it up separately from
|
||||
the database.** Without it, encrypted credentials in the database
|
||||
are unrecoverable; you'd have to re-enrol every host.
|
||||
|
||||
A simple working approach: copy `secret.key` to your password
|
||||
manager or to a separately-backed-up secrets vault the day you
|
||||
install. It doesn't change.
|
||||
|
||||
## Updating the server
|
||||
|
||||
```sh
|
||||
# Pin a new version in your compose file (.env or docker-compose.yml),
|
||||
# then:
|
||||
docker compose pull
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
Migrations run automatically on startup; the server will refuse to
|
||||
start if a migration fails (better to bail than to half-migrate).
|
||||
|
||||
For the agent self-update story, see
|
||||
[Updating agents](../operations/updates.md).
|
||||
@@ -0,0 +1,95 @@
|
||||
# Running behind a reverse proxy
|
||||
|
||||
The restic-manager server is HTTP-only by design. TLS termination,
|
||||
public hostname, ACME, HSTS, and edge-level rate limiting all
|
||||
belong to a reverse proxy you already operate outside this project.
|
||||
|
||||
## What the proxy must forward
|
||||
|
||||
The server reads four headers when (and only when) the immediate
|
||||
peer matches `RM_TRUSTED_PROXY`:
|
||||
|
||||
| Header | Value | Why |
|
||||
|------------------------|----------------------------------------------------|-----|
|
||||
| `X-Forwarded-For` | The original client IP | Rate-limit keys, audit log entries, OIDC redirect-URI checks. |
|
||||
| `X-Forwarded-Proto` | `https` | Used for absolute URLs (e.g. OIDC redirect URIs). |
|
||||
| `Host` | The public hostname clients use | Cookies are scoped to this; `RM_BASE_URL` must match. |
|
||||
| `Connection` / `Upgrade` | Pass through unchanged | `/ws/agent` and `/api/jobs/{id}/stream` are WebSockets; without `Upgrade: websocket` they fail. |
|
||||
|
||||
Set `RM_TRUSTED_PROXY` to the CIDR (or comma-separated list of
|
||||
CIDRs) the proxy connects from. Anything outside that range has
|
||||
its `X-Forwarded-*` headers ignored, so a stray request that
|
||||
bypasses the proxy can't spoof the client IP.
|
||||
|
||||
## Caddy
|
||||
|
||||
```caddyfile
|
||||
restic.example.com {
|
||||
encode zstd gzip
|
||||
reverse_proxy 127.0.0.1:8080 {
|
||||
header_up X-Real-IP {remote_host}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Caddy adds `X-Forwarded-For` / `X-Forwarded-Proto` automatically
|
||||
and passes WebSocket headers through by default, so this is the
|
||||
whole config.
|
||||
|
||||
## nginx
|
||||
|
||||
```nginx
|
||||
server {
|
||||
listen 443 ssl http2;
|
||||
server_name restic.example.com;
|
||||
|
||||
ssl_certificate /etc/letsencrypt/live/restic.example.com/fullchain.pem;
|
||||
ssl_certificate_key /etc/letsencrypt/live/restic.example.com/privkey.pem;
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:8080;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto https;
|
||||
|
||||
# WebSocket upgrade
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
|
||||
# Long-lived agent WS — disable read timeout for this surface.
|
||||
proxy_read_timeout 86400s;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Traefik
|
||||
|
||||
```yaml
|
||||
http:
|
||||
routers:
|
||||
restic-manager:
|
||||
rule: "Host(`restic.example.com`)"
|
||||
entryPoints: [websecure]
|
||||
tls:
|
||||
certResolver: letsencrypt
|
||||
service: restic-manager
|
||||
|
||||
services:
|
||||
restic-manager:
|
||||
loadBalancer:
|
||||
servers:
|
||||
- url: "http://restic-manager:8080"
|
||||
passHostHeader: true
|
||||
```
|
||||
|
||||
Traefik forwards WebSocket upgrades and the standard
|
||||
`X-Forwarded-*` set out of the box.
|
||||
|
||||
## Verification
|
||||
|
||||
After bringing the proxy up, the audit log should show your real
|
||||
client IP for an interactive login (not the proxy's local
|
||||
address). If you see `127.0.0.1` or the proxy's container IP, your
|
||||
`RM_TRUSTED_PROXY` is wrong or `X-Forwarded-For` isn't being
|
||||
forwarded.
|
||||
@@ -0,0 +1,86 @@
|
||||
# restic-manager
|
||||
|
||||
restic-manager is a self-hosted, browser-based, single-pane-of-glass
|
||||
for managing [restic](https://restic.net) backups across a fleet of
|
||||
Linux and Windows endpoints. It's designed for **small fleets** —
|
||||
the original target was twelve endpoints — and **one operator**.
|
||||
|
||||
## What it does
|
||||
|
||||
- Centralised view of every endpoint's last backup, repo size,
|
||||
snapshot count, and recent jobs.
|
||||
- Trigger any restic operation remotely (`backup`, `forget`, `prune`,
|
||||
`check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`).
|
||||
- Per-host backup schedules with source groups (named bundles of
|
||||
paths + retention policy).
|
||||
- Live job log streamed to the browser; downloadable as text or NDJSON.
|
||||
- Restore wizard with snapshot tree browse + path selection.
|
||||
- Repo-level health surfacing (size, raw size, last-check, lock
|
||||
state) plus a 30/90-day size trend.
|
||||
- Alerting over webhook, ntfy, or SMTP.
|
||||
- Cross-platform agent (Linux + Windows).
|
||||
- Append-only-credential-friendly with a separate admin credential
|
||||
for forget/prune.
|
||||
|
||||
## What it isn't
|
||||
|
||||
- **Not a SaaS.** Single-instance, single-tenant, by design.
|
||||
- **Not a replacement for restic** — it's a control plane. The agent
|
||||
shells out to a real `restic` binary.
|
||||
- **Not highly available.** SQLite, single process; if you need
|
||||
HA backups, you're shopping in the wrong aisle.
|
||||
- **Not a multi-protocol backup tool.** restic only.
|
||||
|
||||
## How it fits together
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ Server (control plane, Docker) │
|
||||
│ - REST + WebSocket API │
|
||||
│ - SQLite store │
|
||||
│ - Embedded HTMX UI │
|
||||
└──────────┬─────────────────────────┬─────────┘
|
||||
│ outbound WS │ HTTP(S)
|
||||
│ │
|
||||
┌──────────▼──────────┐ ┌──────────▼─────────┐
|
||||
│ Agent (per host) │ │ Browser (operator) │
|
||||
│ - restic wrapper │ └─────────────────────┘
|
||||
│ - cron for sched. │
|
||||
└──────────┬──────────┘
|
||||
│ restic
|
||||
┌──────────▼──────────────────────────────────┐
|
||||
│ rest-server / S3 / SFTP / local repo │
|
||||
│ (the actual backup data — server never │
|
||||
│ touches it) │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The control plane is a Go binary that runs in Docker. Each endpoint
|
||||
runs a small Go agent that holds an outbound WebSocket to the
|
||||
control plane. Backup data flows directly between the agent and the
|
||||
restic repository — the control plane never sees a snapshot byte.
|
||||
|
||||
## Where to start
|
||||
|
||||
- [Installing the server](./getting-started/install.md) walks
|
||||
through the Docker-based reference deployment.
|
||||
- [Enrolling your first host](./getting-started/enrolling-hosts.md)
|
||||
covers the install scripts and the announce-and-approve flow.
|
||||
- [Architecture](./concepts/architecture.md) is the right read if
|
||||
you want to know why something is the way it is before running
|
||||
the install.
|
||||
|
||||
## Project status
|
||||
|
||||
Pre-1.0 but feature-complete for the original use case. Phases
|
||||
0–4 are landed (MVP, scheduling, restore, RBAC + OIDC); Phase 5
|
||||
(this docs site, contributor onboarding, end-to-end CI) is in
|
||||
flight. See [`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md)
|
||||
for the live roadmap and [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
|
||||
for the canonical design doc.
|
||||
|
||||
## License
|
||||
|
||||
[PolyForm Noncommercial 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0/).
|
||||
Personal and community deployments welcome; commercial use
|
||||
requires a separate license.
|
||||
@@ -0,0 +1,39 @@
|
||||
# License
|
||||
|
||||
restic-manager is licensed under
|
||||
[**PolyForm Noncommercial 1.0.0**](https://polyformproject.org/licenses/noncommercial/1.0.0/).
|
||||
The full text lives at
|
||||
[`LICENSE`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/LICENSE)
|
||||
in the repository root.
|
||||
|
||||
## What this means
|
||||
|
||||
- **Personal, hobbyist, educational, charitable, and similar
|
||||
noncommercial use** is fully permitted, including modification
|
||||
and redistribution.
|
||||
- **Commercial use is not permitted** without a separate
|
||||
license. The maintainer is not currently offering one — if
|
||||
you need commercial rights, open an issue to start the
|
||||
conversation.
|
||||
- The license is permissive about everything except commercial
|
||||
use: you can fork, modify, deploy in your home/lab, and
|
||||
contribute back.
|
||||
|
||||
## Why this license
|
||||
|
||||
The PolyForm Noncommercial license was chosen because:
|
||||
|
||||
- It's a real, legal, plainly-worded license (not a custom
|
||||
half-written variant).
|
||||
- It permits the realistic uses for a hobby project (the
|
||||
maintainer's homelab, a friend's fleet, a charity's IT
|
||||
closet) without inviting commercial vendors to repackage
|
||||
the work.
|
||||
- It's compatible with the project staying small and
|
||||
maintainable — the maintainer doesn't want to be on the hook
|
||||
for SLA-grade commercial support.
|
||||
|
||||
## Contributions
|
||||
|
||||
By contributing, you agree your contributions are licensed
|
||||
under the same PolyForm Noncommercial 1.0.0 license.
|
||||
@@ -0,0 +1,73 @@
|
||||
# Alerts and notifications
|
||||
|
||||
restic-manager raises alerts on conditions that need human
|
||||
attention. The alert engine evaluates rules on a 60s tick and
|
||||
on every job-finished / host-online event.
|
||||
|
||||
## Built-in alert kinds
|
||||
|
||||
| Kind | Trigger | Severity |
|
||||
|---------------------|---------|----------|
|
||||
| `backup_failed` | A backup job ends in `failed` or `cancelled` | warning |
|
||||
| `forget_failed` | A forget job ends in `failed` | warning |
|
||||
| `prune_failed` | A prune job ends in `failed` | critical |
|
||||
| `check_failed` | A check job ends in `failed` | critical |
|
||||
| `agent_offline` | A host has been offline more than 90s past its heartbeat cadence | warning |
|
||||
| `stale_schedule` | A schedule's "last run" is more than 1.5 × its interval ago | warning |
|
||||
| `update_failed` | An agent self-update returned a fail or didn't reconnect within 90s | warning |
|
||||
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
|
||||
|
||||
Each alert has a `dedup_key` so re-firing the same condition
|
||||
just bumps `last_seen_at` — the operator gets one row per
|
||||
condition, not a thousand.
|
||||
|
||||
## Lifecycle
|
||||
|
||||
```
|
||||
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
|
||||
│ │
|
||||
└────────auto-resolve──────┘
|
||||
(e.g. agent_offline auto-resolves on agent_online)
|
||||
```
|
||||
|
||||
- **Acknowledge** says "I've seen this, stop notifying about it".
|
||||
- **Resolve** says "the underlying condition is gone".
|
||||
- Some alerts auto-resolve when the condition clears
|
||||
(`agent_offline` is the canonical example).
|
||||
|
||||
## Notification channels
|
||||
|
||||
Configure under **Settings → Notifications**. Each channel can
|
||||
subscribe to all alerts or filter by severity.
|
||||
|
||||
### Webhook
|
||||
|
||||
Posts a JSON envelope to a URL of your choice. Useful for
|
||||
piping into Slack via an Incoming Webhook URL or into your own
|
||||
alerting tooling.
|
||||
|
||||
### ntfy
|
||||
|
||||
Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
|
||||
topic. Configure the topic URL; optional bearer token if you
|
||||
self-host with auth.
|
||||
|
||||
### SMTP
|
||||
|
||||
Plain SMTP (with optional TLS). Configure host, port,
|
||||
username, password, and the recipient list.
|
||||
|
||||
## Test fire
|
||||
|
||||
Each channel exposes a **Test fire** button that dispatches a
|
||||
single synthetic alert through the channel without touching the
|
||||
alert engine. Use this when you've added a channel and want to
|
||||
verify connectivity before the next real failure happens.
|
||||
|
||||
## What gets logged
|
||||
|
||||
Every alert raise / acknowledge / resolve writes an audit log
|
||||
entry. The audit log UI at **Settings → Audit log** filters by
|
||||
user, action, target, and time range — useful for the
|
||||
post-incident "who clicked acknowledge on the prune-failure
|
||||
alert" question.
|
||||
@@ -0,0 +1,73 @@
|
||||
# Backups and restores
|
||||
|
||||
## Running a backup
|
||||
|
||||
Three ways to trigger one:
|
||||
|
||||
1. **Scheduled** — the agent's local cron fires at the time set
|
||||
on the schedule.
|
||||
2. **Run-now** — operator clicks **Run now** on the host detail
|
||||
right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
|
||||
source groups) or to a per-group form for finer control.
|
||||
3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
|
||||
payload. Same audit + dispatch path.
|
||||
|
||||
In every case the server creates a `jobs` row, broadcasts a
|
||||
`command.run` to the host, and lands the operator on the live
|
||||
job log page (HTMX `HX-Redirect`).
|
||||
|
||||
## Cancelling a job
|
||||
|
||||
Any running job — backup, forget, prune, restore, anything —
|
||||
exposes a **Cancel** button on its detail page. The server
|
||||
broadcasts `command.cancel`, and the agent kills the running
|
||||
restic subprocess via context cancel: SIGTERM first, SIGKILL
|
||||
after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
|
||||
SIGTERM step is replaced with `os.Kill` because Windows can't
|
||||
deliver SIGTERM. Result: a cancelled job lands as `cancelled`
|
||||
within a couple of hundred milliseconds.
|
||||
|
||||
## Restore wizard
|
||||
|
||||
Restoring a file or path goes through a four-step wizard at
|
||||
`/hosts/{id}/restore`:
|
||||
|
||||
1. **Pick a snapshot.** Search by id or by date; the page is
|
||||
pre-populated when you launched the wizard from a snapshot row.
|
||||
2. **Browse the snapshot tree.** Lazy-loaded children via the
|
||||
`MsgTreeList` synchronous WS RPC; results are cached
|
||||
per-wizard-session for 30 minutes. Pick the absolute paths
|
||||
you want.
|
||||
3. **Choose a target.** Either **In place** (overwrites the
|
||||
live filesystem; requires you to type the hostname to
|
||||
confirm) or **New directory** (default
|
||||
`$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
|
||||
`${HOME}` / `~/` and creates the directory chain).
|
||||
4. **Review and submit.** Server mints a job, dispatches
|
||||
`command.run` with a `RestorePayload`, and `HX-Redirect`s to
|
||||
the live job log.
|
||||
|
||||
`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
|
||||
in that release). Hosts running 0.16 don't get the flag and
|
||||
restore as the running user instead.
|
||||
|
||||
## Snapshot diff
|
||||
|
||||
Two snapshot ids in the **Diff** form on the host detail page →
|
||||
a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
|
||||
to the standard live job log. Useful when investigating a
|
||||
suspiciously-sized backup.
|
||||
|
||||
## Job log artefacts
|
||||
|
||||
Every job's log is persisted in `job_logs` (one row per line),
|
||||
not just streamed in-memory. That gives you:
|
||||
|
||||
- A live view at `/jobs/{id}` while the job runs.
|
||||
- Two download formats from the same page header dropdown:
|
||||
- **txt** — one line per row, `HH:MM:SS.mmm TAG payload`.
|
||||
- **ndjson** — one self-contained JSON object per line
|
||||
(`{seq, ts, stream, payload}`), perfect for `jq`.
|
||||
|
||||
Downloads work whether the job is running or finished —
|
||||
the source is the DB, not the live socket.
|
||||
@@ -0,0 +1,61 @@
|
||||
# Observability with Prometheus
|
||||
|
||||
restic-manager can expose a Prometheus scrape endpoint at
|
||||
`GET /metrics`. The endpoint is **opt-in** — without an explicit
|
||||
auth gate it isn't even mounted, so a forgotten config can't
|
||||
accidentally publish fleet state.
|
||||
|
||||
The full reference lives at
|
||||
[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
|
||||
the short version follows.
|
||||
|
||||
## Enable the endpoint
|
||||
|
||||
Set at least one of:
|
||||
|
||||
- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
|
||||
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
|
||||
|
||||
Both ANDed when both set. Constant-time token compare; CIDR
|
||||
honours `X-Forwarded-For` only when the immediate hop matches
|
||||
`RM_TRUSTED_PROXY`.
|
||||
|
||||
## Metrics emitted
|
||||
|
||||
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
|
||||
`rm_active_alerts{severity}`, `rm_build_info{...}`.
|
||||
- **Per-host gauges**: `rm_host_agent_online`,
|
||||
`rm_host_last_backup_timestamp_seconds`,
|
||||
`rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
|
||||
`rm_host_snapshot_count`, `rm_host_open_alerts`,
|
||||
`rm_host_repo_status`.
|
||||
- **Histogram**:
|
||||
`rm_job_duration_seconds{kind,status,le=…}` (buckets
|
||||
`1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
|
||||
|
||||
In-memory histogram only. Prometheus persists the scrapes; if
|
||||
you need durable history at hourly resolution that's
|
||||
Prometheus's job.
|
||||
|
||||
## Sample Grafana dashboard
|
||||
|
||||
[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
|
||||
imports through Grafana's **+ → Import → Upload JSON file**.
|
||||
Six panels:
|
||||
|
||||
1. Fleet status (online / total).
|
||||
2. Open alerts by severity.
|
||||
3. Backups failing on most-recent run.
|
||||
4. Hosts table — last backup, repo size, snapshots, open alerts.
|
||||
5. Repo size over time, one line per host.
|
||||
6. Job-duration p95 over a 1h window per kind.
|
||||
|
||||
## Alerting
|
||||
|
||||
restic-manager already has a built-in alert engine
|
||||
([Alerts](./alerts.md)). The dashboard intentionally doesn't
|
||||
duplicate it as Prometheus alert rules. If you want
|
||||
Prometheus-side alerts on top, write your own based on the
|
||||
metrics above — `rm_host_last_backup_success == 0`,
|
||||
`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
|
||||
or whatever suits your environment.
|
||||
@@ -0,0 +1,50 @@
|
||||
# Updating agents
|
||||
|
||||
Server updates are a `docker compose pull && up -d` away.
|
||||
Agents update via the control plane.
|
||||
|
||||
## Single-host update
|
||||
|
||||
Each host's detail page shows an **Update agent** button when
|
||||
the agent's reported version is older than the server's. The
|
||||
button:
|
||||
|
||||
1. Dispatches a `command.update` to that host.
|
||||
2. The agent fetches the appropriate binary from
|
||||
`$RM_SERVER/agent/binary?os=…&arch=…` to
|
||||
`<binary-path>.new`.
|
||||
3. Copies the running binary to `<binary-path>.old` (one
|
||||
revision back, in case rollback is needed).
|
||||
4. Atomic-renames `.new` over the running binary.
|
||||
5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
|
||||
brings the process back on the new binary.
|
||||
|
||||
A 90-second timer on the server side waits for a hello at the
|
||||
target version and marks the update succeeded — or, if the
|
||||
agent doesn't reconnect at the expected version in time, marks
|
||||
the update **failed** and raises an `update_failed` alert.
|
||||
|
||||
## Fleet update
|
||||
|
||||
The admin-only **Settings → Fleet update** page drives a rolling
|
||||
update across every host in the fleet:
|
||||
|
||||
- One host at a time.
|
||||
- Wait for hello-with-target-version (max 95s).
|
||||
- On any host failing, **halt** the rollout, raise a
|
||||
`fleet_update_halted` alert, leave the rest of the fleet on
|
||||
the old version. No surprise mass-failures.
|
||||
|
||||
You can cancel an in-progress fleet update; the worker stops
|
||||
after the current host finishes.
|
||||
|
||||
## TLS and corruption
|
||||
|
||||
Updates rely on the reverse proxy's TLS to detect corruption in
|
||||
transit. There's no separate sha256 verification step — we
|
||||
chose the simpler model on the basis that the same TLS already
|
||||
gates every other byte the server hands to the agent.
|
||||
|
||||
If you'd like a separate signature step before applying updates,
|
||||
that's a future-phase enhancement (see `tasks.md` Phase 6
|
||||
candidates).
|
||||
@@ -0,0 +1,58 @@
|
||||
# Environment variables
|
||||
|
||||
The server reads its configuration from environment variables
|
||||
(canonical) with an optional YAML overlay. Env wins over YAML so
|
||||
operators can tweak a single setting without rewriting the file.
|
||||
|
||||
## Server
|
||||
|
||||
| Variable | Default | Meaning |
|
||||
|---------------------------|----------------------------------|---------|
|
||||
| `RM_LISTEN` | `:8080` | TCP listener for the HTTP server. |
|
||||
| `RM_DATA_DIR` | `/data` | Persistent state directory (SQLite, secret key, agent assets). |
|
||||
| `RM_BASE_URL` | (none) | Public URL clients use; required for OIDC redirects + cookie scope. |
|
||||
| `RM_SECRET_KEY_FILE` | `${RM_DATA_DIR}/secret.key` | Path to the AEAD key file. Auto-generated on first run. |
|
||||
| `RM_COOKIE_SECURE` | `true` | Set `false` only for local HTTP testing. Controls `Secure` on session cookies. |
|
||||
| `RM_TRUSTED_PROXY` | (none) | Comma-separated CIDRs trusted for `X-Forwarded-*`. |
|
||||
| `RM_BUNDLED_ASSETS_DIR` | `/opt/restic-manager/dist` | Read-only path with bundled agent binaries + install scripts (the docker image bakes them here). |
|
||||
| `RM_METRICS_TOKEN` | (off) | When set, `GET /metrics` requires `Authorization: Bearer <token>`. |
|
||||
| `RM_METRICS_TRUSTED_CIDR` | (off) | When set, `GET /metrics` restricts source IPs (comma-CIDR). |
|
||||
|
||||
OIDC variables (all optional; empty issuer disables OIDC):
|
||||
|
||||
| Variable | Meaning |
|
||||
|--------------------------------|---------|
|
||||
| `RM_OIDC_ISSUER` | OIDC discovery URL (e.g. `https://auth.example.com`). |
|
||||
| `RM_OIDC_CLIENT_ID` | Client ID registered with the IdP. |
|
||||
| `RM_OIDC_CLIENT_SECRET` | Client secret (or use `RM_OIDC_CLIENT_SECRET_FILE`). |
|
||||
| `RM_OIDC_CLIENT_SECRET_FILE` | Path to a file holding the client secret. |
|
||||
| `RM_OIDC_DISPLAY_NAME` | Button label on the login page (e.g. "Authelia"). |
|
||||
| `RM_OIDC_ROLE_CLAIM` | Token claim that carries roles (default `groups`). |
|
||||
| `RM_OIDC_ROLE_MAPPING` | `idp-group=role` entries, comma-separated (e.g. `rm-admin=admin,rm-ops=operator`). |
|
||||
| `RM_OIDC_REDIRECT_URL` | Override for the redirect URL; defaults to `${RM_BASE_URL}/auth/oidc/callback`. |
|
||||
|
||||
## Agent
|
||||
|
||||
| Variable | Default | Meaning |
|
||||
|----------------------|---------|---------|
|
||||
| `RM_AGENT_CONFIG` | `/etc/restic-manager/agent.yaml` (Linux) | Config file path. |
|
||||
|
||||
The agent's other settings live in the YAML file (server URL,
|
||||
bearer token, optional cert pin). The install script writes that
|
||||
file for you at enrolment.
|
||||
|
||||
## Build-time
|
||||
|
||||
The Makefile threads `-ldflags` from `git describe` into the
|
||||
`internal/version` package so `--version` and the dashboard
|
||||
footer show the right values:
|
||||
|
||||
```
|
||||
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Version=$(VERSION)
|
||||
-X gitea.dcglab.co.uk/steve/restic-manager/internal/version.Commit=$(COMMIT)
|
||||
```
|
||||
|
||||
If you build with `go build` directly (no Makefile), `Version`
|
||||
falls back to `dev` and the agent-update comparison falls back
|
||||
to "always equal". Source-build deployments can still run; they
|
||||
just don't participate in the self-update flow.
|
||||
@@ -0,0 +1,82 @@
|
||||
# HTTP endpoints
|
||||
|
||||
A non-exhaustive map of the surfaces the control plane exposes.
|
||||
All `/api/*` routes return JSON; all other paths render HTML
|
||||
(server-rendered with HTMX in the loop).
|
||||
|
||||
The canonical wiring lives at
|
||||
[`internal/server/http/server.go`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/internal/server/http/server.go);
|
||||
when in doubt, read the routes block there.
|
||||
|
||||
## Public (no auth)
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|----------------------------|---------|
|
||||
| GET | `/healthz` | Liveness probe. Returns 204. |
|
||||
| POST | `/api/auth/login` | Local-user login. JSON body: `{username, password}`. |
|
||||
| POST | `/api/auth/logout` | Invalidate the session cookie. |
|
||||
| POST | `/api/bootstrap` | First-run admin creation. Accepts the token printed at first start. |
|
||||
| POST | `/api/agents/enroll` | Token-based agent enrolment. |
|
||||
| POST | `/api/agents/announce` | Announce-and-approve agent enrolment. |
|
||||
| GET | `/agent/binary?os=&arch=` | Serves the agent binary for the install scripts. |
|
||||
| GET | `/install/*` | Serves the Linux + Windows install scripts and the systemd unit. |
|
||||
| GET | `/api/version` | Build version + commit JSON. |
|
||||
| GET | `/metrics` | Prometheus exposition (only when opted-in via `RM_METRICS_TOKEN` / `RM_METRICS_TRUSTED_CIDR`). |
|
||||
| GET | `/login`, `/setup`, `/bootstrap` | UI pages. |
|
||||
|
||||
## Authenticated (any role)
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|------------------------------------------|---------|
|
||||
| GET | `/` | Dashboard. |
|
||||
| GET | `/hosts/{id}` | Host detail. |
|
||||
| GET | `/hosts/{id}/repo` | Repo tab. |
|
||||
| GET | `/hosts/{id}/jobs` | Jobs tab. |
|
||||
| GET | `/hosts/{id}/sources` | Source groups list. |
|
||||
| GET | `/hosts/{id}/schedules` | Schedules list. |
|
||||
| GET | `/jobs/{id}` | Live job log. |
|
||||
| GET | `/api/hosts`, `/api/fleet/summary` | JSON list + summary. |
|
||||
| GET | `/api/jobs/{id}/stream` | WebSocket subscription to a job's live log. |
|
||||
| GET | `/api/jobs/{id}/log.{txt,ndjson}` | Persisted log download. |
|
||||
|
||||
## Operator role and above
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|---------------------------------------|---------|
|
||||
| POST | `/hosts/{id}/run-backup` | Run-now (HTMX form-post). |
|
||||
| POST | `/hosts/{id}/sources/{gid}/run-now` | Per-source-group run-now. |
|
||||
| POST | `/hosts/{id}/repo/{prune,check,unlock,reinit,probe}` | Maintenance actions. |
|
||||
| POST | `/api/hosts/{id}/snapshots/diff` | Snapshot-diff job. |
|
||||
| POST | `/hosts/{id}/restore` | Restore wizard submit. |
|
||||
| POST | `/api/jobs/{id}/cancel` | Cancel a running job. |
|
||||
| POST | `/hosts/{id}/tags` | Update host tags. |
|
||||
| POST | `/hosts/{id}/sources` and friends | Source-group CRUD. |
|
||||
| POST | `/hosts/{id}/schedules` and friends | Schedule CRUD. |
|
||||
| POST | `/hosts/{id}/repo/credentials`, `/admin-credentials` | Credential update. |
|
||||
|
||||
## Admin role only
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|---------------------------------------|---------|
|
||||
| POST | `/hosts/new` | Mint enrolment token (Add host). |
|
||||
| POST | `/hosts/{id}/delete` | Delete + cascade. |
|
||||
| POST | `/hosts/{id}/update` | Dispatch a single agent update. |
|
||||
| GET/POST | `/settings/users/...` | User management. |
|
||||
| POST | `/settings/notifications/...` | Notification channel CRUD + test fire. |
|
||||
| POST | `/settings/fleet-update/...` | Fleet-update worker. |
|
||||
|
||||
## WebSocket
|
||||
|
||||
| Path | Who connects | Auth |
|
||||
|--------------------------------|--------------|------|
|
||||
| `/ws/agent` | Agent | Bearer token issued at enrolment. |
|
||||
| `/ws/agent/pending` | Agent (announce flow) | Pending-id query param. |
|
||||
| `/api/jobs/{id}/stream` | Browser | Session cookie. |
|
||||
|
||||
## RBAC enforcement
|
||||
|
||||
Routes are grouped into chi route-groups by required role
|
||||
(`viewer < operator < admin`); the `requireRole` middleware in
|
||||
`internal/server/http/middleware.go` is the bouncer. Sessions
|
||||
re-validate `disabled_at` on every request, so a disabled user's
|
||||
cookie stops working immediately.
|
||||
@@ -0,0 +1,32 @@
|
||||
# Roadmap
|
||||
|
||||
The live roadmap is in
|
||||
[`tasks.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/tasks.md).
|
||||
Phases ship in order; items inside a phase ship as the
|
||||
opportunity arises.
|
||||
|
||||
## Status snapshot
|
||||
|
||||
| Phase | Theme | Status |
|
||||
|-------|--------------------------------------------------|--------|
|
||||
| 0 | Project bootstrap | ✅ done |
|
||||
| 1 | MVP: enrolment, visibility, on-demand backup | ✅ done |
|
||||
| 2 | Scheduling, retention, repo operations | ✅ done |
|
||||
| 3 | Restore, alerts, audit | ✅ done |
|
||||
| 4 | RBAC, OIDC, host tags | ✅ done |
|
||||
| 5 | OSS readiness | 🚧 in flight (this docs site is part of it) |
|
||||
| 6 | Update delivery + observability polish | ✅ done |
|
||||
|
||||
## What's not on the roadmap
|
||||
|
||||
The non-goals list in [`spec.md` §2](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md):
|
||||
|
||||
- Replacing restic itself or providing custom repo formats
|
||||
- Managing non-restic backup tools
|
||||
- Multi-tenancy / SaaS deployment
|
||||
- High availability of the control plane (SQLite, single-instance)
|
||||
- Mobile-native apps (responsive web only)
|
||||
|
||||
If something there is critical to your use case, restic-manager
|
||||
isn't the right tool. That's not a closed door — it's a
|
||||
deliberate scope decision so the project stays maintainable.
|
||||
@@ -0,0 +1,35 @@
|
||||
# Reporting vulnerabilities
|
||||
|
||||
The full disclosure policy lives in
|
||||
[`SECURITY.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/SECURITY.md)
|
||||
at the repo root. The short version:
|
||||
|
||||
- **Don't open a public issue.**
|
||||
- Send a Gitea private message to `steve` on
|
||||
<https://gitea.dcglab.co.uk>, or email the address on the
|
||||
maintainer's profile, with a subject like
|
||||
`[SECURITY] restic-manager: <one-line summary>`.
|
||||
- Expect an acknowledgement within 3 working days; escalate
|
||||
through the other channel if you don't get one.
|
||||
- Default disclosure window is **30 days from confirmed report
|
||||
to public disclosure**, faster if a PoC is already
|
||||
circulating, slower only by mutual agreement.
|
||||
|
||||
## What to include
|
||||
|
||||
A description of the issue and the impact, the affected
|
||||
component (server / agent / install script / docs), the version,
|
||||
and reproduction steps. A working PoC is welcome but not
|
||||
required — a credible threat model is enough.
|
||||
|
||||
## In scope vs. out of scope
|
||||
|
||||
See the full policy. Quick highlights:
|
||||
|
||||
- **In scope:** server, agent, install scripts, docker image,
|
||||
docker-compose reference, crypto choices, docs that lead to
|
||||
insecure configs.
|
||||
- **Out of scope:** restic itself (report upstream), unpatched
|
||||
third-party deps (report upstream first), pre-authenticated
|
||||
admin abuse (admins are designed to have full power), DoS on
|
||||
deployments without the recommended reverse proxy.
|
||||
@@ -0,0 +1,72 @@
|
||||
# Hardening checklist
|
||||
|
||||
A baseline for new deployments. Most of these are defaults; the
|
||||
list is here to make audit easy.
|
||||
|
||||
## Server
|
||||
|
||||
- [ ] Reverse proxy in front, TLS terminating at the proxy
|
||||
(Caddy/nginx/Traefik).
|
||||
- [ ] `RM_TRUSTED_PROXY` set to the proxy's CIDR.
|
||||
- [ ] `RM_BASE_URL` matches the public hostname and the cookie
|
||||
scope you want.
|
||||
- [ ] `RM_COOKIE_SECURE=true` (the default; only set `false`
|
||||
for local HTTP testing).
|
||||
- [ ] HTTP listener bound to **localhost** in the compose file,
|
||||
not `0.0.0.0`. The reverse proxy is the only thing that
|
||||
should reach it.
|
||||
- [ ] `secret.key` backed up separately from the database.
|
||||
- [ ] Bootstrap token consumed and the printed log line scrubbed
|
||||
from any log archive.
|
||||
|
||||
## Authentication
|
||||
|
||||
- [ ] Admin user has a password ≥ 12 characters (the floor).
|
||||
- [ ] OIDC enabled if you have an IdP — local password auth
|
||||
stays as a break-glass.
|
||||
- [ ] Disabled (not deleted) any users who change roles or leave
|
||||
so their session is invalidated immediately.
|
||||
- [ ] The last-admin guard isn't tripped — there's always at
|
||||
least one enabled admin user.
|
||||
|
||||
## Repo credentials
|
||||
|
||||
- [ ] Append-only credential set as the everyday cred for every
|
||||
host.
|
||||
- [ ] Admin credential set only where prune cadence is enabled.
|
||||
- [ ] No credentials reused across hosts. Each host should have
|
||||
its own credential pair so a single host compromise has a
|
||||
single blast radius.
|
||||
- [ ] If using rest-server, `--append-only` flag is on for the
|
||||
everyday user; the prune user is a separate identity.
|
||||
|
||||
## Agent
|
||||
|
||||
- [ ] Agent runs as `root` (Linux) or `LocalSystem` (Windows)
|
||||
**only when** the source paths require it. Otherwise pin
|
||||
a service user that has read access to what's backed up
|
||||
and nothing else.
|
||||
- [ ] systemd unit's sandboxing flags are intact
|
||||
(`NoNewPrivileges`, `Protect*`, `MemoryDenyWriteExecute`).
|
||||
- [ ] Agent's config file `/etc/restic-manager/agent.yaml` is
|
||||
mode `0600` and owned by the service user. The bearer
|
||||
token lives in there.
|
||||
|
||||
## Operations
|
||||
|
||||
- [ ] Alerts wired to a real channel (webhook into Slack,
|
||||
ntfy topic, SMTP) — not just sitting in the UI.
|
||||
- [ ] Test-fire each notification channel after configuring.
|
||||
- [ ] Audit-log retention is long enough to cover the operator's
|
||||
incident-response window.
|
||||
- [ ] Prometheus endpoint, if enabled, gated by token AND CIDR
|
||||
where practical (default is opt-in / off).
|
||||
|
||||
## Recovery
|
||||
|
||||
- [ ] A documented procedure for rotating a leaked agent bearer
|
||||
(delete + re-enrol the host).
|
||||
- [ ] A test-restore done at least once, end-to-end, before
|
||||
relying on the system in anger.
|
||||
- [ ] `secret.key` and the SQLite database covered by separate
|
||||
backup paths so neither alone reconstitutes the other.
|
||||
@@ -0,0 +1,110 @@
|
||||
# Threat model
|
||||
|
||||
This page documents what restic-manager defends against, what it
|
||||
doesn't, and the trust assumptions a deployment is making. The
|
||||
canonical version lives in [`spec.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/spec.md)
|
||||
§11; the summary here is shaped for operators rather than
|
||||
implementers.
|
||||
|
||||
## Trust boundaries
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ TRUSTED zone │
|
||||
│ ┌─────────────┐ ┌──────────────┐ │
|
||||
│ │ Operator's │ │ Reverse │ │
|
||||
│ │ browser │◄──►│ proxy │ │ TLS terminates here
|
||||
│ └─────────────┘ └──────┬───────┘ │
|
||||
└────────────────────────────┼─────────────┘
|
||||
│ HTTP, plaintext
|
||||
│ (loopback or trusted LAN)
|
||||
┌────────────────────────────▼─────────────┐
|
||||
│ Server (control plane) │
|
||||
└────────────┬─────────────────────────────┘
|
||||
│ outbound WebSocket (TLS to clients via proxy)
|
||||
│ — bearer-authenticated
|
||||
┌────────────▼──────────────┐
|
||||
│ Agent (per host) │ ◄── attacker model: assume one
|
||||
└────────────┬──────────────┘ endpoint can be compromised
|
||||
│ subprocess
|
||||
▼
|
||||
restic ──▶ repository (rest-server / S3 / SFTP / …)
|
||||
```
|
||||
|
||||
## What we defend against
|
||||
|
||||
### Network attacker between operator and server
|
||||
|
||||
- HTTPS via the reverse proxy is the only operator-facing surface
|
||||
on a sane deployment.
|
||||
- `RM_COOKIE_SECURE=true` (default) means the session cookie
|
||||
refuses to ride a non-HTTPS connection.
|
||||
- `RM_TRUSTED_PROXY` gates whether `X-Forwarded-*` is honoured;
|
||||
a bypassing request can't spoof the client IP.
|
||||
|
||||
### Compromised agent host
|
||||
|
||||
- The agent's bearer token can dispatch commands **only on its
|
||||
own host**. It can't read other hosts' state, dispatch jobs
|
||||
on other hosts, or escalate within the control plane.
|
||||
- If you suspect a host compromise:
|
||||
1. Disable the agent's host row from **Hosts → Delete**
|
||||
(cascades the bearer hash).
|
||||
2. Rotate the repo credential at the rest-server / object
|
||||
store side.
|
||||
3. Audit-log lists every action that bearer ever drove.
|
||||
|
||||
### DB compromise without the secret key
|
||||
|
||||
- Repo credentials are AEAD-encrypted at rest. A DB dump alone
|
||||
doesn't expose them.
|
||||
- Agent bearer **hashes** are leaked; that's enough to
|
||||
authenticate as any agent until you revoke. A rotation
|
||||
procedure is just "delete + re-enrol" today.
|
||||
- Operator passwords are bcrypt-hashed; OIDC users have no
|
||||
password to leak.
|
||||
- Session tokens are hashed; an attacker can't replay a
|
||||
session from a DB dump.
|
||||
|
||||
### DB compromise WITH the secret key
|
||||
|
||||
The attacker can decrypt every credential. Treat
|
||||
`secret.key` with the same care as a password manager database.
|
||||
Back it up to a separate vault, not to the same Docker volume
|
||||
as the database.
|
||||
|
||||
### Forget/prune as a DoS vector
|
||||
|
||||
- The everyday backup credential cannot prune (append-only).
|
||||
- The admin credential is only pushed to the agent at the
|
||||
moment of dispatch and discarded after the job ends.
|
||||
- Compromise of a single agent host does **not** grant prune
|
||||
rights — at worst the attacker gets fresh write access until
|
||||
the credential is rotated.
|
||||
|
||||
### Operator-side typo or bad copy-paste
|
||||
|
||||
- Repo credentials are stored encrypted; mis-typed creds fail
|
||||
fast on the next `restic` invocation rather than silently
|
||||
corrupting state.
|
||||
- NS-03 added auto-init: the first dispatched job after creds
|
||||
change runs `restic init`, surfaces the error eagerly under
|
||||
the host's vitals strip if the creds are bad, and resets the
|
||||
host's `repo_status` so the operator can retry without
|
||||
hunting through job logs.
|
||||
|
||||
## What we don't defend against
|
||||
|
||||
- **Insider threat at the maintainer level.** A malicious
|
||||
maintainer can publish a backdoored container; SBOM /
|
||||
signing infrastructure (Phase 6 candidate) would help here
|
||||
but isn't shipped today.
|
||||
- **Supply chain.** We pin module versions (`go.sum`) and
|
||||
pin the Tailwind binary's release tag, but a compromise in
|
||||
one of those upstreams would land here.
|
||||
- **Side-channel via restic itself.** A bug in restic that
|
||||
enables snapshot-content disclosure is restic's problem; the
|
||||
control plane doesn't see snapshot bytes either way.
|
||||
- **DoS via resource exhaustion** without the recommended
|
||||
reverse-proxy / rate-limit in front. Don't expose the
|
||||
server's HTTP port to the public internet directly.
|
||||
+120
@@ -0,0 +1,120 @@
|
||||
# End-to-end test harness
|
||||
|
||||
The e2e harness stands up the full production-shaped stack
|
||||
(server + agent + rest-server) in Docker Compose and drives it
|
||||
through Playwright. CI runs it on every PR; operators can run it
|
||||
locally too.
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
e2e/
|
||||
├── compose.e2e.yml compose stack: server + rest-server + agent
|
||||
├── Dockerfile.agent Linux container for the agent (alpine + restic)
|
||||
├── agent-entrypoint.sh decides between announce / token-enrol / run
|
||||
└── playwright/
|
||||
├── package.json
|
||||
├── playwright.config.ts
|
||||
└── tests/
|
||||
├── lib/server.ts bootstrap, login, accept, poll helpers
|
||||
└── smoke.spec.ts happy-path: enrol → backup → succeeded
|
||||
```
|
||||
|
||||
## Local run
|
||||
|
||||
Prerequisites: Docker + Docker Compose, and `npx` for Playwright.
|
||||
|
||||
```sh
|
||||
# 1. Build + bring up the stack (server, rest-server, source data).
|
||||
docker compose -f e2e/compose.e2e.yml up --build -d server rest-server source-fixture
|
||||
|
||||
# 2. Wait for the server, then scrape the bootstrap token from the log.
|
||||
until curl -fsS http://127.0.0.1:8080/api/version >/dev/null; do sleep 1; done
|
||||
RM_BOOTSTRAP_TOKEN=$(docker compose -f e2e/compose.e2e.yml logs server \
|
||||
| grep -Eo '[a-zA-Z0-9_-]{40,}' | head -1)
|
||||
export RM_BOOTSTRAP_TOKEN
|
||||
|
||||
# 3. Start the agent (it announces against the running server).
|
||||
docker compose -f e2e/compose.e2e.yml up -d agent
|
||||
|
||||
# 4. Install + run Playwright.
|
||||
cd e2e/playwright
|
||||
npm install
|
||||
npx playwright install --with-deps chromium
|
||||
npx playwright test
|
||||
```
|
||||
|
||||
When the test passes you'll see:
|
||||
|
||||
```
|
||||
Running 2 tests using 1 worker
|
||||
✓ smoke: enrol-via-announce → backup › happy path completes in under a minute (47s)
|
||||
✓ smoke: scrape /metrics › metrics endpoint exposes the host gauge (180ms)
|
||||
|
||||
2 passed (47.5s)
|
||||
```
|
||||
|
||||
Tear-down:
|
||||
|
||||
```sh
|
||||
docker compose -f e2e/compose.e2e.yml down -v
|
||||
```
|
||||
|
||||
`-v` removes the named volumes too — important between runs because
|
||||
the rest-server volume holds an initialised repo and the
|
||||
agent-config volume holds a stale bearer.
|
||||
|
||||
## What the test exercises
|
||||
|
||||
1. **Bootstrap.** Posts the admin-creation request to
|
||||
`/api/bootstrap` with the token scraped from the server log.
|
||||
2. **Login (UI).** Drives the login form via Playwright; verifies
|
||||
the dashboard loads with a session cookie set.
|
||||
3. **Pending host appears.** Polls the dashboard for the inline
|
||||
accept form generated by the announcing agent; reads the
|
||||
pending-id out of its action URL.
|
||||
4. **Accept.** POSTs `/api/pending-hosts/{id}/accept` with the
|
||||
rest-server URL + repo password. The server mints a Host row
|
||||
+ bearer + AEAD-encrypted creds and pushes the bearer down
|
||||
the still-open pending WebSocket.
|
||||
5. **Online + auto-init.** Polls `/api/hosts` until the new host
|
||||
is `status=online`. Auto-init runs as part of this — the
|
||||
first dispatched job after creds save is `restic init`.
|
||||
6. **Run backup.** Submits the host detail page's `Run now`
|
||||
form; expects `HX-Redirect` to the live job page.
|
||||
7. **Verify.** Polls `/api/hosts` until the host's
|
||||
`last_backup_status` flips to `succeeded`.
|
||||
8. **Metrics.** Scrapes `/metrics` and asserts the
|
||||
server-gauge + build-info lines are present (the compose
|
||||
stack opens the endpoint via `RM_METRICS_TRUSTED_CIDR=0.0.0.0/0`).
|
||||
|
||||
## CI workflow
|
||||
|
||||
[`.gitea/workflows/e2e.yml`](../.gitea/workflows/e2e.yml) runs the
|
||||
suite on every PR into `main`. On failure it dumps the last 200
|
||||
lines of each container log as a workflow annotation and uploads
|
||||
the Playwright HTML report as an artefact.
|
||||
|
||||
## When tests fail
|
||||
|
||||
- **Pending host never appears.** Agent container probably
|
||||
couldn't reach the server. Check `docker compose logs agent`
|
||||
for connection errors and `docker compose logs server` for
|
||||
any 4xx on `/api/agents/announce`.
|
||||
- **Backup hangs in `running`.** The agent shells out to
|
||||
`restic`; check the live job log at
|
||||
`http://127.0.0.1:8080/jobs/<id>` (still up after a
|
||||
failed test as long as you didn't `down -v`).
|
||||
- **`RM_BOOTSTRAP_TOKEN not set`.** The server log scrape
|
||||
matched the wrong line or the token regex is too tight. The
|
||||
server prints the token on a line starting with ` ` (four
|
||||
spaces) inside a banner; widen the regex if your server log
|
||||
format changes.
|
||||
|
||||
## Adding new tests
|
||||
|
||||
The harness is intentionally flat — one `*.spec.ts` per
|
||||
scenario. Reuse the helpers in `lib/server.ts` and avoid
|
||||
duplicating bootstrap / login boilerplate. Heavy fixtures
|
||||
(custom users, OIDC IdP) belong in their own compose override
|
||||
file rather than complicating `compose.e2e.yml`.
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 27 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 98 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 178 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 48 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 92 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 47 KiB |
Reference in New Issue
Block a user