P5: OSS readiness — docs site, contributor onboarding, e2e harness

P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
2026-05-07 23:56:02 +01:00
parent ff8a5dbead
commit bb4ed3502d
47 changed files with 2818 additions and 61 deletions
+121
View File
@@ -0,0 +1,121 @@
# Architecture
## Components
```
┌────────────────────────────────────────────────────────────┐
│ Server (control plane, single process) │
│ * chi-based HTTP API + HTMX server-rendered UI │
│ * WebSocket hub for agent fan-out + browser fan-out │
│ * SQLite store (modernc.org/sqlite, pure Go) │
│ * AEAD encryption helpers │
│ * Alert engine + notification hub │
└────────────┬───────────────────────────────────┬───────────┘
│ outbound WS only │ HTTP(S)
│ │
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
│ Agent (per host) │ │ Browser (operator) │
│ * coder/websocket │ │ * htmx + a tiny bit │
│ * cron for schedules │ │ of vanilla JS for │
│ * restic wrapper │ │ live job updates │
│ * sysinfo collector │ └──────────────────────────┘
└────────────┬─────────────┘
│ subprocess: restic ...
┌────────────▼─────────────────────────────────────────────────┐
│ restic repository (rest-server, S3, B2, SFTP, local …) │
│ Backup data flows directly here. Server never touches it. │
└──────────────────────────────────────────────────────────────┘
```
## Why outbound-only WebSockets?
The agent dials the server on `/ws/agent` with a bearer token. The
server doesn't initiate connections to the agent. Three reasons:
1. **Firewall friendliness.** Nothing on the endpoint needs an
inbound port; this works behind the typical "branch office NAT"
without router config.
2. **Single auth point.** The bearer token is the only credential
that crosses the boundary; the agent never accepts an
incoming socket.
3. **Reconnect semantics are simpler.** When the connection drops
(NAT timeout, server restart, transient network glitch) the
agent backs off and re-dials; the server marks the host
offline after 90s and lets the alert engine raise a stale-host
alert.
## Why SQLite?
SQLite covers the project's HA non-goal: there isn't one. A small
control plane managing twelve endpoints does not need replication
or a separate database tier. SQLite gives us:
- A single file to back up (plus the secret key).
- Hand-rolled migrations under `internal/store/migrations/`
no migration framework lock-in.
- `WAL` mode plus per-connection foreign-key enforcement.
The migrations file the entire schema; there's no ORM or
query-builder layer between Go code and SQL.
## Why the agent runs `restic` itself, not via the server
The control plane never holds backup bytes in flight. That's
deliberate:
- A compromised control plane cannot exfiltrate snapshot
contents in-band — at worst it can dispatch new backup or
forget jobs (audit-logged) but the data path is between the
agent and the repository.
- The same agent process can target whichever transport restic
natively supports (rest-server, S3, B2, SFTP, local), no
separate mux on the server side.
## Job lifecycle
```
┌──────────────────────┐
operator → │ POST /hosts/{id}/ │
│ run-backup │
└──────────┬───────────┘
│ 1. INSERT INTO jobs (status='queued')
│ 2. dispatch command.run over WS
┌──────────────────────┐
│ Agent dispatches │
│ restic subprocess │
└──────────┬───────────┘
│ 3. job.started ───▶ store.MarkJobStarted
│ 4. job.progress ───▶ JobHub broadcast (live UI)
│ 5. log.stream ───▶ append to job_logs
│ 6. job.finished ───▶ store.MarkJobFinished
│ + alert engine eval
│ + (P6) metrics histogram
terminal: succeeded | failed | cancelled
```
Operators see live updates because the browser subscribes to
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
agent-emitted envelope to all live subscribers in addition to
persisting it.
## What scheduling looks like
- The agent runs a local `robfig/cron/v3` instance.
- The server pushes the desired schedule set to the agent on
hello + after every CRUD change.
- When the agent's cron fires, it sends `schedule.fire` to the
server. The server creates a job row, sends `command.run` back,
and the agent dispatches a normal backup.
- If the WS drops between fire and run, the server queues the
schedule firing into `pending_runs` and drains on agent
reconnect — no missed scheduled backups due to network blips.
For everything that isn't a backup (forget, prune, check), the
server runs a 60-second maintenance ticker against
`host_repo_maintenance` rows and dispatches the relevant command
when a cadence is due. The agent's local cron only handles
backups.
+98
View File
@@ -0,0 +1,98 @@
# Credentials and how they flow
restic-manager handles three credential surfaces:
1. **Operator credentials** — the username + password (or OIDC
identity) that logs into the UI.
2. **Agent bearer tokens** — issued at enrolment, used by the
agent to authenticate its WebSocket to the server.
3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
credentials the agent passes to `restic` itself.
Each has a different threat model and storage strategy.
## Operator credentials
- Local users are stored in `users` with a bcrypt password hash.
- Sessions are random tokens minted at login, stored hashed in
the `sessions` table, expired after 24h. Cookie is HttpOnly,
SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
default).
- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
pinning their IdP identity. Local password login is rejected
for OIDC users.
- Disabling a user soft-deletes them via `disabled_at`
pre-existing sessions are invalidated on the next request.
## Agent bearer tokens
- Minted at enrolment, hashed at rest with `auth.HashToken`.
- The plaintext token only exists in memory at enrolment time
and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
mode `0600`, owned by the service user).
- Compromise of the server DB leaks the hashes, which is enough
to *log in as that agent* until you revoke. Compromise of the
agent host leaks the plaintext (via the config file) — same
end result.
- Rotation: re-enrol the host. Today there's no in-place rotate;
the operator deletes the host (which cascades, including
revoking the bearer hash) and re-runs the install command.
## Repo credentials
This is the credential that ultimately matters for backup
integrity. restic-manager keeps two slots per host:
- **The everyday credential** (`host_credentials.kind = ''`).
Append-only-friendly: this is the one your backup schedule
uses. It can write but not delete or forget.
- **The admin credential** (`host_credentials.kind = 'admin'`).
Has full delete rights. Only pushed to the agent transiently
while a `prune` or `forget` job is dispatching, and discarded
by the agent after the job ends.
### Encryption flow
1. Operator types the credential into the UI or the install form.
2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
memory.
3. Encrypted blob is stored in `host_credentials.cred_blob`.
4. When the agent connects, the server decrypts the blob and
sends the **plaintext** down the WebSocket inside a
`config.update` envelope.
5. The agent stores the plaintext in its in-memory secrets store
for the lifetime of the process; it's reloaded fresh on every
server-side push.
6. When a job runs, the agent merges the credential into the
restic environment (`restic.Env.RepoURL` stays bare; the
`user:pass@…` form is built only inside `envSlice()` at the
moment of `exec.Command`).
The merged form is **never logged**. The slog package's structured
output gets `restic.RedactURL()` for any URL it has cause to
mention.
### Why push plaintext over the wire?
The transport itself is the trust boundary: the WebSocket runs
inside the same TLS-terminated reverse-proxy connection your
browser uses, and the agent has already authenticated with its
bearer token. Re-encrypting the payload on top of that would just
move the key-management problem somewhere else.
If your reverse proxy isn't TLS-terminated, the deployment is
already broken — see [Hardening](../security/hardening.md).
## Setup tokens (admin-driven)
When an admin creates a new user, the server mints a one-time
setup link valid for 1 hour. The hash is stored; the raw token
is shown to the admin once. The user opens the link, sets a
password, and is dropped into a session. Expired tokens are
swept on the alert engine's 60s tick.
Same pattern for enrolment tokens: the raw token only exists in
memory at mint time, and the install snippet is the operator's
only chance to capture it. If you lose it, regenerate via the
**Add host** page (NS-02).
@@ -0,0 +1,85 @@
# Repo maintenance
Backups go in; without maintenance, repos grow forever and
eventually fall over. restic-manager runs three maintenance
operations on a per-host cadence:
| Command | What it does | Default cadence |
|----------|-------------------------------------------------------------|-----------------|
| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
| `prune` | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
| `check` | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
A new field on each host row, `host_repo_maintenance`, holds the
cron expressions and last-fire anchors. The maintenance ticker on
the server runs every 60s, finds hosts whose next-fire is due,
and dispatches the right command. The agent's local cron is
**only** for backups.
## Why server-side and not agent-side?
The agent's cron knows about backups because backups are
per-source-group. Maintenance is per-repo, not per-source-group,
so doing it server-side keeps the per-host wiring simple:
- One ticker, not N agent crons to keep in sync.
- Cancelling a maintenance dispatch is just "don't dispatch the
next one" — no agent-side state to clean up.
- Skipping offline hosts is trivial (no queue; only scheduled
*backups* queue into `pending_runs`).
## Forget and the multi-group payload
A single `forget` job can target several source groups at once.
The wire envelope (`ForgetGroups`) carries one entry per group,
each with its retention policy. The agent runs N
`restic forget --tag <name> --keep-...` invocations in sequence,
streams their output, and reports a single terminal status.
## Prune and the admin credential
Prune mutates the repo. The everyday append-only credential
**cannot** prune — that's the whole point of append-only.
restic-manager keeps a second slot per host (`kind = 'admin'`)
for the credential that can.
When a prune is dispatched (cadence-driven or operator-driven):
1. Server pushes the admin credential to the agent in a fresh
`config.update`.
2. Agent runs `restic prune` with the merged credential.
3. Job finishes; agent discards the admin credential from its
in-memory secrets store.
The server never logs the merged URL (see
[Credentials](./credentials.md)).
## Check and lock state
`restic check` warns about stale locks when it finds them. The
agent ships every check's output back as a `repo.stats` envelope
and a stream of log lines; if a stale lock is detected, the
**Repo** page surfaces a banner with an **Unlock** button. The
operator-only `unlock` command runs `restic unlock` and clears
the banner.
`unlock` has no cadence — it's a manual action, never automatic.
Auto-unlocking would mask the cause (probably a previously
crashed long-running operation) and risk corrupting an
operation the operator has merely lost track of.
## Repo stats
After every backup, check, prune, and unlock, the agent runs
`restic stats --json --mode raw-data` and ships the result as a
`repo.stats` envelope. The server stores this in
`host_repo_stats` (latest only) and `host_repo_stats_history`
(one row per host per day, last-write-wins per column — a
prune-only patch never nulls a backup-time size).
The host detail page surfaces:
- Total size + raw size in the vitals strip.
- Last-check timestamp + colour-coded status.
- Last-prune timestamp.
- 30/90-day repo size trend chart.
@@ -0,0 +1,105 @@
# Schedules and source groups
Two related but separable ideas:
- A **source group** is a named bundle of "what to back up":
include paths, exclude patterns, retention policy, retry
configuration, optional pre/post hooks. The group's name is
used as the restic snapshot tag, so retention can target it
with `restic forget --tag <name>`.
- A **schedule** is a cron expression that, when it fires,
triggers a backup of one or more source groups on a host.
Decoupling them means you can have one schedule covering several
groups (e.g. `0 1 * * *` running both `system` and `data`), and
each group has its own retention without duplicating policy
across schedules.
## Source group anatomy
```yaml
name: data
includes:
- /var/lib/postgresql
- /home
excludes:
- /home/*/.cache
- /home/*/Downloads
retention:
keep_last: 7
keep_daily: 14
keep_weekly: 4
keep_monthly: 6
retry_max: 3
retry_backoff_seconds: 600
pre_hook: |
pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
post_hook: |
rm -f /var/lib/postgresql/dumps/all.dump
```
### Conflict detection
If your retention policy says `keep_hourly: 24` but no schedule
points at this group sub-daily, the UI surfaces a
**conflict-dimension banner** ("`hourly` won't be honoured —
no schedule fires more often than once a day"). The flag is
stored on the source group (`conflict_dimension`) and refreshed
whenever a schedule or group changes.
### Hooks
`pre_hook` and `post_hook` run on the agent host inside
`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
to the live job log as `hook(<phase>): …` lines.
- A non-zero `pre_hook` exit aborts the backup.
- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
in the environment. Use this for cleanup that must happen
whether the backup worked or not.
- Hooks only run for `kind=backup` jobs. They do not run for
`forget`, `prune`, `check`, etc.
- AEAD-encrypted at rest at the HTTP layer; the agent receives
plaintext over the WS channel.
A "host default" pair of hooks lives on the host itself; a
source group's own hooks override them when set.
## Schedule anatomy
```yaml
cron: "0 2 * * *"
enabled: true
source_group_ids:
- <gid for "data">
- <gid for "system">
```
Slim by design: a schedule says **when** and **which groups**.
Everything else (paths, retention, hooks) lives on the groups.
The agent's local cron fires the schedule. If the WebSocket is
down at fire time, the server queues the firing into
`pending_runs` and drains it on the next agent reconnect — a
short network blip won't lose the backup.
### Last / next run
The schedules tab shows "next" (computed by parsing the cron
expression with `robfig/cron/v3`) and "last" (the latest
`actor_kind=schedule` job in the `jobs` table) for every
schedule. The dashboard host row also surfaces `next 12h ago/from
now` when a single covering schedule is the run-now candidate.
## Bandwidth limits
Two places set restic's `--limit-upload` / `--limit-download`:
1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
`bandwidth_down_kbps`). Pushed to the agent on hello and
after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
invocation on the host.
2. **Per-job overrides** on the per-source-group Run-now form.
Win over host caps for the lifetime of that one job.
If neither is set, restic runs unthrottled.