P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
@@ -0,0 +1,121 @@
|
||||
# Architecture
|
||||
|
||||
## Components
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────┐
|
||||
│ Server (control plane, single process) │
|
||||
│ * chi-based HTTP API + HTMX server-rendered UI │
|
||||
│ * WebSocket hub for agent fan-out + browser fan-out │
|
||||
│ * SQLite store (modernc.org/sqlite, pure Go) │
|
||||
│ * AEAD encryption helpers │
|
||||
│ * Alert engine + notification hub │
|
||||
└────────────┬───────────────────────────────────┬───────────┘
|
||||
│ outbound WS only │ HTTP(S)
|
||||
│ │
|
||||
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
|
||||
│ Agent (per host) │ │ Browser (operator) │
|
||||
│ * coder/websocket │ │ * htmx + a tiny bit │
|
||||
│ * cron for schedules │ │ of vanilla JS for │
|
||||
│ * restic wrapper │ │ live job updates │
|
||||
│ * sysinfo collector │ └──────────────────────────┘
|
||||
└────────────┬─────────────┘
|
||||
│ subprocess: restic ...
|
||||
│
|
||||
┌────────────▼─────────────────────────────────────────────────┐
|
||||
│ restic repository (rest-server, S3, B2, SFTP, local …) │
|
||||
│ Backup data flows directly here. Server never touches it. │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Why outbound-only WebSockets?
|
||||
|
||||
The agent dials the server on `/ws/agent` with a bearer token. The
|
||||
server doesn't initiate connections to the agent. Three reasons:
|
||||
|
||||
1. **Firewall friendliness.** Nothing on the endpoint needs an
|
||||
inbound port; this works behind the typical "branch office NAT"
|
||||
without router config.
|
||||
2. **Single auth point.** The bearer token is the only credential
|
||||
that crosses the boundary; the agent never accepts an
|
||||
incoming socket.
|
||||
3. **Reconnect semantics are simpler.** When the connection drops
|
||||
(NAT timeout, server restart, transient network glitch) the
|
||||
agent backs off and re-dials; the server marks the host
|
||||
offline after 90s and lets the alert engine raise a stale-host
|
||||
alert.
|
||||
|
||||
## Why SQLite?
|
||||
|
||||
SQLite covers the project's HA non-goal: there isn't one. A small
|
||||
control plane managing twelve endpoints does not need replication
|
||||
or a separate database tier. SQLite gives us:
|
||||
|
||||
- A single file to back up (plus the secret key).
|
||||
- Hand-rolled migrations under `internal/store/migrations/` —
|
||||
no migration framework lock-in.
|
||||
- `WAL` mode plus per-connection foreign-key enforcement.
|
||||
|
||||
The migrations file the entire schema; there's no ORM or
|
||||
query-builder layer between Go code and SQL.
|
||||
|
||||
## Why the agent runs `restic` itself, not via the server
|
||||
|
||||
The control plane never holds backup bytes in flight. That's
|
||||
deliberate:
|
||||
|
||||
- A compromised control plane cannot exfiltrate snapshot
|
||||
contents in-band — at worst it can dispatch new backup or
|
||||
forget jobs (audit-logged) but the data path is between the
|
||||
agent and the repository.
|
||||
- The same agent process can target whichever transport restic
|
||||
natively supports (rest-server, S3, B2, SFTP, local), no
|
||||
separate mux on the server side.
|
||||
|
||||
## Job lifecycle
|
||||
|
||||
```
|
||||
┌──────────────────────┐
|
||||
operator → │ POST /hosts/{id}/ │
|
||||
│ run-backup │
|
||||
└──────────┬───────────┘
|
||||
│ 1. INSERT INTO jobs (status='queued')
|
||||
│ 2. dispatch command.run over WS
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Agent dispatches │
|
||||
│ restic subprocess │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
│ 3. job.started ───▶ store.MarkJobStarted
|
||||
│ 4. job.progress ───▶ JobHub broadcast (live UI)
|
||||
│ 5. log.stream ───▶ append to job_logs
|
||||
│ 6. job.finished ───▶ store.MarkJobFinished
|
||||
│ + alert engine eval
|
||||
│ + (P6) metrics histogram
|
||||
▼
|
||||
terminal: succeeded | failed | cancelled
|
||||
```
|
||||
|
||||
Operators see live updates because the browser subscribes to
|
||||
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
|
||||
agent-emitted envelope to all live subscribers in addition to
|
||||
persisting it.
|
||||
|
||||
## What scheduling looks like
|
||||
|
||||
- The agent runs a local `robfig/cron/v3` instance.
|
||||
- The server pushes the desired schedule set to the agent on
|
||||
hello + after every CRUD change.
|
||||
- When the agent's cron fires, it sends `schedule.fire` to the
|
||||
server. The server creates a job row, sends `command.run` back,
|
||||
and the agent dispatches a normal backup.
|
||||
- If the WS drops between fire and run, the server queues the
|
||||
schedule firing into `pending_runs` and drains on agent
|
||||
reconnect — no missed scheduled backups due to network blips.
|
||||
|
||||
For everything that isn't a backup (forget, prune, check), the
|
||||
server runs a 60-second maintenance ticker against
|
||||
`host_repo_maintenance` rows and dispatches the relevant command
|
||||
when a cadence is due. The agent's local cron only handles
|
||||
backups.
|
||||
@@ -0,0 +1,98 @@
|
||||
# Credentials and how they flow
|
||||
|
||||
restic-manager handles three credential surfaces:
|
||||
|
||||
1. **Operator credentials** — the username + password (or OIDC
|
||||
identity) that logs into the UI.
|
||||
2. **Agent bearer tokens** — issued at enrolment, used by the
|
||||
agent to authenticate its WebSocket to the server.
|
||||
3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
|
||||
credentials the agent passes to `restic` itself.
|
||||
|
||||
Each has a different threat model and storage strategy.
|
||||
|
||||
## Operator credentials
|
||||
|
||||
- Local users are stored in `users` with a bcrypt password hash.
|
||||
- Sessions are random tokens minted at login, stored hashed in
|
||||
the `sessions` table, expired after 24h. Cookie is HttpOnly,
|
||||
SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
|
||||
default).
|
||||
- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
|
||||
pinning their IdP identity. Local password login is rejected
|
||||
for OIDC users.
|
||||
- Disabling a user soft-deletes them via `disabled_at` —
|
||||
pre-existing sessions are invalidated on the next request.
|
||||
|
||||
## Agent bearer tokens
|
||||
|
||||
- Minted at enrolment, hashed at rest with `auth.HashToken`.
|
||||
- The plaintext token only exists in memory at enrolment time
|
||||
and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
|
||||
mode `0600`, owned by the service user).
|
||||
- Compromise of the server DB leaks the hashes, which is enough
|
||||
to *log in as that agent* until you revoke. Compromise of the
|
||||
agent host leaks the plaintext (via the config file) — same
|
||||
end result.
|
||||
- Rotation: re-enrol the host. Today there's no in-place rotate;
|
||||
the operator deletes the host (which cascades, including
|
||||
revoking the bearer hash) and re-runs the install command.
|
||||
|
||||
## Repo credentials
|
||||
|
||||
This is the credential that ultimately matters for backup
|
||||
integrity. restic-manager keeps two slots per host:
|
||||
|
||||
- **The everyday credential** (`host_credentials.kind = ''`).
|
||||
Append-only-friendly: this is the one your backup schedule
|
||||
uses. It can write but not delete or forget.
|
||||
- **The admin credential** (`host_credentials.kind = 'admin'`).
|
||||
Has full delete rights. Only pushed to the agent transiently
|
||||
while a `prune` or `forget` job is dispatching, and discarded
|
||||
by the agent after the job ends.
|
||||
|
||||
### Encryption flow
|
||||
|
||||
1. Operator types the credential into the UI or the install form.
|
||||
2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
|
||||
key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
|
||||
memory.
|
||||
3. Encrypted blob is stored in `host_credentials.cred_blob`.
|
||||
4. When the agent connects, the server decrypts the blob and
|
||||
sends the **plaintext** down the WebSocket inside a
|
||||
`config.update` envelope.
|
||||
5. The agent stores the plaintext in its in-memory secrets store
|
||||
for the lifetime of the process; it's reloaded fresh on every
|
||||
server-side push.
|
||||
6. When a job runs, the agent merges the credential into the
|
||||
restic environment (`restic.Env.RepoURL` stays bare; the
|
||||
`user:pass@…` form is built only inside `envSlice()` at the
|
||||
moment of `exec.Command`).
|
||||
|
||||
The merged form is **never logged**. The slog package's structured
|
||||
output gets `restic.RedactURL()` for any URL it has cause to
|
||||
mention.
|
||||
|
||||
### Why push plaintext over the wire?
|
||||
|
||||
The transport itself is the trust boundary: the WebSocket runs
|
||||
inside the same TLS-terminated reverse-proxy connection your
|
||||
browser uses, and the agent has already authenticated with its
|
||||
bearer token. Re-encrypting the payload on top of that would just
|
||||
move the key-management problem somewhere else.
|
||||
|
||||
If your reverse proxy isn't TLS-terminated, the deployment is
|
||||
already broken — see [Hardening](../security/hardening.md).
|
||||
|
||||
## Setup tokens (admin-driven)
|
||||
|
||||
When an admin creates a new user, the server mints a one-time
|
||||
setup link valid for 1 hour. The hash is stored; the raw token
|
||||
is shown to the admin once. The user opens the link, sets a
|
||||
password, and is dropped into a session. Expired tokens are
|
||||
swept on the alert engine's 60s tick.
|
||||
|
||||
Same pattern for enrolment tokens: the raw token only exists in
|
||||
memory at mint time, and the install snippet is the operator's
|
||||
only chance to capture it. If you lose it, regenerate via the
|
||||
**Add host** page (NS-02).
|
||||
@@ -0,0 +1,85 @@
|
||||
# Repo maintenance
|
||||
|
||||
Backups go in; without maintenance, repos grow forever and
|
||||
eventually fall over. restic-manager runs three maintenance
|
||||
operations on a per-host cadence:
|
||||
|
||||
| Command | What it does | Default cadence |
|
||||
|----------|-------------------------------------------------------------|-----------------|
|
||||
| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
|
||||
| `prune` | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
|
||||
| `check` | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
|
||||
|
||||
A new field on each host row, `host_repo_maintenance`, holds the
|
||||
cron expressions and last-fire anchors. The maintenance ticker on
|
||||
the server runs every 60s, finds hosts whose next-fire is due,
|
||||
and dispatches the right command. The agent's local cron is
|
||||
**only** for backups.
|
||||
|
||||
## Why server-side and not agent-side?
|
||||
|
||||
The agent's cron knows about backups because backups are
|
||||
per-source-group. Maintenance is per-repo, not per-source-group,
|
||||
so doing it server-side keeps the per-host wiring simple:
|
||||
|
||||
- One ticker, not N agent crons to keep in sync.
|
||||
- Cancelling a maintenance dispatch is just "don't dispatch the
|
||||
next one" — no agent-side state to clean up.
|
||||
- Skipping offline hosts is trivial (no queue; only scheduled
|
||||
*backups* queue into `pending_runs`).
|
||||
|
||||
## Forget and the multi-group payload
|
||||
|
||||
A single `forget` job can target several source groups at once.
|
||||
The wire envelope (`ForgetGroups`) carries one entry per group,
|
||||
each with its retention policy. The agent runs N
|
||||
`restic forget --tag <name> --keep-...` invocations in sequence,
|
||||
streams their output, and reports a single terminal status.
|
||||
|
||||
## Prune and the admin credential
|
||||
|
||||
Prune mutates the repo. The everyday append-only credential
|
||||
**cannot** prune — that's the whole point of append-only.
|
||||
restic-manager keeps a second slot per host (`kind = 'admin'`)
|
||||
for the credential that can.
|
||||
|
||||
When a prune is dispatched (cadence-driven or operator-driven):
|
||||
|
||||
1. Server pushes the admin credential to the agent in a fresh
|
||||
`config.update`.
|
||||
2. Agent runs `restic prune` with the merged credential.
|
||||
3. Job finishes; agent discards the admin credential from its
|
||||
in-memory secrets store.
|
||||
|
||||
The server never logs the merged URL (see
|
||||
[Credentials](./credentials.md)).
|
||||
|
||||
## Check and lock state
|
||||
|
||||
`restic check` warns about stale locks when it finds them. The
|
||||
agent ships every check's output back as a `repo.stats` envelope
|
||||
and a stream of log lines; if a stale lock is detected, the
|
||||
**Repo** page surfaces a banner with an **Unlock** button. The
|
||||
operator-only `unlock` command runs `restic unlock` and clears
|
||||
the banner.
|
||||
|
||||
`unlock` has no cadence — it's a manual action, never automatic.
|
||||
Auto-unlocking would mask the cause (probably a previously
|
||||
crashed long-running operation) and risk corrupting an
|
||||
operation the operator has merely lost track of.
|
||||
|
||||
## Repo stats
|
||||
|
||||
After every backup, check, prune, and unlock, the agent runs
|
||||
`restic stats --json --mode raw-data` and ships the result as a
|
||||
`repo.stats` envelope. The server stores this in
|
||||
`host_repo_stats` (latest only) and `host_repo_stats_history`
|
||||
(one row per host per day, last-write-wins per column — a
|
||||
prune-only patch never nulls a backup-time size).
|
||||
|
||||
The host detail page surfaces:
|
||||
|
||||
- Total size + raw size in the vitals strip.
|
||||
- Last-check timestamp + colour-coded status.
|
||||
- Last-prune timestamp.
|
||||
- 30/90-day repo size trend chart.
|
||||
@@ -0,0 +1,105 @@
|
||||
# Schedules and source groups
|
||||
|
||||
Two related but separable ideas:
|
||||
|
||||
- A **source group** is a named bundle of "what to back up":
|
||||
include paths, exclude patterns, retention policy, retry
|
||||
configuration, optional pre/post hooks. The group's name is
|
||||
used as the restic snapshot tag, so retention can target it
|
||||
with `restic forget --tag <name>`.
|
||||
- A **schedule** is a cron expression that, when it fires,
|
||||
triggers a backup of one or more source groups on a host.
|
||||
|
||||
Decoupling them means you can have one schedule covering several
|
||||
groups (e.g. `0 1 * * *` running both `system` and `data`), and
|
||||
each group has its own retention without duplicating policy
|
||||
across schedules.
|
||||
|
||||
## Source group anatomy
|
||||
|
||||
```yaml
|
||||
name: data
|
||||
includes:
|
||||
- /var/lib/postgresql
|
||||
- /home
|
||||
excludes:
|
||||
- /home/*/.cache
|
||||
- /home/*/Downloads
|
||||
retention:
|
||||
keep_last: 7
|
||||
keep_daily: 14
|
||||
keep_weekly: 4
|
||||
keep_monthly: 6
|
||||
retry_max: 3
|
||||
retry_backoff_seconds: 600
|
||||
pre_hook: |
|
||||
pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
|
||||
post_hook: |
|
||||
rm -f /var/lib/postgresql/dumps/all.dump
|
||||
```
|
||||
|
||||
### Conflict detection
|
||||
|
||||
If your retention policy says `keep_hourly: 24` but no schedule
|
||||
points at this group sub-daily, the UI surfaces a
|
||||
**conflict-dimension banner** ("`hourly` won't be honoured —
|
||||
no schedule fires more often than once a day"). The flag is
|
||||
stored on the source group (`conflict_dimension`) and refreshed
|
||||
whenever a schedule or group changes.
|
||||
|
||||
### Hooks
|
||||
|
||||
`pre_hook` and `post_hook` run on the agent host inside
|
||||
`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
|
||||
to the live job log as `hook(<phase>): …` lines.
|
||||
|
||||
- A non-zero `pre_hook` exit aborts the backup.
|
||||
- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
|
||||
in the environment. Use this for cleanup that must happen
|
||||
whether the backup worked or not.
|
||||
- Hooks only run for `kind=backup` jobs. They do not run for
|
||||
`forget`, `prune`, `check`, etc.
|
||||
- AEAD-encrypted at rest at the HTTP layer; the agent receives
|
||||
plaintext over the WS channel.
|
||||
|
||||
A "host default" pair of hooks lives on the host itself; a
|
||||
source group's own hooks override them when set.
|
||||
|
||||
## Schedule anatomy
|
||||
|
||||
```yaml
|
||||
cron: "0 2 * * *"
|
||||
enabled: true
|
||||
source_group_ids:
|
||||
- <gid for "data">
|
||||
- <gid for "system">
|
||||
```
|
||||
|
||||
Slim by design: a schedule says **when** and **which groups**.
|
||||
Everything else (paths, retention, hooks) lives on the groups.
|
||||
|
||||
The agent's local cron fires the schedule. If the WebSocket is
|
||||
down at fire time, the server queues the firing into
|
||||
`pending_runs` and drains it on the next agent reconnect — a
|
||||
short network blip won't lose the backup.
|
||||
|
||||
### Last / next run
|
||||
|
||||
The schedules tab shows "next" (computed by parsing the cron
|
||||
expression with `robfig/cron/v3`) and "last" (the latest
|
||||
`actor_kind=schedule` job in the `jobs` table) for every
|
||||
schedule. The dashboard host row also surfaces `next 12h ago/from
|
||||
now` when a single covering schedule is the run-now candidate.
|
||||
|
||||
## Bandwidth limits
|
||||
|
||||
Two places set restic's `--limit-upload` / `--limit-download`:
|
||||
|
||||
1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
|
||||
`bandwidth_down_kbps`). Pushed to the agent on hello and
|
||||
after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
|
||||
invocation on the host.
|
||||
2. **Per-job overrides** on the per-source-group Run-now form.
|
||||
Win over host caps for the lifetime of that one job.
|
||||
|
||||
If neither is set, restic runs unthrottled.
|
||||
Reference in New Issue
Block a user