P5: OSS readiness — docs site, contributor onboarding, e2e harness

P5-01 — Documentation site under docs/book/ rendered with mdBook (downloaded via Makefile, same static-binary pattern as Tailwind). Structured chapters: getting started, concepts, operations, security, reference. `make docs` / `make docs-watch`. Generated output gitignored. P5-02 — CONTRIBUTING.md rewritten from placeholder to a full guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a single-maintainer project. .gitea/issue_template/{bug,feature}.md and PULL_REQUEST_TEMPLATE.md. P5-04 — Six README screenshots captured live from a fresh server bootstrap (login, empty dashboard, add-host, alerts, settings, audit log). README rewritten to centre the screenshot grid and link out to the docs site. P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day default window), scope in/out, threat-model summary, operator hardening checklist. Mirrored as a docs-site chapter. P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up server + sibling Linux agent (alpine + restic) + restic/rest-server. Agent uses announce-and-approve so Playwright can drive the full operator flow: bootstrap → login → accept pending → backup → verify terminal status. Second spec scrapes /metrics to assert the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every PR; local how-to in docs/e2e.md.
2026-05-07 23:56:02 +01:00
parent ff8a5dbead
commit bb4ed3502d
47 changed files with 2818 additions and 61 deletions
@@ -0,0 +1,121 @@
+# Architecture
+
+## Components
+
+```
+┌────────────────────────────────────────────────────────────┐
+│  Server (control plane, single process)                    │
+│   * chi-based HTTP API + HTMX server-rendered UI           │
+│   * WebSocket hub for agent fan-out + browser fan-out      │
+│   * SQLite store (modernc.org/sqlite, pure Go)             │
+│   * AEAD encryption helpers                                │
+│   * Alert engine + notification hub                        │
+└────────────┬───────────────────────────────────┬───────────┘
+             │ outbound WS only                   │ HTTP(S)
+             │                                    │
+┌────────────▼─────────────┐         ┌────────────▼─────────────┐
+│  Agent (per host)        │         │  Browser (operator)      │
+│   * coder/websocket      │         │   * htmx + a tiny bit    │
+│   * cron for schedules   │         │     of vanilla JS for    │
+│   * restic wrapper       │         │     live job updates     │
+│   * sysinfo collector    │         └──────────────────────────┘
+└────────────┬─────────────┘
+             │ subprocess: restic ...
+             │
+┌────────────▼─────────────────────────────────────────────────┐
+│  restic repository (rest-server, S3, B2, SFTP, local …)      │
+│  Backup data flows directly here. Server never touches it.   │
+└──────────────────────────────────────────────────────────────┘
+```
+
+## Why outbound-only WebSockets?
+
+The agent dials the server on `/ws/agent` with a bearer token. The
+server doesn't initiate connections to the agent. Three reasons:
+
+1. **Firewall friendliness.** Nothing on the endpoint needs an
+   inbound port; this works behind the typical "branch office NAT"
+   without router config.
+2. **Single auth point.** The bearer token is the only credential
+   that crosses the boundary; the agent never accepts an
+   incoming socket.
+3. **Reconnect semantics are simpler.** When the connection drops
+   (NAT timeout, server restart, transient network glitch) the
+   agent backs off and re-dials; the server marks the host
+   offline after 90s and lets the alert engine raise a stale-host
+   alert.
+
+## Why SQLite?
+
+SQLite covers the project's HA non-goal: there isn't one. A small
+control plane managing twelve endpoints does not need replication
+or a separate database tier. SQLite gives us:
+
+- A single file to back up (plus the secret key).
+- Hand-rolled migrations under `internal/store/migrations/` —
+  no migration framework lock-in.
+- `WAL` mode plus per-connection foreign-key enforcement.
+
+The migrations file the entire schema; there's no ORM or
+query-builder layer between Go code and SQL.
+
+## Why the agent runs `restic` itself, not via the server
+
+The control plane never holds backup bytes in flight. That's
+deliberate:
+
+- A compromised control plane cannot exfiltrate snapshot
+  contents in-band — at worst it can dispatch new backup or
+  forget jobs (audit-logged) but the data path is between the
+  agent and the repository.
+- The same agent process can target whichever transport restic
+  natively supports (rest-server, S3, B2, SFTP, local), no
+  separate mux on the server side.
+
+## Job lifecycle
+
+```
+            ┌──────────────────────┐
+operator →  │ POST /hosts/{id}/    │
+            │       run-backup     │
+            └──────────┬───────────┘
+                       │   1. INSERT INTO jobs (status='queued')
+                       │   2. dispatch command.run over WS
+                       ▼
+            ┌──────────────────────┐
+            │ Agent dispatches     │
+            │ restic subprocess    │
+            └──────────┬───────────┘
+                       │
+                       │   3. job.started   ───▶ store.MarkJobStarted
+                       │   4. job.progress  ───▶ JobHub broadcast (live UI)
+                       │   5. log.stream    ───▶ append to job_logs
+                       │   6. job.finished  ───▶ store.MarkJobFinished
+                       │                          + alert engine eval
+                       │                          + (P6) metrics histogram
+                       ▼
+                  terminal: succeeded | failed | cancelled
+```
+
+Operators see live updates because the browser subscribes to
+`/api/jobs/{id}/stream`, and the WS handler broadcasts each
+agent-emitted envelope to all live subscribers in addition to
+persisting it.
+
+## What scheduling looks like
+
+- The agent runs a local `robfig/cron/v3` instance.
+- The server pushes the desired schedule set to the agent on
+  hello + after every CRUD change.
+- When the agent's cron fires, it sends `schedule.fire` to the
+  server. The server creates a job row, sends `command.run` back,
+  and the agent dispatches a normal backup.
+- If the WS drops between fire and run, the server queues the
+  schedule firing into `pending_runs` and drains on agent
+  reconnect — no missed scheduled backups due to network blips.
+
+For everything that isn't a backup (forget, prune, check), the
+server runs a 60-second maintenance ticker against
+`host_repo_maintenance` rows and dispatches the relevant command
+when a cadence is due. The agent's local cron only handles
+backups.
@@ -0,0 +1,98 @@
+# Credentials and how they flow
+
+restic-manager handles three credential surfaces:
+
+1. **Operator credentials** — the username + password (or OIDC
+   identity) that logs into the UI.
+2. **Agent bearer tokens** — issued at enrolment, used by the
+   agent to authenticate its WebSocket to the server.
+3. **Repo credentials** — the rest-server / S3 / B2 / SFTP
+   credentials the agent passes to `restic` itself.
+
+Each has a different threat model and storage strategy.
+
+## Operator credentials
+
+- Local users are stored in `users` with a bcrypt password hash.
+- Sessions are random tokens minted at login, stored hashed in
+  the `sessions` table, expired after 24h. Cookie is HttpOnly,
+  SameSite=Lax, and Secure (when `RM_COOKIE_SECURE=true`,
+  default).
+- OIDC users carry `auth_source='oidc'` and an `oidc_subject`
+  pinning their IdP identity. Local password login is rejected
+  for OIDC users.
+- Disabling a user soft-deletes them via `disabled_at` —
+  pre-existing sessions are invalidated on the next request.
+
+## Agent bearer tokens
+
+- Minted at enrolment, hashed at rest with `auth.HashToken`.
+- The plaintext token only exists in memory at enrolment time
+  and on the agent's filesystem (`/etc/restic-manager/agent.yaml`,
+  mode `0600`, owned by the service user).
+- Compromise of the server DB leaks the hashes, which is enough
+  to *log in as that agent* until you revoke. Compromise of the
+  agent host leaks the plaintext (via the config file) — same
+  end result.
+- Rotation: re-enrol the host. Today there's no in-place rotate;
+  the operator deletes the host (which cascades, including
+  revoking the bearer hash) and re-runs the install command.
+
+## Repo credentials
+
+This is the credential that ultimately matters for backup
+integrity. restic-manager keeps two slots per host:
+
+- **The everyday credential** (`host_credentials.kind = ''`).
+  Append-only-friendly: this is the one your backup schedule
+  uses. It can write but not delete or forget.
+- **The admin credential** (`host_credentials.kind = 'admin'`).
+  Has full delete rights. Only pushed to the agent transiently
+  while a `prune` or `forget` job is dispatching, and discarded
+  by the agent after the job ends.
+
+### Encryption flow
+
+1. Operator types the credential into the UI or the install form.
+2. Server AEAD-encrypts the cred (`crypto.AEAD.Encrypt`) using the
+   key in `RM_SECRET_KEY_FILE`. The plaintext is dropped from
+   memory.
+3. Encrypted blob is stored in `host_credentials.cred_blob`.
+4. When the agent connects, the server decrypts the blob and
+   sends the **plaintext** down the WebSocket inside a
+   `config.update` envelope.
+5. The agent stores the plaintext in its in-memory secrets store
+   for the lifetime of the process; it's reloaded fresh on every
+   server-side push.
+6. When a job runs, the agent merges the credential into the
+   restic environment (`restic.Env.RepoURL` stays bare; the
+   `user:pass@…` form is built only inside `envSlice()` at the
+   moment of `exec.Command`).
+
+The merged form is **never logged**. The slog package's structured
+output gets `restic.RedactURL()` for any URL it has cause to
+mention.
+
+### Why push plaintext over the wire?
+
+The transport itself is the trust boundary: the WebSocket runs
+inside the same TLS-terminated reverse-proxy connection your
+browser uses, and the agent has already authenticated with its
+bearer token. Re-encrypting the payload on top of that would just
+move the key-management problem somewhere else.
+
+If your reverse proxy isn't TLS-terminated, the deployment is
+already broken — see [Hardening](../security/hardening.md).
+
+## Setup tokens (admin-driven)
+
+When an admin creates a new user, the server mints a one-time
+setup link valid for 1 hour. The hash is stored; the raw token
+is shown to the admin once. The user opens the link, sets a
+password, and is dropped into a session. Expired tokens are
+swept on the alert engine's 60s tick.
+
+Same pattern for enrolment tokens: the raw token only exists in
+memory at mint time, and the install snippet is the operator's
+only chance to capture it. If you lose it, regenerate via the
+**Add host** page (NS-02).
@@ -0,0 +1,85 @@
+# Repo maintenance
+
+Backups go in; without maintenance, repos grow forever and
+eventually fall over. restic-manager runs three maintenance
+operations on a per-host cadence:
+
+| Command  | What it does                                                | Default cadence |
+|----------|-------------------------------------------------------------|-----------------|
+| `forget` | Marks snapshots eligible for removal per the retention policy attached to each source group. Cheap; runs append-only. | Daily after the last backup of the day |
+| `prune`  | Reclaims space from the repo. Requires the **admin** credential (write+delete). | Weekly, off-peak |
+| `check`  | Verifies repo integrity. Sub-options surface lock state. | Weekly, with `--read-data-subset N%` to sample pack files |
+
+A new field on each host row, `host_repo_maintenance`, holds the
+cron expressions and last-fire anchors. The maintenance ticker on
+the server runs every 60s, finds hosts whose next-fire is due,
+and dispatches the right command. The agent's local cron is
+**only** for backups.
+
+## Why server-side and not agent-side?
+
+The agent's cron knows about backups because backups are
+per-source-group. Maintenance is per-repo, not per-source-group,
+so doing it server-side keeps the per-host wiring simple:
+
+- One ticker, not N agent crons to keep in sync.
+- Cancelling a maintenance dispatch is just "don't dispatch the
+  next one" — no agent-side state to clean up.
+- Skipping offline hosts is trivial (no queue; only scheduled
+  *backups* queue into `pending_runs`).
+
+## Forget and the multi-group payload
+
+A single `forget` job can target several source groups at once.
+The wire envelope (`ForgetGroups`) carries one entry per group,
+each with its retention policy. The agent runs N
+`restic forget --tag <name> --keep-...` invocations in sequence,
+streams their output, and reports a single terminal status.
+
+## Prune and the admin credential
+
+Prune mutates the repo. The everyday append-only credential
+**cannot** prune — that's the whole point of append-only.
+restic-manager keeps a second slot per host (`kind = 'admin'`)
+for the credential that can.
+
+When a prune is dispatched (cadence-driven or operator-driven):
+
+1. Server pushes the admin credential to the agent in a fresh
+   `config.update`.
+2. Agent runs `restic prune` with the merged credential.
+3. Job finishes; agent discards the admin credential from its
+   in-memory secrets store.
+
+The server never logs the merged URL (see
+[Credentials](./credentials.md)).
+
+## Check and lock state
+
+`restic check` warns about stale locks when it finds them. The
+agent ships every check's output back as a `repo.stats` envelope
+and a stream of log lines; if a stale lock is detected, the
+**Repo** page surfaces a banner with an **Unlock** button. The
+operator-only `unlock` command runs `restic unlock` and clears
+the banner.
+
+`unlock` has no cadence — it's a manual action, never automatic.
+Auto-unlocking would mask the cause (probably a previously
+crashed long-running operation) and risk corrupting an
+operation the operator has merely lost track of.
+
+## Repo stats
+
+After every backup, check, prune, and unlock, the agent runs
+`restic stats --json --mode raw-data` and ships the result as a
+`repo.stats` envelope. The server stores this in
+`host_repo_stats` (latest only) and `host_repo_stats_history`
+(one row per host per day, last-write-wins per column — a
+prune-only patch never nulls a backup-time size).
+
+The host detail page surfaces:
+
+- Total size + raw size in the vitals strip.
+- Last-check timestamp + colour-coded status.
+- Last-prune timestamp.
+- 30/90-day repo size trend chart.
@@ -0,0 +1,105 @@
+# Schedules and source groups
+
+Two related but separable ideas:
+
+- A **source group** is a named bundle of "what to back up":
+  include paths, exclude patterns, retention policy, retry
+  configuration, optional pre/post hooks. The group's name is
+  used as the restic snapshot tag, so retention can target it
+  with `restic forget --tag <name>`.
+- A **schedule** is a cron expression that, when it fires,
+  triggers a backup of one or more source groups on a host.
+
+Decoupling them means you can have one schedule covering several
+groups (e.g. `0 1 * * *` running both `system` and `data`), and
+each group has its own retention without duplicating policy
+across schedules.
+
+## Source group anatomy
+
+```yaml
+name: data
+includes:
+  - /var/lib/postgresql
+  - /home
+excludes:
+  - /home/*/.cache
+  - /home/*/Downloads
+retention:
+  keep_last: 7
+  keep_daily: 14
+  keep_weekly: 4
+  keep_monthly: 6
+retry_max: 3
+retry_backoff_seconds: 600
+pre_hook: |
+  pg_dump -U postgres -F c -f /var/lib/postgresql/dumps/all.dump
+post_hook: |
+  rm -f /var/lib/postgresql/dumps/all.dump
+```
+
+### Conflict detection
+
+If your retention policy says `keep_hourly: 24` but no schedule
+points at this group sub-daily, the UI surfaces a
+**conflict-dimension banner** ("`hourly` won't be honoured —
+no schedule fires more often than once a day"). The flag is
+stored on the source group (`conflict_dimension`) and refreshed
+whenever a schedule or group changes.
+
+### Hooks
+
+`pre_hook` and `post_hook` run on the agent host inside
+`/bin/sh -c` (`cmd.exe /C` on Windows). Output is streamed back
+to the live job log as `hook(<phase>): …` lines.
+
+- A non-zero `pre_hook` exit aborts the backup.
+- `post_hook` always runs, with `RM_JOB_STATUS=succeeded|failed`
+  in the environment. Use this for cleanup that must happen
+  whether the backup worked or not.
+- Hooks only run for `kind=backup` jobs. They do not run for
+  `forget`, `prune`, `check`, etc.
+- AEAD-encrypted at rest at the HTTP layer; the agent receives
+  plaintext over the WS channel.
+
+A "host default" pair of hooks lives on the host itself; a
+source group's own hooks override them when set.
+
+## Schedule anatomy
+
+```yaml
+cron: "0 2 * * *"
+enabled: true
+source_group_ids:
+  - <gid for "data">
+  - <gid for "system">
+```
+
+Slim by design: a schedule says **when** and **which groups**.
+Everything else (paths, retention, hooks) lives on the groups.
+
+The agent's local cron fires the schedule. If the WebSocket is
+down at fire time, the server queues the firing into
+`pending_runs` and drains it on the next agent reconnect — a
+short network blip won't lose the backup.
+
+### Last / next run
+
+The schedules tab shows "next" (computed by parsing the cron
+expression with `robfig/cron/v3`) and "last" (the latest
+`actor_kind=schedule` job in the `jobs` table) for every
+schedule. The dashboard host row also surfaces `next 12h ago/from
+now` when a single covering schedule is the run-now candidate.
+
+## Bandwidth limits
+
+Two places set restic's `--limit-upload` / `--limit-download`:
+
+1. **Host-wide caps** on the host row (`bandwidth_up_kbps`,
+   `bandwidth_down_kbps`). Pushed to the agent on hello and
+   after `PUT /api/hosts/{id}/bandwidth`. Apply to every restic
+   invocation on the host.
+2. **Per-job overrides** on the per-source-group Run-now form.
+   Win over host caps for the lifetime of that one job.
+
+If neither is set, restic runs unthrottled.