P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
@@ -0,0 +1,121 @@
|
||||
# Architecture
|
||||
|
||||
## Components
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────┐
|
||||
│ Server (control plane, single process) │
|
||||
│ * chi-based HTTP API + HTMX server-rendered UI │
|
||||
│ * WebSocket hub for agent fan-out + browser fan-out │
|
||||
│ * SQLite store (modernc.org/sqlite, pure Go) │
|
||||
│ * AEAD encryption helpers │
|
||||
│ * Alert engine + notification hub │
|
||||
└────────────┬───────────────────────────────────┬───────────┘
|
||||
│ outbound WS only │ HTTP(S)
|
||||
│ │
|
||||
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
|
||||
│ Agent (per host) │ │ Browser (operator) │
|
||||
│ * coder/websocket │ │ * htmx + a tiny bit │
|
||||
│ * cron for schedules │ │ of vanilla JS for │
|
||||
│ * restic wrapper │ │ live job updates │
|
||||
│ * sysinfo collector │ └──────────────────────────┘
|
||||
└────────────┬─────────────┘
|
||||
│ subprocess: restic ...
|
||||
│
|
||||
┌────────────▼─────────────────────────────────────────────────┐
|
||||
│ restic repository (rest-server, S3, B2, SFTP, local …) │
|
||||
│ Backup data flows directly here. Server never touches it. │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Why outbound-only WebSockets?
|
||||
|
||||
The agent dials the server on `/ws/agent` with a bearer token. The
|
||||
server doesn't initiate connections to the agent. Three reasons:
|
||||
|
||||
1. **Firewall friendliness.** Nothing on the endpoint needs an
|
||||
inbound port; this works behind the typical "branch office NAT"
|
||||
without router config.
|
||||
2. **Single auth point.** The bearer token is the only credential
|
||||
that crosses the boundary; the agent never accepts an
|
||||
incoming socket.
|
||||
3. **Reconnect semantics are simpler.** When the connection drops
|
||||
(NAT timeout, server restart, transient network glitch) the
|
||||
agent backs off and re-dials; the server marks the host
|
||||
offline after 90s and lets the alert engine raise a stale-host
|
||||
alert.
|
||||
|
||||
## Why SQLite?
|
||||
|
||||
SQLite covers the project's HA non-goal: there isn't one. A small
|
||||
control plane managing twelve endpoints does not need replication
|
||||
or a separate database tier. SQLite gives us:
|
||||
|
||||
- A single file to back up (plus the secret key).
|
||||
- Hand-rolled migrations under `internal/store/migrations/` —
|
||||
no migration framework lock-in.
|
||||
- `WAL` mode plus per-connection foreign-key enforcement.
|
||||
|
||||
The migrations file the entire schema; there's no ORM or
|
||||
query-builder layer between Go code and SQL.
|
||||
|
||||
## Why the agent runs `restic` itself, not via the server
|
||||
|
||||
The control plane never holds backup bytes in flight. That's
|
||||
deliberate:
|
||||
|
||||
- A compromised control plane cannot exfiltrate snapshot
|
||||
contents in-band — at worst it can dispatch new backup or
|
||||
forget jobs (audit-logged) but the data path is between the
|
||||
agent and the repository.
|
||||
- The same agent process can target whichever transport restic
|
||||
natively supports (rest-server, S3, B2, SFTP, local), no
|
||||
separate mux on the server side.
|
||||
|
||||
## Job lifecycle
|
||||
|
||||
```
|
||||
┌──────────────────────┐
|
||||
operator → │ POST /hosts/{id}/ │
|
||||
│ run-backup │
|
||||
└──────────┬───────────┘
|
||||
│ 1. INSERT INTO jobs (status='queued')
|
||||
│ 2. dispatch command.run over WS
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Agent dispatches │
|
||||
│ restic subprocess │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
│ 3. job.started ───▶ store.MarkJobStarted
|
||||
│ 4. job.progress ───▶ JobHub broadcast (live UI)
|
||||
│ 5. log.stream ───▶ append to job_logs
|
||||
│ 6. job.finished ───▶ store.MarkJobFinished
|
||||
│ + alert engine eval
|
||||
│ + (P6) metrics histogram
|
||||
▼
|
||||
terminal: succeeded | failed | cancelled
|
||||
```
|
||||
|
||||
Operators see live updates because the browser subscribes to
|
||||
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
|
||||
agent-emitted envelope to all live subscribers in addition to
|
||||
persisting it.
|
||||
|
||||
## What scheduling looks like
|
||||
|
||||
- The agent runs a local `robfig/cron/v3` instance.
|
||||
- The server pushes the desired schedule set to the agent on
|
||||
hello + after every CRUD change.
|
||||
- When the agent's cron fires, it sends `schedule.fire` to the
|
||||
server. The server creates a job row, sends `command.run` back,
|
||||
and the agent dispatches a normal backup.
|
||||
- If the WS drops between fire and run, the server queues the
|
||||
schedule firing into `pending_runs` and drains on agent
|
||||
reconnect — no missed scheduled backups due to network blips.
|
||||
|
||||
For everything that isn't a backup (forget, prune, check), the
|
||||
server runs a 60-second maintenance ticker against
|
||||
`host_repo_maintenance` rows and dispatches the relevant command
|
||||
when a cadence is due. The agent's local cron only handles
|
||||
backups.
|
||||
Reference in New Issue
Block a user