P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
6.1 KiB
Architecture
Components
┌────────────────────────────────────────────────────────────┐
│ Server (control plane, single process) │
│ * chi-based HTTP API + HTMX server-rendered UI │
│ * WebSocket hub for agent fan-out + browser fan-out │
│ * SQLite store (modernc.org/sqlite, pure Go) │
│ * AEAD encryption helpers │
│ * Alert engine + notification hub │
└────────────┬───────────────────────────────────┬───────────┘
│ outbound WS only │ HTTP(S)
│ │
┌────────────▼─────────────┐ ┌────────────▼─────────────┐
│ Agent (per host) │ │ Browser (operator) │
│ * coder/websocket │ │ * htmx + a tiny bit │
│ * cron for schedules │ │ of vanilla JS for │
│ * restic wrapper │ │ live job updates │
│ * sysinfo collector │ └──────────────────────────┘
└────────────┬─────────────┘
│ subprocess: restic ...
│
┌────────────▼─────────────────────────────────────────────────┐
│ restic repository (rest-server, S3, B2, SFTP, local …) │
│ Backup data flows directly here. Server never touches it. │
└──────────────────────────────────────────────────────────────┘
Why outbound-only WebSockets?
The agent dials the server on /ws/agent with a bearer token. The
server doesn't initiate connections to the agent. Three reasons:
- Firewall friendliness. Nothing on the endpoint needs an inbound port; this works behind the typical "branch office NAT" without router config.
- Single auth point. The bearer token is the only credential that crosses the boundary; the agent never accepts an incoming socket.
- Reconnect semantics are simpler. When the connection drops (NAT timeout, server restart, transient network glitch) the agent backs off and re-dials; the server marks the host offline after 90s and lets the alert engine raise a stale-host alert.
Why SQLite?
SQLite covers the project's HA non-goal: there isn't one. A small control plane managing twelve endpoints does not need replication or a separate database tier. SQLite gives us:
- A single file to back up (plus the secret key).
- Hand-rolled migrations under
internal/store/migrations/— no migration framework lock-in. WALmode plus per-connection foreign-key enforcement.
The migrations file the entire schema; there's no ORM or query-builder layer between Go code and SQL.
Why the agent runs restic itself, not via the server
The control plane never holds backup bytes in flight. That's deliberate:
- A compromised control plane cannot exfiltrate snapshot contents in-band — at worst it can dispatch new backup or forget jobs (audit-logged) but the data path is between the agent and the repository.
- The same agent process can target whichever transport restic natively supports (rest-server, S3, B2, SFTP, local), no separate mux on the server side.
Job lifecycle
┌──────────────────────┐
operator → │ POST /hosts/{id}/ │
│ run-backup │
└──────────┬───────────┘
│ 1. INSERT INTO jobs (status='queued')
│ 2. dispatch command.run over WS
▼
┌──────────────────────┐
│ Agent dispatches │
│ restic subprocess │
└──────────┬───────────┘
│
│ 3. job.started ───▶ store.MarkJobStarted
│ 4. job.progress ───▶ JobHub broadcast (live UI)
│ 5. log.stream ───▶ append to job_logs
│ 6. job.finished ───▶ store.MarkJobFinished
│ + alert engine eval
│ + (P6) metrics histogram
▼
terminal: succeeded | failed | cancelled
Operators see live updates because the browser subscribes to
/api/jobs/{id}/stream, and the WS handler broadcasts each
agent-emitted envelope to all live subscribers in addition to
persisting it.
What scheduling looks like
- The agent runs a local
robfig/cron/v3instance. - The server pushes the desired schedule set to the agent on hello + after every CRUD change.
- When the agent's cron fires, it sends
schedule.fireto the server. The server creates a job row, sendscommand.runback, and the agent dispatches a normal backup. - If the WS drops between fire and run, the server queues the
schedule firing into
pending_runsand drains on agent reconnect — no missed scheduled backups due to network blips.
For everything that isn't a backup (forget, prune, check), the
server runs a 60-second maintenance ticker against
host_repo_maintenance rows and dispatches the relevant command
when a cadence is due. The agent's local cron only handles
backups.