# Architecture ## Components ``` ┌────────────────────────────────────────────────────────────┐ │ Server (control plane, single process) │ │ * chi-based HTTP API + HTMX server-rendered UI │ │ * WebSocket hub for agent fan-out + browser fan-out │ │ * SQLite store (modernc.org/sqlite, pure Go) │ │ * AEAD encryption helpers │ │ * Alert engine + notification hub │ └────────────┬───────────────────────────────────┬───────────┘ │ outbound WS only │ HTTP(S) │ │ ┌────────────▼─────────────┐ ┌────────────▼─────────────┐ │ Agent (per host) │ │ Browser (operator) │ │ * coder/websocket │ │ * htmx + a tiny bit │ │ * cron for schedules │ │ of vanilla JS for │ │ * restic wrapper │ │ live job updates │ │ * sysinfo collector │ └──────────────────────────┘ └────────────┬─────────────┘ │ subprocess: restic ... │ ┌────────────▼─────────────────────────────────────────────────┐ │ restic repository (rest-server, S3, B2, SFTP, local …) │ │ Backup data flows directly here. Server never touches it. │ └──────────────────────────────────────────────────────────────┘ ``` ## Why outbound-only WebSockets? The agent dials the server on `/ws/agent` with a bearer token. The server doesn't initiate connections to the agent. Three reasons: 1. **Firewall friendliness.** Nothing on the endpoint needs an inbound port; this works behind the typical "branch office NAT" without router config. 2. **Single auth point.** The bearer token is the only credential that crosses the boundary; the agent never accepts an incoming socket. 3. **Reconnect semantics are simpler.** When the connection drops (NAT timeout, server restart, transient network glitch) the agent backs off and re-dials; the server marks the host offline after 90s and lets the alert engine raise a stale-host alert. ## Why SQLite? SQLite covers the project's HA non-goal: there isn't one. A small control plane managing twelve endpoints does not need replication or a separate database tier. SQLite gives us: - A single file to back up (plus the secret key). - Hand-rolled migrations under `internal/store/migrations/` — no migration framework lock-in. - `WAL` mode plus per-connection foreign-key enforcement. The migrations file the entire schema; there's no ORM or query-builder layer between Go code and SQL. ## Why the agent runs `restic` itself, not via the server The control plane never holds backup bytes in flight. That's deliberate: - A compromised control plane cannot exfiltrate snapshot contents in-band — at worst it can dispatch new backup or forget jobs (audit-logged) but the data path is between the agent and the repository. - The same agent process can target whichever transport restic natively supports (rest-server, S3, B2, SFTP, local), no separate mux on the server side. ## Job lifecycle ``` ┌──────────────────────┐ operator → │ POST /hosts/{id}/ │ │ run-backup │ └──────────┬───────────┘ │ 1. INSERT INTO jobs (status='queued') │ 2. dispatch command.run over WS ▼ ┌──────────────────────┐ │ Agent dispatches │ │ restic subprocess │ └──────────┬───────────┘ │ │ 3. job.started ───▶ store.MarkJobStarted │ 4. job.progress ───▶ JobHub broadcast (live UI) │ 5. log.stream ───▶ append to job_logs │ 6. job.finished ───▶ store.MarkJobFinished │ + alert engine eval │ + (P6) metrics histogram ▼ terminal: succeeded | failed | cancelled ``` Operators see live updates because the browser subscribes to `/api/jobs/{id}/stream`, and the WS handler broadcasts each agent-emitted envelope to all live subscribers in addition to persisting it. ## What scheduling looks like - The agent runs a local `robfig/cron/v3` instance. - The server pushes the desired schedule set to the agent on hello + after every CRUD change. - When the agent's cron fires, it sends `schedule.fire` to the server. The server creates a job row, sends `command.run` back, and the agent dispatches a normal backup. - If the WS drops between fire and run, the server queues the schedule firing into `pending_runs` and drains on agent reconnect — no missed scheduled backups due to network blips. For everything that isn't a backup (forget, prune, check), the server runs a 60-second maintenance ticker against `host_repo_maintenance` rows and dispatches the relevant command when a cadence is due. The agent's local cron only handles backups.