# Architecture

## Components

```
┌────────────────────────────────────────────────────────────┐
│  Server (control plane, single process)                    │
│   * chi-based HTTP API + HTMX server-rendered UI           │
│   * WebSocket hub for agent fan-out + browser fan-out      │
│   * SQLite store (modernc.org/sqlite, pure Go)             │
│   * AEAD encryption helpers                                │
│   * Alert engine + notification hub                        │
└────────────┬───────────────────────────────────┬───────────┘
             │ outbound WS only                   │ HTTP(S)
             │                                    │
┌────────────▼─────────────┐         ┌────────────▼─────────────┐
│  Agent (per host)        │         │  Browser (operator)      │
│   * coder/websocket      │         │   * htmx + a tiny bit    │
│   * cron for schedules   │         │     of vanilla JS for    │
│   * restic wrapper       │         │     live job updates     │
│   * sysinfo collector    │         └──────────────────────────┘
└────────────┬─────────────┘
             │ subprocess: restic ...
             │
┌────────────▼─────────────────────────────────────────────────┐
│  restic repository (rest-server, S3, B2, SFTP, local …)      │
│  Backup data flows directly here. Server never touches it.   │
└──────────────────────────────────────────────────────────────┘
```

## Why outbound-only WebSockets?

The agent dials the server on `/ws/agent` with a bearer token. The
server doesn't initiate connections to the agent. Three reasons:

1. **Firewall friendliness.** Nothing on the endpoint needs an
   inbound port; this works behind the typical "branch office NAT"
   without router config.
2. **Single auth point.** The bearer token is the only credential
   that crosses the boundary; the agent never accepts an
   incoming socket.
3. **Reconnect semantics are simpler.** When the connection drops
   (NAT timeout, server restart, transient network glitch) the
   agent backs off and re-dials; the server marks the host
   offline after 90s and lets the alert engine raise a stale-host
   alert.

## Why SQLite?

SQLite covers the project's HA non-goal: there isn't one. A small
control plane managing twelve endpoints does not need replication
or a separate database tier. SQLite gives us:

- A single file to back up (plus the secret key).
- Hand-rolled migrations under `internal/store/migrations/` —
  no migration framework lock-in.
- `WAL` mode plus per-connection foreign-key enforcement.

The migrations file the entire schema; there's no ORM or
query-builder layer between Go code and SQL.

## Why the agent runs `restic` itself, not via the server

The control plane never holds backup bytes in flight. That's
deliberate:

- A compromised control plane cannot exfiltrate snapshot
  contents in-band — at worst it can dispatch new backup or
  forget jobs (audit-logged) but the data path is between the
  agent and the repository.
- The same agent process can target whichever transport restic
  natively supports (rest-server, S3, B2, SFTP, local), no
  separate mux on the server side.

## Job lifecycle

```
            ┌──────────────────────┐
operator →  │ POST /hosts/{id}/    │
            │       run-backup     │
            └──────────┬───────────┘
                       │   1. INSERT INTO jobs (status='queued')
                       │   2. dispatch command.run over WS
                       ▼
            ┌──────────────────────┐
            │ Agent dispatches     │
            │ restic subprocess    │
            └──────────┬───────────┘
                       │
                       │   3. job.started   ───▶ store.MarkJobStarted
                       │   4. job.progress  ───▶ JobHub broadcast (live UI)
                       │   5. log.stream    ───▶ append to job_logs
                       │   6. job.finished  ───▶ store.MarkJobFinished
                       │                          + alert engine eval
                       │                          + (P6) metrics histogram
                       ▼
                  terminal: succeeded | failed | cancelled
```

Operators see live updates because the browser subscribes to
`/api/jobs/{id}/stream`, and the WS handler broadcasts each
agent-emitted envelope to all live subscribers in addition to
persisting it.

## What scheduling looks like

- The agent runs a local `robfig/cron/v3` instance.
- The server pushes the desired schedule set to the agent on
  hello + after every CRUD change.
- When the agent's cron fires, it sends `schedule.fire` to the
  server. The server creates a job row, sends `command.run` back,
  and the agent dispatches a normal backup.
- If the WS drops between fire and run, the server queues the
  schedule firing into `pending_runs` and drains on agent
  reconnect — no missed scheduled backups due to network blips.

For everything that isn't a backup (forget, prune, check), the
server runs a 60-second maintenance ticker against
`host_repo_maintenance` rows and dispatches the relevant command
when a cadence is due. The agent's local cron only handles
backups.