Files
restic-manager/docs/book/src/concepts/architecture.md
T
steve bb4ed3502d P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
2026-05-08 20:08:23 +01:00

6.1 KiB

Architecture

Components

┌────────────────────────────────────────────────────────────┐
│  Server (control plane, single process)                    │
│   * chi-based HTTP API + HTMX server-rendered UI           │
│   * WebSocket hub for agent fan-out + browser fan-out      │
│   * SQLite store (modernc.org/sqlite, pure Go)             │
│   * AEAD encryption helpers                                │
│   * Alert engine + notification hub                        │
└────────────┬───────────────────────────────────┬───────────┘
             │ outbound WS only                   │ HTTP(S)
             │                                    │
┌────────────▼─────────────┐         ┌────────────▼─────────────┐
│  Agent (per host)        │         │  Browser (operator)      │
│   * coder/websocket      │         │   * htmx + a tiny bit    │
│   * cron for schedules   │         │     of vanilla JS for    │
│   * restic wrapper       │         │     live job updates     │
│   * sysinfo collector    │         └──────────────────────────┘
└────────────┬─────────────┘
             │ subprocess: restic ...
             │
┌────────────▼─────────────────────────────────────────────────┐
│  restic repository (rest-server, S3, B2, SFTP, local …)      │
│  Backup data flows directly here. Server never touches it.   │
└──────────────────────────────────────────────────────────────┘

Why outbound-only WebSockets?

The agent dials the server on /ws/agent with a bearer token. The server doesn't initiate connections to the agent. Three reasons:

  1. Firewall friendliness. Nothing on the endpoint needs an inbound port; this works behind the typical "branch office NAT" without router config.
  2. Single auth point. The bearer token is the only credential that crosses the boundary; the agent never accepts an incoming socket.
  3. Reconnect semantics are simpler. When the connection drops (NAT timeout, server restart, transient network glitch) the agent backs off and re-dials; the server marks the host offline after 90s and lets the alert engine raise a stale-host alert.

Why SQLite?

SQLite covers the project's HA non-goal: there isn't one. A small control plane managing twelve endpoints does not need replication or a separate database tier. SQLite gives us:

  • A single file to back up (plus the secret key).
  • Hand-rolled migrations under internal/store/migrations/ — no migration framework lock-in.
  • WAL mode plus per-connection foreign-key enforcement.

The migrations file the entire schema; there's no ORM or query-builder layer between Go code and SQL.

Why the agent runs restic itself, not via the server

The control plane never holds backup bytes in flight. That's deliberate:

  • A compromised control plane cannot exfiltrate snapshot contents in-band — at worst it can dispatch new backup or forget jobs (audit-logged) but the data path is between the agent and the repository.
  • The same agent process can target whichever transport restic natively supports (rest-server, S3, B2, SFTP, local), no separate mux on the server side.

Job lifecycle

            ┌──────────────────────┐
operator →  │ POST /hosts/{id}/    │
            │       run-backup     │
            └──────────┬───────────┘
                       │   1. INSERT INTO jobs (status='queued')
                       │   2. dispatch command.run over WS
                       ▼
            ┌──────────────────────┐
            │ Agent dispatches     │
            │ restic subprocess    │
            └──────────┬───────────┘
                       │
                       │   3. job.started   ───▶ store.MarkJobStarted
                       │   4. job.progress  ───▶ JobHub broadcast (live UI)
                       │   5. log.stream    ───▶ append to job_logs
                       │   6. job.finished  ───▶ store.MarkJobFinished
                       │                          + alert engine eval
                       │                          + (P6) metrics histogram
                       ▼
                  terminal: succeeded | failed | cancelled

Operators see live updates because the browser subscribes to /api/jobs/{id}/stream, and the WS handler broadcasts each agent-emitted envelope to all live subscribers in addition to persisting it.

What scheduling looks like

  • The agent runs a local robfig/cron/v3 instance.
  • The server pushes the desired schedule set to the agent on hello + after every CRUD change.
  • When the agent's cron fires, it sends schedule.fire to the server. The server creates a job row, sends command.run back, and the agent dispatches a normal backup.
  • If the WS drops between fire and run, the server queues the schedule firing into pending_runs and drains on agent reconnect — no missed scheduled backups due to network blips.

For everything that isn't a backup (forget, prune, check), the server runs a 60-second maintenance ticker against host_repo_maintenance rows and dispatches the relevant command when a cadence is due. The agent's local cron only handles backups.