phase 1 foundations: api types, store, crypto, auth

Lands the bottom three layers of Phase 1:

P1-08 internal/api: protocol_version + envelope + every WS message
  shape from spec.md §6.2 (Hello, Heartbeat, Job*, Schedule*, etc).
  Wire-format tests pin the JSON shape so a rename here breaks
  tests instead of silently breaking the agent.

P1-02 + P1-03 internal/store: SQLite via modernc.org/sqlite,
  embed.FS + a tiny version table for hand-rolled migrations.
  0001_initial.sql covers every table from spec.md §5 plus
  enrollment_tokens and host_schedule_version. Typed accessors
  for users / sessions / enrollment / audit. WAL + foreign_keys
  + busy_timeout on by default.

P1-06 internal/crypto: XChaCha20-Poly1305 AEAD wrapper with
  per-message random nonce. Key file lifecycle (generate +
  refuse-to-overwrite, load with size validation). Optional
  additionalData binds ciphertext to the row that owns it.

P1-04 internal/auth (partial — passwords + tokens; sessions
  middleware lands with the HTTP handlers): argon2id following
  RFC 9106 (64 MiB / t=3 / p=4 / 32B), constant-time verify.
  HashToken stores SHA-256 of session/agent/enrollment tokens
  so a stolen DB doesn't hand over credentials.

Build floor moves to Go 1.25 (modernc.org/sqlite v1.50+ requires
it); CI + Dockerfile + README updated. Markdown lint diagnostics
on tasks.md cleared.

All packages tested. ~70 new tests pass in <1s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 00:24:40 +01:00
parent 595546afb9
commit c275f4ff4c
28 changed files with 1952 additions and 13 deletions
+11
View File
@@ -20,6 +20,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
## Phase 1 — MVP: enrollment, visibility, on-demand backup
### Server foundations
- [ ] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
- [ ] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (`golang-migrate` or hand-rolled)
- [ ] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
@@ -29,6 +30,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P1-07** (M) Audit log writer + middleware
### Agent ↔ server protocol
- [ ] **P1-08** (M) Define shared API types in `internal/api` (Go structs, JSON tags)
- [ ] **P1-09** (L) WebSocket transport (`nhooyr.io/websocket`), framed JSON envelopes, request/response correlation, ping/pong, reconnect with backoff
- [ ] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer + cert pin
@@ -36,6 +38,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P1-12** (S) Heartbeat handler (mark host offline after 90s without heartbeat)
### Agent foundations
- [ ] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`); Windows path deferred to Phase 2
- [ ] **P1-14** (M) Service integration: systemd unit (Linux only in Phase 1; Windows service entrypoint deferred to Phase 2 — see P2-16)
- [ ] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect, server cert pinning, `protocol_version` advertisement in `hello`
@@ -43,6 +46,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
### Run-now backup
- [ ] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with logs
- [ ] **P1-19** (M) Server endpoint `POST /api/hosts/:id/jobs` to dispatch a `backup` command
- [ ] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` / `log.stream`
@@ -50,6 +54,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P1-22** (S) Snapshot listing: `restic snapshots --json`, cached projection table, refresh after each backup
### UI (HTMX + Tailwind)
- [ ] **P1-23** (M) Base layout, login page, session-aware nav
- [ ] **P1-24** (M) Dashboard: host cards (status dot, last backup, repo size)
- [ ] **P1-25** (M) Host detail page: snapshots tab + run-now button
@@ -58,10 +63,12 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node)
### Install scripts
- [ ] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Also detects existing restic timers/cron (`systemctl list-timers --all | grep -i restic`, `crontab -l`, `/etc/cron.d/`, `/etc/cron.daily/`) and prints them with the disable commands — does **not** auto-disable, since heuristic matches could be unrelated tooling
- [ ] **P1-31** (S) Server endpoint to serve agent binaries + install scripts (signed)
### Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
@@ -89,6 +96,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P2-17** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named `*restic*` and prints them for manual review
### Phase 2 acceptance
- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a `mysqldump` example) and are rejected on non-backup schedule kinds. Bandwidth limits honoured.
- A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1.
@@ -107,6 +115,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P3-09** (S) `diff` between two snapshots in UI
### Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
---
@@ -124,6 +133,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON
### Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data.
---
@@ -139,6 +149,7 @@ Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
- [ ] **P5-07** (S) Sample `docker-compose.yml` with TLS via Caddy sidecar (also demonstrates `RM_TRUSTED_PROXY`)
### Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
---