kb/openspec/changes/kb-v2-client-server/design.md

## Context

kb v1 is a monolithic Python CLI that loads an embedding model, opens SQLite, and performs search/ingestion in a single process. Every invocation pays ~3-5s model load cost. It was developed on a CPU-only server, then moved to a GPU-equipped WSL2 instance where native CUDA caused WSL crashes. Docker with NVIDIA Container Toolkit proved stable for GPU access.

The target deployment is: heavy initial ingest on a local WSL2 box with an RTX 5090, then move the data directory to a production server which may have either an NVIDIA or AMD GPU. Multiple clients (terminal, Claude Code, future web UI) need to query the same knowledge base.

### Constraints

- Engine must run in Docker (GPU stability, portability, isolation)
- Client must be a single static binary with zero runtime dependencies
- Data directory must be portable between hosts via simple file copy
- API must work behind a reverse proxy with TLS termination
- Must support both NVIDIA (CUDA) and AMD (ROCm) GPUs without code changes

## Goals / Non-Goals

**Goals:**
- Sub-100ms query latency (model already warm)
- Single Go binary CLI that works on Linux, macOS, Windows
- Engine handles all GPU/ML/DB operations — client is pure HTTP
- GPU vendor abstracted at Docker build level, not application level
- Async ingestion — client uploads and exits immediately, engine processes in background
- JSON-first API suitable for both CLI and programmatic consumers
- Bind-mount data directory for cross-host portability

**Non-Goals:**
- Multi-user / multi-tenant (this is a personal knowledge base)
- Authentication beyond optional API key (trust the network boundary)
- Streaming / WebSocket APIs (request-response is sufficient)
- Web UI (the API enables it later but we're not building one now)
- Clustering / horizontal scaling (single engine instance)
- Backward compatibility with v1 CLI flags or config format

## Decisions

### 1. Single process, single container for the engine

The engine runs as one FastAPI process serving both search and ingestion endpoints. The embedding model is loaded once at startup and shared across all request handlers.

**Why not separate ingestion/query services?** They share the same model in GPU memory and the same SQLite database. Splitting them means either duplicating the model (wasting VRAM) or adding IPC complexity. At personal-KB scale there's no reason to scale them independently.

**Why not multiple workers/processes?** SQLite doesn't benefit from multiple writer processes. A single uvicorn process with async handlers is sufficient. Ingestion is CPU/GPU-bound anyway — parallelism happens within the batch, not across processes.

### 2. Async ingestion with staging queue

Ingestion is fully asynchronous. The client uploads file bytes (or note text) to the engine, which writes them to a staging area and returns immediately with a 202. The engine processes the queue in the background.

**The flow:**
1. Client uploads file → engine writes to staging directory + creates a job record in SQLite → returns 202
2. Engine background worker picks queued jobs sequentially, processes them (Docling, chunking, embedding), and updates job status
3. Client can check progress via `kb jobs` if desired

**All content types use the same path.** Text notes are written to staging as files, same as PDFs. One ingestion pipeline, not two. This means notes get the same queue semantics, status tracking, and retry behaviour as documents.

**Client UX:**
- `kb add report.pdf` → "Queued: report.pdf" → exits immediately
- `kb add ~/documents/ --recursive` → "Queued: 47 files" → exits immediately
- `kb jobs` → shows queue with statuses (queued, processing, done, failed)
- `kb jobs <id>` → details for a specific job
- `--format json` on `kb add` includes job IDs in the response (for scripts / Claude Code)
- Upload failures (network error, file not found) error immediately at the client — distinct from processing failures which appear in `kb jobs`

**Why not synchronous?** A large PDF can take minutes to process. Holding an HTTP connection open for that long is fragile (timeouts, proxy limits, client disconnects losing progress). More importantly, the client's job is "get bytes to the engine" — it shouldn't wait for Docling to finish.

**Why not share a filesystem?** The client may not be on the same machine as the engine. Even when it is, mounting host paths into Docker creates permission headaches and breaks the clean separation. Multipart upload works identically whether the client is local or remote.

**Why not an external task queue (Celery, Redis)?** Overkill for single-user. A SQLite `jobs` table + an asyncio background task is sufficient and adds zero dependencies.

### 3. Go client with Cobra + minimal dependencies

The client is a Go binary using Cobra for CLI structure and `net/http` from the standard library. Configuration is a simple YAML file (`~/.kb/client.yaml`) storing the engine URL and optional API key.

**Why Go over Rust?** Faster to develop for a CLI tool, excellent cross-compilation, single binary output. Rust's safety guarantees aren't needed for an HTTP client.

**Why not a shell script / curl wrapper?** Structured output formatting, progress bars for uploads, proper error handling, tab completion. These are painful in shell.

### 4. API design: REST with JSON, versioned under /api/v1

```
# Search
POST   /api/v1/search              Search the knowledge base

# Ingestion (async)
POST   /api/v1/jobs                Upload file or note for ingestion (returns 202 + job ID)
GET    /api/v1/jobs                List ingestion jobs (with status filters)
GET    /api/v1/jobs/{id}           Get job details and progress

# Documents (already-ingested content)
GET    /api/v1/documents           List documents (with filters)
GET    /api/v1/documents/{id}      Get document details
DELETE /api/v1/documents/{id}      Remove a document
PUT    /api/v1/documents/{id}/tags Manage document tags

# Metadata
GET    /api/v1/tags                List tags
GET    /api/v1/status              Engine status (model, DB stats, GPU info)
POST   /api/v1/reindex             Re-embed all chunks
GET    /api/v1/health              Health check (for load balancers / monitoring)
```

Search is POST (not GET) because the query body can include structured filters and is more natural for JSON payloads. Ingestion uses multipart/form-data (file upload with metadata fields for tags, doc_type, etc.). Notes are submitted as a text field in the same multipart form — same endpoint, same queue.

**Why /api/v1 prefix?** Allows a reverse proxy to route cleanly and enables future API versioning without breaking clients.

### 5. SQLite with WAL mode

SQLite remains the storage engine. WAL (Write-Ahead Logging) mode is enabled at startup, which allows concurrent reads while a write is in progress. This is important because search queries shouldn't block during ingestion.

**Why not Postgres?** SQLite with sqlite-vec provides vector search and FTS5 in a single portable file. No extra service to deploy. The data directory is just files — `rsync` to move it. At personal-KB scale, SQLite's limits are irrelevant.

### 6. GPU abstraction at Docker layer only

The Python engine code uses `torch.cuda.is_available()` and `device="auto"` — it never references CUDA or ROCm directly. GPU vendor is determined entirely by:
- Which base Docker image is used (nvidia/cuda vs rocm/pytorch)
- Which ORT package is installed (onnxruntime-gpu vs onnxruntime-rocm)

Two Dockerfiles: `Dockerfile.nvidia` and `Dockerfile.rocm`. The compose file selects one. Application code is identical.

### 7. Configuration hierarchy

**Engine** (configured via environment variables in compose.yaml):
- `KB_DATA_DIR` — bind mount path for SQLite + model cache
- `KB_MODEL` — embedding model name
- `KB_DEVICE` — embedding device (auto/cpu/cuda)
- `KB_INGEST_DEVICE` — Docling device (auto/cpu/cuda)
- `KB_API_KEY` — optional API key for request authentication
- `KB_WORKERS` — uvicorn worker count (default 1)

**Client** (configured via `~/.kb/client.yaml`):
```yaml
engine_url: http://localhost:8000
api_key: ""          # optional
default_format: human  # human or json
```

Overridable via env vars (`KB_ENGINE_URL`, `KB_API_KEY`) and CLI flags (`--engine`, `--format`).

### 8. Model loading strategy

The embedding model loads eagerly at engine startup, before the server accepts requests. The `/health` endpoint returns unhealthy until the model is ready. This ensures the first query is fast and failures are visible immediately (not on first request).

Docling's layout models load lazily on first PDF ingestion (they're large and not needed for search or non-PDF documents).

## Risks / Trade-offs

**[Engine unavailable = total outage]** → No offline fallback by design. Mitigation: engine starts automatically via Docker restart policy (`restart: unless-stopped`). For truly offline use, user can run the engine locally.

**[SQLite concurrent write contention during large ingestion]** → WAL mode handles read/write concurrency. For write/write (rare — single user), SQLite's built-in busy timeout is sufficient. Mitigation: engine serializes write operations internally.

**[AMD ROCm support is less tested than CUDA]** → RDNA 4 (9070 XT) is very new. Mitigation: build and test the ROCm image early. PyTorch ROCm support is mature for RDNA 3; RDNA 4 should work with ROCm 6.4+ but verify before committing to hardware purchase.

**[File upload size limits for large PDFs]** → Mitigation: configure uvicorn/nginx max body size appropriately (default 100MB, configurable). Upload progress feedback in the Go client. Processing time is no longer a concern since the client disconnects after upload.

**[Model warmup time on cold start]** → First startup takes ~5-10s to load the model. Mitigation: health endpoint + Docker restart policy means this only happens on container restart, not per-query. Log clearly during startup.

## Migration Plan

This is a clean-sheet build, not a migration from v1. However, the data format (SQLite schema, embedding dimensions) is unchanged, so an existing v1 `~/.kb/` directory can be bind-mounted directly into the v2 engine container with no conversion needed.

**Rollout steps:**
1. Build engine + Go client
2. Test with existing v1 data directory
3. Deploy engine on target host
4. Distribute Go binary to client machines
5. Retire v1 Python CLI

**Rollback:** v1 CLI still works against the same SQLite database. No destructive changes to the data format.

## Open Questions

- **API key auth**: Simple bearer token is sufficient for a personal tool behind a reverse proxy. Decided.
- **Ingestion**: Async with job queue. Client uploads and exits. `kb jobs` for status. Decided.
- **Model hot-swap**: Should `POST /api/v1/reindex` with a new model name hot-swap the loaded model, or require an engine restart? Hot-swap is nicer UX but adds complexity. **Deferred** — start with restart-required, revisit if it becomes painful.