kb/openspec/changes/archive/2026-03-25-kb-v2-client-server/proposal.md

## Why

Every `kb` CLI invocation loads the embedding model from scratch (~3-5 seconds) before executing even a simple query. This makes interactive use painfully slow and wastes GPU memory with redundant loads. The monolithic architecture also ties the CLI to heavy Python ML dependencies, prevents multi-client access, and couples GPU vendor choice (NVIDIA vs AMD) to every installation.

## What Changes

- Clean-sheet v2 architecture — not a refactor of v1, built from scratch for client-server from day one
  - **Engine**: FastAPI server running in Docker, keeping the embedding model warm in GPU memory. Handles both ingestion and search via HTTP API
  - **Client**: Lightweight Go binary that talks to the engine over HTTP. No Python, no ML dependencies, instant startup
- The `kb` CLI is the Go client only — all operations go through the engine API
- GPU-vendor-agnostic Docker builds (NVIDIA CUDA and AMD ROCm targets)
- Engine exposes a REST API suitable for reverse proxy / HTTPS termination
- Data directory uses bind mounts for portability between hosts (e.g., WSL ingest → production server)
- v1 Python CLI is retired — no dual-CLI maintenance burden

## Capabilities

### New Capabilities
- `engine-api`: REST API server (FastAPI) exposing search, ingestion, document management, and status endpoints. Keeps embedding model resident in memory. Handles all DB and GPU operations
- `go-client`: Lightweight Go CLI that communicates with the engine API over HTTP. Provides the same user-facing commands as v1 (init, add, search, list, info, remove, tags, status, config) without any ML dependencies
- `docker-deployment`: GPU-vendor-agnostic Docker packaging with separate NVIDIA (CUDA) and AMD (ROCm) build targets. Bind-mount data volumes for host portability. Compose files for single-command deployment

### Modified Capabilities
<!-- No existing specs to modify — greenfield OpenSpec setup -->

## Impact

- **Code**: v2 is a new codebase. Python engine built fresh around FastAPI (reusing v1's proven core logic for search, embeddings, database, and ingestion where appropriate). Go client is entirely new. v1 `cli.py` is not carried forward
- **APIs**: New HTTP REST API (JSON). This is the primary integration surface going forward (replaces direct Python imports for Claude Code skills etc.)
- **Dependencies**: Go toolchain added for client build. Python side adds `fastapi` + `uvicorn`. Heavy ML deps (torch, sentence-transformers, docling) contained entirely within the Docker image
- **Systems**: Docker + NVIDIA Container Toolkit (or ROCm equivalent) required on engine host. Client machines need only the Go binary and network access to the engine
- **Data**: SQLite database and HF model cache unchanged in format. Bind-mount directory structure must be documented for cross-host migration