Files
kb/openspec/changes/kb-v2-client-server/proposal.md
T
steve 2030976b85 Add GPU device control, Docker support, and v2 client-server design
- Add configurable device selection for embeddings (embedding.device) and
  Docling ingestion (ingestion.device) with env var overrides (KB_DEVICE,
  KB_INGEST_DEVICE) to control GPU/CPU usage per component
- Add `kb doctor` command for safe GPU diagnostics
- Add Dockerfile (NVIDIA CUDA) and compose.yaml for containerised GPU usage
- Add OpenSpec v2 change (kb-v2-client-server): proposal, design, specs, and
  tasks for client-server architecture with Go CLI, FastAPI engine, async
  ingestion queue, and GPU-vendor-agnostic Docker deployment
- Add uv.lock for reproducible installs
- Gitignore examples/ directory (test data only)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:17:31 +00:00

2.8 KiB

Why

Every kb CLI invocation loads the embedding model from scratch (~3-5 seconds) before executing even a simple query. This makes interactive use painfully slow and wastes GPU memory with redundant loads. The monolithic architecture also ties the CLI to heavy Python ML dependencies, prevents multi-client access, and couples GPU vendor choice (NVIDIA vs AMD) to every installation.

What Changes

  • Clean-sheet v2 architecture — not a refactor of v1, built from scratch for client-server from day one
    • Engine: FastAPI server running in Docker, keeping the embedding model warm in GPU memory. Handles both ingestion and search via HTTP API
    • Client: Lightweight Go binary that talks to the engine over HTTP. No Python, no ML dependencies, instant startup
  • The kb CLI is the Go client only — all operations go through the engine API
  • GPU-vendor-agnostic Docker builds (NVIDIA CUDA and AMD ROCm targets)
  • Engine exposes a REST API suitable for reverse proxy / HTTPS termination
  • Data directory uses bind mounts for portability between hosts (e.g., WSL ingest → production server)
  • v1 Python CLI is retired — no dual-CLI maintenance burden

Capabilities

New Capabilities

  • engine-api: REST API server (FastAPI) exposing search, ingestion, document management, and status endpoints. Keeps embedding model resident in memory. Handles all DB and GPU operations
  • go-client: Lightweight Go CLI that communicates with the engine API over HTTP. Provides the same user-facing commands as v1 (init, add, search, list, info, remove, tags, status, config) without any ML dependencies
  • docker-deployment: GPU-vendor-agnostic Docker packaging with separate NVIDIA (CUDA) and AMD (ROCm) build targets. Bind-mount data volumes for host portability. Compose files for single-command deployment

Modified Capabilities

Impact

  • Code: v2 is a new codebase. Python engine built fresh around FastAPI (reusing v1's proven core logic for search, embeddings, database, and ingestion where appropriate). Go client is entirely new. v1 cli.py is not carried forward
  • APIs: New HTTP REST API (JSON). This is the primary integration surface going forward (replaces direct Python imports for Claude Code skills etc.)
  • Dependencies: Go toolchain added for client build. Python side adds fastapi + uvicorn. Heavy ML deps (torch, sentence-transformers, docling) contained entirely within the Docker image
  • Systems: Docker + NVIDIA Container Toolkit (or ROCm equivalent) required on engine host. Client machines need only the Go binary and network access to the engine
  • Data: SQLite database and HF model cache unchanged in format. Bind-mount directory structure must be documented for cross-host migration