2030976b85
- Add configurable device selection for embeddings (embedding.device) and Docling ingestion (ingestion.device) with env var overrides (KB_DEVICE, KB_INGEST_DEVICE) to control GPU/CPU usage per component - Add `kb doctor` command for safe GPU diagnostics - Add Dockerfile (NVIDIA CUDA) and compose.yaml for containerised GPU usage - Add OpenSpec v2 change (kb-v2-client-server): proposal, design, specs, and tasks for client-server architecture with Go CLI, FastAPI engine, async ingestion queue, and GPU-vendor-agnostic Docker deployment - Add uv.lock for reproducible installs - Gitignore examples/ directory (test data only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
33 lines
2.8 KiB
Markdown
33 lines
2.8 KiB
Markdown
## Why
|
|
|
|
Every `kb` CLI invocation loads the embedding model from scratch (~3-5 seconds) before executing even a simple query. This makes interactive use painfully slow and wastes GPU memory with redundant loads. The monolithic architecture also ties the CLI to heavy Python ML dependencies, prevents multi-client access, and couples GPU vendor choice (NVIDIA vs AMD) to every installation.
|
|
|
|
## What Changes
|
|
|
|
- Clean-sheet v2 architecture — not a refactor of v1, built from scratch for client-server from day one
|
|
- **Engine**: FastAPI server running in Docker, keeping the embedding model warm in GPU memory. Handles both ingestion and search via HTTP API
|
|
- **Client**: Lightweight Go binary that talks to the engine over HTTP. No Python, no ML dependencies, instant startup
|
|
- The `kb` CLI is the Go client only — all operations go through the engine API
|
|
- GPU-vendor-agnostic Docker builds (NVIDIA CUDA and AMD ROCm targets)
|
|
- Engine exposes a REST API suitable for reverse proxy / HTTPS termination
|
|
- Data directory uses bind mounts for portability between hosts (e.g., WSL ingest → production server)
|
|
- v1 Python CLI is retired — no dual-CLI maintenance burden
|
|
|
|
## Capabilities
|
|
|
|
### New Capabilities
|
|
- `engine-api`: REST API server (FastAPI) exposing search, ingestion, document management, and status endpoints. Keeps embedding model resident in memory. Handles all DB and GPU operations
|
|
- `go-client`: Lightweight Go CLI that communicates with the engine API over HTTP. Provides the same user-facing commands as v1 (init, add, search, list, info, remove, tags, status, config) without any ML dependencies
|
|
- `docker-deployment`: GPU-vendor-agnostic Docker packaging with separate NVIDIA (CUDA) and AMD (ROCm) build targets. Bind-mount data volumes for host portability. Compose files for single-command deployment
|
|
|
|
### Modified Capabilities
|
|
<!-- No existing specs to modify — greenfield OpenSpec setup -->
|
|
|
|
## Impact
|
|
|
|
- **Code**: v2 is a new codebase. Python engine built fresh around FastAPI (reusing v1's proven core logic for search, embeddings, database, and ingestion where appropriate). Go client is entirely new. v1 `cli.py` is not carried forward
|
|
- **APIs**: New HTTP REST API (JSON). This is the primary integration surface going forward (replaces direct Python imports for Claude Code skills etc.)
|
|
- **Dependencies**: Go toolchain added for client build. Python side adds `fastapi` + `uvicorn`. Heavy ML deps (torch, sentence-transformers, docling) contained entirely within the Docker image
|
|
- **Systems**: Docker + NVIDIA Container Toolkit (or ROCm equivalent) required on engine host. Client machines need only the Go binary and network access to the engine
|
|
- **Data**: SQLite database and HF model cache unchanged in format. Bind-mount directory structure must be documented for cross-host migration
|