Add GPU device control, Docker support, and v2 client-server design

- Add configurable device selection for embeddings (embedding.device) and Docling ingestion (ingestion.device) with env var overrides (KB_DEVICE, KB_INGEST_DEVICE) to control GPU/CPU usage per component - Add `kb doctor` command for safe GPU diagnostics - Add Dockerfile (NVIDIA CUDA) and compose.yaml for containerised GPU usage - Add OpenSpec v2 change (kb-v2-client-server): proposal, design, specs, and tasks for client-server architecture with Go CLI, FastAPI engine, async ingestion queue, and GPU-vendor-agnostic Docker deployment - Add uv.lock for reproducible installs - Gitignore examples/ directory (test data only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:17:31 +00:00
parent f245c24928
commit 2030976b85
20 changed files with 4321 additions and 29 deletions
@@ -0,0 +1,94 @@
+## ADDED Requirements
+
+### Requirement: NVIDIA CUDA Docker image
+
+The project SHALL provide a `Dockerfile.nvidia` that builds the engine on an NVIDIA CUDA runtime base image with GPU support for PyTorch and ONNX Runtime.
+
+#### Scenario: Build NVIDIA image
+- **WHEN** an admin runs `docker compose -f compose.nvidia.yaml build`
+- **THEN** the build SHALL produce a working image with CUDA runtime, PyTorch with CUDA support, onnxruntime-gpu, and all engine dependencies
+
+#### Scenario: GPU access in NVIDIA container
+- **WHEN** the NVIDIA container starts with `--gpus all` or the NVIDIA runtime
+- **THEN** `torch.cuda.is_available()` SHALL return True and the engine SHALL load the embedding model on GPU
+
+---
+
+### Requirement: AMD ROCm Docker image
+
+The project SHALL provide a `Dockerfile.rocm` that builds the engine on an AMD ROCm base image with GPU support for PyTorch and ONNX Runtime.
+
+#### Scenario: Build ROCm image
+- **WHEN** an admin runs `docker compose -f compose.rocm.yaml build`
+- **THEN** the build SHALL produce a working image with ROCm runtime, PyTorch with ROCm support, onnxruntime-rocm, and all engine dependencies
+
+#### Scenario: GPU access in ROCm container
+- **WHEN** the ROCm container starts with `--device=/dev/kfd --device=/dev/dri`
+- **THEN** `torch.cuda.is_available()` SHALL return True (via HIP) and the engine SHALL load the embedding model on GPU
+
+---
+
+### Requirement: Application code is GPU-vendor-agnostic
+
+The Python engine code SHALL NOT reference CUDA or ROCm directly. GPU vendor abstraction SHALL be handled entirely at the Docker image level (base image selection and pip package choice). The same application code SHALL run on both NVIDIA and AMD images without modification.
+
+#### Scenario: Same engine code on both platforms
+- **WHEN** the engine starts on an NVIDIA image and an AMD image with identical configuration
+- **THEN** both SHALL load the model, accept requests, and return identical search results for the same query and data
+
+---
+
+### Requirement: Bind-mount data directory
+
+The engine SHALL store all persistent state (SQLite database, HF model cache, staging directory) under a single configurable data directory. This directory SHALL be mounted from the host via bind mount.
+
+#### Scenario: Data directory structure
+- **WHEN** the engine starts for the first time
+- **THEN** it SHALL create the following structure under the data directory:
+  - `kb.db` — SQLite database
+  - `hf_cache/` — HuggingFace model cache
+  - `staging/` — temporary files for queued ingestion jobs
+
+#### Scenario: Portable data across hosts
+- **WHEN** an admin copies the data directory from Host A to Host B and starts the engine with the same bind mount path
+- **THEN** the engine SHALL start successfully and serve all previously ingested documents without reprocessing
+
+#### Scenario: Portable data across GPU vendors
+- **WHEN** an admin moves the data directory from an NVIDIA host to an AMD host (same model name)
+- **THEN** the engine SHALL start successfully. Embeddings in the database remain valid (they are model-specific, not GPU-vendor-specific)
+
+---
+
+### Requirement: Compose files for deployment
+
+The project SHALL provide Docker Compose files for single-command deployment.
+
+#### Scenario: Start NVIDIA deployment
+- **WHEN** an admin runs `docker compose -f compose.nvidia.yaml up -d`
+- **THEN** the engine SHALL start with GPU access, bind-mount the data directory, and be reachable on the configured port
+
+#### Scenario: Start ROCm deployment
+- **WHEN** an admin runs `docker compose -f compose.rocm.yaml up -d`
+- **THEN** the engine SHALL start with GPU access via ROCm device passthrough, bind-mount the data directory, and be reachable on the configured port
+
+#### Scenario: Automatic restart
+- **WHEN** the engine process crashes or the host reboots
+- **THEN** Docker SHALL automatically restart the container (restart policy `unless-stopped`)
+
+#### Scenario: Configure via environment
+- **WHEN** an admin sets environment variables in the compose file (KB_MODEL, KB_API_KEY, KB_DEVICE, etc.)
+- **THEN** the engine SHALL use those values
+
+---
+
+### Requirement: CPU-only fallback
+
+The Dockerfiles SHALL produce images that work without GPU access. If no GPU is available, the engine SHALL fall back to CPU for all operations.
+
+#### Scenario: No GPU available
+- **WHEN** the container starts without GPU passthrough (no `--gpus`, no `/dev/kfd`)
+- **THEN** the engine SHALL detect no GPU, load the model on CPU, and log a warning that GPU acceleration is unavailable
+
+#### Scenario: Explicit CPU mode
+- **WHEN** `KB_DEVICE=cpu` and `KB_INGEST_DEVICE=cpu` are set in the environment
+- **THEN** the engine SHALL use CPU regardless of GPU availability