Add GPU device control, Docker support, and v2 client-server design

- Add configurable device selection for embeddings (embedding.device) and Docling ingestion (ingestion.device) with env var overrides (KB_DEVICE, KB_INGEST_DEVICE) to control GPU/CPU usage per component - Add `kb doctor` command for safe GPU diagnostics - Add Dockerfile (NVIDIA CUDA) and compose.yaml for containerised GPU usage - Add OpenSpec v2 change (kb-v2-client-server): proposal, design, specs, and tasks for client-server architecture with Go CLI, FastAPI engine, async ingestion queue, and GPU-vendor-agnostic Docker deployment - Add uv.lock for reproducible installs - Gitignore examples/ directory (test data only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:17:31 +00:00
parent f245c24928
commit 2030976b85
20 changed files with 4321 additions and 29 deletions
@@ -0,0 +1,106 @@
+---
+name: "OPSX: Propose"
+description: Propose a new change - create it and generate all artifacts in one step
+category: Workflow
+tags: [workflow, artifacts, experimental]
+---
+
+Propose a new change - create the change and generate all artifacts in one step.
+
+I'll create a change with artifacts:
+- proposal.md (what & why)
+- design.md (how)
+- tasks.md (implementation steps)
+
+When ready to implement, run /opsx:apply
+
+---
+
+**Input**: The argument after `/opsx:propose` is the change name (kebab-case), OR a description of what the user wants to build.
+
+**Steps**
+
+1. **If no input provided, ask what they want to build**
+
+   Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
+   > "What change do you want to work on? Describe what you want to build or fix."
+
+   From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
+
+   **IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
+
+2. **Create the change directory**
+   ```bash
+   openspec new change "<name>"
+   ```
+   This creates a scaffolded change at `openspec/changes/<name>/` with `.openspec.yaml`.
+
+3. **Get the artifact build order**
+   ```bash
+   openspec status --change "<name>" --json
+   ```
+   Parse the JSON to get:
+   - `applyRequires`: array of artifact IDs needed before implementation (e.g., `["tasks"]`)
+   - `artifacts`: list of all artifacts with their status and dependencies
+
+4. **Create artifacts in sequence until apply-ready**
+
+   Use the **TodoWrite tool** to track progress through the artifacts.
+
+   Loop through artifacts in dependency order (artifacts with no pending dependencies first):
+
+   a. **For each artifact that is `ready` (dependencies satisfied)**:
+      - Get instructions:
+        ```bash
+        openspec instructions <artifact-id> --change "<name>" --json
+        ```
+      - The instructions JSON includes:
+        - `context`: Project background (constraints for you - do NOT include in output)
+        - `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
+        - `template`: The structure to use for your output file
+        - `instruction`: Schema-specific guidance for this artifact type
+        - `outputPath`: Where to write the artifact
+        - `dependencies`: Completed artifacts to read for context
+      - Read any completed dependency files for context
+      - Create the artifact file using `template` as the structure
+      - Apply `context` and `rules` as constraints - but do NOT copy them into the file
+      - Show brief progress: "Created <artifact-id>"
+
+   b. **Continue until all `applyRequires` artifacts are complete**
+      - After creating each artifact, re-run `openspec status --change "<name>" --json`
+      - Check if every artifact ID in `applyRequires` has `status: "done"` in the artifacts array
+      - Stop when all `applyRequires` artifacts are done
+
+   c. **If an artifact requires user input** (unclear context):
+      - Use **AskUserQuestion tool** to clarify
+      - Then continue with creation
+
+5. **Show final status**
+   ```bash
+   openspec status --change "<name>"
+   ```
+
+**Output**
+
+After completing all artifacts, summarize:
+- Change name and location
+- List of artifacts created with brief descriptions
+- What's ready: "All artifacts created! Ready for implementation."
+- Prompt: "Run `/opsx:apply` to start implementing."
+
+**Artifact Creation Guidelines**
+
+- Follow the `instruction` field from `openspec instructions` for each artifact type
+- The schema defines what each artifact should contain - follow it
+- Read dependency artifacts for context before creating new ones
+- Use `template` as the structure for your output file - fill in its sections
+- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
+  - Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
+  - These guide what you write, but should never appear in the output
+
+**Guardrails**
+- Create ALL artifacts needed for implementation (as defined by schema's `apply.requires`)
+- Always read dependency artifacts before creating a new one
+- If context is critically unclear, ask the user - but prefer making reasonable decisions to keep momentum
+- If a change with that name already exists, ask if user wants to continue it or create a new one
+- Verify each artifact file exists after writing before proceeding to next
@@ -0,0 +1,110 @@
+---
+name: openspec-propose
+description: Propose a new change with all artifacts generated in one step. Use when the user wants to quickly describe what they want to build and get a complete proposal with design, specs, and tasks ready for implementation.
+license: MIT
+compatibility: Requires openspec CLI.
+metadata:
+  author: openspec
+  version: "1.0"
+  generatedBy: "1.2.0"
+---
+
+Propose a new change - create the change and generate all artifacts in one step.
+
+I'll create a change with artifacts:
+- proposal.md (what & why)
+- design.md (how)
+- tasks.md (implementation steps)
+
+When ready to implement, run /opsx:apply
+
+---
+
+**Input**: The user's request should include a change name (kebab-case) OR a description of what they want to build.
+
+**Steps**
+
+1. **If no clear input provided, ask what they want to build**
+
+   Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
+   > "What change do you want to work on? Describe what you want to build or fix."
+
+   From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
+
+   **IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
+
+2. **Create the change directory**
+   ```bash
+   openspec new change "<name>"
+   ```
+   This creates a scaffolded change at `openspec/changes/<name>/` with `.openspec.yaml`.
+
+3. **Get the artifact build order**
+   ```bash
+   openspec status --change "<name>" --json
+   ```
+   Parse the JSON to get:
+   - `applyRequires`: array of artifact IDs needed before implementation (e.g., `["tasks"]`)
+   - `artifacts`: list of all artifacts with their status and dependencies
+
+4. **Create artifacts in sequence until apply-ready**
+
+   Use the **TodoWrite tool** to track progress through the artifacts.
+
+   Loop through artifacts in dependency order (artifacts with no pending dependencies first):
+
+   a. **For each artifact that is `ready` (dependencies satisfied)**:
+      - Get instructions:
+        ```bash
+        openspec instructions <artifact-id> --change "<name>" --json
+        ```
+      - The instructions JSON includes:
+        - `context`: Project background (constraints for you - do NOT include in output)
+        - `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
+        - `template`: The structure to use for your output file
+        - `instruction`: Schema-specific guidance for this artifact type
+        - `outputPath`: Where to write the artifact
+        - `dependencies`: Completed artifacts to read for context
+      - Read any completed dependency files for context
+      - Create the artifact file using `template` as the structure
+      - Apply `context` and `rules` as constraints - but do NOT copy them into the file
+      - Show brief progress: "Created <artifact-id>"
+
+   b. **Continue until all `applyRequires` artifacts are complete**
+      - After creating each artifact, re-run `openspec status --change "<name>" --json`
+      - Check if every artifact ID in `applyRequires` has `status: "done"` in the artifacts array
+      - Stop when all `applyRequires` artifacts are done
+
+   c. **If an artifact requires user input** (unclear context):
+      - Use **AskUserQuestion tool** to clarify
+      - Then continue with creation
+
+5. **Show final status**
+   ```bash
+   openspec status --change "<name>"
+   ```
+
+**Output**
+
+After completing all artifacts, summarize:
+- Change name and location
+- List of artifacts created with brief descriptions
+- What's ready: "All artifacts created! Ready for implementation."
+- Prompt: "Run `/opsx:apply` or ask me to implement to start working on the tasks."
+
+**Artifact Creation Guidelines**
+
+- Follow the `instruction` field from `openspec instructions` for each artifact type
+- The schema defines what each artifact should contain - follow it
+- Read dependency artifacts for context before creating new ones
+- Use `template` as the structure for your output file - fill in its sections
+- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
+  - Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
+  - These guide what you write, but should never appear in the output
+
+**Guardrails**
+- Create ALL artifacts needed for implementation (as defined by schema's `apply.requires`)
+- Always read dependency artifacts before creating a new one
+- If context is critically unclear, ask the user - but prefer making reasonable decisions to keep momentum
+- If a change with that name already exists, ask if user wants to continue it or create a new one
+- Verify each artifact file exists after writing before proceeding to next
@@ -0,0 +1,7 @@
+.venv/
+.git/
+__pycache__/
+*.pyc
+.pytest_cache/
+*.egg-info/
+openspec/
@@ -5,3 +5,4 @@ __pycache__/
 dist/
 build/
 .eggs/
+examples/
@@ -0,0 +1,44 @@
+FROM nvidia/cuda:13.0.1-runtime-ubuntu24.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+# System deps for docling (poppler for PDF, build tools for native wheels)
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    python3.12 python3.12-venv python3.12-dev python3-pip \
+    libpoppler-cpp-dev poppler-utils \
+    libgl1 libglib2.0-0 \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install uv for fast dependency resolution
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
+
+WORKDIR /app
+
+# Copy project files
+COPY pyproject.toml uv.lock ./
+COPY src/ src/
+
+# Create venv, install deps, overlay onnxruntime-gpu
+RUN uv venv .venv && \
+    . .venv/bin/activate && \
+    uv sync && \
+    uv pip install --no-deps onnxruntime-gpu
+
+# Put venv on PATH
+ENV PATH="/app/.venv/bin:$PATH"
+ENV VIRTUAL_ENV="/app/.venv"
+
+# GPU enabled by default in the container
+ENV KB_DEVICE=auto
+ENV KB_INGEST_DEVICE=auto
+
+# Model cache persisted via volume
+ENV HF_HOME=/data/hf_cache
+ENV KB_DATA_DIR=/data/kb
+
+VOLUME ["/data"]
+
+ENTRYPOINT ["kb"]
+CMD ["--help"]
@@ -0,0 +1,20 @@
+services:
+  kb:
+    build: .
+    runtime: nvidia
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    volumes:
+      - kb-data:/data
+      - ./examples:/examples:ro
+    # Override entrypoint for interactive use:
+    #   docker compose run kb search "query"
+    #   docker compose run kb add /examples/car.pdf
+
+volumes:
+  kb-data:
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-03-25
@@ -0,0 +1,173 @@
+## Context
+
+kb v1 is a monolithic Python CLI that loads an embedding model, opens SQLite, and performs search/ingestion in a single process. Every invocation pays ~3-5s model load cost. It was developed on a CPU-only server, then moved to a GPU-equipped WSL2 instance where native CUDA caused WSL crashes. Docker with NVIDIA Container Toolkit proved stable for GPU access.
+
+The target deployment is: heavy initial ingest on a local WSL2 box with an RTX 5090, then move the data directory to a production server which may have either an NVIDIA or AMD GPU. Multiple clients (terminal, Claude Code, future web UI) need to query the same knowledge base.
+
+### Constraints
+
+- Engine must run in Docker (GPU stability, portability, isolation)
+- Client must be a single static binary with zero runtime dependencies
+- Data directory must be portable between hosts via simple file copy
+- API must work behind a reverse proxy with TLS termination
+- Must support both NVIDIA (CUDA) and AMD (ROCm) GPUs without code changes
+
+## Goals / Non-Goals
+
+**Goals:**
+- Sub-100ms query latency (model already warm)
+- Single Go binary CLI that works on Linux, macOS, Windows
+- Engine handles all GPU/ML/DB operations — client is pure HTTP
+- GPU vendor abstracted at Docker build level, not application level
+- Async ingestion — client uploads and exits immediately, engine processes in background
+- JSON-first API suitable for both CLI and programmatic consumers
+- Bind-mount data directory for cross-host portability
+
+**Non-Goals:**
+- Multi-user / multi-tenant (this is a personal knowledge base)
+- Authentication beyond optional API key (trust the network boundary)
+- Streaming / WebSocket APIs (request-response is sufficient)
+- Web UI (the API enables it later but we're not building one now)
+- Clustering / horizontal scaling (single engine instance)
+- Backward compatibility with v1 CLI flags or config format
+
+## Decisions
+
+### 1. Single process, single container for the engine
+
+The engine runs as one FastAPI process serving both search and ingestion endpoints. The embedding model is loaded once at startup and shared across all request handlers.
+
+**Why not separate ingestion/query services?** They share the same model in GPU memory and the same SQLite database. Splitting them means either duplicating the model (wasting VRAM) or adding IPC complexity. At personal-KB scale there's no reason to scale them independently.
+
+**Why not multiple workers/processes?** SQLite doesn't benefit from multiple writer processes. A single uvicorn process with async handlers is sufficient. Ingestion is CPU/GPU-bound anyway — parallelism happens within the batch, not across processes.
+
+### 2. Async ingestion with staging queue
+
+Ingestion is fully asynchronous. The client uploads file bytes (or note text) to the engine, which writes them to a staging area and returns immediately with a 202. The engine processes the queue in the background.
+
+**The flow:**
+1. Client uploads file → engine writes to staging directory + creates a job record in SQLite → returns 202
+2. Engine background worker picks queued jobs sequentially, processes them (Docling, chunking, embedding), and updates job status
+3. Client can check progress via `kb jobs` if desired
+
+**All content types use the same path.** Text notes are written to staging as files, same as PDFs. One ingestion pipeline, not two. This means notes get the same queue semantics, status tracking, and retry behaviour as documents.
+
+**Client UX:**
+- `kb add report.pdf` → "Queued: report.pdf" → exits immediately
+- `kb add ~/documents/ --recursive` → "Queued: 47 files" → exits immediately
+- `kb jobs` → shows queue with statuses (queued, processing, done, failed)
+- `kb jobs <id>` → details for a specific job
+- `--format json` on `kb add` includes job IDs in the response (for scripts / Claude Code)
+- Upload failures (network error, file not found) error immediately at the client — distinct from processing failures which appear in `kb jobs`
+
+**Why not synchronous?** A large PDF can take minutes to process. Holding an HTTP connection open for that long is fragile (timeouts, proxy limits, client disconnects losing progress). More importantly, the client's job is "get bytes to the engine" — it shouldn't wait for Docling to finish.
+
+**Why not share a filesystem?** The client may not be on the same machine as the engine. Even when it is, mounting host paths into Docker creates permission headaches and breaks the clean separation. Multipart upload works identically whether the client is local or remote.
+
+**Why not an external task queue (Celery, Redis)?** Overkill for single-user. A SQLite `jobs` table + an asyncio background task is sufficient and adds zero dependencies.
+
+### 3. Go client with Cobra + minimal dependencies
+
+The client is a Go binary using Cobra for CLI structure and `net/http` from the standard library. Configuration is a simple YAML file (`~/.kb/client.yaml`) storing the engine URL and optional API key.
+
+**Why Go over Rust?** Faster to develop for a CLI tool, excellent cross-compilation, single binary output. Rust's safety guarantees aren't needed for an HTTP client.
+
+**Why not a shell script / curl wrapper?** Structured output formatting, progress bars for uploads, proper error handling, tab completion. These are painful in shell.
+
+### 4. API design: REST with JSON, versioned under /api/v1
+
+```
+# Search
+POST   /api/v1/search              Search the knowledge base
+
+# Ingestion (async)
+POST   /api/v1/jobs                Upload file or note for ingestion (returns 202 + job ID)
+GET    /api/v1/jobs                List ingestion jobs (with status filters)
+GET    /api/v1/jobs/{id}           Get job details and progress
+
+# Documents (already-ingested content)
+GET    /api/v1/documents           List documents (with filters)
+GET    /api/v1/documents/{id}      Get document details
+DELETE /api/v1/documents/{id}      Remove a document
+PUT    /api/v1/documents/{id}/tags Manage document tags
+
+# Metadata
+GET    /api/v1/tags                List tags
+GET    /api/v1/status              Engine status (model, DB stats, GPU info)
+POST   /api/v1/reindex             Re-embed all chunks
+GET    /api/v1/health              Health check (for load balancers / monitoring)
+```
+
+Search is POST (not GET) because the query body can include structured filters and is more natural for JSON payloads. Ingestion uses multipart/form-data (file upload with metadata fields for tags, doc_type, etc.). Notes are submitted as a text field in the same multipart form — same endpoint, same queue.
+
+**Why /api/v1 prefix?** Allows a reverse proxy to route cleanly and enables future API versioning without breaking clients.
+
+### 5. SQLite with WAL mode
+
+SQLite remains the storage engine. WAL (Write-Ahead Logging) mode is enabled at startup, which allows concurrent reads while a write is in progress. This is important because search queries shouldn't block during ingestion.
+
+**Why not Postgres?** SQLite with sqlite-vec provides vector search and FTS5 in a single portable file. No extra service to deploy. The data directory is just files — `rsync` to move it. At personal-KB scale, SQLite's limits are irrelevant.
+
+### 6. GPU abstraction at Docker layer only
+
+The Python engine code uses `torch.cuda.is_available()` and `device="auto"` — it never references CUDA or ROCm directly. GPU vendor is determined entirely by:
+- Which base Docker image is used (nvidia/cuda vs rocm/pytorch)
+- Which ORT package is installed (onnxruntime-gpu vs onnxruntime-rocm)
+
+Two Dockerfiles: `Dockerfile.nvidia` and `Dockerfile.rocm`. The compose file selects one. Application code is identical.
+
+### 7. Configuration hierarchy
+
+**Engine** (configured via environment variables in compose.yaml):
+- `KB_DATA_DIR` — bind mount path for SQLite + model cache
+- `KB_MODEL` — embedding model name
+- `KB_DEVICE` — embedding device (auto/cpu/cuda)
+- `KB_INGEST_DEVICE` — Docling device (auto/cpu/cuda)
+- `KB_API_KEY` — optional API key for request authentication
+- `KB_WORKERS` — uvicorn worker count (default 1)
+
+**Client** (configured via `~/.kb/client.yaml`):
+```yaml
+engine_url: http://localhost:8000
+api_key: ""          # optional
+default_format: human  # human or json
+```
+
+Overridable via env vars (`KB_ENGINE_URL`, `KB_API_KEY`) and CLI flags (`--engine`, `--format`).
+
+### 8. Model loading strategy
+
+The embedding model loads eagerly at engine startup, before the server accepts requests. The `/health` endpoint returns unhealthy until the model is ready. This ensures the first query is fast and failures are visible immediately (not on first request).
+
+Docling's layout models load lazily on first PDF ingestion (they're large and not needed for search or non-PDF documents).
+
+## Risks / Trade-offs
+
+**[Engine unavailable = total outage]** → No offline fallback by design. Mitigation: engine starts automatically via Docker restart policy (`restart: unless-stopped`). For truly offline use, user can run the engine locally.
+
+**[SQLite concurrent write contention during large ingestion]** → WAL mode handles read/write concurrency. For write/write (rare — single user), SQLite's built-in busy timeout is sufficient. Mitigation: engine serializes write operations internally.
+
+**[AMD ROCm support is less tested than CUDA]** → RDNA 4 (9070 XT) is very new. Mitigation: build and test the ROCm image early. PyTorch ROCm support is mature for RDNA 3; RDNA 4 should work with ROCm 6.4+ but verify before committing to hardware purchase.
+
+**[File upload size limits for large PDFs]** → Mitigation: configure uvicorn/nginx max body size appropriately (default 100MB, configurable). Upload progress feedback in the Go client. Processing time is no longer a concern since the client disconnects after upload.
+
+**[Model warmup time on cold start]** → First startup takes ~5-10s to load the model. Mitigation: health endpoint + Docker restart policy means this only happens on container restart, not per-query. Log clearly during startup.
+
+## Migration Plan
+
+This is a clean-sheet build, not a migration from v1. However, the data format (SQLite schema, embedding dimensions) is unchanged, so an existing v1 `~/.kb/` directory can be bind-mounted directly into the v2 engine container with no conversion needed.
+
+**Rollout steps:**
+1. Build engine + Go client
+2. Test with existing v1 data directory
+3. Deploy engine on target host
+4. Distribute Go binary to client machines
+5. Retire v1 Python CLI
+
+**Rollback:** v1 CLI still works against the same SQLite database. No destructive changes to the data format.
+
+## Open Questions
+
+- **API key auth**: Simple bearer token is sufficient for a personal tool behind a reverse proxy. Decided.
+- **Ingestion**: Async with job queue. Client uploads and exits. `kb jobs` for status. Decided.
+- **Model hot-swap**: Should `POST /api/v1/reindex` with a new model name hot-swap the loaded model, or require an engine restart? Hot-swap is nicer UX but adds complexity. **Deferred** — start with restart-required, revisit if it becomes painful.
@@ -0,0 +1,32 @@
+## Why
+
+Every `kb` CLI invocation loads the embedding model from scratch (~3-5 seconds) before executing even a simple query. This makes interactive use painfully slow and wastes GPU memory with redundant loads. The monolithic architecture also ties the CLI to heavy Python ML dependencies, prevents multi-client access, and couples GPU vendor choice (NVIDIA vs AMD) to every installation.
+
+## What Changes
+
+- Clean-sheet v2 architecture — not a refactor of v1, built from scratch for client-server from day one
+  - **Engine**: FastAPI server running in Docker, keeping the embedding model warm in GPU memory. Handles both ingestion and search via HTTP API
+  - **Client**: Lightweight Go binary that talks to the engine over HTTP. No Python, no ML dependencies, instant startup
+- The `kb` CLI is the Go client only — all operations go through the engine API
+- GPU-vendor-agnostic Docker builds (NVIDIA CUDA and AMD ROCm targets)
+- Engine exposes a REST API suitable for reverse proxy / HTTPS termination
+- Data directory uses bind mounts for portability between hosts (e.g., WSL ingest → production server)
+- v1 Python CLI is retired — no dual-CLI maintenance burden
+
+## Capabilities
+
+### New Capabilities
+- `engine-api`: REST API server (FastAPI) exposing search, ingestion, document management, and status endpoints. Keeps embedding model resident in memory. Handles all DB and GPU operations
+- `go-client`: Lightweight Go CLI that communicates with the engine API over HTTP. Provides the same user-facing commands as v1 (init, add, search, list, info, remove, tags, status, config) without any ML dependencies
+- `docker-deployment`: GPU-vendor-agnostic Docker packaging with separate NVIDIA (CUDA) and AMD (ROCm) build targets. Bind-mount data volumes for host portability. Compose files for single-command deployment
+
+### Modified Capabilities
+<!-- No existing specs to modify — greenfield OpenSpec setup -->
+
+## Impact
+
+- **Code**: v2 is a new codebase. Python engine built fresh around FastAPI (reusing v1's proven core logic for search, embeddings, database, and ingestion where appropriate). Go client is entirely new. v1 `cli.py` is not carried forward
+- **APIs**: New HTTP REST API (JSON). This is the primary integration surface going forward (replaces direct Python imports for Claude Code skills etc.)
+- **Dependencies**: Go toolchain added for client build. Python side adds `fastapi` + `uvicorn`. Heavy ML deps (torch, sentence-transformers, docling) contained entirely within the Docker image
+- **Systems**: Docker + NVIDIA Container Toolkit (or ROCm equivalent) required on engine host. Client machines need only the Go binary and network access to the engine
+- **Data**: SQLite database and HF model cache unchanged in format. Bind-mount directory structure must be documented for cross-host migration
@@ -0,0 +1,94 @@
+## ADDED Requirements
+
+### Requirement: NVIDIA CUDA Docker image
+
+The project SHALL provide a `Dockerfile.nvidia` that builds the engine on an NVIDIA CUDA runtime base image with GPU support for PyTorch and ONNX Runtime.
+
+#### Scenario: Build NVIDIA image
+- **WHEN** an admin runs `docker compose -f compose.nvidia.yaml build`
+- **THEN** the build SHALL produce a working image with CUDA runtime, PyTorch with CUDA support, onnxruntime-gpu, and all engine dependencies
+
+#### Scenario: GPU access in NVIDIA container
+- **WHEN** the NVIDIA container starts with `--gpus all` or the NVIDIA runtime
+- **THEN** `torch.cuda.is_available()` SHALL return True and the engine SHALL load the embedding model on GPU
+
+---
+
+### Requirement: AMD ROCm Docker image
+
+The project SHALL provide a `Dockerfile.rocm` that builds the engine on an AMD ROCm base image with GPU support for PyTorch and ONNX Runtime.
+
+#### Scenario: Build ROCm image
+- **WHEN** an admin runs `docker compose -f compose.rocm.yaml build`
+- **THEN** the build SHALL produce a working image with ROCm runtime, PyTorch with ROCm support, onnxruntime-rocm, and all engine dependencies
+
+#### Scenario: GPU access in ROCm container
+- **WHEN** the ROCm container starts with `--device=/dev/kfd --device=/dev/dri`
+- **THEN** `torch.cuda.is_available()` SHALL return True (via HIP) and the engine SHALL load the embedding model on GPU
+
+---
+
+### Requirement: Application code is GPU-vendor-agnostic
+
+The Python engine code SHALL NOT reference CUDA or ROCm directly. GPU vendor abstraction SHALL be handled entirely at the Docker image level (base image selection and pip package choice). The same application code SHALL run on both NVIDIA and AMD images without modification.
+
+#### Scenario: Same engine code on both platforms
+- **WHEN** the engine starts on an NVIDIA image and an AMD image with identical configuration
+- **THEN** both SHALL load the model, accept requests, and return identical search results for the same query and data
+
+---
+
+### Requirement: Bind-mount data directory
+
+The engine SHALL store all persistent state (SQLite database, HF model cache, staging directory) under a single configurable data directory. This directory SHALL be mounted from the host via bind mount.
+
+#### Scenario: Data directory structure
+- **WHEN** the engine starts for the first time
+- **THEN** it SHALL create the following structure under the data directory:
+  - `kb.db` — SQLite database
+  - `hf_cache/` — HuggingFace model cache
+  - `staging/` — temporary files for queued ingestion jobs
+
+#### Scenario: Portable data across hosts
+- **WHEN** an admin copies the data directory from Host A to Host B and starts the engine with the same bind mount path
+- **THEN** the engine SHALL start successfully and serve all previously ingested documents without reprocessing
+
+#### Scenario: Portable data across GPU vendors
+- **WHEN** an admin moves the data directory from an NVIDIA host to an AMD host (same model name)
+- **THEN** the engine SHALL start successfully. Embeddings in the database remain valid (they are model-specific, not GPU-vendor-specific)
+
+---
+
+### Requirement: Compose files for deployment
+
+The project SHALL provide Docker Compose files for single-command deployment.
+
+#### Scenario: Start NVIDIA deployment
+- **WHEN** an admin runs `docker compose -f compose.nvidia.yaml up -d`
+- **THEN** the engine SHALL start with GPU access, bind-mount the data directory, and be reachable on the configured port
+
+#### Scenario: Start ROCm deployment
+- **WHEN** an admin runs `docker compose -f compose.rocm.yaml up -d`
+- **THEN** the engine SHALL start with GPU access via ROCm device passthrough, bind-mount the data directory, and be reachable on the configured port
+
+#### Scenario: Automatic restart
+- **WHEN** the engine process crashes or the host reboots
+- **THEN** Docker SHALL automatically restart the container (restart policy `unless-stopped`)
+
+#### Scenario: Configure via environment
+- **WHEN** an admin sets environment variables in the compose file (KB_MODEL, KB_API_KEY, KB_DEVICE, etc.)
+- **THEN** the engine SHALL use those values
+
+---
+
+### Requirement: CPU-only fallback
+
+The Dockerfiles SHALL produce images that work without GPU access. If no GPU is available, the engine SHALL fall back to CPU for all operations.
+
+#### Scenario: No GPU available
+- **WHEN** the container starts without GPU passthrough (no `--gpus`, no `/dev/kfd`)
+- **THEN** the engine SHALL detect no GPU, load the model on CPU, and log a warning that GPU acceleration is unavailable
+
+#### Scenario: Explicit CPU mode
+- **WHEN** `KB_DEVICE=cpu` and `KB_INGEST_DEVICE=cpu` are set in the environment
+- **THEN** the engine SHALL use CPU regardless of GPU availability
@@ -0,0 +1,199 @@
+## ADDED Requirements
+
+### Requirement: Engine startup and model loading
+
+The engine SHALL load the embedding model eagerly at startup before accepting HTTP requests. The engine SHALL expose a health endpoint that returns unhealthy until the model is fully loaded and the database is initialised.
+
+#### Scenario: Cold start with model download
+- **WHEN** the engine starts for the first time with no cached model
+- **THEN** it SHALL download the configured embedding model, load it into memory (GPU if available, CPU otherwise), enable WAL mode on the SQLite database, and begin accepting requests only after all initialisation completes
+
+#### Scenario: Health check during startup
+- **WHEN** a client sends `GET /api/v1/health` before the model is loaded
+- **THEN** the engine SHALL respond with HTTP 503 and `{"status": "starting"}`
+
+#### Scenario: Health check after startup
+- **WHEN** a client sends `GET /api/v1/health` after initialisation completes
+- **THEN** the engine SHALL respond with HTTP 200 and `{"status": "healthy"}`
+
+---
+
+### Requirement: Hybrid search
+
+The engine SHALL provide hybrid search combining BM25 full-text search (via FTS5) and vector similarity search (via sqlite-vec), merged using Reciprocal Rank Fusion. Search SHALL complete in under 100ms when the model is warm.
+
+#### Scenario: Hybrid search with results
+- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "how to change oil", "top": 5}`
+- **THEN** the engine SHALL embed the query using the resident model, run both FTS5 and vector searches, merge results via RRF, and return a JSON response with matched chunks including scores, document metadata, and tags
+
+#### Scenario: Search with filters
+- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "brakes", "tags": ["maintenance"], "doc_type": "pdf", "top": 3}`
+- **THEN** the engine SHALL apply tag and document type filters to both FTS5 and vector results before merging
+
+#### Scenario: Search with mode override
+- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "error log", "fts_only": true}`
+- **THEN** the engine SHALL return only FTS5 results without running vector search
+
+#### Scenario: Empty knowledge base
+- **WHEN** a client searches against an empty database
+- **THEN** the engine SHALL return HTTP 200 with `{"query": "...", "results": [], "total_matches": 0}`
+
+---
+
+### Requirement: Async ingestion via job queue
+
+The engine SHALL accept file uploads and text notes for ingestion asynchronously. Uploaded content SHALL be written to a staging area and a job record created in the database. The engine SHALL return HTTP 202 immediately. A background worker SHALL process queued jobs sequentially.
+
+#### Scenario: Upload a PDF file
+- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a PDF file and optional fields (tags, doc_type)
+- **THEN** the engine SHALL write the file to the staging directory, create a job record with status `queued`, and return HTTP 202 with `{"job_id": "<id>", "status": "queued", "filename": "report.pdf"}`
+
+#### Scenario: Upload a text note
+- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a `note` text field and optional `title` field
+- **THEN** the engine SHALL write the note content to a staging file, create a job record with status `queued`, and return HTTP 202 with the job ID
+
+#### Scenario: Upload multiple files in sequence
+- **WHEN** a client sends multiple `POST /api/v1/jobs` requests in quick succession
+- **THEN** the engine SHALL queue each job independently and the background worker SHALL process them in FIFO order
+
+#### Scenario: Duplicate content detection
+- **WHEN** a client uploads a file whose content hash matches an already-ingested document
+- **THEN** the engine SHALL return HTTP 202 but the background worker SHALL mark the job as `skipped` with reason `duplicate`
+
+#### Scenario: Upload failure due to unsupported file type
+- **WHEN** a client uploads a file with an unsupported extension
+- **THEN** the engine SHALL return HTTP 422 with an error message listing supported types
+
+---
+
+### Requirement: Job status tracking
+
+The engine SHALL maintain job records in SQLite with status tracking. Jobs SHALL transition through states: `queued` → `processing` → `done` | `failed` | `skipped`.
+
+#### Scenario: List all jobs
+- **WHEN** a client sends `GET /api/v1/jobs`
+- **THEN** the engine SHALL return a JSON array of job records ordered by creation time (newest first), each including job_id, filename, status, created_at, and completed_at
+
+#### Scenario: Filter jobs by status
+- **WHEN** a client sends `GET /api/v1/jobs?status=failed`
+- **THEN** the engine SHALL return only jobs with the specified status
+
+#### Scenario: Get job details
+- **WHEN** a client sends `GET /api/v1/jobs/{id}`
+- **THEN** the engine SHALL return the full job record including status, filename, error message (if failed), document_id (if done), chunk count, and timing information
+
+#### Scenario: Job not found
+- **WHEN** a client sends `GET /api/v1/jobs/{id}` with a non-existent ID
+- **THEN** the engine SHALL return HTTP 404
+
+---
+
+### Requirement: Background ingestion worker
+
+The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, and insert chunks and vectors into the database.
+
+#### Scenario: Successful PDF ingestion
+- **WHEN** the background worker picks up a queued PDF job
+- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, update the job status to `done` with the resulting document_id and chunk count, and delete the staged file
+
+#### Scenario: Ingestion failure
+- **WHEN** the background worker encounters an error during processing (e.g., corrupt PDF)
+- **THEN** it SHALL update the job status to `failed` with the error message, delete the staged file, and continue processing the next queued job
+
+#### Scenario: Search during active ingestion
+- **WHEN** a search request arrives while the background worker is processing a job
+- **THEN** the search SHALL execute without blocking (SQLite WAL mode) and return results from already-ingested documents
+
+---
+
+### Requirement: Document management
+
+The engine SHALL provide endpoints to list, inspect, and remove ingested documents.
+
+#### Scenario: List documents
+- **WHEN** a client sends `GET /api/v1/documents`
+- **THEN** the engine SHALL return a JSON array of documents with id, title, doc_type, tags, chunk_count, and created_at
+
+#### Scenario: List documents with filters
+- **WHEN** a client sends `GET /api/v1/documents?type=pdf&tags=manual`
+- **THEN** the engine SHALL return only documents matching all specified filters
+
+#### Scenario: Get document details
+- **WHEN** a client sends `GET /api/v1/documents/{id}`
+- **THEN** the engine SHALL return the full document record including all chunks and their text content
+
+#### Scenario: Remove a document
+- **WHEN** a client sends `DELETE /api/v1/documents/{id}`
+- **THEN** the engine SHALL delete the document, all its chunks, associated embeddings, and tag associations, and return HTTP 200 with a confirmation
+
+#### Scenario: Remove non-existent document
+- **WHEN** a client sends `DELETE /api/v1/documents/{id}` with a non-existent ID
+- **THEN** the engine SHALL return HTTP 404
+
+---
+
+### Requirement: Tag management
+
+The engine SHALL provide endpoints to list all tags and manage tags on documents.
+
+#### Scenario: List all tags
+- **WHEN** a client sends `GET /api/v1/tags`
+- **THEN** the engine SHALL return a JSON array of tags with name and document count
+
+#### Scenario: Add tags to a document
+- **WHEN** a client sends `PUT /api/v1/documents/{id}/tags` with body `{"add": ["manual", "v2"]}`
+- **THEN** the engine SHALL add the specified tags to the document and return the updated tag list
+
+#### Scenario: Remove tags from a document
+- **WHEN** a client sends `PUT /api/v1/documents/{id}/tags` with body `{"remove": ["draft"]}`
+- **THEN** the engine SHALL remove the specified tags from the document and return the updated tag list
+
+---
+
+### Requirement: Engine status and reindex
+
+The engine SHALL provide status information and support re-embedding all chunks.
+
+#### Scenario: Get engine status
+- **WHEN** a client sends `GET /api/v1/status`
+- **THEN** the engine SHALL return JSON with model_name, embedding_dim, GPU device info, database stats (document count by type, total chunks, DB size), and queue stats (queued/processing job count)
+
+#### Scenario: Trigger reindex
+- **WHEN** a client sends `POST /api/v1/reindex`
+- **THEN** the engine SHALL re-embed all existing chunks using the currently loaded model and return progress information. This operation SHALL NOT block search queries.
+
+---
+
+### Requirement: API authentication
+
+The engine SHALL support optional API key authentication via Bearer token. When `KB_API_KEY` is set, all requests MUST include a matching `Authorization: Bearer <key>` header. When `KB_API_KEY` is not set, authentication SHALL be disabled.
+
+#### Scenario: Valid API key
+- **WHEN** `KB_API_KEY` is set and a request includes a matching Bearer token
+- **THEN** the engine SHALL process the request normally
+
+#### Scenario: Missing API key when required
+- **WHEN** `KB_API_KEY` is set and a request has no Authorization header
+- **THEN** the engine SHALL return HTTP 401 `{"error": "authentication required"}`
+
+#### Scenario: Invalid API key
+- **WHEN** `KB_API_KEY` is set and a request includes a non-matching Bearer token
+- **THEN** the engine SHALL return HTTP 401 `{"error": "invalid api key"}`
+
+#### Scenario: Auth disabled
+- **WHEN** `KB_API_KEY` is not set
+- **THEN** the engine SHALL process all requests without requiring authentication
+
+---
+
+### Requirement: Engine configuration via environment variables
+
+The engine SHALL be configured via environment variables. No config file is read by the engine — all configuration comes from the environment (set via compose.yaml or Docker run).
+
+#### Scenario: Default configuration
+- **WHEN** the engine starts with no environment variables set
+- **THEN** it SHALL use defaults: data directory `/data`, model `all-MiniLM-L6-v2`, device `auto`, no API key required
+
+#### Scenario: Custom model
+- **WHEN** `KB_MODEL` is set to `BAAI/bge-small-en-v1.5`
+- **THEN** the engine SHALL download and load that model instead of the default
@@ -0,0 +1,177 @@
+## ADDED Requirements
+
+### Requirement: Single static binary with zero runtime dependencies
+
+The Go client SHALL compile to a single static binary with no runtime dependencies. It SHALL support cross-compilation for Linux (amd64, arm64), macOS (amd64, arm64), and Windows (amd64).
+
+#### Scenario: Install on a clean machine
+- **WHEN** a user downloads the `kb` binary for their platform
+- **THEN** they SHALL be able to run it immediately with no additional installs (no Python, no Docker, no shared libraries)
+
+---
+
+### Requirement: Client configuration
+
+The client SHALL read configuration from `~/.kb/client.yaml`. Configuration values SHALL be overridable via environment variables and CLI flags. Precedence: CLI flags > environment variables > config file > defaults.
+
+#### Scenario: Default configuration
+- **WHEN** no config file exists and no env vars or flags are set
+- **THEN** the client SHALL use defaults: engine URL `http://localhost:8000`, no API key, format `human`
+
+#### Scenario: Config file
+- **WHEN** `~/.kb/client.yaml` contains `engine_url: https://kb.example.com`
+- **THEN** the client SHALL use that URL for all API requests
+
+#### Scenario: Environment variable override
+- **WHEN** `KB_ENGINE_URL` is set
+- **THEN** it SHALL override the config file value
+
+#### Scenario: CLI flag override
+- **WHEN** the user passes `--engine https://other.host:8000`
+- **THEN** it SHALL override both the config file and environment variable
+
+#### Scenario: Engine unreachable
+- **WHEN** the client cannot connect to the engine URL
+- **THEN** it SHALL print a clear error message (e.g., "Cannot reach engine at http://localhost:8000 — is it running?") and exit with a non-zero code
+
+---
+
+### Requirement: Search command
+
+The client SHALL provide a `kb search <query>` command that sends the query to the engine and displays results.
+
+#### Scenario: Human-readable search output
+- **WHEN** the user runs `kb search "how to change oil"`
+- **THEN** the client SHALL POST to `/api/v1/search`, and display results in a human-readable format showing rank, score, document title, page/section, doc type, tags, and a text snippet
+
+#### Scenario: JSON search output
+- **WHEN** the user runs `kb search "query" --format json`
+- **THEN** the client SHALL output the raw JSON response from the engine
+
+#### Scenario: Search with filters
+- **WHEN** the user runs `kb search "brakes" --tags maintenance --type pdf --top 3`
+- **THEN** the client SHALL include the filters in the API request body
+
+#### Scenario: Search mode flags
+- **WHEN** the user runs `kb search "error" --fts-only`
+- **THEN** the client SHALL set `fts_only: true` in the request body
+
+---
+
+### Requirement: Add command (file and note ingestion)
+
+The client SHALL provide a `kb add` command that uploads files or notes to the engine for async ingestion. The client SHALL exit immediately after a successful upload.
+
+#### Scenario: Add a single file
+- **WHEN** the user runs `kb add report.pdf`
+- **THEN** the client SHALL upload the file via `POST /api/v1/jobs` (multipart), print "Queued: report.pdf", and exit
+
+#### Scenario: Add a file with tags
+- **WHEN** the user runs `kb add manual.pdf --tags car,maintenance`
+- **THEN** the client SHALL include the tags in the multipart upload metadata
+
+#### Scenario: Add a directory recursively
+- **WHEN** the user runs `kb add ~/documents/ --recursive`
+- **THEN** the client SHALL discover all supported files in the directory tree, upload each one sequentially, and print "Queued: N files"
+
+#### Scenario: Add a text note
+- **WHEN** the user runs `kb add --note "The server room is in building 3, floor 2"`
+- **THEN** the client SHALL submit the note text via `POST /api/v1/jobs` (multipart with note field), print "Queued: note", and exit
+
+#### Scenario: Add with JSON output
+- **WHEN** the user runs `kb add report.pdf --format json`
+- **THEN** the client SHALL output the JSON response from the engine including the job_id
+
+#### Scenario: File not found
+- **WHEN** the user runs `kb add nonexistent.pdf`
+- **THEN** the client SHALL print an error and exit with a non-zero code without making any API call
+
+#### Scenario: Upload failure
+- **WHEN** the upload fails (network error, engine returns 4xx/5xx)
+- **THEN** the client SHALL print the error and exit with a non-zero code
+
+---
+
+### Requirement: Jobs command
+
+The client SHALL provide a `kb jobs` command to view the ingestion queue.
+
+#### Scenario: List all jobs
+- **WHEN** the user runs `kb jobs`
+- **THEN** the client SHALL fetch `GET /api/v1/jobs` and display a table of recent jobs showing ID, filename, status, and timestamp
+
+#### Scenario: Filter jobs by status
+- **WHEN** the user runs `kb jobs --status failed`
+- **THEN** the client SHALL pass the status filter and display only matching jobs
+
+#### Scenario: Job details
+- **WHEN** the user runs `kb jobs <id>`
+- **THEN** the client SHALL fetch `GET /api/v1/jobs/{id}` and display full job details including error message (if failed), document_id (if done), and chunk count
+
+---
+
+### Requirement: Document management commands
+
+The client SHALL provide commands to list, inspect, and remove documents.
+
+#### Scenario: List documents
+- **WHEN** the user runs `kb list`
+- **THEN** the client SHALL fetch `GET /api/v1/documents` and display a table of documents with ID, title, type, tags, chunk count, and date
+
+#### Scenario: List with filters
+- **WHEN** the user runs `kb list --type pdf --tags manual`
+- **THEN** the client SHALL pass filters as query parameters
+
+#### Scenario: Document info
+- **WHEN** the user runs `kb info <id>`
+- **THEN** the client SHALL fetch `GET /api/v1/documents/{id}` and display full document details
+
+#### Scenario: Remove a document
+- **WHEN** the user runs `kb remove <id>`
+- **THEN** the client SHALL prompt for confirmation, then send `DELETE /api/v1/documents/{id}` and display the result
+
+#### Scenario: Remove with skip confirmation
+- **WHEN** the user runs `kb remove <id> --yes`
+- **THEN** the client SHALL skip the confirmation prompt
+
+---
+
+### Requirement: Tag management commands
+
+The client SHALL provide commands to list and manage tags.
+
+#### Scenario: List tags
+- **WHEN** the user runs `kb tags`
+- **THEN** the client SHALL fetch `GET /api/v1/tags` and display tags with document counts
+
+#### Scenario: Add tags to a document
+- **WHEN** the user runs `kb tag <id> --add manual,v2`
+- **THEN** the client SHALL send `PUT /api/v1/documents/{id}/tags` with the add payload
+
+#### Scenario: Remove tags from a document
+- **WHEN** the user runs `kb tag <id> --remove draft`
+- **THEN** the client SHALL send `PUT /api/v1/documents/{id}/tags` with the remove payload
+
+---
+
+### Requirement: Status command
+
+The client SHALL provide a `kb status` command to display engine status.
+
+#### Scenario: Display engine status
+- **WHEN** the user runs `kb status`
+- **THEN** the client SHALL fetch `GET /api/v1/status` and display model name, embedding dimensions, GPU info, document counts by type, total chunks, database size, and queue status
+
+---
+
+### Requirement: Global output format flag
+
+All commands SHALL support a `--format` flag accepting `human` (default) or `json`. The default MAY be changed via the `default_format` config value.
+
+#### Scenario: JSON output on any command
+- **WHEN** the user passes `--format json` to any command
+- **THEN** the client SHALL output the raw JSON response from the engine without human formatting
+
+#### Scenario: Human output (default)
+- **WHEN** the user runs any command without `--format`
+- **THEN** the client SHALL format the response in a human-readable table or structured text output
@@ -0,0 +1,91 @@
+## 1. Project scaffolding
+
+- [ ] 1.1 Create v2 project structure: `engine/` (Python/FastAPI) and `client/` (Go) directories at repo root
+- [ ] 1.2 Set up `engine/pyproject.toml` with dependencies: fastapi, uvicorn, sentence-transformers, sqlite-vec, docling, pyyaml
+- [ ] 1.3 Set up `client/go.mod` with dependencies: cobra, gopkg.in/yaml.v3
+- [ ] 1.4 Create engine entry point (`engine/main.py`) with uvicorn startup, eager model loading, and readiness gating
+
+## 2. Database layer
+
+- [ ] 2.1 Implement database module (`engine/kb/database.py`): connection factory with WAL mode, schema initialisation (documents, chunks, chunks_fts, chunks_vec, tags, document_tags, config tables)
+- [ ] 2.2 Add `jobs` table to schema: id, filename, status (queued/processing/done/failed/skipped), doc_type, tags_json, error, document_id, chunk_count, created_at, completed_at, staging_path
+- [ ] 2.3 Implement job CRUD functions: create_job, get_job, list_jobs, update_job_status
+
+## 3. Embeddings and search
+
+- [ ] 3.1 Implement embeddings module (`engine/kb/embeddings.py`): model loading with device resolution (auto/cpu/cuda), embed_texts, get_model_dim — model loaded once and cached in-process
+- [ ] 3.2 Implement search module (`engine/kb/search.py`): FTS5 search, vector search via sqlite-vec, RRF merge, filter support (tags, doc_type, fts_only, vec_only, threshold)
+
+## 4. Ingestion pipelines
+
+- [ ] 4.1 Implement file type detection (`engine/kb/ingest/detector.py`): extension-based detection for pdf, markdown, code, note
+- [ ] 4.2 Implement Docling pipeline (`engine/kb/ingest/docling.py`): PDF/DOCX conversion with AcceleratorOptions device control, hierarchy and fixed chunking
+- [ ] 4.3 Implement Markdown pipeline (`engine/kb/ingest/markdown.py`): header-based splitting with min/max token bounds
+- [ ] 4.4 Implement code pipeline (`engine/kb/ingest/code.py`): AST-based chunking for Python, regex for Bash/Go, fallback fixed-size
+- [ ] 4.5 Implement note pipeline (`engine/kb/ingest/note.py`): whole-text chunking with auto-title
+
+## 5. Async job queue and background worker
+
+- [ ] 5.1 Implement staging manager (`engine/kb/staging.py`): write uploaded file/note to staging directory, generate staging path, cleanup after processing
+- [ ] 5.2 Implement background worker (`engine/kb/worker.py`): asyncio background task that polls for queued jobs, processes sequentially (detect type → chunk → embed → insert), updates job status on success/failure/skip (duplicate detection)
+- [ ] 5.3 Wire worker into FastAPI lifespan: start worker on app startup, graceful shutdown on app stop
+
+## 6. API routes
+
+- [ ] 6.1 Implement health endpoint: `GET /api/v1/health` — returns 503 during startup, 200 when ready
+- [ ] 6.2 Implement search endpoint: `POST /api/v1/search` — accepts query, top, tags, doc_type, fts_only, vec_only, threshold in JSON body
+- [ ] 6.3 Implement ingestion endpoint: `POST /api/v1/jobs` — accepts multipart file upload or note text field with optional tags/doc_type/title metadata, writes to staging, creates job, returns 202
+- [ ] 6.4 Implement job status endpoints: `GET /api/v1/jobs` (list with status filter), `GET /api/v1/jobs/{id}` (details)
+- [ ] 6.5 Implement document endpoints: `GET /api/v1/documents` (list with filters), `GET /api/v1/documents/{id}` (details), `DELETE /api/v1/documents/{id}` (remove)
+- [ ] 6.6 Implement tag endpoints: `GET /api/v1/tags` (list), `PUT /api/v1/documents/{id}/tags` (add/remove)
+- [ ] 6.7 Implement status endpoint: `GET /api/v1/status` — model info, GPU info, DB stats, queue stats
+- [ ] 6.8 Implement reindex endpoint: `POST /api/v1/reindex` — re-embed all chunks with current model
+- [ ] 6.9 Implement API key authentication middleware: check `KB_API_KEY` env, validate Bearer token, skip when unset
+
+## 7. Engine configuration
+
+- [ ] 7.1 Implement config module (`engine/kb/config.py`): read all settings from environment variables (KB_DATA_DIR, KB_MODEL, KB_DEVICE, KB_INGEST_DEVICE, KB_API_KEY), apply defaults
+
+## 8. Docker images
+
+- [ ] 8.1 Create `Dockerfile.nvidia`: CUDA runtime base, system deps (libgl1, libglib2.0, poppler), uv install, onnxruntime-gpu overlay, engine entrypoint
+- [ ] 8.2 Create `Dockerfile.rocm`: ROCm/PyTorch base, system deps, uv install, onnxruntime-rocm, engine entrypoint
+- [ ] 8.3 Create `compose.nvidia.yaml`: NVIDIA runtime, GPU reservation, bind mount for /data, environment variables, restart policy, port mapping
+- [ ] 8.4 Create `compose.rocm.yaml`: ROCm device passthrough (/dev/kfd, /dev/dri), bind mount, environment variables, restart policy, port mapping
+- [ ] 8.5 Create `.dockerignore` for engine context
+
+## 9. Go client — project setup and config
+
+- [ ] 9.1 Initialise Cobra CLI structure: root command with `--engine`, `--format`, `--api-key` persistent flags
+- [ ] 9.2 Implement client config loading: read `~/.kb/client.yaml`, merge with env vars (KB_ENGINE_URL, KB_API_KEY), merge with CLI flags
+- [ ] 9.3 Implement HTTP client helper: base URL handling, Bearer token injection, JSON request/response helpers, error formatting for connection failures and HTTP errors
+
+## 10. Go client — commands
+
+- [ ] 10.1 Implement `kb search <query>` command: POST to /api/v1/search, human and JSON output formatting
+- [ ] 10.2 Implement `kb add <path>` command: file discovery (single file, directory with --recursive), multipart upload to /api/v1/jobs, human summary output ("Queued: N files"), JSON output with job IDs
+- [ ] 10.3 Implement `kb add --note <text>` command: submit note via multipart to /api/v1/jobs
+- [ ] 10.4 Implement `kb jobs` command: list jobs (with --status filter), single job detail via `kb jobs <id>`
+- [ ] 10.5 Implement `kb list` command: GET /api/v1/documents with --type and --tags filters
+- [ ] 10.6 Implement `kb info <id>` command: GET /api/v1/documents/{id}
+- [ ] 10.7 Implement `kb remove <id>` command: confirmation prompt (skip with --yes), DELETE /api/v1/documents/{id}
+- [ ] 10.8 Implement `kb tags` command: GET /api/v1/tags
+- [ ] 10.9 Implement `kb tag <id>` command: --add and --remove flags, PUT /api/v1/documents/{id}/tags
+- [ ] 10.10 Implement `kb status` command: GET /api/v1/status with human formatting
+
+## 11. Go client — build and distribution
+
+- [ ] 11.1 Create Makefile or build script: cross-compile for linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64
+- [ ] 11.2 Add version injection via `-ldflags` at build time
+
+## 12. Integration testing
+
+- [ ] 12.1 Test engine startup: health endpoint transitions from 503 → 200 after model load
+- [ ] 12.2 Test full ingestion flow: upload PDF via API → job queued → job completes → document appears in list → chunks searchable
+- [ ] 12.3 Test note ingestion: submit note via API → job completes → note searchable
+- [ ] 12.4 Test search: hybrid search returns ranked results, filters work, fts_only/vec_only modes work
+- [ ] 12.5 Test document management: list, info, remove, tag operations via API
+- [ ] 12.6 Test job queue: multiple uploads queue correctly, failures don't block queue, duplicates are skipped
+- [ ] 12.7 Test API authentication: requests rejected without key when KB_API_KEY set, accepted with valid key, all requests pass when unset
+- [ ] 12.8 Test Docker GPU: `kb doctor`-style verification that GPU is accessible inside container (NVIDIA build)
+- [ ] 12.9 Test data portability: copy data directory, start engine on new container, verify all documents and search work
@@ -40,8 +40,9 @@ def init(model, status):
    data_dir.mkdir(parents=True, exist_ok=True)

    # Download model and get dimension
-    download_model(model_name)
-    dim = get_model_dim(model_name)
+    embed_device = cfg["embedding"]["device"]
+    download_model(model_name, device=embed_device)
+    dim = get_model_dim(model_name, device=embed_device)

    # Initialise database
    conn = get_connection(db_path)
@@ -86,6 +87,7 @@ def add(path, note, title, tags, doc_type, language, recursive, workers):
    conn = get_connection(db_path)
    check_model_binding(conn, cfg)
    model_name = cfg["embedding"]["model"]
+    embed_device = cfg["embedding"]["device"]
    tag_list = [t.strip() for t in tags.split(",")] if tags else []

    if note:
@@ -100,7 +102,8 @@ def add(path, note, title, tags, doc_type, language, recursive, workers):
        doc_id = insert_document(conn, note_title, None, content_hash, "note")
        for c in chunks:
            chunk_id = insert_chunk(conn, doc_id, c["chunk_index"], c["text"], metadata=c["metadata"])
-            emb = embed_texts(model_name, [c["text"]], prefix=cfg["embedding"].get("passage_prefix", ""))
+            emb = embed_texts(model_name, [c["text"]], prefix=cfg["embedding"].get("passage_prefix", ""),
+                              device=embed_device)
            insert_embedding(conn, chunk_id, emb[0])
        if tag_list:
            tag_document(conn, doc_id, tag_list)
@@ -153,7 +156,8 @@ def _add_single_file(conn, file_path, cfg, model_name, tag_list, force_type, for
    # Embed all chunks in one batch
    texts = [c["text"] for c in chunks]
    prefix = cfg["embedding"].get("passage_prefix", "")
-    embeddings = embed_texts(model_name, texts, prefix=prefix)
+    embed_device = cfg["embedding"]["device"]
+    embeddings = embed_texts(model_name, texts, prefix=prefix, device=embed_device)

    for c, emb in zip(chunks, embeddings):
        chunk_id = insert_chunk(conn, doc_id, c["chunk_index"], c["text"],
@@ -544,12 +548,13 @@ def reindex(model):

    conn = get_connection(db_path)
    model_name = model or get_db_config(conn, "model_name") or cfg["embedding"]["model"]
+    embed_device = cfg["embedding"]["device"]

    # Download model if switching
    if model:
-        download_model(model_name)
+        download_model(model_name, device=embed_device)

-    dim = get_model_dim(model_name)
+    dim = get_model_dim(model_name, device=embed_device)

    # Get all chunks
    rows = conn.execute("SELECT id, text FROM chunks ORDER BY id").fetchall()
@@ -570,7 +575,7 @@ def reindex(model):
    with click.progressbar(range(0, len(all_texts), batch_size), label="Embedding") as bar:
        for i in bar:
            batch = all_texts[i:i + batch_size]
-            batch_embs = embed_texts(model_name, batch, prefix=prefix)
+            batch_embs = embed_texts(model_name, batch, prefix=prefix, device=embed_device)
            all_embeddings.extend(batch_embs)

    # Atomically replace vectors
@@ -614,3 +619,86 @@ def config_set(key, value):


 main.add_command(config)
+
+
+@main.command()
+def doctor():
+    """Check GPU availability and run diagnostics."""
+    import subprocess
+    import sys
+
+    click.echo("=== GPU Diagnostics ===\n")
+
+    # Step 1: nvidia-smi
+    click.echo("[1/5] nvidia-smi...")
+    try:
+        out = subprocess.run(["nvidia-smi", "--query-gpu=name,driver_version,memory.total",
+                              "--format=csv,noheader"], capture_output=True, text=True, timeout=10)
+        if out.returncode == 0:
+            click.echo(f"  OK: {out.stdout.strip()}")
+        else:
+            click.echo(f"  FAIL: {out.stderr.strip()}")
+            click.echo("\nNo GPU detected. Use CPU mode (the default).")
+            return
+    except FileNotFoundError:
+        click.echo("  SKIP: nvidia-smi not found")
+        click.echo("\nNo NVIDIA driver. Use CPU mode (the default).")
+        return
+
+    # Step 2: torch CUDA detection
+    click.echo("[2/5] torch CUDA detection...")
+    try:
+        import torch
+        if torch.cuda.is_available():
+            click.echo(f"  OK: torch {torch.__version__}, CUDA {torch.version.cuda}, "
+                        f"device={torch.cuda.get_device_name(0)}")
+        else:
+            click.echo(f"  FAIL: torch {torch.__version__} sees no CUDA")
+            click.echo("\nGPU detected but torch can't use it. Use CPU mode.")
+            return
+    except Exception as e:
+        click.echo(f"  FAIL: {e}")
+        return
+
+    # Step 3: small tensor op
+    click.echo("[3/5] Small GPU tensor operation...")
+    try:
+        x = torch.randn(64, 64, device="cuda")
+        y = x @ x.T
+        torch.cuda.synchronize()
+        del x, y
+        torch.cuda.empty_cache()
+        click.echo("  OK")
+    except Exception as e:
+        click.echo(f"  FAIL: {e}")
+        click.echo("\nBasic CUDA ops failed. Use CPU mode.")
+        return
+
+    # Step 4: load embedding model on GPU
+    click.echo("[4/5] Loading embedding model on GPU...")
+    try:
+        from kb_search.config import load_config
+        cfg = load_config()
+        model_name = cfg["embedding"]["model"]
+        from sentence_transformers import SentenceTransformer
+        model = SentenceTransformer(model_name, device="cuda")
+        emb = model.encode(["GPU diagnostic test"])
+        click.echo(f"  OK: {model_name}, embedding dim={emb.shape[1]}")
+        del model
+        torch.cuda.empty_cache()
+    except Exception as e:
+        click.echo(f"  FAIL: {e}")
+        click.echo("\nModel loading failed on GPU. Use CPU mode.")
+        return
+
+    # Step 5: memory check
+    click.echo("[5/5] GPU memory...")
+    mem_alloc = torch.cuda.memory_allocated() / 1024**2
+    mem_reserved = torch.cuda.memory_reserved() / 1024**2
+    mem_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
+    click.echo(f"  Allocated: {mem_alloc:.0f} MB, Reserved: {mem_reserved:.0f} MB, "
+                f"Total: {mem_total:.1f} GB")
+
+    click.echo("\n=== All checks passed ===")
+    click.echo("To enable GPU embeddings:")
+    click.echo("  kb config set embedding.device cuda")
@@ -10,6 +10,7 @@ DEFAULTS = {
    "data_dir": "~/.kb",
    "embedding": {
        "model": "all-MiniLM-L6-v2",
+        "device": "cpu",
        "query_prefix": "",
        "passage_prefix": "",
    },
@@ -45,6 +46,7 @@ DEFAULTS = {
        "workers": 4,
        "batch_size": 50,
        "enable_ocr": "auto",
+        "device": "cpu",
    },
 }

@@ -52,6 +54,8 @@ DEFAULTS = {
 ENV_MAP = {
    "KB_DATA_DIR": "data_dir",
    "KB_MODEL": "embedding.model",
+    "KB_DEVICE": "embedding.device",
+    "KB_INGEST_DEVICE": "ingestion.device",
    "KB_DEFAULT_TOP": "search.default_top",
    "KB_DEFAULT_FORMAT": "search.default_format",
 }
@@ -1,52 +1,71 @@
-"""Embedding model management — download, load, and inference via ONNX."""
+"""Embedding model management — download, load, and inference."""

 import click
 from pathlib import Path

 _model_instance = None
 _model_name = None
+_model_device = None


-def load_model(model_name: str):
-    """Load a sentence-transformers model with ONNX backend. Caches in-process."""
-    global _model_instance, _model_name
-    if _model_instance is not None and _model_name == model_name:
+def _resolve_device(device: str) -> str:
+    """Resolve 'auto' to a concrete device string."""
+    if device in ("cpu", "cuda"):
+        return device
+    # auto: use GPU if available
+    try:
+        import torch
+        return "cuda" if torch.cuda.is_available() else "cpu"
+    except ImportError:
+        return "cpu"
+
+
+def load_model(model_name: str, device: str = "cpu"):
+    """Load a sentence-transformers model. Uses torch backend for CUDA, ONNX for CPU."""
+    global _model_instance, _model_name, _model_device
+    if _model_instance is not None and _model_name == model_name and _model_device == device:
        return _model_instance

    from sentence_transformers import SentenceTransformer

-    click.echo(f"Loading model '{model_name}'...")
-    try:
-        _model_instance = SentenceTransformer(model_name, backend="onnx")
-    except Exception:
-        # Fallback: some models may not have pre-exported ONNX. Let sentence-transformers export.
-        click.echo("Optimising model for ONNX inference (one-time)...")
-        _model_instance = SentenceTransformer(model_name, backend="onnx")
+    resolved = _resolve_device(device)
+    if resolved == "cuda":
+        click.echo(f"Loading model '{model_name}' on GPU...")
+        _model_instance = SentenceTransformer(model_name, device="cuda")
+    else:
+        click.echo(f"Loading model '{model_name}' (ONNX/CPU)...")
+        try:
+            _model_instance = SentenceTransformer(model_name, backend="onnx")
+        except Exception:
+            click.echo("Optimising model for ONNX inference (one-time)...")
+            _model_instance = SentenceTransformer(model_name, backend="onnx")

    _model_name = model_name
+    _model_device = resolved
    return _model_instance


-def get_model_dim(model_name: str) -> int:
+def get_model_dim(model_name: str, device: str = "cpu") -> int:
    """Get the embedding dimension for a model."""
-    model = load_model(model_name)
+    model = load_model(model_name, device)
    return model.get_sentence_embedding_dimension()


 def embed_texts(model_name: str, texts: list[str],
-                prefix: str = "", show_progress: bool = False) -> list[list[float]]:
+                prefix: str = "", show_progress: bool = False,
+                device: str = "cpu") -> list[list[float]]:
    """Embed a list of texts, returning float vectors."""
-    model = load_model(model_name)
+    model = load_model(model_name, device)
    if prefix:
        texts = [prefix + t for t in texts]
    embeddings = model.encode(texts, show_progress_bar=show_progress, convert_to_numpy=True)
    return [e.tolist() for e in embeddings]


-def download_model(model_name: str):
+def download_model(model_name: str, device: str = "cpu"):
    """Pre-download a model (for kb init)."""
    click.echo(f"Downloading embedding model '{model_name}'...")
-    load_model(model_name)
+    load_model(model_name, device)
    click.echo("Embedding model ready.")


@@ -13,11 +13,15 @@ def chunk_document(file_path: Path, cfg: dict) -> list[dict]:
    """Ingest a document using Docling and return chunks."""
    from docling.document_converter import DocumentConverter, PdfFormatOption
    from docling.datamodel.base_models import InputFormat
-    from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
+    from docling.datamodel.pipeline_options import AcceleratorOptions, PdfPipelineOptions, RapidOcrOptions

    # Configure PDF pipeline
-    ocr_setting = cfg.get("ingestion", {}).get("enable_ocr", "auto")
-    pdf_opts = PdfPipelineOptions()
+    ingestion_cfg = cfg.get("ingestion", {})
+    ocr_setting = ingestion_cfg.get("enable_ocr", "auto")
+    ingest_device = ingestion_cfg.get("device", "cpu")
+    pdf_opts = PdfPipelineOptions(
+        accelerator_options=AcceleratorOptions(device=ingest_device),
+    )

    if ocr_setting == "never":
        pdf_opts.do_ocr = False
@@ -150,7 +150,8 @@ def _vector_search(conn: sqlite3.Connection, query: str, model_name: str,
    from kb_search.embeddings import embed_texts

    prefix = cfg.get("embedding", {}).get("query_prefix", "")
-    query_emb = embed_texts(model_name, [query], prefix=prefix)[0]
+    embed_device = cfg.get("embedding", {}).get("device", "cpu")
+    query_emb = embed_texts(model_name, [query], prefix=prefix, device=embed_device)[0]
    blob = struct.pack(f"{len(query_emb)}f", *query_emb)

    # sqlite-vec returns results ordered by distance (lower = more similar)