kb/README.md

# kb

Personal knowledge base with hybrid search (full-text + semantic vector search).

v2 uses a client-server architecture: a **FastAPI engine** running in Docker (with GPU acceleration) and a lightweight **Go CLI client** that talks to it over HTTP.

## Architecture

```
Go CLI (kb) ──HTTP──▶ FastAPI Engine (Docker) ──▶ SQLite + GPU
```

- **Engine**: Keeps the embedding model warm in GPU memory. Handles search, ingestion, and document management via REST API. Runs in Docker with NVIDIA or AMD GPU support.
- **Client**: Single static Go binary. No Python, no ML dependencies, instant startup. Talks to the engine over HTTP.
- **Storage**: Single SQLite database with FTS5 (keyword search) and sqlite-vec (vector search). Portable via bind mount — just copy the data directory between hosts.

## Quick start

### 1. Start the engine

**From pre-built images** (recommended):

```bash
# NVIDIA GPU
docker run -d --name kb-engine \
  --gpus all \
  -p 8000:8000 \
  -v ~/kb-data:/data \
  -e KB_MODEL=all-MiniLM-L6-v2 \
  -e KB_DEVICE=auto \
  -e KB_API_KEY=your-secret-key \
  --restart unless-stopped \
  docker.dcglab.co.uk/dcg/kb/engine:latest-nvidia

# AMD GPU (ROCm)
docker run -d --name kb-engine \
  --device /dev/kfd --device /dev/dri \
  --group-add video \
  -p 8000:8000 \
  -v ~/kb-data:/data \
  -e KB_MODEL=all-MiniLM-L6-v2 \
  -e KB_DEVICE=auto \
  -e KB_API_KEY=your-secret-key \
  --restart unless-stopped \
  docker.dcglab.co.uk/dcg/kb/engine:latest-rocm
```

Or use a compose file — create `compose.yaml`:

```yaml
services:
  kb-engine:
    image: docker.dcglab.co.uk/dcg/kb/engine:latest-nvidia  # or latest-rocm
    runtime: nvidia  # remove for ROCm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # For ROCm, replace the above runtime/deploy block with:
    # devices:
    #   - "/dev/kfd"
    #   - "/dev/dri"
    # group_add:
    #   - "video"
    ports:
      - "${KB_PORT:-8000}:8000"
    volumes:
      - ${KB_DATA_PATH:-./data}:/data
    environment:
      - KB_MODEL=${KB_MODEL:-all-MiniLM-L6-v2}
      - KB_DEVICE=${KB_DEVICE:-auto}
      - KB_INGEST_DEVICE=${KB_INGEST_DEVICE:-auto}
      - KB_API_KEY=${KB_API_KEY:-}
      - KB_SEARCH_THRESHOLD=${KB_SEARCH_THRESHOLD:-0.01}
      - HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-}
    restart: unless-stopped
```

```bash
KB_DATA_PATH=~/kb-data docker compose up -d
```

**From source** (for development):

```bash
cd engine

# NVIDIA GPU
KB_DATA_PATH=~/kb-data docker compose -f compose.nvidia.yaml up -d

# AMD GPU (ROCm)
KB_DATA_PATH=~/kb-data docker compose -f compose.rocm.yaml up -d
```

The engine will download the embedding model on first start (~90MB) and load it onto the GPU. Check readiness:

```bash
curl http://localhost:8000/api/v1/health
# {"status": "healthy"}
```

### 2. Install the client

**From a release** (recommended):

Check [releases](https://gitea.dcglab.co.uk/steve/kb/releases) for the latest client tag, then:

```bash
# Set the version tag
TAG=client-v2.1.0

# Linux (amd64)
curl -L -o kb https://gitea.dcglab.co.uk/steve/kb/releases/download/${TAG}/kb-linux-amd64

# Linux (arm64)
curl -L -o kb https://gitea.dcglab.co.uk/steve/kb/releases/download/${TAG}/kb-linux-arm64

# macOS (Apple Silicon)
curl -L -o kb https://gitea.dcglab.co.uk/steve/kb/releases/download/${TAG}/kb-darwin-arm64

# macOS (Intel)
curl -L -o kb https://gitea.dcglab.co.uk/steve/kb/releases/download/${TAG}/kb-darwin-amd64

# Then install
chmod +x kb
sudo mv kb /usr/local/bin/
```

**From source** (for development):

```bash
cd client
make build    # produces ./kb binary
make all      # or cross-compile: dist/kb-{os}-{arch}
```

### 3. Configure the client

The client works with zero configuration if the engine is on localhost:8000. To customise, create `~/.kb/client.yaml`:

```yaml
engine_url: http://localhost:8000
api_key: ""
default_format: human
```

Override via environment variables (`KB_ENGINE_URL`, `KB_API_KEY`) or CLI flags (`--engine`, `--api-key`, `--format`).

### 4. Use it

```bash
# Quick notes (shorthand — no subcommand needed)
kb "Always restart nginx after config changes"
kb "Server room is building 3, floor 2" --tags ops

# Add files (async — uploads and exits immediately)
kb addfile ~/docs/manual.pdf --tags admin
kb addfile ~/notes/ --recursive

# Check ingestion progress
kb jobs

# Search
kb search "how to install git"
kb search "deploy process" --tags ops --type pdf

# Manage
kb list
kb info 1
kb tags
kb tag 1 --add important
kb export 1 -o manual.pdf    # download original file
kb remove 3 --yes
kb status
```

## How it works

- **Ingestion**: Files are uploaded to the engine and queued for async processing. The engine chunks documents (PDFs via Docling, markdown by headers, code by AST/functions, notes as whole text), generates embeddings on GPU, and stores everything in SQLite.
- **Search**: Hybrid retrieval combining BM25 keyword scoring (FTS5) and vector similarity (sqlite-vec), merged via Reciprocal Rank Fusion. Sub-100ms with a warm model.
- **Output**: JSON (for scripts/LLM tool use) or human-readable terminal format. Use `--format json` on any command.

## Engine configuration

The engine is configured via environment variables (set in the compose file or via `docker compose` CLI):

| Variable | Default | Description |
|---|---|---|
| `KB_DATA_DIR` | `/data` | Data directory inside the container (bind-mounted) |
| `KB_MODEL` | `all-MiniLM-L6-v2` | HuggingFace embedding model name |
| `KB_DEVICE` | `auto` | Embedding/search device: `auto`, `cpu`, or `cuda` |
| `KB_INGEST_DEVICE` | `auto` | Docling layout detection device: `auto`, `cpu`, or `cuda` |
| `KB_API_KEY` | (none) | Optional Bearer token for API authentication |
| `KB_SEARCH_THRESHOLD` | `0.01` | Minimum score for search results (filters noise) |
| `KB_PORT` | `8000` | Port to expose |
| `KB_HOST` | `0.0.0.0` | Host to bind to |
| `HF_HUB_OFFLINE` | (none) | Set to `1` to prevent model downloads (use cached only) |
| `KB_DATA_PATH` | `./data` | Host path for bind mount (compose variable, not used by engine) |

## Data portability

The data directory contains everything: SQLite database, model cache, and staging files. To migrate between hosts:

```bash
# On source host
rsync -a ~/kb-data/ user@target:/home/user/kb-data/

# On target host
KB_DATA_PATH=~/kb-data docker compose -f compose.nvidia.yaml up -d
```

Data is GPU-vendor-agnostic — you can ingest on NVIDIA and serve from AMD (or vice versa) with the same data directory.

## API reference

All endpoints are under `/api/v1/`. Requires `Authorization: Bearer <key>` header when `KB_API_KEY` is set.

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/health` | Health check (bypasses auth) |
| `POST` | `/search` | Hybrid search (JSON body) |
| `POST` | `/jobs` | Upload file/note for ingestion (multipart, returns 202 or 409 if duplicate) |
| `GET` | `/jobs` | List ingestion jobs |
| `GET` | `/jobs/{id}` | Job details |
| `GET` | `/documents` | List documents |
| `GET` | `/documents/{id}` | Document details with chunks |
| `GET` | `/documents/{id}/file` | Download original file |
| `DELETE` | `/documents/{id}` | Remove a document (and stored file) |
| `PUT` | `/documents/{id}/tags` | Add/remove tags |
| `GET` | `/tags` | List all tags |
| `GET` | `/status` | Engine status, GPU info, DB stats |
| `POST` | `/reindex` | Re-embed all chunks |

## Building and releasing

Client and engine are versioned independently via `client/VERSION` and `engine/VERSION`. Each has its own release script and git tag prefix.

### Release client

```bash
./release-client.sh --gitea              # patch bump, release via Gitea
./release-client.sh --github --minor     # minor bump, release via GitHub
./release-client.sh --gitea --no-increment  # release current version as-is
./release-client.sh --gitea --dry-run    # preview without doing anything
```

Creates tag `client-vX.Y.Z`, builds Go binaries for all platforms, and creates a Gitea/GitHub release with binaries attached.

The client embeds a `MinEngineVersion` (from `client/MIN_ENGINE_VERSION`) and will hard-fail if the connected engine is too old.

### Release engine

```bash
./release-engine.sh --gitea              # patch bump, release via Gitea
./release-engine.sh --github --minor     # minor bump, release via GitHub
./release-engine.sh --gitea --no-increment  # release current version as-is
./release-engine.sh --gitea --dry-run    # preview without doing anything
```

Creates tag `engine-vX.Y.Z`, builds NVIDIA and ROCm Docker images, creates a Gitea/GitHub release, and pushes images to the registry.

### Checking versions

```bash
# Client
kb --version

# Engine
curl http://localhost:8000/api/v1/status | jq .version
```

### Docker images

Images are pushed to `docker.dcglab.co.uk/dcg/kb/engine` with tags:

- `engine-v2.0.6-nvidia` / `engine-v2.0.6-rocm` — versioned
- `latest-nvidia` / `latest-rocm` — latest release

Override the registry and org via environment variables:

```bash
REGISTRY=ghcr.io IMAGE_ORG=myorg ./release-engine.sh --github
```

## Future: ROCm runtime migration

The `onnxruntime-rocm` execution provider was removed from onnxruntime as of v1.23. AMD is pushing toward the **MIGraphX execution provider** as the replacement for ROCm GPU inference. When upgrading onnxruntime beyond v1.22, the ROCm Dockerfile will need to switch from `onnxruntime-rocm` to `onnxruntime` with the MIGraphX EP and install the `migraphx` runtime libraries instead.

## Claude Code skill

This tool is designed to be wrapped as a Claude Code skill. See `SKILL.md` for the skill definition.