kb/openspec/changes/kb-search/design.md

## Context

This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.

Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.

The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.

## Goals / Non-Goals

**Goals:**
- Single-command install (`pipx install kb-search`) with `kb init` for model setup
- Ingest heterogeneous documents with format-appropriate chunking
- Hybrid search (keyword + semantic) with a single command
- JSON output contract stable enough for skill integration
- Configurable but works with zero configuration
- All state in one SQLite file for easy backup/portability

**Non-Goals:**
- LLM-based answer synthesis (the calling skill handles this)
- Multi-user or networked access
- Real-time / streaming ingestion
- Web UI or TUI dashboard
- Support for every possible document format (start with PDF, markdown, code, notes)
- Clustering, deduplication, or automatic organisation of documents

## Decisions

### 1. Package Structure

```
kb-search/
├── pyproject.toml
├── src/
│   └── kb_search/
│       ├── __init__.py
│       ├── cli.py              # Click CLI entry point
│       ├── config.py           # YAML config loading + ENV overrides
│       ├── database.py         # SQLite schema, migrations, connection
│       ├── embeddings.py       # Model download, loading, inference
│       ├── search.py           # Hybrid search + RRF merging
│       ├── ingest/
│       │   ├── __init__.py
│       │   ├── detector.py     # File type detection + routing
│       │   ├── docling.py      # Docling pipeline (PDF, DOCX, HTML, images)
│       │   ├── markdown.py     # Header-based markdown splitting
│       │   ├── code.py         # AST/regex code splitting
│       │   └── note.py         # Whole-document note handler
│       └── output.py           # JSON + human-readable formatters
├── tests/
└── SKILL.md                    # Claude Code skill definition
```

**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.

### 2. SQLite as Sole Storage Backend

All data lives in `~/.kb/kb.db`:

```sql
-- Documents
CREATE TABLE documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    source_path TEXT,
    content_hash TEXT NOT NULL,          -- SHA-256 for dedup/change detection
    doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
    language TEXT,                        -- for code: 'python','bash','go'
    created_at TEXT DEFAULT (datetime('now')),
    metadata TEXT DEFAULT '{}'           -- JSON: page_count, author, etc.
);

-- Chunks
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    text TEXT NOT NULL,
    token_count INTEGER,
    metadata TEXT DEFAULT '{}',          -- JSON: page, section_header, symbol_name
    created_at TEXT DEFAULT (datetime('now'))
);

-- FTS5 index (content-sync with chunks table)
CREATE VIRTUAL TABLE chunks_fts USING fts5(
    text,
    content='chunks',
    content_rowid='id',
    tokenize='porter unicode61'
);

-- Triggers to keep FTS in sync
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
    INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
    INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
    INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
    INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;

-- Vector storage (sqlite-vec)
CREATE VIRTUAL TABLE chunks_vec USING vec0(
    chunk_id INTEGER PRIMARY KEY,
    embedding FLOAT[384]                 -- dimension matches model
);

-- Tags
CREATE TABLE tags (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL
);

CREATE TABLE document_tags (
    document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
    tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
    PRIMARY KEY (document_id, tag_id)
);

-- Config stored in DB (model binding)
CREATE TABLE config (
    key TEXT PRIMARY KEY,
    value TEXT NOT NULL
);
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
```

**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.

**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."

**Alternatives considered:**
- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
- LanceDB: Interesting but less mature, no FTS built in

### 3. Docling for Complex Document Ingestion

Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.

**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.

**Docling configuration for this project:**
- Use `pypdfium2` backend (default, fast for text-based PDFs)
- Enable OCR only when needed (detect pages with no extractable text)
- Use hierarchy-aware chunking (respects section/paragraph boundaries)
- Disable image extraction (we're indexing text, not images)
- Run with multiple workers for batch ingestion

**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.

**Alternatives considered:**
- pymupdf4llm: Fast, lightweight, but poor table/layout handling
- Unstructured: Heavier than Docling, commercial focus, less predictable output
- LlamaParse: Cloud-only, violates local-first constraint

### 4. Per-Type Chunking Strategy

Each document type gets a purpose-built chunker with configurable parameters:

**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.

**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.

**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.

**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.

**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.

**Notes:** Whole document = one chunk. Notes are small by definition.

**Configurable defaults (in `~/.kb/config.yaml`):**
```yaml
chunking:
  defaults:
    max_tokens: 512
    overlap_tokens: 50
  pdf:
    strategy: hierarchy    # hierarchy | fixed
    max_tokens: 1024       # for fixed strategy fallback
  markdown:
    strategy: header       # header | fixed
    min_tokens: 50         # merge sections smaller than this
    max_tokens: 1024
  code:
    strategy: ast          # ast | fixed
    include_context: true  # include class/module docstring with methods
    max_tokens: 1024
  note:
    strategy: whole
```

### 5. Embedding Model Management

**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).

**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.

**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.

**Model switching (`kb reindex`):**
1. Download new model
2. Read all chunks from DB
3. Re-embed in batches (with progress bar)
4. Replace all vectors in `chunks_vec`
5. Update DB config (model_name, embedding_dim)
6. Recreate `chunks_vec` table if dimension changed

**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.

**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
- Dimension (read from model config)
- Max sequence length (read from model config, used to cap chunk size)
- Query/passage prefixes (configurable in YAML, empty by default)

```yaml
embedding:
  model: all-MiniLM-L6-v2
  query_prefix: ""          # some models need "search_query: "
  passage_prefix: ""        # some models need "search_document: "
```

### 6. Hybrid Search with Reciprocal Rank Fusion

**Search flow:**

```
Query: "how to install git"
         │
         ├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
         │
         └──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
                              (cosine distance, top-K)
         │
         ▼
    Reciprocal Rank Fusion (RRF)
    score(d) = Σ 1/(k + rank_in_list)  where k=60 (standard)
         │
         ▼
    Merged results, sorted by RRF score
         │
         ▼
    Apply filters (tags, doc_type) ──▶ Top-N results
```

**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.

**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.

**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.

**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
- Type filter: Applied in the SQL query (efficient)
- Tag filter: Applied in the SQL query via JOIN (efficient)
- Score threshold: Applied post-RRF as a cutoff

### 7. Output Format (Skill Contract)

**JSON output (`--format json`, default):**

```json
{
  "query": "how to install git",
  "results": [
    {
      "chunk_id": 1423,
      "score": 0.87,
      "score_breakdown": {"fts": 0.72, "vector": 0.94},
      "text": "To install the latest version of git from source...",
      "source": {
        "document_id": 42,
        "title": "Git Admin Guide",
        "path": "/home/user/docs/git-admin.pdf",
        "type": "pdf",
        "page": 12,
        "chunk_index": 3,
        "total_chunks": 28,
        "tags": ["git", "admin"]
      }
    }
  ],
  "total_matches": 47,
  "returned": 10
}
```

**Human output (`--format human`):**

```
Search: "how to install git" (47 matches, showing top 10)

 1. [0.87] Git Admin Guide (p.12)                    [pdf] [git, admin]
    To install the latest version of git from source...

 2. [0.65] setup-notes.md §Installation               [markdown] [git]
    First, add the PPA repository for the latest git...
```

**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.

### 8. Configuration Architecture

```
Precedence (highest to lowest):
  1. CLI flags (--top, --tags, --format)
  2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
  3. ~/.kb/config.yaml
  4. Built-in defaults

ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
  KB_DATA_DIR     → ~/.kb/
  KB_MODEL        → all-MiniLM-L6-v2
  KB_DEFAULT_TOP  → 10
```

**Full default config.yaml:**

```yaml
# ~/.kb/config.yaml

data_dir: ~/.kb

embedding:
  model: all-MiniLM-L6-v2
  query_prefix: ""
  passage_prefix: ""

search:
  default_top: 10
  default_format: json
  rrf_k: 60

chunking:
  defaults:
    max_tokens: 512
    overlap_tokens: 50
  pdf:
    strategy: hierarchy
    max_tokens: 1024
  markdown:
    strategy: header
    min_tokens: 50
    max_tokens: 1024
  code:
    strategy: ast
    include_context: true
    max_tokens: 1024
  note:
    strategy: whole

ingestion:
  workers: 4                 # parallel Docling workers
  batch_size: 50             # commit to DB every N documents
  enable_ocr: auto           # auto | always | never
```

### 9. CLI Framework: Click

**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.

### 10. Error Handling and Resumability

**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
- Each document is processed independently
- On success: document + chunks inserted in a single transaction
- On failure: error logged, document skipped, processing continues
- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
- Progress shown via `click.progressbar` or `rich.progress`
- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."

Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.

## Risks / Trade-offs

**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.

**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.

**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."

**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.

**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.

**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.

## Resolved Questions

1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.

2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.

3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.