kb/openspec/changes/kb-search/design.md at f245c24928c17d3fdc759e9512587185b0f890c9

steve/kb

Fork 0

Files

T

steve f245c24928 Initial MVP

2026-03-23 20:38:42 +00:00

18 KiB

Raw Blame History

Context

This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at ~/.kb/ on the user's machine and be installed via pipx install kb-search. It must work entirely offline after initial model download.

Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls kb search and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.

The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.

Goals / Non-Goals

Goals:

Single-command install (pipx install kb-search) with kb init for model setup
Ingest heterogeneous documents with format-appropriate chunking
Hybrid search (keyword + semantic) with a single command
JSON output contract stable enough for skill integration
Configurable but works with zero configuration
All state in one SQLite file for easy backup/portability

Non-Goals:

LLM-based answer synthesis (the calling skill handles this)
Multi-user or networked access
Real-time / streaming ingestion
Web UI or TUI dashboard
Support for every possible document format (start with PDF, markdown, code, notes)
Clustering, deduplication, or automatic organisation of documents

Decisions

1. Package Structure

kb-search/
├── pyproject.toml
├── src/
│   └── kb_search/
│       ├── __init__.py
│       ├── cli.py              # Click CLI entry point
│       ├── config.py           # YAML config loading + ENV overrides
│       ├── database.py         # SQLite schema, migrations, connection
│       ├── embeddings.py       # Model download, loading, inference
│       ├── search.py           # Hybrid search + RRF merging
│       ├── ingest/
│       │   ├── __init__.py
│       │   ├── detector.py     # File type detection + routing
│       │   ├── docling.py      # Docling pipeline (PDF, DOCX, HTML, images)
│       │   ├── markdown.py     # Header-based markdown splitting
│       │   ├── code.py         # AST/regex code splitting
│       │   └── note.py         # Whole-document note handler
│       └── output.py           # JSON + human-readable formatters
├── tests/
└── SKILL.md                    # Claude Code skill definition

Why this structure: Flat enough to navigate easily, but the ingest/ subpackage isolates format-specific logic. Each ingestion module exports the same interface (ingest(path, config) -> list[Chunk]), making it easy to add formats later. Using src/ layout per Python packaging best practices.

2. SQLite as Sole Storage Backend

All data lives in ~/.kb/kb.db:

-- Documents
CREATE TABLE documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    source_path TEXT,
    content_hash TEXT NOT NULL,          -- SHA-256 for dedup/change detection
    doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
    language TEXT,                        -- for code: 'python','bash','go'
    created_at TEXT DEFAULT (datetime('now')),
    metadata TEXT DEFAULT '{}'           -- JSON: page_count, author, etc.
);

-- Chunks
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    text TEXT NOT NULL,
    token_count INTEGER,
    metadata TEXT DEFAULT '{}',          -- JSON: page, section_header, symbol_name
    created_at TEXT DEFAULT (datetime('now'))
);

-- FTS5 index (content-sync with chunks table)
CREATE VIRTUAL TABLE chunks_fts USING fts5(
    text,
    content='chunks',
    content_rowid='id',
    tokenize='porter unicode61'
);

-- Triggers to keep FTS in sync
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
    INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
    INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
    INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
    INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;

-- Vector storage (sqlite-vec)
CREATE VIRTUAL TABLE chunks_vec USING vec0(
    chunk_id INTEGER PRIMARY KEY,
    embedding FLOAT[384]                 -- dimension matches model
);

-- Tags
CREATE TABLE tags (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL
);

CREATE TABLE document_tags (
    document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
    tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
    PRIMARY KEY (document_id, tag_id)
);

-- Config stored in DB (model binding)
CREATE TABLE config (
    key TEXT PRIMARY KEY,
    value TEXT NOT NULL
);
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens

Why SQLite for everything: At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (cp kb.db kb.db.bak), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.

Why store config in DB and YAML: The YAML file holds user preferences (chunking params, model choice). The DB config table records what the DB was actually built with (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."

Alternatives considered:

ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
LanceDB: Interesting but less mature, no FTS built in

3. Docling for Complex Document Ingestion

Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.

Why Docling over simpler extractors: The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.

Docling configuration for this project:

Use pypdfium2 backend (default, fast for text-based PDFs)
Enable OCR only when needed (detect pages with no extractable text)
Use hierarchy-aware chunking (respects section/paragraph boundaries)
Disable image extraction (we're indexing text, not images)
Run with multiple workers for batch ingestion

Model download: Docling models (~1.5 GB) download on first use or via kb init. Stored in ~/.kb/models/docling/ or HuggingFace's default cache.

Alternatives considered:

pymupdf4llm: Fast, lightweight, but poor table/layout handling
Unstructured: Heavier than Docling, commercial focus, less predictable output
LlamaParse: Cloud-only, violates local-first constraint

4. Per-Type Chunking Strategy

Each document type gets a purpose-built chunker with configurable parameters:

PDF (Docling): Hierarchy-aware chunking. Docling's HierarchicalChunker splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.

Markdown: Header-based splitting. Split at ## and ### boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< min_tokens) with their neighbor. Configurable: min_tokens, max_tokens.

Code (Python): Use stdlib ast module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.

Code (Bash): Regex-based. Split on function name() { and name() { patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.

Code (Go): Regex-based. Split on func declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.

Notes: Whole document = one chunk. Notes are small by definition.

Configurable defaults (in ~/.kb/config.yaml):

chunking:
  defaults:
    max_tokens: 512
    overlap_tokens: 50
  pdf:
    strategy: hierarchy    # hierarchy | fixed
    max_tokens: 1024       # for fixed strategy fallback
  markdown:
    strategy: header       # header | fixed
    min_tokens: 50         # merge sections smaller than this
    max_tokens: 1024
  code:
    strategy: ast          # ast | fixed
    include_context: true  # include class/module docstring with methods
    max_tokens: 1024
  note:
    strategy: whole

5. Embedding Model Management

Default model: all-MiniLM-L6-v2 (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).

Model loading: Use sentence-transformers library which provides a unified API across models. Models stored in HuggingFace's default cache (~/.cache/huggingface/), shared with other tools that use HF models. No custom cache directory override.

Model binding: On kb init, the chosen model's name and dimension are written to the DB config table. Every subsequent kb add checks the loaded model matches the DB. Mismatch = hard error with clear message.

Model switching (kb reindex):

Download new model
Read all chunks from DB
Re-embed in batches (with progress bar)
Replace all vectors in chunks_vec
Update DB config (model_name, embedding_dim)
Recreate chunks_vec table if dimension changed

ONNX Runtime for inference: Use sentence-transformers with ONNX backend (model = SentenceTransformer(model_name, backend="onnx")). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.

Model compatibility: All models on HuggingFace that work with sentence-transformers are supported. The only per-model differences handled in code:

Dimension (read from model config)
Max sequence length (read from model config, used to cap chunk size)
Query/passage prefixes (configurable in YAML, empty by default)

embedding:
  model: all-MiniLM-L6-v2
  query_prefix: ""          # some models need "search_query: "
  passage_prefix: ""        # some models need "search_document: "

6. Hybrid Search with Reciprocal Rank Fusion

Search flow:

Query: "how to install git"
         │
         ├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
         │
         └──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
                              (cosine distance, top-K)
         │
         ▼
    Reciprocal Rank Fusion (RRF)
    score(d) = Σ 1/(k + rank_in_list)  where k=60 (standard)
         │
         ▼
    Merged results, sorted by RRF score
         │
         ▼
    Apply filters (tags, doc_type) ──▶ Top-N results

Why RRF over learned re-ranking: RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.

FTS5 query construction: Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.

Vector search: Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.

Filter application: Tag and type filters are applied as SQL WHERE clauses before search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:

Type filter: Applied in the SQL query (efficient)
Tag filter: Applied in the SQL query via JOIN (efficient)
Score threshold: Applied post-RRF as a cutoff

7. Output Format (Skill Contract)

JSON output (--format json, default):

{
  "query": "how to install git",
  "results": [
    {
      "chunk_id": 1423,
      "score": 0.87,
      "score_breakdown": {"fts": 0.72, "vector": 0.94},
      "text": "To install the latest version of git from source...",
      "source": {
        "document_id": 42,
        "title": "Git Admin Guide",
        "path": "/home/user/docs/git-admin.pdf",
        "type": "pdf",
        "page": 12,
        "chunk_index": 3,
        "total_chunks": 28,
        "tags": ["git", "admin"]
      }
    }
  ],
  "total_matches": 47,
  "returned": 10
}

Human output (--format human):

Search: "how to install git" (47 matches, showing top 10)

 1. [0.87] Git Admin Guide (p.12)                    [pdf] [git, admin]
    To install the latest version of git from source...

 2. [0.65] setup-notes.md §Installation               [markdown] [git]
    First, add the PPA repository for the latest git...

Stability commitment: The JSON schema is the contract with the skill. Fields may be added but not removed or renamed once the skill is built.

8. Configuration Architecture

Precedence (highest to lowest):
  1. CLI flags (--top, --tags, --format)
  2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
  3. ~/.kb/config.yaml
  4. Built-in defaults

ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
  KB_DATA_DIR     → ~/.kb/
  KB_MODEL        → all-MiniLM-L6-v2
  KB_DEFAULT_TOP  → 10

Full default config.yaml:

# ~/.kb/config.yaml

data_dir: ~/.kb

embedding:
  model: all-MiniLM-L6-v2
  query_prefix: ""
  passage_prefix: ""

search:
  default_top: 10
  default_format: json
  rrf_k: 60

chunking:
  defaults:
    max_tokens: 512
    overlap_tokens: 50
  pdf:
    strategy: hierarchy
    max_tokens: 1024
  markdown:
    strategy: header
    min_tokens: 50
    max_tokens: 1024
  code:
    strategy: ast
    include_context: true
    max_tokens: 1024
  note:
    strategy: whole

ingestion:
  workers: 4                 # parallel Docling workers
  batch_size: 50             # commit to DB every N documents
  enable_ocr: auto           # auto | always | never

9. CLI Framework: Click

Why Click: Mature, well-documented, supports nested command groups, automatic --help generation, parameter validation, and progress bars (via click.progressbar). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.

10. Error Handling and Resumability

Batch ingestion must be resumable. When adding a directory of 2,000 PDFs:

Each document is processed independently
On success: document + chunks inserted in a single transaction
On failure: error logged, document skipped, processing continues
content_hash (SHA-256 of file contents) enables skip-if-already-indexed
Progress shown via click.progressbar or rich.progress
Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."

Failed documents are logged to ~/.kb/ingest-errors.log with the file path and error for later investigation.

Risks / Trade-offs

[Docling model size] → Mitigation: ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in ~/.kb/models/. Document this in kb init output and SKILL.md.

[Docling ingestion speed on CPU] → Mitigation: ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.

[ONNX model export on first load] → Mitigation: First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first kb add or kb search after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."

[sqlite-vec maturity] → Mitigation: sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.

[FTS5 trigger sync] → Mitigation: FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild') after batch operations.

[Model lock-in] → Mitigation: Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). kb reindex with progress bar makes this manageable. Model name stored in DB prevents silent mixing.

Resolved Questions

ONNX for inference from day one. Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
HuggingFace default cache for models. Both embedding and Docling models use ~/.cache/huggingface/. Shared with other HF tools — no duplicate downloads if the user already has models cached.
Manual schema migrations. Version number in config table. database.py checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.

18 KiB Raw Blame History Unescape Escape