## Context This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download. Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second. The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient. ## Goals / Non-Goals **Goals:** - Single-command install (`pipx install kb-search`) with `kb init` for model setup - Ingest heterogeneous documents with format-appropriate chunking - Hybrid search (keyword + semantic) with a single command - JSON output contract stable enough for skill integration - Configurable but works with zero configuration - All state in one SQLite file for easy backup/portability **Non-Goals:** - LLM-based answer synthesis (the calling skill handles this) - Multi-user or networked access - Real-time / streaming ingestion - Web UI or TUI dashboard - Support for every possible document format (start with PDF, markdown, code, notes) - Clustering, deduplication, or automatic organisation of documents ## Decisions ### 1. Package Structure ``` kb-search/ ├── pyproject.toml ├── src/ │ └── kb_search/ │ ├── __init__.py │ ├── cli.py # Click CLI entry point │ ├── config.py # YAML config loading + ENV overrides │ ├── database.py # SQLite schema, migrations, connection │ ├── embeddings.py # Model download, loading, inference │ ├── search.py # Hybrid search + RRF merging │ ├── ingest/ │ │ ├── __init__.py │ │ ├── detector.py # File type detection + routing │ │ ├── docling.py # Docling pipeline (PDF, DOCX, HTML, images) │ │ ├── markdown.py # Header-based markdown splitting │ │ ├── code.py # AST/regex code splitting │ │ └── note.py # Whole-document note handler │ └── output.py # JSON + human-readable formatters ├── tests/ └── SKILL.md # Claude Code skill definition ``` **Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices. ### 2. SQLite as Sole Storage Backend All data lives in `~/.kb/kb.db`: ```sql -- Documents CREATE TABLE documents ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, source_path TEXT, content_hash TEXT NOT NULL, -- SHA-256 for dedup/change detection doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')), language TEXT, -- for code: 'python','bash','go' created_at TEXT DEFAULT (datetime('now')), metadata TEXT DEFAULT '{}' -- JSON: page_count, author, etc. ); -- Chunks CREATE TABLE chunks ( id INTEGER PRIMARY KEY AUTOINCREMENT, document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE, chunk_index INTEGER NOT NULL, text TEXT NOT NULL, token_count INTEGER, metadata TEXT DEFAULT '{}', -- JSON: page, section_header, symbol_name created_at TEXT DEFAULT (datetime('now')) ); -- FTS5 index (content-sync with chunks table) CREATE VIRTUAL TABLE chunks_fts USING fts5( text, content='chunks', content_rowid='id', tokenize='porter unicode61' ); -- Triggers to keep FTS in sync CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text); END; CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text); END; CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text); INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text); END; -- Vector storage (sqlite-vec) CREATE VIRTUAL TABLE chunks_vec USING vec0( chunk_id INTEGER PRIMARY KEY, embedding FLOAT[384] -- dimension matches model ); -- Tags CREATE TABLE tags ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT UNIQUE NOT NULL ); CREATE TABLE document_tags ( document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE, tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE, PRIMARY KEY (document_id, tag_id) ); -- Config stored in DB (model binding) CREATE TABLE config ( key TEXT PRIMARY KEY, value TEXT NOT NULL ); -- Keys: schema_version, model_name, embedding_dim, model_max_tokens ``` **Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension. **Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2." **Alternatives considered:** - ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story - DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5 - LanceDB: Interesting but less mature, no FTS built in ### 3. Docling for Complex Document Ingestion Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction. **Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool. **Docling configuration for this project:** - Use `pypdfium2` backend (default, fast for text-based PDFs) - Enable OCR only when needed (detect pages with no extractable text) - Use hierarchy-aware chunking (respects section/paragraph boundaries) - Disable image extraction (we're indexing text, not images) - Run with multiple workers for batch ingestion **Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache. **Alternatives considered:** - pymupdf4llm: Fast, lightweight, but poor table/layout handling - Unstructured: Heavier than Docling, commercial focus, less predictable output - LlamaParse: Cloud-only, violates local-first constraint ### 4. Per-Type Chunking Strategy Each document type gets a purpose-built chunker with configurable parameters: **PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails. **Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`. **Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk. **Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected. **Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries. **Notes:** Whole document = one chunk. Notes are small by definition. **Configurable defaults (in `~/.kb/config.yaml`):** ```yaml chunking: defaults: max_tokens: 512 overlap_tokens: 50 pdf: strategy: hierarchy # hierarchy | fixed max_tokens: 1024 # for fixed strategy fallback markdown: strategy: header # header | fixed min_tokens: 50 # merge sections smaller than this max_tokens: 1024 code: strategy: ast # ast | fixed include_context: true # include class/module docstring with methods max_tokens: 1024 note: strategy: whole ``` ### 5. Embedding Model Management **Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU). **Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override. **Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message. **Model switching (`kb reindex`):** 1. Download new model 2. Read all chunks from DB 3. Re-embed in batches (with progress bar) 4. Replace all vectors in `chunks_vec` 5. Update DB config (model_name, embedding_dim) 6. Recreate `chunks_vec` table if dimension changed **ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API. **Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code: - Dimension (read from model config) - Max sequence length (read from model config, used to cap chunk size) - Query/passage prefixes (configurable in YAML, empty by default) ```yaml embedding: model: all-MiniLM-L6-v2 query_prefix: "" # some models need "search_query: " passage_prefix: "" # some models need "search_document: " ``` ### 6. Hybrid Search with Reciprocal Rank Fusion **Search flow:** ``` Query: "how to install git" │ ├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score) │ └──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score) (cosine distance, top-K) │ ▼ Reciprocal Rank Fusion (RRF) score(d) = Σ 1/(k + rank_in_list) where k=60 (standard) │ ▼ Merged results, sorted by RRF score │ ▼ Apply filters (tags, doc_type) ──▶ Top-N results ``` **Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed. **FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple. **Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms. **Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type: - Type filter: Applied in the SQL query (efficient) - Tag filter: Applied in the SQL query via JOIN (efficient) - Score threshold: Applied post-RRF as a cutoff ### 7. Output Format (Skill Contract) **JSON output (`--format json`, default):** ```json { "query": "how to install git", "results": [ { "chunk_id": 1423, "score": 0.87, "score_breakdown": {"fts": 0.72, "vector": 0.94}, "text": "To install the latest version of git from source...", "source": { "document_id": 42, "title": "Git Admin Guide", "path": "/home/user/docs/git-admin.pdf", "type": "pdf", "page": 12, "chunk_index": 3, "total_chunks": 28, "tags": ["git", "admin"] } } ], "total_matches": 47, "returned": 10 } ``` **Human output (`--format human`):** ``` Search: "how to install git" (47 matches, showing top 10) 1. [0.87] Git Admin Guide (p.12) [pdf] [git, admin] To install the latest version of git from source... 2. [0.65] setup-notes.md §Installation [markdown] [git] First, add the PPA repository for the latest git... ``` **Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built. ### 8. Configuration Architecture ``` Precedence (highest to lowest): 1. CLI flags (--top, --tags, --format) 2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP) 3. ~/.kb/config.yaml 4. Built-in defaults ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE KB_DATA_DIR → ~/.kb/ KB_MODEL → all-MiniLM-L6-v2 KB_DEFAULT_TOP → 10 ``` **Full default config.yaml:** ```yaml # ~/.kb/config.yaml data_dir: ~/.kb embedding: model: all-MiniLM-L6-v2 query_prefix: "" passage_prefix: "" search: default_top: 10 default_format: json rrf_k: 60 chunking: defaults: max_tokens: 512 overlap_tokens: 50 pdf: strategy: hierarchy max_tokens: 1024 markdown: strategy: header min_tokens: 50 max_tokens: 1024 code: strategy: ast include_context: true max_tokens: 1024 note: strategy: whole ingestion: workers: 4 # parallel Docling workers batch_size: 50 # commit to DB every N documents enable_ocr: auto # auto | always | never ``` ### 9. CLI Framework: Click **Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands. ### 10. Error Handling and Resumability **Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs: - Each document is processed independently - On success: document + chunks inserted in a single transaction - On failure: error logged, document skipped, processing continues - `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed - Progress shown via `click.progressbar` or `rich.progress` - Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)." Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation. ## Risks / Trade-offs **[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md. **[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost. **[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..." **[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path. **[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations. **[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing. ## Resolved Questions 1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency. 2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached. 3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.