Initial MVP

2026-03-23 20:38:42 +00:00
commit f245c24928
57 changed files with 6812 additions and 0 deletions
@@ -0,0 +1,396 @@
+## Context
+
+This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.
+
+Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.
+
+The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Single-command install (`pipx install kb-search`) with `kb init` for model setup
+- Ingest heterogeneous documents with format-appropriate chunking
+- Hybrid search (keyword + semantic) with a single command
+- JSON output contract stable enough for skill integration
+- Configurable but works with zero configuration
+- All state in one SQLite file for easy backup/portability
+
+**Non-Goals:**
+- LLM-based answer synthesis (the calling skill handles this)
+- Multi-user or networked access
+- Real-time / streaming ingestion
+- Web UI or TUI dashboard
+- Support for every possible document format (start with PDF, markdown, code, notes)
+- Clustering, deduplication, or automatic organisation of documents
+
+## Decisions
+
+### 1. Package Structure
+
+```
+kb-search/
+├── pyproject.toml
+├── src/
+│   └── kb_search/
+│       ├── __init__.py
+│       ├── cli.py              # Click CLI entry point
+│       ├── config.py           # YAML config loading + ENV overrides
+│       ├── database.py         # SQLite schema, migrations, connection
+│       ├── embeddings.py       # Model download, loading, inference
+│       ├── search.py           # Hybrid search + RRF merging
+│       ├── ingest/
+│       │   ├── __init__.py
+│       │   ├── detector.py     # File type detection + routing
+│       │   ├── docling.py      # Docling pipeline (PDF, DOCX, HTML, images)
+│       │   ├── markdown.py     # Header-based markdown splitting
+│       │   ├── code.py         # AST/regex code splitting
+│       │   └── note.py         # Whole-document note handler
+│       └── output.py           # JSON + human-readable formatters
+├── tests/
+└── SKILL.md                    # Claude Code skill definition
+```
+
+**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.
+
+### 2. SQLite as Sole Storage Backend
+
+All data lives in `~/.kb/kb.db`:
+
+```sql
+-- Documents
+CREATE TABLE documents (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    title TEXT NOT NULL,
+    source_path TEXT,
+    content_hash TEXT NOT NULL,          -- SHA-256 for dedup/change detection
+    doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
+    language TEXT,                        -- for code: 'python','bash','go'
+    created_at TEXT DEFAULT (datetime('now')),
+    metadata TEXT DEFAULT '{}'           -- JSON: page_count, author, etc.
+);
+
+-- Chunks
+CREATE TABLE chunks (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
+    chunk_index INTEGER NOT NULL,
+    text TEXT NOT NULL,
+    token_count INTEGER,
+    metadata TEXT DEFAULT '{}',          -- JSON: page, section_header, symbol_name
+    created_at TEXT DEFAULT (datetime('now'))
+);
+
+-- FTS5 index (content-sync with chunks table)
+CREATE VIRTUAL TABLE chunks_fts USING fts5(
+    text,
+    content='chunks',
+    content_rowid='id',
+    tokenize='porter unicode61'
+);
+
+-- Triggers to keep FTS in sync
+CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
+    INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
+END;
+CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
+    INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
+END;
+CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
+    INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
+    INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
+END;
+
+-- Vector storage (sqlite-vec)
+CREATE VIRTUAL TABLE chunks_vec USING vec0(
+    chunk_id INTEGER PRIMARY KEY,
+    embedding FLOAT[384]                 -- dimension matches model
+);
+
+-- Tags
+CREATE TABLE tags (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    name TEXT UNIQUE NOT NULL
+);
+
+CREATE TABLE document_tags (
+    document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
+    tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
+    PRIMARY KEY (document_id, tag_id)
+);
+
+-- Config stored in DB (model binding)
+CREATE TABLE config (
+    key TEXT PRIMARY KEY,
+    value TEXT NOT NULL
+);
+-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
+```
+
+**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.
+
+**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."
+
+**Alternatives considered:**
+- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
+- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
+- LanceDB: Interesting but less mature, no FTS built in
+
+### 3. Docling for Complex Document Ingestion
+
+Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.
+
+**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.
+
+**Docling configuration for this project:**
+- Use `pypdfium2` backend (default, fast for text-based PDFs)
+- Enable OCR only when needed (detect pages with no extractable text)
+- Use hierarchy-aware chunking (respects section/paragraph boundaries)
+- Disable image extraction (we're indexing text, not images)
+- Run with multiple workers for batch ingestion
+
+**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.
+
+**Alternatives considered:**
+- pymupdf4llm: Fast, lightweight, but poor table/layout handling
+- Unstructured: Heavier than Docling, commercial focus, less predictable output
+- LlamaParse: Cloud-only, violates local-first constraint
+
+### 4. Per-Type Chunking Strategy
+
+Each document type gets a purpose-built chunker with configurable parameters:
+
+**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.
+
+**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.
+
+**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.
+
+**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.
+
+**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.
+
+**Notes:** Whole document = one chunk. Notes are small by definition.
+
+**Configurable defaults (in `~/.kb/config.yaml`):**
+```yaml
+chunking:
+  defaults:
+    max_tokens: 512
+    overlap_tokens: 50
+  pdf:
+    strategy: hierarchy    # hierarchy | fixed
+    max_tokens: 1024       # for fixed strategy fallback
+  markdown:
+    strategy: header       # header | fixed
+    min_tokens: 50         # merge sections smaller than this
+    max_tokens: 1024
+  code:
+    strategy: ast          # ast | fixed
+    include_context: true  # include class/module docstring with methods
+    max_tokens: 1024
+  note:
+    strategy: whole
+```
+
+### 5. Embedding Model Management
+
+**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).
+
+**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.
+
+**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.
+
+**Model switching (`kb reindex`):**
+1. Download new model
+2. Read all chunks from DB
+3. Re-embed in batches (with progress bar)
+4. Replace all vectors in `chunks_vec`
+5. Update DB config (model_name, embedding_dim)
+6. Recreate `chunks_vec` table if dimension changed
+
+**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.
+
+**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
+- Dimension (read from model config)
+- Max sequence length (read from model config, used to cap chunk size)
+- Query/passage prefixes (configurable in YAML, empty by default)
+
+```yaml
+embedding:
+  model: all-MiniLM-L6-v2
+  query_prefix: ""          # some models need "search_query: "
+  passage_prefix: ""        # some models need "search_document: "
+```
+
+### 6. Hybrid Search with Reciprocal Rank Fusion
+
+**Search flow:**
+
+```
+Query: "how to install git"
+         │
+         ├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
+         │
+         └──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
+                              (cosine distance, top-K)
+         │
+         ▼
+    Reciprocal Rank Fusion (RRF)
+    score(d) = Σ 1/(k + rank_in_list)  where k=60 (standard)
+         │
+         ▼
+    Merged results, sorted by RRF score
+         │
+         ▼
+    Apply filters (tags, doc_type) ──▶ Top-N results
+```
+
+**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.
+
+**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.
+
+**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.
+
+**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
+- Type filter: Applied in the SQL query (efficient)
+- Tag filter: Applied in the SQL query via JOIN (efficient)
+- Score threshold: Applied post-RRF as a cutoff
+
+### 7. Output Format (Skill Contract)
+
+**JSON output (`--format json`, default):**
+
+```json
+{
+  "query": "how to install git",
+  "results": [
+    {
+      "chunk_id": 1423,
+      "score": 0.87,
+      "score_breakdown": {"fts": 0.72, "vector": 0.94},
+      "text": "To install the latest version of git from source...",
+      "source": {
+        "document_id": 42,
+        "title": "Git Admin Guide",
+        "path": "/home/user/docs/git-admin.pdf",
+        "type": "pdf",
+        "page": 12,
+        "chunk_index": 3,
+        "total_chunks": 28,
+        "tags": ["git", "admin"]
+      }
+    }
+  ],
+  "total_matches": 47,
+  "returned": 10
+}
+```
+
+**Human output (`--format human`):**
+
+```
+Search: "how to install git" (47 matches, showing top 10)
+
+ 1. [0.87] Git Admin Guide (p.12)                    [pdf] [git, admin]
+    To install the latest version of git from source...
+
+ 2. [0.65] setup-notes.md §Installation               [markdown] [git]
+    First, add the PPA repository for the latest git...
+```
+
+**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.
+
+### 8. Configuration Architecture
+
+```
+Precedence (highest to lowest):
+  1. CLI flags (--top, --tags, --format)
+  2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
+  3. ~/.kb/config.yaml
+  4. Built-in defaults
+
+ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
+  KB_DATA_DIR     → ~/.kb/
+  KB_MODEL        → all-MiniLM-L6-v2
+  KB_DEFAULT_TOP  → 10
+```
+
+**Full default config.yaml:**
+
+```yaml
+# ~/.kb/config.yaml
+
+data_dir: ~/.kb
+
+embedding:
+  model: all-MiniLM-L6-v2
+  query_prefix: ""
+  passage_prefix: ""
+
+search:
+  default_top: 10
+  default_format: json
+  rrf_k: 60
+
+chunking:
+  defaults:
+    max_tokens: 512
+    overlap_tokens: 50
+  pdf:
+    strategy: hierarchy
+    max_tokens: 1024
+  markdown:
+    strategy: header
+    min_tokens: 50
+    max_tokens: 1024
+  code:
+    strategy: ast
+    include_context: true
+    max_tokens: 1024
+  note:
+    strategy: whole
+
+ingestion:
+  workers: 4                 # parallel Docling workers
+  batch_size: 50             # commit to DB every N documents
+  enable_ocr: auto           # auto | always | never
+```
+
+### 9. CLI Framework: Click
+
+**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.
+
+### 10. Error Handling and Resumability
+
+**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
+- Each document is processed independently
+- On success: document + chunks inserted in a single transaction
+- On failure: error logged, document skipped, processing continues
+- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
+- Progress shown via `click.progressbar` or `rich.progress`
+- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."
+
+Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.
+
+## Risks / Trade-offs
+
+**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.
+
+**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.
+
+**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."
+
+**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.
+
+**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.
+
+**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.
+
+## Resolved Questions
+
+1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
+
+2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.
+
+3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.