Initial MVP

This commit is contained in:
2026-03-23 20:38:42 +00:00
commit f245c24928
57 changed files with 6812 additions and 0 deletions
+396
View File
@@ -0,0 +1,396 @@
## Context
This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.
Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.
The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.
## Goals / Non-Goals
**Goals:**
- Single-command install (`pipx install kb-search`) with `kb init` for model setup
- Ingest heterogeneous documents with format-appropriate chunking
- Hybrid search (keyword + semantic) with a single command
- JSON output contract stable enough for skill integration
- Configurable but works with zero configuration
- All state in one SQLite file for easy backup/portability
**Non-Goals:**
- LLM-based answer synthesis (the calling skill handles this)
- Multi-user or networked access
- Real-time / streaming ingestion
- Web UI or TUI dashboard
- Support for every possible document format (start with PDF, markdown, code, notes)
- Clustering, deduplication, or automatic organisation of documents
## Decisions
### 1. Package Structure
```
kb-search/
├── pyproject.toml
├── src/
│ └── kb_search/
│ ├── __init__.py
│ ├── cli.py # Click CLI entry point
│ ├── config.py # YAML config loading + ENV overrides
│ ├── database.py # SQLite schema, migrations, connection
│ ├── embeddings.py # Model download, loading, inference
│ ├── search.py # Hybrid search + RRF merging
│ ├── ingest/
│ │ ├── __init__.py
│ │ ├── detector.py # File type detection + routing
│ │ ├── docling.py # Docling pipeline (PDF, DOCX, HTML, images)
│ │ ├── markdown.py # Header-based markdown splitting
│ │ ├── code.py # AST/regex code splitting
│ │ └── note.py # Whole-document note handler
│ └── output.py # JSON + human-readable formatters
├── tests/
└── SKILL.md # Claude Code skill definition
```
**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.
### 2. SQLite as Sole Storage Backend
All data lives in `~/.kb/kb.db`:
```sql
-- Documents
CREATE TABLE documents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
source_path TEXT,
content_hash TEXT NOT NULL, -- SHA-256 for dedup/change detection
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
language TEXT, -- for code: 'python','bash','go'
created_at TEXT DEFAULT (datetime('now')),
metadata TEXT DEFAULT '{}' -- JSON: page_count, author, etc.
);
-- Chunks
CREATE TABLE chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
token_count INTEGER,
metadata TEXT DEFAULT '{}', -- JSON: page, section_header, symbol_name
created_at TEXT DEFAULT (datetime('now'))
);
-- FTS5 index (content-sync with chunks table)
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
content='chunks',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
-- Vector storage (sqlite-vec)
CREATE VIRTUAL TABLE chunks_vec USING vec0(
chunk_id INTEGER PRIMARY KEY,
embedding FLOAT[384] -- dimension matches model
);
-- Tags
CREATE TABLE tags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
);
CREATE TABLE document_tags (
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
PRIMARY KEY (document_id, tag_id)
);
-- Config stored in DB (model binding)
CREATE TABLE config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
```
**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.
**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."
**Alternatives considered:**
- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
- LanceDB: Interesting but less mature, no FTS built in
### 3. Docling for Complex Document Ingestion
Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.
**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.
**Docling configuration for this project:**
- Use `pypdfium2` backend (default, fast for text-based PDFs)
- Enable OCR only when needed (detect pages with no extractable text)
- Use hierarchy-aware chunking (respects section/paragraph boundaries)
- Disable image extraction (we're indexing text, not images)
- Run with multiple workers for batch ingestion
**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.
**Alternatives considered:**
- pymupdf4llm: Fast, lightweight, but poor table/layout handling
- Unstructured: Heavier than Docling, commercial focus, less predictable output
- LlamaParse: Cloud-only, violates local-first constraint
### 4. Per-Type Chunking Strategy
Each document type gets a purpose-built chunker with configurable parameters:
**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.
**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.
**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.
**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.
**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.
**Notes:** Whole document = one chunk. Notes are small by definition.
**Configurable defaults (in `~/.kb/config.yaml`):**
```yaml
chunking:
defaults:
max_tokens: 512
overlap_tokens: 50
pdf:
strategy: hierarchy # hierarchy | fixed
max_tokens: 1024 # for fixed strategy fallback
markdown:
strategy: header # header | fixed
min_tokens: 50 # merge sections smaller than this
max_tokens: 1024
code:
strategy: ast # ast | fixed
include_context: true # include class/module docstring with methods
max_tokens: 1024
note:
strategy: whole
```
### 5. Embedding Model Management
**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).
**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.
**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.
**Model switching (`kb reindex`):**
1. Download new model
2. Read all chunks from DB
3. Re-embed in batches (with progress bar)
4. Replace all vectors in `chunks_vec`
5. Update DB config (model_name, embedding_dim)
6. Recreate `chunks_vec` table if dimension changed
**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.
**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
- Dimension (read from model config)
- Max sequence length (read from model config, used to cap chunk size)
- Query/passage prefixes (configurable in YAML, empty by default)
```yaml
embedding:
model: all-MiniLM-L6-v2
query_prefix: "" # some models need "search_query: "
passage_prefix: "" # some models need "search_document: "
```
### 6. Hybrid Search with Reciprocal Rank Fusion
**Search flow:**
```
Query: "how to install git"
├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
└──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
(cosine distance, top-K)
Reciprocal Rank Fusion (RRF)
score(d) = Σ 1/(k + rank_in_list) where k=60 (standard)
Merged results, sorted by RRF score
Apply filters (tags, doc_type) ──▶ Top-N results
```
**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.
**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.
**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.
**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
- Type filter: Applied in the SQL query (efficient)
- Tag filter: Applied in the SQL query via JOIN (efficient)
- Score threshold: Applied post-RRF as a cutoff
### 7. Output Format (Skill Contract)
**JSON output (`--format json`, default):**
```json
{
"query": "how to install git",
"results": [
{
"chunk_id": 1423,
"score": 0.87,
"score_breakdown": {"fts": 0.72, "vector": 0.94},
"text": "To install the latest version of git from source...",
"source": {
"document_id": 42,
"title": "Git Admin Guide",
"path": "/home/user/docs/git-admin.pdf",
"type": "pdf",
"page": 12,
"chunk_index": 3,
"total_chunks": 28,
"tags": ["git", "admin"]
}
}
],
"total_matches": 47,
"returned": 10
}
```
**Human output (`--format human`):**
```
Search: "how to install git" (47 matches, showing top 10)
1. [0.87] Git Admin Guide (p.12) [pdf] [git, admin]
To install the latest version of git from source...
2. [0.65] setup-notes.md §Installation [markdown] [git]
First, add the PPA repository for the latest git...
```
**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.
### 8. Configuration Architecture
```
Precedence (highest to lowest):
1. CLI flags (--top, --tags, --format)
2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
3. ~/.kb/config.yaml
4. Built-in defaults
ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
KB_DATA_DIR → ~/.kb/
KB_MODEL → all-MiniLM-L6-v2
KB_DEFAULT_TOP → 10
```
**Full default config.yaml:**
```yaml
# ~/.kb/config.yaml
data_dir: ~/.kb
embedding:
model: all-MiniLM-L6-v2
query_prefix: ""
passage_prefix: ""
search:
default_top: 10
default_format: json
rrf_k: 60
chunking:
defaults:
max_tokens: 512
overlap_tokens: 50
pdf:
strategy: hierarchy
max_tokens: 1024
markdown:
strategy: header
min_tokens: 50
max_tokens: 1024
code:
strategy: ast
include_context: true
max_tokens: 1024
note:
strategy: whole
ingestion:
workers: 4 # parallel Docling workers
batch_size: 50 # commit to DB every N documents
enable_ocr: auto # auto | always | never
```
### 9. CLI Framework: Click
**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.
### 10. Error Handling and Resumability
**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
- Each document is processed independently
- On success: document + chunks inserted in a single transaction
- On failure: error logged, document skipped, processing continues
- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
- Progress shown via `click.progressbar` or `rich.progress`
- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."
Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.
## Risks / Trade-offs
**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.
**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.
**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."
**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.
**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.
**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.
## Resolved Questions
1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.
3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.