397 lines
18 KiB
Markdown
397 lines
18 KiB
Markdown
## Context
|
||
|
||
This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.
|
||
|
||
Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.
|
||
|
||
The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.
|
||
|
||
## Goals / Non-Goals
|
||
|
||
**Goals:**
|
||
- Single-command install (`pipx install kb-search`) with `kb init` for model setup
|
||
- Ingest heterogeneous documents with format-appropriate chunking
|
||
- Hybrid search (keyword + semantic) with a single command
|
||
- JSON output contract stable enough for skill integration
|
||
- Configurable but works with zero configuration
|
||
- All state in one SQLite file for easy backup/portability
|
||
|
||
**Non-Goals:**
|
||
- LLM-based answer synthesis (the calling skill handles this)
|
||
- Multi-user or networked access
|
||
- Real-time / streaming ingestion
|
||
- Web UI or TUI dashboard
|
||
- Support for every possible document format (start with PDF, markdown, code, notes)
|
||
- Clustering, deduplication, or automatic organisation of documents
|
||
|
||
## Decisions
|
||
|
||
### 1. Package Structure
|
||
|
||
```
|
||
kb-search/
|
||
├── pyproject.toml
|
||
├── src/
|
||
│ └── kb_search/
|
||
│ ├── __init__.py
|
||
│ ├── cli.py # Click CLI entry point
|
||
│ ├── config.py # YAML config loading + ENV overrides
|
||
│ ├── database.py # SQLite schema, migrations, connection
|
||
│ ├── embeddings.py # Model download, loading, inference
|
||
│ ├── search.py # Hybrid search + RRF merging
|
||
│ ├── ingest/
|
||
│ │ ├── __init__.py
|
||
│ │ ├── detector.py # File type detection + routing
|
||
│ │ ├── docling.py # Docling pipeline (PDF, DOCX, HTML, images)
|
||
│ │ ├── markdown.py # Header-based markdown splitting
|
||
│ │ ├── code.py # AST/regex code splitting
|
||
│ │ └── note.py # Whole-document note handler
|
||
│ └── output.py # JSON + human-readable formatters
|
||
├── tests/
|
||
└── SKILL.md # Claude Code skill definition
|
||
```
|
||
|
||
**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.
|
||
|
||
### 2. SQLite as Sole Storage Backend
|
||
|
||
All data lives in `~/.kb/kb.db`:
|
||
|
||
```sql
|
||
-- Documents
|
||
CREATE TABLE documents (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
title TEXT NOT NULL,
|
||
source_path TEXT,
|
||
content_hash TEXT NOT NULL, -- SHA-256 for dedup/change detection
|
||
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
|
||
language TEXT, -- for code: 'python','bash','go'
|
||
created_at TEXT DEFAULT (datetime('now')),
|
||
metadata TEXT DEFAULT '{}' -- JSON: page_count, author, etc.
|
||
);
|
||
|
||
-- Chunks
|
||
CREATE TABLE chunks (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||
chunk_index INTEGER NOT NULL,
|
||
text TEXT NOT NULL,
|
||
token_count INTEGER,
|
||
metadata TEXT DEFAULT '{}', -- JSON: page, section_header, symbol_name
|
||
created_at TEXT DEFAULT (datetime('now'))
|
||
);
|
||
|
||
-- FTS5 index (content-sync with chunks table)
|
||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||
text,
|
||
content='chunks',
|
||
content_rowid='id',
|
||
tokenize='porter unicode61'
|
||
);
|
||
|
||
-- Triggers to keep FTS in sync
|
||
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
|
||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||
END;
|
||
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
|
||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||
END;
|
||
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
|
||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||
END;
|
||
|
||
-- Vector storage (sqlite-vec)
|
||
CREATE VIRTUAL TABLE chunks_vec USING vec0(
|
||
chunk_id INTEGER PRIMARY KEY,
|
||
embedding FLOAT[384] -- dimension matches model
|
||
);
|
||
|
||
-- Tags
|
||
CREATE TABLE tags (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
name TEXT UNIQUE NOT NULL
|
||
);
|
||
|
||
CREATE TABLE document_tags (
|
||
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
|
||
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
|
||
PRIMARY KEY (document_id, tag_id)
|
||
);
|
||
|
||
-- Config stored in DB (model binding)
|
||
CREATE TABLE config (
|
||
key TEXT PRIMARY KEY,
|
||
value TEXT NOT NULL
|
||
);
|
||
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
|
||
```
|
||
|
||
**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.
|
||
|
||
**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."
|
||
|
||
**Alternatives considered:**
|
||
- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
|
||
- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
|
||
- LanceDB: Interesting but less mature, no FTS built in
|
||
|
||
### 3. Docling for Complex Document Ingestion
|
||
|
||
Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.
|
||
|
||
**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.
|
||
|
||
**Docling configuration for this project:**
|
||
- Use `pypdfium2` backend (default, fast for text-based PDFs)
|
||
- Enable OCR only when needed (detect pages with no extractable text)
|
||
- Use hierarchy-aware chunking (respects section/paragraph boundaries)
|
||
- Disable image extraction (we're indexing text, not images)
|
||
- Run with multiple workers for batch ingestion
|
||
|
||
**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.
|
||
|
||
**Alternatives considered:**
|
||
- pymupdf4llm: Fast, lightweight, but poor table/layout handling
|
||
- Unstructured: Heavier than Docling, commercial focus, less predictable output
|
||
- LlamaParse: Cloud-only, violates local-first constraint
|
||
|
||
### 4. Per-Type Chunking Strategy
|
||
|
||
Each document type gets a purpose-built chunker with configurable parameters:
|
||
|
||
**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.
|
||
|
||
**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.
|
||
|
||
**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.
|
||
|
||
**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.
|
||
|
||
**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.
|
||
|
||
**Notes:** Whole document = one chunk. Notes are small by definition.
|
||
|
||
**Configurable defaults (in `~/.kb/config.yaml`):**
|
||
```yaml
|
||
chunking:
|
||
defaults:
|
||
max_tokens: 512
|
||
overlap_tokens: 50
|
||
pdf:
|
||
strategy: hierarchy # hierarchy | fixed
|
||
max_tokens: 1024 # for fixed strategy fallback
|
||
markdown:
|
||
strategy: header # header | fixed
|
||
min_tokens: 50 # merge sections smaller than this
|
||
max_tokens: 1024
|
||
code:
|
||
strategy: ast # ast | fixed
|
||
include_context: true # include class/module docstring with methods
|
||
max_tokens: 1024
|
||
note:
|
||
strategy: whole
|
||
```
|
||
|
||
### 5. Embedding Model Management
|
||
|
||
**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).
|
||
|
||
**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.
|
||
|
||
**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.
|
||
|
||
**Model switching (`kb reindex`):**
|
||
1. Download new model
|
||
2. Read all chunks from DB
|
||
3. Re-embed in batches (with progress bar)
|
||
4. Replace all vectors in `chunks_vec`
|
||
5. Update DB config (model_name, embedding_dim)
|
||
6. Recreate `chunks_vec` table if dimension changed
|
||
|
||
**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.
|
||
|
||
**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
|
||
- Dimension (read from model config)
|
||
- Max sequence length (read from model config, used to cap chunk size)
|
||
- Query/passage prefixes (configurable in YAML, empty by default)
|
||
|
||
```yaml
|
||
embedding:
|
||
model: all-MiniLM-L6-v2
|
||
query_prefix: "" # some models need "search_query: "
|
||
passage_prefix: "" # some models need "search_document: "
|
||
```
|
||
|
||
### 6. Hybrid Search with Reciprocal Rank Fusion
|
||
|
||
**Search flow:**
|
||
|
||
```
|
||
Query: "how to install git"
|
||
│
|
||
├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
|
||
│
|
||
└──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
|
||
(cosine distance, top-K)
|
||
│
|
||
▼
|
||
Reciprocal Rank Fusion (RRF)
|
||
score(d) = Σ 1/(k + rank_in_list) where k=60 (standard)
|
||
│
|
||
▼
|
||
Merged results, sorted by RRF score
|
||
│
|
||
▼
|
||
Apply filters (tags, doc_type) ──▶ Top-N results
|
||
```
|
||
|
||
**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.
|
||
|
||
**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.
|
||
|
||
**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.
|
||
|
||
**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
|
||
- Type filter: Applied in the SQL query (efficient)
|
||
- Tag filter: Applied in the SQL query via JOIN (efficient)
|
||
- Score threshold: Applied post-RRF as a cutoff
|
||
|
||
### 7. Output Format (Skill Contract)
|
||
|
||
**JSON output (`--format json`, default):**
|
||
|
||
```json
|
||
{
|
||
"query": "how to install git",
|
||
"results": [
|
||
{
|
||
"chunk_id": 1423,
|
||
"score": 0.87,
|
||
"score_breakdown": {"fts": 0.72, "vector": 0.94},
|
||
"text": "To install the latest version of git from source...",
|
||
"source": {
|
||
"document_id": 42,
|
||
"title": "Git Admin Guide",
|
||
"path": "/home/user/docs/git-admin.pdf",
|
||
"type": "pdf",
|
||
"page": 12,
|
||
"chunk_index": 3,
|
||
"total_chunks": 28,
|
||
"tags": ["git", "admin"]
|
||
}
|
||
}
|
||
],
|
||
"total_matches": 47,
|
||
"returned": 10
|
||
}
|
||
```
|
||
|
||
**Human output (`--format human`):**
|
||
|
||
```
|
||
Search: "how to install git" (47 matches, showing top 10)
|
||
|
||
1. [0.87] Git Admin Guide (p.12) [pdf] [git, admin]
|
||
To install the latest version of git from source...
|
||
|
||
2. [0.65] setup-notes.md §Installation [markdown] [git]
|
||
First, add the PPA repository for the latest git...
|
||
```
|
||
|
||
**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.
|
||
|
||
### 8. Configuration Architecture
|
||
|
||
```
|
||
Precedence (highest to lowest):
|
||
1. CLI flags (--top, --tags, --format)
|
||
2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
|
||
3. ~/.kb/config.yaml
|
||
4. Built-in defaults
|
||
|
||
ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
|
||
KB_DATA_DIR → ~/.kb/
|
||
KB_MODEL → all-MiniLM-L6-v2
|
||
KB_DEFAULT_TOP → 10
|
||
```
|
||
|
||
**Full default config.yaml:**
|
||
|
||
```yaml
|
||
# ~/.kb/config.yaml
|
||
|
||
data_dir: ~/.kb
|
||
|
||
embedding:
|
||
model: all-MiniLM-L6-v2
|
||
query_prefix: ""
|
||
passage_prefix: ""
|
||
|
||
search:
|
||
default_top: 10
|
||
default_format: json
|
||
rrf_k: 60
|
||
|
||
chunking:
|
||
defaults:
|
||
max_tokens: 512
|
||
overlap_tokens: 50
|
||
pdf:
|
||
strategy: hierarchy
|
||
max_tokens: 1024
|
||
markdown:
|
||
strategy: header
|
||
min_tokens: 50
|
||
max_tokens: 1024
|
||
code:
|
||
strategy: ast
|
||
include_context: true
|
||
max_tokens: 1024
|
||
note:
|
||
strategy: whole
|
||
|
||
ingestion:
|
||
workers: 4 # parallel Docling workers
|
||
batch_size: 50 # commit to DB every N documents
|
||
enable_ocr: auto # auto | always | never
|
||
```
|
||
|
||
### 9. CLI Framework: Click
|
||
|
||
**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.
|
||
|
||
### 10. Error Handling and Resumability
|
||
|
||
**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
|
||
- Each document is processed independently
|
||
- On success: document + chunks inserted in a single transaction
|
||
- On failure: error logged, document skipped, processing continues
|
||
- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
|
||
- Progress shown via `click.progressbar` or `rich.progress`
|
||
- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."
|
||
|
||
Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.
|
||
|
||
## Risks / Trade-offs
|
||
|
||
**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.
|
||
|
||
**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.
|
||
|
||
**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."
|
||
|
||
**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.
|
||
|
||
**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.
|
||
|
||
**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.
|
||
|
||
## Resolved Questions
|
||
|
||
1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
|
||
|
||
2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.
|
||
|
||
3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.
|