Initial MVP
This commit is contained in:
@@ -0,0 +1,396 @@
|
||||
## Context
|
||||
|
||||
This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.
|
||||
|
||||
Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.
|
||||
|
||||
The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Single-command install (`pipx install kb-search`) with `kb init` for model setup
|
||||
- Ingest heterogeneous documents with format-appropriate chunking
|
||||
- Hybrid search (keyword + semantic) with a single command
|
||||
- JSON output contract stable enough for skill integration
|
||||
- Configurable but works with zero configuration
|
||||
- All state in one SQLite file for easy backup/portability
|
||||
|
||||
**Non-Goals:**
|
||||
- LLM-based answer synthesis (the calling skill handles this)
|
||||
- Multi-user or networked access
|
||||
- Real-time / streaming ingestion
|
||||
- Web UI or TUI dashboard
|
||||
- Support for every possible document format (start with PDF, markdown, code, notes)
|
||||
- Clustering, deduplication, or automatic organisation of documents
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Package Structure
|
||||
|
||||
```
|
||||
kb-search/
|
||||
├── pyproject.toml
|
||||
├── src/
|
||||
│ └── kb_search/
|
||||
│ ├── __init__.py
|
||||
│ ├── cli.py # Click CLI entry point
|
||||
│ ├── config.py # YAML config loading + ENV overrides
|
||||
│ ├── database.py # SQLite schema, migrations, connection
|
||||
│ ├── embeddings.py # Model download, loading, inference
|
||||
│ ├── search.py # Hybrid search + RRF merging
|
||||
│ ├── ingest/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── detector.py # File type detection + routing
|
||||
│ │ ├── docling.py # Docling pipeline (PDF, DOCX, HTML, images)
|
||||
│ │ ├── markdown.py # Header-based markdown splitting
|
||||
│ │ ├── code.py # AST/regex code splitting
|
||||
│ │ └── note.py # Whole-document note handler
|
||||
│ └── output.py # JSON + human-readable formatters
|
||||
├── tests/
|
||||
└── SKILL.md # Claude Code skill definition
|
||||
```
|
||||
|
||||
**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.
|
||||
|
||||
### 2. SQLite as Sole Storage Backend
|
||||
|
||||
All data lives in `~/.kb/kb.db`:
|
||||
|
||||
```sql
|
||||
-- Documents
|
||||
CREATE TABLE documents (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
source_path TEXT,
|
||||
content_hash TEXT NOT NULL, -- SHA-256 for dedup/change detection
|
||||
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
|
||||
language TEXT, -- for code: 'python','bash','go'
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
metadata TEXT DEFAULT '{}' -- JSON: page_count, author, etc.
|
||||
);
|
||||
|
||||
-- Chunks
|
||||
CREATE TABLE chunks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||||
chunk_index INTEGER NOT NULL,
|
||||
text TEXT NOT NULL,
|
||||
token_count INTEGER,
|
||||
metadata TEXT DEFAULT '{}', -- JSON: page, section_header, symbol_name
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
-- FTS5 index (content-sync with chunks table)
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
text,
|
||||
content='chunks',
|
||||
content_rowid='id',
|
||||
tokenize='porter unicode61'
|
||||
);
|
||||
|
||||
-- Triggers to keep FTS in sync
|
||||
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||||
END;
|
||||
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||||
END;
|
||||
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||||
END;
|
||||
|
||||
-- Vector storage (sqlite-vec)
|
||||
CREATE VIRTUAL TABLE chunks_vec USING vec0(
|
||||
chunk_id INTEGER PRIMARY KEY,
|
||||
embedding FLOAT[384] -- dimension matches model
|
||||
);
|
||||
|
||||
-- Tags
|
||||
CREATE TABLE tags (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
name TEXT UNIQUE NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE document_tags (
|
||||
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
|
||||
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
|
||||
PRIMARY KEY (document_id, tag_id)
|
||||
);
|
||||
|
||||
-- Config stored in DB (model binding)
|
||||
CREATE TABLE config (
|
||||
key TEXT PRIMARY KEY,
|
||||
value TEXT NOT NULL
|
||||
);
|
||||
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
|
||||
```
|
||||
|
||||
**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.
|
||||
|
||||
**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."
|
||||
|
||||
**Alternatives considered:**
|
||||
- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
|
||||
- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
|
||||
- LanceDB: Interesting but less mature, no FTS built in
|
||||
|
||||
### 3. Docling for Complex Document Ingestion
|
||||
|
||||
Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.
|
||||
|
||||
**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.
|
||||
|
||||
**Docling configuration for this project:**
|
||||
- Use `pypdfium2` backend (default, fast for text-based PDFs)
|
||||
- Enable OCR only when needed (detect pages with no extractable text)
|
||||
- Use hierarchy-aware chunking (respects section/paragraph boundaries)
|
||||
- Disable image extraction (we're indexing text, not images)
|
||||
- Run with multiple workers for batch ingestion
|
||||
|
||||
**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.
|
||||
|
||||
**Alternatives considered:**
|
||||
- pymupdf4llm: Fast, lightweight, but poor table/layout handling
|
||||
- Unstructured: Heavier than Docling, commercial focus, less predictable output
|
||||
- LlamaParse: Cloud-only, violates local-first constraint
|
||||
|
||||
### 4. Per-Type Chunking Strategy
|
||||
|
||||
Each document type gets a purpose-built chunker with configurable parameters:
|
||||
|
||||
**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.
|
||||
|
||||
**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.
|
||||
|
||||
**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.
|
||||
|
||||
**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.
|
||||
|
||||
**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.
|
||||
|
||||
**Notes:** Whole document = one chunk. Notes are small by definition.
|
||||
|
||||
**Configurable defaults (in `~/.kb/config.yaml`):**
|
||||
```yaml
|
||||
chunking:
|
||||
defaults:
|
||||
max_tokens: 512
|
||||
overlap_tokens: 50
|
||||
pdf:
|
||||
strategy: hierarchy # hierarchy | fixed
|
||||
max_tokens: 1024 # for fixed strategy fallback
|
||||
markdown:
|
||||
strategy: header # header | fixed
|
||||
min_tokens: 50 # merge sections smaller than this
|
||||
max_tokens: 1024
|
||||
code:
|
||||
strategy: ast # ast | fixed
|
||||
include_context: true # include class/module docstring with methods
|
||||
max_tokens: 1024
|
||||
note:
|
||||
strategy: whole
|
||||
```
|
||||
|
||||
### 5. Embedding Model Management
|
||||
|
||||
**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).
|
||||
|
||||
**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.
|
||||
|
||||
**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.
|
||||
|
||||
**Model switching (`kb reindex`):**
|
||||
1. Download new model
|
||||
2. Read all chunks from DB
|
||||
3. Re-embed in batches (with progress bar)
|
||||
4. Replace all vectors in `chunks_vec`
|
||||
5. Update DB config (model_name, embedding_dim)
|
||||
6. Recreate `chunks_vec` table if dimension changed
|
||||
|
||||
**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.
|
||||
|
||||
**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
|
||||
- Dimension (read from model config)
|
||||
- Max sequence length (read from model config, used to cap chunk size)
|
||||
- Query/passage prefixes (configurable in YAML, empty by default)
|
||||
|
||||
```yaml
|
||||
embedding:
|
||||
model: all-MiniLM-L6-v2
|
||||
query_prefix: "" # some models need "search_query: "
|
||||
passage_prefix: "" # some models need "search_document: "
|
||||
```
|
||||
|
||||
### 6. Hybrid Search with Reciprocal Rank Fusion
|
||||
|
||||
**Search flow:**
|
||||
|
||||
```
|
||||
Query: "how to install git"
|
||||
│
|
||||
├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
|
||||
│
|
||||
└──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
|
||||
(cosine distance, top-K)
|
||||
│
|
||||
▼
|
||||
Reciprocal Rank Fusion (RRF)
|
||||
score(d) = Σ 1/(k + rank_in_list) where k=60 (standard)
|
||||
│
|
||||
▼
|
||||
Merged results, sorted by RRF score
|
||||
│
|
||||
▼
|
||||
Apply filters (tags, doc_type) ──▶ Top-N results
|
||||
```
|
||||
|
||||
**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.
|
||||
|
||||
**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.
|
||||
|
||||
**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.
|
||||
|
||||
**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
|
||||
- Type filter: Applied in the SQL query (efficient)
|
||||
- Tag filter: Applied in the SQL query via JOIN (efficient)
|
||||
- Score threshold: Applied post-RRF as a cutoff
|
||||
|
||||
### 7. Output Format (Skill Contract)
|
||||
|
||||
**JSON output (`--format json`, default):**
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "how to install git",
|
||||
"results": [
|
||||
{
|
||||
"chunk_id": 1423,
|
||||
"score": 0.87,
|
||||
"score_breakdown": {"fts": 0.72, "vector": 0.94},
|
||||
"text": "To install the latest version of git from source...",
|
||||
"source": {
|
||||
"document_id": 42,
|
||||
"title": "Git Admin Guide",
|
||||
"path": "/home/user/docs/git-admin.pdf",
|
||||
"type": "pdf",
|
||||
"page": 12,
|
||||
"chunk_index": 3,
|
||||
"total_chunks": 28,
|
||||
"tags": ["git", "admin"]
|
||||
}
|
||||
}
|
||||
],
|
||||
"total_matches": 47,
|
||||
"returned": 10
|
||||
}
|
||||
```
|
||||
|
||||
**Human output (`--format human`):**
|
||||
|
||||
```
|
||||
Search: "how to install git" (47 matches, showing top 10)
|
||||
|
||||
1. [0.87] Git Admin Guide (p.12) [pdf] [git, admin]
|
||||
To install the latest version of git from source...
|
||||
|
||||
2. [0.65] setup-notes.md §Installation [markdown] [git]
|
||||
First, add the PPA repository for the latest git...
|
||||
```
|
||||
|
||||
**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.
|
||||
|
||||
### 8. Configuration Architecture
|
||||
|
||||
```
|
||||
Precedence (highest to lowest):
|
||||
1. CLI flags (--top, --tags, --format)
|
||||
2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
|
||||
3. ~/.kb/config.yaml
|
||||
4. Built-in defaults
|
||||
|
||||
ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
|
||||
KB_DATA_DIR → ~/.kb/
|
||||
KB_MODEL → all-MiniLM-L6-v2
|
||||
KB_DEFAULT_TOP → 10
|
||||
```
|
||||
|
||||
**Full default config.yaml:**
|
||||
|
||||
```yaml
|
||||
# ~/.kb/config.yaml
|
||||
|
||||
data_dir: ~/.kb
|
||||
|
||||
embedding:
|
||||
model: all-MiniLM-L6-v2
|
||||
query_prefix: ""
|
||||
passage_prefix: ""
|
||||
|
||||
search:
|
||||
default_top: 10
|
||||
default_format: json
|
||||
rrf_k: 60
|
||||
|
||||
chunking:
|
||||
defaults:
|
||||
max_tokens: 512
|
||||
overlap_tokens: 50
|
||||
pdf:
|
||||
strategy: hierarchy
|
||||
max_tokens: 1024
|
||||
markdown:
|
||||
strategy: header
|
||||
min_tokens: 50
|
||||
max_tokens: 1024
|
||||
code:
|
||||
strategy: ast
|
||||
include_context: true
|
||||
max_tokens: 1024
|
||||
note:
|
||||
strategy: whole
|
||||
|
||||
ingestion:
|
||||
workers: 4 # parallel Docling workers
|
||||
batch_size: 50 # commit to DB every N documents
|
||||
enable_ocr: auto # auto | always | never
|
||||
```
|
||||
|
||||
### 9. CLI Framework: Click
|
||||
|
||||
**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.
|
||||
|
||||
### 10. Error Handling and Resumability
|
||||
|
||||
**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
|
||||
- Each document is processed independently
|
||||
- On success: document + chunks inserted in a single transaction
|
||||
- On failure: error logged, document skipped, processing continues
|
||||
- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
|
||||
- Progress shown via `click.progressbar` or `rich.progress`
|
||||
- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."
|
||||
|
||||
Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.
|
||||
|
||||
**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.
|
||||
|
||||
**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."
|
||||
|
||||
**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.
|
||||
|
||||
**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.
|
||||
|
||||
**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.
|
||||
|
||||
## Resolved Questions
|
||||
|
||||
1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
|
||||
|
||||
2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.
|
||||
|
||||
3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.
|
||||
Reference in New Issue
Block a user