Initial MVP

This commit is contained in:
2026-03-23 20:38:42 +00:00
commit f245c24928
57 changed files with 6812 additions and 0 deletions
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-22
+396
View File
@@ -0,0 +1,396 @@
## Context
This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.
Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.
The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.
## Goals / Non-Goals
**Goals:**
- Single-command install (`pipx install kb-search`) with `kb init` for model setup
- Ingest heterogeneous documents with format-appropriate chunking
- Hybrid search (keyword + semantic) with a single command
- JSON output contract stable enough for skill integration
- Configurable but works with zero configuration
- All state in one SQLite file for easy backup/portability
**Non-Goals:**
- LLM-based answer synthesis (the calling skill handles this)
- Multi-user or networked access
- Real-time / streaming ingestion
- Web UI or TUI dashboard
- Support for every possible document format (start with PDF, markdown, code, notes)
- Clustering, deduplication, or automatic organisation of documents
## Decisions
### 1. Package Structure
```
kb-search/
├── pyproject.toml
├── src/
│ └── kb_search/
│ ├── __init__.py
│ ├── cli.py # Click CLI entry point
│ ├── config.py # YAML config loading + ENV overrides
│ ├── database.py # SQLite schema, migrations, connection
│ ├── embeddings.py # Model download, loading, inference
│ ├── search.py # Hybrid search + RRF merging
│ ├── ingest/
│ │ ├── __init__.py
│ │ ├── detector.py # File type detection + routing
│ │ ├── docling.py # Docling pipeline (PDF, DOCX, HTML, images)
│ │ ├── markdown.py # Header-based markdown splitting
│ │ ├── code.py # AST/regex code splitting
│ │ └── note.py # Whole-document note handler
│ └── output.py # JSON + human-readable formatters
├── tests/
└── SKILL.md # Claude Code skill definition
```
**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.
### 2. SQLite as Sole Storage Backend
All data lives in `~/.kb/kb.db`:
```sql
-- Documents
CREATE TABLE documents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
source_path TEXT,
content_hash TEXT NOT NULL, -- SHA-256 for dedup/change detection
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
language TEXT, -- for code: 'python','bash','go'
created_at TEXT DEFAULT (datetime('now')),
metadata TEXT DEFAULT '{}' -- JSON: page_count, author, etc.
);
-- Chunks
CREATE TABLE chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
token_count INTEGER,
metadata TEXT DEFAULT '{}', -- JSON: page, section_header, symbol_name
created_at TEXT DEFAULT (datetime('now'))
);
-- FTS5 index (content-sync with chunks table)
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
content='chunks',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
-- Vector storage (sqlite-vec)
CREATE VIRTUAL TABLE chunks_vec USING vec0(
chunk_id INTEGER PRIMARY KEY,
embedding FLOAT[384] -- dimension matches model
);
-- Tags
CREATE TABLE tags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
);
CREATE TABLE document_tags (
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
PRIMARY KEY (document_id, tag_id)
);
-- Config stored in DB (model binding)
CREATE TABLE config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
```
**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.
**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."
**Alternatives considered:**
- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
- LanceDB: Interesting but less mature, no FTS built in
### 3. Docling for Complex Document Ingestion
Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.
**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.
**Docling configuration for this project:**
- Use `pypdfium2` backend (default, fast for text-based PDFs)
- Enable OCR only when needed (detect pages with no extractable text)
- Use hierarchy-aware chunking (respects section/paragraph boundaries)
- Disable image extraction (we're indexing text, not images)
- Run with multiple workers for batch ingestion
**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.
**Alternatives considered:**
- pymupdf4llm: Fast, lightweight, but poor table/layout handling
- Unstructured: Heavier than Docling, commercial focus, less predictable output
- LlamaParse: Cloud-only, violates local-first constraint
### 4. Per-Type Chunking Strategy
Each document type gets a purpose-built chunker with configurable parameters:
**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.
**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.
**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.
**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.
**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.
**Notes:** Whole document = one chunk. Notes are small by definition.
**Configurable defaults (in `~/.kb/config.yaml`):**
```yaml
chunking:
defaults:
max_tokens: 512
overlap_tokens: 50
pdf:
strategy: hierarchy # hierarchy | fixed
max_tokens: 1024 # for fixed strategy fallback
markdown:
strategy: header # header | fixed
min_tokens: 50 # merge sections smaller than this
max_tokens: 1024
code:
strategy: ast # ast | fixed
include_context: true # include class/module docstring with methods
max_tokens: 1024
note:
strategy: whole
```
### 5. Embedding Model Management
**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).
**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.
**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.
**Model switching (`kb reindex`):**
1. Download new model
2. Read all chunks from DB
3. Re-embed in batches (with progress bar)
4. Replace all vectors in `chunks_vec`
5. Update DB config (model_name, embedding_dim)
6. Recreate `chunks_vec` table if dimension changed
**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.
**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
- Dimension (read from model config)
- Max sequence length (read from model config, used to cap chunk size)
- Query/passage prefixes (configurable in YAML, empty by default)
```yaml
embedding:
model: all-MiniLM-L6-v2
query_prefix: "" # some models need "search_query: "
passage_prefix: "" # some models need "search_document: "
```
### 6. Hybrid Search with Reciprocal Rank Fusion
**Search flow:**
```
Query: "how to install git"
├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
└──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
(cosine distance, top-K)
Reciprocal Rank Fusion (RRF)
score(d) = Σ 1/(k + rank_in_list) where k=60 (standard)
Merged results, sorted by RRF score
Apply filters (tags, doc_type) ──▶ Top-N results
```
**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.
**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.
**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.
**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
- Type filter: Applied in the SQL query (efficient)
- Tag filter: Applied in the SQL query via JOIN (efficient)
- Score threshold: Applied post-RRF as a cutoff
### 7. Output Format (Skill Contract)
**JSON output (`--format json`, default):**
```json
{
"query": "how to install git",
"results": [
{
"chunk_id": 1423,
"score": 0.87,
"score_breakdown": {"fts": 0.72, "vector": 0.94},
"text": "To install the latest version of git from source...",
"source": {
"document_id": 42,
"title": "Git Admin Guide",
"path": "/home/user/docs/git-admin.pdf",
"type": "pdf",
"page": 12,
"chunk_index": 3,
"total_chunks": 28,
"tags": ["git", "admin"]
}
}
],
"total_matches": 47,
"returned": 10
}
```
**Human output (`--format human`):**
```
Search: "how to install git" (47 matches, showing top 10)
1. [0.87] Git Admin Guide (p.12) [pdf] [git, admin]
To install the latest version of git from source...
2. [0.65] setup-notes.md §Installation [markdown] [git]
First, add the PPA repository for the latest git...
```
**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.
### 8. Configuration Architecture
```
Precedence (highest to lowest):
1. CLI flags (--top, --tags, --format)
2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
3. ~/.kb/config.yaml
4. Built-in defaults
ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
KB_DATA_DIR → ~/.kb/
KB_MODEL → all-MiniLM-L6-v2
KB_DEFAULT_TOP → 10
```
**Full default config.yaml:**
```yaml
# ~/.kb/config.yaml
data_dir: ~/.kb
embedding:
model: all-MiniLM-L6-v2
query_prefix: ""
passage_prefix: ""
search:
default_top: 10
default_format: json
rrf_k: 60
chunking:
defaults:
max_tokens: 512
overlap_tokens: 50
pdf:
strategy: hierarchy
max_tokens: 1024
markdown:
strategy: header
min_tokens: 50
max_tokens: 1024
code:
strategy: ast
include_context: true
max_tokens: 1024
note:
strategy: whole
ingestion:
workers: 4 # parallel Docling workers
batch_size: 50 # commit to DB every N documents
enable_ocr: auto # auto | always | never
```
### 9. CLI Framework: Click
**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.
### 10. Error Handling and Resumability
**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
- Each document is processed independently
- On success: document + chunks inserted in a single transaction
- On failure: error logged, document skipped, processing continues
- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
- Progress shown via `click.progressbar` or `rich.progress`
- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."
Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.
## Risks / Trade-offs
**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.
**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.
**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."
**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.
**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.
**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.
## Resolved Questions
1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.
3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.
+39
View File
@@ -0,0 +1,39 @@
## Why
There is no simple, local-first CLI tool for building a personal knowledge base across heterogeneous document types (PDFs, markdown, code snippets, text notes) with hybrid search that combines keyword matching and semantic understanding. Existing tools either require cloud services, lack semantic search, or can't handle the variety of document formats. This tool fills the gap — a retrieval engine that can be used standalone from the terminal or wrapped as an AI skill (e.g. Claude Code) where the LLM layer provides natural language synthesis over retrieved results.
## What Changes
- New Python CLI tool (`kb`) distributed via pipx (PyPI package: `kb-search`)
- Ingestion pipeline with per-format handling:
- **PDFs/DOCX/HTML/images**: Docling (layout-aware, table reconstruction, optional OCR)
- **Markdown/text**: Header-based semantic splitting
- **Code (Python, Bash, Go)**: AST/regex-based splitting at function/class boundaries
- **Notes**: Inline text stored as whole-document chunks
- Hybrid search combining SQLite FTS5 (BM25 keyword scoring) and sqlite-vec (vector similarity), merged via Reciprocal Rank Fusion
- Local embedding models downloaded from HuggingFace on first run (`kb init`), with multi-model support and full reindex capability when switching models
- Document tagging system for manual categorisation and filtered search
- Structured JSON output designed for LLM skill consumption, plus human-readable terminal output
- Configurable chunking parameters per document type with sensible defaults
- All state in a single SQLite database (`~/.kb/kb.db`)
- Configuration via YAML (`~/.kb/config.yaml`) with ENV variable overrides
## Capabilities
### New Capabilities
- `document-ingestion`: Ingest PDFs, markdown, code, and text notes into chunked, embedded, searchable storage. Handles format detection, per-type chunking strategies, Docling pipeline for complex documents, and resumable batch imports.
- `hybrid-search`: Hybrid retrieval combining FTS5 full-text search and sqlite-vec vector similarity via Reciprocal Rank Fusion. Supports tag/type filtering, configurable result counts, score thresholds, and JSON/human output formats.
- `embedding-management`: Local embedding model lifecycle — download on init, bind model to database, detect mismatches, and full re-embedding via reindex when switching models.
- `document-management`: CRUD operations on the document store — list, inspect, remove documents. Tag management (add/remove tags, filter by tags, list tags with counts).
- `configuration`: TOML-based configuration with per-document-type chunking parameters, model selection, and ENV variable overrides. Sensible defaults that work without any config file.
- `skill-interface`: Structured JSON output contract designed for LLM skill consumption — chunks with scores, source metadata, and provenance for citation.
### Modified Capabilities
_(none — greenfield project)_
## Impact
- **Dependencies**: Docling (~1.5 GB models), sentence-transformers with ONNX Runtime backend, sqlite-vec, Click
- **Storage**: ~/.kb/ directory containing SQLite database, config file, and downloaded models (~1.6 GB on init, database grows with content)
- **First-run experience**: `kb init` required before use to download models. Batch ingestion of 2,000 PDFs estimated at ~17 hours CPU / ~3 hours GPU (one-time cost, resumable)
- **External integration**: Designed to be wrapped as a Claude Code skill — the skill definition (SKILL.md) is a deliverable alongside the code
@@ -0,0 +1,72 @@
## ADDED Requirements
### Requirement: YAML configuration file
The system SHALL read configuration from `~/.kb/config.yaml`. If the file does not exist, the system SHALL use built-in defaults. The configuration file SHALL be optional — the tool MUST work with zero configuration.
#### Scenario: No config file
- **WHEN** `~/.kb/config.yaml` does not exist
- **THEN** the system uses built-in defaults for all settings and operates normally
#### Scenario: Partial config file
- **WHEN** `~/.kb/config.yaml` exists but only specifies `chunking.pdf.max_tokens: 2048`
- **THEN** the system uses built-in defaults for all other settings, overriding only `chunking.pdf.max_tokens`
#### Scenario: Invalid config file
- **WHEN** `~/.kb/config.yaml` contains invalid YAML
- **THEN** the system prints a clear error message identifying the YAML syntax issue and exits with non-zero status
### Requirement: Environment variable overrides
The system SHALL support environment variable overrides with the prefix `KB_`. ENV variables SHALL take precedence over the YAML config file. Supported variables: `KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`.
#### Scenario: Override data directory
- **WHEN** `KB_DATA_DIR=/tmp/test-kb` is set
- **THEN** the system uses `/tmp/test-kb/` instead of `~/.kb/` for the database and config
#### Scenario: Override model
- **WHEN** `KB_MODEL=nomic-embed-text` is set
- **THEN** the system uses `nomic-embed-text` as the embedding model, overriding the YAML config
#### Scenario: ENV overrides YAML
- **WHEN** YAML config has `search.default_top: 10` and `KB_DEFAULT_TOP=20` is set
- **THEN** the default top value is 20
### Requirement: Configuration precedence
The system SHALL apply configuration in this order (highest to lowest precedence): CLI flags, environment variables, YAML config file, built-in defaults.
#### Scenario: CLI flag overrides everything
- **WHEN** YAML config has `search.default_top: 10`, ENV has `KB_DEFAULT_TOP=20`, and user runs `kb search "test" --top 5`
- **THEN** 5 results are returned
### Requirement: View and set configuration
The system SHALL support viewing the current effective configuration via `kb config` and setting individual values via `kb config set <key> <value>`.
#### Scenario: View configuration
- **WHEN** user runs `kb config`
- **THEN** the system displays the fully resolved configuration (defaults merged with YAML merged with ENV), indicating the source of each value
#### Scenario: Set a config value
- **WHEN** user runs `kb config set chunking.pdf.max_tokens 2048`
- **THEN** the value is written to `~/.kb/config.yaml`, creating the file if necessary
### Requirement: Configurable chunking parameters
The system SHALL support per-document-type chunking configuration with sensible defaults.
#### Scenario: Default chunking for PDF
- **WHEN** no chunking config is specified for PDF
- **THEN** the system uses `strategy: hierarchy, max_tokens: 1024`
#### Scenario: Default chunking for markdown
- **WHEN** no chunking config is specified for markdown
- **THEN** the system uses `strategy: header, min_tokens: 50, max_tokens: 1024`
#### Scenario: Default chunking for code
- **WHEN** no chunking config is specified for code
- **THEN** the system uses `strategy: ast, include_context: true, max_tokens: 1024`
#### Scenario: Default chunking for notes
- **WHEN** no chunking config is specified for notes
- **THEN** the system uses `strategy: whole`
#### Scenario: Custom chunking overrides
- **WHEN** YAML config specifies `chunking.pdf.strategy: fixed` and `chunking.pdf.max_tokens: 512`
- **THEN** PDFs are chunked with fixed-size windows of 512 tokens instead of hierarchy-aware chunking
@@ -0,0 +1,125 @@
## ADDED Requirements
### Requirement: File type detection and routing
The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (`.pdf`), DOCX (`.docx`), HTML (`.html`, `.htm`), Markdown (`.md`, `.markdown`, `.txt`), Code (`.py`, `.sh`, `.bash`, `.go`), and image files (`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`). The user MAY override detection with `--type` and `--language` flags.
#### Scenario: Auto-detect PDF file
- **WHEN** user runs `kb add report.pdf`
- **THEN** the file is routed to the Docling ingestion pipeline
#### Scenario: Auto-detect Python code
- **WHEN** user runs `kb add script.py`
- **THEN** the file is routed to the code ingestion pipeline with language set to `python`
#### Scenario: Override type detection
- **WHEN** user runs `kb add data.txt --type code --language bash`
- **THEN** the file is routed to the code pipeline as Bash, regardless of the `.txt` extension
#### Scenario: Unsupported file type
- **WHEN** user runs `kb add archive.zip`
- **THEN** the system SHALL print an error message listing supported formats and exit with non-zero status
### Requirement: Docling pipeline for complex documents
The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the `pypdfium2` backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: `auto` (detect pages with no extractable text and OCR those), `always`, or `never`.
#### Scenario: Ingest a text-based PDF
- **WHEN** user runs `kb add manual.pdf`
- **THEN** the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database
#### Scenario: Ingest a PDF with tables
- **WHEN** user ingests a PDF containing data tables
- **THEN** Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments
#### Scenario: Ingest a scanned PDF with OCR auto mode
- **WHEN** user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to `auto`
- **THEN** the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally
#### Scenario: Ingest an image file
- **WHEN** user runs `kb add diagram.png`
- **THEN** the system SHALL process it through Docling with OCR enabled, extracting any text content from the image
### Requirement: Markdown ingestion with header-based splitting
The system SHALL split markdown and text files at header boundaries (`##`, `###`). Each chunk SHALL include its parent header chain as context. Sections smaller than `min_tokens` SHALL be merged with the following section. Sections larger than `max_tokens` SHALL be split at paragraph boundaries with configurable overlap.
#### Scenario: Split markdown at headers
- **WHEN** user runs `kb add guide.md` and the file contains multiple `##` sections
- **THEN** each section becomes a separate chunk, with the header text included in the chunk
#### Scenario: Preserve header hierarchy
- **WHEN** a markdown file has nested headers like `## Config` > `### Advanced Options`
- **THEN** the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"
#### Scenario: Merge small sections
- **WHEN** a markdown section contains fewer tokens than `min_tokens` (default: 50)
- **THEN** it SHALL be merged with the next section into a single chunk
#### Scenario: Plain text file without headers
- **WHEN** user runs `kb add notes.txt` and the file has no markdown headers
- **THEN** the system SHALL fall back to fixed-size chunking with configurable `max_tokens` and `overlap_tokens`
### Requirement: Code ingestion with AST/regex splitting
The system SHALL split code files at function and class boundaries. Python files SHALL use the `ast` module. Bash and Go files SHALL use regex-based splitting. When `include_context` is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.
#### Scenario: Python file with functions and classes
- **WHEN** user runs `kb add auth.py` and the file contains a class with methods
- **THEN** each method becomes a chunk, and each chunk includes the class name and docstring as context
#### Scenario: Bash script with functions
- **WHEN** user runs `kb add deploy.sh` and the file contains `function deploy() {` blocks
- **THEN** each function becomes a separate chunk, including any preceding comment block
#### Scenario: Go file with functions
- **WHEN** user runs `kb add main.go` and the file contains `func` declarations
- **THEN** each function becomes a separate chunk
#### Scenario: Code file with no functions
- **WHEN** user runs `kb add script.sh` and the file has no function declarations
- **THEN** the system SHALL fall back to fixed-size chunking with `max_tokens` and `overlap_tokens`
### Requirement: Inline note ingestion
The system SHALL support adding text notes directly from the command line via `kb add --note "text"`. Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional `--title` for display purposes.
#### Scenario: Add an inline note
- **WHEN** user runs `kb add --note "Always restart nginx after config changes" --title "nginx reminder"`
- **THEN** a document of type `note` is created with the title "nginx reminder", and the full text becomes a single chunk
#### Scenario: Add a note without title
- **WHEN** user runs `kb add --note "some text"`
- **THEN** the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title
### Requirement: Deduplication via content hash
The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same `content_hash` already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.
#### Scenario: Add a file that is already indexed
- **WHEN** user runs `kb add report.pdf` and the file's SHA-256 matches an existing document
- **THEN** the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate
#### Scenario: Add a modified version of an existing file
- **WHEN** user runs `kb add report.pdf` and the file has changed since last indexed (different hash)
- **THEN** the system SHALL ingest it as a new document (the old version remains unless manually removed)
### Requirement: Batch ingestion with progress and resumability
The system SHALL support ingesting entire directories via `kb add <dir> --recursive`. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.
#### Scenario: Ingest a directory
- **WHEN** user runs `kb add ~/docs/ --recursive`
- **THEN** the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."
#### Scenario: Resume after interruption
- **WHEN** a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
- **THEN** already-indexed files are skipped via content hash, and processing continues with remaining files
#### Scenario: Failed file during batch
- **WHEN** a single file fails to process (corrupt PDF, encoding error)
- **THEN** the error is logged to `~/.kb/ingest-errors.log` with the file path and error message, and processing continues with the next file
### Requirement: Parallel ingestion workers
The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's `DocumentConverter` SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.
#### Scenario: Parallel PDF ingestion
- **WHEN** user runs `kb add ~/pdfs/ --recursive` with `workers: 4` in config
- **THEN** up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially
#### Scenario: Override worker count
- **WHEN** user runs `kb add ~/pdfs/ --recursive --workers 1`
- **THEN** documents are processed sequentially with a single worker
@@ -0,0 +1,80 @@
## ADDED Requirements
### Requirement: List documents
The system SHALL list all indexed documents via `kb list`. Results SHALL include document ID, title, type, tag count, chunk count, and creation date. Output SHALL support `--format json` and `--format human`.
#### Scenario: List all documents
- **WHEN** user runs `kb list`
- **THEN** all documents are listed with their ID, title, type, tags, chunk count, and creation date
#### Scenario: Filter by type
- **WHEN** user runs `kb list --type pdf`
- **THEN** only PDF documents are listed
#### Scenario: Filter by tags
- **WHEN** user runs `kb list --tags admin,ops`
- **THEN** only documents tagged with BOTH "admin" AND "ops" are listed
#### Scenario: Empty database
- **WHEN** user runs `kb list` with no documents indexed
- **THEN** the system prints "No documents indexed. Run `kb add` to get started." and exits with zero status
### Requirement: Document info
The system SHALL display detailed information about a single document via `kb info <doc_id>`, including all metadata, tags, chunk count, and chunk previews (first 100 characters of each chunk).
#### Scenario: View document info
- **WHEN** user runs `kb info 42`
- **THEN** the system displays: title, source path, type, language (if code), content hash, creation date, tags, total chunks, and a preview of each chunk
#### Scenario: Invalid document ID
- **WHEN** user runs `kb info 9999` and no document with ID 9999 exists
- **THEN** the system prints "Document not found: 9999" and exits with non-zero status
### Requirement: Remove document
The system SHALL remove a document and all its associated chunks, embeddings, and tag associations via `kb remove <doc_id>`. The system SHALL ask for confirmation before deletion unless `--yes` is passed.
#### Scenario: Remove with confirmation
- **WHEN** user runs `kb remove 42`
- **THEN** the system displays the document title and asks "Remove 'Git Admin Guide' and its 28 chunks? [y/N]". On confirmation, the document, its chunks, FTS entries, vector embeddings, and tag associations are deleted.
#### Scenario: Remove with --yes flag
- **WHEN** user runs `kb remove 42 --yes`
- **THEN** the document is removed without confirmation prompt
#### Scenario: Cascading delete
- **WHEN** a document is removed
- **THEN** all rows in `chunks`, `chunks_fts`, `chunks_vec`, and `document_tags` referencing that document SHALL be deleted
### Requirement: Tag management
The system SHALL support adding and removing tags on documents via `kb tag <doc_id> --add tag1,tag2` and `kb tag <doc_id> --remove tag1`. Tags are case-insensitive and stored lowercase. The system SHALL list all tags with document counts via `kb tags`.
#### Scenario: Add tags to a document
- **WHEN** user runs `kb tag 42 --add git,admin`
- **THEN** the tags "git" and "admin" are associated with document 42. Tags are created if they don't exist.
#### Scenario: Remove a tag from a document
- **WHEN** user runs `kb tag 42 --remove admin`
- **THEN** the "admin" tag association is removed from document 42. The tag itself remains in the tags table if other documents use it.
#### Scenario: List all tags
- **WHEN** user runs `kb tags`
- **THEN** the system lists all tags with the count of documents using each tag, sorted by count descending
#### Scenario: Tag on ingestion
- **WHEN** user runs `kb add report.pdf --tags compliance,q1`
- **THEN** the document is ingested and immediately tagged with "compliance" and "q1"
#### Scenario: Tags in JSON format
- **WHEN** user runs `kb tags --format json`
- **THEN** output is a JSON array of objects: `[{"name": "git", "count": 15}, ...]`
### Requirement: Database status
The system SHALL report database statistics via `kb status`, including: total documents (by type), total chunks, database file size, active model name and dimension, and schema version.
#### Scenario: Show status
- **WHEN** user runs `kb status`
- **THEN** the system displays: document counts by type, total chunks, DB file size, model name, embedding dimension, and schema version
#### Scenario: Status before init
- **WHEN** user runs `kb status` before `kb init`
- **THEN** the system prints "Knowledge base not initialised. Run `kb init` first." and exits with non-zero status
@@ -0,0 +1,57 @@
## ADDED Requirements
### Requirement: Model initialisation
The system SHALL download the embedding model on `kb init`. The default model SHALL be `all-MiniLM-L6-v2`. The user MAY specify a different model via `kb init --model <name>`. The model SHALL be downloaded via sentence-transformers to the HuggingFace default cache (`~/.cache/huggingface/`). On first load, the model SHALL be exported to ONNX format for inference.
#### Scenario: Default init
- **WHEN** user runs `kb init`
- **THEN** the system downloads `all-MiniLM-L6-v2`, creates `~/.kb/kb.db` with the schema, and records `model_name=all-MiniLM-L6-v2` and `embedding_dim=384` in the DB config table
#### Scenario: Init with custom model
- **WHEN** user runs `kb init --model nomic-embed-text`
- **THEN** the system downloads `nomic-embed-text`, creates the database, and records the model name and its dimension in the DB config table
#### Scenario: Init status check
- **WHEN** user runs `kb init --status`
- **THEN** the system reports: whether `~/.kb/` exists, whether the DB is initialised, which model is configured, whether the model is downloaded, and Docling model status
#### Scenario: ONNX export on first load
- **WHEN** the embedding model is loaded for the first time after download
- **THEN** the system SHALL display "Optimising model for ONNX inference (one-time)..." and export the model to ONNX format. Subsequent loads SHALL use the cached ONNX export.
### Requirement: Model-database binding
The system SHALL store the active model name and embedding dimension in the database `config` table. Every operation that uses the embedding model (add, search, reindex) SHALL verify that the loaded model matches the DB record. A mismatch SHALL be a hard error.
#### Scenario: Model mismatch on add
- **WHEN** user runs `kb add doc.pdf` but the config YAML specifies a different model than what the DB was initialised with
- **THEN** the system SHALL print an error: "Model mismatch: DB uses 'all-MiniLM-L6-v2' (384 dim) but config specifies 'nomic-embed-text'. Run `kb reindex --model nomic-embed-text` to switch models." and exit with non-zero status
#### Scenario: Model match on add
- **WHEN** user runs `kb add doc.pdf` and the config model matches the DB model
- **THEN** ingestion proceeds normally
### Requirement: Full reindex with model switching
The system SHALL support re-embedding all chunks via `kb reindex`. If `--model` is specified, the system SHALL download the new model, re-embed all chunks, replace all vectors, and update the DB config. A progress bar SHALL be displayed. The operation SHALL be atomic — if interrupted, the old embeddings remain intact.
#### Scenario: Reindex with same model
- **WHEN** user runs `kb reindex`
- **THEN** all chunks are re-embedded with the current model and vectors are replaced. Useful if the model's ONNX export was corrupted or chunks were modified.
#### Scenario: Reindex with new model
- **WHEN** user runs `kb reindex --model bge-small-en-v1.5`
- **THEN** the system downloads the new model, re-embeds all chunks (showing progress), replaces all vectors in `chunks_vec` (recreating the table if dimension changed), and updates `model_name` and `embedding_dim` in the DB config table
#### Scenario: Interrupted reindex
- **WHEN** a reindex is interrupted partway through
- **THEN** the old embeddings remain intact (the vector table is only replaced on successful completion of all embeddings). The user can rerun `kb reindex` to retry.
### Requirement: Embedding model inference via ONNX
The system SHALL use `sentence-transformers` with the ONNX backend for all embedding inference. This avoids a PyTorch dependency. The ONNX Runtime (`onnxruntime`) SHALL be the inference engine.
#### Scenario: Embed a chunk
- **WHEN** a chunk of text needs to be embedded during ingestion
- **THEN** the system uses the sentence-transformers ONNX backend to produce a float vector of the correct dimension for the active model
#### Scenario: Embed a query
- **WHEN** a search query needs to be embedded
- **THEN** the system applies the configured `query_prefix` (if any) to the query text before embedding, and uses the same ONNX model used for chunk embeddings
@@ -0,0 +1,70 @@
## ADDED Requirements
### Requirement: Full-text search via FTS5
The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the `porter unicode61` tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.
#### Scenario: Keyword search
- **WHEN** user runs `kb search "install git"`
- **THEN** FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score
#### Scenario: FTS-only mode
- **WHEN** user runs `kb search "install git" --fts-only`
- **THEN** only FTS5 results are returned, no vector search is performed
### Requirement: Vector similarity search via sqlite-vec
The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in `chunks_vec` using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.
#### Scenario: Semantic search
- **WHEN** user runs `kb search "how to set up version control"`
- **THEN** the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"
#### Scenario: Vector-only mode
- **WHEN** user runs `kb search "how to set up version control" --vec-only`
- **THEN** only vector similarity results are returned, no FTS search is performed
### Requirement: Reciprocal Rank Fusion merging
The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: `score(d) = Σ 1/(k + rank)` where `k` is configurable (default: 60). Results SHALL be sorted by descending RRF score.
#### Scenario: Hybrid search combines both signals
- **WHEN** user runs `kb search "install git"` (default hybrid mode)
- **THEN** the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score
#### Scenario: Document appears in both result sets
- **WHEN** a chunk ranks #2 in FTS5 and #5 in vector search
- **THEN** its RRF score SHALL be `1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315`, higher than a chunk appearing in only one result set
### Requirement: Tag-based filtering
The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.
#### Scenario: Filter by single tag
- **WHEN** user runs `kb search "deploy" --tags ops`
- **THEN** only chunks from documents tagged with "ops" are included in results
#### Scenario: Filter by multiple tags
- **WHEN** user runs `kb search "deploy" --tags ops,production`
- **THEN** only chunks from documents tagged with BOTH "ops" AND "production" are included
### Requirement: Type-based filtering
The system SHALL support filtering search results by document type. Valid types: `pdf`, `markdown`, `code`, `note`.
#### Scenario: Filter by type
- **WHEN** user runs `kb search "deploy" --type code`
- **THEN** only chunks from code documents are included in results
### Requirement: Score threshold
The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.
#### Scenario: Apply score threshold
- **WHEN** user runs `kb search "deploy" --threshold 0.02`
- **THEN** only results with RRF score >= 0.02 are returned
### Requirement: Result count control
The system SHALL return a configurable number of results (default: 10, configurable via `--top` flag or `search.default_top` in config).
#### Scenario: Request specific number of results
- **WHEN** user runs `kb search "deploy" --top 5`
- **THEN** at most 5 results are returned
#### Scenario: Fewer matches than requested
- **WHEN** user searches and only 3 chunks match
- **THEN** the system returns 3 results without error, with `returned: 3` in the output
@@ -0,0 +1,101 @@
## ADDED Requirements
### Requirement: JSON output format for search
The system SHALL output search results as JSON when `--format json` is used (this is the default). The JSON schema SHALL include: `query`, `results` array, `total_matches`, and `returned` count. Each result SHALL include: `chunk_id`, `score`, `score_breakdown` (with `fts` and `vector` sub-scores), `text`, and `source` object.
#### Scenario: JSON search output
- **WHEN** user runs `kb search "install git" --format json`
- **THEN** the output is valid JSON matching this structure:
```json
{
"query": "install git",
"results": [
{
"chunk_id": 1423,
"score": 0.031,
"score_breakdown": {"fts": 0.016, "vector": 0.015},
"text": "To install the latest version...",
"source": {
"document_id": 42,
"title": "Git Admin Guide",
"path": "/home/user/docs/git-admin.pdf",
"type": "pdf",
"page": 12,
"chunk_index": 3,
"total_chunks": 28,
"tags": ["git", "admin"]
}
}
],
"total_matches": 47,
"returned": 10
}
```
#### Scenario: Score breakdown in FTS-only mode
- **WHEN** user runs `kb search "test" --fts-only --format json`
- **THEN** `score_breakdown` contains `{"fts": <score>, "vector": null}`
#### Scenario: Score breakdown in vector-only mode
- **WHEN** user runs `kb search "test" --vec-only --format json`
- **THEN** `score_breakdown` contains `{"fts": null, "vector": <score>}`
### Requirement: Human-readable output format
The system SHALL support human-readable output via `--format human`. This format SHALL show: query, match count, and for each result: rank, score, title, page/section (if applicable), type, tags, and a text preview.
#### Scenario: Human-readable search output
- **WHEN** user runs `kb search "install git" --format human`
- **THEN** output is formatted for terminal reading:
```
Search: "install git" (47 matches, showing top 10)
1. [0.031] Git Admin Guide (p.12) [pdf] [git, admin]
To install the latest version of git from source...
2. [0.025] setup-notes.md §Installation [markdown] [git]
First, add the PPA repository for the latest git...
```
### Requirement: JSON output for list and tags commands
The system SHALL support `--format json` for `kb list`, `kb tags`, `kb info`, and `kb status` commands. JSON output SHALL be valid and parseable by the skill wrapper.
#### Scenario: List documents as JSON
- **WHEN** user runs `kb list --format json`
- **THEN** output is a JSON array of document objects with `id`, `title`, `type`, `tags`, `chunk_count`, `created_at`
#### Scenario: Tags as JSON
- **WHEN** user runs `kb tags --format json`
- **THEN** output is a JSON array: `[{"name": "git", "count": 15}, ...]`
#### Scenario: Status as JSON
- **WHEN** user runs `kb status --format json`
- **THEN** output is a JSON object with `documents` (counts by type), `total_chunks`, `db_size_bytes`, `model_name`, `embedding_dim`, `schema_version`
### Requirement: JSON schema stability
The JSON output schema SHALL be treated as a public contract. Fields MAY be added to JSON objects in future versions. Fields SHALL NOT be removed or renamed. The skill wrapper MUST be able to rely on the presence and type of all documented fields.
#### Scenario: Forward compatibility
- **WHEN** a future version adds a `language` field to search results
- **THEN** all existing fields remain present and unchanged, the new field is additive only
### Requirement: Exit codes
The system SHALL use consistent exit codes: 0 for success, 1 for user errors (bad arguments, missing files), 2 for system errors (database corruption, model failure). JSON error output SHALL include an `error` field with a human-readable message.
#### Scenario: Successful operation
- **WHEN** any command completes successfully
- **THEN** exit code is 0
#### Scenario: User error with JSON output
- **WHEN** user runs `kb search` with no query argument
- **THEN** exit code is 1 and stderr contains a clear error message
#### Scenario: System error
- **WHEN** the SQLite database is corrupted
- **THEN** exit code is 2 and stderr contains the error details
### Requirement: Skill definition file
The project SHALL include a `SKILL.md` file that defines how an LLM tool (e.g. Claude Code) should invoke and interpret `kb` commands. The skill file SHALL document: when to use the tool, available commands, output format, how to cite sources, and how to handle low-confidence results.
#### Scenario: Skill file exists
- **WHEN** the project is built
- **THEN** a `SKILL.md` file exists at the project root describing the skill interface for LLM consumption
+115
View File
@@ -0,0 +1,115 @@
## 1. Project Scaffolding
- [x] 1.1 Create Python virtual environment (`python3 -m venv .venv`) and add `.venv/` to `.gitignore`. All development and testing MUST run inside this venv.
- [x] 1.2 Create `pyproject.toml` with project metadata, dependencies (`click`, `sqlite-vec`, `pyyaml`, `sentence-transformers`, `onnxruntime`, `docling`), dev dependencies (`pytest`, `pytest-cov`), and `[project.scripts] kb = "kb_search.cli:main"` entry point
- [x] 1.3 Install the project in editable mode inside the venv: `.venv/bin/pip install -e ".[dev]"`
- [x] 1.4 Create `src/kb_search/` package directory with `__init__.py`
- [x] 1.5 Create `src/kb_search/cli.py` with Click group and stub subcommands (`init`, `add`, `search`, `list`, `info`, `remove`, `tags`, `tag`, `status`, `reindex`, `config`)
- [x] 1.6 Verify `.venv/bin/kb --help` shows all commands
## 2. Configuration
- [x] 2.1 Create `src/kb_search/config.py` — load YAML from `~/.kb/config.yaml` with deep-merge against built-in defaults. Handle missing file gracefully.
- [x] 2.2 Implement ENV variable overrides (`KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`) with precedence: CLI flags > ENV > YAML > defaults
- [x] 2.3 Implement `kb config` command — display fully resolved config with source indicators
- [x] 2.4 Implement `kb config set <key> <value>` — write to `~/.kb/config.yaml`, creating file if needed
- [x] 2.5 Write tests for config loading, merging, ENV overrides, and precedence
## 3. Database Layer
- [x] 3.1 Create `src/kb_search/database.py` — SQLite connection management with sqlite-vec extension loading
- [x] 3.2 Implement schema creation: `documents`, `chunks`, `tags`, `document_tags`, `config` tables per design.md
- [x] 3.3 Implement FTS5 virtual table (`chunks_fts`) with `porter unicode61` tokenizer and sync triggers (INSERT, UPDATE, DELETE)
- [x] 3.4 Implement `chunks_vec` virtual table via sqlite-vec
- [x] 3.5 Implement schema versioning: store `schema_version` in `config` table, check on open, run migrations sequentially
- [x] 3.6 Implement DB config helpers: `get_config(key)`, `set_config(key, value)` for model binding
- [x] 3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers
## 4. Embedding Management
- [x] 4.1 Create `src/kb_search/embeddings.py` — model download, ONNX export, and loading via `SentenceTransformer(model_name, backend="onnx")`
- [x] 4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
- [x] 4.3 Implement `embed_texts(texts: list[str]) -> list[list[float]]` with configurable query/passage prefix support
- [x] 4.4 Implement `kb init` command — create `~/.kb/`, init DB schema, download model, record binding. Support `--model` flag and `--status` check.
- [x] 4.5 Implement `kb reindex` command — download new model if `--model` specified, re-embed all chunks with progress bar, replace vectors atomically, update DB config
- [x] 4.6 Write tests for embedding, model binding verification, and mismatch detection
## 5. Document Ingestion — Core
- [x] 5.1 Create `src/kb_search/ingest/__init__.py` and `src/kb_search/ingest/detector.py` — file type detection by extension, routing to correct pipeline, `--type`/`--language` override support
- [x] 5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against `documents.content_hash`
- [x] 5.3 Implement `kb add <file>` command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction
- [x] 5.4 Implement `kb add --note "text"` — create note document with whole-text chunk, optional `--title`, auto-title from first 80 chars
- [x] 5.5 Implement `kb add <dir> --recursive` — walk directory, filter supported extensions, process each file, skip dupes, log failures to `~/.kb/ingest-errors.log`, display summary
- [x] 5.6 Implement parallel ingestion with configurable `--workers` (default: 4), serialised DB writes
- [x] 5.7 Write tests for type detection, dedup, note creation, and batch processing
## 6. Document Ingestion — Docling Pipeline
- [x] 6.1 Create `src/kb_search/ingest/docling.py` — Docling `DocumentConverter` setup with `pypdfium2` backend, layout model enabled, table reconstruction enabled
- [x] 6.2 Implement OCR configuration (`auto`/`always`/`never`) per config.yaml `ingestion.enable_ocr`
- [x] 6.3 Implement hierarchy-aware chunking via Docling's `HierarchicalChunker`, with fallback to fixed-size chunking when hierarchy detection fails
- [x] 6.4 Extract and preserve chunk metadata: page number, section headers, table markers
- [x] 6.5 Wire Docling models to download on `kb init` (using HuggingFace default cache)
- [x] 6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)
## 7. Document Ingestion — Markdown Pipeline
- [x] 7.1 Create `src/kb_search/ingest/markdown.py` — split at `##`/`###` header boundaries
- [x] 7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
- [x] 7.3 Implement small section merging (sections below `min_tokens` merged with next section)
- [x] 7.4 Implement large section splitting at paragraph boundaries with overlap
- [x] 7.5 Implement fallback to fixed-size chunking for plain text files without headers
- [x] 7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback
## 8. Document Ingestion — Code Pipeline
- [x] 8.1 Create `src/kb_search/ingest/code.py` — language detection from extension (`.py`, `.sh`, `.bash`, `.go`)
- [x] 8.2 Implement Python AST splitting using stdlib `ast` module — function and class boundaries, class docstring context on methods
- [x] 8.3 Implement Bash regex splitting — `function name()` and `name()` patterns with preceding comment blocks
- [x] 8.4 Implement Go regex splitting — `func` declarations with type grouping
- [x] 8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
- [x] 8.6 Write tests for each language parser and fallback behaviour
## 9. Hybrid Search
- [x] 9.1 Create `src/kb_search/search.py` — FTS5 query execution with BM25 scoring, special character escaping
- [x] 9.2 Implement vector similarity search: embed query, query `chunks_vec` for top-K (3× requested), cosine similarity
- [x] 9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with `score(d) = Σ 1/(k + rank)`, configurable `k` (default: 60)
- [x] 9.4 Implement `--fts-only` and `--vec-only` modes
- [x] 9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
- [x] 9.6 Implement `--threshold` score cutoff (post-RRF)
- [x] 9.7 Implement `--top` result count control (default from config)
- [x] 9.8 Wire up `kb search` command with all flags: `--top`, `--tags`, `--type`, `--format`, `--fts-only`, `--vec-only`, `--threshold`
- [x] 9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)
## 10. Output Formatting
- [x] 10.1 Create `src/kb_search/output.py` — JSON formatter for search results matching the schema in skill-interface spec
- [x] 10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
- [x] 10.3 Implement JSON formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.4 Implement human-readable formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
- [x] 10.6 Write tests for JSON output schema validation and exit codes
## 11. Document Management Commands
- [x] 11.1 Implement `kb list` — query documents with optional `--type` and `--tags` filters, `--format` output
- [x] 11.2 Implement `kb info <doc_id>` — document details with chunk previews
- [x] 11.3 Implement `kb remove <doc_id>` — cascading delete with confirmation prompt, `--yes` flag
- [x] 11.4 Implement `kb tags` — list all tags with document counts, `--format` support
- [x] 11.5 Implement `kb tag <doc_id> --add/--remove` — tag management, case-insensitive storage
- [x] 11.6 Implement `kb status` — DB stats, model info, storage size, schema version
- [x] 11.7 Write tests for each management command
## 12. Skill Definition
- [x] 12.1 Write `SKILL.md` — when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance
- [x] 12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results
## 13. Packaging and Distribution
- [x] 13.1 Verify `pipx install kb-search` works from a clean environment
- [x] 13.2 Verify `kb init` downloads both embedding model and Docling models successfully
- [x] 13.3 Add a README with quickstart: install, init, add, search
- [x] 13.4 Add `py.typed` marker and basic type annotations on public interfaces
+10
View File
@@ -0,0 +1,10 @@
schema: spec-driven
context: |
Tech stack: Python 3.11+, Click (CLI), SQLite (FTS5 + sqlite-vec), Docling, sentence-transformers
Distribution: pipx (PyPI package name: kb-search, CLI entry point: kb)
Storage: Single SQLite database at ~/.kb/kb.db
Config: TOML at ~/.kb/config.toml with ENV overrides
Domain: CLI knowledge base / retrieval engine for personal document search
Primary consumer: Claude Code skills (JSON output), secondary: human terminal use
Local-first: no cloud dependencies, embedding models downloaded from HuggingFace on init