8.8 KiB
8.8 KiB
1. Project Scaffolding
- 1.1 Create Python virtual environment (
python3 -m venv .venv) and add.venv/to.gitignore. All development and testing MUST run inside this venv. - 1.2 Create
pyproject.tomlwith project metadata, dependencies (click,sqlite-vec,pyyaml,sentence-transformers,onnxruntime,docling), dev dependencies (pytest,pytest-cov), and[project.scripts] kb = "kb_search.cli:main"entry point - 1.3 Install the project in editable mode inside the venv:
.venv/bin/pip install -e ".[dev]" - 1.4 Create
src/kb_search/package directory with__init__.py - 1.5 Create
src/kb_search/cli.pywith Click group and stub subcommands (init,add,search,list,info,remove,tags,tag,status,reindex,config) - 1.6 Verify
.venv/bin/kb --helpshows all commands
2. Configuration
- 2.1 Create
src/kb_search/config.py— load YAML from~/.kb/config.yamlwith deep-merge against built-in defaults. Handle missing file gracefully. - 2.2 Implement ENV variable overrides (
KB_DATA_DIR,KB_MODEL,KB_DEFAULT_TOP,KB_DEFAULT_FORMAT) with precedence: CLI flags > ENV > YAML > defaults - 2.3 Implement
kb configcommand — display fully resolved config with source indicators - 2.4 Implement
kb config set <key> <value>— write to~/.kb/config.yaml, creating file if needed - 2.5 Write tests for config loading, merging, ENV overrides, and precedence
3. Database Layer
- 3.1 Create
src/kb_search/database.py— SQLite connection management with sqlite-vec extension loading - 3.2 Implement schema creation:
documents,chunks,tags,document_tags,configtables per design.md - 3.3 Implement FTS5 virtual table (
chunks_fts) withporter unicode61tokenizer and sync triggers (INSERT, UPDATE, DELETE) - 3.4 Implement
chunks_vecvirtual table via sqlite-vec - 3.5 Implement schema versioning: store
schema_versioninconfigtable, check on open, run migrations sequentially - 3.6 Implement DB config helpers:
get_config(key),set_config(key, value)for model binding - 3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers
4. Embedding Management
- 4.1 Create
src/kb_search/embeddings.py— model download, ONNX export, and loading viaSentenceTransformer(model_name, backend="onnx") - 4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
- 4.3 Implement
embed_texts(texts: list[str]) -> list[list[float]]with configurable query/passage prefix support - 4.4 Implement
kb initcommand — create~/.kb/, init DB schema, download model, record binding. Support--modelflag and--statuscheck. - 4.5 Implement
kb reindexcommand — download new model if--modelspecified, re-embed all chunks with progress bar, replace vectors atomically, update DB config - 4.6 Write tests for embedding, model binding verification, and mismatch detection
5. Document Ingestion — Core
- 5.1 Create
src/kb_search/ingest/__init__.pyandsrc/kb_search/ingest/detector.py— file type detection by extension, routing to correct pipeline,--type/--languageoverride support - 5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against
documents.content_hash - 5.3 Implement
kb add <file>command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction - 5.4 Implement
kb add --note "text"— create note document with whole-text chunk, optional--title, auto-title from first 80 chars - 5.5 Implement
kb add <dir> --recursive— walk directory, filter supported extensions, process each file, skip dupes, log failures to~/.kb/ingest-errors.log, display summary - 5.6 Implement parallel ingestion with configurable
--workers(default: 4), serialised DB writes - 5.7 Write tests for type detection, dedup, note creation, and batch processing
6. Document Ingestion — Docling Pipeline
- 6.1 Create
src/kb_search/ingest/docling.py— DoclingDocumentConvertersetup withpypdfium2backend, layout model enabled, table reconstruction enabled - 6.2 Implement OCR configuration (
auto/always/never) per config.yamlingestion.enable_ocr - 6.3 Implement hierarchy-aware chunking via Docling's
HierarchicalChunker, with fallback to fixed-size chunking when hierarchy detection fails - 6.4 Extract and preserve chunk metadata: page number, section headers, table markers
- 6.5 Wire Docling models to download on
kb init(using HuggingFace default cache) - 6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)
7. Document Ingestion — Markdown Pipeline
- 7.1 Create
src/kb_search/ingest/markdown.py— split at##/###header boundaries - 7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
- 7.3 Implement small section merging (sections below
min_tokensmerged with next section) - 7.4 Implement large section splitting at paragraph boundaries with overlap
- 7.5 Implement fallback to fixed-size chunking for plain text files without headers
- 7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback
8. Document Ingestion — Code Pipeline
- 8.1 Create
src/kb_search/ingest/code.py— language detection from extension (.py,.sh,.bash,.go) - 8.2 Implement Python AST splitting using stdlib
astmodule — function and class boundaries, class docstring context on methods - 8.3 Implement Bash regex splitting —
function name()andname()patterns with preceding comment blocks - 8.4 Implement Go regex splitting —
funcdeclarations with type grouping - 8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
- 8.6 Write tests for each language parser and fallback behaviour
9. Hybrid Search
- 9.1 Create
src/kb_search/search.py— FTS5 query execution with BM25 scoring, special character escaping - 9.2 Implement vector similarity search: embed query, query
chunks_vecfor top-K (3× requested), cosine similarity - 9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with
score(d) = Σ 1/(k + rank), configurablek(default: 60) - 9.4 Implement
--fts-onlyand--vec-onlymodes - 9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
- 9.6 Implement
--thresholdscore cutoff (post-RRF) - 9.7 Implement
--topresult count control (default from config) - 9.8 Wire up
kb searchcommand with all flags:--top,--tags,--type,--format,--fts-only,--vec-only,--threshold - 9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)
10. Output Formatting
- 10.1 Create
src/kb_search/output.py— JSON formatter for search results matching the schema in skill-interface spec - 10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
- 10.3 Implement JSON formatters for
list,tags,info, andstatuscommands - 10.4 Implement human-readable formatters for
list,tags,info, andstatuscommands - 10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
- 10.6 Write tests for JSON output schema validation and exit codes
11. Document Management Commands
- 11.1 Implement
kb list— query documents with optional--typeand--tagsfilters,--formatoutput - 11.2 Implement
kb info <doc_id>— document details with chunk previews - 11.3 Implement
kb remove <doc_id>— cascading delete with confirmation prompt,--yesflag - 11.4 Implement
kb tags— list all tags with document counts,--formatsupport - 11.5 Implement
kb tag <doc_id> --add/--remove— tag management, case-insensitive storage - 11.6 Implement
kb status— DB stats, model info, storage size, schema version - 11.7 Write tests for each management command
12. Skill Definition
- 12.1 Write
SKILL.md— when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance - 12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results
13. Packaging and Distribution
- 13.1 Verify
pipx install kb-searchworks from a clean environment - 13.2 Verify
kb initdownloads both embedding model and Docling models successfully - 13.3 Add a README with quickstart: install, init, add, search
- 13.4 Add
py.typedmarker and basic type annotations on public interfaces