Files
kb/openspec/changes/kb-search/proposal.md
T
2026-03-23 20:38:42 +00:00

3.5 KiB

Why

There is no simple, local-first CLI tool for building a personal knowledge base across heterogeneous document types (PDFs, markdown, code snippets, text notes) with hybrid search that combines keyword matching and semantic understanding. Existing tools either require cloud services, lack semantic search, or can't handle the variety of document formats. This tool fills the gap — a retrieval engine that can be used standalone from the terminal or wrapped as an AI skill (e.g. Claude Code) where the LLM layer provides natural language synthesis over retrieved results.

What Changes

  • New Python CLI tool (kb) distributed via pipx (PyPI package: kb-search)
  • Ingestion pipeline with per-format handling:
    • PDFs/DOCX/HTML/images: Docling (layout-aware, table reconstruction, optional OCR)
    • Markdown/text: Header-based semantic splitting
    • Code (Python, Bash, Go): AST/regex-based splitting at function/class boundaries
    • Notes: Inline text stored as whole-document chunks
  • Hybrid search combining SQLite FTS5 (BM25 keyword scoring) and sqlite-vec (vector similarity), merged via Reciprocal Rank Fusion
  • Local embedding models downloaded from HuggingFace on first run (kb init), with multi-model support and full reindex capability when switching models
  • Document tagging system for manual categorisation and filtered search
  • Structured JSON output designed for LLM skill consumption, plus human-readable terminal output
  • Configurable chunking parameters per document type with sensible defaults
  • All state in a single SQLite database (~/.kb/kb.db)
  • Configuration via YAML (~/.kb/config.yaml) with ENV variable overrides

Capabilities

New Capabilities

  • document-ingestion: Ingest PDFs, markdown, code, and text notes into chunked, embedded, searchable storage. Handles format detection, per-type chunking strategies, Docling pipeline for complex documents, and resumable batch imports.
  • hybrid-search: Hybrid retrieval combining FTS5 full-text search and sqlite-vec vector similarity via Reciprocal Rank Fusion. Supports tag/type filtering, configurable result counts, score thresholds, and JSON/human output formats.
  • embedding-management: Local embedding model lifecycle — download on init, bind model to database, detect mismatches, and full re-embedding via reindex when switching models.
  • document-management: CRUD operations on the document store — list, inspect, remove documents. Tag management (add/remove tags, filter by tags, list tags with counts).
  • configuration: TOML-based configuration with per-document-type chunking parameters, model selection, and ENV variable overrides. Sensible defaults that work without any config file.
  • skill-interface: Structured JSON output contract designed for LLM skill consumption — chunks with scores, source metadata, and provenance for citation.

Modified Capabilities

(none — greenfield project)

Impact

  • Dependencies: Docling (~1.5 GB models), sentence-transformers with ONNX Runtime backend, sqlite-vec, Click
  • Storage: ~/.kb/ directory containing SQLite database, config file, and downloaded models (~1.6 GB on init, database grows with content)
  • First-run experience: kb init required before use to download models. Batch ingestion of 2,000 PDFs estimated at ~17 hours CPU / ~3 hours GPU (one-time cost, resumable)
  • External integration: Designed to be wrapped as a Claude Code skill — the skill definition (SKILL.md) is a deliverable alongside the code