Files
kb/openspec/changes/kb-search/tasks.md
T
2026-03-23 20:38:42 +00:00

8.8 KiB
Raw Blame History

1. Project Scaffolding

  • 1.1 Create Python virtual environment (python3 -m venv .venv) and add .venv/ to .gitignore. All development and testing MUST run inside this venv.
  • 1.2 Create pyproject.toml with project metadata, dependencies (click, sqlite-vec, pyyaml, sentence-transformers, onnxruntime, docling), dev dependencies (pytest, pytest-cov), and [project.scripts] kb = "kb_search.cli:main" entry point
  • 1.3 Install the project in editable mode inside the venv: .venv/bin/pip install -e ".[dev]"
  • 1.4 Create src/kb_search/ package directory with __init__.py
  • 1.5 Create src/kb_search/cli.py with Click group and stub subcommands (init, add, search, list, info, remove, tags, tag, status, reindex, config)
  • 1.6 Verify .venv/bin/kb --help shows all commands

2. Configuration

  • 2.1 Create src/kb_search/config.py — load YAML from ~/.kb/config.yaml with deep-merge against built-in defaults. Handle missing file gracefully.
  • 2.2 Implement ENV variable overrides (KB_DATA_DIR, KB_MODEL, KB_DEFAULT_TOP, KB_DEFAULT_FORMAT) with precedence: CLI flags > ENV > YAML > defaults
  • 2.3 Implement kb config command — display fully resolved config with source indicators
  • 2.4 Implement kb config set <key> <value> — write to ~/.kb/config.yaml, creating file if needed
  • 2.5 Write tests for config loading, merging, ENV overrides, and precedence

3. Database Layer

  • 3.1 Create src/kb_search/database.py — SQLite connection management with sqlite-vec extension loading
  • 3.2 Implement schema creation: documents, chunks, tags, document_tags, config tables per design.md
  • 3.3 Implement FTS5 virtual table (chunks_fts) with porter unicode61 tokenizer and sync triggers (INSERT, UPDATE, DELETE)
  • 3.4 Implement chunks_vec virtual table via sqlite-vec
  • 3.5 Implement schema versioning: store schema_version in config table, check on open, run migrations sequentially
  • 3.6 Implement DB config helpers: get_config(key), set_config(key, value) for model binding
  • 3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers

4. Embedding Management

  • 4.1 Create src/kb_search/embeddings.py — model download, ONNX export, and loading via SentenceTransformer(model_name, backend="onnx")
  • 4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
  • 4.3 Implement embed_texts(texts: list[str]) -> list[list[float]] with configurable query/passage prefix support
  • 4.4 Implement kb init command — create ~/.kb/, init DB schema, download model, record binding. Support --model flag and --status check.
  • 4.5 Implement kb reindex command — download new model if --model specified, re-embed all chunks with progress bar, replace vectors atomically, update DB config
  • 4.6 Write tests for embedding, model binding verification, and mismatch detection

5. Document Ingestion — Core

  • 5.1 Create src/kb_search/ingest/__init__.py and src/kb_search/ingest/detector.py — file type detection by extension, routing to correct pipeline, --type/--language override support
  • 5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against documents.content_hash
  • 5.3 Implement kb add <file> command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction
  • 5.4 Implement kb add --note "text" — create note document with whole-text chunk, optional --title, auto-title from first 80 chars
  • 5.5 Implement kb add <dir> --recursive — walk directory, filter supported extensions, process each file, skip dupes, log failures to ~/.kb/ingest-errors.log, display summary
  • 5.6 Implement parallel ingestion with configurable --workers (default: 4), serialised DB writes
  • 5.7 Write tests for type detection, dedup, note creation, and batch processing

6. Document Ingestion — Docling Pipeline

  • 6.1 Create src/kb_search/ingest/docling.py — Docling DocumentConverter setup with pypdfium2 backend, layout model enabled, table reconstruction enabled
  • 6.2 Implement OCR configuration (auto/always/never) per config.yaml ingestion.enable_ocr
  • 6.3 Implement hierarchy-aware chunking via Docling's HierarchicalChunker, with fallback to fixed-size chunking when hierarchy detection fails
  • 6.4 Extract and preserve chunk metadata: page number, section headers, table markers
  • 6.5 Wire Docling models to download on kb init (using HuggingFace default cache)
  • 6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)

7. Document Ingestion — Markdown Pipeline

  • 7.1 Create src/kb_search/ingest/markdown.py — split at ##/### header boundaries
  • 7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
  • 7.3 Implement small section merging (sections below min_tokens merged with next section)
  • 7.4 Implement large section splitting at paragraph boundaries with overlap
  • 7.5 Implement fallback to fixed-size chunking for plain text files without headers
  • 7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback

8. Document Ingestion — Code Pipeline

  • 8.1 Create src/kb_search/ingest/code.py — language detection from extension (.py, .sh, .bash, .go)
  • 8.2 Implement Python AST splitting using stdlib ast module — function and class boundaries, class docstring context on methods
  • 8.3 Implement Bash regex splitting — function name() and name() patterns with preceding comment blocks
  • 8.4 Implement Go regex splitting — func declarations with type grouping
  • 8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
  • 8.6 Write tests for each language parser and fallback behaviour
  • 9.1 Create src/kb_search/search.py — FTS5 query execution with BM25 scoring, special character escaping
  • 9.2 Implement vector similarity search: embed query, query chunks_vec for top-K (3× requested), cosine similarity
  • 9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with score(d) = Σ 1/(k + rank), configurable k (default: 60)
  • 9.4 Implement --fts-only and --vec-only modes
  • 9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
  • 9.6 Implement --threshold score cutoff (post-RRF)
  • 9.7 Implement --top result count control (default from config)
  • 9.8 Wire up kb search command with all flags: --top, --tags, --type, --format, --fts-only, --vec-only, --threshold
  • 9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)

10. Output Formatting

  • 10.1 Create src/kb_search/output.py — JSON formatter for search results matching the schema in skill-interface spec
  • 10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
  • 10.3 Implement JSON formatters for list, tags, info, and status commands
  • 10.4 Implement human-readable formatters for list, tags, info, and status commands
  • 10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
  • 10.6 Write tests for JSON output schema validation and exit codes

11. Document Management Commands

  • 11.1 Implement kb list — query documents with optional --type and --tags filters, --format output
  • 11.2 Implement kb info <doc_id> — document details with chunk previews
  • 11.3 Implement kb remove <doc_id> — cascading delete with confirmation prompt, --yes flag
  • 11.4 Implement kb tags — list all tags with document counts, --format support
  • 11.5 Implement kb tag <doc_id> --add/--remove — tag management, case-insensitive storage
  • 11.6 Implement kb status — DB stats, model info, storage size, schema version
  • 11.7 Write tests for each management command

12. Skill Definition

  • 12.1 Write SKILL.md — when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance
  • 12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results

13. Packaging and Distribution

  • 13.1 Verify pipx install kb-search works from a clean environment
  • 13.2 Verify kb init downloads both embedding model and Docling models successfully
  • 13.3 Add a README with quickstart: install, init, add, search
  • 13.4 Add py.typed marker and basic type annotations on public interfaces