kb/openspec/changes/kb-search/tasks.md at f245c24928c17d3fdc759e9512587185b0f890c9

steve/kb

Fork 0

Files

T

steve f245c24928 Initial MVP

2026-03-23 20:38:42 +00:00

8.8 KiB

Raw Blame History

1. Project Scaffolding

1.1 Create Python virtual environment (python3 -m venv .venv) and add .venv/ to .gitignore. All development and testing MUST run inside this venv.
1.2 Create pyproject.toml with project metadata, dependencies (click, sqlite-vec, pyyaml, sentence-transformers, onnxruntime, docling), dev dependencies (pytest, pytest-cov), and [project.scripts] kb = "kb_search.cli:main" entry point
1.3 Install the project in editable mode inside the venv: .venv/bin/pip install -e ".[dev]"
1.4 Create src/kb_search/ package directory with __init__.py
1.5 Create src/kb_search/cli.py with Click group and stub subcommands (init, add, search, list, info, remove, tags, tag, status, reindex, config)
1.6 Verify .venv/bin/kb --help shows all commands

2. Configuration

2.1 Create src/kb_search/config.py — load YAML from ~/.kb/config.yaml with deep-merge against built-in defaults. Handle missing file gracefully.
2.2 Implement ENV variable overrides (KB_DATA_DIR, KB_MODEL, KB_DEFAULT_TOP, KB_DEFAULT_FORMAT) with precedence: CLI flags > ENV > YAML > defaults
2.3 Implement kb config command — display fully resolved config with source indicators
2.4 Implement kb config set <key> <value> — write to ~/.kb/config.yaml, creating file if needed
2.5 Write tests for config loading, merging, ENV overrides, and precedence

3. Database Layer

3.1 Create src/kb_search/database.py — SQLite connection management with sqlite-vec extension loading
3.2 Implement schema creation: documents, chunks, tags, document_tags, config tables per design.md
3.3 Implement FTS5 virtual table (chunks_fts) with porter unicode61 tokenizer and sync triggers (INSERT, UPDATE, DELETE)
3.4 Implement chunks_vec virtual table via sqlite-vec
3.5 Implement schema versioning: store schema_version in config table, check on open, run migrations sequentially
3.6 Implement DB config helpers: get_config(key), set_config(key, value) for model binding
3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers

4. Embedding Management

4.1 Create src/kb_search/embeddings.py — model download, ONNX export, and loading via SentenceTransformer(model_name, backend="onnx")
4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
4.3 Implement embed_texts(texts: list[str]) -> list[list[float]] with configurable query/passage prefix support
4.4 Implement kb init command — create ~/.kb/, init DB schema, download model, record binding. Support --model flag and --status check.
4.5 Implement kb reindex command — download new model if --model specified, re-embed all chunks with progress bar, replace vectors atomically, update DB config
4.6 Write tests for embedding, model binding verification, and mismatch detection

5. Document Ingestion — Core

5.1 Create src/kb_search/ingest/__init__.py and src/kb_search/ingest/detector.py — file type detection by extension, routing to correct pipeline, --type/--language override support
5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against documents.content_hash
5.3 Implement kb add <file> command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction
5.4 Implement kb add --note "text" — create note document with whole-text chunk, optional --title, auto-title from first 80 chars
5.5 Implement kb add <dir> --recursive — walk directory, filter supported extensions, process each file, skip dupes, log failures to ~/.kb/ingest-errors.log, display summary
5.6 Implement parallel ingestion with configurable --workers (default: 4), serialised DB writes
5.7 Write tests for type detection, dedup, note creation, and batch processing

6. Document Ingestion — Docling Pipeline

6.1 Create src/kb_search/ingest/docling.py — Docling DocumentConverter setup with pypdfium2 backend, layout model enabled, table reconstruction enabled
6.2 Implement OCR configuration (auto/always/never) per config.yaml ingestion.enable_ocr
6.3 Implement hierarchy-aware chunking via Docling's HierarchicalChunker, with fallback to fixed-size chunking when hierarchy detection fails
6.4 Extract and preserve chunk metadata: page number, section headers, table markers
6.5 Wire Docling models to download on kb init (using HuggingFace default cache)
6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)

7. Document Ingestion — Markdown Pipeline

7.1 Create src/kb_search/ingest/markdown.py — split at ##/### header boundaries
7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
7.3 Implement small section merging (sections below min_tokens merged with next section)
7.4 Implement large section splitting at paragraph boundaries with overlap
7.5 Implement fallback to fixed-size chunking for plain text files without headers
7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback

8. Document Ingestion — Code Pipeline

8.1 Create src/kb_search/ingest/code.py — language detection from extension (.py, .sh, .bash, .go)
8.2 Implement Python AST splitting using stdlib ast module — function and class boundaries, class docstring context on methods
8.3 Implement Bash regex splitting — function name() and name() patterns with preceding comment blocks
8.4 Implement Go regex splitting — func declarations with type grouping
8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
8.6 Write tests for each language parser and fallback behaviour

9. Hybrid Search

9.1 Create src/kb_search/search.py — FTS5 query execution with BM25 scoring, special character escaping
9.2 Implement vector similarity search: embed query, query chunks_vec for top-K (3× requested), cosine similarity
9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with score(d) = Σ 1/(k + rank), configurable k (default: 60)
9.4 Implement --fts-only and --vec-only modes
9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
9.6 Implement --threshold score cutoff (post-RRF)
9.7 Implement --top result count control (default from config)
9.8 Wire up kb search command with all flags: --top, --tags, --type, --format, --fts-only, --vec-only, --threshold
9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)

10. Output Formatting

10.1 Create src/kb_search/output.py — JSON formatter for search results matching the schema in skill-interface spec
10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
10.3 Implement JSON formatters for list, tags, info, and status commands
10.4 Implement human-readable formatters for list, tags, info, and status commands
10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
10.6 Write tests for JSON output schema validation and exit codes

11. Document Management Commands

11.1 Implement kb list — query documents with optional --type and --tags filters, --format output
11.2 Implement kb info <doc_id> — document details with chunk previews
11.3 Implement kb remove <doc_id> — cascading delete with confirmation prompt, --yes flag
11.4 Implement kb tags — list all tags with document counts, --format support
11.5 Implement kb tag <doc_id> --add/--remove — tag management, case-insensitive storage
11.6 Implement kb status — DB stats, model info, storage size, schema version
11.7 Write tests for each management command

12. Skill Definition

12.1 Write SKILL.md — when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance
12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results

13. Packaging and Distribution

13.1 Verify pipx install kb-search works from a clean environment
13.2 Verify kb init downloads both embedding model and Docling models successfully
13.3 Add a README with quickstart: install, init, add, search
13.4 Add py.typed marker and basic type annotations on public interfaces

8.8 KiB Raw Blame History Unescape Escape

1. Project Scaffolding

2. Configuration

3. Database Layer

4. Embedding Management

5. Document Ingestion — Core

6. Document Ingestion — Docling Pipeline

7. Document Ingestion — Markdown Pipeline

8. Document Ingestion — Code Pipeline

9. Hybrid Search

10. Output Formatting

11. Document Management Commands

12. Skill Definition

13. Packaging and Distribution

8.8 KiB

Raw Blame History