Files
kb/openspec/changes/kb-search/tasks.md
T
2026-03-23 20:38:42 +00:00

116 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## 1. Project Scaffolding
- [x] 1.1 Create Python virtual environment (`python3 -m venv .venv`) and add `.venv/` to `.gitignore`. All development and testing MUST run inside this venv.
- [x] 1.2 Create `pyproject.toml` with project metadata, dependencies (`click`, `sqlite-vec`, `pyyaml`, `sentence-transformers`, `onnxruntime`, `docling`), dev dependencies (`pytest`, `pytest-cov`), and `[project.scripts] kb = "kb_search.cli:main"` entry point
- [x] 1.3 Install the project in editable mode inside the venv: `.venv/bin/pip install -e ".[dev]"`
- [x] 1.4 Create `src/kb_search/` package directory with `__init__.py`
- [x] 1.5 Create `src/kb_search/cli.py` with Click group and stub subcommands (`init`, `add`, `search`, `list`, `info`, `remove`, `tags`, `tag`, `status`, `reindex`, `config`)
- [x] 1.6 Verify `.venv/bin/kb --help` shows all commands
## 2. Configuration
- [x] 2.1 Create `src/kb_search/config.py` — load YAML from `~/.kb/config.yaml` with deep-merge against built-in defaults. Handle missing file gracefully.
- [x] 2.2 Implement ENV variable overrides (`KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`) with precedence: CLI flags > ENV > YAML > defaults
- [x] 2.3 Implement `kb config` command — display fully resolved config with source indicators
- [x] 2.4 Implement `kb config set <key> <value>` — write to `~/.kb/config.yaml`, creating file if needed
- [x] 2.5 Write tests for config loading, merging, ENV overrides, and precedence
## 3. Database Layer
- [x] 3.1 Create `src/kb_search/database.py` — SQLite connection management with sqlite-vec extension loading
- [x] 3.2 Implement schema creation: `documents`, `chunks`, `tags`, `document_tags`, `config` tables per design.md
- [x] 3.3 Implement FTS5 virtual table (`chunks_fts`) with `porter unicode61` tokenizer and sync triggers (INSERT, UPDATE, DELETE)
- [x] 3.4 Implement `chunks_vec` virtual table via sqlite-vec
- [x] 3.5 Implement schema versioning: store `schema_version` in `config` table, check on open, run migrations sequentially
- [x] 3.6 Implement DB config helpers: `get_config(key)`, `set_config(key, value)` for model binding
- [x] 3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers
## 4. Embedding Management
- [x] 4.1 Create `src/kb_search/embeddings.py` — model download, ONNX export, and loading via `SentenceTransformer(model_name, backend="onnx")`
- [x] 4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
- [x] 4.3 Implement `embed_texts(texts: list[str]) -> list[list[float]]` with configurable query/passage prefix support
- [x] 4.4 Implement `kb init` command — create `~/.kb/`, init DB schema, download model, record binding. Support `--model` flag and `--status` check.
- [x] 4.5 Implement `kb reindex` command — download new model if `--model` specified, re-embed all chunks with progress bar, replace vectors atomically, update DB config
- [x] 4.6 Write tests for embedding, model binding verification, and mismatch detection
## 5. Document Ingestion — Core
- [x] 5.1 Create `src/kb_search/ingest/__init__.py` and `src/kb_search/ingest/detector.py` — file type detection by extension, routing to correct pipeline, `--type`/`--language` override support
- [x] 5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against `documents.content_hash`
- [x] 5.3 Implement `kb add <file>` command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction
- [x] 5.4 Implement `kb add --note "text"` — create note document with whole-text chunk, optional `--title`, auto-title from first 80 chars
- [x] 5.5 Implement `kb add <dir> --recursive` — walk directory, filter supported extensions, process each file, skip dupes, log failures to `~/.kb/ingest-errors.log`, display summary
- [x] 5.6 Implement parallel ingestion with configurable `--workers` (default: 4), serialised DB writes
- [x] 5.7 Write tests for type detection, dedup, note creation, and batch processing
## 6. Document Ingestion — Docling Pipeline
- [x] 6.1 Create `src/kb_search/ingest/docling.py` — Docling `DocumentConverter` setup with `pypdfium2` backend, layout model enabled, table reconstruction enabled
- [x] 6.2 Implement OCR configuration (`auto`/`always`/`never`) per config.yaml `ingestion.enable_ocr`
- [x] 6.3 Implement hierarchy-aware chunking via Docling's `HierarchicalChunker`, with fallback to fixed-size chunking when hierarchy detection fails
- [x] 6.4 Extract and preserve chunk metadata: page number, section headers, table markers
- [x] 6.5 Wire Docling models to download on `kb init` (using HuggingFace default cache)
- [x] 6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)
## 7. Document Ingestion — Markdown Pipeline
- [x] 7.1 Create `src/kb_search/ingest/markdown.py` — split at `##`/`###` header boundaries
- [x] 7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
- [x] 7.3 Implement small section merging (sections below `min_tokens` merged with next section)
- [x] 7.4 Implement large section splitting at paragraph boundaries with overlap
- [x] 7.5 Implement fallback to fixed-size chunking for plain text files without headers
- [x] 7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback
## 8. Document Ingestion — Code Pipeline
- [x] 8.1 Create `src/kb_search/ingest/code.py` — language detection from extension (`.py`, `.sh`, `.bash`, `.go`)
- [x] 8.2 Implement Python AST splitting using stdlib `ast` module — function and class boundaries, class docstring context on methods
- [x] 8.3 Implement Bash regex splitting — `function name()` and `name()` patterns with preceding comment blocks
- [x] 8.4 Implement Go regex splitting — `func` declarations with type grouping
- [x] 8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
- [x] 8.6 Write tests for each language parser and fallback behaviour
## 9. Hybrid Search
- [x] 9.1 Create `src/kb_search/search.py` — FTS5 query execution with BM25 scoring, special character escaping
- [x] 9.2 Implement vector similarity search: embed query, query `chunks_vec` for top-K (3× requested), cosine similarity
- [x] 9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with `score(d) = Σ 1/(k + rank)`, configurable `k` (default: 60)
- [x] 9.4 Implement `--fts-only` and `--vec-only` modes
- [x] 9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
- [x] 9.6 Implement `--threshold` score cutoff (post-RRF)
- [x] 9.7 Implement `--top` result count control (default from config)
- [x] 9.8 Wire up `kb search` command with all flags: `--top`, `--tags`, `--type`, `--format`, `--fts-only`, `--vec-only`, `--threshold`
- [x] 9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)
## 10. Output Formatting
- [x] 10.1 Create `src/kb_search/output.py` — JSON formatter for search results matching the schema in skill-interface spec
- [x] 10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
- [x] 10.3 Implement JSON formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.4 Implement human-readable formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
- [x] 10.6 Write tests for JSON output schema validation and exit codes
## 11. Document Management Commands
- [x] 11.1 Implement `kb list` — query documents with optional `--type` and `--tags` filters, `--format` output
- [x] 11.2 Implement `kb info <doc_id>` — document details with chunk previews
- [x] 11.3 Implement `kb remove <doc_id>` — cascading delete with confirmation prompt, `--yes` flag
- [x] 11.4 Implement `kb tags` — list all tags with document counts, `--format` support
- [x] 11.5 Implement `kb tag <doc_id> --add/--remove` — tag management, case-insensitive storage
- [x] 11.6 Implement `kb status` — DB stats, model info, storage size, schema version
- [x] 11.7 Write tests for each management command
## 12. Skill Definition
- [x] 12.1 Write `SKILL.md` — when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance
- [x] 12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results
## 13. Packaging and Distribution
- [x] 13.1 Verify `pipx install kb-search` works from a clean environment
- [x] 13.2 Verify `kb init` downloads both embedding model and Docling models successfully
- [x] 13.3 Add a README with quickstart: install, init, add, search
- [x] 13.4 Add `py.typed` marker and basic type annotations on public interfaces