kb/openspec/changes/kb-search/tasks.md

## 1. Project Scaffolding

- [x] 1.1 Create Python virtual environment (`python3 -m venv .venv`) and add `.venv/` to `.gitignore`. All development and testing MUST run inside this venv.
- [x] 1.2 Create `pyproject.toml` with project metadata, dependencies (`click`, `sqlite-vec`, `pyyaml`, `sentence-transformers`, `onnxruntime`, `docling`), dev dependencies (`pytest`, `pytest-cov`), and `[project.scripts] kb = "kb_search.cli:main"` entry point
- [x] 1.3 Install the project in editable mode inside the venv: `.venv/bin/pip install -e ".[dev]"`
- [x] 1.4 Create `src/kb_search/` package directory with `__init__.py`
- [x] 1.5 Create `src/kb_search/cli.py` with Click group and stub subcommands (`init`, `add`, `search`, `list`, `info`, `remove`, `tags`, `tag`, `status`, `reindex`, `config`)
- [x] 1.6 Verify `.venv/bin/kb --help` shows all commands

## 2. Configuration

- [x] 2.1 Create `src/kb_search/config.py` — load YAML from `~/.kb/config.yaml` with deep-merge against built-in defaults. Handle missing file gracefully.
- [x] 2.2 Implement ENV variable overrides (`KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`) with precedence: CLI flags > ENV > YAML > defaults
- [x] 2.3 Implement `kb config` command — display fully resolved config with source indicators
- [x] 2.4 Implement `kb config set <key> <value>` — write to `~/.kb/config.yaml`, creating file if needed
- [x] 2.5 Write tests for config loading, merging, ENV overrides, and precedence

## 3. Database Layer

- [x] 3.1 Create `src/kb_search/database.py` — SQLite connection management with sqlite-vec extension loading
- [x] 3.2 Implement schema creation: `documents`, `chunks`, `tags`, `document_tags`, `config` tables per design.md
- [x] 3.3 Implement FTS5 virtual table (`chunks_fts`) with `porter unicode61` tokenizer and sync triggers (INSERT, UPDATE, DELETE)
- [x] 3.4 Implement `chunks_vec` virtual table via sqlite-vec
- [x] 3.5 Implement schema versioning: store `schema_version` in `config` table, check on open, run migrations sequentially
- [x] 3.6 Implement DB config helpers: `get_config(key)`, `set_config(key, value)` for model binding
- [x] 3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers

## 4. Embedding Management

- [x] 4.1 Create `src/kb_search/embeddings.py` — model download, ONNX export, and loading via `SentenceTransformer(model_name, backend="onnx")`
- [x] 4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
- [x] 4.3 Implement `embed_texts(texts: list[str]) -> list[list[float]]` with configurable query/passage prefix support
- [x] 4.4 Implement `kb init` command — create `~/.kb/`, init DB schema, download model, record binding. Support `--model` flag and `--status` check.
- [x] 4.5 Implement `kb reindex` command — download new model if `--model` specified, re-embed all chunks with progress bar, replace vectors atomically, update DB config
- [x] 4.6 Write tests for embedding, model binding verification, and mismatch detection

## 5. Document Ingestion — Core

- [x] 5.1 Create `src/kb_search/ingest/__init__.py` and `src/kb_search/ingest/detector.py` — file type detection by extension, routing to correct pipeline, `--type`/`--language` override support
- [x] 5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against `documents.content_hash`
- [x] 5.3 Implement `kb add <file>` command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction
- [x] 5.4 Implement `kb add --note "text"` — create note document with whole-text chunk, optional `--title`, auto-title from first 80 chars
- [x] 5.5 Implement `kb add <dir> --recursive` — walk directory, filter supported extensions, process each file, skip dupes, log failures to `~/.kb/ingest-errors.log`, display summary
- [x] 5.6 Implement parallel ingestion with configurable `--workers` (default: 4), serialised DB writes
- [x] 5.7 Write tests for type detection, dedup, note creation, and batch processing

## 6. Document Ingestion — Docling Pipeline

- [x] 6.1 Create `src/kb_search/ingest/docling.py` — Docling `DocumentConverter` setup with `pypdfium2` backend, layout model enabled, table reconstruction enabled
- [x] 6.2 Implement OCR configuration (`auto`/`always`/`never`) per config.yaml `ingestion.enable_ocr`
- [x] 6.3 Implement hierarchy-aware chunking via Docling's `HierarchicalChunker`, with fallback to fixed-size chunking when hierarchy detection fails
- [x] 6.4 Extract and preserve chunk metadata: page number, section headers, table markers
- [x] 6.5 Wire Docling models to download on `kb init` (using HuggingFace default cache)
- [x] 6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)

## 7. Document Ingestion — Markdown Pipeline

- [x] 7.1 Create `src/kb_search/ingest/markdown.py` — split at `##`/`###` header boundaries
- [x] 7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
- [x] 7.3 Implement small section merging (sections below `min_tokens` merged with next section)
- [x] 7.4 Implement large section splitting at paragraph boundaries with overlap
- [x] 7.5 Implement fallback to fixed-size chunking for plain text files without headers
- [x] 7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback

## 8. Document Ingestion — Code Pipeline

- [x] 8.1 Create `src/kb_search/ingest/code.py` — language detection from extension (`.py`, `.sh`, `.bash`, `.go`)
- [x] 8.2 Implement Python AST splitting using stdlib `ast` module — function and class boundaries, class docstring context on methods
- [x] 8.3 Implement Bash regex splitting — `function name()` and `name()` patterns with preceding comment blocks
- [x] 8.4 Implement Go regex splitting — `func` declarations with type grouping
- [x] 8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
- [x] 8.6 Write tests for each language parser and fallback behaviour

## 9. Hybrid Search

- [x] 9.1 Create `src/kb_search/search.py` — FTS5 query execution with BM25 scoring, special character escaping
- [x] 9.2 Implement vector similarity search: embed query, query `chunks_vec` for top-K (3× requested), cosine similarity
- [x] 9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with `score(d) = Σ 1/(k + rank)`, configurable `k` (default: 60)
- [x] 9.4 Implement `--fts-only` and `--vec-only` modes
- [x] 9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
- [x] 9.6 Implement `--threshold` score cutoff (post-RRF)
- [x] 9.7 Implement `--top` result count control (default from config)
- [x] 9.8 Wire up `kb search` command with all flags: `--top`, `--tags`, `--type`, `--format`, `--fts-only`, `--vec-only`, `--threshold`
- [x] 9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)

## 10. Output Formatting

- [x] 10.1 Create `src/kb_search/output.py` — JSON formatter for search results matching the schema in skill-interface spec
- [x] 10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
- [x] 10.3 Implement JSON formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.4 Implement human-readable formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
- [x] 10.6 Write tests for JSON output schema validation and exit codes

## 11. Document Management Commands

- [x] 11.1 Implement `kb list` — query documents with optional `--type` and `--tags` filters, `--format` output
- [x] 11.2 Implement `kb info <doc_id>` — document details with chunk previews
- [x] 11.3 Implement `kb remove <doc_id>` — cascading delete with confirmation prompt, `--yes` flag
- [x] 11.4 Implement `kb tags` — list all tags with document counts, `--format` support
- [x] 11.5 Implement `kb tag <doc_id> --add/--remove` — tag management, case-insensitive storage
- [x] 11.6 Implement `kb status` — DB stats, model info, storage size, schema version
- [x] 11.7 Write tests for each management command

## 12. Skill Definition

- [x] 12.1 Write `SKILL.md` — when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance
- [x] 12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results

## 13. Packaging and Distribution

- [x] 13.1 Verify `pipx install kb-search` works from a clean environment
- [x] 13.2 Verify `kb init` downloads both embedding model and Docling models successfully
- [x] 13.3 Add a README with quickstart: install, init, add, search
- [x] 13.4 Add `py.typed` marker and basic type annotations on public interfaces