Initial MVP

2026-03-23 20:38:42 +00:00
commit f245c24928
57 changed files with 6812 additions and 0 deletions
@@ -0,0 +1,72 @@
+## ADDED Requirements
+
+### Requirement: YAML configuration file
+The system SHALL read configuration from `~/.kb/config.yaml`. If the file does not exist, the system SHALL use built-in defaults. The configuration file SHALL be optional — the tool MUST work with zero configuration.
+
+#### Scenario: No config file
+- **WHEN** `~/.kb/config.yaml` does not exist
+- **THEN** the system uses built-in defaults for all settings and operates normally
+
+#### Scenario: Partial config file
+- **WHEN** `~/.kb/config.yaml` exists but only specifies `chunking.pdf.max_tokens: 2048`
+- **THEN** the system uses built-in defaults for all other settings, overriding only `chunking.pdf.max_tokens`
+
+#### Scenario: Invalid config file
+- **WHEN** `~/.kb/config.yaml` contains invalid YAML
+- **THEN** the system prints a clear error message identifying the YAML syntax issue and exits with non-zero status
+
+### Requirement: Environment variable overrides
+The system SHALL support environment variable overrides with the prefix `KB_`. ENV variables SHALL take precedence over the YAML config file. Supported variables: `KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`.
+
+#### Scenario: Override data directory
+- **WHEN** `KB_DATA_DIR=/tmp/test-kb` is set
+- **THEN** the system uses `/tmp/test-kb/` instead of `~/.kb/` for the database and config
+
+#### Scenario: Override model
+- **WHEN** `KB_MODEL=nomic-embed-text` is set
+- **THEN** the system uses `nomic-embed-text` as the embedding model, overriding the YAML config
+
+#### Scenario: ENV overrides YAML
+- **WHEN** YAML config has `search.default_top: 10` and `KB_DEFAULT_TOP=20` is set
+- **THEN** the default top value is 20
+
+### Requirement: Configuration precedence
+The system SHALL apply configuration in this order (highest to lowest precedence): CLI flags, environment variables, YAML config file, built-in defaults.
+
+#### Scenario: CLI flag overrides everything
+- **WHEN** YAML config has `search.default_top: 10`, ENV has `KB_DEFAULT_TOP=20`, and user runs `kb search "test" --top 5`
+- **THEN** 5 results are returned
+
+### Requirement: View and set configuration
+The system SHALL support viewing the current effective configuration via `kb config` and setting individual values via `kb config set <key> <value>`.
+
+#### Scenario: View configuration
+- **WHEN** user runs `kb config`
+- **THEN** the system displays the fully resolved configuration (defaults merged with YAML merged with ENV), indicating the source of each value
+
+#### Scenario: Set a config value
+- **WHEN** user runs `kb config set chunking.pdf.max_tokens 2048`
+- **THEN** the value is written to `~/.kb/config.yaml`, creating the file if necessary
+
+### Requirement: Configurable chunking parameters
+The system SHALL support per-document-type chunking configuration with sensible defaults.
+
+#### Scenario: Default chunking for PDF
+- **WHEN** no chunking config is specified for PDF
+- **THEN** the system uses `strategy: hierarchy, max_tokens: 1024`
+
+#### Scenario: Default chunking for markdown
+- **WHEN** no chunking config is specified for markdown
+- **THEN** the system uses `strategy: header, min_tokens: 50, max_tokens: 1024`
+
+#### Scenario: Default chunking for code
+- **WHEN** no chunking config is specified for code
+- **THEN** the system uses `strategy: ast, include_context: true, max_tokens: 1024`
+
+#### Scenario: Default chunking for notes
+- **WHEN** no chunking config is specified for notes
+- **THEN** the system uses `strategy: whole`
+
+#### Scenario: Custom chunking overrides
+- **WHEN** YAML config specifies `chunking.pdf.strategy: fixed` and `chunking.pdf.max_tokens: 512`
+- **THEN** PDFs are chunked with fixed-size windows of 512 tokens instead of hierarchy-aware chunking
@@ -0,0 +1,125 @@
+## ADDED Requirements
+
+### Requirement: File type detection and routing
+The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (`.pdf`), DOCX (`.docx`), HTML (`.html`, `.htm`), Markdown (`.md`, `.markdown`, `.txt`), Code (`.py`, `.sh`, `.bash`, `.go`), and image files (`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`). The user MAY override detection with `--type` and `--language` flags.
+
+#### Scenario: Auto-detect PDF file
+- **WHEN** user runs `kb add report.pdf`
+- **THEN** the file is routed to the Docling ingestion pipeline
+
+#### Scenario: Auto-detect Python code
+- **WHEN** user runs `kb add script.py`
+- **THEN** the file is routed to the code ingestion pipeline with language set to `python`
+
+#### Scenario: Override type detection
+- **WHEN** user runs `kb add data.txt --type code --language bash`
+- **THEN** the file is routed to the code pipeline as Bash, regardless of the `.txt` extension
+
+#### Scenario: Unsupported file type
+- **WHEN** user runs `kb add archive.zip`
+- **THEN** the system SHALL print an error message listing supported formats and exit with non-zero status
+
+### Requirement: Docling pipeline for complex documents
+The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the `pypdfium2` backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: `auto` (detect pages with no extractable text and OCR those), `always`, or `never`.
+
+#### Scenario: Ingest a text-based PDF
+- **WHEN** user runs `kb add manual.pdf`
+- **THEN** the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database
+
+#### Scenario: Ingest a PDF with tables
+- **WHEN** user ingests a PDF containing data tables
+- **THEN** Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments
+
+#### Scenario: Ingest a scanned PDF with OCR auto mode
+- **WHEN** user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to `auto`
+- **THEN** the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally
+
+#### Scenario: Ingest an image file
+- **WHEN** user runs `kb add diagram.png`
+- **THEN** the system SHALL process it through Docling with OCR enabled, extracting any text content from the image
+
+### Requirement: Markdown ingestion with header-based splitting
+The system SHALL split markdown and text files at header boundaries (`##`, `###`). Each chunk SHALL include its parent header chain as context. Sections smaller than `min_tokens` SHALL be merged with the following section. Sections larger than `max_tokens` SHALL be split at paragraph boundaries with configurable overlap.
+
+#### Scenario: Split markdown at headers
+- **WHEN** user runs `kb add guide.md` and the file contains multiple `##` sections
+- **THEN** each section becomes a separate chunk, with the header text included in the chunk
+
+#### Scenario: Preserve header hierarchy
+- **WHEN** a markdown file has nested headers like `## Config` > `### Advanced Options`
+- **THEN** the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"
+
+#### Scenario: Merge small sections
+- **WHEN** a markdown section contains fewer tokens than `min_tokens` (default: 50)
+- **THEN** it SHALL be merged with the next section into a single chunk
+
+#### Scenario: Plain text file without headers
+- **WHEN** user runs `kb add notes.txt` and the file has no markdown headers
+- **THEN** the system SHALL fall back to fixed-size chunking with configurable `max_tokens` and `overlap_tokens`
+
+### Requirement: Code ingestion with AST/regex splitting
+The system SHALL split code files at function and class boundaries. Python files SHALL use the `ast` module. Bash and Go files SHALL use regex-based splitting. When `include_context` is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.
+
+#### Scenario: Python file with functions and classes
+- **WHEN** user runs `kb add auth.py` and the file contains a class with methods
+- **THEN** each method becomes a chunk, and each chunk includes the class name and docstring as context
+
+#### Scenario: Bash script with functions
+- **WHEN** user runs `kb add deploy.sh` and the file contains `function deploy() {` blocks
+- **THEN** each function becomes a separate chunk, including any preceding comment block
+
+#### Scenario: Go file with functions
+- **WHEN** user runs `kb add main.go` and the file contains `func` declarations
+- **THEN** each function becomes a separate chunk
+
+#### Scenario: Code file with no functions
+- **WHEN** user runs `kb add script.sh` and the file has no function declarations
+- **THEN** the system SHALL fall back to fixed-size chunking with `max_tokens` and `overlap_tokens`
+
+### Requirement: Inline note ingestion
+The system SHALL support adding text notes directly from the command line via `kb add --note "text"`. Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional `--title` for display purposes.
+
+#### Scenario: Add an inline note
+- **WHEN** user runs `kb add --note "Always restart nginx after config changes" --title "nginx reminder"`
+- **THEN** a document of type `note` is created with the title "nginx reminder", and the full text becomes a single chunk
+
+#### Scenario: Add a note without title
+- **WHEN** user runs `kb add --note "some text"`
+- **THEN** the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title
+
+### Requirement: Deduplication via content hash
+The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same `content_hash` already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.
+
+#### Scenario: Add a file that is already indexed
+- **WHEN** user runs `kb add report.pdf` and the file's SHA-256 matches an existing document
+- **THEN** the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate
+
+#### Scenario: Add a modified version of an existing file
+- **WHEN** user runs `kb add report.pdf` and the file has changed since last indexed (different hash)
+- **THEN** the system SHALL ingest it as a new document (the old version remains unless manually removed)
+
+### Requirement: Batch ingestion with progress and resumability
+The system SHALL support ingesting entire directories via `kb add <dir> --recursive`. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.
+
+#### Scenario: Ingest a directory
+- **WHEN** user runs `kb add ~/docs/ --recursive`
+- **THEN** the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."
+
+#### Scenario: Resume after interruption
+- **WHEN** a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
+- **THEN** already-indexed files are skipped via content hash, and processing continues with remaining files
+
+#### Scenario: Failed file during batch
+- **WHEN** a single file fails to process (corrupt PDF, encoding error)
+- **THEN** the error is logged to `~/.kb/ingest-errors.log` with the file path and error message, and processing continues with the next file
+
+### Requirement: Parallel ingestion workers
+The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's `DocumentConverter` SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.
+
+#### Scenario: Parallel PDF ingestion
+- **WHEN** user runs `kb add ~/pdfs/ --recursive` with `workers: 4` in config
+- **THEN** up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially
+
+#### Scenario: Override worker count
+- **WHEN** user runs `kb add ~/pdfs/ --recursive --workers 1`
+- **THEN** documents are processed sequentially with a single worker
@@ -0,0 +1,80 @@
+## ADDED Requirements
+
+### Requirement: List documents
+The system SHALL list all indexed documents via `kb list`. Results SHALL include document ID, title, type, tag count, chunk count, and creation date. Output SHALL support `--format json` and `--format human`.
+
+#### Scenario: List all documents
+- **WHEN** user runs `kb list`
+- **THEN** all documents are listed with their ID, title, type, tags, chunk count, and creation date
+
+#### Scenario: Filter by type
+- **WHEN** user runs `kb list --type pdf`
+- **THEN** only PDF documents are listed
+
+#### Scenario: Filter by tags
+- **WHEN** user runs `kb list --tags admin,ops`
+- **THEN** only documents tagged with BOTH "admin" AND "ops" are listed
+
+#### Scenario: Empty database
+- **WHEN** user runs `kb list` with no documents indexed
+- **THEN** the system prints "No documents indexed. Run `kb add` to get started." and exits with zero status
+
+### Requirement: Document info
+The system SHALL display detailed information about a single document via `kb info <doc_id>`, including all metadata, tags, chunk count, and chunk previews (first 100 characters of each chunk).
+
+#### Scenario: View document info
+- **WHEN** user runs `kb info 42`
+- **THEN** the system displays: title, source path, type, language (if code), content hash, creation date, tags, total chunks, and a preview of each chunk
+
+#### Scenario: Invalid document ID
+- **WHEN** user runs `kb info 9999` and no document with ID 9999 exists
+- **THEN** the system prints "Document not found: 9999" and exits with non-zero status
+
+### Requirement: Remove document
+The system SHALL remove a document and all its associated chunks, embeddings, and tag associations via `kb remove <doc_id>`. The system SHALL ask for confirmation before deletion unless `--yes` is passed.
+
+#### Scenario: Remove with confirmation
+- **WHEN** user runs `kb remove 42`
+- **THEN** the system displays the document title and asks "Remove 'Git Admin Guide' and its 28 chunks? [y/N]". On confirmation, the document, its chunks, FTS entries, vector embeddings, and tag associations are deleted.
+
+#### Scenario: Remove with --yes flag
+- **WHEN** user runs `kb remove 42 --yes`
+- **THEN** the document is removed without confirmation prompt
+
+#### Scenario: Cascading delete
+- **WHEN** a document is removed
+- **THEN** all rows in `chunks`, `chunks_fts`, `chunks_vec`, and `document_tags` referencing that document SHALL be deleted
+
+### Requirement: Tag management
+The system SHALL support adding and removing tags on documents via `kb tag <doc_id> --add tag1,tag2` and `kb tag <doc_id> --remove tag1`. Tags are case-insensitive and stored lowercase. The system SHALL list all tags with document counts via `kb tags`.
+
+#### Scenario: Add tags to a document
+- **WHEN** user runs `kb tag 42 --add git,admin`
+- **THEN** the tags "git" and "admin" are associated with document 42. Tags are created if they don't exist.
+
+#### Scenario: Remove a tag from a document
+- **WHEN** user runs `kb tag 42 --remove admin`
+- **THEN** the "admin" tag association is removed from document 42. The tag itself remains in the tags table if other documents use it.
+
+#### Scenario: List all tags
+- **WHEN** user runs `kb tags`
+- **THEN** the system lists all tags with the count of documents using each tag, sorted by count descending
+
+#### Scenario: Tag on ingestion
+- **WHEN** user runs `kb add report.pdf --tags compliance,q1`
+- **THEN** the document is ingested and immediately tagged with "compliance" and "q1"
+
+#### Scenario: Tags in JSON format
+- **WHEN** user runs `kb tags --format json`
+- **THEN** output is a JSON array of objects: `[{"name": "git", "count": 15}, ...]`
+
+### Requirement: Database status
+The system SHALL report database statistics via `kb status`, including: total documents (by type), total chunks, database file size, active model name and dimension, and schema version.
+
+#### Scenario: Show status
+- **WHEN** user runs `kb status`
+- **THEN** the system displays: document counts by type, total chunks, DB file size, model name, embedding dimension, and schema version
+
+#### Scenario: Status before init
+- **WHEN** user runs `kb status` before `kb init`
+- **THEN** the system prints "Knowledge base not initialised. Run `kb init` first." and exits with non-zero status
@@ -0,0 +1,57 @@
+## ADDED Requirements
+
+### Requirement: Model initialisation
+The system SHALL download the embedding model on `kb init`. The default model SHALL be `all-MiniLM-L6-v2`. The user MAY specify a different model via `kb init --model <name>`. The model SHALL be downloaded via sentence-transformers to the HuggingFace default cache (`~/.cache/huggingface/`). On first load, the model SHALL be exported to ONNX format for inference.
+
+#### Scenario: Default init
+- **WHEN** user runs `kb init`
+- **THEN** the system downloads `all-MiniLM-L6-v2`, creates `~/.kb/kb.db` with the schema, and records `model_name=all-MiniLM-L6-v2` and `embedding_dim=384` in the DB config table
+
+#### Scenario: Init with custom model
+- **WHEN** user runs `kb init --model nomic-embed-text`
+- **THEN** the system downloads `nomic-embed-text`, creates the database, and records the model name and its dimension in the DB config table
+
+#### Scenario: Init status check
+- **WHEN** user runs `kb init --status`
+- **THEN** the system reports: whether `~/.kb/` exists, whether the DB is initialised, which model is configured, whether the model is downloaded, and Docling model status
+
+#### Scenario: ONNX export on first load
+- **WHEN** the embedding model is loaded for the first time after download
+- **THEN** the system SHALL display "Optimising model for ONNX inference (one-time)..." and export the model to ONNX format. Subsequent loads SHALL use the cached ONNX export.
+
+### Requirement: Model-database binding
+The system SHALL store the active model name and embedding dimension in the database `config` table. Every operation that uses the embedding model (add, search, reindex) SHALL verify that the loaded model matches the DB record. A mismatch SHALL be a hard error.
+
+#### Scenario: Model mismatch on add
+- **WHEN** user runs `kb add doc.pdf` but the config YAML specifies a different model than what the DB was initialised with
+- **THEN** the system SHALL print an error: "Model mismatch: DB uses 'all-MiniLM-L6-v2' (384 dim) but config specifies 'nomic-embed-text'. Run `kb reindex --model nomic-embed-text` to switch models." and exit with non-zero status
+
+#### Scenario: Model match on add
+- **WHEN** user runs `kb add doc.pdf` and the config model matches the DB model
+- **THEN** ingestion proceeds normally
+
+### Requirement: Full reindex with model switching
+The system SHALL support re-embedding all chunks via `kb reindex`. If `--model` is specified, the system SHALL download the new model, re-embed all chunks, replace all vectors, and update the DB config. A progress bar SHALL be displayed. The operation SHALL be atomic — if interrupted, the old embeddings remain intact.
+
+#### Scenario: Reindex with same model
+- **WHEN** user runs `kb reindex`
+- **THEN** all chunks are re-embedded with the current model and vectors are replaced. Useful if the model's ONNX export was corrupted or chunks were modified.
+
+#### Scenario: Reindex with new model
+- **WHEN** user runs `kb reindex --model bge-small-en-v1.5`
+- **THEN** the system downloads the new model, re-embeds all chunks (showing progress), replaces all vectors in `chunks_vec` (recreating the table if dimension changed), and updates `model_name` and `embedding_dim` in the DB config table
+
+#### Scenario: Interrupted reindex
+- **WHEN** a reindex is interrupted partway through
+- **THEN** the old embeddings remain intact (the vector table is only replaced on successful completion of all embeddings). The user can rerun `kb reindex` to retry.
+
+### Requirement: Embedding model inference via ONNX
+The system SHALL use `sentence-transformers` with the ONNX backend for all embedding inference. This avoids a PyTorch dependency. The ONNX Runtime (`onnxruntime`) SHALL be the inference engine.
+
+#### Scenario: Embed a chunk
+- **WHEN** a chunk of text needs to be embedded during ingestion
+- **THEN** the system uses the sentence-transformers ONNX backend to produce a float vector of the correct dimension for the active model
+
+#### Scenario: Embed a query
+- **WHEN** a search query needs to be embedded
+- **THEN** the system applies the configured `query_prefix` (if any) to the query text before embedding, and uses the same ONNX model used for chunk embeddings
@@ -0,0 +1,70 @@
+## ADDED Requirements
+
+### Requirement: Full-text search via FTS5
+The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the `porter unicode61` tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.
+
+#### Scenario: Keyword search
+- **WHEN** user runs `kb search "install git"`
+- **THEN** FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score
+
+#### Scenario: FTS-only mode
+- **WHEN** user runs `kb search "install git" --fts-only`
+- **THEN** only FTS5 results are returned, no vector search is performed
+
+### Requirement: Vector similarity search via sqlite-vec
+The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in `chunks_vec` using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.
+
+#### Scenario: Semantic search
+- **WHEN** user runs `kb search "how to set up version control"`
+- **THEN** the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"
+
+#### Scenario: Vector-only mode
+- **WHEN** user runs `kb search "how to set up version control" --vec-only`
+- **THEN** only vector similarity results are returned, no FTS search is performed
+
+### Requirement: Reciprocal Rank Fusion merging
+The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: `score(d) = Σ 1/(k + rank)` where `k` is configurable (default: 60). Results SHALL be sorted by descending RRF score.
+
+#### Scenario: Hybrid search combines both signals
+- **WHEN** user runs `kb search "install git"` (default hybrid mode)
+- **THEN** the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score
+
+#### Scenario: Document appears in both result sets
+- **WHEN** a chunk ranks #2 in FTS5 and #5 in vector search
+- **THEN** its RRF score SHALL be `1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315`, higher than a chunk appearing in only one result set
+
+### Requirement: Tag-based filtering
+The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.
+
+#### Scenario: Filter by single tag
+- **WHEN** user runs `kb search "deploy" --tags ops`
+- **THEN** only chunks from documents tagged with "ops" are included in results
+
+#### Scenario: Filter by multiple tags
+- **WHEN** user runs `kb search "deploy" --tags ops,production`
+- **THEN** only chunks from documents tagged with BOTH "ops" AND "production" are included
+
+### Requirement: Type-based filtering
+The system SHALL support filtering search results by document type. Valid types: `pdf`, `markdown`, `code`, `note`.
+
+#### Scenario: Filter by type
+- **WHEN** user runs `kb search "deploy" --type code`
+- **THEN** only chunks from code documents are included in results
+
+### Requirement: Score threshold
+The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.
+
+#### Scenario: Apply score threshold
+- **WHEN** user runs `kb search "deploy" --threshold 0.02`
+- **THEN** only results with RRF score >= 0.02 are returned
+
+### Requirement: Result count control
+The system SHALL return a configurable number of results (default: 10, configurable via `--top` flag or `search.default_top` in config).
+
+#### Scenario: Request specific number of results
+- **WHEN** user runs `kb search "deploy" --top 5`
+- **THEN** at most 5 results are returned
+
+#### Scenario: Fewer matches than requested
+- **WHEN** user searches and only 3 chunks match
+- **THEN** the system returns 3 results without error, with `returned: 3` in the output
@@ -0,0 +1,101 @@
+## ADDED Requirements
+
+### Requirement: JSON output format for search
+The system SHALL output search results as JSON when `--format json` is used (this is the default). The JSON schema SHALL include: `query`, `results` array, `total_matches`, and `returned` count. Each result SHALL include: `chunk_id`, `score`, `score_breakdown` (with `fts` and `vector` sub-scores), `text`, and `source` object.
+
+#### Scenario: JSON search output
+- **WHEN** user runs `kb search "install git" --format json`
+- **THEN** the output is valid JSON matching this structure:
+  ```json
+  {
+    "query": "install git",
+    "results": [
+      {
+        "chunk_id": 1423,
+        "score": 0.031,
+        "score_breakdown": {"fts": 0.016, "vector": 0.015},
+        "text": "To install the latest version...",
+        "source": {
+          "document_id": 42,
+          "title": "Git Admin Guide",
+          "path": "/home/user/docs/git-admin.pdf",
+          "type": "pdf",
+          "page": 12,
+          "chunk_index": 3,
+          "total_chunks": 28,
+          "tags": ["git", "admin"]
+        }
+      }
+    ],
+    "total_matches": 47,
+    "returned": 10
+  }
+  ```
+
+#### Scenario: Score breakdown in FTS-only mode
+- **WHEN** user runs `kb search "test" --fts-only --format json`
+- **THEN** `score_breakdown` contains `{"fts": <score>, "vector": null}`
+
+#### Scenario: Score breakdown in vector-only mode
+- **WHEN** user runs `kb search "test" --vec-only --format json`
+- **THEN** `score_breakdown` contains `{"fts": null, "vector": <score>}`
+
+### Requirement: Human-readable output format
+The system SHALL support human-readable output via `--format human`. This format SHALL show: query, match count, and for each result: rank, score, title, page/section (if applicable), type, tags, and a text preview.
+
+#### Scenario: Human-readable search output
+- **WHEN** user runs `kb search "install git" --format human`
+- **THEN** output is formatted for terminal reading:
+  ```
+  Search: "install git" (47 matches, showing top 10)
+
+   1. [0.031] Git Admin Guide (p.12)               [pdf] [git, admin]
+      To install the latest version of git from source...
+
+   2. [0.025] setup-notes.md §Installation          [markdown] [git]
+      First, add the PPA repository for the latest git...
+  ```
+
+### Requirement: JSON output for list and tags commands
+The system SHALL support `--format json` for `kb list`, `kb tags`, `kb info`, and `kb status` commands. JSON output SHALL be valid and parseable by the skill wrapper.
+
+#### Scenario: List documents as JSON
+- **WHEN** user runs `kb list --format json`
+- **THEN** output is a JSON array of document objects with `id`, `title`, `type`, `tags`, `chunk_count`, `created_at`
+
+#### Scenario: Tags as JSON
+- **WHEN** user runs `kb tags --format json`
+- **THEN** output is a JSON array: `[{"name": "git", "count": 15}, ...]`
+
+#### Scenario: Status as JSON
+- **WHEN** user runs `kb status --format json`
+- **THEN** output is a JSON object with `documents` (counts by type), `total_chunks`, `db_size_bytes`, `model_name`, `embedding_dim`, `schema_version`
+
+### Requirement: JSON schema stability
+The JSON output schema SHALL be treated as a public contract. Fields MAY be added to JSON objects in future versions. Fields SHALL NOT be removed or renamed. The skill wrapper MUST be able to rely on the presence and type of all documented fields.
+
+#### Scenario: Forward compatibility
+- **WHEN** a future version adds a `language` field to search results
+- **THEN** all existing fields remain present and unchanged, the new field is additive only
+
+### Requirement: Exit codes
+The system SHALL use consistent exit codes: 0 for success, 1 for user errors (bad arguments, missing files), 2 for system errors (database corruption, model failure). JSON error output SHALL include an `error` field with a human-readable message.
+
+#### Scenario: Successful operation
+- **WHEN** any command completes successfully
+- **THEN** exit code is 0
+
+#### Scenario: User error with JSON output
+- **WHEN** user runs `kb search` with no query argument
+- **THEN** exit code is 1 and stderr contains a clear error message
+
+#### Scenario: System error
+- **WHEN** the SQLite database is corrupted
+- **THEN** exit code is 2 and stderr contains the error details
+
+### Requirement: Skill definition file
+The project SHALL include a `SKILL.md` file that defines how an LLM tool (e.g. Claude Code) should invoke and interpret `kb` commands. The skill file SHALL document: when to use the tool, available commands, output format, how to cite sources, and how to handle low-confidence results.
+
+#### Scenario: Skill file exists
+- **WHEN** the project is built
+- **THEN** a `SKILL.md` file exists at the project root describing the skill interface for LLM consumption