kb/openspec/changes/kb-search/specs/document-ingestion/spec.md

## ADDED Requirements

### Requirement: File type detection and routing
The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (`.pdf`), DOCX (`.docx`), HTML (`.html`, `.htm`), Markdown (`.md`, `.markdown`, `.txt`), Code (`.py`, `.sh`, `.bash`, `.go`), and image files (`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`). The user MAY override detection with `--type` and `--language` flags.

#### Scenario: Auto-detect PDF file
- **WHEN** user runs `kb add report.pdf`
- **THEN** the file is routed to the Docling ingestion pipeline

#### Scenario: Auto-detect Python code
- **WHEN** user runs `kb add script.py`
- **THEN** the file is routed to the code ingestion pipeline with language set to `python`

#### Scenario: Override type detection
- **WHEN** user runs `kb add data.txt --type code --language bash`
- **THEN** the file is routed to the code pipeline as Bash, regardless of the `.txt` extension

#### Scenario: Unsupported file type
- **WHEN** user runs `kb add archive.zip`
- **THEN** the system SHALL print an error message listing supported formats and exit with non-zero status

### Requirement: Docling pipeline for complex documents
The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the `pypdfium2` backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: `auto` (detect pages with no extractable text and OCR those), `always`, or `never`.

#### Scenario: Ingest a text-based PDF
- **WHEN** user runs `kb add manual.pdf`
- **THEN** the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database

#### Scenario: Ingest a PDF with tables
- **WHEN** user ingests a PDF containing data tables
- **THEN** Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments

#### Scenario: Ingest a scanned PDF with OCR auto mode
- **WHEN** user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to `auto`
- **THEN** the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally

#### Scenario: Ingest an image file
- **WHEN** user runs `kb add diagram.png`
- **THEN** the system SHALL process it through Docling with OCR enabled, extracting any text content from the image

### Requirement: Markdown ingestion with header-based splitting
The system SHALL split markdown and text files at header boundaries (`##`, `###`). Each chunk SHALL include its parent header chain as context. Sections smaller than `min_tokens` SHALL be merged with the following section. Sections larger than `max_tokens` SHALL be split at paragraph boundaries with configurable overlap.

#### Scenario: Split markdown at headers
- **WHEN** user runs `kb add guide.md` and the file contains multiple `##` sections
- **THEN** each section becomes a separate chunk, with the header text included in the chunk

#### Scenario: Preserve header hierarchy
- **WHEN** a markdown file has nested headers like `## Config` > `### Advanced Options`
- **THEN** the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"

#### Scenario: Merge small sections
- **WHEN** a markdown section contains fewer tokens than `min_tokens` (default: 50)
- **THEN** it SHALL be merged with the next section into a single chunk

#### Scenario: Plain text file without headers
- **WHEN** user runs `kb add notes.txt` and the file has no markdown headers
- **THEN** the system SHALL fall back to fixed-size chunking with configurable `max_tokens` and `overlap_tokens`

### Requirement: Code ingestion with AST/regex splitting
The system SHALL split code files at function and class boundaries. Python files SHALL use the `ast` module. Bash and Go files SHALL use regex-based splitting. When `include_context` is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.

#### Scenario: Python file with functions and classes
- **WHEN** user runs `kb add auth.py` and the file contains a class with methods
- **THEN** each method becomes a chunk, and each chunk includes the class name and docstring as context

#### Scenario: Bash script with functions
- **WHEN** user runs `kb add deploy.sh` and the file contains `function deploy() {` blocks
- **THEN** each function becomes a separate chunk, including any preceding comment block

#### Scenario: Go file with functions
- **WHEN** user runs `kb add main.go` and the file contains `func` declarations
- **THEN** each function becomes a separate chunk

#### Scenario: Code file with no functions
- **WHEN** user runs `kb add script.sh` and the file has no function declarations
- **THEN** the system SHALL fall back to fixed-size chunking with `max_tokens` and `overlap_tokens`

### Requirement: Inline note ingestion
The system SHALL support adding text notes directly from the command line via `kb add --note "text"`. Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional `--title` for display purposes.

#### Scenario: Add an inline note
- **WHEN** user runs `kb add --note "Always restart nginx after config changes" --title "nginx reminder"`
- **THEN** a document of type `note` is created with the title "nginx reminder", and the full text becomes a single chunk

#### Scenario: Add a note without title
- **WHEN** user runs `kb add --note "some text"`
- **THEN** the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title

### Requirement: Deduplication via content hash
The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same `content_hash` already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.

#### Scenario: Add a file that is already indexed
- **WHEN** user runs `kb add report.pdf` and the file's SHA-256 matches an existing document
- **THEN** the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate

#### Scenario: Add a modified version of an existing file
- **WHEN** user runs `kb add report.pdf` and the file has changed since last indexed (different hash)
- **THEN** the system SHALL ingest it as a new document (the old version remains unless manually removed)

### Requirement: Batch ingestion with progress and resumability
The system SHALL support ingesting entire directories via `kb add <dir> --recursive`. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.

#### Scenario: Ingest a directory
- **WHEN** user runs `kb add ~/docs/ --recursive`
- **THEN** the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."

#### Scenario: Resume after interruption
- **WHEN** a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
- **THEN** already-indexed files are skipped via content hash, and processing continues with remaining files

#### Scenario: Failed file during batch
- **WHEN** a single file fails to process (corrupt PDF, encoding error)
- **THEN** the error is logged to `~/.kb/ingest-errors.log` with the file path and error message, and processing continues with the next file

### Requirement: Parallel ingestion workers
The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's `DocumentConverter` SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.

#### Scenario: Parallel PDF ingestion
- **WHEN** user runs `kb add ~/pdfs/ --recursive` with `workers: 4` in config
- **THEN** up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially

#### Scenario: Override worker count
- **WHEN** user runs `kb add ~/pdfs/ --recursive --workers 1`
- **THEN** documents are processed sequentially with a single worker