## ADDED Requirements ### Requirement: File type detection and routing The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (`.pdf`), DOCX (`.docx`), HTML (`.html`, `.htm`), Markdown (`.md`, `.markdown`, `.txt`), Code (`.py`, `.sh`, `.bash`, `.go`), and image files (`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`). The user MAY override detection with `--type` and `--language` flags. #### Scenario: Auto-detect PDF file - **WHEN** user runs `kb add report.pdf` - **THEN** the file is routed to the Docling ingestion pipeline #### Scenario: Auto-detect Python code - **WHEN** user runs `kb add script.py` - **THEN** the file is routed to the code ingestion pipeline with language set to `python` #### Scenario: Override type detection - **WHEN** user runs `kb add data.txt --type code --language bash` - **THEN** the file is routed to the code pipeline as Bash, regardless of the `.txt` extension #### Scenario: Unsupported file type - **WHEN** user runs `kb add archive.zip` - **THEN** the system SHALL print an error message listing supported formats and exit with non-zero status ### Requirement: Docling pipeline for complex documents The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the `pypdfium2` backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: `auto` (detect pages with no extractable text and OCR those), `always`, or `never`. #### Scenario: Ingest a text-based PDF - **WHEN** user runs `kb add manual.pdf` - **THEN** the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database #### Scenario: Ingest a PDF with tables - **WHEN** user ingests a PDF containing data tables - **THEN** Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments #### Scenario: Ingest a scanned PDF with OCR auto mode - **WHEN** user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to `auto` - **THEN** the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally #### Scenario: Ingest an image file - **WHEN** user runs `kb add diagram.png` - **THEN** the system SHALL process it through Docling with OCR enabled, extracting any text content from the image ### Requirement: Markdown ingestion with header-based splitting The system SHALL split markdown and text files at header boundaries (`##`, `###`). Each chunk SHALL include its parent header chain as context. Sections smaller than `min_tokens` SHALL be merged with the following section. Sections larger than `max_tokens` SHALL be split at paragraph boundaries with configurable overlap. #### Scenario: Split markdown at headers - **WHEN** user runs `kb add guide.md` and the file contains multiple `##` sections - **THEN** each section becomes a separate chunk, with the header text included in the chunk #### Scenario: Preserve header hierarchy - **WHEN** a markdown file has nested headers like `## Config` > `### Advanced Options` - **THEN** the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options" #### Scenario: Merge small sections - **WHEN** a markdown section contains fewer tokens than `min_tokens` (default: 50) - **THEN** it SHALL be merged with the next section into a single chunk #### Scenario: Plain text file without headers - **WHEN** user runs `kb add notes.txt` and the file has no markdown headers - **THEN** the system SHALL fall back to fixed-size chunking with configurable `max_tokens` and `overlap_tokens` ### Requirement: Code ingestion with AST/regex splitting The system SHALL split code files at function and class boundaries. Python files SHALL use the `ast` module. Bash and Go files SHALL use regex-based splitting. When `include_context` is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking. #### Scenario: Python file with functions and classes - **WHEN** user runs `kb add auth.py` and the file contains a class with methods - **THEN** each method becomes a chunk, and each chunk includes the class name and docstring as context #### Scenario: Bash script with functions - **WHEN** user runs `kb add deploy.sh` and the file contains `function deploy() {` blocks - **THEN** each function becomes a separate chunk, including any preceding comment block #### Scenario: Go file with functions - **WHEN** user runs `kb add main.go` and the file contains `func` declarations - **THEN** each function becomes a separate chunk #### Scenario: Code file with no functions - **WHEN** user runs `kb add script.sh` and the file has no function declarations - **THEN** the system SHALL fall back to fixed-size chunking with `max_tokens` and `overlap_tokens` ### Requirement: Inline note ingestion The system SHALL support adding text notes directly from the command line via `kb add --note "text"`. Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional `--title` for display purposes. #### Scenario: Add an inline note - **WHEN** user runs `kb add --note "Always restart nginx after config changes" --title "nginx reminder"` - **THEN** a document of type `note` is created with the title "nginx reminder", and the full text becomes a single chunk #### Scenario: Add a note without title - **WHEN** user runs `kb add --note "some text"` - **THEN** the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title ### Requirement: Deduplication via content hash The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same `content_hash` already exists in the database, the file SHALL be skipped with a message indicating it is already indexed. #### Scenario: Add a file that is already indexed - **WHEN** user runs `kb add report.pdf` and the file's SHA-256 matches an existing document - **THEN** the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate #### Scenario: Add a modified version of an existing file - **WHEN** user runs `kb add report.pdf` and the file has changed since last indexed (different hash) - **THEN** the system SHALL ingest it as a new document (the old version remains unless manually removed) ### Requirement: Batch ingestion with progress and resumability The system SHALL support ingesting entire directories via `kb add --recursive`. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion. #### Scenario: Ingest a directory - **WHEN** user runs `kb add ~/docs/ --recursive` - **THEN** the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)." #### Scenario: Resume after interruption - **WHEN** a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command - **THEN** already-indexed files are skipped via content hash, and processing continues with remaining files #### Scenario: Failed file during batch - **WHEN** a single file fails to process (corrupt PDF, encoding error) - **THEN** the error is logged to `~/.kb/ingest-errors.log` with the file path and error message, and processing continues with the next file ### Requirement: Parallel ingestion workers The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's `DocumentConverter` SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues. #### Scenario: Parallel PDF ingestion - **WHEN** user runs `kb add ~/pdfs/ --recursive` with `workers: 4` in config - **THEN** up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially #### Scenario: Override worker count - **WHEN** user runs `kb add ~/pdfs/ --recursive --workers 1` - **THEN** documents are processed sequentially with a single worker