8.4 KiB
ADDED Requirements
Requirement: File type detection and routing
The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (.pdf), DOCX (.docx), HTML (.html, .htm), Markdown (.md, .markdown, .txt), Code (.py, .sh, .bash, .go), and image files (.png, .jpg, .jpeg, .tiff, .bmp, .webp). The user MAY override detection with --type and --language flags.
Scenario: Auto-detect PDF file
- WHEN user runs
kb add report.pdf - THEN the file is routed to the Docling ingestion pipeline
Scenario: Auto-detect Python code
- WHEN user runs
kb add script.py - THEN the file is routed to the code ingestion pipeline with language set to
python
Scenario: Override type detection
- WHEN user runs
kb add data.txt --type code --language bash - THEN the file is routed to the code pipeline as Bash, regardless of the
.txtextension
Scenario: Unsupported file type
- WHEN user runs
kb add archive.zip - THEN the system SHALL print an error message listing supported formats and exit with non-zero status
Requirement: Docling pipeline for complex documents
The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the pypdfium2 backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: auto (detect pages with no extractable text and OCR those), always, or never.
Scenario: Ingest a text-based PDF
- WHEN user runs
kb add manual.pdf - THEN the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database
Scenario: Ingest a PDF with tables
- WHEN user ingests a PDF containing data tables
- THEN Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments
Scenario: Ingest a scanned PDF with OCR auto mode
- WHEN user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to
auto - THEN the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally
Scenario: Ingest an image file
- WHEN user runs
kb add diagram.png - THEN the system SHALL process it through Docling with OCR enabled, extracting any text content from the image
Requirement: Markdown ingestion with header-based splitting
The system SHALL split markdown and text files at header boundaries (##, ###). Each chunk SHALL include its parent header chain as context. Sections smaller than min_tokens SHALL be merged with the following section. Sections larger than max_tokens SHALL be split at paragraph boundaries with configurable overlap.
Scenario: Split markdown at headers
- WHEN user runs
kb add guide.mdand the file contains multiple##sections - THEN each section becomes a separate chunk, with the header text included in the chunk
Scenario: Preserve header hierarchy
- WHEN a markdown file has nested headers like
## Config>### Advanced Options - THEN the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"
Scenario: Merge small sections
- WHEN a markdown section contains fewer tokens than
min_tokens(default: 50) - THEN it SHALL be merged with the next section into a single chunk
Scenario: Plain text file without headers
- WHEN user runs
kb add notes.txtand the file has no markdown headers - THEN the system SHALL fall back to fixed-size chunking with configurable
max_tokensandoverlap_tokens
Requirement: Code ingestion with AST/regex splitting
The system SHALL split code files at function and class boundaries. Python files SHALL use the ast module. Bash and Go files SHALL use regex-based splitting. When include_context is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.
Scenario: Python file with functions and classes
- WHEN user runs
kb add auth.pyand the file contains a class with methods - THEN each method becomes a chunk, and each chunk includes the class name and docstring as context
Scenario: Bash script with functions
- WHEN user runs
kb add deploy.shand the file containsfunction deploy() {blocks - THEN each function becomes a separate chunk, including any preceding comment block
Scenario: Go file with functions
- WHEN user runs
kb add main.goand the file containsfuncdeclarations - THEN each function becomes a separate chunk
Scenario: Code file with no functions
- WHEN user runs
kb add script.shand the file has no function declarations - THEN the system SHALL fall back to fixed-size chunking with
max_tokensandoverlap_tokens
Requirement: Inline note ingestion
The system SHALL support adding text notes directly from the command line via kb add --note "text". Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional --title for display purposes.
Scenario: Add an inline note
- WHEN user runs
kb add --note "Always restart nginx after config changes" --title "nginx reminder" - THEN a document of type
noteis created with the title "nginx reminder", and the full text becomes a single chunk
Scenario: Add a note without title
- WHEN user runs
kb add --note "some text" - THEN the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title
Requirement: Deduplication via content hash
The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same content_hash already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.
Scenario: Add a file that is already indexed
- WHEN user runs
kb add report.pdfand the file's SHA-256 matches an existing document - THEN the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate
Scenario: Add a modified version of an existing file
- WHEN user runs
kb add report.pdfand the file has changed since last indexed (different hash) - THEN the system SHALL ingest it as a new document (the old version remains unless manually removed)
Requirement: Batch ingestion with progress and resumability
The system SHALL support ingesting entire directories via kb add <dir> --recursive. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.
Scenario: Ingest a directory
- WHEN user runs
kb add ~/docs/ --recursive - THEN the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."
Scenario: Resume after interruption
- WHEN a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
- THEN already-indexed files are skipped via content hash, and processing continues with remaining files
Scenario: Failed file during batch
- WHEN a single file fails to process (corrupt PDF, encoding error)
- THEN the error is logged to
~/.kb/ingest-errors.logwith the file path and error message, and processing continues with the next file
Requirement: Parallel ingestion workers
The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's DocumentConverter SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.
Scenario: Parallel PDF ingestion
- WHEN user runs
kb add ~/pdfs/ --recursivewithworkers: 4in config - THEN up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially
Scenario: Override worker count
- WHEN user runs
kb add ~/pdfs/ --recursive --workers 1 - THEN documents are processed sequentially with a single worker