126 lines
8.4 KiB
Markdown
126 lines
8.4 KiB
Markdown
## ADDED Requirements
|
|
|
|
### Requirement: File type detection and routing
|
|
The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (`.pdf`), DOCX (`.docx`), HTML (`.html`, `.htm`), Markdown (`.md`, `.markdown`, `.txt`), Code (`.py`, `.sh`, `.bash`, `.go`), and image files (`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`). The user MAY override detection with `--type` and `--language` flags.
|
|
|
|
#### Scenario: Auto-detect PDF file
|
|
- **WHEN** user runs `kb add report.pdf`
|
|
- **THEN** the file is routed to the Docling ingestion pipeline
|
|
|
|
#### Scenario: Auto-detect Python code
|
|
- **WHEN** user runs `kb add script.py`
|
|
- **THEN** the file is routed to the code ingestion pipeline with language set to `python`
|
|
|
|
#### Scenario: Override type detection
|
|
- **WHEN** user runs `kb add data.txt --type code --language bash`
|
|
- **THEN** the file is routed to the code pipeline as Bash, regardless of the `.txt` extension
|
|
|
|
#### Scenario: Unsupported file type
|
|
- **WHEN** user runs `kb add archive.zip`
|
|
- **THEN** the system SHALL print an error message listing supported formats and exit with non-zero status
|
|
|
|
### Requirement: Docling pipeline for complex documents
|
|
The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the `pypdfium2` backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: `auto` (detect pages with no extractable text and OCR those), `always`, or `never`.
|
|
|
|
#### Scenario: Ingest a text-based PDF
|
|
- **WHEN** user runs `kb add manual.pdf`
|
|
- **THEN** the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database
|
|
|
|
#### Scenario: Ingest a PDF with tables
|
|
- **WHEN** user ingests a PDF containing data tables
|
|
- **THEN** Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments
|
|
|
|
#### Scenario: Ingest a scanned PDF with OCR auto mode
|
|
- **WHEN** user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to `auto`
|
|
- **THEN** the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally
|
|
|
|
#### Scenario: Ingest an image file
|
|
- **WHEN** user runs `kb add diagram.png`
|
|
- **THEN** the system SHALL process it through Docling with OCR enabled, extracting any text content from the image
|
|
|
|
### Requirement: Markdown ingestion with header-based splitting
|
|
The system SHALL split markdown and text files at header boundaries (`##`, `###`). Each chunk SHALL include its parent header chain as context. Sections smaller than `min_tokens` SHALL be merged with the following section. Sections larger than `max_tokens` SHALL be split at paragraph boundaries with configurable overlap.
|
|
|
|
#### Scenario: Split markdown at headers
|
|
- **WHEN** user runs `kb add guide.md` and the file contains multiple `##` sections
|
|
- **THEN** each section becomes a separate chunk, with the header text included in the chunk
|
|
|
|
#### Scenario: Preserve header hierarchy
|
|
- **WHEN** a markdown file has nested headers like `## Config` > `### Advanced Options`
|
|
- **THEN** the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"
|
|
|
|
#### Scenario: Merge small sections
|
|
- **WHEN** a markdown section contains fewer tokens than `min_tokens` (default: 50)
|
|
- **THEN** it SHALL be merged with the next section into a single chunk
|
|
|
|
#### Scenario: Plain text file without headers
|
|
- **WHEN** user runs `kb add notes.txt` and the file has no markdown headers
|
|
- **THEN** the system SHALL fall back to fixed-size chunking with configurable `max_tokens` and `overlap_tokens`
|
|
|
|
### Requirement: Code ingestion with AST/regex splitting
|
|
The system SHALL split code files at function and class boundaries. Python files SHALL use the `ast` module. Bash and Go files SHALL use regex-based splitting. When `include_context` is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.
|
|
|
|
#### Scenario: Python file with functions and classes
|
|
- **WHEN** user runs `kb add auth.py` and the file contains a class with methods
|
|
- **THEN** each method becomes a chunk, and each chunk includes the class name and docstring as context
|
|
|
|
#### Scenario: Bash script with functions
|
|
- **WHEN** user runs `kb add deploy.sh` and the file contains `function deploy() {` blocks
|
|
- **THEN** each function becomes a separate chunk, including any preceding comment block
|
|
|
|
#### Scenario: Go file with functions
|
|
- **WHEN** user runs `kb add main.go` and the file contains `func` declarations
|
|
- **THEN** each function becomes a separate chunk
|
|
|
|
#### Scenario: Code file with no functions
|
|
- **WHEN** user runs `kb add script.sh` and the file has no function declarations
|
|
- **THEN** the system SHALL fall back to fixed-size chunking with `max_tokens` and `overlap_tokens`
|
|
|
|
### Requirement: Inline note ingestion
|
|
The system SHALL support adding text notes directly from the command line via `kb add --note "text"`. Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional `--title` for display purposes.
|
|
|
|
#### Scenario: Add an inline note
|
|
- **WHEN** user runs `kb add --note "Always restart nginx after config changes" --title "nginx reminder"`
|
|
- **THEN** a document of type `note` is created with the title "nginx reminder", and the full text becomes a single chunk
|
|
|
|
#### Scenario: Add a note without title
|
|
- **WHEN** user runs `kb add --note "some text"`
|
|
- **THEN** the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title
|
|
|
|
### Requirement: Deduplication via content hash
|
|
The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same `content_hash` already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.
|
|
|
|
#### Scenario: Add a file that is already indexed
|
|
- **WHEN** user runs `kb add report.pdf` and the file's SHA-256 matches an existing document
|
|
- **THEN** the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate
|
|
|
|
#### Scenario: Add a modified version of an existing file
|
|
- **WHEN** user runs `kb add report.pdf` and the file has changed since last indexed (different hash)
|
|
- **THEN** the system SHALL ingest it as a new document (the old version remains unless manually removed)
|
|
|
|
### Requirement: Batch ingestion with progress and resumability
|
|
The system SHALL support ingesting entire directories via `kb add <dir> --recursive`. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.
|
|
|
|
#### Scenario: Ingest a directory
|
|
- **WHEN** user runs `kb add ~/docs/ --recursive`
|
|
- **THEN** the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."
|
|
|
|
#### Scenario: Resume after interruption
|
|
- **WHEN** a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
|
|
- **THEN** already-indexed files are skipped via content hash, and processing continues with remaining files
|
|
|
|
#### Scenario: Failed file during batch
|
|
- **WHEN** a single file fails to process (corrupt PDF, encoding error)
|
|
- **THEN** the error is logged to `~/.kb/ingest-errors.log` with the file path and error message, and processing continues with the next file
|
|
|
|
### Requirement: Parallel ingestion workers
|
|
The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's `DocumentConverter` SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.
|
|
|
|
#### Scenario: Parallel PDF ingestion
|
|
- **WHEN** user runs `kb add ~/pdfs/ --recursive` with `workers: 4` in config
|
|
- **THEN** up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially
|
|
|
|
#### Scenario: Override worker count
|
|
- **WHEN** user runs `kb add ~/pdfs/ --recursive --workers 1`
|
|
- **THEN** documents are processed sequentially with a single worker
|