kb/openspec/changes/kb-search/specs/document-ingestion/spec.md at 2030976b8546a0bafb0402a94e5a85457caf28e7

steve/kb

Fork 0

Files

T

steve f245c24928 Initial MVP

2026-03-23 20:38:42 +00:00

8.4 KiB

Raw Blame History

ADDED Requirements

Requirement: File type detection and routing

The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (.pdf), DOCX (.docx), HTML (.html, .htm), Markdown (.md, .markdown, .txt), Code (.py, .sh, .bash, .go), and image files (.png, .jpg, .jpeg, .tiff, .bmp, .webp). The user MAY override detection with --type and --language flags.

Scenario: Auto-detect PDF file

WHEN user runs kb add report.pdf
THEN the file is routed to the Docling ingestion pipeline

Scenario: Auto-detect Python code

WHEN user runs kb add script.py
THEN the file is routed to the code ingestion pipeline with language set to python

Scenario: Override type detection

WHEN user runs kb add data.txt --type code --language bash
THEN the file is routed to the code pipeline as Bash, regardless of the .txt extension

Scenario: Unsupported file type

WHEN user runs kb add archive.zip
THEN the system SHALL print an error message listing supported formats and exit with non-zero status

Requirement: Docling pipeline for complex documents

The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the pypdfium2 backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: auto (detect pages with no extractable text and OCR those), always, or never.

Scenario: Ingest a text-based PDF

WHEN user runs kb add manual.pdf
THEN the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database

Scenario: Ingest a PDF with tables

WHEN user ingests a PDF containing data tables
THEN Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments

Scenario: Ingest a scanned PDF with OCR auto mode

WHEN user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to auto
THEN the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally

Scenario: Ingest an image file

WHEN user runs kb add diagram.png
THEN the system SHALL process it through Docling with OCR enabled, extracting any text content from the image

Requirement: Markdown ingestion with header-based splitting

The system SHALL split markdown and text files at header boundaries (##, ###). Each chunk SHALL include its parent header chain as context. Sections smaller than min_tokens SHALL be merged with the following section. Sections larger than max_tokens SHALL be split at paragraph boundaries with configurable overlap.

Scenario: Split markdown at headers

WHEN user runs kb add guide.md and the file contains multiple ## sections
THEN each section becomes a separate chunk, with the header text included in the chunk

Scenario: Preserve header hierarchy

WHEN a markdown file has nested headers like ## Config > ### Advanced Options
THEN the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"

Scenario: Merge small sections

WHEN a markdown section contains fewer tokens than min_tokens (default: 50)
THEN it SHALL be merged with the next section into a single chunk

Scenario: Plain text file without headers

WHEN user runs kb add notes.txt and the file has no markdown headers
THEN the system SHALL fall back to fixed-size chunking with configurable max_tokens and overlap_tokens

Requirement: Code ingestion with AST/regex splitting

The system SHALL split code files at function and class boundaries. Python files SHALL use the ast module. Bash and Go files SHALL use regex-based splitting. When include_context is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.

Scenario: Python file with functions and classes

WHEN user runs kb add auth.py and the file contains a class with methods
THEN each method becomes a chunk, and each chunk includes the class name and docstring as context

Scenario: Bash script with functions

WHEN user runs kb add deploy.sh and the file contains function deploy() { blocks
THEN each function becomes a separate chunk, including any preceding comment block

Scenario: Go file with functions

WHEN user runs kb add main.go and the file contains func declarations
THEN each function becomes a separate chunk

Scenario: Code file with no functions

WHEN user runs kb add script.sh and the file has no function declarations
THEN the system SHALL fall back to fixed-size chunking with max_tokens and overlap_tokens

Requirement: Inline note ingestion

The system SHALL support adding text notes directly from the command line via kb add --note "text". Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional --title for display purposes.

Scenario: Add an inline note

WHEN user runs kb add --note "Always restart nginx after config changes" --title "nginx reminder"
THEN a document of type note is created with the title "nginx reminder", and the full text becomes a single chunk

Scenario: Add a note without title

WHEN user runs kb add --note "some text"
THEN the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title

Requirement: Deduplication via content hash

The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same content_hash already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.

Scenario: Add a file that is already indexed

WHEN user runs kb add report.pdf and the file's SHA-256 matches an existing document
THEN the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate

Scenario: Add a modified version of an existing file

WHEN user runs kb add report.pdf and the file has changed since last indexed (different hash)
THEN the system SHALL ingest it as a new document (the old version remains unless manually removed)

Requirement: Batch ingestion with progress and resumability

The system SHALL support ingesting entire directories via kb add <dir> --recursive. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.

Scenario: Ingest a directory

WHEN user runs kb add ~/docs/ --recursive
THEN the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."

Scenario: Resume after interruption

WHEN a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
THEN already-indexed files are skipped via content hash, and processing continues with remaining files

Scenario: Failed file during batch

WHEN a single file fails to process (corrupt PDF, encoding error)
THEN the error is logged to ~/.kb/ingest-errors.log with the file path and error message, and processing continues with the next file

Requirement: Parallel ingestion workers

The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's DocumentConverter SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.

Scenario: Parallel PDF ingestion

WHEN user runs kb add ~/pdfs/ --recursive with workers: 4 in config
THEN up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially

Scenario: Override worker count

WHEN user runs kb add ~/pdfs/ --recursive --workers 1
THEN documents are processed sequentially with a single worker

8.4 KiB Raw Blame History