Files
kb/openspec/changes/kb-search/specs/document-ingestion/spec.md
T
2026-03-23 20:38:42 +00:00

8.4 KiB

ADDED Requirements

Requirement: File type detection and routing

The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (.pdf), DOCX (.docx), HTML (.html, .htm), Markdown (.md, .markdown, .txt), Code (.py, .sh, .bash, .go), and image files (.png, .jpg, .jpeg, .tiff, .bmp, .webp). The user MAY override detection with --type and --language flags.

Scenario: Auto-detect PDF file

  • WHEN user runs kb add report.pdf
  • THEN the file is routed to the Docling ingestion pipeline

Scenario: Auto-detect Python code

  • WHEN user runs kb add script.py
  • THEN the file is routed to the code ingestion pipeline with language set to python

Scenario: Override type detection

  • WHEN user runs kb add data.txt --type code --language bash
  • THEN the file is routed to the code pipeline as Bash, regardless of the .txt extension

Scenario: Unsupported file type

  • WHEN user runs kb add archive.zip
  • THEN the system SHALL print an error message listing supported formats and exit with non-zero status

Requirement: Docling pipeline for complex documents

The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the pypdfium2 backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: auto (detect pages with no extractable text and OCR those), always, or never.

Scenario: Ingest a text-based PDF

  • WHEN user runs kb add manual.pdf
  • THEN the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database

Scenario: Ingest a PDF with tables

  • WHEN user ingests a PDF containing data tables
  • THEN Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments

Scenario: Ingest a scanned PDF with OCR auto mode

  • WHEN user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to auto
  • THEN the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally

Scenario: Ingest an image file

  • WHEN user runs kb add diagram.png
  • THEN the system SHALL process it through Docling with OCR enabled, extracting any text content from the image

Requirement: Markdown ingestion with header-based splitting

The system SHALL split markdown and text files at header boundaries (##, ###). Each chunk SHALL include its parent header chain as context. Sections smaller than min_tokens SHALL be merged with the following section. Sections larger than max_tokens SHALL be split at paragraph boundaries with configurable overlap.

Scenario: Split markdown at headers

  • WHEN user runs kb add guide.md and the file contains multiple ## sections
  • THEN each section becomes a separate chunk, with the header text included in the chunk

Scenario: Preserve header hierarchy

  • WHEN a markdown file has nested headers like ## Config > ### Advanced Options
  • THEN the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"

Scenario: Merge small sections

  • WHEN a markdown section contains fewer tokens than min_tokens (default: 50)
  • THEN it SHALL be merged with the next section into a single chunk

Scenario: Plain text file without headers

  • WHEN user runs kb add notes.txt and the file has no markdown headers
  • THEN the system SHALL fall back to fixed-size chunking with configurable max_tokens and overlap_tokens

Requirement: Code ingestion with AST/regex splitting

The system SHALL split code files at function and class boundaries. Python files SHALL use the ast module. Bash and Go files SHALL use regex-based splitting. When include_context is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.

Scenario: Python file with functions and classes

  • WHEN user runs kb add auth.py and the file contains a class with methods
  • THEN each method becomes a chunk, and each chunk includes the class name and docstring as context

Scenario: Bash script with functions

  • WHEN user runs kb add deploy.sh and the file contains function deploy() { blocks
  • THEN each function becomes a separate chunk, including any preceding comment block

Scenario: Go file with functions

  • WHEN user runs kb add main.go and the file contains func declarations
  • THEN each function becomes a separate chunk

Scenario: Code file with no functions

  • WHEN user runs kb add script.sh and the file has no function declarations
  • THEN the system SHALL fall back to fixed-size chunking with max_tokens and overlap_tokens

Requirement: Inline note ingestion

The system SHALL support adding text notes directly from the command line via kb add --note "text". Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional --title for display purposes.

Scenario: Add an inline note

  • WHEN user runs kb add --note "Always restart nginx after config changes" --title "nginx reminder"
  • THEN a document of type note is created with the title "nginx reminder", and the full text becomes a single chunk

Scenario: Add a note without title

  • WHEN user runs kb add --note "some text"
  • THEN the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title

Requirement: Deduplication via content hash

The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same content_hash already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.

Scenario: Add a file that is already indexed

  • WHEN user runs kb add report.pdf and the file's SHA-256 matches an existing document
  • THEN the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate

Scenario: Add a modified version of an existing file

  • WHEN user runs kb add report.pdf and the file has changed since last indexed (different hash)
  • THEN the system SHALL ingest it as a new document (the old version remains unless manually removed)

Requirement: Batch ingestion with progress and resumability

The system SHALL support ingesting entire directories via kb add <dir> --recursive. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.

Scenario: Ingest a directory

  • WHEN user runs kb add ~/docs/ --recursive
  • THEN the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."

Scenario: Resume after interruption

  • WHEN a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
  • THEN already-indexed files are skipped via content hash, and processing continues with remaining files

Scenario: Failed file during batch

  • WHEN a single file fails to process (corrupt PDF, encoding error)
  • THEN the error is logged to ~/.kb/ingest-errors.log with the file path and error message, and processing continues with the next file

Requirement: Parallel ingestion workers

The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's DocumentConverter SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.

Scenario: Parallel PDF ingestion

  • WHEN user runs kb add ~/pdfs/ --recursive with workers: 4 in config
  • THEN up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially

Scenario: Override worker count

  • WHEN user runs kb add ~/pdfs/ --recursive --workers 1
  • THEN documents are processed sequentially with a single worker