kb/openspec/changes/archive/2026-03-28-store-original-documents/specs/engine-api/spec.md at 528a09ca909af0b6155c3a85eecaa97208d20eb3

steve/kb

Files

T

steve b04823e67b Store original documents for download after ingestion

Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 15:16:27 +00:00

3.8 KiB

Raw Blame History

MODIFIED Requirements

Requirement: Background ingestion worker

The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, insert chunks and vectors into the database, and move the original file to persistent storage.

Scenario: Successful PDF ingestion

WHEN the background worker picks up a queued PDF job
THEN it SHALL update the job status to processing, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, move the staged file to {data_dir}/documents/{content_hash}.pdf, update documents.stored_path with the permanent path, store the original filename in documents.original_filename, update the job status to done with the resulting document_id and chunk count, and clean up the staging entry

Scenario: Ingestion failure

WHEN the background worker encounters an error during processing (e.g., corrupt PDF)
THEN it SHALL update the job status to failed with the error message, delete the staged file, and continue processing the next queued job

Scenario: Search during active ingestion

WHEN a search request arrives while the background worker is processing a job
THEN the search SHALL execute without blocking (SQLite WAL mode) and return results from already-ingested documents

Requirement: Document management

The engine SHALL provide endpoints to list, inspect, remove, and download original files for ingested documents.

Scenario: List documents

WHEN a client sends GET /api/v1/documents
THEN the engine SHALL return a JSON array of documents with id, title, doc_type, tags, chunk_count, and created_at

Scenario: List documents with filters

WHEN a client sends GET /api/v1/documents?type=pdf&tags=manual
THEN the engine SHALL return only documents matching all specified filters

Scenario: Get document details

WHEN a client sends GET /api/v1/documents/{id}
THEN the engine SHALL return the full document record including all chunks, their text content, and whether the original file is available (has_file: true/false)

Scenario: Download original file

WHEN a client sends GET /api/v1/documents/{id}/file
THEN the engine SHALL return the original file with appropriate Content-Type and Content-Disposition: attachment; filename="{original_filename}" headers, or HTTP 404 if the file is not available

Scenario: Remove a document

WHEN a client sends DELETE /api/v1/documents/{id}
THEN the engine SHALL delete the document, all its chunks, associated embeddings, tag associations, and the stored original file from disk, and return HTTP 200 with a confirmation

Scenario: Remove non-existent document

WHEN a client sends DELETE /api/v1/documents/{id} with a non-existent ID
THEN the engine SHALL return HTTP 404

Requirement: Engine configuration via environment variables

The engine SHALL be configured via environment variables. No config file is read by the engine — all configuration comes from the environment (set via compose.yaml or Docker run).

Scenario: Default configuration

WHEN the engine starts with no environment variables set
THEN it SHALL use defaults: data directory /data, model all-MiniLM-L6-v2, device auto, no API key required. It SHALL create staging/ and documents/ subdirectories under the data directory.

Scenario: Custom model

WHEN KB_MODEL is set to BAAI/bge-small-en-v1.5
THEN the engine SHALL download and load that model instead of the default

3.8 KiB Raw Blame History