Files
kb/openspec/changes/archive/2026-03-28-store-original-documents/specs/engine-api/spec.md
T
steve b04823e67b Store original documents for download after ingestion
Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 15:16:27 +00:00

3.8 KiB

MODIFIED Requirements

Requirement: Background ingestion worker

The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, insert chunks and vectors into the database, and move the original file to persistent storage.

Scenario: Successful PDF ingestion

  • WHEN the background worker picks up a queued PDF job
  • THEN it SHALL update the job status to processing, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, move the staged file to {data_dir}/documents/{content_hash}.pdf, update documents.stored_path with the permanent path, store the original filename in documents.original_filename, update the job status to done with the resulting document_id and chunk count, and clean up the staging entry

Scenario: Ingestion failure

  • WHEN the background worker encounters an error during processing (e.g., corrupt PDF)
  • THEN it SHALL update the job status to failed with the error message, delete the staged file, and continue processing the next queued job

Scenario: Search during active ingestion

  • WHEN a search request arrives while the background worker is processing a job
  • THEN the search SHALL execute without blocking (SQLite WAL mode) and return results from already-ingested documents

Requirement: Document management

The engine SHALL provide endpoints to list, inspect, remove, and download original files for ingested documents.

Scenario: List documents

  • WHEN a client sends GET /api/v1/documents
  • THEN the engine SHALL return a JSON array of documents with id, title, doc_type, tags, chunk_count, and created_at

Scenario: List documents with filters

  • WHEN a client sends GET /api/v1/documents?type=pdf&tags=manual
  • THEN the engine SHALL return only documents matching all specified filters

Scenario: Get document details

  • WHEN a client sends GET /api/v1/documents/{id}
  • THEN the engine SHALL return the full document record including all chunks, their text content, and whether the original file is available (has_file: true/false)

Scenario: Download original file

  • WHEN a client sends GET /api/v1/documents/{id}/file
  • THEN the engine SHALL return the original file with appropriate Content-Type and Content-Disposition: attachment; filename="{original_filename}" headers, or HTTP 404 if the file is not available

Scenario: Remove a document

  • WHEN a client sends DELETE /api/v1/documents/{id}
  • THEN the engine SHALL delete the document, all its chunks, associated embeddings, tag associations, and the stored original file from disk, and return HTTP 200 with a confirmation

Scenario: Remove non-existent document

  • WHEN a client sends DELETE /api/v1/documents/{id} with a non-existent ID
  • THEN the engine SHALL return HTTP 404

Requirement: Engine configuration via environment variables

The engine SHALL be configured via environment variables. No config file is read by the engine — all configuration comes from the environment (set via compose.yaml or Docker run).

Scenario: Default configuration

  • WHEN the engine starts with no environment variables set
  • THEN it SHALL use defaults: data directory /data, model all-MiniLM-L6-v2, device auto, no API key required. It SHALL create staging/ and documents/ subdirectories under the data directory.

Scenario: Custom model

  • WHEN KB_MODEL is set to BAAI/bge-small-en-v1.5
  • THEN the engine SHALL download and load that model instead of the default