b04823e67b
Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.8 KiB
3.8 KiB
MODIFIED Requirements
Requirement: Background ingestion worker
The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, insert chunks and vectors into the database, and move the original file to persistent storage.
Scenario: Successful PDF ingestion
- WHEN the background worker picks up a queued PDF job
- THEN it SHALL update the job status to
processing, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, move the staged file to{data_dir}/documents/{content_hash}.pdf, updatedocuments.stored_pathwith the permanent path, store the original filename indocuments.original_filename, update the job status todonewith the resulting document_id and chunk count, and clean up the staging entry
Scenario: Ingestion failure
- WHEN the background worker encounters an error during processing (e.g., corrupt PDF)
- THEN it SHALL update the job status to
failedwith the error message, delete the staged file, and continue processing the next queued job
Scenario: Search during active ingestion
- WHEN a search request arrives while the background worker is processing a job
- THEN the search SHALL execute without blocking (SQLite WAL mode) and return results from already-ingested documents
Requirement: Document management
The engine SHALL provide endpoints to list, inspect, remove, and download original files for ingested documents.
Scenario: List documents
- WHEN a client sends
GET /api/v1/documents - THEN the engine SHALL return a JSON array of documents with id, title, doc_type, tags, chunk_count, and created_at
Scenario: List documents with filters
- WHEN a client sends
GET /api/v1/documents?type=pdf&tags=manual - THEN the engine SHALL return only documents matching all specified filters
Scenario: Get document details
- WHEN a client sends
GET /api/v1/documents/{id} - THEN the engine SHALL return the full document record including all chunks, their text content, and whether the original file is available (
has_file: true/false)
Scenario: Download original file
- WHEN a client sends
GET /api/v1/documents/{id}/file - THEN the engine SHALL return the original file with appropriate Content-Type and
Content-Disposition: attachment; filename="{original_filename}"headers, or HTTP 404 if the file is not available
Scenario: Remove a document
- WHEN a client sends
DELETE /api/v1/documents/{id} - THEN the engine SHALL delete the document, all its chunks, associated embeddings, tag associations, and the stored original file from disk, and return HTTP 200 with a confirmation
Scenario: Remove non-existent document
- WHEN a client sends
DELETE /api/v1/documents/{id}with a non-existent ID - THEN the engine SHALL return HTTP 404
Requirement: Engine configuration via environment variables
The engine SHALL be configured via environment variables. No config file is read by the engine — all configuration comes from the environment (set via compose.yaml or Docker run).
Scenario: Default configuration
- WHEN the engine starts with no environment variables set
- THEN it SHALL use defaults: data directory
/data, modelall-MiniLM-L6-v2, deviceauto, no API key required. It SHALL createstaging/anddocuments/subdirectories under the data directory.
Scenario: Custom model
- WHEN
KB_MODELis set toBAAI/bge-small-en-v1.5 - THEN the engine SHALL download and load that model instead of the default