## MODIFIED Requirements ### Requirement: Background ingestion worker The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, insert chunks and vectors into the database, and move the original file to persistent storage. #### Scenario: Successful PDF ingestion - **WHEN** the background worker picks up a queued PDF job - **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, move the staged file to `{data_dir}/documents/{content_hash}.pdf`, update `documents.stored_path` with the permanent path, store the original filename in `documents.original_filename`, update the job status to `done` with the resulting document_id and chunk count, and clean up the staging entry #### Scenario: Ingestion failure - **WHEN** the background worker encounters an error during processing (e.g., corrupt PDF) - **THEN** it SHALL update the job status to `failed` with the error message, delete the staged file, and continue processing the next queued job #### Scenario: Search during active ingestion - **WHEN** a search request arrives while the background worker is processing a job - **THEN** the search SHALL execute without blocking (SQLite WAL mode) and return results from already-ingested documents --- ### Requirement: Document management The engine SHALL provide endpoints to list, inspect, remove, and download original files for ingested documents. #### Scenario: List documents - **WHEN** a client sends `GET /api/v1/documents` - **THEN** the engine SHALL return a JSON array of documents with id, title, doc_type, tags, chunk_count, and created_at #### Scenario: List documents with filters - **WHEN** a client sends `GET /api/v1/documents?type=pdf&tags=manual` - **THEN** the engine SHALL return only documents matching all specified filters #### Scenario: Get document details - **WHEN** a client sends `GET /api/v1/documents/{id}` - **THEN** the engine SHALL return the full document record including all chunks, their text content, and whether the original file is available (`has_file: true/false`) #### Scenario: Download original file - **WHEN** a client sends `GET /api/v1/documents/{id}/file` - **THEN** the engine SHALL return the original file with appropriate Content-Type and `Content-Disposition: attachment; filename="{original_filename}"` headers, or HTTP 404 if the file is not available #### Scenario: Remove a document - **WHEN** a client sends `DELETE /api/v1/documents/{id}` - **THEN** the engine SHALL delete the document, all its chunks, associated embeddings, tag associations, and the stored original file from disk, and return HTTP 200 with a confirmation #### Scenario: Remove non-existent document - **WHEN** a client sends `DELETE /api/v1/documents/{id}` with a non-existent ID - **THEN** the engine SHALL return HTTP 404 --- ### Requirement: Engine configuration via environment variables The engine SHALL be configured via environment variables. No config file is read by the engine — all configuration comes from the environment (set via compose.yaml or Docker run). #### Scenario: Default configuration - **WHEN** the engine starts with no environment variables set - **THEN** it SHALL use defaults: data directory `/data`, model `all-MiniLM-L6-v2`, device `auto`, no API key required. It SHALL create `staging/` and `documents/` subdirectories under the data directory. #### Scenario: Custom model - **WHEN** `KB_MODEL` is set to `BAAI/bge-small-en-v1.5` - **THEN** the engine SHALL download and load that model instead of the default