Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.7 KiB
MODIFIED Requirements
Requirement: Background ingestion worker
The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), build enriched text by prepending the document title (and section header when present) to each chunk's text, generate embeddings using the enriched text and the resident model, insert chunks (with both raw text and enriched text) and vectors into the database, and move the original file to persistent storage.
Scenario: Successful PDF ingestion
- WHEN the background worker picks up a queued PDF job
- THEN it SHALL update the job status to
processing, run Docling conversion and chunking, build enriched text for each chunk by prepending the document title, embed all chunks using enriched text, insert document and chunks into the database, move the staged file to{data_dir}/documents/{content_hash}.pdf, updatedocuments.stored_pathwith the permanent path, store the original filename indocuments.original_filename, update the job status todonewith the resulting document_id and chunk count, and clean up the staging entry
Scenario: Ingestion failure
- WHEN the background worker encounters an error during processing (e.g., corrupt PDF)
- THEN it SHALL update the job status to
failedwith the error message, delete the staged file, and continue processing the next queued job
Scenario: Search during active ingestion
- WHEN a search request arrives while the background worker is processing a job
- THEN the search SHALL execute without blocking (SQLite WAL mode) and return results from already-ingested documents
Requirement: Engine status and reindex
The engine SHALL provide status information and support re-embedding all chunks. The version field in the status response SHALL always be present and SHALL reflect the engine's release version as read from the VERSION file. This field is the contract used by clients for compatibility checking.
Scenario: Get engine status
- WHEN a client sends
GET /api/v1/status - THEN the engine SHALL return JSON with
version(string, from VERSION file), model_name, embedding_dim, GPU device info, database stats (document count by type, total chunks, DB size), and queue stats (queued/processing job count)
Scenario: Trigger reindex
- WHEN a client sends
POST /api/v1/reindex - THEN the engine SHALL re-embed all existing chunks using the
enriched_textcolumn and the currently loaded model, and return progress information. This operation SHALL NOT block search queries.