Store original documents for download after ingestion

Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-28 15:16:27 +00:00
parent 6a4bce4659
commit b04823e67b
19 changed files with 802 additions and 10 deletions
+10 -6
View File
@@ -128,11 +128,11 @@ The engine SHALL maintain job records in SQLite with status tracking. Jobs SHALL
### Requirement: Background ingestion worker
The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, and insert chunks and vectors into the database.
The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, insert chunks and vectors into the database, and move the original file to persistent storage.
#### Scenario: Successful PDF ingestion
- **WHEN** the background worker picks up a queued PDF job
- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, update the job status to `done` with the resulting document_id and chunk count, and delete the staged file
- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, move the staged file to `{data_dir}/documents/{content_hash}.pdf`, update `documents.stored_path` with the permanent path, store the original filename in `documents.original_filename`, update the job status to `done` with the resulting document_id and chunk count, and clean up the staging entry
#### Scenario: Ingestion failure
- **WHEN** the background worker encounters an error during processing (e.g., corrupt PDF)
@@ -146,7 +146,7 @@ The engine SHALL run a background worker that processes queued jobs. The worker
### Requirement: Document management
The engine SHALL provide endpoints to list, inspect, and remove ingested documents.
The engine SHALL provide endpoints to list, inspect, remove, and download original files for ingested documents.
#### Scenario: List documents
- **WHEN** a client sends `GET /api/v1/documents`
@@ -158,11 +158,15 @@ The engine SHALL provide endpoints to list, inspect, and remove ingested documen
#### Scenario: Get document details
- **WHEN** a client sends `GET /api/v1/documents/{id}`
- **THEN** the engine SHALL return the full document record including all chunks and their text content
- **THEN** the engine SHALL return the full document record including all chunks, their text content, and whether the original file is available (`has_file: true/false`)
#### Scenario: Download original file
- **WHEN** a client sends `GET /api/v1/documents/{id}/file`
- **THEN** the engine SHALL return the original file with appropriate Content-Type and `Content-Disposition: attachment; filename="{original_filename}"` headers, or HTTP 404 if the file is not available
#### Scenario: Remove a document
- **WHEN** a client sends `DELETE /api/v1/documents/{id}`
- **THEN** the engine SHALL delete the document, all its chunks, associated embeddings, and tag associations, and return HTTP 200 with a confirmation
- **THEN** the engine SHALL delete the document, all its chunks, associated embeddings, tag associations, and the stored original file from disk, and return HTTP 200 with a confirmation
#### Scenario: Remove non-existent document
- **WHEN** a client sends `DELETE /api/v1/documents/{id}` with a non-existent ID
@@ -230,7 +234,7 @@ The engine SHALL be configured via environment variables. No config file is read
#### Scenario: Default configuration
- **WHEN** the engine starts with no environment variables set
- **THEN** it SHALL use defaults: data directory `/data`, model `all-MiniLM-L6-v2`, device `auto`, no API key required
- **THEN** it SHALL use defaults: data directory `/data`, model `all-MiniLM-L6-v2`, device `auto`, no API key required. It SHALL create `staging/` and `documents/` subdirectories under the data directory.
#### Scenario: Custom model
- **WHEN** `KB_MODEL` is set to `BAAI/bge-small-en-v1.5`