Store original documents for download after ingestion

Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after successful ingestion. Add GET /documents/{id}/file endpoint for retrieval, delete stored files on document deletion, and add `kb export` client command. Includes schema migration, tests, and spec updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 15:16:27 +00:00
parent 6a4bce4659
commit b04823e67b
19 changed files with 802 additions and 10 deletions
@@ -0,0 +1,89 @@
+# Document Storage
+
+## Purpose
+
+Persistent storage, retrieval, and lifecycle management of original uploaded document files.
+
+## Requirements
+
+### Requirement: Persistent original file storage
+
+The engine SHALL persistently store the original uploaded file on disk after successful ingestion. Files SHALL be stored at `{data_dir}/documents/{content_hash}{extension}` where `content_hash` is the SHA-256 hex digest already computed for dedup and `extension` is preserved from the original filename. The `documents` table SHALL record the stored file path in a `stored_path` column and the original upload filename in an `original_filename` column.
+
+#### Scenario: File stored after successful ingestion
+- **WHEN** the background worker successfully processes an ingestion job for a PDF file
+- **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}.pdf`, store the permanent path in `documents.stored_path`, store the original filename in `documents.original_filename`, and delete the staging entry
+
+#### Scenario: Note stored after successful ingestion
+- **WHEN** the background worker successfully processes an ingestion job for a text note
+- **THEN** the worker SHALL move the staged `.note` file to `{data_dir}/documents/{content_hash}.note` and store the permanent path in `documents.stored_path`
+
+#### Scenario: Markdown file stored after successful ingestion
+- **WHEN** the background worker successfully processes an ingestion job for a markdown file
+- **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}.md` and store the permanent path in `documents.stored_path`
+
+#### Scenario: Code file stored after successful ingestion
+- **WHEN** the background worker successfully processes an ingestion job for a code file (e.g. `.py`, `.go`)
+- **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}{original_extension}` and store the permanent path in `documents.stored_path`
+
+#### Scenario: Documents directory created at startup
+- **WHEN** the engine starts up and calls `ensure_dirs()`
+- **THEN** the `{data_dir}/documents/` directory SHALL be created if it does not exist
+
+#### Scenario: Ingestion failure does not store file
+- **WHEN** the background worker fails to process an ingestion job
+- **THEN** the staged file SHALL be cleaned up as before and no file SHALL be written to the documents directory
+
+---
+
+### Requirement: File retrieval via API
+
+The engine SHALL serve the original stored file for any document that has a stored file on disk.
+
+#### Scenario: Download original file
+- **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document with a stored file
+- **THEN** the engine SHALL return the file with appropriate `Content-Type` based on file extension and `Content-Disposition: attachment; filename="{original_filename}"` header, falling back to `{title}{ext}` if `original_filename` is NULL
+
+#### Scenario: Download file for pre-existing document
+- **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document ingested before this feature was added (stored_path is NULL)
+- **THEN** the engine SHALL return HTTP 404 with `{"error": "Original file not available - ingested before document storage was enabled"}`
+
+#### Scenario: Download file when file missing from disk
+- **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document whose `stored_path` is set but the file no longer exists on disk
+- **THEN** the engine SHALL return HTTP 404 with `{"error": "Stored file not found on disk"}`
+
+#### Scenario: Download file for non-existent document
+- **WHEN** a client sends `GET /api/v1/documents/{id}/file` with a non-existent document ID
+- **THEN** the engine SHALL return HTTP 404 with `{"error": "Document not found"}`
+
+---
+
+### Requirement: File cleanup on document deletion
+
+The engine SHALL remove the stored original file from disk when a document is deleted.
+
+#### Scenario: Delete document with stored file
+- **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document with a stored file
+- **THEN** the engine SHALL delete the document from the database (cascading to chunks, embeddings, tags) AND delete the stored file from disk
+
+#### Scenario: Delete document when stored file already missing
+- **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document whose stored file has been manually removed from disk
+- **THEN** the engine SHALL delete the document from the database successfully and log a warning about the missing file
+
+#### Scenario: Delete document without stored file (pre-existing)
+- **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document with `stored_path` NULL
+- **THEN** the engine SHALL delete the document from the database without attempting file removal
+
+---
+
+### Requirement: Database schema migration for stored_path and original_filename
+
+The engine SHALL add `stored_path` and `original_filename` columns to the `documents` table for tracking permanent file locations and original upload filenames.
+
+#### Scenario: Fresh database initialization
+- **WHEN** the engine initializes a new database
+- **THEN** the `documents` table SHALL include `stored_path TEXT` and `original_filename TEXT` columns in its schema
+
+#### Scenario: Existing database migration
+- **WHEN** the engine starts with a database created before this feature
+- **THEN** the engine SHALL add `stored_path TEXT` and `original_filename TEXT` to the `documents` table via `ALTER TABLE` if the columns do not exist
@@ -128,11 +128,11 @@ The engine SHALL maintain job records in SQLite with status tracking. Jobs SHALL

 ### Requirement: Background ingestion worker

-The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, and insert chunks and vectors into the database.
+The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, insert chunks and vectors into the database, and move the original file to persistent storage.

 #### Scenario: Successful PDF ingestion
 - **WHEN** the background worker picks up a queued PDF job
- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, update the job status to `done` with the resulting document_id and chunk count, and delete the staged file
+- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, move the staged file to `{data_dir}/documents/{content_hash}.pdf`, update `documents.stored_path` with the permanent path, store the original filename in `documents.original_filename`, update the job status to `done` with the resulting document_id and chunk count, and clean up the staging entry

 #### Scenario: Ingestion failure
 - **WHEN** the background worker encounters an error during processing (e.g., corrupt PDF)
@@ -146,7 +146,7 @@ The engine SHALL run a background worker that processes queued jobs. The worker

 ### Requirement: Document management

-The engine SHALL provide endpoints to list, inspect, and remove ingested documents.
+The engine SHALL provide endpoints to list, inspect, remove, and download original files for ingested documents.

 #### Scenario: List documents
 - **WHEN** a client sends `GET /api/v1/documents`
@@ -158,11 +158,15 @@ The engine SHALL provide endpoints to list, inspect, and remove ingested documen

 #### Scenario: Get document details
 - **WHEN** a client sends `GET /api/v1/documents/{id}`
- **THEN** the engine SHALL return the full document record including all chunks and their text content
+- **THEN** the engine SHALL return the full document record including all chunks, their text content, and whether the original file is available (`has_file: true/false`)
+
+#### Scenario: Download original file
+- **WHEN** a client sends `GET /api/v1/documents/{id}/file`
+- **THEN** the engine SHALL return the original file with appropriate Content-Type and `Content-Disposition: attachment; filename="{original_filename}"` headers, or HTTP 404 if the file is not available

 #### Scenario: Remove a document
 - **WHEN** a client sends `DELETE /api/v1/documents/{id}`
- **THEN** the engine SHALL delete the document, all its chunks, associated embeddings, and tag associations, and return HTTP 200 with a confirmation
+- **THEN** the engine SHALL delete the document, all its chunks, associated embeddings, tag associations, and the stored original file from disk, and return HTTP 200 with a confirmation

 #### Scenario: Remove non-existent document
 - **WHEN** a client sends `DELETE /api/v1/documents/{id}` with a non-existent ID
@@ -230,7 +234,7 @@ The engine SHALL be configured via environment variables. No config file is read

 #### Scenario: Default configuration
 - **WHEN** the engine starts with no environment variables set
- **THEN** it SHALL use defaults: data directory `/data`, model `all-MiniLM-L6-v2`, device `auto`, no API key required
+- **THEN** it SHALL use defaults: data directory `/data`, model `all-MiniLM-L6-v2`, device `auto`, no API key required. It SHALL create `staging/` and `documents/` subdirectories under the data directory.

 #### Scenario: Custom model
 - **WHEN** `KB_MODEL` is set to `BAAI/bge-small-en-v1.5`