Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.4 KiB
Document Storage
Purpose
Persistent storage, retrieval, and lifecycle management of original uploaded document files.
Requirements
Requirement: Persistent original file storage
The engine SHALL persistently store the original uploaded file on disk after successful ingestion. Files SHALL be stored at {data_dir}/documents/{content_hash}{extension} where content_hash is the SHA-256 hex digest already computed for dedup and extension is preserved from the original filename. The documents table SHALL record the stored file path in a stored_path column and the original upload filename in an original_filename column.
Scenario: File stored after successful ingestion
- WHEN the background worker successfully processes an ingestion job for a PDF file
- THEN the worker SHALL move the staged file to
{data_dir}/documents/{content_hash}.pdf, store the permanent path indocuments.stored_path, store the original filename indocuments.original_filename, and delete the staging entry
Scenario: Note stored after successful ingestion
- WHEN the background worker successfully processes an ingestion job for a text note
- THEN the worker SHALL move the staged
.notefile to{data_dir}/documents/{content_hash}.noteand store the permanent path indocuments.stored_path
Scenario: Markdown file stored after successful ingestion
- WHEN the background worker successfully processes an ingestion job for a markdown file
- THEN the worker SHALL move the staged file to
{data_dir}/documents/{content_hash}.mdand store the permanent path indocuments.stored_path
Scenario: Code file stored after successful ingestion
- WHEN the background worker successfully processes an ingestion job for a code file (e.g.
.py,.go) - THEN the worker SHALL move the staged file to
{data_dir}/documents/{content_hash}{original_extension}and store the permanent path indocuments.stored_path
Scenario: Documents directory created at startup
- WHEN the engine starts up and calls
ensure_dirs() - THEN the
{data_dir}/documents/directory SHALL be created if it does not exist
Scenario: Ingestion failure does not store file
- WHEN the background worker fails to process an ingestion job
- THEN the staged file SHALL be cleaned up as before and no file SHALL be written to the documents directory
Requirement: File retrieval via API
The engine SHALL serve the original stored file for any document that has a stored file on disk.
Scenario: Download original file
- WHEN a client sends
GET /api/v1/documents/{id}/filefor a document with a stored file - THEN the engine SHALL return the file with appropriate
Content-Typebased on file extension andContent-Disposition: attachment; filename="{original_filename}"header, falling back to{title}{ext}iforiginal_filenameis NULL
Scenario: Download file for pre-existing document
- WHEN a client sends
GET /api/v1/documents/{id}/filefor a document ingested before this feature was added (stored_path is NULL) - THEN the engine SHALL return HTTP 404 with
{"error": "Original file not available - ingested before document storage was enabled"}
Scenario: Download file when file missing from disk
- WHEN a client sends
GET /api/v1/documents/{id}/filefor a document whosestored_pathis set but the file no longer exists on disk - THEN the engine SHALL return HTTP 404 with
{"error": "Stored file not found on disk"}
Scenario: Download file for non-existent document
- WHEN a client sends
GET /api/v1/documents/{id}/filewith a non-existent document ID - THEN the engine SHALL return HTTP 404 with
{"error": "Document not found"}
Requirement: File cleanup on document deletion
The engine SHALL remove the stored original file from disk when a document is deleted.
Scenario: Delete document with stored file
- WHEN a client sends
DELETE /api/v1/documents/{id}for a document with a stored file - THEN the engine SHALL delete the document from the database (cascading to chunks, embeddings, tags) AND delete the stored file from disk
Scenario: Delete document when stored file already missing
- WHEN a client sends
DELETE /api/v1/documents/{id}for a document whose stored file has been manually removed from disk - THEN the engine SHALL delete the document from the database successfully and log a warning about the missing file
Scenario: Delete document without stored file (pre-existing)
- WHEN a client sends
DELETE /api/v1/documents/{id}for a document withstored_pathNULL - THEN the engine SHALL delete the document from the database without attempting file removal
Requirement: Database schema migration for stored_path and original_filename
The engine SHALL add stored_path and original_filename columns to the documents table for tracking permanent file locations and original upload filenames.
Scenario: Fresh database initialization
- WHEN the engine initializes a new database
- THEN the
documentstable SHALL includestored_path TEXTandoriginal_filename TEXTcolumns in its schema
Scenario: Existing database migration
- WHEN the engine starts with a database created before this feature
- THEN the engine SHALL add
stored_path TEXTandoriginal_filename TEXTto thedocumentstable viaALTER TABLEif the columns do not exist