# Document Storage ## Purpose Persistent storage, retrieval, and lifecycle management of original uploaded document files. ## Requirements ### Requirement: Persistent original file storage The engine SHALL persistently store the original uploaded file on disk after successful ingestion. Files SHALL be stored at `{data_dir}/documents/{content_hash}{extension}` where `content_hash` is the SHA-256 hex digest already computed for dedup and `extension` is preserved from the original filename. The `documents` table SHALL record the stored file path in a `stored_path` column and the original upload filename in an `original_filename` column. #### Scenario: File stored after successful ingestion - **WHEN** the background worker successfully processes an ingestion job for a PDF file - **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}.pdf`, store the permanent path in `documents.stored_path`, store the original filename in `documents.original_filename`, and delete the staging entry #### Scenario: Note stored after successful ingestion - **WHEN** the background worker successfully processes an ingestion job for a text note - **THEN** the worker SHALL move the staged `.note` file to `{data_dir}/documents/{content_hash}.note` and store the permanent path in `documents.stored_path` #### Scenario: Markdown file stored after successful ingestion - **WHEN** the background worker successfully processes an ingestion job for a markdown file - **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}.md` and store the permanent path in `documents.stored_path` #### Scenario: Code file stored after successful ingestion - **WHEN** the background worker successfully processes an ingestion job for a code file (e.g. `.py`, `.go`) - **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}{original_extension}` and store the permanent path in `documents.stored_path` #### Scenario: Documents directory created at startup - **WHEN** the engine starts up and calls `ensure_dirs()` - **THEN** the `{data_dir}/documents/` directory SHALL be created if it does not exist #### Scenario: Ingestion failure does not store file - **WHEN** the background worker fails to process an ingestion job - **THEN** the staged file SHALL be cleaned up as before and no file SHALL be written to the documents directory --- ### Requirement: File retrieval via API The engine SHALL serve the original stored file for any document that has a stored file on disk. #### Scenario: Download original file - **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document with a stored file - **THEN** the engine SHALL return the file with appropriate `Content-Type` based on file extension and `Content-Disposition: attachment; filename="{original_filename}"` header, falling back to `{title}{ext}` if `original_filename` is NULL #### Scenario: Download file for pre-existing document - **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document ingested before this feature was added (stored_path is NULL) - **THEN** the engine SHALL return HTTP 404 with `{"error": "Original file not available - ingested before document storage was enabled"}` #### Scenario: Download file when file missing from disk - **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document whose `stored_path` is set but the file no longer exists on disk - **THEN** the engine SHALL return HTTP 404 with `{"error": "Stored file not found on disk"}` #### Scenario: Download file for non-existent document - **WHEN** a client sends `GET /api/v1/documents/{id}/file` with a non-existent document ID - **THEN** the engine SHALL return HTTP 404 with `{"error": "Document not found"}` --- ### Requirement: File cleanup on document deletion The engine SHALL remove the stored original file from disk when a document is deleted. #### Scenario: Delete document with stored file - **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document with a stored file - **THEN** the engine SHALL delete the document from the database (cascading to chunks, embeddings, tags) AND delete the stored file from disk #### Scenario: Delete document when stored file already missing - **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document whose stored file has been manually removed from disk - **THEN** the engine SHALL delete the document from the database successfully and log a warning about the missing file #### Scenario: Delete document without stored file (pre-existing) - **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document with `stored_path` NULL - **THEN** the engine SHALL delete the document from the database without attempting file removal --- ### Requirement: Database schema migration for stored_path and original_filename The engine SHALL add `stored_path` and `original_filename` columns to the `documents` table for tracking permanent file locations and original upload filenames. #### Scenario: Fresh database initialization - **WHEN** the engine initializes a new database - **THEN** the `documents` table SHALL include `stored_path TEXT` and `original_filename TEXT` columns in its schema #### Scenario: Existing database migration - **WHEN** the engine starts with a database created before this feature - **THEN** the engine SHALL add `stored_path TEXT` and `original_filename TEXT` to the `documents` table via `ALTER TABLE` if the columns do not exist