kb/openspec/specs/document-storage/spec.md

# Document Storage

## Purpose

Persistent storage, retrieval, and lifecycle management of original uploaded document files.

## Requirements

### Requirement: Persistent original file storage

The engine SHALL persistently store the original uploaded file on disk after successful ingestion. Files SHALL be stored at `{data_dir}/documents/{content_hash}{extension}` where `content_hash` is the SHA-256 hex digest already computed for dedup and `extension` is preserved from the original filename. The `documents` table SHALL record the stored file path in a `stored_path` column and the original upload filename in an `original_filename` column.

#### Scenario: File stored after successful ingestion
- **WHEN** the background worker successfully processes an ingestion job for a PDF file
- **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}.pdf`, store the permanent path in `documents.stored_path`, store the original filename in `documents.original_filename`, and delete the staging entry

#### Scenario: Note stored after successful ingestion
- **WHEN** the background worker successfully processes an ingestion job for a text note
- **THEN** the worker SHALL move the staged `.note` file to `{data_dir}/documents/{content_hash}.note` and store the permanent path in `documents.stored_path`

#### Scenario: Markdown file stored after successful ingestion
- **WHEN** the background worker successfully processes an ingestion job for a markdown file
- **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}.md` and store the permanent path in `documents.stored_path`

#### Scenario: Code file stored after successful ingestion
- **WHEN** the background worker successfully processes an ingestion job for a code file (e.g. `.py`, `.go`)
- **THEN** the worker SHALL move the staged file to `{data_dir}/documents/{content_hash}{original_extension}` and store the permanent path in `documents.stored_path`

#### Scenario: Documents directory created at startup
- **WHEN** the engine starts up and calls `ensure_dirs()`
- **THEN** the `{data_dir}/documents/` directory SHALL be created if it does not exist

#### Scenario: Ingestion failure does not store file
- **WHEN** the background worker fails to process an ingestion job
- **THEN** the staged file SHALL be cleaned up as before and no file SHALL be written to the documents directory

---

### Requirement: File retrieval via API

The engine SHALL serve the original stored file for any document that has a stored file on disk.

#### Scenario: Download original file
- **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document with a stored file
- **THEN** the engine SHALL return the file with appropriate `Content-Type` based on file extension and `Content-Disposition: attachment; filename="{original_filename}"` header, falling back to `{title}{ext}` if `original_filename` is NULL

#### Scenario: Download file for pre-existing document
- **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document ingested before this feature was added (stored_path is NULL)
- **THEN** the engine SHALL return HTTP 404 with `{"error": "Original file not available - ingested before document storage was enabled"}`

#### Scenario: Download file when file missing from disk
- **WHEN** a client sends `GET /api/v1/documents/{id}/file` for a document whose `stored_path` is set but the file no longer exists on disk
- **THEN** the engine SHALL return HTTP 404 with `{"error": "Stored file not found on disk"}`

#### Scenario: Download file for non-existent document
- **WHEN** a client sends `GET /api/v1/documents/{id}/file` with a non-existent document ID
- **THEN** the engine SHALL return HTTP 404 with `{"error": "Document not found"}`

---

### Requirement: File cleanup on document deletion

The engine SHALL remove the stored original file from disk when a document is deleted.

#### Scenario: Delete document with stored file
- **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document with a stored file
- **THEN** the engine SHALL delete the document from the database (cascading to chunks, embeddings, tags) AND delete the stored file from disk

#### Scenario: Delete document when stored file already missing
- **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document whose stored file has been manually removed from disk
- **THEN** the engine SHALL delete the document from the database successfully and log a warning about the missing file

#### Scenario: Delete document without stored file (pre-existing)
- **WHEN** a client sends `DELETE /api/v1/documents/{id}` for a document with `stored_path` NULL
- **THEN** the engine SHALL delete the document from the database without attempting file removal

---

### Requirement: Database schema migration for stored_path and original_filename

The engine SHALL add `stored_path` and `original_filename` columns to the `documents` table for tracking permanent file locations and original upload filenames.

#### Scenario: Fresh database initialization
- **WHEN** the engine initializes a new database
- **THEN** the `documents` table SHALL include `stored_path TEXT` and `original_filename TEXT` columns in its schema

#### Scenario: Existing database migration
- **WHEN** the engine starts with a database created before this feature
- **THEN** the engine SHALL add `stored_path TEXT` and `original_filename TEXT` to the `documents` table via `ALTER TABLE` if the columns do not exist