## Context Currently, uploaded files pass through a staging directory and are deleted after the worker extracts chunks and embeddings. The `documents.source_path` column stores the (now-stale) staging path. Users who want the original file must re-source it externally. The data directory structure today is: ``` /data/ kb.db hf_cache/ staging/ # temporary, cleaned after processing ``` ## Goals / Non-Goals **Goals:** - Persist every successfully-ingested original file for the lifetime of the document - Serve the original file via API (`GET /api/v1/documents/{id}/file`) - Clean up stored files when a document is deleted - Work transparently with the existing Docker volume mount (`/data`) **Non-Goals:** - Serving transformed/converted versions of documents (e.g. PDF→HTML) - De-duplicating file storage (same content hash = same row, so 1:1 is fine) - Compression or archival of stored files - Retroactive storage of files ingested before this change (they're already gone) ## Decisions ### 1. Storage layout: content-hash-based flat directory Store files at `{data_dir}/documents/{content_hash}{ext}` (e.g. `documents/a1b2c3...d4.pdf`). **Why over document-ID naming:** Content hash is available at staging time before the DB row exists, avoids race conditions, and makes dedup trivially safe (same hash = same file, overwrite is harmless). The hash is already computed for dedup checks. **Why flat over nested:** The KB is a personal tool — expected scale is hundreds to low-thousands of documents. A flat directory is simpler and sufficient. If needed later, a `ab/cd/` prefix scheme is easy to add. **Alternatives considered:** - *Store in SQLite as BLOBs*: Bloats the DB, complicates backups, and degrades WAL performance for large files. Rejected. - *Keep the staging path as-is*: Staging uses UUID prefixes which are meaningless; content-hash naming is deterministic and self-deduplicating. ### 2. Move file from staging to documents dir (not copy) Use `shutil.move()` from staging to documents dir after successful ingestion, before `staging.cleanup()`. This avoids doubling disk usage during processing. **Why not copy-then-delete:** Move is atomic on the same filesystem (which `/data/staging` and `/data/documents` share). Faster, no temporary disk spike. ### 3. New columns `stored_path` and `original_filename` on `documents` table Add two nullable columns: - `stored_path TEXT` — permanent file location on disk - `original_filename TEXT` — the exact filename from the upload (e.g. `report.pdf`) Both are nullable because existing documents (ingested before this change) won't have values. **Why `original_filename` separate from `title`:** The `title` field can be user-overridden (e.g. "Engine Manual" instead of `report.pdf`). When serving the file for download, the `Content-Disposition` header should use the original filename so the downloaded file has the correct name and extension. The `original_filename` is sourced from `jobs.filename` which is already captured at upload time. Keep `source_path` as-is for backward compatibility (it records what the staging path was). `stored_path` is the permanent location. **Migration:** Two `ALTER TABLE` statements — safe additive migrations, no data rewrite needed. ### 4. File download endpoint returns the file directly `GET /api/v1/documents/{id}/file` uses FastAPI's `FileResponse` with: - `media_type` derived from the file extension - `Content-Disposition: attachment; filename="{original_filename}"` (falls back to `{title}{ext}` if `original_filename` is NULL) - Returns 404 if `stored_path` is NULL or file is missing from disk ### 5. Delete cascades to file removal When `DELETE /api/v1/documents/{id}` is called, delete the stored file from disk after the DB delete succeeds. If file removal fails (already gone, permissions), log a warning but don't fail the API call — the DB is the source of truth. ## Risks / Trade-offs - **Disk usage increases** — every ingested file persists. For the personal-use scale this is expected and acceptable. Users manage this via document deletion. → Mitigation: Document the storage behavior; `GET /api/v1/status` already shows DB size, could add documents-dir size later. - **Pre-existing documents have no stored file** — `stored_path` will be NULL for documents ingested before this change. → Mitigation: The download endpoint returns 404 with a clear message ("original file not available — ingested before document storage was enabled"). No attempt to backfill. - **File-DB consistency** — crash between DB commit and file move could leave orphan staged files or missing stored files. → Mitigation: Move file first, then commit DB. If DB commit fails, the file in documents dir is harmless (orphan cleanup can be added later). If move fails, the job fails and staged file remains for retry. ## Open Questions None — the scope is straightforward enough to proceed.