kb/openspec/changes/archive/2026-03-28-store-original-documents/design.md at main

steve/kb

Files

T

steve b04823e67b Store original documents for download after ingestion

Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 15:16:27 +00:00

4.8 KiB

Raw Permalink Blame History

Context

Currently, uploaded files pass through a staging directory and are deleted after the worker extracts chunks and embeddings. The documents.source_path column stores the (now-stale) staging path. Users who want the original file must re-source it externally. The data directory structure today is:

/data/
  kb.db
  hf_cache/
  staging/      # temporary, cleaned after processing

Goals / Non-Goals

Goals:

Persist every successfully-ingested original file for the lifetime of the document
Serve the original file via API (GET /api/v1/documents/{id}/file)
Clean up stored files when a document is deleted
Work transparently with the existing Docker volume mount (/data)

Non-Goals:

Serving transformed/converted versions of documents (e.g. PDF→HTML)
De-duplicating file storage (same content hash = same row, so 1:1 is fine)
Compression or archival of stored files
Retroactive storage of files ingested before this change (they're already gone)

Decisions

1. Storage layout: content-hash-based flat directory

Store files at {data_dir}/documents/{content_hash}{ext} (e.g. documents/a1b2c3...d4.pdf).

Why over document-ID naming: Content hash is available at staging time before the DB row exists, avoids race conditions, and makes dedup trivially safe (same hash = same file, overwrite is harmless). The hash is already computed for dedup checks.

Why flat over nested: The KB is a personal tool — expected scale is hundreds to low-thousands of documents. A flat directory is simpler and sufficient. If needed later, a ab/cd/ prefix scheme is easy to add.

Alternatives considered:

Store in SQLite as BLOBs: Bloats the DB, complicates backups, and degrades WAL performance for large files. Rejected.
Keep the staging path as-is: Staging uses UUID prefixes which are meaningless; content-hash naming is deterministic and self-deduplicating.

2. Move file from staging to documents dir (not copy)

Use shutil.move() from staging to documents dir after successful ingestion, before staging.cleanup(). This avoids doubling disk usage during processing.

Why not copy-then-delete: Move is atomic on the same filesystem (which /data/staging and /data/documents share). Faster, no temporary disk spike.

3. New columns `stored_path` and `original_filename` on `documents` table

Add two nullable columns:

stored_path TEXT — permanent file location on disk
original_filename TEXT — the exact filename from the upload (e.g. report.pdf)

Both are nullable because existing documents (ingested before this change) won't have values.

Why original_filename separate from title: The title field can be user-overridden (e.g. "Engine Manual" instead of report.pdf). When serving the file for download, the Content-Disposition header should use the original filename so the downloaded file has the correct name and extension. The original_filename is sourced from jobs.filename which is already captured at upload time.

Keep source_path as-is for backward compatibility (it records what the staging path was). stored_path is the permanent location.

Migration: Two ALTER TABLE statements — safe additive migrations, no data rewrite needed.

4. File download endpoint returns the file directly

GET /api/v1/documents/{id}/file uses FastAPI's FileResponse with:

media_type derived from the file extension
Content-Disposition: attachment; filename="{original_filename}" (falls back to {title}{ext} if original_filename is NULL)
Returns 404 if stored_path is NULL or file is missing from disk

5. Delete cascades to file removal

When DELETE /api/v1/documents/{id} is called, delete the stored file from disk after the DB delete succeeds. If file removal fails (already gone, permissions), log a warning but don't fail the API call — the DB is the source of truth.

Risks / Trade-offs

Disk usage increases — every ingested file persists. For the personal-use scale this is expected and acceptable. Users manage this via document deletion. → Mitigation: Document the storage behavior; GET /api/v1/status already shows DB size, could add documents-dir size later.
Pre-existing documents have no stored file — stored_path will be NULL for documents ingested before this change. → Mitigation: The download endpoint returns 404 with a clear message ("original file not available — ingested before document storage was enabled"). No attempt to backfill.
File-DB consistency — crash between DB commit and file move could leave orphan staged files or missing stored files. → Mitigation: Move file first, then commit DB. If DB commit fails, the file in documents dir is harmless (orphan cleanup can be added later). If move fails, the job fails and staged file remains for retry.

Open Questions

None — the scope is straightforward enough to proceed.

4.8 KiB Raw Permalink Blame History