Files
kb/openspec/changes/archive/2026-03-28-store-original-documents/proposal.md
T
steve b04823e67b Store original documents for download after ingestion
Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 15:16:27 +00:00

2.2 KiB

Why

The knowledge base currently discards original files after chunking and embedding. Once a document is ingested, only the extracted text chunks and vectors remain — the original PDF, markdown, or code file is deleted from staging. Users cannot retrieve the source document from the KB, which limits its usefulness as a document store and prevents use cases like re-processing with a different model or serving the original file to downstream tools.

What Changes

  • Add a persistent document storage directory ({data_dir}/documents/) alongside the SQLite database
  • After successful ingestion, copy the original file from staging to permanent storage instead of deleting it
  • Store the permanent file path in the documents table (stored_path column) and the original upload filename (original_filename column) so downloads use the correct name
  • Add an API endpoint to download the original file by document ID
  • Add a CLI command to export/retrieve the original document
  • BREAKING: Delete document now also removes the stored file from disk
  • Notes (text-only) are stored as .note files in the same directory for consistency

Capabilities

New Capabilities

  • document-storage: Persistent storage of original uploaded files on disk, lifecycle management (store on ingest, delete on document removal), and retrieval via API

Modified Capabilities

  • engine-api: New endpoint GET /api/v1/documents/{id}/file to download the original file; delete endpoint must also clean up stored files; ingestion worker stores files instead of discarding them

Impact

  • Engine config: New documents_dir property on Config, new directory created at startup via ensure_dirs()
  • Worker: After successful chunking, move/copy file from staging to documents dir; update source_pathstored_path with permanent location
  • Database schema: Add stored_path and original_filename columns to documents table (migration for existing DBs)
  • Routes: New file-download endpoint; update delete handler to remove stored file
  • Go client: New export / get-file subcommand to download original documents
  • Docker: documents/ directory lives inside the existing /data volume — no new mounts needed