Files
kb/openspec/changes/archive/2026-03-28-store-original-documents/proposal.md
T
steve b04823e67b Store original documents for download after ingestion
Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 15:16:27 +00:00

31 lines
2.2 KiB
Markdown

## Why
The knowledge base currently discards original files after chunking and embedding. Once a document is ingested, only the extracted text chunks and vectors remain — the original PDF, markdown, or code file is deleted from staging. Users cannot retrieve the source document from the KB, which limits its usefulness as a document store and prevents use cases like re-processing with a different model or serving the original file to downstream tools.
## What Changes
- Add a persistent document storage directory (`{data_dir}/documents/`) alongside the SQLite database
- After successful ingestion, copy the original file from staging to permanent storage instead of deleting it
- Store the permanent file path in the `documents` table (`stored_path` column) and the original upload filename (`original_filename` column) so downloads use the correct name
- Add an API endpoint to download the original file by document ID
- Add a CLI command to export/retrieve the original document
- **BREAKING**: Delete document now also removes the stored file from disk
- Notes (text-only) are stored as `.note` files in the same directory for consistency
## Capabilities
### New Capabilities
- `document-storage`: Persistent storage of original uploaded files on disk, lifecycle management (store on ingest, delete on document removal), and retrieval via API
### Modified Capabilities
- `engine-api`: New endpoint `GET /api/v1/documents/{id}/file` to download the original file; delete endpoint must also clean up stored files; ingestion worker stores files instead of discarding them
## Impact
- **Engine config**: New `documents_dir` property on Config, new directory created at startup via `ensure_dirs()`
- **Worker**: After successful chunking, move/copy file from staging to documents dir; update `source_path``stored_path` with permanent location
- **Database schema**: Add `stored_path` and `original_filename` columns to `documents` table (migration for existing DBs)
- **Routes**: New file-download endpoint; update delete handler to remove stored file
- **Go client**: New `export` / `get-file` subcommand to download original documents
- **Docker**: `documents/` directory lives inside the existing `/data` volume — no new mounts needed