b04823e67b
Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
31 lines
2.2 KiB
Markdown
31 lines
2.2 KiB
Markdown
## Why
|
|
|
|
The knowledge base currently discards original files after chunking and embedding. Once a document is ingested, only the extracted text chunks and vectors remain — the original PDF, markdown, or code file is deleted from staging. Users cannot retrieve the source document from the KB, which limits its usefulness as a document store and prevents use cases like re-processing with a different model or serving the original file to downstream tools.
|
|
|
|
## What Changes
|
|
|
|
- Add a persistent document storage directory (`{data_dir}/documents/`) alongside the SQLite database
|
|
- After successful ingestion, copy the original file from staging to permanent storage instead of deleting it
|
|
- Store the permanent file path in the `documents` table (`stored_path` column) and the original upload filename (`original_filename` column) so downloads use the correct name
|
|
- Add an API endpoint to download the original file by document ID
|
|
- Add a CLI command to export/retrieve the original document
|
|
- **BREAKING**: Delete document now also removes the stored file from disk
|
|
- Notes (text-only) are stored as `.note` files in the same directory for consistency
|
|
|
|
## Capabilities
|
|
|
|
### New Capabilities
|
|
- `document-storage`: Persistent storage of original uploaded files on disk, lifecycle management (store on ingest, delete on document removal), and retrieval via API
|
|
|
|
### Modified Capabilities
|
|
- `engine-api`: New endpoint `GET /api/v1/documents/{id}/file` to download the original file; delete endpoint must also clean up stored files; ingestion worker stores files instead of discarding them
|
|
|
|
## Impact
|
|
|
|
- **Engine config**: New `documents_dir` property on Config, new directory created at startup via `ensure_dirs()`
|
|
- **Worker**: After successful chunking, move/copy file from staging to documents dir; update `source_path` → `stored_path` with permanent location
|
|
- **Database schema**: Add `stored_path` and `original_filename` columns to `documents` table (migration for existing DBs)
|
|
- **Routes**: New file-download endpoint; update delete handler to remove stored file
|
|
- **Go client**: New `export` / `get-file` subcommand to download original documents
|
|
- **Docker**: `documents/` directory lives inside the existing `/data` volume — no new mounts needed
|