steve/kb

Files

T

steve b04823e67b Store original documents for download after ingestion

Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 15:16:27 +00:00

5.4 KiB

Raw Permalink Blame History

Document Storage

Purpose

Persistent storage, retrieval, and lifecycle management of original uploaded document files.

Requirements

Requirement: Persistent original file storage

The engine SHALL persistently store the original uploaded file on disk after successful ingestion. Files SHALL be stored at {data_dir}/documents/{content_hash}{extension} where content_hash is the SHA-256 hex digest already computed for dedup and extension is preserved from the original filename. The documents table SHALL record the stored file path in a stored_path column and the original upload filename in an original_filename column.

Scenario: File stored after successful ingestion

WHEN the background worker successfully processes an ingestion job for a PDF file
THEN the worker SHALL move the staged file to {data_dir}/documents/{content_hash}.pdf, store the permanent path in documents.stored_path, store the original filename in documents.original_filename, and delete the staging entry

Scenario: Note stored after successful ingestion

WHEN the background worker successfully processes an ingestion job for a text note
THEN the worker SHALL move the staged .note file to {data_dir}/documents/{content_hash}.note and store the permanent path in documents.stored_path

Scenario: Markdown file stored after successful ingestion

WHEN the background worker successfully processes an ingestion job for a markdown file
THEN the worker SHALL move the staged file to {data_dir}/documents/{content_hash}.md and store the permanent path in documents.stored_path

Scenario: Code file stored after successful ingestion

WHEN the background worker successfully processes an ingestion job for a code file (e.g. .py, .go)
THEN the worker SHALL move the staged file to {data_dir}/documents/{content_hash}{original_extension} and store the permanent path in documents.stored_path

Scenario: Documents directory created at startup

WHEN the engine starts up and calls ensure_dirs()
THEN the {data_dir}/documents/ directory SHALL be created if it does not exist

Scenario: Ingestion failure does not store file

WHEN the background worker fails to process an ingestion job
THEN the staged file SHALL be cleaned up as before and no file SHALL be written to the documents directory

Requirement: File retrieval via API

The engine SHALL serve the original stored file for any document that has a stored file on disk.

Scenario: Download original file

WHEN a client sends GET /api/v1/documents/{id}/file for a document with a stored file
THEN the engine SHALL return the file with appropriate Content-Type based on file extension and Content-Disposition: attachment; filename="{original_filename}" header, falling back to {title}{ext} if original_filename is NULL

Scenario: Download file for pre-existing document

WHEN a client sends GET /api/v1/documents/{id}/file for a document ingested before this feature was added (stored_path is NULL)
THEN the engine SHALL return HTTP 404 with {"error": "Original file not available - ingested before document storage was enabled"}

Scenario: Download file when file missing from disk

WHEN a client sends GET /api/v1/documents/{id}/file for a document whose stored_path is set but the file no longer exists on disk
THEN the engine SHALL return HTTP 404 with {"error": "Stored file not found on disk"}

Scenario: Download file for non-existent document

WHEN a client sends GET /api/v1/documents/{id}/file with a non-existent document ID
THEN the engine SHALL return HTTP 404 with {"error": "Document not found"}

Requirement: File cleanup on document deletion

The engine SHALL remove the stored original file from disk when a document is deleted.

Scenario: Delete document with stored file

WHEN a client sends DELETE /api/v1/documents/{id} for a document with a stored file
THEN the engine SHALL delete the document from the database (cascading to chunks, embeddings, tags) AND delete the stored file from disk

Scenario: Delete document when stored file already missing

WHEN a client sends DELETE /api/v1/documents/{id} for a document whose stored file has been manually removed from disk
THEN the engine SHALL delete the document from the database successfully and log a warning about the missing file

Scenario: Delete document without stored file (pre-existing)

WHEN a client sends DELETE /api/v1/documents/{id} for a document with stored_path NULL
THEN the engine SHALL delete the document from the database without attempting file removal

Requirement: Database schema migration for stored_path and original_filename

The engine SHALL add stored_path and original_filename columns to the documents table for tracking permanent file locations and original upload filenames.

Scenario: Fresh database initialization

WHEN the engine initializes a new database
THEN the documents table SHALL include stored_path TEXT and original_filename TEXT columns in its schema

Scenario: Existing database migration

WHEN the engine starts with a database created before this feature
THEN the engine SHALL add stored_path TEXT and original_filename TEXT to the documents table via ALTER TABLE if the columns do not exist

5.4 KiB Raw Permalink Blame History