Files
steve b04823e67b Store original documents for download after ingestion
Persist uploaded files to {data_dir}/documents/{content_hash}{ext} after
successful ingestion. Add GET /documents/{id}/file endpoint for retrieval,
delete stored files on document deletion, and add `kb export` client command.
Includes schema migration, tests, and spec updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 15:16:27 +00:00

5.4 KiB

Document Storage

Purpose

Persistent storage, retrieval, and lifecycle management of original uploaded document files.

Requirements

Requirement: Persistent original file storage

The engine SHALL persistently store the original uploaded file on disk after successful ingestion. Files SHALL be stored at {data_dir}/documents/{content_hash}{extension} where content_hash is the SHA-256 hex digest already computed for dedup and extension is preserved from the original filename. The documents table SHALL record the stored file path in a stored_path column and the original upload filename in an original_filename column.

Scenario: File stored after successful ingestion

  • WHEN the background worker successfully processes an ingestion job for a PDF file
  • THEN the worker SHALL move the staged file to {data_dir}/documents/{content_hash}.pdf, store the permanent path in documents.stored_path, store the original filename in documents.original_filename, and delete the staging entry

Scenario: Note stored after successful ingestion

  • WHEN the background worker successfully processes an ingestion job for a text note
  • THEN the worker SHALL move the staged .note file to {data_dir}/documents/{content_hash}.note and store the permanent path in documents.stored_path

Scenario: Markdown file stored after successful ingestion

  • WHEN the background worker successfully processes an ingestion job for a markdown file
  • THEN the worker SHALL move the staged file to {data_dir}/documents/{content_hash}.md and store the permanent path in documents.stored_path

Scenario: Code file stored after successful ingestion

  • WHEN the background worker successfully processes an ingestion job for a code file (e.g. .py, .go)
  • THEN the worker SHALL move the staged file to {data_dir}/documents/{content_hash}{original_extension} and store the permanent path in documents.stored_path

Scenario: Documents directory created at startup

  • WHEN the engine starts up and calls ensure_dirs()
  • THEN the {data_dir}/documents/ directory SHALL be created if it does not exist

Scenario: Ingestion failure does not store file

  • WHEN the background worker fails to process an ingestion job
  • THEN the staged file SHALL be cleaned up as before and no file SHALL be written to the documents directory

Requirement: File retrieval via API

The engine SHALL serve the original stored file for any document that has a stored file on disk.

Scenario: Download original file

  • WHEN a client sends GET /api/v1/documents/{id}/file for a document with a stored file
  • THEN the engine SHALL return the file with appropriate Content-Type based on file extension and Content-Disposition: attachment; filename="{original_filename}" header, falling back to {title}{ext} if original_filename is NULL

Scenario: Download file for pre-existing document

  • WHEN a client sends GET /api/v1/documents/{id}/file for a document ingested before this feature was added (stored_path is NULL)
  • THEN the engine SHALL return HTTP 404 with {"error": "Original file not available - ingested before document storage was enabled"}

Scenario: Download file when file missing from disk

  • WHEN a client sends GET /api/v1/documents/{id}/file for a document whose stored_path is set but the file no longer exists on disk
  • THEN the engine SHALL return HTTP 404 with {"error": "Stored file not found on disk"}

Scenario: Download file for non-existent document

  • WHEN a client sends GET /api/v1/documents/{id}/file with a non-existent document ID
  • THEN the engine SHALL return HTTP 404 with {"error": "Document not found"}

Requirement: File cleanup on document deletion

The engine SHALL remove the stored original file from disk when a document is deleted.

Scenario: Delete document with stored file

  • WHEN a client sends DELETE /api/v1/documents/{id} for a document with a stored file
  • THEN the engine SHALL delete the document from the database (cascading to chunks, embeddings, tags) AND delete the stored file from disk

Scenario: Delete document when stored file already missing

  • WHEN a client sends DELETE /api/v1/documents/{id} for a document whose stored file has been manually removed from disk
  • THEN the engine SHALL delete the document from the database successfully and log a warning about the missing file

Scenario: Delete document without stored file (pre-existing)

  • WHEN a client sends DELETE /api/v1/documents/{id} for a document with stored_path NULL
  • THEN the engine SHALL delete the document from the database without attempting file removal

Requirement: Database schema migration for stored_path and original_filename

The engine SHALL add stored_path and original_filename columns to the documents table for tracking permanent file locations and original upload filenames.

Scenario: Fresh database initialization

  • WHEN the engine initializes a new database
  • THEN the documents table SHALL include stored_path TEXT and original_filename TEXT columns in its schema

Scenario: Existing database migration

  • WHEN the engine starts with a database created before this feature
  • THEN the engine SHALL add stored_path TEXT and original_filename TEXT to the documents table via ALTER TABLE if the columns do not exist