Chunk enrichment: prepend document title to embeddings

Adds enriched_text column to chunks table that prepends document title
(and section header when present) to chunk text. Embeddings and FTS now
use enriched text for better search relevance. Includes schema migration
with backfill for existing data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-29 21:03:48 +01:00
parent 5f9946efc9
commit b2176c36ea
10 changed files with 278 additions and 21 deletions
+81
View File
@@ -0,0 +1,81 @@
# Chunk Enrichment
## Purpose
Chunk enrichment prepends document titles and section headers to chunk text before indexing and embedding, ensuring that document-level context participates in both full-text and semantic search.
## Requirements
### Requirement: Chunk text enrichment with document title
The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated `enriched_text` column on the `chunks` table. The original chunk text SHALL remain in the `text` column for display purposes.
The enrichment format SHALL be:
- Without section header: `"{title}\n\n{chunk_text}"`
- With section header: `"{title} > {section_header}\n\n{chunk_text}"`
Where `section_header` is the value from the chunk's metadata `section_header` field, when present.
#### Scenario: Note ingestion with title enrichment
- **WHEN** a note titled "Suitcase Locks" with content "Steve = 363" is ingested
- **THEN** the `chunks.text` column SHALL contain "Steve = 363" and the `chunks.enriched_text` column SHALL contain "Suitcase Locks\n\nSteve = 363"
#### Scenario: Markdown chunk with section header enrichment
- **WHEN** a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk"
- **THEN** the `chunks.enriched_text` SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk"
#### Scenario: Chunk without section header
- **WHEN** a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }"
- **THEN** the `chunks.enriched_text` SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }"
---
### Requirement: FTS5 indexes enriched text
The FTS5 virtual table `chunks_fts` SHALL index the `enriched_text` column instead of the `text` column. All FTS sync triggers (insert, update, delete) SHALL operate on `enriched_text`.
#### Scenario: FTS search matches document title
- **WHEN** a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363"
- **THEN** the FTS5 search SHALL return that chunk as a match
#### Scenario: FTS search still matches chunk content
- **WHEN** a user searches for "MSI X870" and a chunk contains that text in its body
- **THEN** the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching)
---
### Requirement: Vector embeddings use enriched text
The embedding model SHALL receive `enriched_text` (not raw `text`) when generating vectors during both initial ingestion and reindex operations.
#### Scenario: Vector search matches document title
- **WHEN** a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists
- **THEN** the vector search SHALL return that chunk with higher similarity than it would without title enrichment
#### Scenario: Reindex uses enriched text
- **WHEN** `POST /api/v1/reindex` is called
- **THEN** the engine SHALL read `enriched_text` from the chunks table and embed that (not `text`)
---
### Requirement: Schema migration adds enriched_text column
On startup, `init_schema` SHALL add the `enriched_text` column to the `chunks` table if it does not exist. It SHALL then backfill `enriched_text` for all existing chunks by joining with `documents.title` and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index `enriched_text`.
#### Scenario: First startup after upgrade
- **WHEN** the engine starts and `chunks.enriched_text` column does not exist
- **THEN** the engine SHALL add the column, backfill all rows, drop and recreate `chunks_fts` to index `enriched_text`, and recreate the FTS sync triggers
#### Scenario: Subsequent startup
- **WHEN** the engine starts and `chunks.enriched_text` column already exists
- **THEN** the engine SHALL not perform any migration and start normally
---
### Requirement: Search results return raw text
Search results SHALL continue to return the original chunk text (from `chunks.text`) in the `text` field, not the enriched text. The document title is already returned as a separate `title` field.
#### Scenario: Search result text field
- **WHEN** a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363"
- **THEN** the result `text` field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")
+3 -3
View File
@@ -128,11 +128,11 @@ The engine SHALL maintain job records in SQLite with status tracking. Jobs SHALL
### Requirement: Background ingestion worker
The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), generate embeddings using the resident model, insert chunks and vectors into the database, and move the original file to persistent storage.
The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), build enriched text by prepending the document title (and section header when present) to each chunk's text, generate embeddings using the enriched text and the resident model, insert chunks (with both raw text and enriched text) and vectors into the database, and move the original file to persistent storage.
#### Scenario: Successful PDF ingestion
- **WHEN** the background worker picks up a queued PDF job
- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, embed all chunks, insert document and chunks into the database, move the staged file to `{data_dir}/documents/{content_hash}.pdf`, update `documents.stored_path` with the permanent path, store the original filename in `documents.original_filename`, update the job status to `done` with the resulting document_id and chunk count, and clean up the staging entry
- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, build enriched text for each chunk by prepending the document title, embed all chunks using enriched text, insert document and chunks into the database, move the staged file to `{data_dir}/documents/{content_hash}.pdf`, update `documents.stored_path` with the permanent path, store the original filename in `documents.original_filename`, update the job status to `done` with the resulting document_id and chunk count, and clean up the staging entry
#### Scenario: Ingestion failure
- **WHEN** the background worker encounters an error during processing (e.g., corrupt PDF)
@@ -202,7 +202,7 @@ The engine SHALL provide status information and support re-embedding all chunks.
#### Scenario: Trigger reindex
- **WHEN** a client sends `POST /api/v1/reindex`
- **THEN** the engine SHALL re-embed all existing chunks using the currently loaded model and return progress information. This operation SHALL NOT block search queries.
- **THEN** the engine SHALL re-embed all existing chunks using the `enriched_text` column and the currently loaded model, and return progress information. This operation SHALL NOT block search queries.
---