Adds enriched_text column to chunks table that prepends document title (and section header when present) to chunk text. Embeddings and FTS now use enriched text for better search relevance. Includes schema migration with backfill for existing data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.4 KiB
Chunk Enrichment
Purpose
Chunk enrichment prepends document titles and section headers to chunk text before indexing and embedding, ensuring that document-level context participates in both full-text and semantic search.
Requirements
Requirement: Chunk text enrichment with document title
The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated enriched_text column on the chunks table. The original chunk text SHALL remain in the text column for display purposes.
The enrichment format SHALL be:
- Without section header:
"{title}\n\n{chunk_text}" - With section header:
"{title} > {section_header}\n\n{chunk_text}"
Where section_header is the value from the chunk's metadata section_header field, when present.
Scenario: Note ingestion with title enrichment
- WHEN a note titled "Suitcase Locks" with content "Steve = 363" is ingested
- THEN the
chunks.textcolumn SHALL contain "Steve = 363" and thechunks.enriched_textcolumn SHALL contain "Suitcase Locks\n\nSteve = 363"
Scenario: Markdown chunk with section header enrichment
- WHEN a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk"
- THEN the
chunks.enriched_textSHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk"
Scenario: Chunk without section header
- WHEN a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }"
- THEN the
chunks.enriched_textSHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }"
Requirement: FTS5 indexes enriched text
The FTS5 virtual table chunks_fts SHALL index the enriched_text column instead of the text column. All FTS sync triggers (insert, update, delete) SHALL operate on enriched_text.
Scenario: FTS search matches document title
- WHEN a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363"
- THEN the FTS5 search SHALL return that chunk as a match
Scenario: FTS search still matches chunk content
- WHEN a user searches for "MSI X870" and a chunk contains that text in its body
- THEN the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching)
Requirement: Vector embeddings use enriched text
The embedding model SHALL receive enriched_text (not raw text) when generating vectors during both initial ingestion and reindex operations.
Scenario: Vector search matches document title
- WHEN a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists
- THEN the vector search SHALL return that chunk with higher similarity than it would without title enrichment
Scenario: Reindex uses enriched text
- WHEN
POST /api/v1/reindexis called - THEN the engine SHALL read
enriched_textfrom the chunks table and embed that (nottext)
Requirement: Schema migration adds enriched_text column
On startup, init_schema SHALL add the enriched_text column to the chunks table if it does not exist. It SHALL then backfill enriched_text for all existing chunks by joining with documents.title and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index enriched_text.
Scenario: First startup after upgrade
- WHEN the engine starts and
chunks.enriched_textcolumn does not exist - THEN the engine SHALL add the column, backfill all rows, drop and recreate
chunks_ftsto indexenriched_text, and recreate the FTS sync triggers
Scenario: Subsequent startup
- WHEN the engine starts and
chunks.enriched_textcolumn already exists - THEN the engine SHALL not perform any migration and start normally
Requirement: Search results return raw text
Search results SHALL continue to return the original chunk text (from chunks.text) in the text field, not the enriched text. The document title is already returned as a separate title field.
Scenario: Search result text field
- WHEN a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363"
- THEN the result
textfield SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")