# Chunk Enrichment ## Purpose Chunk enrichment prepends document titles and section headers to chunk text before indexing and embedding, ensuring that document-level context participates in both full-text and semantic search. ## Requirements ### Requirement: Chunk text enrichment with document title The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated `enriched_text` column on the `chunks` table. The original chunk text SHALL remain in the `text` column for display purposes. The enrichment format SHALL be: - Without section header: `"{title}\n\n{chunk_text}"` - With section header: `"{title} > {section_header}\n\n{chunk_text}"` Where `section_header` is the value from the chunk's metadata `section_header` field, when present. #### Scenario: Note ingestion with title enrichment - **WHEN** a note titled "Suitcase Locks" with content "Steve = 363" is ingested - **THEN** the `chunks.text` column SHALL contain "Steve = 363" and the `chunks.enriched_text` column SHALL contain "Suitcase Locks\n\nSteve = 363" #### Scenario: Markdown chunk with section header enrichment - **WHEN** a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk" - **THEN** the `chunks.enriched_text` SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk" #### Scenario: Chunk without section header - **WHEN** a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }" - **THEN** the `chunks.enriched_text` SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }" --- ### Requirement: FTS5 indexes enriched text The FTS5 virtual table `chunks_fts` SHALL index the `enriched_text` column instead of the `text` column. All FTS sync triggers (insert, update, delete) SHALL operate on `enriched_text`. #### Scenario: FTS search matches document title - **WHEN** a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363" - **THEN** the FTS5 search SHALL return that chunk as a match #### Scenario: FTS search still matches chunk content - **WHEN** a user searches for "MSI X870" and a chunk contains that text in its body - **THEN** the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching) --- ### Requirement: Vector embeddings use enriched text The embedding model SHALL receive `enriched_text` (not raw `text`) when generating vectors during both initial ingestion and reindex operations. #### Scenario: Vector search matches document title - **WHEN** a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists - **THEN** the vector search SHALL return that chunk with higher similarity than it would without title enrichment #### Scenario: Reindex uses enriched text - **WHEN** `POST /api/v1/reindex` is called - **THEN** the engine SHALL read `enriched_text` from the chunks table and embed that (not `text`) --- ### Requirement: Schema migration adds enriched_text column On startup, `init_schema` SHALL add the `enriched_text` column to the `chunks` table if it does not exist. It SHALL then backfill `enriched_text` for all existing chunks by joining with `documents.title` and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index `enriched_text`. #### Scenario: First startup after upgrade - **WHEN** the engine starts and `chunks.enriched_text` column does not exist - **THEN** the engine SHALL add the column, backfill all rows, drop and recreate `chunks_fts` to index `enriched_text`, and recreate the FTS sync triggers #### Scenario: Subsequent startup - **WHEN** the engine starts and `chunks.enriched_text` column already exists - **THEN** the engine SHALL not perform any migration and start normally --- ### Requirement: Search results return raw text Search results SHALL continue to return the original chunk text (from `chunks.text`) in the `text` field, not the enriched text. The document title is already returned as a separate `title` field. #### Scenario: Search result text field - **WHEN** a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363" - **THEN** the result `text` field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")