Files
steve b2176c36ea Chunk enrichment: prepend document title to embeddings
Adds enriched_text column to chunks table that prepends document title
(and section header when present) to chunk text. Embeddings and FTS now
use enriched text for better search relevance. Includes schema migration
with backfill for existing data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:03:48 +01:00

82 lines
4.4 KiB
Markdown

# Chunk Enrichment
## Purpose
Chunk enrichment prepends document titles and section headers to chunk text before indexing and embedding, ensuring that document-level context participates in both full-text and semantic search.
## Requirements
### Requirement: Chunk text enrichment with document title
The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated `enriched_text` column on the `chunks` table. The original chunk text SHALL remain in the `text` column for display purposes.
The enrichment format SHALL be:
- Without section header: `"{title}\n\n{chunk_text}"`
- With section header: `"{title} > {section_header}\n\n{chunk_text}"`
Where `section_header` is the value from the chunk's metadata `section_header` field, when present.
#### Scenario: Note ingestion with title enrichment
- **WHEN** a note titled "Suitcase Locks" with content "Steve = 363" is ingested
- **THEN** the `chunks.text` column SHALL contain "Steve = 363" and the `chunks.enriched_text` column SHALL contain "Suitcase Locks\n\nSteve = 363"
#### Scenario: Markdown chunk with section header enrichment
- **WHEN** a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk"
- **THEN** the `chunks.enriched_text` SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk"
#### Scenario: Chunk without section header
- **WHEN** a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }"
- **THEN** the `chunks.enriched_text` SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }"
---
### Requirement: FTS5 indexes enriched text
The FTS5 virtual table `chunks_fts` SHALL index the `enriched_text` column instead of the `text` column. All FTS sync triggers (insert, update, delete) SHALL operate on `enriched_text`.
#### Scenario: FTS search matches document title
- **WHEN** a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363"
- **THEN** the FTS5 search SHALL return that chunk as a match
#### Scenario: FTS search still matches chunk content
- **WHEN** a user searches for "MSI X870" and a chunk contains that text in its body
- **THEN** the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching)
---
### Requirement: Vector embeddings use enriched text
The embedding model SHALL receive `enriched_text` (not raw `text`) when generating vectors during both initial ingestion and reindex operations.
#### Scenario: Vector search matches document title
- **WHEN** a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists
- **THEN** the vector search SHALL return that chunk with higher similarity than it would without title enrichment
#### Scenario: Reindex uses enriched text
- **WHEN** `POST /api/v1/reindex` is called
- **THEN** the engine SHALL read `enriched_text` from the chunks table and embed that (not `text`)
---
### Requirement: Schema migration adds enriched_text column
On startup, `init_schema` SHALL add the `enriched_text` column to the `chunks` table if it does not exist. It SHALL then backfill `enriched_text` for all existing chunks by joining with `documents.title` and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index `enriched_text`.
#### Scenario: First startup after upgrade
- **WHEN** the engine starts and `chunks.enriched_text` column does not exist
- **THEN** the engine SHALL add the column, backfill all rows, drop and recreate `chunks_fts` to index `enriched_text`, and recreate the FTS sync triggers
#### Scenario: Subsequent startup
- **WHEN** the engine starts and `chunks.enriched_text` column already exists
- **THEN** the engine SHALL not perform any migration and start normally
---
### Requirement: Search results return raw text
Search results SHALL continue to return the original chunk text (from `chunks.text`) in the `text` field, not the enriched text. The document title is already returned as a separate `title` field.
#### Scenario: Search result text field
- **WHEN** a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363"
- **THEN** the result `text` field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")