Files
kb/openspec/specs/chunk-enrichment/spec.md
T
steve b2176c36ea Chunk enrichment: prepend document title to embeddings
Adds enriched_text column to chunks table that prepends document title
(and section header when present) to chunk text. Embeddings and FTS now
use enriched text for better search relevance. Includes schema migration
with backfill for existing data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:03:48 +01:00

4.4 KiB

Chunk Enrichment

Purpose

Chunk enrichment prepends document titles and section headers to chunk text before indexing and embedding, ensuring that document-level context participates in both full-text and semantic search.

Requirements

Requirement: Chunk text enrichment with document title

The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated enriched_text column on the chunks table. The original chunk text SHALL remain in the text column for display purposes.

The enrichment format SHALL be:

  • Without section header: "{title}\n\n{chunk_text}"
  • With section header: "{title} > {section_header}\n\n{chunk_text}"

Where section_header is the value from the chunk's metadata section_header field, when present.

Scenario: Note ingestion with title enrichment

  • WHEN a note titled "Suitcase Locks" with content "Steve = 363" is ingested
  • THEN the chunks.text column SHALL contain "Steve = 363" and the chunks.enriched_text column SHALL contain "Suitcase Locks\n\nSteve = 363"

Scenario: Markdown chunk with section header enrichment

  • WHEN a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk"
  • THEN the chunks.enriched_text SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk"

Scenario: Chunk without section header

  • WHEN a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }"
  • THEN the chunks.enriched_text SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }"

Requirement: FTS5 indexes enriched text

The FTS5 virtual table chunks_fts SHALL index the enriched_text column instead of the text column. All FTS sync triggers (insert, update, delete) SHALL operate on enriched_text.

Scenario: FTS search matches document title

  • WHEN a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363"
  • THEN the FTS5 search SHALL return that chunk as a match

Scenario: FTS search still matches chunk content

  • WHEN a user searches for "MSI X870" and a chunk contains that text in its body
  • THEN the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching)

Requirement: Vector embeddings use enriched text

The embedding model SHALL receive enriched_text (not raw text) when generating vectors during both initial ingestion and reindex operations.

Scenario: Vector search matches document title

  • WHEN a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists
  • THEN the vector search SHALL return that chunk with higher similarity than it would without title enrichment

Scenario: Reindex uses enriched text

  • WHEN POST /api/v1/reindex is called
  • THEN the engine SHALL read enriched_text from the chunks table and embed that (not text)

Requirement: Schema migration adds enriched_text column

On startup, init_schema SHALL add the enriched_text column to the chunks table if it does not exist. It SHALL then backfill enriched_text for all existing chunks by joining with documents.title and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index enriched_text.

Scenario: First startup after upgrade

  • WHEN the engine starts and chunks.enriched_text column does not exist
  • THEN the engine SHALL add the column, backfill all rows, drop and recreate chunks_fts to index enriched_text, and recreate the FTS sync triggers

Scenario: Subsequent startup

  • WHEN the engine starts and chunks.enriched_text column already exists
  • THEN the engine SHALL not perform any migration and start normally

Requirement: Search results return raw text

Search results SHALL continue to return the original chunk text (from chunks.text) in the text field, not the enriched text. The document title is already returned as a separate title field.

Scenario: Search result text field

  • WHEN a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363"
  • THEN the result text field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")