kb/openspec/specs/chunk-enrichment/spec.md

# Chunk Enrichment

## Purpose

Chunk enrichment prepends document titles and section headers to chunk text before indexing and embedding, ensuring that document-level context participates in both full-text and semantic search.

## Requirements

### Requirement: Chunk text enrichment with document title

The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated `enriched_text` column on the `chunks` table. The original chunk text SHALL remain in the `text` column for display purposes.

The enrichment format SHALL be:
- Without section header: `"{title}\n\n{chunk_text}"`
- With section header: `"{title} > {section_header}\n\n{chunk_text}"`

Where `section_header` is the value from the chunk's metadata `section_header` field, when present.

#### Scenario: Note ingestion with title enrichment
- **WHEN** a note titled "Suitcase Locks" with content "Steve = 363" is ingested
- **THEN** the `chunks.text` column SHALL contain "Steve = 363" and the `chunks.enriched_text` column SHALL contain "Suitcase Locks\n\nSteve = 363"

#### Scenario: Markdown chunk with section header enrichment
- **WHEN** a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk"
- **THEN** the `chunks.enriched_text` SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk"

#### Scenario: Chunk without section header
- **WHEN** a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }"
- **THEN** the `chunks.enriched_text` SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }"

---

### Requirement: FTS5 indexes enriched text

The FTS5 virtual table `chunks_fts` SHALL index the `enriched_text` column instead of the `text` column. All FTS sync triggers (insert, update, delete) SHALL operate on `enriched_text`.

#### Scenario: FTS search matches document title
- **WHEN** a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363"
- **THEN** the FTS5 search SHALL return that chunk as a match

#### Scenario: FTS search still matches chunk content
- **WHEN** a user searches for "MSI X870" and a chunk contains that text in its body
- **THEN** the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching)

---

### Requirement: Vector embeddings use enriched text

The embedding model SHALL receive `enriched_text` (not raw `text`) when generating vectors during both initial ingestion and reindex operations.

#### Scenario: Vector search matches document title
- **WHEN** a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists
- **THEN** the vector search SHALL return that chunk with higher similarity than it would without title enrichment

#### Scenario: Reindex uses enriched text
- **WHEN** `POST /api/v1/reindex` is called
- **THEN** the engine SHALL read `enriched_text` from the chunks table and embed that (not `text`)

---

### Requirement: Schema migration adds enriched_text column

On startup, `init_schema` SHALL add the `enriched_text` column to the `chunks` table if it does not exist. It SHALL then backfill `enriched_text` for all existing chunks by joining with `documents.title` and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index `enriched_text`.

#### Scenario: First startup after upgrade
- **WHEN** the engine starts and `chunks.enriched_text` column does not exist
- **THEN** the engine SHALL add the column, backfill all rows, drop and recreate `chunks_fts` to index `enriched_text`, and recreate the FTS sync triggers

#### Scenario: Subsequent startup
- **WHEN** the engine starts and `chunks.enriched_text` column already exists
- **THEN** the engine SHALL not perform any migration and start normally

---

### Requirement: Search results return raw text

Search results SHALL continue to return the original chunk text (from `chunks.text`) in the `text` field, not the enriched text. The document title is already returned as a separate `title` field.

#### Scenario: Search result text field
- **WHEN** a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363"
- **THEN** the result `text` field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")