bbe6a5e909
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
76 lines
4.2 KiB
Markdown
76 lines
4.2 KiB
Markdown
## ADDED Requirements
|
|
|
|
### Requirement: Chunk text enrichment with document title
|
|
|
|
The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated `enriched_text` column on the `chunks` table. The original chunk text SHALL remain in the `text` column for display purposes.
|
|
|
|
The enrichment format SHALL be:
|
|
- Without section header: `"{title}\n\n{chunk_text}"`
|
|
- With section header: `"{title} > {section_header}\n\n{chunk_text}"`
|
|
|
|
Where `section_header` is the value from the chunk's metadata `section_header` field, when present.
|
|
|
|
#### Scenario: Note ingestion with title enrichment
|
|
- **WHEN** a note titled "Suitcase Locks" with content "Steve = 363" is ingested
|
|
- **THEN** the `chunks.text` column SHALL contain "Steve = 363" and the `chunks.enriched_text` column SHALL contain "Suitcase Locks\n\nSteve = 363"
|
|
|
|
#### Scenario: Markdown chunk with section header enrichment
|
|
- **WHEN** a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk"
|
|
- **THEN** the `chunks.enriched_text` SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk"
|
|
|
|
#### Scenario: Chunk without section header
|
|
- **WHEN** a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }"
|
|
- **THEN** the `chunks.enriched_text` SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }"
|
|
|
|
---
|
|
|
|
### Requirement: FTS5 indexes enriched text
|
|
|
|
The FTS5 virtual table `chunks_fts` SHALL index the `enriched_text` column instead of the `text` column. All FTS sync triggers (insert, update, delete) SHALL operate on `enriched_text`.
|
|
|
|
#### Scenario: FTS search matches document title
|
|
- **WHEN** a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363"
|
|
- **THEN** the FTS5 search SHALL return that chunk as a match
|
|
|
|
#### Scenario: FTS search still matches chunk content
|
|
- **WHEN** a user searches for "MSI X870" and a chunk contains that text in its body
|
|
- **THEN** the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching)
|
|
|
|
---
|
|
|
|
### Requirement: Vector embeddings use enriched text
|
|
|
|
The embedding model SHALL receive `enriched_text` (not raw `text`) when generating vectors during both initial ingestion and reindex operations.
|
|
|
|
#### Scenario: Vector search matches document title
|
|
- **WHEN** a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists
|
|
- **THEN** the vector search SHALL return that chunk with higher similarity than it would without title enrichment
|
|
|
|
#### Scenario: Reindex uses enriched text
|
|
- **WHEN** `POST /api/v1/reindex` is called
|
|
- **THEN** the engine SHALL read `enriched_text` from the chunks table and embed that (not `text`)
|
|
|
|
---
|
|
|
|
### Requirement: Schema migration adds enriched_text column
|
|
|
|
On startup, `init_schema` SHALL add the `enriched_text` column to the `chunks` table if it does not exist. It SHALL then backfill `enriched_text` for all existing chunks by joining with `documents.title` and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index `enriched_text`.
|
|
|
|
#### Scenario: First startup after upgrade
|
|
- **WHEN** the engine starts and `chunks.enriched_text` column does not exist
|
|
- **THEN** the engine SHALL add the column, backfill all rows, drop and recreate `chunks_fts` to index `enriched_text`, and recreate the FTS sync triggers
|
|
|
|
#### Scenario: Subsequent startup
|
|
- **WHEN** the engine starts and `chunks.enriched_text` column already exists
|
|
- **THEN** the engine SHALL not perform any migration and start normally
|
|
|
|
---
|
|
|
|
### Requirement: Search results return raw text
|
|
|
|
Search results SHALL continue to return the original chunk text (from `chunks.text`) in the `text` field, not the enriched text. The document title is already returned as a separate `title` field.
|
|
|
|
#### Scenario: Search result text field
|
|
- **WHEN** a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363"
|
|
- **THEN** the result `text` field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")
|