Add dev-up script and archive kb-title-in-chunks change
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,75 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Chunk text enrichment with document title
|
||||
|
||||
The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated `enriched_text` column on the `chunks` table. The original chunk text SHALL remain in the `text` column for display purposes.
|
||||
|
||||
The enrichment format SHALL be:
|
||||
- Without section header: `"{title}\n\n{chunk_text}"`
|
||||
- With section header: `"{title} > {section_header}\n\n{chunk_text}"`
|
||||
|
||||
Where `section_header` is the value from the chunk's metadata `section_header` field, when present.
|
||||
|
||||
#### Scenario: Note ingestion with title enrichment
|
||||
- **WHEN** a note titled "Suitcase Locks" with content "Steve = 363" is ingested
|
||||
- **THEN** the `chunks.text` column SHALL contain "Steve = 363" and the `chunks.enriched_text` column SHALL contain "Suitcase Locks\n\nSteve = 363"
|
||||
|
||||
#### Scenario: Markdown chunk with section header enrichment
|
||||
- **WHEN** a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk"
|
||||
- **THEN** the `chunks.enriched_text` SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk"
|
||||
|
||||
#### Scenario: Chunk without section header
|
||||
- **WHEN** a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }"
|
||||
- **THEN** the `chunks.enriched_text` SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }"
|
||||
|
||||
---
|
||||
|
||||
### Requirement: FTS5 indexes enriched text
|
||||
|
||||
The FTS5 virtual table `chunks_fts` SHALL index the `enriched_text` column instead of the `text` column. All FTS sync triggers (insert, update, delete) SHALL operate on `enriched_text`.
|
||||
|
||||
#### Scenario: FTS search matches document title
|
||||
- **WHEN** a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363"
|
||||
- **THEN** the FTS5 search SHALL return that chunk as a match
|
||||
|
||||
#### Scenario: FTS search still matches chunk content
|
||||
- **WHEN** a user searches for "MSI X870" and a chunk contains that text in its body
|
||||
- **THEN** the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching)
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Vector embeddings use enriched text
|
||||
|
||||
The embedding model SHALL receive `enriched_text` (not raw `text`) when generating vectors during both initial ingestion and reindex operations.
|
||||
|
||||
#### Scenario: Vector search matches document title
|
||||
- **WHEN** a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists
|
||||
- **THEN** the vector search SHALL return that chunk with higher similarity than it would without title enrichment
|
||||
|
||||
#### Scenario: Reindex uses enriched text
|
||||
- **WHEN** `POST /api/v1/reindex` is called
|
||||
- **THEN** the engine SHALL read `enriched_text` from the chunks table and embed that (not `text`)
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Schema migration adds enriched_text column
|
||||
|
||||
On startup, `init_schema` SHALL add the `enriched_text` column to the `chunks` table if it does not exist. It SHALL then backfill `enriched_text` for all existing chunks by joining with `documents.title` and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index `enriched_text`.
|
||||
|
||||
#### Scenario: First startup after upgrade
|
||||
- **WHEN** the engine starts and `chunks.enriched_text` column does not exist
|
||||
- **THEN** the engine SHALL add the column, backfill all rows, drop and recreate `chunks_fts` to index `enriched_text`, and recreate the FTS sync triggers
|
||||
|
||||
#### Scenario: Subsequent startup
|
||||
- **WHEN** the engine starts and `chunks.enriched_text` column already exists
|
||||
- **THEN** the engine SHALL not perform any migration and start normally
|
||||
|
||||
---
|
||||
|
||||
### Requirement: Search results return raw text
|
||||
|
||||
Search results SHALL continue to return the original chunk text (from `chunks.text`) in the `text` field, not the enriched text. The document title is already returned as a separate `title` field.
|
||||
|
||||
#### Scenario: Search result text field
|
||||
- **WHEN** a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363"
|
||||
- **THEN** the result `text` field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363")
|
||||
Reference in New Issue
Block a user