Files
2026-03-30 07:25:22 +01:00

29 lines
3.2 KiB
Markdown

## Why
Short documents and notes are unsearchable when the user's query matches the document title but not the chunk content. For example, a document titled "Suitcase Locks" containing only "Steve = 1234 / Theresa = 4567" is invisible to both FTS and vector search for the query "suitcase locks". This is because chunk text — the only thing indexed and embedded — does not include the document title. This is a standard RAG deficiency that most pipelines solve by prepending title context to each chunk.
## What Changes
- **Prepend document title to chunk text at ingestion time**: Before embedding and FTS indexing, each chunk's text will be prefixed with the document title (e.g., `"Suitcase Locks\n\n Steve = 363..."`). This ensures the title participates in both full-text and semantic search.
- **Include section header context in chunk text**: For chunks that have a `section_header` in their metadata, prepend the header breadcrumb too (e.g., `"DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk..."`). This improves search for queries that reference section names.
- **Store the raw chunk text separately from the enriched text**: The original chunk text (without title prefix) must remain accessible so that search results don't display the prepended title redundantly — the title is already returned as a separate field.
- **Reindex command must apply the same enrichment**: When `kb reindex` re-embeds all chunks, it must reconstruct the enriched text (title + section header + chunk text) from stored metadata.
## Capabilities
### New Capabilities
- `chunk-enrichment`: Prepending document title and section context to chunk text before indexing and embedding, while preserving the original text for display.
### Modified Capabilities
- `engine-api`: The search endpoint's returned `text` field must continue to show the original chunk text (without the prepended title), so no visible API change, but the internal indexing behaviour changes. The reindex endpoint must apply enrichment consistently.
## Impact
- **Engine ingestion pipeline** (`worker.py`): The `_process_job` function must build enriched text from title + section headers + chunk text before passing to `embed_texts()` and `insert_chunk()`.
- **Database schema** (`database.py`): Need to store both raw `text` (for display) and enriched `text` (for FTS/embedding), or reconstruct enriched text at index time. Simplest approach: store raw text in `chunks.text`, use enriched text only for FTS content and embedding vectors.
- **FTS triggers** (`database.py`): The FTS5 external content table currently mirrors `chunks.text`. If we add an `enriched_text` column, the FTS index should be built from that instead.
- **Reindex flow** (`worker.py` / `database.py`): Must reconstruct enriched text by joining chunk metadata with document title.
- **Search result enrichment** (`routes/search.py`): No change needed — results already return `chunks.text` (raw) and `documents.title` separately.
- **All four ingest modules** (`note.py`, `markdown.py`, `code.py`, `docling_pipeline.py`): No changes needed — enrichment happens after chunking, in the worker.
- **Existing documents**: Require a `reindex` to benefit from the new enrichment. No data migration needed since the original text is preserved.