bbe6a5e909
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.2 KiB
3.2 KiB
Why
Short documents and notes are unsearchable when the user's query matches the document title but not the chunk content. For example, a document titled "Suitcase Locks" containing only "Steve = 1234 / Theresa = 4567" is invisible to both FTS and vector search for the query "suitcase locks". This is because chunk text — the only thing indexed and embedded — does not include the document title. This is a standard RAG deficiency that most pipelines solve by prepending title context to each chunk.
What Changes
- Prepend document title to chunk text at ingestion time: Before embedding and FTS indexing, each chunk's text will be prefixed with the document title (e.g.,
"Suitcase Locks\n\n Steve = 363..."). This ensures the title participates in both full-text and semantic search. - Include section header context in chunk text: For chunks that have a
section_headerin their metadata, prepend the header breadcrumb too (e.g.,"DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk..."). This improves search for queries that reference section names. - Store the raw chunk text separately from the enriched text: The original chunk text (without title prefix) must remain accessible so that search results don't display the prepended title redundantly — the title is already returned as a separate field.
- Reindex command must apply the same enrichment: When
kb reindexre-embeds all chunks, it must reconstruct the enriched text (title + section header + chunk text) from stored metadata.
Capabilities
New Capabilities
chunk-enrichment: Prepending document title and section context to chunk text before indexing and embedding, while preserving the original text for display.
Modified Capabilities
engine-api: The search endpoint's returnedtextfield must continue to show the original chunk text (without the prepended title), so no visible API change, but the internal indexing behaviour changes. The reindex endpoint must apply enrichment consistently.
Impact
- Engine ingestion pipeline (
worker.py): The_process_jobfunction must build enriched text from title + section headers + chunk text before passing toembed_texts()andinsert_chunk(). - Database schema (
database.py): Need to store both rawtext(for display) and enrichedtext(for FTS/embedding), or reconstruct enriched text at index time. Simplest approach: store raw text inchunks.text, use enriched text only for FTS content and embedding vectors. - FTS triggers (
database.py): The FTS5 external content table currently mirrorschunks.text. If we add anenriched_textcolumn, the FTS index should be built from that instead. - Reindex flow (
worker.py/database.py): Must reconstruct enriched text by joining chunk metadata with document title. - Search result enrichment (
routes/search.py): No change needed — results already returnchunks.text(raw) anddocuments.titleseparately. - All four ingest modules (
note.py,markdown.py,code.py,docling_pipeline.py): No changes needed — enrichment happens after chunking, in the worker. - Existing documents: Require a
reindexto benefit from the new enrichment. No data migration needed since the original text is preserved.