Files
kb/openspec/changes/archive/2026-03-29-kb-title-in-chunks/proposal.md
T
2026-03-30 07:25:22 +01:00

3.2 KiB

Why

Short documents and notes are unsearchable when the user's query matches the document title but not the chunk content. For example, a document titled "Suitcase Locks" containing only "Steve = 1234 / Theresa = 4567" is invisible to both FTS and vector search for the query "suitcase locks". This is because chunk text — the only thing indexed and embedded — does not include the document title. This is a standard RAG deficiency that most pipelines solve by prepending title context to each chunk.

What Changes

  • Prepend document title to chunk text at ingestion time: Before embedding and FTS indexing, each chunk's text will be prefixed with the document title (e.g., "Suitcase Locks\n\n Steve = 363..."). This ensures the title participates in both full-text and semantic search.
  • Include section header context in chunk text: For chunks that have a section_header in their metadata, prepend the header breadcrumb too (e.g., "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk..."). This improves search for queries that reference section names.
  • Store the raw chunk text separately from the enriched text: The original chunk text (without title prefix) must remain accessible so that search results don't display the prepended title redundantly — the title is already returned as a separate field.
  • Reindex command must apply the same enrichment: When kb reindex re-embeds all chunks, it must reconstruct the enriched text (title + section header + chunk text) from stored metadata.

Capabilities

New Capabilities

  • chunk-enrichment: Prepending document title and section context to chunk text before indexing and embedding, while preserving the original text for display.

Modified Capabilities

  • engine-api: The search endpoint's returned text field must continue to show the original chunk text (without the prepended title), so no visible API change, but the internal indexing behaviour changes. The reindex endpoint must apply enrichment consistently.

Impact

  • Engine ingestion pipeline (worker.py): The _process_job function must build enriched text from title + section headers + chunk text before passing to embed_texts() and insert_chunk().
  • Database schema (database.py): Need to store both raw text (for display) and enriched text (for FTS/embedding), or reconstruct enriched text at index time. Simplest approach: store raw text in chunks.text, use enriched text only for FTS content and embedding vectors.
  • FTS triggers (database.py): The FTS5 external content table currently mirrors chunks.text. If we add an enriched_text column, the FTS index should be built from that instead.
  • Reindex flow (worker.py / database.py): Must reconstruct enriched text by joining chunk metadata with document title.
  • Search result enrichment (routes/search.py): No change needed — results already return chunks.text (raw) and documents.title separately.
  • All four ingest modules (note.py, markdown.py, code.py, docling_pipeline.py): No changes needed — enrichment happens after chunking, in the worker.
  • Existing documents: Require a reindex to benefit from the new enrichment. No data migration needed since the original text is preserved.