Files
kb/openspec/changes/archive/2026-03-29-kb-title-in-chunks/design.md
T
2026-03-30 07:25:22 +01:00

4.6 KiB

Context

When a document is ingested, the worker chunks its content and stores each chunk's text in the chunks table. FTS5 triggers index that text, and the embedding model embeds it. The document title is stored only in documents.title — it never participates in search. This means short documents (or documents whose content lacks the title keywords) are invisible to queries that match the title.

The reindex endpoint (POST /api/v1/reindex) currently reads chunks.text and re-embeds it. Any fix must apply consistently at both ingestion and reindex time.

Goals / Non-Goals

Goals:

  • Document titles are searchable via both FTS5 and vector search
  • Section header breadcrumbs (when present in chunk metadata) are also searchable
  • Search results continue to return the original chunk text (no title prefix in the text field returned to clients)
  • Existing documents become searchable by title after a kb reindex
  • No schema-breaking migration — additive column only

Non-Goals:

  • Changing the chunking strategies themselves (note, markdown, code, docling)
  • Adding a separate title-search endpoint or client-side title filtering
  • Changing the search result JSON structure

Decisions

1. Add an enriched_text column to the chunks table

Store the title-prefixed text in a new chunks.enriched_text column alongside the existing chunks.text. The text column remains the raw chunk content (used for display in search results). The enriched_text column holds "{title}\n\n{section_header}\n\n{text}" (with section_header omitted when absent).

Why not just modify chunks.text? The title would then appear in every search result's text field, which is redundant (title is already a separate field) and would confuse consumers that display results.

Why not reconstruct enriched text on-the-fly at search time? FTS5 uses an external content table and triggers — it needs a real column to index. Reconstructing via JOIN at FTS query time would defeat the purpose of the FTS index.

2. Point FTS5 at enriched_text instead of text

Update the FTS5 virtual table definition and its sync triggers to index enriched_text rather than text. This is the core change that makes titles searchable via keyword search.

Since FTS5 external content tables cannot be ALTERed, existing databases require a rebuild: drop and recreate chunks_fts and its triggers, then repopulate. This is handled as a schema migration in init_schema.

3. Embed enriched_text instead of text

At ingestion time, pass enriched_text values to embed_texts() instead of raw chunk text. At reindex time, read enriched_text from the database. This makes titles searchable via vector similarity too.

4. Build enriched text in the worker, not in the ingest modules

The enrichment format is: "{title}\n\n{chunk_text}" or "{title} > {section_header}\n\n{chunk_text}" when a section header exists in chunk metadata.

This happens in worker._process_job() after chunking and before embedding/insertion. The ingest modules remain unchanged — they continue to return raw chunk text and metadata.

5. Schema migration adds enriched_text and rebuilds FTS

The init_schema function will:

  1. Add enriched_text TEXT column to chunks if missing
  2. Backfill enriched_text from existing data (join with documents.title and chunk metadata)
  3. Drop and recreate chunks_fts to index enriched_text instead of text
  4. Recreate the FTS sync triggers

This is safe because the migration only runs when the column is missing (first startup after upgrade). The backfill uses a single UPDATE...FROM query.

Risks / Trade-offs

Slightly larger database — Each chunk stores the title string twice (once in enriched_text, once via the document FK). For a typical KB with short titles this is negligible (< 1% size increase). → Acceptable for the search quality improvement.

FTS rebuild on upgrade — First startup after upgrade will rebuild the FTS index, which takes a few seconds for large KBs. → This is a one-time cost and happens automatically.

Embedding drift — Existing vector embeddings won't include title context until kb reindex is run. The FTS backfill happens automatically, but vectors require an explicit reindex. → Document this in release notes. The FTS improvement alone is a significant win even without reindexing vectors.

Title changes not propagated — If a document's title were ever updated, enriched_text would be stale. Currently the engine has no title-update endpoint, so this is not a concern. → No mitigation needed now. If title editing is added later, it should update enriched_text.