kb/openspec/changes/archive/2026-03-29-kb-title-in-chunks/design.md at 75e4a0cf730969bdae711a42e54a4b9a533b8126

steve/kb

Files

T

steve bbe6a5e909 Add dev-up script and archive kb-title-in-chunks change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-30 07:25:22 +01:00

4.6 KiB

Raw Blame History

Context

When a document is ingested, the worker chunks its content and stores each chunk's text in the chunks table. FTS5 triggers index that text, and the embedding model embeds it. The document title is stored only in documents.title — it never participates in search. This means short documents (or documents whose content lacks the title keywords) are invisible to queries that match the title.

The reindex endpoint (POST /api/v1/reindex) currently reads chunks.text and re-embeds it. Any fix must apply consistently at both ingestion and reindex time.

Goals / Non-Goals

Goals:

Document titles are searchable via both FTS5 and vector search
Section header breadcrumbs (when present in chunk metadata) are also searchable
Search results continue to return the original chunk text (no title prefix in the text field returned to clients)
Existing documents become searchable by title after a kb reindex
No schema-breaking migration — additive column only

Non-Goals:

Changing the chunking strategies themselves (note, markdown, code, docling)
Adding a separate title-search endpoint or client-side title filtering
Changing the search result JSON structure

Decisions

1. Add an `enriched_text` column to the `chunks` table

Store the title-prefixed text in a new chunks.enriched_text column alongside the existing chunks.text. The text column remains the raw chunk content (used for display in search results). The enriched_text column holds "{title}\n\n{section_header}\n\n{text}" (with section_header omitted when absent).

Why not just modify chunks.text? The title would then appear in every search result's text field, which is redundant (title is already a separate field) and would confuse consumers that display results.

Why not reconstruct enriched text on-the-fly at search time? FTS5 uses an external content table and triggers — it needs a real column to index. Reconstructing via JOIN at FTS query time would defeat the purpose of the FTS index.

2. Point FTS5 at `enriched_text` instead of `text`

Update the FTS5 virtual table definition and its sync triggers to index enriched_text rather than text. This is the core change that makes titles searchable via keyword search.

Since FTS5 external content tables cannot be ALTERed, existing databases require a rebuild: drop and recreate chunks_fts and its triggers, then repopulate. This is handled as a schema migration in init_schema.

3. Embed `enriched_text` instead of `text`

At ingestion time, pass enriched_text values to embed_texts() instead of raw chunk text. At reindex time, read enriched_text from the database. This makes titles searchable via vector similarity too.

4. Build enriched text in the worker, not in the ingest modules

The enrichment format is: "{title}\n\n{chunk_text}" or "{title} > {section_header}\n\n{chunk_text}" when a section header exists in chunk metadata.

This happens in worker._process_job() after chunking and before embedding/insertion. The ingest modules remain unchanged — they continue to return raw chunk text and metadata.

5. Schema migration adds `enriched_text` and rebuilds FTS

The init_schema function will:

Add enriched_text TEXT column to chunks if missing
Backfill enriched_text from existing data (join with documents.title and chunk metadata)
Drop and recreate chunks_fts to index enriched_text instead of text
Recreate the FTS sync triggers

This is safe because the migration only runs when the column is missing (first startup after upgrade). The backfill uses a single UPDATE...FROM query.

Risks / Trade-offs

Slightly larger database — Each chunk stores the title string twice (once in enriched_text, once via the document FK). For a typical KB with short titles this is negligible (< 1% size increase). → Acceptable for the search quality improvement.

FTS rebuild on upgrade — First startup after upgrade will rebuild the FTS index, which takes a few seconds for large KBs. → This is a one-time cost and happens automatically.

Embedding drift — Existing vector embeddings won't include title context until kb reindex is run. The FTS backfill happens automatically, but vectors require an explicit reindex. → Document this in release notes. The FTS improvement alone is a significant win even without reindexing vectors.

Title changes not propagated — If a document's title were ever updated, enriched_text would be stale. Currently the engine has no title-update endpoint, so this is not a concern. → No mitigation needed now. If title editing is added later, it should update enriched_text.

4.6 KiB Raw Blame History