Files
kb/openspec/changes/archive/2026-03-29-kb-title-in-chunks/design.md
T
2026-03-30 07:25:22 +01:00

70 lines
4.6 KiB
Markdown

## Context
When a document is ingested, the worker chunks its content and stores each chunk's text in the `chunks` table. FTS5 triggers index that text, and the embedding model embeds it. The document title is stored only in `documents.title` — it never participates in search. This means short documents (or documents whose content lacks the title keywords) are invisible to queries that match the title.
The reindex endpoint (`POST /api/v1/reindex`) currently reads `chunks.text` and re-embeds it. Any fix must apply consistently at both ingestion and reindex time.
## Goals / Non-Goals
**Goals:**
- Document titles are searchable via both FTS5 and vector search
- Section header breadcrumbs (when present in chunk metadata) are also searchable
- Search results continue to return the original chunk text (no title prefix in the `text` field returned to clients)
- Existing documents become searchable by title after a `kb reindex`
- No schema-breaking migration — additive column only
**Non-Goals:**
- Changing the chunking strategies themselves (note, markdown, code, docling)
- Adding a separate title-search endpoint or client-side title filtering
- Changing the search result JSON structure
## Decisions
### 1. Add an `enriched_text` column to the `chunks` table
Store the title-prefixed text in a new `chunks.enriched_text` column alongside the existing `chunks.text`. The `text` column remains the raw chunk content (used for display in search results). The `enriched_text` column holds `"{title}\n\n{section_header}\n\n{text}"` (with section_header omitted when absent).
**Why not just modify `chunks.text`?** The title would then appear in every search result's text field, which is redundant (title is already a separate field) and would confuse consumers that display results.
**Why not reconstruct enriched text on-the-fly at search time?** FTS5 uses an external content table and triggers — it needs a real column to index. Reconstructing via JOIN at FTS query time would defeat the purpose of the FTS index.
### 2. Point FTS5 at `enriched_text` instead of `text`
Update the FTS5 virtual table definition and its sync triggers to index `enriched_text` rather than `text`. This is the core change that makes titles searchable via keyword search.
Since FTS5 external content tables cannot be ALTERed, existing databases require a rebuild: drop and recreate `chunks_fts` and its triggers, then repopulate. This is handled as a schema migration in `init_schema`.
### 3. Embed `enriched_text` instead of `text`
At ingestion time, pass `enriched_text` values to `embed_texts()` instead of raw chunk text. At reindex time, read `enriched_text` from the database. This makes titles searchable via vector similarity too.
### 4. Build enriched text in the worker, not in the ingest modules
The enrichment format is: `"{title}\n\n{chunk_text}"` or `"{title} > {section_header}\n\n{chunk_text}"` when a section header exists in chunk metadata.
This happens in `worker._process_job()` after chunking and before embedding/insertion. The ingest modules remain unchanged — they continue to return raw chunk text and metadata.
### 5. Schema migration adds `enriched_text` and rebuilds FTS
The `init_schema` function will:
1. Add `enriched_text TEXT` column to `chunks` if missing
2. Backfill `enriched_text` from existing data (join with `documents.title` and chunk metadata)
3. Drop and recreate `chunks_fts` to index `enriched_text` instead of `text`
4. Recreate the FTS sync triggers
This is safe because the migration only runs when the column is missing (first startup after upgrade). The backfill uses a single UPDATE...FROM query.
## Risks / Trade-offs
**Slightly larger database** — Each chunk stores the title string twice (once in `enriched_text`, once via the document FK). For a typical KB with short titles this is negligible (< 1% size increase).
→ Acceptable for the search quality improvement.
**FTS rebuild on upgrade** — First startup after upgrade will rebuild the FTS index, which takes a few seconds for large KBs.
→ This is a one-time cost and happens automatically.
**Embedding drift** — Existing vector embeddings won't include title context until `kb reindex` is run. The FTS backfill happens automatically, but vectors require an explicit reindex.
→ Document this in release notes. The FTS improvement alone is a significant win even without reindexing vectors.
**Title changes not propagated** — If a document's title were ever updated, `enriched_text` would be stale. Currently the engine has no title-update endpoint, so this is not a concern.
→ No mitigation needed now. If title editing is added later, it should update enriched_text.