bbe6a5e909
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
70 lines
4.6 KiB
Markdown
70 lines
4.6 KiB
Markdown
## Context
|
|
|
|
When a document is ingested, the worker chunks its content and stores each chunk's text in the `chunks` table. FTS5 triggers index that text, and the embedding model embeds it. The document title is stored only in `documents.title` — it never participates in search. This means short documents (or documents whose content lacks the title keywords) are invisible to queries that match the title.
|
|
|
|
The reindex endpoint (`POST /api/v1/reindex`) currently reads `chunks.text` and re-embeds it. Any fix must apply consistently at both ingestion and reindex time.
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Document titles are searchable via both FTS5 and vector search
|
|
- Section header breadcrumbs (when present in chunk metadata) are also searchable
|
|
- Search results continue to return the original chunk text (no title prefix in the `text` field returned to clients)
|
|
- Existing documents become searchable by title after a `kb reindex`
|
|
- No schema-breaking migration — additive column only
|
|
|
|
**Non-Goals:**
|
|
- Changing the chunking strategies themselves (note, markdown, code, docling)
|
|
- Adding a separate title-search endpoint or client-side title filtering
|
|
- Changing the search result JSON structure
|
|
|
|
## Decisions
|
|
|
|
### 1. Add an `enriched_text` column to the `chunks` table
|
|
|
|
Store the title-prefixed text in a new `chunks.enriched_text` column alongside the existing `chunks.text`. The `text` column remains the raw chunk content (used for display in search results). The `enriched_text` column holds `"{title}\n\n{section_header}\n\n{text}"` (with section_header omitted when absent).
|
|
|
|
**Why not just modify `chunks.text`?** The title would then appear in every search result's text field, which is redundant (title is already a separate field) and would confuse consumers that display results.
|
|
|
|
**Why not reconstruct enriched text on-the-fly at search time?** FTS5 uses an external content table and triggers — it needs a real column to index. Reconstructing via JOIN at FTS query time would defeat the purpose of the FTS index.
|
|
|
|
### 2. Point FTS5 at `enriched_text` instead of `text`
|
|
|
|
Update the FTS5 virtual table definition and its sync triggers to index `enriched_text` rather than `text`. This is the core change that makes titles searchable via keyword search.
|
|
|
|
Since FTS5 external content tables cannot be ALTERed, existing databases require a rebuild: drop and recreate `chunks_fts` and its triggers, then repopulate. This is handled as a schema migration in `init_schema`.
|
|
|
|
### 3. Embed `enriched_text` instead of `text`
|
|
|
|
At ingestion time, pass `enriched_text` values to `embed_texts()` instead of raw chunk text. At reindex time, read `enriched_text` from the database. This makes titles searchable via vector similarity too.
|
|
|
|
### 4. Build enriched text in the worker, not in the ingest modules
|
|
|
|
The enrichment format is: `"{title}\n\n{chunk_text}"` or `"{title} > {section_header}\n\n{chunk_text}"` when a section header exists in chunk metadata.
|
|
|
|
This happens in `worker._process_job()` after chunking and before embedding/insertion. The ingest modules remain unchanged — they continue to return raw chunk text and metadata.
|
|
|
|
### 5. Schema migration adds `enriched_text` and rebuilds FTS
|
|
|
|
The `init_schema` function will:
|
|
1. Add `enriched_text TEXT` column to `chunks` if missing
|
|
2. Backfill `enriched_text` from existing data (join with `documents.title` and chunk metadata)
|
|
3. Drop and recreate `chunks_fts` to index `enriched_text` instead of `text`
|
|
4. Recreate the FTS sync triggers
|
|
|
|
This is safe because the migration only runs when the column is missing (first startup after upgrade). The backfill uses a single UPDATE...FROM query.
|
|
|
|
## Risks / Trade-offs
|
|
|
|
**Slightly larger database** — Each chunk stores the title string twice (once in `enriched_text`, once via the document FK). For a typical KB with short titles this is negligible (< 1% size increase).
|
|
→ Acceptable for the search quality improvement.
|
|
|
|
**FTS rebuild on upgrade** — First startup after upgrade will rebuild the FTS index, which takes a few seconds for large KBs.
|
|
→ This is a one-time cost and happens automatically.
|
|
|
|
**Embedding drift** — Existing vector embeddings won't include title context until `kb reindex` is run. The FTS backfill happens automatically, but vectors require an explicit reindex.
|
|
→ Document this in release notes. The FTS improvement alone is a significant win even without reindexing vectors.
|
|
|
|
**Title changes not propagated** — If a document's title were ever updated, `enriched_text` would be stale. Currently the engine has no title-update endpoint, so this is not a concern.
|
|
→ No mitigation needed now. If title editing is added later, it should update enriched_text.
|