From bbe6a5e909be990d201e22e70e7589b3868b4ee1 Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Mon, 30 Mar 2026 07:25:22 +0100 Subject: [PATCH] Add dev-up script and archive kb-title-in-chunks change Co-Authored-By: Claude Opus 4.6 (1M context) --- engine/dev-up | 4 + .../.openspec.yaml | 2 + .../2026-03-29-kb-title-in-chunks/design.md | 69 +++++++++++++++++ .../2026-03-29-kb-title-in-chunks/proposal.md | 28 +++++++ .../specs/chunk-enrichment/spec.md | 75 +++++++++++++++++++ .../specs/engine-api/spec.md | 31 ++++++++ .../2026-03-29-kb-title-in-chunks/tasks.md | 33 ++++++++ 7 files changed, 242 insertions(+) create mode 100755 engine/dev-up create mode 100644 openspec/changes/archive/2026-03-29-kb-title-in-chunks/.openspec.yaml create mode 100644 openspec/changes/archive/2026-03-29-kb-title-in-chunks/design.md create mode 100644 openspec/changes/archive/2026-03-29-kb-title-in-chunks/proposal.md create mode 100644 openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/chunk-enrichment/spec.md create mode 100644 openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/engine-api/spec.md create mode 100644 openspec/changes/archive/2026-03-29-kb-title-in-chunks/tasks.md diff --git a/engine/dev-up b/engine/dev-up new file mode 100755 index 0000000..033a4dc --- /dev/null +++ b/engine/dev-up @@ -0,0 +1,4 @@ +#!/bin/bash + +docker stop engine-kb-engine-1 +KB_MODEL=BAAI/bge-base-en-v1.5 KB_DATA_PATH=~/kb-data docker compose -f compose.nvidia.yaml up -d --build diff --git a/openspec/changes/archive/2026-03-29-kb-title-in-chunks/.openspec.yaml b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/.openspec.yaml new file mode 100644 index 0000000..5e98b74 --- /dev/null +++ b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-03-29 diff --git a/openspec/changes/archive/2026-03-29-kb-title-in-chunks/design.md b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/design.md new file mode 100644 index 0000000..38c5972 --- /dev/null +++ b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/design.md @@ -0,0 +1,69 @@ +## Context + +When a document is ingested, the worker chunks its content and stores each chunk's text in the `chunks` table. FTS5 triggers index that text, and the embedding model embeds it. The document title is stored only in `documents.title` — it never participates in search. This means short documents (or documents whose content lacks the title keywords) are invisible to queries that match the title. + +The reindex endpoint (`POST /api/v1/reindex`) currently reads `chunks.text` and re-embeds it. Any fix must apply consistently at both ingestion and reindex time. + +## Goals / Non-Goals + +**Goals:** +- Document titles are searchable via both FTS5 and vector search +- Section header breadcrumbs (when present in chunk metadata) are also searchable +- Search results continue to return the original chunk text (no title prefix in the `text` field returned to clients) +- Existing documents become searchable by title after a `kb reindex` +- No schema-breaking migration — additive column only + +**Non-Goals:** +- Changing the chunking strategies themselves (note, markdown, code, docling) +- Adding a separate title-search endpoint or client-side title filtering +- Changing the search result JSON structure + +## Decisions + +### 1. Add an `enriched_text` column to the `chunks` table + +Store the title-prefixed text in a new `chunks.enriched_text` column alongside the existing `chunks.text`. The `text` column remains the raw chunk content (used for display in search results). The `enriched_text` column holds `"{title}\n\n{section_header}\n\n{text}"` (with section_header omitted when absent). + +**Why not just modify `chunks.text`?** The title would then appear in every search result's text field, which is redundant (title is already a separate field) and would confuse consumers that display results. + +**Why not reconstruct enriched text on-the-fly at search time?** FTS5 uses an external content table and triggers — it needs a real column to index. Reconstructing via JOIN at FTS query time would defeat the purpose of the FTS index. + +### 2. Point FTS5 at `enriched_text` instead of `text` + +Update the FTS5 virtual table definition and its sync triggers to index `enriched_text` rather than `text`. This is the core change that makes titles searchable via keyword search. + +Since FTS5 external content tables cannot be ALTERed, existing databases require a rebuild: drop and recreate `chunks_fts` and its triggers, then repopulate. This is handled as a schema migration in `init_schema`. + +### 3. Embed `enriched_text` instead of `text` + +At ingestion time, pass `enriched_text` values to `embed_texts()` instead of raw chunk text. At reindex time, read `enriched_text` from the database. This makes titles searchable via vector similarity too. + +### 4. Build enriched text in the worker, not in the ingest modules + +The enrichment format is: `"{title}\n\n{chunk_text}"` or `"{title} > {section_header}\n\n{chunk_text}"` when a section header exists in chunk metadata. + +This happens in `worker._process_job()` after chunking and before embedding/insertion. The ingest modules remain unchanged — they continue to return raw chunk text and metadata. + +### 5. Schema migration adds `enriched_text` and rebuilds FTS + +The `init_schema` function will: +1. Add `enriched_text TEXT` column to `chunks` if missing +2. Backfill `enriched_text` from existing data (join with `documents.title` and chunk metadata) +3. Drop and recreate `chunks_fts` to index `enriched_text` instead of `text` +4. Recreate the FTS sync triggers + +This is safe because the migration only runs when the column is missing (first startup after upgrade). The backfill uses a single UPDATE...FROM query. + +## Risks / Trade-offs + +**Slightly larger database** — Each chunk stores the title string twice (once in `enriched_text`, once via the document FK). For a typical KB with short titles this is negligible (< 1% size increase). +→ Acceptable for the search quality improvement. + +**FTS rebuild on upgrade** — First startup after upgrade will rebuild the FTS index, which takes a few seconds for large KBs. +→ This is a one-time cost and happens automatically. + +**Embedding drift** — Existing vector embeddings won't include title context until `kb reindex` is run. The FTS backfill happens automatically, but vectors require an explicit reindex. +→ Document this in release notes. The FTS improvement alone is a significant win even without reindexing vectors. + +**Title changes not propagated** — If a document's title were ever updated, `enriched_text` would be stale. Currently the engine has no title-update endpoint, so this is not a concern. +→ No mitigation needed now. If title editing is added later, it should update enriched_text. diff --git a/openspec/changes/archive/2026-03-29-kb-title-in-chunks/proposal.md b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/proposal.md new file mode 100644 index 0000000..a77ea12 --- /dev/null +++ b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/proposal.md @@ -0,0 +1,28 @@ +## Why + +Short documents and notes are unsearchable when the user's query matches the document title but not the chunk content. For example, a document titled "Suitcase Locks" containing only "Steve = 1234 / Theresa = 4567" is invisible to both FTS and vector search for the query "suitcase locks". This is because chunk text — the only thing indexed and embedded — does not include the document title. This is a standard RAG deficiency that most pipelines solve by prepending title context to each chunk. + +## What Changes + +- **Prepend document title to chunk text at ingestion time**: Before embedding and FTS indexing, each chunk's text will be prefixed with the document title (e.g., `"Suitcase Locks\n\n Steve = 363..."`). This ensures the title participates in both full-text and semantic search. +- **Include section header context in chunk text**: For chunks that have a `section_header` in their metadata, prepend the header breadcrumb too (e.g., `"DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk..."`). This improves search for queries that reference section names. +- **Store the raw chunk text separately from the enriched text**: The original chunk text (without title prefix) must remain accessible so that search results don't display the prepended title redundantly — the title is already returned as a separate field. +- **Reindex command must apply the same enrichment**: When `kb reindex` re-embeds all chunks, it must reconstruct the enriched text (title + section header + chunk text) from stored metadata. + +## Capabilities + +### New Capabilities +- `chunk-enrichment`: Prepending document title and section context to chunk text before indexing and embedding, while preserving the original text for display. + +### Modified Capabilities +- `engine-api`: The search endpoint's returned `text` field must continue to show the original chunk text (without the prepended title), so no visible API change, but the internal indexing behaviour changes. The reindex endpoint must apply enrichment consistently. + +## Impact + +- **Engine ingestion pipeline** (`worker.py`): The `_process_job` function must build enriched text from title + section headers + chunk text before passing to `embed_texts()` and `insert_chunk()`. +- **Database schema** (`database.py`): Need to store both raw `text` (for display) and enriched `text` (for FTS/embedding), or reconstruct enriched text at index time. Simplest approach: store raw text in `chunks.text`, use enriched text only for FTS content and embedding vectors. +- **FTS triggers** (`database.py`): The FTS5 external content table currently mirrors `chunks.text`. If we add an `enriched_text` column, the FTS index should be built from that instead. +- **Reindex flow** (`worker.py` / `database.py`): Must reconstruct enriched text by joining chunk metadata with document title. +- **Search result enrichment** (`routes/search.py`): No change needed — results already return `chunks.text` (raw) and `documents.title` separately. +- **All four ingest modules** (`note.py`, `markdown.py`, `code.py`, `docling_pipeline.py`): No changes needed — enrichment happens after chunking, in the worker. +- **Existing documents**: Require a `reindex` to benefit from the new enrichment. No data migration needed since the original text is preserved. diff --git a/openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/chunk-enrichment/spec.md b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/chunk-enrichment/spec.md new file mode 100644 index 0000000..de5fea7 --- /dev/null +++ b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/chunk-enrichment/spec.md @@ -0,0 +1,75 @@ +## ADDED Requirements + +### Requirement: Chunk text enrichment with document title + +The engine SHALL prepend the document title to each chunk's text before FTS indexing and vector embedding. The enriched text SHALL be stored in a dedicated `enriched_text` column on the `chunks` table. The original chunk text SHALL remain in the `text` column for display purposes. + +The enrichment format SHALL be: +- Without section header: `"{title}\n\n{chunk_text}"` +- With section header: `"{title} > {section_header}\n\n{chunk_text}"` + +Where `section_header` is the value from the chunk's metadata `section_header` field, when present. + +#### Scenario: Note ingestion with title enrichment +- **WHEN** a note titled "Suitcase Locks" with content "Steve = 363" is ingested +- **THEN** the `chunks.text` column SHALL contain "Steve = 363" and the `chunks.enriched_text` column SHALL contain "Suitcase Locks\n\nSteve = 363" + +#### Scenario: Markdown chunk with section header enrichment +- **WHEN** a markdown document titled "DCG Lab Hardware" produces a chunk with section_header "GRIMDAWN > motherboard" and text "MSI X870 Tomahawk" +- **THEN** the `chunks.enriched_text` SHALL contain "DCG Lab Hardware > GRIMDAWN > motherboard\n\nMSI X870 Tomahawk" + +#### Scenario: Chunk without section header +- **WHEN** a document titled "Docker Tips" produces a chunk with no section_header in metadata and text "dbash() { docker exec -it $1 bash; }" +- **THEN** the `chunks.enriched_text` SHALL contain "Docker Tips\n\ndbash() { docker exec -it $1 bash; }" + +--- + +### Requirement: FTS5 indexes enriched text + +The FTS5 virtual table `chunks_fts` SHALL index the `enriched_text` column instead of the `text` column. All FTS sync triggers (insert, update, delete) SHALL operate on `enriched_text`. + +#### Scenario: FTS search matches document title +- **WHEN** a user searches for "suitcase locks" and a document titled "Suitcase Locks" exists with chunk text "Steve = 363" +- **THEN** the FTS5 search SHALL return that chunk as a match + +#### Scenario: FTS search still matches chunk content +- **WHEN** a user searches for "MSI X870" and a chunk contains that text in its body +- **THEN** the FTS5 search SHALL return that chunk as a match (enrichment does not break content matching) + +--- + +### Requirement: Vector embeddings use enriched text + +The embedding model SHALL receive `enriched_text` (not raw `text`) when generating vectors during both initial ingestion and reindex operations. + +#### Scenario: Vector search matches document title +- **WHEN** a user searches semantically for "luggage combination codes" and a document titled "Suitcase Locks" exists +- **THEN** the vector search SHALL return that chunk with higher similarity than it would without title enrichment + +#### Scenario: Reindex uses enriched text +- **WHEN** `POST /api/v1/reindex` is called +- **THEN** the engine SHALL read `enriched_text` from the chunks table and embed that (not `text`) + +--- + +### Requirement: Schema migration adds enriched_text column + +On startup, `init_schema` SHALL add the `enriched_text` column to the `chunks` table if it does not exist. It SHALL then backfill `enriched_text` for all existing chunks by joining with `documents.title` and parsing chunk metadata for section headers. It SHALL rebuild the FTS5 table and triggers to index `enriched_text`. + +#### Scenario: First startup after upgrade +- **WHEN** the engine starts and `chunks.enriched_text` column does not exist +- **THEN** the engine SHALL add the column, backfill all rows, drop and recreate `chunks_fts` to index `enriched_text`, and recreate the FTS sync triggers + +#### Scenario: Subsequent startup +- **WHEN** the engine starts and `chunks.enriched_text` column already exists +- **THEN** the engine SHALL not perform any migration and start normally + +--- + +### Requirement: Search results return raw text + +Search results SHALL continue to return the original chunk text (from `chunks.text`) in the `text` field, not the enriched text. The document title is already returned as a separate `title` field. + +#### Scenario: Search result text field +- **WHEN** a search returns a chunk from document "Suitcase Locks" with raw text "Steve = 363" +- **THEN** the result `text` field SHALL be "Steve = 363" (not "Suitcase Locks\n\nSteve = 363") diff --git a/openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/engine-api/spec.md b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/engine-api/spec.md new file mode 100644 index 0000000..5b86213 --- /dev/null +++ b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/specs/engine-api/spec.md @@ -0,0 +1,31 @@ +## MODIFIED Requirements + +### Requirement: Background ingestion worker + +The engine SHALL run a background worker that processes queued jobs. The worker SHALL process one job at a time. For each job, it SHALL: detect document type, run the appropriate chunking pipeline (Docling for PDFs, header-based for Markdown, AST-based for code, whole-text for notes), build enriched text by prepending the document title (and section header when present) to each chunk's text, generate embeddings using the enriched text and the resident model, insert chunks (with both raw text and enriched text) and vectors into the database, and move the original file to persistent storage. + +#### Scenario: Successful PDF ingestion +- **WHEN** the background worker picks up a queued PDF job +- **THEN** it SHALL update the job status to `processing`, run Docling conversion and chunking, build enriched text for each chunk by prepending the document title, embed all chunks using enriched text, insert document and chunks into the database, move the staged file to `{data_dir}/documents/{content_hash}.pdf`, update `documents.stored_path` with the permanent path, store the original filename in `documents.original_filename`, update the job status to `done` with the resulting document_id and chunk count, and clean up the staging entry + +#### Scenario: Ingestion failure +- **WHEN** the background worker encounters an error during processing (e.g., corrupt PDF) +- **THEN** it SHALL update the job status to `failed` with the error message, delete the staged file, and continue processing the next queued job + +#### Scenario: Search during active ingestion +- **WHEN** a search request arrives while the background worker is processing a job +- **THEN** the search SHALL execute without blocking (SQLite WAL mode) and return results from already-ingested documents + +--- + +### Requirement: Engine status and reindex + +The engine SHALL provide status information and support re-embedding all chunks. The `version` field in the status response SHALL always be present and SHALL reflect the engine's release version as read from the `VERSION` file. This field is the contract used by clients for compatibility checking. + +#### Scenario: Get engine status +- **WHEN** a client sends `GET /api/v1/status` +- **THEN** the engine SHALL return JSON with `version` (string, from VERSION file), model_name, embedding_dim, GPU device info, database stats (document count by type, total chunks, DB size), and queue stats (queued/processing job count) + +#### Scenario: Trigger reindex +- **WHEN** a client sends `POST /api/v1/reindex` +- **THEN** the engine SHALL re-embed all existing chunks using the `enriched_text` column and the currently loaded model, and return progress information. This operation SHALL NOT block search queries. diff --git a/openspec/changes/archive/2026-03-29-kb-title-in-chunks/tasks.md b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/tasks.md new file mode 100644 index 0000000..163d5b5 --- /dev/null +++ b/openspec/changes/archive/2026-03-29-kb-title-in-chunks/tasks.md @@ -0,0 +1,33 @@ +## 1. Schema Migration + +- [x] 1.1 Add `enriched_text TEXT` column to `chunks` table in `database.py:init_schema` (with migration check for existing DBs) +- [x] 1.2 Write backfill query: `UPDATE chunks SET enriched_text = ... FROM documents` joining title and parsing chunk metadata for section_header +- [x] 1.3 Drop and recreate `chunks_fts` virtual table to index `enriched_text` instead of `text` +- [x] 1.4 Update FTS sync triggers (`chunks_ai`, `chunks_ad`, `chunks_au`) to use `enriched_text` + +## 2. Enrichment Helper + +- [x] 2.1 Create `build_enriched_text(title: str, chunk_text: str, metadata: dict | None) -> str` helper function in `worker.py` (or a shared util) that formats `"{title} > {section_header}\n\n{chunk_text}"` or `"{title}\n\n{chunk_text}"` + +## 3. Ingestion Pipeline + +- [x] 3.1 Update `worker._process_job()` to build enriched text for each chunk after chunking +- [x] 3.2 Pass enriched text to `embed_texts()` instead of raw chunk text +- [x] 3.3 Pass enriched text to `database.insert_chunk()` as the new `enriched_text` parameter +- [x] 3.4 Update `database.insert_chunk()` to accept and store `enriched_text` + +## 4. Reindex + +- [x] 4.1 Update `routes/reindex.py` to read `enriched_text` from chunks table and embed that instead of `text` + +## 5. Search Results + +- [x] 5.1 Verify `search.py:_enrich()` returns `chunks.text` (raw) not `enriched_text` — no change expected, but confirm + +## 6. Testing + +- [x] 6.1 Test: ingest a short note with a descriptive title, search by title keywords, confirm it is found +- [x] 6.2 Test: ingest a markdown doc, search by section header, confirm chunks are found +- [x] 6.3 Test: verify search result `text` field does not contain the prepended title +- [x] 6.4 Test: run `reindex`, verify enriched text is used for new embeddings +- [x] 6.5 Test: verify schema migration backfills enriched_text for pre-existing chunks on startup