Bump engine version to 3.2.3

Clarify hybrid semantic + full-text search in MCP descriptions
Agents were misreading kb_search as keyword-only because the vector/semantic component was only mentioned in the negative ("fts_only: no vector similarity"). Lead with hybrid semantic + BM25 + RRF in the server instructions, kb_search docstring, and MCP.md so agents recognise it as a vector search tool.
2026-05-15 18:22:08 +01:00 · 2026-05-15 18:19:42 +01:00 · 2026-05-13 20:34:55 +01:00 · 2026-04-14 21:48:55 +01:00
4 changed files with 85 additions and 13 deletions
@@ -1,6 +1,6 @@
 # MCP Server (Agent Integration)

-The MCP server exposes kb operations as native MCP tools, so agents can search, add notes, upload files, and manage documents without shelling out to the CLI.
+The MCP server exposes kb operations as native MCP tools, so agents can search, add notes, upload files, and manage documents without shelling out to the CLI. `kb_search` is hybrid: dense vector embeddings (semantic similarity) fused with BM25 full-text ranking via Reciprocal Rank Fusion, so agents can ask natural-language questions and find conceptually related content even when the exact words don't match.

 ## Start the MCP server

@@ -27,7 +27,7 @@ docker run -d --name kb-mcp \

 | Tool | Description |
 |---|---|
-| `kb_search` | Hybrid search with optional tag/type filters |
+| `kb_search` | Hybrid semantic (vector) + full-text search with tag/type filters |
 | `kb_addnote` | Add a text note (queued for async ingestion) |
 | `kb_update_note` | Update an existing note in place |
 | `kb_get` | Get document details by ID or source path |
@@ -1 +1 @@
-3.2.1
+3.2.3
@@ -44,11 +44,16 @@ _transport_security = TransportSecuritySettings(
 mcp = FastMCP(
    "kb",
    instructions=(
-        "Knowledge base MCP server. Provides tools for searching, adding, and "
-        "managing documents and notes. Use tags to organise and filter documents "
-        "(e.g. tag notes with 'agent:mybot' and filter searches by that tag). "
-        "This server requires Bearer token authentication — all requests are "
-        "authenticated via the Authorization header at the HTTP transport layer."
+        "Knowledge base MCP server with hybrid semantic + full-text search. "
+        "kb_search uses dense vector embeddings (semantic similarity) fused with "
+        "BM25 full-text ranking, so it finds conceptually related content even "
+        "when the exact words don't match — agents can ask natural-language "
+        "questions rather than guessing keywords. Also provides tools for adding "
+        "notes, uploading files, and managing documents and tags. Use tags to "
+        "organise and filter documents (e.g. tag notes with 'agent:mybot' and "
+        "filter searches by that tag). This server requires Bearer token "
+        "authentication — all requests are authenticated via the Authorization "
+        "header at the HTTP transport layer."
    ),
    transport_security=_transport_security,
 )
@@ -62,17 +67,25 @@ async def kb_search(
    doc_type: str | None = None,
    fts_only: bool = False,
 ) -> str:
-    """Search the knowledge base for relevant documents and notes.
+    """Hybrid semantic (vector) + full-text search over the knowledge base.

-    Returns ranked chunks matching the query, with text content, relevance scores,
-    and document metadata.
+    Combines dense vector embeddings (semantic similarity — finds conceptually
+    related content even when the wording differs) with BM25 keyword ranking,
+    fused via reciprocal rank fusion. Because the search is semantic, you can
+    ask natural-language questions ("what did we decide about X?") rather than
+    guessing the exact keywords used in the source documents.
+
+    Returns ranked chunks matching the query, with text content, relevance
+    scores, and document metadata.

    Args:
-        query: The search query. Can be a natural language question or keywords.
+        query: The search query — a natural language question or keywords.
        top: Maximum number of results to return (default 10).
        tags: Filter results to documents with ALL of these tags.
        doc_type: Filter by document type (e.g. "note", "pdf", "markdown", "code").
-        fts_only: If true, use only full-text search (no vector similarity).
+        fts_only: Disable the vector/semantic component and use only BM25
+            keyword matching. Default false (hybrid mode). Set true only when
+            you need exact-string matching (e.g. an error code, identifier).

    Tips for complex queries:
    - Consider expanding into 2-3 variant phrasings and calling this tool multiple
@@ -80,6 +93,7 @@ async def kb_search(
      "pension revaluation rules" and "how are pensions revalued" to cast a wider net.
    - For precision, rerank the returned results using your own judgement based on
      relevance to the original question.
+    - Call kb_status to see which embedding model is in use.
    """
    result = engine.search(
        query=query,
@@ -0,0 +1,58 @@
+# kb — Next Steps
+
+UX improvements to make documents easier to find and inspect, prompted by a session where searching for an uploaded PDF (`M38T_PHEV_RHD_OM_EN_UK_20251209.pdf`, doc id 2077, 1801 chunks) surfaced lots of chunk hits but no obvious path back to the original document.
+
+## Problems observed
+
+### 1. `kb list` silently ignores positional arguments
+
+```
+kb list --type pdf "M38T_PHEV_RHD_OM_EN_UK_20251209"
+```
+
+The quoted term is dropped without warning; user gets the default newest-first listing and assumes the document is missing. `kb list` currently only supports `--tags` and `--type` filters.
+
+### 2. `kb search` returns chunks with no `document_id`
+
+Result objects expose `chunk_id`, `title`, `source_path`, `tags` — but not `document_id`. To get from a search hit back to the owning document you have to title-match against `kb list` output or call an undocumented endpoint. The skill docs even claim a `source.document_id` field that isn't actually present in the CLI output.
+
+### 3. `kb info` dumps every chunk with no summary mode
+
+`kb info 2077` returns ~1801 chunk objects. The document-level metadata (`id`, `title`, `original_filename`, `source_path`, `stored_path`, `doc_type`, `language`, `content_hash`, `has_file`, `tags`, `created_at`, `updated_at`) **is** present at the top level of the JSON, but in practice it's invisible — human format presumably dumps the chunk list and the user sees only chunks.
+
+There's no way to ask for "just tell me about this document."
+
+### 4. Search hits can look like noise on image-heavy PDFs
+
+Top chunks for the M38T search were single characters (`"1"`, `"B"`, `"\""`). Almost certainly an FTS artefact on short tokens from a scan/image-heavy PDF — but it makes the result set look broken. Worth considering a minimum-text-length filter on indexed chunks, or down-weighting very short chunks in ranking.
+
+## Proposed changes
+
+### Small / high-value
+
+- **`kb info --no-chunks`** (or make `--chunks` opt-in): default to metadata + chunk count, only include chunks when asked. Human format should always lead with the metadata block.
+- **`kb list --title <substring>`** (or accept a positional query) for filename / title search. At minimum, error or warn when positional args are passed and ignored.
+- **Include `document_id` in `kb search` result objects.** Either at the top of each result or under `source.document_id` (matching the skill docs).
+
+### Medium
+
+- **`kb find <query>`** as a doc-level search that aggregates chunk hits per document and returns ranked *documents* (with hit count, top chunk preview). This is what users usually want when they say "find my PDF about X."
+- **Update the `kb` skill docs** to match actual CLI output shape, and to steer users toward `kb list | jq` for filename lookups until proper filtering lands.
+
+### Larger
+
+- **Quality filter for short chunks** during ingestion (e.g. drop chunks with < N alphanumeric chars, or fold them into neighbours). Stops scanned/image-heavy PDFs from polluting search.
+- **OCR path for scan-heavy PDFs.** The M38T manual extracted enough real text to be useful, but other "scan" docs likely don't. Detect low text density per page and route through OCR.
+
+## Quick reference (current workarounds)
+
+```bash
+# Find a doc by filename
+kb list --type pdf --format json | jq '.[] | select(.title | contains("M38T"))'
+
+# Get just metadata for a doc
+kb info 2077 --format json | jq 'del(.chunks)'
+
+# Download the original
+kb export 2077 -o manual.pdf
+```
Author	SHA1	Message	Date
steve	45e2c5ce91	Bump engine version to 3.2.3	2026-05-15 18:22:08 +01:00
steve	e6e91f1d5c	Clarify hybrid semantic + full-text search in MCP descriptions Agents were misreading kb_search as keyword-only because the vector/semantic component was only mentioned in the negative ("fts_only: no vector similarity"). Lead with hybrid semantic + BM25 + RRF in the server instructions, kb_search docstring, and MCP.md so agents recognise it as a vector search tool.	2026-05-15 18:19:42 +01:00
steve	9eccc527ae	Add next-steps.md with UX improvement ideas for kb CLI Captures pain points found while trying to locate an uploaded PDF: kb list silently ignores positional args, kb search results lack document_id, kb info dumps all chunks with no summary mode, and scan-heavy PDFs produce noisy single-char chunk hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:34:55 +01:00
steve	d44d11e4fe	Bump engine version to 3.2.2	2026-04-14 21:48:55 +01:00
@@ -1 +1 @@
 .2.1
 .2.3