Bump engine version to 3.2.3

Clarify hybrid semantic + full-text search in MCP descriptions
Agents were misreading kb_search as keyword-only because the vector/semantic component was only mentioned in the negative ("fts_only: no vector similarity"). Lead with hybrid semantic + BM25 + RRF in the server instructions, kb_search docstring, and MCP.md so agents recognise it as a vector search tool.
2026-05-15 18:22:08 +01:00 · 2026-05-15 18:19:42 +01:00 · 2026-05-13 20:34:55 +01:00
4 changed files with 85 additions and 13 deletions
@@ -1,6 +1,6 @@
 # MCP Server (Agent Integration)
-The MCP server exposes kb operations as native MCP tools, so agents can search, add notes, upload files, and manage documents without shelling out to the CLI.
+The MCP server exposes kb operations as native MCP tools, so agents can search, add notes, upload files, and manage documents without shelling out to the CLI. `kb_search` is hybrid: dense vector embeddings (semantic similarity) fused with BM25 full-text ranking via Reciprocal Rank Fusion, so agents can ask natural-language questions and find conceptually related content even when the exact words don't match.
 ## Start the MCP server
@@ -27,7 +27,7 @@ docker run -d --name kb-mcp \
 | Tool | Description |
 |---|---|
-| `kb_search` | Hybrid search with optional tag/type filters |
+| `kb_search` | Hybrid semantic (vector) + full-text search with tag/type filters |
 | `kb_addnote` | Add a text note (queued for async ingestion) |
 | `kb_update_note` | Update an existing note in place |
 | `kb_get` | Get document details by ID or source path |
@@ -1 +1 @@
-3.2.2
+3.2.3
@@ -44,11 +44,16 @@ _transport_security = TransportSecuritySettings(
 mcp = FastMCP(
    "kb",
    instructions=(
-        "Knowledge base MCP server. Provides tools for searching, adding, and "
+        "Knowledge base MCP server with hybrid semantic + full-text search. "
-        "managing documents and notes. Use tags to organise and filter documents "
+        "kb_search uses dense vector embeddings (semantic similarity) fused with "
-        "(e.g. tag notes with 'agent:mybot' and filter searches by that tag). "
+        "BM25 full-text ranking, so it finds conceptually related content even "
-        "This server requires Bearer token authentication — all requests are "
+        "when the exact words don't match — agents can ask natural-language "
-        "authenticated via the Authorization header at the HTTP transport layer."
+        "questions rather than guessing keywords. Also provides tools for adding "
        "notes, uploading files, and managing documents and tags. Use tags to "
        "organise and filter documents (e.g. tag notes with 'agent:mybot' and "
        "filter searches by that tag). This server requires Bearer token "
        "authentication — all requests are authenticated via the Authorization "
        "header at the HTTP transport layer."
    ),
    transport_security=_transport_security,
 )
@@ -62,17 +67,25 @@ async def kb_search(
    doc_type: str | None = None,
    fts_only: bool = False,
 ) -> str:
-    """Search the knowledge base for relevant documents and notes.
+    """Hybrid semantic (vector) + full-text search over the knowledge base.
-    Returns ranked chunks matching the query, with text content, relevance scores,
+    Combines dense vector embeddings (semantic similarity — finds conceptually
-    and document metadata.
+    related content even when the wording differs) with BM25 keyword ranking,
    fused via reciprocal rank fusion. Because the search is semantic, you can
    ask natural-language questions ("what did we decide about X?") rather than
    guessing the exact keywords used in the source documents.
    Returns ranked chunks matching the query, with text content, relevance
    scores, and document metadata.
    Args:
-        query: The search query. Can be a natural language question or keywords.
+        query: The search query — a natural language question or keywords.
        top: Maximum number of results to return (default 10).
        tags: Filter results to documents with ALL of these tags.
        doc_type: Filter by document type (e.g. "note", "pdf", "markdown", "code").
-        fts_only: If true, use only full-text search (no vector similarity).
+        fts_only: Disable the vector/semantic component and use only BM25
            keyword matching. Default false (hybrid mode). Set true only when
            you need exact-string matching (e.g. an error code, identifier).
    Tips for complex queries:
    - Consider expanding into 2-3 variant phrasings and calling this tool multiple
@@ -80,6 +93,7 @@ async def kb_search(
      "pension revaluation rules" and "how are pensions revalued" to cast a wider net.
    - For precision, rerank the returned results using your own judgement based on
      relevance to the original question.
    - Call kb_status to see which embedding model is in use.
    """
    result = engine.search(
        query=query,
@@ -0,0 +1,58 @@
 # kb — Next Steps
 UX improvements to make documents easier to find and inspect, prompted by a session where searching for an uploaded PDF (`M38T_PHEV_RHD_OM_EN_UK_20251209.pdf`, doc id 2077, 1801 chunks) surfaced lots of chunk hits but no obvious path back to the original document.
 ## Problems observed
 ### 1. `kb list` silently ignores positional arguments
 ```
 kb list --type pdf "M38T_PHEV_RHD_OM_EN_UK_20251209"
 ```
 The quoted term is dropped without warning; user gets the default newest-first listing and assumes the document is missing. `kb list` currently only supports `--tags` and `--type` filters.
 ### 2. `kb search` returns chunks with no `document_id`
 Result objects expose `chunk_id`, `title`, `source_path`, `tags` — but not `document_id`. To get from a search hit back to the owning document you have to title-match against `kb list` output or call an undocumented endpoint. The skill docs even claim a `source.document_id` field that isn't actually present in the CLI output.
 ### 3. `kb info` dumps every chunk with no summary mode
 `kb info 2077` returns ~1801 chunk objects. The document-level metadata (`id`, `title`, `original_filename`, `source_path`, `stored_path`, `doc_type`, `language`, `content_hash`, `has_file`, `tags`, `created_at`, `updated_at`) **is** present at the top level of the JSON, but in practice it's invisible — human format presumably dumps the chunk list and the user sees only chunks.
 There's no way to ask for "just tell me about this document."
 ### 4. Search hits can look like noise on image-heavy PDFs
 Top chunks for the M38T search were single characters (`"1"`, `"B"`, `"\""`). Almost certainly an FTS artefact on short tokens from a scan/image-heavy PDF — but it makes the result set look broken. Worth considering a minimum-text-length filter on indexed chunks, or down-weighting very short chunks in ranking.
 ## Proposed changes
 ### Small / high-value
 - **`kb info --no-chunks`** (or make `--chunks` opt-in): default to metadata + chunk count, only include chunks when asked. Human format should always lead with the metadata block.
 - **`kb list --title <substring>`** (or accept a positional query) for filename / title search. At minimum, error or warn when positional args are passed and ignored.
 - **Include `document_id` in `kb search` result objects.** Either at the top of each result or under `source.document_id` (matching the skill docs).
 ### Medium
 - **`kb find <query>`** as a doc-level search that aggregates chunk hits per document and returns ranked *documents* (with hit count, top chunk preview). This is what users usually want when they say "find my PDF about X."
 - **Update the `kb` skill docs** to match actual CLI output shape, and to steer users toward `kb list | jq` for filename lookups until proper filtering lands.
 ### Larger
 - **Quality filter for short chunks** during ingestion (e.g. drop chunks with < N alphanumeric chars, or fold them into neighbours). Stops scanned/image-heavy PDFs from polluting search.
 - **OCR path for scan-heavy PDFs.** The M38T manual extracted enough real text to be useful, but other "scan" docs likely don't. Detect low text density per page and route through OCR.
 ## Quick reference (current workarounds)
 ```bash
 # Find a doc by filename
 kb list --type pdf --format json | jq '.[] | select(.title | contains("M38T"))'
 # Get just metadata for a doc
 kb info 2077 --format json | jq 'del(.chunks)'
 # Download the original
 kb export 2077 -o manual.pdf
 ```