3 Commits

Author SHA1 Message Date
steve 45e2c5ce91 Bump engine version to 3.2.3 2026-05-15 18:22:08 +01:00
steve e6e91f1d5c Clarify hybrid semantic + full-text search in MCP descriptions
Agents were misreading kb_search as keyword-only because the vector/semantic
component was only mentioned in the negative ("fts_only: no vector similarity").
Lead with hybrid semantic + BM25 + RRF in the server instructions, kb_search
docstring, and MCP.md so agents recognise it as a vector search tool.
2026-05-15 18:19:42 +01:00
steve 9eccc527ae Add next-steps.md with UX improvement ideas for kb CLI
Captures pain points found while trying to locate an uploaded PDF: kb
list silently ignores positional args, kb search results lack
document_id, kb info dumps all chunks with no summary mode, and
scan-heavy PDFs produce noisy single-char chunk hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:34:55 +01:00
4 changed files with 85 additions and 13 deletions
+2 -2
View File
@@ -1,6 +1,6 @@
# MCP Server (Agent Integration) # MCP Server (Agent Integration)
The MCP server exposes kb operations as native MCP tools, so agents can search, add notes, upload files, and manage documents without shelling out to the CLI. The MCP server exposes kb operations as native MCP tools, so agents can search, add notes, upload files, and manage documents without shelling out to the CLI. `kb_search` is hybrid: dense vector embeddings (semantic similarity) fused with BM25 full-text ranking via Reciprocal Rank Fusion, so agents can ask natural-language questions and find conceptually related content even when the exact words don't match.
## Start the MCP server ## Start the MCP server
@@ -27,7 +27,7 @@ docker run -d --name kb-mcp \
| Tool | Description | | Tool | Description |
|---|---| |---|---|
| `kb_search` | Hybrid search with optional tag/type filters | | `kb_search` | Hybrid semantic (vector) + full-text search with tag/type filters |
| `kb_addnote` | Add a text note (queued for async ingestion) | | `kb_addnote` | Add a text note (queued for async ingestion) |
| `kb_update_note` | Update an existing note in place | | `kb_update_note` | Update an existing note in place |
| `kb_get` | Get document details by ID or source path | | `kb_get` | Get document details by ID or source path |
+1 -1
View File
@@ -1 +1 @@
3.2.2 3.2.3
+24 -10
View File
@@ -44,11 +44,16 @@ _transport_security = TransportSecuritySettings(
mcp = FastMCP( mcp = FastMCP(
"kb", "kb",
instructions=( instructions=(
"Knowledge base MCP server. Provides tools for searching, adding, and " "Knowledge base MCP server with hybrid semantic + full-text search. "
"managing documents and notes. Use tags to organise and filter documents " "kb_search uses dense vector embeddings (semantic similarity) fused with "
"(e.g. tag notes with 'agent:mybot' and filter searches by that tag). " "BM25 full-text ranking, so it finds conceptually related content even "
"This server requires Bearer token authentication — all requests are " "when the exact words don't match — agents can ask natural-language "
"authenticated via the Authorization header at the HTTP transport layer." "questions rather than guessing keywords. Also provides tools for adding "
"notes, uploading files, and managing documents and tags. Use tags to "
"organise and filter documents (e.g. tag notes with 'agent:mybot' and "
"filter searches by that tag). This server requires Bearer token "
"authentication — all requests are authenticated via the Authorization "
"header at the HTTP transport layer."
), ),
transport_security=_transport_security, transport_security=_transport_security,
) )
@@ -62,17 +67,25 @@ async def kb_search(
doc_type: str | None = None, doc_type: str | None = None,
fts_only: bool = False, fts_only: bool = False,
) -> str: ) -> str:
"""Search the knowledge base for relevant documents and notes. """Hybrid semantic (vector) + full-text search over the knowledge base.
Returns ranked chunks matching the query, with text content, relevance scores, Combines dense vector embeddings (semantic similarity — finds conceptually
and document metadata. related content even when the wording differs) with BM25 keyword ranking,
fused via reciprocal rank fusion. Because the search is semantic, you can
ask natural-language questions ("what did we decide about X?") rather than
guessing the exact keywords used in the source documents.
Returns ranked chunks matching the query, with text content, relevance
scores, and document metadata.
Args: Args:
query: The search query. Can be a natural language question or keywords. query: The search query a natural language question or keywords.
top: Maximum number of results to return (default 10). top: Maximum number of results to return (default 10).
tags: Filter results to documents with ALL of these tags. tags: Filter results to documents with ALL of these tags.
doc_type: Filter by document type (e.g. "note", "pdf", "markdown", "code"). doc_type: Filter by document type (e.g. "note", "pdf", "markdown", "code").
fts_only: If true, use only full-text search (no vector similarity). fts_only: Disable the vector/semantic component and use only BM25
keyword matching. Default false (hybrid mode). Set true only when
you need exact-string matching (e.g. an error code, identifier).
Tips for complex queries: Tips for complex queries:
- Consider expanding into 2-3 variant phrasings and calling this tool multiple - Consider expanding into 2-3 variant phrasings and calling this tool multiple
@@ -80,6 +93,7 @@ async def kb_search(
"pension revaluation rules" and "how are pensions revalued" to cast a wider net. "pension revaluation rules" and "how are pensions revalued" to cast a wider net.
- For precision, rerank the returned results using your own judgement based on - For precision, rerank the returned results using your own judgement based on
relevance to the original question. relevance to the original question.
- Call kb_status to see which embedding model is in use.
""" """
result = engine.search( result = engine.search(
query=query, query=query,
+58
View File
@@ -0,0 +1,58 @@
# kb — Next Steps
UX improvements to make documents easier to find and inspect, prompted by a session where searching for an uploaded PDF (`M38T_PHEV_RHD_OM_EN_UK_20251209.pdf`, doc id 2077, 1801 chunks) surfaced lots of chunk hits but no obvious path back to the original document.
## Problems observed
### 1. `kb list` silently ignores positional arguments
```
kb list --type pdf "M38T_PHEV_RHD_OM_EN_UK_20251209"
```
The quoted term is dropped without warning; user gets the default newest-first listing and assumes the document is missing. `kb list` currently only supports `--tags` and `--type` filters.
### 2. `kb search` returns chunks with no `document_id`
Result objects expose `chunk_id`, `title`, `source_path`, `tags` — but not `document_id`. To get from a search hit back to the owning document you have to title-match against `kb list` output or call an undocumented endpoint. The skill docs even claim a `source.document_id` field that isn't actually present in the CLI output.
### 3. `kb info` dumps every chunk with no summary mode
`kb info 2077` returns ~1801 chunk objects. The document-level metadata (`id`, `title`, `original_filename`, `source_path`, `stored_path`, `doc_type`, `language`, `content_hash`, `has_file`, `tags`, `created_at`, `updated_at`) **is** present at the top level of the JSON, but in practice it's invisible — human format presumably dumps the chunk list and the user sees only chunks.
There's no way to ask for "just tell me about this document."
### 4. Search hits can look like noise on image-heavy PDFs
Top chunks for the M38T search were single characters (`"1"`, `"B"`, `"\""`). Almost certainly an FTS artefact on short tokens from a scan/image-heavy PDF — but it makes the result set look broken. Worth considering a minimum-text-length filter on indexed chunks, or down-weighting very short chunks in ranking.
## Proposed changes
### Small / high-value
- **`kb info --no-chunks`** (or make `--chunks` opt-in): default to metadata + chunk count, only include chunks when asked. Human format should always lead with the metadata block.
- **`kb list --title <substring>`** (or accept a positional query) for filename / title search. At minimum, error or warn when positional args are passed and ignored.
- **Include `document_id` in `kb search` result objects.** Either at the top of each result or under `source.document_id` (matching the skill docs).
### Medium
- **`kb find <query>`** as a doc-level search that aggregates chunk hits per document and returns ranked *documents* (with hit count, top chunk preview). This is what users usually want when they say "find my PDF about X."
- **Update the `kb` skill docs** to match actual CLI output shape, and to steer users toward `kb list | jq` for filename lookups until proper filtering lands.
### Larger
- **Quality filter for short chunks** during ingestion (e.g. drop chunks with < N alphanumeric chars, or fold them into neighbours). Stops scanned/image-heavy PDFs from polluting search.
- **OCR path for scan-heavy PDFs.** The M38T manual extracted enough real text to be useful, but other "scan" docs likely don't. Detect low text density per page and route through OCR.
## Quick reference (current workarounds)
```bash
# Find a doc by filename
kb list --type pdf --format json | jq '.[] | select(.title | contains("M38T"))'
# Get just metadata for a doc
kb info 2077 --format json | jq 'del(.chunks)'
# Download the original
kb export 2077 -o manual.pdf
```