New MCP server (mcp/) exposes kb operations as native MCP tools over
Streamable HTTP with Bearer token auth. Supports collections via tag
conventions, chunked file uploads, and agent-side search patterns.
Engine gains PATCH /api/v1/notes/{id} for in-place note updates with
transactional re-chunk/re-embed, and updated_at column on documents.
Go client adds updatenote command and Patch HTTP method.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
Context
kb v2 is a client-server knowledge base: a Python FastAPI engine (SQLite + FTS5 + sqlite-vec, sentence-transformers embeddings) serving a Go CLI client over HTTP. Agent integration currently works via a Claude Code skill that shells out to the Go binary and parses JSON output.
The engine runs in Docker (NVIDIA/ROCm/CPU variants), keeps the embedding model warm in memory, and handles async ingestion via a background worker. The data model has documents, chunks, embeddings, tags, and jobs — but no concept of collections or note mutation.
This design covers three changes: adding an MCP server as a new integration surface, adding collection-scoped search via tag conventions, and adding in-place note updates.
Goals / Non-Goals
Goals:
- Expose kb as native MCP tools so agents interact with it directly, not via shell subprocess
- Separate agent memory from user documents via collection tags
- Allow notes to be updated in place, preserving document identity
- Support file upload from remote agents via the MCP server
- Keep the engine fully local — no cloud API dependencies
- Maintain backward compatibility: existing CLI, API, and data all continue to work
Non-Goals:
- Query expansion or LLM reranking inside the engine (agent-side responsibility)
- File-watching / inotify for auto-reindexing (useful but separate concern)
- Collection-level access control or permissions
- New schema columns for collections (use existing tags)
- Stdio MCP transport (Streamable HTTP only)
Decisions
D1 — MCP server as a separate container, Streamable HTTP transport, with its own auth
The MCP server runs as its own Docker container alongside the engine, exposed via Streamable HTTP. It is not embedded into the FastAPI engine app. It requires its own Bearer token (KB_MCP_API_KEY) from calling agents.
Why: The engine and MCP server have different concerns — the engine manages embeddings, search, and ingestion; the MCP server translates MCP protocol to engine API calls. Keeping them separate means either can be updated independently. Both run as long-lived containers in a Docker Compose stack.
Streamable HTTP (not stdio) because the MCP server is a network service that remote agents connect to, not a subprocess spawned by a local agent. This matches the deployment model: engine + MCP server run on an infrastructure host, agents connect over the network.
The MCP server must have its own authentication because it is HTTP-exposed. Without it, anyone who discovers the endpoint has a direct pipe to the engine via KB_API_KEY. The MCP server validates the agent's Bearer token (KB_MCP_API_KEY) before proxying requests to the engine.
Alternative considered: Embedding MCP into the FastAPI app as additional routes. Rejected — it couples the MCP SDK lifecycle to the engine, and the engine shouldn't need to know about MCP protocol details. Also considered stdio transport, rejected because it requires the agent and MCP server to share a host. Also considered relying solely on the engine's KB_API_KEY for auth. Rejected — the MCP server is a separate network surface and must authenticate its own callers.
Implementation: Separate Python package/directory (mcp/ at repo root). Uses the mcp Python SDK with Streamable HTTP transport. Reads engine URL and engine API key from environment variables (KB_ENGINE_URL, KB_API_KEY). Reads its own auth token from KB_MCP_API_KEY. Makes HTTP calls to the engine using httpx. Docker Compose file adds the MCP server as a service alongside the engine.
D2 — Collections via tag conventions, with MCP-enforced exclusivity
Collections are implemented using the existing tag system with a naming convention: collection:documents, collection:memory, collection:workspace.
Why: Tags already exist, already filter search, and are already mutable via the API. A dedicated collection column would add a schema migration, new API parameters, and new CLI flags — all duplicating what tags can do.
Exclusive membership: The MCP server enforces one collection per document. When adding a document to a collection, the MCP server first removes any existing collection:* tags via the engine's tag API, then applies the new one. This prevents a document from appearing in multiple collections and keeps search results clean.
Tag stripping in MCP responses: The MCP server strips collection:* tags from the tags array in search results and presents the collection as a separate collection field. Agents see a clean interface: {"collection": "memory", "tags": ["feedback", "email"]} rather than raw collection:memory mixed in with user tags.
Implementation: The MCP tools accept a collection parameter (e.g. "memory"). The MCP server translates this to tag operations:
- On search: adds
collection:<name>to the tag filter - On addnote/addfile: removes any existing
collection:*tags, then appliescollection:<name> - On results: strips
collection:*from tags, adds acollectionfield
The engine is unchanged. The Go CLI can use the same convention manually via --tags collection:memory.
Convention: collection:documents is the default. Standard names: documents, memory, workspace. The MCP tool descriptions document these.
D3 — Note mutation via dedicated PATCH endpoint, with full chunking support
Note updates go through a new synchronous PATCH /api/v1/notes/{id} endpoint, not through the async job queue. The endpoint uses the same chunking logic as the ingestion pipeline, not a hardcoded single-chunk assumption.
Why: Most notes are short and produce a single chunk. But if an agent updates a note with text that exceeds the embedding model's token window (~256 tokens for MiniLM), a single-chunk approach would silently embed only a portion of the text. Using the standard note chunking pipeline (which today produces one chunk for typical notes) means the endpoint naturally handles longer notes without silent data loss.
Alternative considered: Truncating long notes and returning a warning. Rejected — silent data loss or warnings that the agent might ignore are worse than just doing the right thing. Also considered reusing the job queue for consistency. Rejected — the queue's value is async processing of heavy workloads. Notes don't need it.
Implementation: The PATCH endpoint:
- Validates the document exists and is
doc_type = 'note' - Deletes existing chunks, FTS entries, and vector embeddings for that document
- Runs the new text through the note chunking pipeline (same as ingestion)
- Embeds each chunk and inserts into chunks_vec
- Updates the document's
content_hashandupdated_at - Returns the updated document
All within a single transaction. FTS5 triggers keep the full-text index in sync automatically (existing chunks_au and chunks_ad triggers handle this). If embedding fails, the transaction rolls back and the old note is preserved.
D4 — updated_at column on documents, set only on mutation
A new updated_at TEXT column on documents, initially NULL for all existing documents. Set to current_timestamp only when a document is modified (note update, tag change).
Why: Distinguishes "created" from "last modified". The agent memory use case needs to know when a memory was last updated, not just when it was first created. NULL means "never updated" — cleaner than duplicating created_at.
Date sorting: Any query that sorts or filters by "most recent" must use COALESCE(updated_at, created_at) to ensure un-mutated documents don't disappear from recent lists. This applies to the documents list endpoint and any future "recent" views.
D5 — File upload via chunked base64, proxied to engine's existing upload API
The MCP server supports file uploads from remote agents using a three-step chunked upload pattern:
kb_upload_start(filename, total_size, tags, collection)— creates a temporary staging entry on the MCP server, returns a server-generated UUIDupload_idkb_upload_chunk(upload_id, data, chunk_index)— appends a base64-encoded chunk to the staging entry. Called N times.kb_upload_finish(upload_id)— reassembles chunks, decodes from base64, and forwards the complete file as a multipart upload to the engine's existingPOST /api/v1/jobsendpoint. Returns the job ID.
Why: The MCP server is remote from the calling agent, so file paths are meaningless. The agent reads the file locally, splits it into chunks, base64-encodes each chunk, and sends them as individual tool calls. No single MCP message needs to carry the entire file, avoiding message size limits regardless of file size.
The engine's existing upload pipeline handles everything from there: staging, type detection, chunking, embedding. No new engine code needed for file transfer.
Alternative considered: Single-message base64 upload (kb_addfile with full file content). Rejected — works for small files but hits practical MCP message size limits on larger PDFs. Also considered a separate file transfer service (SFTP container). Rejected — adds operational complexity for no benefit over the chunked approach. Also considered a plain HTTP upload endpoint on the MCP server. Rejected — adds a second protocol surface the agent needs to interact with. Also considered a single-call shortcut for small files. Rejected — one path for all files is simpler for agents to learn, and the overhead of 3 calls vs 1 is negligible for an LLM.
Upload ID: Server-generated UUID, returned by kb_upload_start. Prevents collision and is unpredictable (important since the MCP server is network-exposed).
Chunk size: Recommended 1MB raw (before base64 encoding, ~1.33MB encoded) per chunk. A 10MB PDF = ~10 tool calls. The MCP server holds chunks in a temporary directory, cleans up on finish or after a timeout (e.g. 10 minutes for abandoned uploads).
Staging cleanup: The MCP server tracks active uploads in memory. Chunks are written to a temporary directory. On kb_upload_finish, chunks are assembled and forwarded. On timeout or error, the temporary files are cleaned up. No persistent state needed — abandoned uploads are simply garbage collected. The temp directory does not need to survive container restarts; if the MCP server restarts mid-upload, the agent retries from kb_upload_start.
D6 — MCP tool descriptions include agent-side search patterns
The MCP tool descriptions for kb_search include guidance on query expansion and reranking as documented patterns, not as engine parameters.
Why: The calling agent has an LLM. Expanding queries (call search N times with variant phrasings, merge results) and reranking (read top results, reorder by relevance) are better done in the agent's context. This keeps the engine deterministic and local.
Implementation: The kb_search tool description includes a note like: "For complex queries, consider expanding into 2-3 variant phrasings and calling this tool multiple times, then deduplicating results by chunk_id. For precision, rerank the returned results using your own judgement."
D7 — Version bump to 3.0.0 for both engine and client
Engine and client both bump to v3.0.0. MIN_ENGINE_VERSION updates to v3.0.0.
Why: The updated_at column is a schema addition and the new PATCH /api/v1/notes/{id} endpoint is a new API surface. The new client command (updatenote) requires the new engine. A major version bump signals this clearly. The clean break is worth it given the MCP server is a new integration paradigm.
Risks / Trade-offs
MCP SDK maturity — The mcp Python SDK is relatively new. Breaking changes in the SDK could require MCP server updates. Mitigation: the MCP server is a thin adapter, so updating it is low cost. Pin the SDK version.
Tag convention enforcement — Collection tags are a convention, not a constraint at the engine level. Typos create new collections silently (e.g. collection:memeory). Mitigation: the MCP server enforces exclusivity (removes old collection:* tags before applying new) and validates collection names against a known list. The Go CLI does not enforce this — it's a convention for manual users. Direct engine API users can still create arbitrary tags.
Note mutation with long text — The PATCH endpoint uses the standard note chunking pipeline, so long notes are chunked correctly. However, a note that grows very large (thousands of tokens) will produce many chunks and embeddings, making the synchronous PATCH slower. Mitigation: for the agent memory use case, notes are typically short. If a note grows large enough for this to matter, the agent should consider splitting it into multiple notes.
Chunked upload complexity — The three-step upload pattern (start/chunk/finish) is more complex than a single tool call. An agent must make N+2 calls to upload a file. Mitigation: the pattern is deterministic and easily scripted by agents. The MCP tool descriptions will include a clear usage example. Abandoned uploads (agent crashes mid-upload) are cleaned up by a timeout on the MCP server — no permanent state leaks.
MCP server as HTTP client — The MCP server calls the engine over HTTP, adding a network hop. For a compose deployment (both containers on the same Docker network) this adds sub-millisecond latency per call. Acceptable.
Migration Plan
- Engine schema migration — runs automatically on startup (same pattern as existing migrations in
init_schema):ALTER TABLE documents ADD COLUMN updated_at TEXT
- New engine endpoint —
PATCH /api/v1/notes/{id}for note mutation - Engine version bump — update
engine/VERSIONto3.0.0 - Client updates — new
updatenotecommand, version bump to3.0.0,MIN_ENGINE_VERSIONto3.0.0 - MCP server — new
mcp/directory, Dockerfile, added to Docker Compose - Rollback — the schema change is additive (one new column). Rolling back to v2 engine code works fine — v2 ignores
updated_at. Rolling back the client is a binary swap. Removing the MCP server container has no effect on engine or CLI.