## Context

kb v2 is a client-server knowledge base: a Python FastAPI engine (SQLite + FTS5 + sqlite-vec, sentence-transformers embeddings) serving a Go CLI client over HTTP. Agent integration currently works via a Claude Code skill that shells out to the Go binary and parses JSON output.

The engine runs in Docker (NVIDIA/ROCm/CPU variants), keeps the embedding model warm in memory, and handles async ingestion via a background worker. The data model has documents, chunks, embeddings, tags, and jobs — but no concept of collections or note mutation.

This design covers three changes: adding an MCP server as a new integration surface, adding collection-scoped search via tag conventions, and adding in-place note updates.

## Goals / Non-Goals

**Goals:**

- Expose kb as native MCP tools so agents interact with it directly, not via shell subprocess
- Separate agent memory from user documents via collection tags
- Allow notes to be updated in place, preserving document identity
- Support file upload from remote agents via the MCP server
- Keep the engine fully local — no cloud API dependencies
- Maintain backward compatibility: existing CLI, API, and data all continue to work

**Non-Goals:**

- Query expansion or LLM reranking inside the engine (agent-side responsibility)
- File-watching / inotify for auto-reindexing (useful but separate concern)
- Collection-level access control or permissions
- New schema columns for collections (use existing tags)
- Stdio MCP transport (Streamable HTTP only)

## Decisions

### D1 — MCP server as a separate container, Streamable HTTP transport, with its own auth

The MCP server runs as its own Docker container alongside the engine, exposed via Streamable HTTP. It is not embedded into the FastAPI engine app. It requires its own Bearer token (`KB_MCP_API_KEY`) from calling agents.

**Why:** The engine and MCP server have different concerns — the engine manages embeddings, search, and ingestion; the MCP server translates MCP protocol to engine API calls. Keeping them separate means either can be updated independently. Both run as long-lived containers in a Docker Compose stack.

Streamable HTTP (not stdio) because the MCP server is a network service that remote agents connect to, not a subprocess spawned by a local agent. This matches the deployment model: engine + MCP server run on an infrastructure host, agents connect over the network.

The MCP server must have its own authentication because it is HTTP-exposed. Without it, anyone who discovers the endpoint has a direct pipe to the engine via `KB_API_KEY`. The MCP server validates the agent's Bearer token (`KB_MCP_API_KEY`) before proxying requests to the engine.

**Alternative considered:** Embedding MCP into the FastAPI app as additional routes. Rejected — it couples the MCP SDK lifecycle to the engine, and the engine shouldn't need to know about MCP protocol details. Also considered stdio transport, rejected because it requires the agent and MCP server to share a host. Also considered relying solely on the engine's `KB_API_KEY` for auth. Rejected — the MCP server is a separate network surface and must authenticate its own callers.

**Implementation:** Separate Python package/directory (`mcp/` at repo root). Uses the `mcp` Python SDK with Streamable HTTP transport. Reads engine URL and engine API key from environment variables (`KB_ENGINE_URL`, `KB_API_KEY`). Reads its own auth token from `KB_MCP_API_KEY`. Makes HTTP calls to the engine using `httpx`. Docker Compose file adds the MCP server as a service alongside the engine.

### D2 — Collections via tag conventions, with MCP-enforced exclusivity

Collections are implemented using the existing tag system with a naming convention: `collection:documents`, `collection:memory`, `collection:workspace`.

**Why:** Tags already exist, already filter search, and are already mutable via the API. A dedicated `collection` column would add a schema migration, new API parameters, and new CLI flags — all duplicating what tags can do.

**Exclusive membership:** The MCP server enforces one collection per document. When adding a document to a collection, the MCP server first removes any existing `collection:*` tags via the engine's tag API, then applies the new one. This prevents a document from appearing in multiple collections and keeps search results clean.

**Tag stripping in MCP responses:** The MCP server strips `collection:*` tags from the `tags` array in search results and presents the collection as a separate `collection` field. Agents see a clean interface: `{"collection": "memory", "tags": ["feedback", "email"]}` rather than raw `collection:memory` mixed in with user tags.

**Implementation:** The MCP tools accept a `collection` parameter (e.g. `"memory"`). The MCP server translates this to tag operations:

- On search: adds `collection:<name>` to the tag filter
- On addnote/addfile: removes any existing `collection:*` tags, then applies `collection:<name>`
- On results: strips `collection:*` from tags, adds a `collection` field

The engine is unchanged. The Go CLI can use the same convention manually via `--tags collection:memory`.

**Convention:** `collection:documents` is the default. Standard names: `documents`, `memory`, `workspace`. The MCP tool descriptions document these.

### D3 — Note mutation via dedicated PATCH endpoint, with full chunking support

Note updates go through a new synchronous `PATCH /api/v1/notes/{id}` endpoint, not through the async job queue. The endpoint uses the same chunking logic as the ingestion pipeline, not a hardcoded single-chunk assumption.

**Why:** Most notes are short and produce a single chunk. But if an agent updates a note with text that exceeds the embedding model's token window (~256 tokens for MiniLM), a single-chunk approach would silently embed only a portion of the text. Using the standard note chunking pipeline (which today produces one chunk for typical notes) means the endpoint naturally handles longer notes without silent data loss.

**Alternative considered:** Truncating long notes and returning a warning. Rejected — silent data loss or warnings that the agent might ignore are worse than just doing the right thing. Also considered reusing the job queue for consistency. Rejected — the queue's value is async processing of heavy workloads. Notes don't need it.

**Implementation:** The PATCH endpoint:

1. Validates the document exists and is `doc_type = 'note'`
2. Deletes existing chunks, FTS entries, and vector embeddings for that document
3. Runs the new text through the note chunking pipeline (same as ingestion)
4. Embeds each chunk and inserts into chunks_vec
5. Updates the document's `content_hash` and `updated_at`
6. Returns the updated document

All within a single transaction. FTS5 triggers keep the full-text index in sync automatically (existing `chunks_au` and `chunks_ad` triggers handle this). If embedding fails, the transaction rolls back and the old note is preserved.

### D4 — `updated_at` column on documents, set only on mutation

A new `updated_at TEXT` column on `documents`, initially NULL for all existing documents. Set to `current_timestamp` only when a document is modified (note update, tag change).

**Why:** Distinguishes "created" from "last modified". The agent memory use case needs to know when a memory was last updated, not just when it was first created. NULL means "never updated" — cleaner than duplicating `created_at`.

**Date sorting:** Any query that sorts or filters by "most recent" must use `COALESCE(updated_at, created_at)` to ensure un-mutated documents don't disappear from recent lists. This applies to the documents list endpoint and any future "recent" views.

### D5 — File upload via chunked base64, proxied to engine's existing upload API

The MCP server supports file uploads from remote agents using a three-step chunked upload pattern:

1. `kb_upload_start(filename, total_size, tags, collection)` — creates a temporary staging entry on the MCP server, returns a server-generated UUID `upload_id`
2. `kb_upload_chunk(upload_id, data, chunk_index)` — appends a base64-encoded chunk to the staging entry. Called N times.
3. `kb_upload_finish(upload_id)` — reassembles chunks, decodes from base64, and forwards the complete file as a multipart upload to the engine's existing `POST /api/v1/jobs` endpoint. Returns the job ID.

**Why:** The MCP server is remote from the calling agent, so file paths are meaningless. The agent reads the file locally, splits it into chunks, base64-encodes each chunk, and sends them as individual tool calls. No single MCP message needs to carry the entire file, avoiding message size limits regardless of file size.

The engine's existing upload pipeline handles everything from there: staging, type detection, chunking, embedding. No new engine code needed for file transfer.

**Alternative considered:** Single-message base64 upload (`kb_addfile` with full file content). Rejected — works for small files but hits practical MCP message size limits on larger PDFs. Also considered a separate file transfer service (SFTP container). Rejected — adds operational complexity for no benefit over the chunked approach. Also considered a plain HTTP upload endpoint on the MCP server. Rejected — adds a second protocol surface the agent needs to interact with. Also considered a single-call shortcut for small files. Rejected — one path for all files is simpler for agents to learn, and the overhead of 3 calls vs 1 is negligible for an LLM.

**Upload ID:** Server-generated UUID, returned by `kb_upload_start`. Prevents collision and is unpredictable (important since the MCP server is network-exposed).

**Chunk size:** Recommended 1MB raw (before base64 encoding, ~1.33MB encoded) per chunk. A 10MB PDF = ~10 tool calls. The MCP server holds chunks in a temporary directory, cleans up on finish or after a timeout (e.g. 10 minutes for abandoned uploads).

**Staging cleanup:** The MCP server tracks active uploads in memory. Chunks are written to a temporary directory. On `kb_upload_finish`, chunks are assembled and forwarded. On timeout or error, the temporary files are cleaned up. No persistent state needed — abandoned uploads are simply garbage collected. The temp directory does not need to survive container restarts; if the MCP server restarts mid-upload, the agent retries from `kb_upload_start`.

### D6 — MCP tool descriptions include agent-side search patterns

The MCP tool descriptions for `kb_search` include guidance on query expansion and reranking as documented patterns, not as engine parameters.

**Why:** The calling agent has an LLM. Expanding queries (call search N times with variant phrasings, merge results) and reranking (read top results, reorder by relevance) are better done in the agent's context. This keeps the engine deterministic and local.

**Implementation:** The `kb_search` tool description includes a note like: *"For complex queries, consider expanding into 2-3 variant phrasings and calling this tool multiple times, then deduplicating results by chunk_id. For precision, rerank the returned results using your own judgement."*

### D7 — Version bump to 3.0.0 for both engine and client

Engine and client both bump to v3.0.0. MIN_ENGINE_VERSION updates to v3.0.0.

**Why:** The `updated_at` column is a schema addition and the new `PATCH /api/v1/notes/{id}` endpoint is a new API surface. The new client command (`updatenote`) requires the new engine. A major version bump signals this clearly. The clean break is worth it given the MCP server is a new integration paradigm.

## Risks / Trade-offs

**MCP SDK maturity** — The `mcp` Python SDK is relatively new. Breaking changes in the SDK could require MCP server updates. Mitigation: the MCP server is a thin adapter, so updating it is low cost. Pin the SDK version.

**Tag convention enforcement** — Collection tags are a convention, not a constraint at the engine level. Typos create new collections silently (e.g. `collection:memeory`). Mitigation: the MCP server enforces exclusivity (removes old `collection:*` tags before applying new) and validates collection names against a known list. The Go CLI does not enforce this — it's a convention for manual users. Direct engine API users can still create arbitrary tags.

**Note mutation with long text** — The PATCH endpoint uses the standard note chunking pipeline, so long notes are chunked correctly. However, a note that grows very large (thousands of tokens) will produce many chunks and embeddings, making the synchronous PATCH slower. Mitigation: for the agent memory use case, notes are typically short. If a note grows large enough for this to matter, the agent should consider splitting it into multiple notes.

**Chunked upload complexity** — The three-step upload pattern (start/chunk/finish) is more complex than a single tool call. An agent must make N+2 calls to upload a file. Mitigation: the pattern is deterministic and easily scripted by agents. The MCP tool descriptions will include a clear usage example. Abandoned uploads (agent crashes mid-upload) are cleaned up by a timeout on the MCP server — no permanent state leaks.

**MCP server as HTTP client** — The MCP server calls the engine over HTTP, adding a network hop. For a compose deployment (both containers on the same Docker network) this adds sub-millisecond latency per call. Acceptable.

## Migration Plan

1. **Engine schema migration** — runs automatically on startup (same pattern as existing migrations in `init_schema`):
   - `ALTER TABLE documents ADD COLUMN updated_at TEXT`
2. **New engine endpoint** — `PATCH /api/v1/notes/{id}` for note mutation
3. **Engine version bump** — update `engine/VERSION` to `3.0.0`
4. **Client updates** — new `updatenote` command, version bump to `3.0.0`, `MIN_ENGINE_VERSION` to `3.0.0`
5. **MCP server** — new `mcp/` directory, Dockerfile, added to Docker Compose
6. **Rollback** — the schema change is additive (one new column). Rolling back to v2 engine code works fine — v2 ignores `updated_at`. Rolling back the client is a binary swap. Removing the MCP server container has no effect on engine or CLI.