Add bulk operations and remove collections abstraction

- Add bulk delete, bulk tags, and bulk set-tags engine endpoints
  (POST /api/v1/bulk/delete, /bulk/tags, /bulk/set-tags)
- Filter-based selection: by tags, doc_type, ID list, ID range
- Safety threshold (KB_BULK_SAFETY_PERCENT, default 70%) prevents
  accidental mass operations unless force=true
- Synchronous execution with audit trail via jobs table
- Add kb_bulk_delete, kb_bulk_tags, kb_bulk_set_tags MCP tools
- Add kb bulk-remove, bulk-tag, bulk-set-tags CLI commands
- Remove collection abstraction from MCP server (use tags instead)
- Remove kb_set_collection MCP tool
- Update SKILL.md, MCP.md, README.md documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-04 22:34:47 +01:00
parent 0c124c4ab7
commit b5a203d2aa
21 changed files with 1619 additions and 112 deletions
+8 -8
View File
@@ -27,25 +27,25 @@ docker run -d --name kb-mcp \
| Tool | Description |
|---|---|
| `kb_search` | Hybrid search with optional collection/tag/type filters |
| `kb_search` | Hybrid search with optional tag/type filters |
| `kb_addnote` | Add a text note (queued for async ingestion) |
| `kb_update_note` | Update an existing note in place |
| `kb_get` | Get document details by ID or source path |
| `kb_delete` | Permanently delete a document by ID |
| `kb_status` | Engine health and statistics |
| `kb_jobs` | Ingestion queue status |
| `kb_upload_start` | Start a chunked file upload |
| `kb_upload_chunk` | Upload a base64-encoded file chunk |
| `kb_upload_finish` | Finish upload and submit for ingestion |
| `kb_bulk_delete` | Delete multiple documents matching a filter |
| `kb_bulk_tags` | Add/remove tags on multiple documents |
| `kb_bulk_set_tags` | Replace all tags on multiple documents |
## Collections
## Organising with tags
The MCP server supports **collections** — scoped document namespaces implemented via tag conventions. Use these to separate agent memory from user documents:
Use tags to separate agent data from user documents. For example, an agent can tag all its notes with `agent:mybot` and filter by that tag when searching. This is a naming convention — configure it in your agent's system prompt. No special server-side enforcement is needed.
- `documents` (default) — user-facing documents
- `memory` — agent memory and preferences
- `workspace` — working context
Tools accept a `collection` parameter. The MCP server translates this to `collection:<name>` tags on the engine, and strips them from responses so agents see a clean `"collection": "memory"` field.
Bulk tools accept filter-based selection (by tags, doc_type, ID list, or ID range) so agents can manage thousands of documents in a single call instead of looping. A safety threshold (default 70%) prevents accidental mass operations unless `force: true` is set.
## MCP server configuration
+7 -2
View File
@@ -14,7 +14,7 @@ MCP Agents ──MCP/HTTP──▶ MCP Server (Docker) ──┘
- **Engine**: Keeps the embedding model warm in memory. Handles search, ingestion, document management, and note mutation via REST API. Runs in Docker with NVIDIA GPU, AMD GPU (ROCm), or CPU-only support.
- **Client**: Single static Go binary. No Python, no ML dependencies, instant startup. Talks to the engine over HTTP.
- **MCP Server**: Exposes kb operations as native MCP tools over Streamable HTTP. Runs as a separate Docker container alongside the engine. Supports collections for scoping agent memory vs user documents.
- **MCP Server**: Exposes kb operations as native MCP tools over Streamable HTTP. Runs as a separate Docker container alongside the engine. Use tags to scope agent data from user documents.
- **Storage**: Single SQLite database with FTS5 (keyword search) and sqlite-vec (vector search). Portable via bind mount — just copy the data directory between hosts.
## Quick start
@@ -149,6 +149,11 @@ kb tag 1 --add important
kb export 1 -o manual.pdf # download original file
kb remove 3 --yes
kb status
# Bulk operations
kb bulk-remove --tags "draft,old" --type note --yes
kb bulk-tag --type note --add "archived" --yes
kb bulk-set-tags --tags "old-scheme" --set "new-scheme" --yes
```
## How it works
@@ -192,7 +197,7 @@ Data is device-agnostic — you can ingest on NVIDIA and serve from AMD or CPU (
The MCP server exposes kb operations as native MCP tools over Streamable HTTP, so agents can search, add notes, upload files, and manage documents without shelling out to the CLI. Includes setup guides for Claude Code, VS Code, Cursor, Windsurf, and JetBrains IDEs.
See **[MCP.md](MCP.md)** for full details — server setup, available tools, collections, configuration, and client examples.
See **[MCP.md](MCP.md)** for full details — server setup, available tools, tag-based organisation, configuration, and client examples.
## Agent skill
+36 -5
View File
@@ -71,6 +71,36 @@ kb tag <doc_id> --add important,ops # add tags to a document
kb tag <doc_id> --remove draft # remove tags from a document
```
## Bulk operations
Operate on multiple documents at once using filter-based selection. Filters combine with AND logic.
**Filter flags (shared across all bulk commands):**
- `--tags tag1,tag2` — match documents with ALL specified tags
- `--type pdf|note|...` — match by document type
- `--ids 1,5,12` — match specific document IDs
- `--from-id N` — match documents with id >= N
- `--to-id N` — match documents with id <= N
- `--force` / `-f` — override safety threshold (blocks operations affecting >70% of all documents)
- `--yes` / `-y` — skip confirmation prompt
```bash
# Bulk delete
kb bulk-remove --tags "draft,old" --type note --yes # delete matching docs
kb bulk-remove --from-id 10 --to-id 50 --yes # delete by ID range
kb bulk-remove --ids "3,7,12" --yes # delete specific IDs
# Bulk tag add/remove
kb bulk-tag --tags "agent:mybot" --add "reviewed" --remove "pending" --yes
kb bulk-tag --type note --add "archived" --yes # tag all notes
# Bulk replace tags
kb bulk-set-tags --tags "old-scheme" --set "new-scheme,migrated" --yes
```
All bulk commands return a summary: matched count, succeeded count, failed count, and errors.
A safety threshold prevents accidentally affecting more than 70% of documents unless `--force` is used.
## Jobs (ingestion queue)
```bash
@@ -172,12 +202,13 @@ For agent-to-agent integration, kb provides an MCP server alongside the CLI. The
exposes the same operations as native MCP tools over Streamable HTTP transport, which agents
can connect to directly without subprocess overhead.
**MCP tools:** `kb_search`, `kb_addnote`, `kb_update_note`, `kb_get`, `kb_status`, `kb_jobs`,
`kb_upload_start`, `kb_upload_chunk`, `kb_upload_finish`.
**MCP tools:** `kb_search`, `kb_addnote`, `kb_update_note`, `kb_get`, `kb_delete`, `kb_status`,
`kb_jobs`, `kb_upload_start`, `kb_upload_chunk`, `kb_upload_finish`, `kb_bulk_delete`,
`kb_bulk_tags`, `kb_bulk_set_tags`.
The MCP server supports **collections** — scoped document namespaces (e.g. `memory`, `documents`,
`workspace`) implemented via tag conventions. This is the recommended way for agents to separate
their memory from user documents.
Use tags to separate agent data from user documents (e.g. tag all agent notes with
`agent:mybot` and filter by that tag when searching). This convention is communicated
via system prompt — no special server-side enforcement needed.
If the kb engine is already running via Docker Compose, add the MCP server by deploying the
`kb-mcp` service from the same compose file. Agents connect to it on port 3000 (default).
+1 -1
View File
@@ -1 +1 @@
3.0.0
3.2.0
+1 -1
View File
@@ -1 +1 @@
3.0.0
3.2.0
+186
View File
@@ -0,0 +1,186 @@
package cmd
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
"github.com/kb-search/kb/internal/api"
"github.com/kb-search/kb/internal/output"
"github.com/spf13/cobra"
)
var bulkRemoveCmd = &cobra.Command{
Use: "bulk-remove",
Short: "Delete multiple documents matching a filter",
RunE: runBulkRemove,
}
func init() {
addBulkFilterFlags(bulkRemoveCmd)
rootCmd.AddCommand(bulkRemoveCmd)
}
func runBulkRemove(cmd *cobra.Command, args []string) error {
body, err := buildBulkBody(cmd)
if err != nil {
return err
}
yes, _ := cmd.Flags().GetBool("yes")
if !yes {
desc := describeBulkFilter(cmd)
fmt.Printf("This will delete documents matching: %s\nProceed? [y/N] ", desc)
reader := bufio.NewReader(os.Stdin)
answer, _ := reader.ReadString('\n')
answer = strings.TrimSpace(strings.ToLower(answer))
if answer != "y" && answer != "yes" {
fmt.Println("Cancelled.")
return nil
}
}
client := api.NewClient()
resp, err := client.Post("/api/v1/bulk/delete", body)
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
if err := api.CheckError(resp); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
var result map[string]interface{}
if err := api.DecodeJSON(resp, &result); err != nil {
return fmt.Errorf("failed to decode response: %w", err)
}
if output.IsJSON() {
output.PrintJSON(result)
} else {
printBulkResult("Deleted", result)
}
return nil
}
// ---------------------------------------------------------------------------
// Shared helpers for all bulk commands
// ---------------------------------------------------------------------------
func addBulkFilterFlags(cmd *cobra.Command) {
cmd.Flags().String("tags", "", "filter by tags (comma-separated)")
cmd.Flags().String("type", "", "filter by document type")
cmd.Flags().String("ids", "", "filter by document IDs (comma-separated)")
cmd.Flags().Int("from-id", 0, "filter by id >= value")
cmd.Flags().Int("to-id", 0, "filter by id <= value")
cmd.Flags().BoolP("force", "f", false, "override safety threshold")
cmd.Flags().BoolP("yes", "y", false, "skip confirmation prompt")
}
func buildBulkBody(cmd *cobra.Command) (map[string]interface{}, error) {
body := map[string]interface{}{}
tagsStr, _ := cmd.Flags().GetString("tags")
if tagsStr != "" {
body["tags"] = splitTags(tagsStr)
}
docType, _ := cmd.Flags().GetString("type")
if docType != "" {
body["doc_type"] = docType
}
idsStr, _ := cmd.Flags().GetString("ids")
if idsStr != "" {
ids, err := parseIntList(idsStr)
if err != nil {
return nil, fmt.Errorf("invalid --ids: %w", err)
}
body["document_ids"] = ids
}
fromID, _ := cmd.Flags().GetInt("from-id")
if fromID > 0 {
body["from_id"] = fromID
}
toID, _ := cmd.Flags().GetInt("to-id")
if toID > 0 {
body["to_id"] = toID
}
force, _ := cmd.Flags().GetBool("force")
if force {
body["force"] = true
}
// Ensure at least one filter
hasFilter := tagsStr != "" || docType != "" || idsStr != "" || fromID > 0 || toID > 0
if !hasFilter {
return nil, fmt.Errorf("at least one filter is required (--tags, --type, --ids, --from-id, --to-id)")
}
return body, nil
}
func describeBulkFilter(cmd *cobra.Command) string {
var parts []string
tagsStr, _ := cmd.Flags().GetString("tags")
if tagsStr != "" {
parts = append(parts, fmt.Sprintf("tags=[%s]", tagsStr))
}
docType, _ := cmd.Flags().GetString("type")
if docType != "" {
parts = append(parts, fmt.Sprintf("type=%s", docType))
}
idsStr, _ := cmd.Flags().GetString("ids")
if idsStr != "" {
parts = append(parts, fmt.Sprintf("ids=[%s]", idsStr))
}
fromID, _ := cmd.Flags().GetInt("from-id")
if fromID > 0 {
parts = append(parts, fmt.Sprintf("from_id=%d", fromID))
}
toID, _ := cmd.Flags().GetInt("to-id")
if toID > 0 {
parts = append(parts, fmt.Sprintf("to_id=%d", toID))
}
return strings.Join(parts, " ")
}
func printBulkResult(action string, result map[string]interface{}) {
matched := int(result["matched"].(float64))
succeeded := int(result["succeeded"].(float64))
failed := int(result["failed"].(float64))
fmt.Printf("%s %d of %d documents", action, succeeded, matched)
if failed > 0 {
fmt.Printf(" (%d failed)", failed)
}
fmt.Println()
}
func parseIntList(s string) ([]int, error) {
var ids []int
for _, part := range strings.Split(s, ",") {
part = strings.TrimSpace(part)
if part == "" {
continue
}
id, err := strconv.Atoi(part)
if err != nil {
return nil, fmt.Errorf("invalid ID %q: %w", part, err)
}
ids = append(ids, id)
}
return ids, nil
}
+73
View File
@@ -0,0 +1,73 @@
package cmd
import (
"bufio"
"fmt"
"os"
"strings"
"github.com/kb-search/kb/internal/api"
"github.com/kb-search/kb/internal/output"
"github.com/spf13/cobra"
)
var bulkSetTagsCmd = &cobra.Command{
Use: "bulk-set-tags",
Short: "Replace all tags on multiple documents matching a filter",
RunE: runBulkSetTags,
}
func init() {
addBulkFilterFlags(bulkSetTagsCmd)
bulkSetTagsCmd.Flags().String("set", "", "replacement tags (comma-separated)")
rootCmd.AddCommand(bulkSetTagsCmd)
}
func runBulkSetTags(cmd *cobra.Command, args []string) error {
body, err := buildBulkBody(cmd)
if err != nil {
return err
}
setStr, _ := cmd.Flags().GetString("set")
if setStr == "" {
return fmt.Errorf("--set is required (comma-separated list of replacement tags)")
}
body["new_tags"] = splitTags(setStr)
yes, _ := cmd.Flags().GetBool("yes")
if !yes {
desc := describeBulkFilter(cmd)
fmt.Printf("This will replace all tags with [%s] on documents matching: %s\nProceed? [y/N] ", setStr, desc)
reader := bufio.NewReader(os.Stdin)
answer, _ := reader.ReadString('\n')
answer = strings.TrimSpace(strings.ToLower(answer))
if answer != "y" && answer != "yes" {
fmt.Println("Cancelled.")
return nil
}
}
client := api.NewClient()
resp, err := client.Post("/api/v1/bulk/set-tags", body)
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
if err := api.CheckError(resp); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
var result map[string]interface{}
if err := api.DecodeJSON(resp, &result); err != nil {
return fmt.Errorf("failed to decode response: %w", err)
}
if output.IsJSON() {
output.PrintJSON(result)
} else {
printBulkResult("Set tags on", result)
}
return nil
}
+92
View File
@@ -0,0 +1,92 @@
package cmd
import (
"bufio"
"fmt"
"os"
"strings"
"github.com/kb-search/kb/internal/api"
"github.com/kb-search/kb/internal/output"
"github.com/spf13/cobra"
)
var bulkTagCmd = &cobra.Command{
Use: "bulk-tag",
Short: "Add or remove tags on multiple documents matching a filter",
RunE: runBulkTag,
}
func init() {
addBulkFilterFlags(bulkTagCmd)
bulkTagCmd.Flags().String("add", "", "tags to add (comma-separated)")
bulkTagCmd.Flags().String("remove", "", "tags to remove (comma-separated)")
rootCmd.AddCommand(bulkTagCmd)
}
func runBulkTag(cmd *cobra.Command, args []string) error {
body, err := buildBulkBody(cmd)
if err != nil {
return err
}
addStr, _ := cmd.Flags().GetString("add")
removeStr, _ := cmd.Flags().GetString("remove")
if addStr == "" && removeStr == "" {
return fmt.Errorf("specify --add and/or --remove")
}
if addStr != "" {
body["add"] = splitTags(addStr)
}
if removeStr != "" {
body["remove"] = splitTags(removeStr)
}
yes, _ := cmd.Flags().GetBool("yes")
if !yes {
desc := describeBulkFilter(cmd)
action := ""
if addStr != "" {
action += fmt.Sprintf("add=[%s]", addStr)
}
if removeStr != "" {
if action != "" {
action += " "
}
action += fmt.Sprintf("remove=[%s]", removeStr)
}
fmt.Printf("This will update tags (%s) on documents matching: %s\nProceed? [y/N] ", action, desc)
reader := bufio.NewReader(os.Stdin)
answer, _ := reader.ReadString('\n')
answer = strings.TrimSpace(strings.ToLower(answer))
if answer != "y" && answer != "yes" {
fmt.Println("Cancelled.")
return nil
}
}
client := api.NewClient()
resp, err := client.Post("/api/v1/bulk/tags", body)
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
if err := api.CheckError(resp); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
var result map[string]interface{}
if err := api.DecodeJSON(resp, &result); err != nil {
return fmt.Errorf("failed to decode response: %w", err)
}
if output.IsJSON() {
output.PrintJSON(result)
} else {
printBulkResult("Tagged", result)
}
return nil
}
+1 -1
View File
@@ -1 +1 @@
3.0.1
3.2.0
+1
View File
@@ -20,6 +20,7 @@ class Config:
self.ingest_device = os.environ.get("KB_INGEST_DEVICE", "auto")
self.api_key = os.environ.get("KB_API_KEY") or None
self.search_threshold = float(os.environ.get("KB_SEARCH_THRESHOLD", "0.01"))
self.bulk_safety_percent = int(os.environ.get("KB_BULK_SAFETY_PERCENT", "70"))
self.host = os.environ.get("KB_HOST", "0.0.0.0")
self.port = int(os.environ.get("KB_PORT", "8000"))
+91
View File
@@ -189,6 +189,11 @@ def init_schema(conn: sqlite3.Connection, embedding_dim: int) -> None:
if "updated_at" not in doc_cols:
conn.execute("ALTER TABLE documents ADD COLUMN updated_at TEXT")
# Migrate: add job_type to jobs if missing (bulk operations)
job_cols = {row[1] for row in conn.execute("PRAGMA table_info(jobs)").fetchall()}
if "job_type" not in job_cols:
conn.execute("ALTER TABLE jobs ADD COLUMN job_type TEXT DEFAULT 'ingest'")
conn.commit()
@@ -329,6 +334,92 @@ def untag_document(conn: sqlite3.Connection, document_id: int, tag_names: list[s
conn.commit()
# ---------------------------------------------------------------------------
# Bulk operation helpers
# ---------------------------------------------------------------------------
def resolve_bulk_selection(
conn: sqlite3.Connection,
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
) -> list[int]:
"""Return document IDs matching the bulk selection filter.
Filters combine with AND logic. At least one filter must be provided.
"""
sql = "SELECT DISTINCT d.id FROM documents d"
joins: list[str] = []
where: list[str] = []
params: list = []
if tags:
for i, tag in enumerate(tags):
joins.append(f"JOIN document_tags dt{i} ON d.id = dt{i}.document_id")
joins.append(f"JOIN tags t{i} ON dt{i}.tag_id = t{i}.id")
where.append(f"t{i}.name = ?")
params.append(tag)
if doc_type:
where.append("d.doc_type = ?")
params.append(doc_type)
if document_ids:
placeholders = ",".join("?" for _ in document_ids)
where.append(f"d.id IN ({placeholders})")
params.extend(document_ids)
if from_id is not None:
where.append("d.id >= ?")
params.append(from_id)
if to_id is not None:
where.append("d.id <= ?")
params.append(to_id)
if joins:
sql += " " + " ".join(joins)
if where:
sql += " WHERE " + " AND ".join(where)
rows = conn.execute(sql, params).fetchall()
return [row["id"] for row in rows]
def create_bulk_job(
conn: sqlite3.Connection,
job_type: str,
filters_json: str,
matched: int,
succeeded: int,
failed: int,
errors_json: str = "[]",
) -> int:
"""Create an audit log entry for a bulk operation and return its id."""
cur = conn.execute(
"""INSERT INTO jobs(filename, status, job_type, document_id, chunk_count, error, completed_at)
VALUES (?, ?, ?, ?, ?, ?, current_timestamp)""",
(
filters_json,
"done" if failed == 0 else "partial_failure",
job_type,
matched,
succeeded,
errors_json if failed > 0 else None,
),
)
conn.commit()
return cur.lastrowid
def count_documents(conn: sqlite3.Connection) -> int:
"""Return total number of documents in the database."""
row = conn.execute("SELECT COUNT(*) AS cnt FROM documents").fetchone()
return row["cnt"]
# ---------------------------------------------------------------------------
# Vec table management
# ---------------------------------------------------------------------------
+281
View File
@@ -0,0 +1,281 @@
"""Bulk operation endpoints — delete, tag, and set-tags on multiple documents."""
import json
import logging
from pathlib import Path
from typing import Optional
from fastapi import HTTPException
from pydantic import BaseModel, model_validator
from main import app
from kb.config import cfg
from kb.database import (
get_connection,
resolve_bulk_selection,
count_documents,
create_bulk_job,
tag_document,
untag_document,
)
logger = logging.getLogger("kb.routes.bulk")
# ---------------------------------------------------------------------------
# Request models
# ---------------------------------------------------------------------------
class BulkSelectionRequest(BaseModel):
document_ids: Optional[list[int]] = None
tags: Optional[list[str]] = None
doc_type: Optional[str] = None
from_id: Optional[int] = None
to_id: Optional[int] = None
force: bool = False
@model_validator(mode="after")
def require_at_least_one_filter(self):
if not any([self.document_ids, self.tags, self.doc_type,
self.from_id is not None, self.to_id is not None]):
raise ValueError("At least one selection filter is required")
return self
class BulkDeleteRequest(BulkSelectionRequest):
pass
class BulkTagsRequest(BulkSelectionRequest):
add: Optional[list[str]] = None
remove: Optional[list[str]] = None
@model_validator(mode="after")
def require_add_or_remove(self):
if not self.add and not self.remove:
raise ValueError("At least one of 'add' or 'remove' is required")
return self
class BulkSetTagsRequest(BulkSelectionRequest):
new_tags: list[str]
# ---------------------------------------------------------------------------
# Shared helpers
# ---------------------------------------------------------------------------
def _check_safety_threshold(matched: int, total: int, force: bool) -> None:
"""Raise 409 if the operation would affect too many documents."""
threshold = cfg.bulk_safety_percent
if threshold <= 0 or force or total == 0:
return
percent = (matched / total) * 100
if percent > threshold:
raise HTTPException(
status_code=409,
detail={
"error": "safety_threshold_exceeded",
"message": (
f"Operation would affect {matched} of {total} documents "
f"({percent:.1f}%). Exceeds safety threshold of {threshold}%. "
f"Use force: true to proceed."
),
"matched": matched,
"total": total,
"percent": round(percent, 1),
"threshold": threshold,
},
)
def _filters_dict(req: BulkSelectionRequest) -> str:
"""Build a JSON string of the selection filter for audit logging."""
d = {}
if req.document_ids:
d["document_ids"] = req.document_ids
if req.tags:
d["tags"] = req.tags
if req.doc_type:
d["doc_type"] = req.doc_type
if req.from_id is not None:
d["from_id"] = req.from_id
if req.to_id is not None:
d["to_id"] = req.to_id
return json.dumps(d)
# ---------------------------------------------------------------------------
# Endpoints
# ---------------------------------------------------------------------------
@app.post("/api/v1/bulk/delete")
async def bulk_delete(req: BulkDeleteRequest):
conn = get_connection(cfg.db_path)
try:
doc_ids = resolve_bulk_selection(
conn, req.document_ids, req.tags, req.doc_type, req.from_id, req.to_id,
)
total = count_documents(conn)
_check_safety_threshold(len(doc_ids), total, req.force)
succeeded = 0
failed = 0
errors = []
stored_files: list[str] = []
for doc_id in doc_ids:
try:
doc = conn.execute(
"SELECT id, stored_path FROM documents WHERE id = ?", (doc_id,)
).fetchone()
if not doc:
failed += 1
errors.append({"document_id": doc_id, "error": "not found"})
continue
if doc["stored_path"]:
stored_files.append(doc["stored_path"])
# Delete embeddings
chunk_ids = conn.execute(
"SELECT id FROM chunks WHERE document_id = ?", (doc_id,)
).fetchall()
for row in chunk_ids:
conn.execute("DELETE FROM chunks_vec WHERE chunk_id = ?", (row["id"],))
# Delete document (cascades to chunks, document_tags)
conn.execute("DELETE FROM documents WHERE id = ?", (doc_id,))
succeeded += 1
except Exception as exc:
failed += 1
errors.append({"document_id": doc_id, "error": str(exc)})
conn.commit()
# Best-effort file cleanup after commit
for path in stored_files:
try:
f = Path(path)
if f.exists():
f.unlink()
except OSError as exc:
logger.warning("Failed to delete stored file %s: %s", path, exc)
errors_json = json.dumps(errors) if errors else "[]"
job_id = create_bulk_job(
conn, "bulk_delete", _filters_dict(req),
len(doc_ids), succeeded, failed, errors_json,
)
return {
"job_id": job_id,
"status": "done" if failed == 0 else "partial_failure",
"matched": len(doc_ids),
"succeeded": succeeded,
"failed": failed,
"errors": errors,
}
finally:
conn.close()
@app.post("/api/v1/bulk/tags")
async def bulk_tags(req: BulkTagsRequest):
conn = get_connection(cfg.db_path)
try:
doc_ids = resolve_bulk_selection(
conn, req.document_ids, req.tags, req.doc_type, req.from_id, req.to_id,
)
total = count_documents(conn)
_check_safety_threshold(len(doc_ids), total, req.force)
succeeded = 0
failed = 0
errors = []
for doc_id in doc_ids:
try:
if req.add:
tag_document(conn, doc_id, req.add)
if req.remove:
untag_document(conn, doc_id, req.remove)
conn.execute(
"UPDATE documents SET updated_at = current_timestamp WHERE id = ?",
(doc_id,),
)
succeeded += 1
except Exception as exc:
failed += 1
errors.append({"document_id": doc_id, "error": str(exc)})
conn.commit()
errors_json = json.dumps(errors) if errors else "[]"
job_id = create_bulk_job(
conn, "bulk_tags", _filters_dict(req),
len(doc_ids), succeeded, failed, errors_json,
)
return {
"job_id": job_id,
"status": "done" if failed == 0 else "partial_failure",
"matched": len(doc_ids),
"succeeded": succeeded,
"failed": failed,
"errors": errors,
}
finally:
conn.close()
@app.post("/api/v1/bulk/set-tags")
async def bulk_set_tags(req: BulkSetTagsRequest):
conn = get_connection(cfg.db_path)
try:
doc_ids = resolve_bulk_selection(
conn, req.document_ids, req.tags, req.doc_type, req.from_id, req.to_id,
)
total = count_documents(conn)
_check_safety_threshold(len(doc_ids), total, req.force)
succeeded = 0
failed = 0
errors = []
for doc_id in doc_ids:
try:
# Remove all existing tags
conn.execute(
"DELETE FROM document_tags WHERE document_id = ?", (doc_id,)
)
# Apply new tag set
if req.new_tags:
tag_document(conn, doc_id, req.new_tags)
conn.execute(
"UPDATE documents SET updated_at = current_timestamp WHERE id = ?",
(doc_id,),
)
succeeded += 1
except Exception as exc:
failed += 1
errors.append({"document_id": doc_id, "error": str(exc)})
conn.commit()
errors_json = json.dumps(errors) if errors else "[]"
job_id = create_bulk_job(
conn, "bulk_set_tags", _filters_dict(req),
len(doc_ids), succeeded, failed, errors_json,
)
return {
"job_id": job_id,
"status": "done" if failed == 0 else "partial_failure",
"matched": len(doc_ids),
"succeeded": succeeded,
"failed": failed,
"errors": errors,
}
finally:
conn.close()
+1 -1
View File
@@ -62,7 +62,7 @@ async def lifespan(app: FastAPI):
app = FastAPI(title="kb-engine", version=__version__, lifespan=lifespan)
# Import routes after app is created
from kb.routes import health, search, jobs, documents, tags, status, reindex, auth, notes # noqa: E402, F401
from kb.routes import health, search, jobs, documents, tags, status, reindex, auth, notes, bulk # noqa: E402, F401
if __name__ == "__main__":
import uvicorn
+87
View File
@@ -106,6 +106,93 @@ def update_tags(doc_id: int, add: list[str] | None = None,
return r.json()
def delete_document(doc_id: int) -> dict:
with _client() as c:
r = c.delete(f"/api/v1/documents/{doc_id}")
r.raise_for_status()
return r.json()
def _bulk_body(
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
force: bool = False,
**extra,
) -> dict:
body: dict = {}
if document_ids:
body["document_ids"] = document_ids
if tags:
body["tags"] = tags
if doc_type:
body["doc_type"] = doc_type
if from_id is not None:
body["from_id"] = from_id
if to_id is not None:
body["to_id"] = to_id
if force:
body["force"] = True
body.update(extra)
return body
def bulk_delete(
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
force: bool = False,
) -> dict:
body = _bulk_body(document_ids, tags, doc_type, from_id, to_id, force)
with _client() as c:
r = c.post("/api/v1/bulk/delete", json=body)
r.raise_for_status()
return r.json()
def bulk_tags(
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
add: list[str] | None = None,
remove: list[str] | None = None,
force: bool = False,
) -> dict:
extra = {}
if add:
extra["add"] = add
if remove:
extra["remove"] = remove
body = _bulk_body(document_ids, tags, doc_type, from_id, to_id, force, **extra)
with _client() as c:
r = c.post("/api/v1/bulk/tags", json=body)
r.raise_for_status()
return r.json()
def bulk_set_tags(
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
new_tags: list[str] | None = None,
force: bool = False,
) -> dict:
extra = {"new_tags": new_tags or []}
body = _bulk_body(document_ids, tags, doc_type, from_id, to_id, force, **extra)
with _client() as c:
r = c.post("/api/v1/bulk/set-tags", json=body)
r.raise_for_status()
return r.json()
def upload_file(filename: str, file_bytes: bytes,
tags: list[str] | None = None) -> dict:
fields: dict = {}
+136 -93
View File
@@ -20,68 +20,6 @@ import uploads
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("kb.mcp")
# ---------------------------------------------------------------------------
# Collection helpers
# ---------------------------------------------------------------------------
COLLECTION_TAG_PREFIX = "collection:"
DEFAULT_COLLECTION = "documents"
def _collection_tag(collection: str | None) -> str:
return f"{COLLECTION_TAG_PREFIX}{collection or DEFAULT_COLLECTION}"
def _strip_collection_tags(tags: list[str]) -> tuple[str | None, list[str]]:
"""Split tags into (collection, remaining_tags)."""
collection = None
remaining = []
for t in tags:
if t.startswith(COLLECTION_TAG_PREFIX):
collection = t[len(COLLECTION_TAG_PREFIX):]
else:
remaining.append(t)
return collection, remaining
def _process_document(doc: dict) -> dict:
"""Strip collection tags from a document dict and add collection field."""
tags = doc.get("tags", [])
collection, clean_tags = _strip_collection_tags(tags)
doc["tags"] = clean_tags
doc["collection"] = collection
return doc
def _process_search_results(results: list[dict]) -> list[dict]:
"""Strip collection tags from search result dicts."""
for r in results:
if "tags" in r:
collection, clean_tags = _strip_collection_tags(r["tags"])
r["tags"] = clean_tags
r["collection"] = collection
if "document" in r and "tags" in r["document"]:
collection, clean_tags = _strip_collection_tags(r["document"]["tags"])
r["document"]["tags"] = clean_tags
r["document"]["collection"] = collection
return results
async def _ensure_exclusive_collection(doc_id: int, collection: str) -> None:
"""Remove existing collection tags and apply the new one."""
doc = engine.get_document(doc_id)
existing_collection_tags = [
t for t in doc.get("tags", [])
if t.startswith(COLLECTION_TAG_PREFIX)
]
new_tag = _collection_tag(collection)
if existing_collection_tags == [new_tag]:
return
if existing_collection_tags:
engine.update_tags(doc_id, remove=existing_collection_tags)
engine.update_tags(doc_id, add=[new_tag])
# ---------------------------------------------------------------------------
# Transport security — DNS rebinding protection with configurable allowed hosts
# ---------------------------------------------------------------------------
@@ -107,9 +45,10 @@ mcp = FastMCP(
"kb",
instructions=(
"Knowledge base MCP server. Provides tools for searching, adding, and "
"managing documents and notes. This server requires Bearer token "
"authentication — all requests are authenticated via the Authorization "
"header at the HTTP transport layer."
"managing documents and notes. Use tags to organise and filter documents "
"(e.g. tag notes with 'agent:mybot' and filter searches by that tag). "
"This server requires Bearer token authentication — all requests are "
"authenticated via the Authorization header at the HTTP transport layer."
),
transport_security=_transport_security,
)
@@ -121,7 +60,6 @@ async def kb_search(
top: int = 10,
tags: list[str] | None = None,
doc_type: str | None = None,
collection: str | None = None,
fts_only: bool = False,
) -> str:
"""Search the knowledge base for relevant documents and notes.
@@ -134,7 +72,6 @@ async def kb_search(
top: Maximum number of results to return (default 10).
tags: Filter results to documents with ALL of these tags.
doc_type: Filter by document type (e.g. "note", "pdf", "markdown", "code").
collection: Filter by collection name (e.g. "documents", "memory", "workspace").
fts_only: If true, use only full-text search (no vector similarity).
Tips for complex queries:
@@ -144,27 +81,21 @@ async def kb_search(
- For precision, rerank the returned results using your own judgement based on
relevance to the original question.
"""
search_tags = list(tags) if tags else []
if collection:
search_tags.append(_collection_tag(collection))
result = engine.search(
query=query,
top=top,
tags=search_tags or None,
tags=tags or None,
doc_type=doc_type,
fts_only=fts_only,
)
results_list = result if isinstance(result, list) else result.get("results", [])
processed = _process_search_results(results_list)
return json.dumps(processed, indent=2)
return json.dumps(results_list, indent=2)
@mcp.tool()
async def kb_addnote(
text: str,
collection: str | None = None,
tags: list[str] | None = None,
title: str | None = None,
) -> str:
@@ -175,15 +106,10 @@ async def kb_addnote(
Args:
text: The note text content.
collection: Collection to add the note to (default "documents").
Standard collections: "documents", "memory", "workspace".
tags: Additional tags to apply to the note.
tags: Tags to apply to the note.
title: Optional title (auto-derived from first line if omitted).
"""
all_tags = list(tags) if tags else []
all_tags.append(_collection_tag(collection))
result = engine.add_note(text=text, tags=all_tags, title=title)
result = engine.add_note(text=text, tags=tags or None, title=title)
return json.dumps(result, indent=2)
@@ -203,7 +129,7 @@ async def kb_update_note(
text: The new text content for the note.
"""
result = engine.update_note(document_id, text)
return json.dumps(_process_document(result), indent=2)
return json.dumps(result, indent=2)
@mcp.tool()
@@ -222,14 +148,14 @@ async def kb_get(
"""
if document_id is not None:
result = engine.get_document(document_id)
return json.dumps(_process_document(result), indent=2)
return json.dumps(result, indent=2)
elif source_path is not None:
docs = engine.list_documents()
matches = [d for d in docs if d.get("source_path") == source_path]
if not matches:
return json.dumps({"error": "No document found with that source_path"})
doc = engine.get_document(matches[0]["id"])
return json.dumps(_process_document(doc), indent=2)
return json.dumps(doc, indent=2)
else:
return json.dumps({"error": "Provide either document_id or source_path"})
@@ -262,12 +188,27 @@ async def kb_jobs(
return json.dumps(result, indent=2)
@mcp.tool()
async def kb_delete(
document_id: int,
) -> str:
"""Permanently delete a document from the knowledge base.
Removes the document and all associated data (chunks, embeddings, tags,
stored files). This action cannot be undone.
Args:
document_id: The ID of the document to delete.
"""
result = engine.delete_document(document_id)
return json.dumps(result, indent=2)
@mcp.tool()
async def kb_upload_start(
filename: str,
total_size: int,
tags: list[str] | None = None,
collection: str | None = None,
) -> str:
"""Start a chunked file upload to the knowledge base.
@@ -277,7 +218,7 @@ async def kb_upload_start(
3. Call kb_upload_finish to submit the file for ingestion
Example for a 3MB file:
upload = kb_upload_start(filename="report.pdf", total_size=3145728, collection="documents")
upload = kb_upload_start(filename="report.pdf", total_size=3145728, tags=["project:x"])
kb_upload_chunk(upload_id=upload["upload_id"], data="<base64 chunk 0>", chunk_index=0)
kb_upload_chunk(upload_id=upload["upload_id"], data="<base64 chunk 1>", chunk_index=1)
kb_upload_chunk(upload_id=upload["upload_id"], data="<base64 chunk 2>", chunk_index=2)
@@ -286,13 +227,9 @@ async def kb_upload_start(
Args:
filename: Original filename (used for type detection).
total_size: Total file size in bytes.
tags: Additional tags to apply.
collection: Collection name (default "documents").
tags: Tags to apply to the uploaded document.
"""
all_tags = list(tags) if tags else []
all_tags.append(_collection_tag(collection))
upload_id = uploads.start_upload(filename, total_size, all_tags)
upload_id = uploads.start_upload(filename, total_size, tags or [])
return json.dumps({"upload_id": upload_id})
@@ -338,6 +275,112 @@ async def kb_upload_finish(
return json.dumps({"error": str(e)})
# ---------------------------------------------------------------------------
# Bulk operation tools
# ---------------------------------------------------------------------------
@mcp.tool()
async def kb_bulk_delete(
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
force: bool = False,
) -> str:
"""Permanently delete multiple documents matching a filter.
Removes matched documents and all associated data (chunks, embeddings, tags,
stored files). This action cannot be undone.
Selection filters combine with AND logic — at least one is required.
A safety threshold applies: if the operation would affect more than 70% of
all documents, it is rejected unless force=true.
Args:
document_ids: Delete documents with these specific IDs.
tags: Delete documents that have ALL of these tags (selection filter).
doc_type: Delete documents of this type (e.g. "note", "pdf").
from_id: Delete documents with id >= this value.
to_id: Delete documents with id <= this value.
force: Override the safety threshold if it would block the operation.
"""
result = engine.bulk_delete(
document_ids=document_ids, tags=tags, doc_type=doc_type,
from_id=from_id, to_id=to_id, force=force,
)
return json.dumps(result, indent=2)
@mcp.tool()
async def kb_bulk_tags(
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
add: list[str] | None = None,
remove: list[str] | None = None,
force: bool = False,
) -> str:
"""Add and/or remove tags on multiple documents matching a filter.
Selection filters combine with AND logic — at least one is required.
Note: the 'tags' parameter is a SELECTION FILTER (which documents to target),
while 'add' and 'remove' specify the TAG CHANGES to apply to those documents.
Args:
document_ids: Target documents with these specific IDs.
tags: Target documents that have ALL of these tags (selection filter).
doc_type: Target documents of this type.
from_id: Target documents with id >= this value.
to_id: Target documents with id <= this value.
add: Tags to add to matched documents.
remove: Tags to remove from matched documents.
force: Override the safety threshold if it would block the operation.
"""
result = engine.bulk_tags(
document_ids=document_ids, tags=tags, doc_type=doc_type,
from_id=from_id, to_id=to_id, add=add, remove=remove, force=force,
)
return json.dumps(result, indent=2)
@mcp.tool()
async def kb_bulk_set_tags(
document_ids: list[int] | None = None,
tags: list[str] | None = None,
doc_type: str | None = None,
from_id: int | None = None,
to_id: int | None = None,
new_tags: list[str] | None = None,
force: bool = False,
) -> str:
"""Replace all tags on multiple documents with a new set.
Removes ALL existing tags from matched documents, then applies the new tag set.
Selection filters combine with AND logic — at least one is required.
Note: the 'tags' parameter is a SELECTION FILTER (which documents to target),
while 'new_tags' is the REPLACEMENT tag set to apply.
Args:
document_ids: Target documents with these specific IDs.
tags: Target documents that have ALL of these tags (selection filter).
doc_type: Target documents of this type.
from_id: Target documents with id >= this value.
to_id: Target documents with id <= this value.
new_tags: The replacement tag set to apply to all matched documents.
force: Override the safety threshold if it would block the operation.
"""
result = engine.bulk_set_tags(
document_ids=document_ids, tags=tags, doc_type=doc_type,
from_id=from_id, to_id=to_id, new_tags=new_tags, force=force,
)
return json.dumps(result, indent=2)
# ---------------------------------------------------------------------------
# Auth middleware
# ---------------------------------------------------------------------------
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-04-04
@@ -0,0 +1,194 @@
## Context
The engine API (`engine/kb/routes/`) provides single-document operations for delete (`DELETE /api/v1/documents/{id}`) and tag management (`PUT /api/v1/documents/{id}/tags`). The MCP server (`mcp/server.py`) wraps these and adds a "collection" abstraction via `collection:`-prefixed tags — ~70 lines of helpers and translation logic that only the MCP layer understands.
The database is SQLite with WAL mode, FTS5 for full-text search, and sqlite-vec for embeddings. Foreign keys with `ON DELETE CASCADE` handle chunk cleanup when documents are deleted. Stored files on disk must be cleaned up separately.
## Goals / Non-Goals
**Goals:**
- Bulk delete, bulk tag add/remove, and bulk set-tags (replace) via engine API, MCP tools, and CLI
- Filter-based selection: by tag, doc_type, ID list, and ID range
- Safety threshold to prevent accidental mass operations
- Audit trail via jobs table
- Remove collection abstraction from MCP server
**Non-Goals:**
- Async/queued bulk operations (SQLite handles thousands of rows synchronously in <1s)
- Bulk document retrieval or bulk note creation
- Undo/recycle bin for bulk deletes
- Adding collection concept to engine or CLI (collections are being removed, not moved)
## Decisions
### 1. Common selection filter for all bulk endpoints
All three bulk endpoints accept the same selection body:
```json
{
"document_ids": [1, 5, 12],
"tags": ["agent:mybot", "draft"],
"doc_type": "note",
"from_id": 10,
"to_id": 50
}
```
Filters combine with AND logic. At least one filter is required — the engine rejects requests with no selection criteria (400).
**Selection SQL generation**: A shared helper in `database.py` builds the WHERE clause from the filter. The `tags` filter uses the same JOIN pattern as `list_documents` (all specified tags must match). The `document_ids` filter uses `IN (?)`. The `from_id`/`to_id` filter uses `id >= ? AND id <= ?`.
**Alternative considered**: Separate endpoints per filter type. Rejected — combinable filters are more powerful and the SQL generation is straightforward.
### 2. Safety threshold with configurable percentage
Before executing, the engine counts matched documents and total documents. If `matched / total > threshold`, the request is rejected:
```
HTTP 409 Conflict
{
"error": "safety_threshold_exceeded",
"message": "Operation would affect 750 of 1000 documents (75.0%). Exceeds safety threshold of 70%. Use force: true to proceed.",
"matched": 750,
"total": 1000,
"percent": 75.0,
"threshold": 70
}
```
- Default threshold: 70% (env var `KB_BULK_SAFETY_PERCENT`, integer 0-100)
- Override per-request: `"force": true` in the request body
- Threshold of 0 effectively disables the safety check
- CLI maps this to `--force` / `-f` flag
The check is a SELECT COUNT before the operation — minimal overhead.
**Alternative considered**: Dry-run mode (preview what would be affected, then confirm). Rejected — adds a two-step flow that doesn't help LLM callers (they'd just always confirm) and the safety threshold covers the dangerous case.
### 3. Synchronous execution with audit logging
Bulk operations execute synchronously and return a summary response:
```json
{
"job_id": 42,
"status": "done",
"matched": 750,
"succeeded": 748,
"failed": 2,
"errors": [
{"document_id": 42, "error": "file locked"},
{"document_id": 99, "error": "not found"}
]
}
```
A job record is created in the `jobs` table with a new `bulk_delete` / `bulk_tags` / `bulk_set_tags` status type. This requires extending the jobs table:
- Add `job_type` column: `"ingest"` (default, for existing jobs) or `"bulk_delete"` / `"bulk_tags"` / `"bulk_set_tags"`
- The job's `filename` field stores a JSON summary of the selection filter for auditability
- `document_id` field stores the count of affected documents
- `error` field stores JSON array of individual errors if any
**Alternative considered**: Full async with job polling. Rejected — SQLite bulk operations are fast enough synchronously and async would require extra polling calls (defeating the purpose of reducing token usage).
### 4. Bulk delete implementation
For each matched document:
1. Collect chunk IDs
2. Delete embeddings from `chunks_vec`
3. Delete the document row (cascades to chunks, document_tags)
4. Delete stored file from disk
This follows the same logic as the existing `delete_document` endpoint but batched in a single transaction (except file deletion, which happens after commit). If a file deletion fails, the document is still counted as succeeded (the DB record is gone) but a warning is logged.
The operation processes documents within a single SQLite transaction for atomicity of the DB changes. File deletions happen post-commit and are best-effort.
### 5. Bulk tags implementation
Two distinct operations:
**`POST /api/v1/bulk/tags`** — Add and/or remove tags:
```json
{
"add": ["reviewed", "approved"],
"remove": ["draft"],
...selection filters...
}
```
**`POST /api/v1/bulk/set-tags`** — Replace all tags:
```json
{
"tags": ["final", "approved"],
...selection filters...
}
```
The `set-tags` operation removes all existing tags from matched documents, then applies the new set. This is useful for cleaning up tag clutter or migrating tagging schemes.
Both update `updated_at` on affected documents.
### 6. Remove collection abstraction from MCP
Remove from `mcp/server.py`:
- Constants: `COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`
- Functions: `_collection_tag`, `_strip_collection_tags`, `_process_document`, `_process_search_results`, `_ensure_exclusive_collection`
- Tool: `kb_set_collection` (entire tool removed)
- Parameters: `collection` from `kb_search`, `kb_addnote`, `kb_upload_start`
The `_process_document` and `_process_search_results` calls in remaining tools are removed — documents are returned as-is from the engine, with all tags visible.
Users/agents that need namespace isolation use a tag convention (e.g. `agent:claude-code`) communicated via system prompt or tool instructions.
### 7. Engine bulk route module
New file: `engine/kb/routes/bulk.py`
Three endpoints sharing common infrastructure:
- `_resolve_selection(conn, filters)` → list of document IDs + count
- `_check_safety_threshold(matched, total, force)` → raises HTTPException if exceeded
- `_log_bulk_job(conn, job_type, filters, matched, succeeded, failed, errors)` → job_id
### 8. MCP bulk tools
Three new tools in `mcp/server.py`, thin wrappers calling new `engine.py` methods:
- `kb_bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → str (JSON)
- `kb_bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → str (JSON)
- `kb_bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → str (JSON)
Note: The `tags` parameter on bulk tools serves as a **selection filter** (which documents to target), while `add`/`remove` (on bulk_tags) and `new_tags` (on bulk_set_tags) are the **operation** (what to do to the tags). Tool descriptions must make this distinction clear.
### 9. CLI bulk commands
Three new commands under `client/cmd/`:
```
kb bulk-remove --tags "draft,old" --type note --force --yes
kb bulk-tag --tags "agent:mybot" --add "reviewed" --remove "pending" --yes
kb bulk-set-tags --ids "1,5,12" --tags "clean,final" --yes
```
Filter flags (shared): `--tags`, `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`
Confirmation: `--yes` / `-y` to skip interactive prompt.
Without `--yes`, the CLI first shows the match count and asks for confirmation:
```
This will delete 47 documents matching: tags=[draft,old] type=note
Proceed? [y/N]
```
### 10. Engine config for safety threshold
New env var: `KB_BULK_SAFETY_PERCENT` (integer, default 70). Added to `engine/kb/config.py`.
## Risks / Trade-offs
- **[Bulk delete is irreversible]** → Safety threshold mitigates accidental mass deletion. CLI requires interactive confirmation. No undo mechanism — this is deliberate to keep the system simple.
- **[Naming collision: `tags` as filter vs operation]** → The `tags` parameter in bulk_tags selects documents, while `add`/`remove` specifies the tag changes. Clear naming and tool descriptions mitigate confusion. Engine request model uses the same field name as the existing list/search filter.
- **[SQLite lock during large bulk ops]** → A single transaction deleting 5000 documents will hold a write lock. With WAL mode, readers are not blocked. The lock duration should be under a few seconds for typical workloads.
- **[Breaking change: collection removal]** → Any MCP client relying on `collection` parameters will break. Since collections were only recently added and are not widely deployed, this is acceptable. Existing `collection:*` tags in the database remain as regular tags — they still work as filters, just without special treatment.
- **[Jobs table overload]** → Bulk operations add a new job type to a table designed for ingestion jobs. The schema change is minimal (one new column) and the audit trail value outweighs the mixing of concerns.
@@ -0,0 +1,91 @@
## Why
Bulk operations on documents (delete, tag, retag) currently require one API/MCP call per document. When an LLM manages hundreds or thousands of documents, this means hundreds of tool calls — burning tokens, adding latency, and creating fragile multi-step flows that can fail partway through.
Additionally, the "collection" abstraction in the MCP server adds complexity without real benefit. Collections are implemented as `collection:`-prefixed tags, but this convention is only enforced in the MCP layer — the CLI and engine don't know about it. This creates inconsistency and extra code. Tags alone, with a naming convention communicated via system prompt or configuration, achieve the same namespace isolation more simply and uniformly.
## What Changes
### 1. Remove collections from MCP server
Strip all collection logic from `mcp/server.py`:
- Remove `COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`, and all collection helper functions
- Remove `collection` parameter from `kb_search`, `kb_addnote`, `kb_upload_start`
- Remove `kb_set_collection` tool entirely
- Remove `_process_document` / `_process_search_results` collection-tag stripping
- Update MCP server instructions to explain tag-based namespace convention
### 2. Add bulk engine endpoints
Three new endpoints in the engine API:
- **POST /api/v1/bulk/delete** — Delete multiple documents matching a filter
- **POST /api/v1/bulk/tags** — Add/remove tags on multiple documents matching a filter
- **POST /api/v1/bulk/set-tags** — Replace all tags on multiple documents matching a filter
All accept a common **selection filter** (combinable with AND logic):
- `document_ids` — explicit list of IDs
- `tags` — documents matching ALL specified tags
- `doc_type` — documents of this type
- `from_id` / `to_id` — ID range (inclusive)
At least one selection criterion is required.
**Safety threshold**: If the operation would affect more than N% of all documents (default 70%, configurable via `KB_BULK_SAFETY_PERCENT` env var), the request is rejected with a 409 response showing what would be affected. The caller must re-send with `force: true` to proceed.
**Response model**: Synchronous execution with summary response. The operation is logged to the jobs table for audit trail:
```json
{
"job_id": 42,
"status": "done",
"matched": 750,
"succeeded": 748,
"failed": 2,
"errors": [
{"document_id": 42, "error": "file locked"},
{"document_id": 99, "error": "not found"}
]
}
```
### 3. Add bulk MCP tools
Expose the bulk engine endpoints as MCP tools:
- `kb_bulk_delete` — bulk delete with filter selection
- `kb_bulk_tags` — bulk add/remove tags with filter selection
- `kb_bulk_set_tags` — bulk replace tags with filter selection
These are thin wrappers around the engine bulk endpoints — no collection translation, no special logic.
### 4. Add bulk CLI commands
- `kb bulk-remove` — bulk delete with `--tags`, `--type`, `--ids`, `--from-id`, `--to-id`, `--force` flags
- `kb bulk-tag` — bulk tag/untag with `--add`, `--remove`, and the same filter flags
- `kb bulk-set-tags` — bulk replace tags with `--tags` (new tags) and the same filter flags
All show a confirmation prompt with match count before executing (unless `--yes`).
## Capabilities
### New Capabilities
- `bulk-operations`: Engine endpoints, MCP tools, and CLI commands for bulk delete, tag, and set-tags operations with filter-based selection and safety threshold.
### Modified Capabilities
- `mcp-document-management`: Remove `kb_set_collection` tool. Remove `collection` parameter from all tools.
### Removed Capabilities
- `mcp-collections`: The collection abstraction (collection helpers, collection parameters, collection tag stripping) is removed from the MCP server entirely.
## Impact
- **Engine API** (`engine/kb/routes/`): New `bulk.py` route module with 3 endpoints. New `bulk` job type in jobs table.
- **Engine database** (`engine/kb/database.py`): Helper functions for bulk selection queries and bulk delete/tag operations.
- **MCP server** (`mcp/server.py`): Remove ~70 lines of collection logic. Add 3 bulk tool definitions. Remove `collection` param from `kb_search`, `kb_addnote`, `kb_upload_start`. Remove `kb_set_collection`.
- **MCP engine client** (`mcp/engine.py`): Add bulk operation methods. Remove no longer needed code.
- **CLI** (`client/cmd/`): New `bulk_remove.go`, `bulk_tag.go`, `bulk_set_tags.go` command files.
- **CLI API client** (`client/internal/api/`): Add `Post` with JSON body support if not present.
- **Breaking changes**: `kb_set_collection` MCP tool removed. `collection` parameter removed from `kb_search`, `kb_addnote`, `kb_upload_start` MCP tools. Any MCP clients using collections will need to switch to tags.
@@ -0,0 +1,230 @@
## ADDED Requirements
### Requirement: Common selection filter
All bulk engine endpoints SHALL accept a JSON body with the following optional selection fields, combined with AND logic:
- `document_ids` (list of int) — match documents with these specific IDs
- `tags` (list of str) — match documents that have ALL specified tags
- `doc_type` (str) — match documents with this document type
- `from_id` (int) — match documents with id >= this value
- `to_id` (int) — match documents with id <= this value
At least one selection field MUST be present. If no selection fields are provided, the endpoint SHALL return 400 Bad Request.
#### Scenario: Filter by tags and doc_type
- **WHEN** a bulk endpoint receives `{"tags": ["draft"], "doc_type": "note"}`
- **THEN** it SHALL match only documents that have the tag "draft" AND have doc_type "note"
#### Scenario: Filter by ID range
- **WHEN** a bulk endpoint receives `{"from_id": 10, "to_id": 50}`
- **THEN** it SHALL match documents with id >= 10 AND id <= 50
#### Scenario: Filter by explicit IDs
- **WHEN** a bulk endpoint receives `{"document_ids": [1, 5, 12]}`
- **THEN** it SHALL match only documents with those specific IDs
#### Scenario: Combined filters
- **WHEN** a bulk endpoint receives `{"tags": ["agent:mybot"], "doc_type": "note", "from_id": 100}`
- **THEN** it SHALL match documents satisfying ALL three criteria
#### Scenario: No selection fields provided
- **WHEN** a bulk endpoint receives `{}` or `{"force": true}` with no selection fields
- **THEN** it SHALL return 400 Bad Request
### Requirement: Safety threshold
All bulk endpoints SHALL enforce a safety threshold. Before executing, the engine SHALL count the matched documents and the total documents in the database. If `matched / total * 100` exceeds the configured threshold, the request SHALL be rejected with 409 Conflict.
The response SHALL include: `error` ("safety_threshold_exceeded"), `message` (human-readable), `matched` (int), `total` (int), `percent` (float), and `threshold` (int).
The threshold SHALL default to 70 and be configurable via the `KB_BULK_SAFETY_PERCENT` environment variable (integer 0-100). A value of 0 disables the check.
The caller MAY override the threshold by including `"force": true` in the request body.
#### Scenario: Threshold exceeded
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
- **WHEN** a bulk endpoint matches 750 documents (75%) without `force: true`
- **THEN** it SHALL return 409 with `matched: 750`, `total: 1000`, `percent: 75.0`, `threshold: 70`
#### Scenario: Threshold not exceeded
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
- **WHEN** a bulk endpoint matches 500 documents (50%) without `force: true`
- **THEN** the operation SHALL proceed normally
#### Scenario: Force override
- **GIVEN** 1000 total documents and a match of 900 (90%)
- **WHEN** the request includes `"force": true`
- **THEN** the operation SHALL proceed regardless of threshold
#### Scenario: Zero threshold
- **GIVEN** `KB_BULK_SAFETY_PERCENT` is 0
- **THEN** the safety check SHALL be effectively disabled for all operations
### Requirement: Synchronous response with audit log
All bulk endpoints SHALL execute synchronously and return a JSON response with:
- `job_id` (int) — ID of the audit log entry in the jobs table
- `status` (str) — "done" or "partial_failure"
- `matched` (int) — number of documents that matched the selection
- `succeeded` (int) — number of documents successfully processed
- `failed` (int) — number of documents that failed
- `errors` (list) — array of `{"document_id": int, "error": str}` for each failure (empty on full success)
A job record SHALL be created in the jobs table with `job_type` set to the operation type. The `filename` field SHALL store a JSON representation of the selection filter. The `error` field SHALL store a JSON array of individual errors if any occurred.
#### Scenario: Full success
- **WHEN** a bulk operation matches 50 documents and all succeed
- **THEN** the response SHALL have `status: "done"`, `matched: 50`, `succeeded: 50`, `failed: 0`, `errors: []`
#### Scenario: Partial failure
- **WHEN** a bulk operation matches 50 documents but 2 fail
- **THEN** the response SHALL have `status: "partial_failure"`, `matched: 50`, `succeeded: 48`, `failed: 2`, and `errors` listing the 2 failures
### Requirement: Bulk delete endpoint
The engine SHALL expose `POST /api/v1/bulk/delete` which permanently deletes all documents matching the selection filter. For each matched document, it SHALL delete embeddings from `chunks_vec`, delete the document row (cascading to chunks and document_tags), and delete any stored file from disk.
Database deletions SHALL be performed within a single transaction. File deletions SHALL occur after the transaction commits and SHALL be best-effort (failures logged but not counted as document failures).
#### Scenario: Bulk delete by tag
- **WHEN** `POST /api/v1/bulk/delete` receives `{"tags": ["old", "draft"]}`
- **THEN** all documents with both tags "old" and "draft" SHALL be deleted
- **AND** their chunks, embeddings, tag associations, and stored files SHALL be removed
#### Scenario: Bulk delete with no matches
- **WHEN** `POST /api/v1/bulk/delete` receives a filter that matches 0 documents
- **THEN** the response SHALL have `matched: 0`, `succeeded: 0`, `failed: 0`
### Requirement: Bulk tags endpoint
The engine SHALL expose `POST /api/v1/bulk/tags` which adds and/or removes tags on all documents matching the selection filter. The request body SHALL include the selection filter plus:
- `add` (list of str, optional) — tags to add
- `remove` (list of str, optional) — tags to remove
At least one of `add` or `remove` MUST be present. The endpoint SHALL return 400 if neither is provided.
The endpoint SHALL update `updated_at` on all affected documents.
#### Scenario: Add and remove tags in one call
- **WHEN** `POST /api/v1/bulk/tags` receives `{"tags": ["agent:mybot"], "add": ["reviewed"], "remove": ["pending"]}`
- **THEN** all documents tagged "agent:mybot" SHALL have "reviewed" added and "pending" removed
### Requirement: Bulk set-tags endpoint
The engine SHALL expose `POST /api/v1/bulk/set-tags` which replaces all tags on matched documents with a new set. The request body SHALL include the selection filter plus:
- `new_tags` (list of str) — the replacement tag set
The endpoint SHALL remove all existing tag associations from matched documents, then apply the new set. It SHALL update `updated_at` on all affected documents.
#### Scenario: Replace all tags
- **WHEN** `POST /api/v1/bulk/set-tags` receives `{"doc_type": "note", "new_tags": ["clean", "final"]}`
- **THEN** all notes SHALL have their existing tags removed and replaced with "clean" and "final"
### Requirement: Jobs table extension
The jobs table SHALL be extended with a `job_type` column (TEXT, default "ingest") to distinguish ingestion jobs from bulk operation audit entries. Valid values: "ingest", "bulk_delete", "bulk_tags", "bulk_set_tags".
Existing jobs SHALL default to `job_type = "ingest"`. The existing jobs list endpoint and CLI `kb jobs` command SHALL continue to work unchanged.
#### Scenario: Migration adds column
- **GIVEN** an existing database without the `job_type` column
- **WHEN** the engine starts
- **THEN** the column SHALL be added with default value "ingest"
### Requirement: Engine config for safety threshold
The engine `Config` class SHALL read `KB_BULK_SAFETY_PERCENT` from the environment as an integer (default 70, range 0-100). This value SHALL be used as the default safety threshold for all bulk endpoints.
### Requirement: MCP bulk delete tool
The MCP server SHALL expose a `kb_bulk_delete` tool with parameters: `document_ids` (optional list of int), `tags` (optional list of str), `doc_type` (optional str), `from_id` (optional int), `to_id` (optional int), `force` (optional bool).
The tool SHALL call `POST /api/v1/bulk/delete` on the engine via the engine client and return the JSON response.
The tool description SHALL clearly state that `tags` is a selection filter (which documents to delete), not tags to delete.
#### Scenario: MCP bulk delete by tag
- **WHEN** `kb_bulk_delete(tags=["old"])` is called
- **THEN** the engine client SHALL send `POST /api/v1/bulk/delete` with `{"tags": ["old"]}`
- **AND** the tool SHALL return the engine's JSON response
### Requirement: MCP bulk tags tool
The MCP server SHALL expose a `kb_bulk_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `add` (optional list of str), `remove` (optional list of str), and `force` (optional bool).
The tool description SHALL clearly distinguish `tags` (selection filter) from `add`/`remove` (tag changes to apply).
#### Scenario: MCP bulk tag update
- **WHEN** `kb_bulk_tags(tags=["agent:mybot"], add=["reviewed"], remove=["draft"])` is called
- **THEN** the engine client SHALL send the appropriate `POST /api/v1/bulk/tags` request
### Requirement: MCP bulk set-tags tool
The MCP server SHALL expose a `kb_bulk_set_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `new_tags` (list of str) and `force` (optional bool).
#### Scenario: MCP bulk set tags
- **WHEN** `kb_bulk_set_tags(doc_type="note", new_tags=["clean"])` is called
- **THEN** the engine client SHALL send `POST /api/v1/bulk/set-tags` with `{"doc_type": "note", "new_tags": ["clean"]}`
### Requirement: MCP engine client bulk methods
The MCP engine client (`mcp/engine.py`) SHALL provide three new methods:
- `bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → dict
- `bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → dict
- `bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → dict
Each SHALL send a POST request to the corresponding `/api/v1/bulk/*` endpoint with the parameters as a JSON body. Each SHALL raise on non-2xx status codes, consistent with existing methods.
### Requirement: CLI bulk-remove command
The CLI SHALL expose a `kb bulk-remove` command with flags: `--tags` (comma-separated), `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`/`-f`, `--yes`/`-y`.
Without `--yes`, the CLI SHALL first display the match count and ask for interactive confirmation before proceeding.
The command SHALL call `POST /api/v1/bulk/delete` with the constructed filter.
#### Scenario: CLI bulk remove with confirmation
- **WHEN** `kb bulk-remove --tags "draft,old" --type note` is run without `--yes`
- **THEN** the CLI SHALL display "This will delete N documents matching: tags=[draft,old] type=note" and prompt "Proceed? [y/N]"
#### Scenario: CLI bulk remove with --yes
- **WHEN** `kb bulk-remove --tags "draft" --yes` is run
- **THEN** the CLI SHALL proceed without prompting
### Requirement: CLI bulk-tag command
The CLI SHALL expose a `kb bulk-tag` command with the same filter flags as `bulk-remove`, plus `--add` and `--remove` (comma-separated tag lists).
The command SHALL call `POST /api/v1/bulk/tags` with the constructed filter and tag changes.
### Requirement: CLI bulk-set-tags command
The CLI SHALL expose a `kb bulk-set-tags` command with the filter flags, plus `--set` (comma-separated list of replacement tags).
The command SHALL call `POST /api/v1/bulk/set-tags` with the constructed filter and `new_tags`.
@@ -0,0 +1,55 @@
## REMOVED Requirements
### Requirement: Collection abstraction in MCP server
The MCP server SHALL NOT maintain any collection abstraction. The following SHALL be removed:
- Constants: `COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`
- Functions: `_collection_tag`, `_strip_collection_tags`, `_process_document`, `_process_search_results`, `_ensure_exclusive_collection`
- Tool: `kb_set_collection` (entire tool)
- Parameters: `collection` from `kb_search`, `kb_addnote`, `kb_upload_start`
Documents SHALL be returned as-is from the engine with all tags visible. No tag stripping or collection field injection SHALL occur.
#### Scenario: Search results show all tags
- **WHEN** `kb_search` is called and a result has tags `["agent:mybot", "collection:documents", "draft"]`
- **THEN** all three tags SHALL be returned as-is — no stripping of `collection:*` tags
#### Scenario: kb_set_collection no longer exists
- **WHEN** an MCP client attempts to call `kb_set_collection`
- **THEN** the tool SHALL not be found (removed)
## MODIFIED Requirements
### Requirement: kb_search without collection parameter
The `kb_search` MCP tool SHALL accept `tags` (optional list of str) for filtering but SHALL NOT accept a `collection` parameter. Callers that previously used `collection="memory"` SHALL instead use `tags=["collection:memory"]` or whatever tag convention they prefer.
#### Scenario: Filter by tag instead of collection
- **WHEN** `kb_search(query="test", tags=["agent:mybot"])` is called
- **THEN** results SHALL be filtered to documents tagged "agent:mybot"
- **AND** no collection field SHALL be present in the response
### Requirement: kb_addnote without collection parameter
The `kb_addnote` MCP tool SHALL accept `tags` (optional list of str) but SHALL NOT accept a `collection` parameter. The tool SHALL NOT automatically apply any default collection tag — only explicitly provided tags are applied.
#### Scenario: Add note with explicit tags
- **WHEN** `kb_addnote(text="hello", tags=["agent:mybot", "memory"])` is called
- **THEN** the note SHALL be created with exactly those two tags — no `collection:documents` tag added
### Requirement: kb_upload_start without collection parameter
The `kb_upload_start` MCP tool SHALL accept `tags` (optional list of str) but SHALL NOT accept a `collection` parameter. The tool SHALL NOT automatically apply any default collection tag.
### Requirement: kb_update_note without collection processing
The `kb_update_note` MCP tool SHALL return the document as-is from the engine without passing it through `_process_document`. All tags SHALL be visible in the response.
### Requirement: kb_get without collection processing
The `kb_get` MCP tool SHALL return documents as-is from the engine without passing through `_process_document`. All tags SHALL be visible in the response. No `collection` field SHALL be injected.
@@ -0,0 +1,45 @@
## 1. Remove collections from MCP server
- [x] 1.1 Remove collection constants and helper functions from `mcp/server.py` (`COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`, `_collection_tag`, `_strip_collection_tags`, `_process_document`, `_process_search_results`, `_ensure_exclusive_collection`)
- [x] 1.2 Remove `collection` parameter from `kb_search`, `kb_addnote`, `kb_upload_start` tools
- [x] 1.3 Remove `kb_set_collection` tool entirely
- [x] 1.4 Remove `_process_document` / `_process_search_results` calls from `kb_get`, `kb_update_note`, `kb_search`
- [x] 1.5 Update MCP server instructions text to reflect tags-only approach
## 2. Engine bulk infrastructure
- [x] 2.1 Add `bulk_safety_percent` to `Config` class in `engine/kb/config.py` (env var `KB_BULK_SAFETY_PERCENT`, default 70)
- [x] 2.2 Add `job_type` column migration to `database.py` `init_schema` (TEXT, default "ingest")
- [x] 2.3 Add `resolve_bulk_selection(conn, document_ids, tags, doc_type, from_id, to_id)` helper to `database.py` — returns list of matching document IDs
- [x] 2.4 Add `create_bulk_job(conn, job_type, filters_json, matched, succeeded, failed, errors_json)` helper to `database.py`
## 3. Engine bulk endpoints
- [x] 3.1 Create `engine/kb/routes/bulk.py` with shared Pydantic request model (`BulkSelectionRequest` with selection fields + `force` bool)
- [x] 3.2 Add `_check_safety_threshold` helper that returns 409 if threshold exceeded
- [x] 3.3 Implement `POST /api/v1/bulk/delete` — resolve selection, check threshold, delete documents in transaction, clean up files, log job, return summary
- [x] 3.4 Implement `POST /api/v1/bulk/tags` — resolve selection, check threshold, add/remove tags on matched docs, log job, return summary
- [x] 3.5 Implement `POST /api/v1/bulk/set-tags` — resolve selection, check threshold, clear and replace tags on matched docs, log job, return summary
- [x] 3.6 Import bulk routes in engine app startup (add to `engine/kb/routes/__init__.py` or `main.py`)
## 4. MCP bulk tools
- [x] 4.1 Add `bulk_delete`, `bulk_tags`, `bulk_set_tags` methods to `mcp/engine.py`
- [x] 4.2 Add `kb_bulk_delete` tool to `mcp/server.py`
- [x] 4.3 Add `kb_bulk_tags` tool to `mcp/server.py`
- [x] 4.4 Add `kb_bulk_set_tags` tool to `mcp/server.py`
## 5. CLI bulk commands
- [x] 5.1 Create `client/cmd/bulk_remove.go``kb bulk-remove` with filter flags, confirmation prompt, JSON output support
- [x] 5.2 Create `client/cmd/bulk_tag.go``kb bulk-tag` with filter flags + `--add`/`--remove`, confirmation prompt
- [x] 5.3 Create `client/cmd/bulk_set_tags.go``kb bulk-set-tags` with filter flags + `--set`, confirmation prompt
## 6. Verification
- [x] 6.1 Test collection removal: verify `kb_search`, `kb_addnote`, `kb_get`, `kb_update_note`, `kb_upload_start` work without collection params
- [x] 6.2 Test bulk delete via engine API: filter by tags, by IDs, by range, safety threshold trigger and force override
- [x] 6.3 Test bulk tags and bulk set-tags via engine API
- [x] 6.4 Test MCP bulk tools against running engine
- [x] 6.5 Test CLI bulk commands against running engine
- [x] 6.6 Test audit trail: verify bulk jobs appear in `kb jobs` output