kb/openspec/changes/archive/2026-04-04-bulk-ops-remove-collections/design.md at main

steve/kb

Fork 0

Files

T

steve 223ff2cf5d Latest changes all archived

2026-04-04 22:50:19 +01:00

8.9 KiB

Raw Permalink Blame History

Context

The engine API (engine/kb/routes/) provides single-document operations for delete (DELETE /api/v1/documents/{id}) and tag management (PUT /api/v1/documents/{id}/tags). The MCP server (mcp/server.py) wraps these and adds a "collection" abstraction via collection:-prefixed tags — ~70 lines of helpers and translation logic that only the MCP layer understands.

The database is SQLite with WAL mode, FTS5 for full-text search, and sqlite-vec for embeddings. Foreign keys with ON DELETE CASCADE handle chunk cleanup when documents are deleted. Stored files on disk must be cleaned up separately.

Goals / Non-Goals

Goals:

Bulk delete, bulk tag add/remove, and bulk set-tags (replace) via engine API, MCP tools, and CLI
Filter-based selection: by tag, doc_type, ID list, and ID range
Safety threshold to prevent accidental mass operations
Audit trail via jobs table
Remove collection abstraction from MCP server

Non-Goals:

Async/queued bulk operations (SQLite handles thousands of rows synchronously in <1s)
Bulk document retrieval or bulk note creation
Undo/recycle bin for bulk deletes
Adding collection concept to engine or CLI (collections are being removed, not moved)

Decisions

1. Common selection filter for all bulk endpoints

All three bulk endpoints accept the same selection body:

{
  "document_ids": [1, 5, 12],
  "tags": ["agent:mybot", "draft"],
  "doc_type": "note",
  "from_id": 10,
  "to_id": 50
}

Filters combine with AND logic. At least one filter is required — the engine rejects requests with no selection criteria (400).

Selection SQL generation: A shared helper in database.py builds the WHERE clause from the filter. The tags filter uses the same JOIN pattern as list_documents (all specified tags must match). The document_ids filter uses IN (?). The from_id/to_id filter uses id >= ? AND id <= ?.

Alternative considered: Separate endpoints per filter type. Rejected — combinable filters are more powerful and the SQL generation is straightforward.

2. Safety threshold with configurable percentage

Before executing, the engine counts matched documents and total documents. If matched / total > threshold, the request is rejected:

HTTP 409 Conflict
{
  "error": "safety_threshold_exceeded",
  "message": "Operation would affect 750 of 1000 documents (75.0%). Exceeds safety threshold of 70%. Use force: true to proceed.",
  "matched": 750,
  "total": 1000,
  "percent": 75.0,
  "threshold": 70
}

Default threshold: 70% (env var KB_BULK_SAFETY_PERCENT, integer 0-100)
Override per-request: "force": true in the request body
Threshold of 0 effectively disables the safety check
CLI maps this to --force / -f flag

The check is a SELECT COUNT before the operation — minimal overhead.

Alternative considered: Dry-run mode (preview what would be affected, then confirm). Rejected — adds a two-step flow that doesn't help LLM callers (they'd just always confirm) and the safety threshold covers the dangerous case.

3. Synchronous execution with audit logging

Bulk operations execute synchronously and return a summary response:

{
  "job_id": 42,
  "status": "done",
  "matched": 750,
  "succeeded": 748,
  "failed": 2,
  "errors": [
    {"document_id": 42, "error": "file locked"},
    {"document_id": 99, "error": "not found"}
  ]
}

A job record is created in the jobs table with a new bulk_delete / bulk_tags / bulk_set_tags status type. This requires extending the jobs table:

Add job_type column: "ingest" (default, for existing jobs) or "bulk_delete" / "bulk_tags" / "bulk_set_tags"
The job's filename field stores a JSON summary of the selection filter for auditability
document_id field stores the count of affected documents
error field stores JSON array of individual errors if any

Alternative considered: Full async with job polling. Rejected — SQLite bulk operations are fast enough synchronously and async would require extra polling calls (defeating the purpose of reducing token usage).

4. Bulk delete implementation

For each matched document:

Collect chunk IDs
Delete embeddings from chunks_vec
Delete the document row (cascades to chunks, document_tags)
Delete stored file from disk

This follows the same logic as the existing delete_document endpoint but batched in a single transaction (except file deletion, which happens after commit). If a file deletion fails, the document is still counted as succeeded (the DB record is gone) but a warning is logged.

The operation processes documents within a single SQLite transaction for atomicity of the DB changes. File deletions happen post-commit and are best-effort.

5. Bulk tags implementation

Two distinct operations:

POST /api/v1/bulk/tags — Add and/or remove tags:

{
  "add": ["reviewed", "approved"],
  "remove": ["draft"],
  ...selection filters...
}

POST /api/v1/bulk/set-tags — Replace all tags:

{
  "tags": ["final", "approved"],
  ...selection filters...
}

The set-tags operation removes all existing tags from matched documents, then applies the new set. This is useful for cleaning up tag clutter or migrating tagging schemes.

Both update updated_at on affected documents.

6. Remove collection abstraction from MCP

Remove from mcp/server.py:

Constants: COLLECTION_TAG_PREFIX, DEFAULT_COLLECTION
Functions: _collection_tag, _strip_collection_tags, _process_document, _process_search_results, _ensure_exclusive_collection
Tool: kb_set_collection (entire tool removed)
Parameters: collection from kb_search, kb_addnote, kb_upload_start

The _process_document and _process_search_results calls in remaining tools are removed — documents are returned as-is from the engine, with all tags visible.

Users/agents that need namespace isolation use a tag convention (e.g. agent:claude-code) communicated via system prompt or tool instructions.

7. Engine bulk route module

New file: engine/kb/routes/bulk.py

Three endpoints sharing common infrastructure:

_resolve_selection(conn, filters) → list of document IDs + count
_check_safety_threshold(matched, total, force) → raises HTTPException if exceeded
_log_bulk_job(conn, job_type, filters, matched, succeeded, failed, errors) → job_id

8. MCP bulk tools

Three new tools in mcp/server.py, thin wrappers calling new engine.py methods:

kb_bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?) → str (JSON)
kb_bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?) → str (JSON)
kb_bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?) → str (JSON)

Note: The tags parameter on bulk tools serves as a selection filter (which documents to target), while add/remove (on bulk_tags) and new_tags (on bulk_set_tags) are the operation (what to do to the tags). Tool descriptions must make this distinction clear.

9. CLI bulk commands

Three new commands under client/cmd/:

kb bulk-remove --tags "draft,old" --type note --force --yes
kb bulk-tag --tags "agent:mybot" --add "reviewed" --remove "pending" --yes
kb bulk-set-tags --ids "1,5,12" --tags "clean,final" --yes

Filter flags (shared): --tags, --type, --ids (comma-separated), --from-id, --to-id, --force Confirmation: --yes / -y to skip interactive prompt.

Without --yes, the CLI first shows the match count and asks for confirmation:

This will delete 47 documents matching: tags=[draft,old] type=note
Proceed? [y/N]

10. Engine config for safety threshold

New env var: KB_BULK_SAFETY_PERCENT (integer, default 70). Added to engine/kb/config.py.

Risks / Trade-offs

[Bulk delete is irreversible] → Safety threshold mitigates accidental mass deletion. CLI requires interactive confirmation. No undo mechanism — this is deliberate to keep the system simple.
[Naming collision: tags as filter vs operation] → The tags parameter in bulk_tags selects documents, while add/remove specifies the tag changes. Clear naming and tool descriptions mitigate confusion. Engine request model uses the same field name as the existing list/search filter.
[SQLite lock during large bulk ops] → A single transaction deleting 5000 documents will hold a write lock. With WAL mode, readers are not blocked. The lock duration should be under a few seconds for typical workloads.
[Breaking change: collection removal] → Any MCP client relying on collection parameters will break. Since collections were only recently added and are not widely deployed, this is acceptable. Existing collection:* tags in the database remain as regular tags — they still work as filters, just without special treatment.
[Jobs table overload] → Bulk operations add a new job type to a table designed for ingestion jobs. The schema change is minimal (one new column) and the audit trail value outweighs the mixing of concerns.

8.9 KiB Raw Permalink Blame History