b5a203d2aa
- Add bulk delete, bulk tags, and bulk set-tags engine endpoints (POST /api/v1/bulk/delete, /bulk/tags, /bulk/set-tags) - Filter-based selection: by tags, doc_type, ID list, ID range - Safety threshold (KB_BULK_SAFETY_PERCENT, default 70%) prevents accidental mass operations unless force=true - Synchronous execution with audit trail via jobs table - Add kb_bulk_delete, kb_bulk_tags, kb_bulk_set_tags MCP tools - Add kb bulk-remove, bulk-tag, bulk-set-tags CLI commands - Remove collection abstraction from MCP server (use tags instead) - Remove kb_set_collection MCP tool - Update SKILL.md, MCP.md, README.md documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
195 lines
8.9 KiB
Markdown
195 lines
8.9 KiB
Markdown
## Context
|
|
|
|
The engine API (`engine/kb/routes/`) provides single-document operations for delete (`DELETE /api/v1/documents/{id}`) and tag management (`PUT /api/v1/documents/{id}/tags`). The MCP server (`mcp/server.py`) wraps these and adds a "collection" abstraction via `collection:`-prefixed tags — ~70 lines of helpers and translation logic that only the MCP layer understands.
|
|
|
|
The database is SQLite with WAL mode, FTS5 for full-text search, and sqlite-vec for embeddings. Foreign keys with `ON DELETE CASCADE` handle chunk cleanup when documents are deleted. Stored files on disk must be cleaned up separately.
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Bulk delete, bulk tag add/remove, and bulk set-tags (replace) via engine API, MCP tools, and CLI
|
|
- Filter-based selection: by tag, doc_type, ID list, and ID range
|
|
- Safety threshold to prevent accidental mass operations
|
|
- Audit trail via jobs table
|
|
- Remove collection abstraction from MCP server
|
|
|
|
**Non-Goals:**
|
|
- Async/queued bulk operations (SQLite handles thousands of rows synchronously in <1s)
|
|
- Bulk document retrieval or bulk note creation
|
|
- Undo/recycle bin for bulk deletes
|
|
- Adding collection concept to engine or CLI (collections are being removed, not moved)
|
|
|
|
## Decisions
|
|
|
|
### 1. Common selection filter for all bulk endpoints
|
|
|
|
All three bulk endpoints accept the same selection body:
|
|
|
|
```json
|
|
{
|
|
"document_ids": [1, 5, 12],
|
|
"tags": ["agent:mybot", "draft"],
|
|
"doc_type": "note",
|
|
"from_id": 10,
|
|
"to_id": 50
|
|
}
|
|
```
|
|
|
|
Filters combine with AND logic. At least one filter is required — the engine rejects requests with no selection criteria (400).
|
|
|
|
**Selection SQL generation**: A shared helper in `database.py` builds the WHERE clause from the filter. The `tags` filter uses the same JOIN pattern as `list_documents` (all specified tags must match). The `document_ids` filter uses `IN (?)`. The `from_id`/`to_id` filter uses `id >= ? AND id <= ?`.
|
|
|
|
**Alternative considered**: Separate endpoints per filter type. Rejected — combinable filters are more powerful and the SQL generation is straightforward.
|
|
|
|
### 2. Safety threshold with configurable percentage
|
|
|
|
Before executing, the engine counts matched documents and total documents. If `matched / total > threshold`, the request is rejected:
|
|
|
|
```
|
|
HTTP 409 Conflict
|
|
{
|
|
"error": "safety_threshold_exceeded",
|
|
"message": "Operation would affect 750 of 1000 documents (75.0%). Exceeds safety threshold of 70%. Use force: true to proceed.",
|
|
"matched": 750,
|
|
"total": 1000,
|
|
"percent": 75.0,
|
|
"threshold": 70
|
|
}
|
|
```
|
|
|
|
- Default threshold: 70% (env var `KB_BULK_SAFETY_PERCENT`, integer 0-100)
|
|
- Override per-request: `"force": true` in the request body
|
|
- Threshold of 0 effectively disables the safety check
|
|
- CLI maps this to `--force` / `-f` flag
|
|
|
|
The check is a SELECT COUNT before the operation — minimal overhead.
|
|
|
|
**Alternative considered**: Dry-run mode (preview what would be affected, then confirm). Rejected — adds a two-step flow that doesn't help LLM callers (they'd just always confirm) and the safety threshold covers the dangerous case.
|
|
|
|
### 3. Synchronous execution with audit logging
|
|
|
|
Bulk operations execute synchronously and return a summary response:
|
|
|
|
```json
|
|
{
|
|
"job_id": 42,
|
|
"status": "done",
|
|
"matched": 750,
|
|
"succeeded": 748,
|
|
"failed": 2,
|
|
"errors": [
|
|
{"document_id": 42, "error": "file locked"},
|
|
{"document_id": 99, "error": "not found"}
|
|
]
|
|
}
|
|
```
|
|
|
|
A job record is created in the `jobs` table with a new `bulk_delete` / `bulk_tags` / `bulk_set_tags` status type. This requires extending the jobs table:
|
|
|
|
- Add `job_type` column: `"ingest"` (default, for existing jobs) or `"bulk_delete"` / `"bulk_tags"` / `"bulk_set_tags"`
|
|
- The job's `filename` field stores a JSON summary of the selection filter for auditability
|
|
- `document_id` field stores the count of affected documents
|
|
- `error` field stores JSON array of individual errors if any
|
|
|
|
**Alternative considered**: Full async with job polling. Rejected — SQLite bulk operations are fast enough synchronously and async would require extra polling calls (defeating the purpose of reducing token usage).
|
|
|
|
### 4. Bulk delete implementation
|
|
|
|
For each matched document:
|
|
1. Collect chunk IDs
|
|
2. Delete embeddings from `chunks_vec`
|
|
3. Delete the document row (cascades to chunks, document_tags)
|
|
4. Delete stored file from disk
|
|
|
|
This follows the same logic as the existing `delete_document` endpoint but batched in a single transaction (except file deletion, which happens after commit). If a file deletion fails, the document is still counted as succeeded (the DB record is gone) but a warning is logged.
|
|
|
|
The operation processes documents within a single SQLite transaction for atomicity of the DB changes. File deletions happen post-commit and are best-effort.
|
|
|
|
### 5. Bulk tags implementation
|
|
|
|
Two distinct operations:
|
|
|
|
**`POST /api/v1/bulk/tags`** — Add and/or remove tags:
|
|
```json
|
|
{
|
|
"add": ["reviewed", "approved"],
|
|
"remove": ["draft"],
|
|
...selection filters...
|
|
}
|
|
```
|
|
|
|
**`POST /api/v1/bulk/set-tags`** — Replace all tags:
|
|
```json
|
|
{
|
|
"tags": ["final", "approved"],
|
|
...selection filters...
|
|
}
|
|
```
|
|
|
|
The `set-tags` operation removes all existing tags from matched documents, then applies the new set. This is useful for cleaning up tag clutter or migrating tagging schemes.
|
|
|
|
Both update `updated_at` on affected documents.
|
|
|
|
### 6. Remove collection abstraction from MCP
|
|
|
|
Remove from `mcp/server.py`:
|
|
- Constants: `COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`
|
|
- Functions: `_collection_tag`, `_strip_collection_tags`, `_process_document`, `_process_search_results`, `_ensure_exclusive_collection`
|
|
- Tool: `kb_set_collection` (entire tool removed)
|
|
- Parameters: `collection` from `kb_search`, `kb_addnote`, `kb_upload_start`
|
|
|
|
The `_process_document` and `_process_search_results` calls in remaining tools are removed — documents are returned as-is from the engine, with all tags visible.
|
|
|
|
Users/agents that need namespace isolation use a tag convention (e.g. `agent:claude-code`) communicated via system prompt or tool instructions.
|
|
|
|
### 7. Engine bulk route module
|
|
|
|
New file: `engine/kb/routes/bulk.py`
|
|
|
|
Three endpoints sharing common infrastructure:
|
|
- `_resolve_selection(conn, filters)` → list of document IDs + count
|
|
- `_check_safety_threshold(matched, total, force)` → raises HTTPException if exceeded
|
|
- `_log_bulk_job(conn, job_type, filters, matched, succeeded, failed, errors)` → job_id
|
|
|
|
### 8. MCP bulk tools
|
|
|
|
Three new tools in `mcp/server.py`, thin wrappers calling new `engine.py` methods:
|
|
|
|
- `kb_bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → str (JSON)
|
|
- `kb_bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → str (JSON)
|
|
- `kb_bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → str (JSON)
|
|
|
|
Note: The `tags` parameter on bulk tools serves as a **selection filter** (which documents to target), while `add`/`remove` (on bulk_tags) and `new_tags` (on bulk_set_tags) are the **operation** (what to do to the tags). Tool descriptions must make this distinction clear.
|
|
|
|
### 9. CLI bulk commands
|
|
|
|
Three new commands under `client/cmd/`:
|
|
|
|
```
|
|
kb bulk-remove --tags "draft,old" --type note --force --yes
|
|
kb bulk-tag --tags "agent:mybot" --add "reviewed" --remove "pending" --yes
|
|
kb bulk-set-tags --ids "1,5,12" --tags "clean,final" --yes
|
|
```
|
|
|
|
Filter flags (shared): `--tags`, `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`
|
|
Confirmation: `--yes` / `-y` to skip interactive prompt.
|
|
|
|
Without `--yes`, the CLI first shows the match count and asks for confirmation:
|
|
|
|
```
|
|
This will delete 47 documents matching: tags=[draft,old] type=note
|
|
Proceed? [y/N]
|
|
```
|
|
|
|
### 10. Engine config for safety threshold
|
|
|
|
New env var: `KB_BULK_SAFETY_PERCENT` (integer, default 70). Added to `engine/kb/config.py`.
|
|
|
|
## Risks / Trade-offs
|
|
|
|
- **[Bulk delete is irreversible]** → Safety threshold mitigates accidental mass deletion. CLI requires interactive confirmation. No undo mechanism — this is deliberate to keep the system simple.
|
|
- **[Naming collision: `tags` as filter vs operation]** → The `tags` parameter in bulk_tags selects documents, while `add`/`remove` specifies the tag changes. Clear naming and tool descriptions mitigate confusion. Engine request model uses the same field name as the existing list/search filter.
|
|
- **[SQLite lock during large bulk ops]** → A single transaction deleting 5000 documents will hold a write lock. With WAL mode, readers are not blocked. The lock duration should be under a few seconds for typical workloads.
|
|
- **[Breaking change: collection removal]** → Any MCP client relying on `collection` parameters will break. Since collections were only recently added and are not widely deployed, this is acceptable. Existing `collection:*` tags in the database remain as regular tags — they still work as filters, just without special treatment.
|
|
- **[Jobs table overload]** → Bulk operations add a new job type to a table designed for ingestion jobs. The schema change is minimal (one new column) and the audit trail value outweighs the mixing of concerns.
|