Add bulk operations and remove collections abstraction
- Add bulk delete, bulk tags, and bulk set-tags engine endpoints (POST /api/v1/bulk/delete, /bulk/tags, /bulk/set-tags) - Filter-based selection: by tags, doc_type, ID list, ID range - Safety threshold (KB_BULK_SAFETY_PERCENT, default 70%) prevents accidental mass operations unless force=true - Synchronous execution with audit trail via jobs table - Add kb_bulk_delete, kb_bulk_tags, kb_bulk_set_tags MCP tools - Add kb bulk-remove, bulk-tag, bulk-set-tags CLI commands - Remove collection abstraction from MCP server (use tags instead) - Remove kb_set_collection MCP tool - Update SKILL.md, MCP.md, README.md documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-04-04
|
||||
@@ -0,0 +1,194 @@
|
||||
## Context
|
||||
|
||||
The engine API (`engine/kb/routes/`) provides single-document operations for delete (`DELETE /api/v1/documents/{id}`) and tag management (`PUT /api/v1/documents/{id}/tags`). The MCP server (`mcp/server.py`) wraps these and adds a "collection" abstraction via `collection:`-prefixed tags — ~70 lines of helpers and translation logic that only the MCP layer understands.
|
||||
|
||||
The database is SQLite with WAL mode, FTS5 for full-text search, and sqlite-vec for embeddings. Foreign keys with `ON DELETE CASCADE` handle chunk cleanup when documents are deleted. Stored files on disk must be cleaned up separately.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Bulk delete, bulk tag add/remove, and bulk set-tags (replace) via engine API, MCP tools, and CLI
|
||||
- Filter-based selection: by tag, doc_type, ID list, and ID range
|
||||
- Safety threshold to prevent accidental mass operations
|
||||
- Audit trail via jobs table
|
||||
- Remove collection abstraction from MCP server
|
||||
|
||||
**Non-Goals:**
|
||||
- Async/queued bulk operations (SQLite handles thousands of rows synchronously in <1s)
|
||||
- Bulk document retrieval or bulk note creation
|
||||
- Undo/recycle bin for bulk deletes
|
||||
- Adding collection concept to engine or CLI (collections are being removed, not moved)
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Common selection filter for all bulk endpoints
|
||||
|
||||
All three bulk endpoints accept the same selection body:
|
||||
|
||||
```json
|
||||
{
|
||||
"document_ids": [1, 5, 12],
|
||||
"tags": ["agent:mybot", "draft"],
|
||||
"doc_type": "note",
|
||||
"from_id": 10,
|
||||
"to_id": 50
|
||||
}
|
||||
```
|
||||
|
||||
Filters combine with AND logic. At least one filter is required — the engine rejects requests with no selection criteria (400).
|
||||
|
||||
**Selection SQL generation**: A shared helper in `database.py` builds the WHERE clause from the filter. The `tags` filter uses the same JOIN pattern as `list_documents` (all specified tags must match). The `document_ids` filter uses `IN (?)`. The `from_id`/`to_id` filter uses `id >= ? AND id <= ?`.
|
||||
|
||||
**Alternative considered**: Separate endpoints per filter type. Rejected — combinable filters are more powerful and the SQL generation is straightforward.
|
||||
|
||||
### 2. Safety threshold with configurable percentage
|
||||
|
||||
Before executing, the engine counts matched documents and total documents. If `matched / total > threshold`, the request is rejected:
|
||||
|
||||
```
|
||||
HTTP 409 Conflict
|
||||
{
|
||||
"error": "safety_threshold_exceeded",
|
||||
"message": "Operation would affect 750 of 1000 documents (75.0%). Exceeds safety threshold of 70%. Use force: true to proceed.",
|
||||
"matched": 750,
|
||||
"total": 1000,
|
||||
"percent": 75.0,
|
||||
"threshold": 70
|
||||
}
|
||||
```
|
||||
|
||||
- Default threshold: 70% (env var `KB_BULK_SAFETY_PERCENT`, integer 0-100)
|
||||
- Override per-request: `"force": true` in the request body
|
||||
- Threshold of 0 effectively disables the safety check
|
||||
- CLI maps this to `--force` / `-f` flag
|
||||
|
||||
The check is a SELECT COUNT before the operation — minimal overhead.
|
||||
|
||||
**Alternative considered**: Dry-run mode (preview what would be affected, then confirm). Rejected — adds a two-step flow that doesn't help LLM callers (they'd just always confirm) and the safety threshold covers the dangerous case.
|
||||
|
||||
### 3. Synchronous execution with audit logging
|
||||
|
||||
Bulk operations execute synchronously and return a summary response:
|
||||
|
||||
```json
|
||||
{
|
||||
"job_id": 42,
|
||||
"status": "done",
|
||||
"matched": 750,
|
||||
"succeeded": 748,
|
||||
"failed": 2,
|
||||
"errors": [
|
||||
{"document_id": 42, "error": "file locked"},
|
||||
{"document_id": 99, "error": "not found"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
A job record is created in the `jobs` table with a new `bulk_delete` / `bulk_tags` / `bulk_set_tags` status type. This requires extending the jobs table:
|
||||
|
||||
- Add `job_type` column: `"ingest"` (default, for existing jobs) or `"bulk_delete"` / `"bulk_tags"` / `"bulk_set_tags"`
|
||||
- The job's `filename` field stores a JSON summary of the selection filter for auditability
|
||||
- `document_id` field stores the count of affected documents
|
||||
- `error` field stores JSON array of individual errors if any
|
||||
|
||||
**Alternative considered**: Full async with job polling. Rejected — SQLite bulk operations are fast enough synchronously and async would require extra polling calls (defeating the purpose of reducing token usage).
|
||||
|
||||
### 4. Bulk delete implementation
|
||||
|
||||
For each matched document:
|
||||
1. Collect chunk IDs
|
||||
2. Delete embeddings from `chunks_vec`
|
||||
3. Delete the document row (cascades to chunks, document_tags)
|
||||
4. Delete stored file from disk
|
||||
|
||||
This follows the same logic as the existing `delete_document` endpoint but batched in a single transaction (except file deletion, which happens after commit). If a file deletion fails, the document is still counted as succeeded (the DB record is gone) but a warning is logged.
|
||||
|
||||
The operation processes documents within a single SQLite transaction for atomicity of the DB changes. File deletions happen post-commit and are best-effort.
|
||||
|
||||
### 5. Bulk tags implementation
|
||||
|
||||
Two distinct operations:
|
||||
|
||||
**`POST /api/v1/bulk/tags`** — Add and/or remove tags:
|
||||
```json
|
||||
{
|
||||
"add": ["reviewed", "approved"],
|
||||
"remove": ["draft"],
|
||||
...selection filters...
|
||||
}
|
||||
```
|
||||
|
||||
**`POST /api/v1/bulk/set-tags`** — Replace all tags:
|
||||
```json
|
||||
{
|
||||
"tags": ["final", "approved"],
|
||||
...selection filters...
|
||||
}
|
||||
```
|
||||
|
||||
The `set-tags` operation removes all existing tags from matched documents, then applies the new set. This is useful for cleaning up tag clutter or migrating tagging schemes.
|
||||
|
||||
Both update `updated_at` on affected documents.
|
||||
|
||||
### 6. Remove collection abstraction from MCP
|
||||
|
||||
Remove from `mcp/server.py`:
|
||||
- Constants: `COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`
|
||||
- Functions: `_collection_tag`, `_strip_collection_tags`, `_process_document`, `_process_search_results`, `_ensure_exclusive_collection`
|
||||
- Tool: `kb_set_collection` (entire tool removed)
|
||||
- Parameters: `collection` from `kb_search`, `kb_addnote`, `kb_upload_start`
|
||||
|
||||
The `_process_document` and `_process_search_results` calls in remaining tools are removed — documents are returned as-is from the engine, with all tags visible.
|
||||
|
||||
Users/agents that need namespace isolation use a tag convention (e.g. `agent:claude-code`) communicated via system prompt or tool instructions.
|
||||
|
||||
### 7. Engine bulk route module
|
||||
|
||||
New file: `engine/kb/routes/bulk.py`
|
||||
|
||||
Three endpoints sharing common infrastructure:
|
||||
- `_resolve_selection(conn, filters)` → list of document IDs + count
|
||||
- `_check_safety_threshold(matched, total, force)` → raises HTTPException if exceeded
|
||||
- `_log_bulk_job(conn, job_type, filters, matched, succeeded, failed, errors)` → job_id
|
||||
|
||||
### 8. MCP bulk tools
|
||||
|
||||
Three new tools in `mcp/server.py`, thin wrappers calling new `engine.py` methods:
|
||||
|
||||
- `kb_bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → str (JSON)
|
||||
- `kb_bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → str (JSON)
|
||||
- `kb_bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → str (JSON)
|
||||
|
||||
Note: The `tags` parameter on bulk tools serves as a **selection filter** (which documents to target), while `add`/`remove` (on bulk_tags) and `new_tags` (on bulk_set_tags) are the **operation** (what to do to the tags). Tool descriptions must make this distinction clear.
|
||||
|
||||
### 9. CLI bulk commands
|
||||
|
||||
Three new commands under `client/cmd/`:
|
||||
|
||||
```
|
||||
kb bulk-remove --tags "draft,old" --type note --force --yes
|
||||
kb bulk-tag --tags "agent:mybot" --add "reviewed" --remove "pending" --yes
|
||||
kb bulk-set-tags --ids "1,5,12" --tags "clean,final" --yes
|
||||
```
|
||||
|
||||
Filter flags (shared): `--tags`, `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`
|
||||
Confirmation: `--yes` / `-y` to skip interactive prompt.
|
||||
|
||||
Without `--yes`, the CLI first shows the match count and asks for confirmation:
|
||||
|
||||
```
|
||||
This will delete 47 documents matching: tags=[draft,old] type=note
|
||||
Proceed? [y/N]
|
||||
```
|
||||
|
||||
### 10. Engine config for safety threshold
|
||||
|
||||
New env var: `KB_BULK_SAFETY_PERCENT` (integer, default 70). Added to `engine/kb/config.py`.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Bulk delete is irreversible]** → Safety threshold mitigates accidental mass deletion. CLI requires interactive confirmation. No undo mechanism — this is deliberate to keep the system simple.
|
||||
- **[Naming collision: `tags` as filter vs operation]** → The `tags` parameter in bulk_tags selects documents, while `add`/`remove` specifies the tag changes. Clear naming and tool descriptions mitigate confusion. Engine request model uses the same field name as the existing list/search filter.
|
||||
- **[SQLite lock during large bulk ops]** → A single transaction deleting 5000 documents will hold a write lock. With WAL mode, readers are not blocked. The lock duration should be under a few seconds for typical workloads.
|
||||
- **[Breaking change: collection removal]** → Any MCP client relying on `collection` parameters will break. Since collections were only recently added and are not widely deployed, this is acceptable. Existing `collection:*` tags in the database remain as regular tags — they still work as filters, just without special treatment.
|
||||
- **[Jobs table overload]** → Bulk operations add a new job type to a table designed for ingestion jobs. The schema change is minimal (one new column) and the audit trail value outweighs the mixing of concerns.
|
||||
@@ -0,0 +1,91 @@
|
||||
## Why
|
||||
|
||||
Bulk operations on documents (delete, tag, retag) currently require one API/MCP call per document. When an LLM manages hundreds or thousands of documents, this means hundreds of tool calls — burning tokens, adding latency, and creating fragile multi-step flows that can fail partway through.
|
||||
|
||||
Additionally, the "collection" abstraction in the MCP server adds complexity without real benefit. Collections are implemented as `collection:`-prefixed tags, but this convention is only enforced in the MCP layer — the CLI and engine don't know about it. This creates inconsistency and extra code. Tags alone, with a naming convention communicated via system prompt or configuration, achieve the same namespace isolation more simply and uniformly.
|
||||
|
||||
## What Changes
|
||||
|
||||
### 1. Remove collections from MCP server
|
||||
|
||||
Strip all collection logic from `mcp/server.py`:
|
||||
- Remove `COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`, and all collection helper functions
|
||||
- Remove `collection` parameter from `kb_search`, `kb_addnote`, `kb_upload_start`
|
||||
- Remove `kb_set_collection` tool entirely
|
||||
- Remove `_process_document` / `_process_search_results` collection-tag stripping
|
||||
- Update MCP server instructions to explain tag-based namespace convention
|
||||
|
||||
### 2. Add bulk engine endpoints
|
||||
|
||||
Three new endpoints in the engine API:
|
||||
|
||||
- **POST /api/v1/bulk/delete** — Delete multiple documents matching a filter
|
||||
- **POST /api/v1/bulk/tags** — Add/remove tags on multiple documents matching a filter
|
||||
- **POST /api/v1/bulk/set-tags** — Replace all tags on multiple documents matching a filter
|
||||
|
||||
All accept a common **selection filter** (combinable with AND logic):
|
||||
- `document_ids` — explicit list of IDs
|
||||
- `tags` — documents matching ALL specified tags
|
||||
- `doc_type` — documents of this type
|
||||
- `from_id` / `to_id` — ID range (inclusive)
|
||||
|
||||
At least one selection criterion is required.
|
||||
|
||||
**Safety threshold**: If the operation would affect more than N% of all documents (default 70%, configurable via `KB_BULK_SAFETY_PERCENT` env var), the request is rejected with a 409 response showing what would be affected. The caller must re-send with `force: true` to proceed.
|
||||
|
||||
**Response model**: Synchronous execution with summary response. The operation is logged to the jobs table for audit trail:
|
||||
|
||||
```json
|
||||
{
|
||||
"job_id": 42,
|
||||
"status": "done",
|
||||
"matched": 750,
|
||||
"succeeded": 748,
|
||||
"failed": 2,
|
||||
"errors": [
|
||||
{"document_id": 42, "error": "file locked"},
|
||||
{"document_id": 99, "error": "not found"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Add bulk MCP tools
|
||||
|
||||
Expose the bulk engine endpoints as MCP tools:
|
||||
- `kb_bulk_delete` — bulk delete with filter selection
|
||||
- `kb_bulk_tags` — bulk add/remove tags with filter selection
|
||||
- `kb_bulk_set_tags` — bulk replace tags with filter selection
|
||||
|
||||
These are thin wrappers around the engine bulk endpoints — no collection translation, no special logic.
|
||||
|
||||
### 4. Add bulk CLI commands
|
||||
|
||||
- `kb bulk-remove` — bulk delete with `--tags`, `--type`, `--ids`, `--from-id`, `--to-id`, `--force` flags
|
||||
- `kb bulk-tag` — bulk tag/untag with `--add`, `--remove`, and the same filter flags
|
||||
- `kb bulk-set-tags` — bulk replace tags with `--tags` (new tags) and the same filter flags
|
||||
|
||||
All show a confirmation prompt with match count before executing (unless `--yes`).
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
- `bulk-operations`: Engine endpoints, MCP tools, and CLI commands for bulk delete, tag, and set-tags operations with filter-based selection and safety threshold.
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
- `mcp-document-management`: Remove `kb_set_collection` tool. Remove `collection` parameter from all tools.
|
||||
|
||||
### Removed Capabilities
|
||||
|
||||
- `mcp-collections`: The collection abstraction (collection helpers, collection parameters, collection tag stripping) is removed from the MCP server entirely.
|
||||
|
||||
## Impact
|
||||
|
||||
- **Engine API** (`engine/kb/routes/`): New `bulk.py` route module with 3 endpoints. New `bulk` job type in jobs table.
|
||||
- **Engine database** (`engine/kb/database.py`): Helper functions for bulk selection queries and bulk delete/tag operations.
|
||||
- **MCP server** (`mcp/server.py`): Remove ~70 lines of collection logic. Add 3 bulk tool definitions. Remove `collection` param from `kb_search`, `kb_addnote`, `kb_upload_start`. Remove `kb_set_collection`.
|
||||
- **MCP engine client** (`mcp/engine.py`): Add bulk operation methods. Remove no longer needed code.
|
||||
- **CLI** (`client/cmd/`): New `bulk_remove.go`, `bulk_tag.go`, `bulk_set_tags.go` command files.
|
||||
- **CLI API client** (`client/internal/api/`): Add `Post` with JSON body support if not present.
|
||||
- **Breaking changes**: `kb_set_collection` MCP tool removed. `collection` parameter removed from `kb_search`, `kb_addnote`, `kb_upload_start` MCP tools. Any MCP clients using collections will need to switch to tags.
|
||||
@@ -0,0 +1,230 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Common selection filter
|
||||
|
||||
All bulk engine endpoints SHALL accept a JSON body with the following optional selection fields, combined with AND logic:
|
||||
|
||||
- `document_ids` (list of int) — match documents with these specific IDs
|
||||
- `tags` (list of str) — match documents that have ALL specified tags
|
||||
- `doc_type` (str) — match documents with this document type
|
||||
- `from_id` (int) — match documents with id >= this value
|
||||
- `to_id` (int) — match documents with id <= this value
|
||||
|
||||
At least one selection field MUST be present. If no selection fields are provided, the endpoint SHALL return 400 Bad Request.
|
||||
|
||||
#### Scenario: Filter by tags and doc_type
|
||||
|
||||
- **WHEN** a bulk endpoint receives `{"tags": ["draft"], "doc_type": "note"}`
|
||||
- **THEN** it SHALL match only documents that have the tag "draft" AND have doc_type "note"
|
||||
|
||||
#### Scenario: Filter by ID range
|
||||
|
||||
- **WHEN** a bulk endpoint receives `{"from_id": 10, "to_id": 50}`
|
||||
- **THEN** it SHALL match documents with id >= 10 AND id <= 50
|
||||
|
||||
#### Scenario: Filter by explicit IDs
|
||||
|
||||
- **WHEN** a bulk endpoint receives `{"document_ids": [1, 5, 12]}`
|
||||
- **THEN** it SHALL match only documents with those specific IDs
|
||||
|
||||
#### Scenario: Combined filters
|
||||
|
||||
- **WHEN** a bulk endpoint receives `{"tags": ["agent:mybot"], "doc_type": "note", "from_id": 100}`
|
||||
- **THEN** it SHALL match documents satisfying ALL three criteria
|
||||
|
||||
#### Scenario: No selection fields provided
|
||||
|
||||
- **WHEN** a bulk endpoint receives `{}` or `{"force": true}` with no selection fields
|
||||
- **THEN** it SHALL return 400 Bad Request
|
||||
|
||||
### Requirement: Safety threshold
|
||||
|
||||
All bulk endpoints SHALL enforce a safety threshold. Before executing, the engine SHALL count the matched documents and the total documents in the database. If `matched / total * 100` exceeds the configured threshold, the request SHALL be rejected with 409 Conflict.
|
||||
|
||||
The response SHALL include: `error` ("safety_threshold_exceeded"), `message` (human-readable), `matched` (int), `total` (int), `percent` (float), and `threshold` (int).
|
||||
|
||||
The threshold SHALL default to 70 and be configurable via the `KB_BULK_SAFETY_PERCENT` environment variable (integer 0-100). A value of 0 disables the check.
|
||||
|
||||
The caller MAY override the threshold by including `"force": true` in the request body.
|
||||
|
||||
#### Scenario: Threshold exceeded
|
||||
|
||||
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
|
||||
- **WHEN** a bulk endpoint matches 750 documents (75%) without `force: true`
|
||||
- **THEN** it SHALL return 409 with `matched: 750`, `total: 1000`, `percent: 75.0`, `threshold: 70`
|
||||
|
||||
#### Scenario: Threshold not exceeded
|
||||
|
||||
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
|
||||
- **WHEN** a bulk endpoint matches 500 documents (50%) without `force: true`
|
||||
- **THEN** the operation SHALL proceed normally
|
||||
|
||||
#### Scenario: Force override
|
||||
|
||||
- **GIVEN** 1000 total documents and a match of 900 (90%)
|
||||
- **WHEN** the request includes `"force": true`
|
||||
- **THEN** the operation SHALL proceed regardless of threshold
|
||||
|
||||
#### Scenario: Zero threshold
|
||||
|
||||
- **GIVEN** `KB_BULK_SAFETY_PERCENT` is 0
|
||||
- **THEN** the safety check SHALL be effectively disabled for all operations
|
||||
|
||||
### Requirement: Synchronous response with audit log
|
||||
|
||||
All bulk endpoints SHALL execute synchronously and return a JSON response with:
|
||||
|
||||
- `job_id` (int) — ID of the audit log entry in the jobs table
|
||||
- `status` (str) — "done" or "partial_failure"
|
||||
- `matched` (int) — number of documents that matched the selection
|
||||
- `succeeded` (int) — number of documents successfully processed
|
||||
- `failed` (int) — number of documents that failed
|
||||
- `errors` (list) — array of `{"document_id": int, "error": str}` for each failure (empty on full success)
|
||||
|
||||
A job record SHALL be created in the jobs table with `job_type` set to the operation type. The `filename` field SHALL store a JSON representation of the selection filter. The `error` field SHALL store a JSON array of individual errors if any occurred.
|
||||
|
||||
#### Scenario: Full success
|
||||
|
||||
- **WHEN** a bulk operation matches 50 documents and all succeed
|
||||
- **THEN** the response SHALL have `status: "done"`, `matched: 50`, `succeeded: 50`, `failed: 0`, `errors: []`
|
||||
|
||||
#### Scenario: Partial failure
|
||||
|
||||
- **WHEN** a bulk operation matches 50 documents but 2 fail
|
||||
- **THEN** the response SHALL have `status: "partial_failure"`, `matched: 50`, `succeeded: 48`, `failed: 2`, and `errors` listing the 2 failures
|
||||
|
||||
### Requirement: Bulk delete endpoint
|
||||
|
||||
The engine SHALL expose `POST /api/v1/bulk/delete` which permanently deletes all documents matching the selection filter. For each matched document, it SHALL delete embeddings from `chunks_vec`, delete the document row (cascading to chunks and document_tags), and delete any stored file from disk.
|
||||
|
||||
Database deletions SHALL be performed within a single transaction. File deletions SHALL occur after the transaction commits and SHALL be best-effort (failures logged but not counted as document failures).
|
||||
|
||||
#### Scenario: Bulk delete by tag
|
||||
|
||||
- **WHEN** `POST /api/v1/bulk/delete` receives `{"tags": ["old", "draft"]}`
|
||||
- **THEN** all documents with both tags "old" and "draft" SHALL be deleted
|
||||
- **AND** their chunks, embeddings, tag associations, and stored files SHALL be removed
|
||||
|
||||
#### Scenario: Bulk delete with no matches
|
||||
|
||||
- **WHEN** `POST /api/v1/bulk/delete` receives a filter that matches 0 documents
|
||||
- **THEN** the response SHALL have `matched: 0`, `succeeded: 0`, `failed: 0`
|
||||
|
||||
### Requirement: Bulk tags endpoint
|
||||
|
||||
The engine SHALL expose `POST /api/v1/bulk/tags` which adds and/or removes tags on all documents matching the selection filter. The request body SHALL include the selection filter plus:
|
||||
|
||||
- `add` (list of str, optional) — tags to add
|
||||
- `remove` (list of str, optional) — tags to remove
|
||||
|
||||
At least one of `add` or `remove` MUST be present. The endpoint SHALL return 400 if neither is provided.
|
||||
|
||||
The endpoint SHALL update `updated_at` on all affected documents.
|
||||
|
||||
#### Scenario: Add and remove tags in one call
|
||||
|
||||
- **WHEN** `POST /api/v1/bulk/tags` receives `{"tags": ["agent:mybot"], "add": ["reviewed"], "remove": ["pending"]}`
|
||||
- **THEN** all documents tagged "agent:mybot" SHALL have "reviewed" added and "pending" removed
|
||||
|
||||
### Requirement: Bulk set-tags endpoint
|
||||
|
||||
The engine SHALL expose `POST /api/v1/bulk/set-tags` which replaces all tags on matched documents with a new set. The request body SHALL include the selection filter plus:
|
||||
|
||||
- `new_tags` (list of str) — the replacement tag set
|
||||
|
||||
The endpoint SHALL remove all existing tag associations from matched documents, then apply the new set. It SHALL update `updated_at` on all affected documents.
|
||||
|
||||
#### Scenario: Replace all tags
|
||||
|
||||
- **WHEN** `POST /api/v1/bulk/set-tags` receives `{"doc_type": "note", "new_tags": ["clean", "final"]}`
|
||||
- **THEN** all notes SHALL have their existing tags removed and replaced with "clean" and "final"
|
||||
|
||||
### Requirement: Jobs table extension
|
||||
|
||||
The jobs table SHALL be extended with a `job_type` column (TEXT, default "ingest") to distinguish ingestion jobs from bulk operation audit entries. Valid values: "ingest", "bulk_delete", "bulk_tags", "bulk_set_tags".
|
||||
|
||||
Existing jobs SHALL default to `job_type = "ingest"`. The existing jobs list endpoint and CLI `kb jobs` command SHALL continue to work unchanged.
|
||||
|
||||
#### Scenario: Migration adds column
|
||||
|
||||
- **GIVEN** an existing database without the `job_type` column
|
||||
- **WHEN** the engine starts
|
||||
- **THEN** the column SHALL be added with default value "ingest"
|
||||
|
||||
### Requirement: Engine config for safety threshold
|
||||
|
||||
The engine `Config` class SHALL read `KB_BULK_SAFETY_PERCENT` from the environment as an integer (default 70, range 0-100). This value SHALL be used as the default safety threshold for all bulk endpoints.
|
||||
|
||||
### Requirement: MCP bulk delete tool
|
||||
|
||||
The MCP server SHALL expose a `kb_bulk_delete` tool with parameters: `document_ids` (optional list of int), `tags` (optional list of str), `doc_type` (optional str), `from_id` (optional int), `to_id` (optional int), `force` (optional bool).
|
||||
|
||||
The tool SHALL call `POST /api/v1/bulk/delete` on the engine via the engine client and return the JSON response.
|
||||
|
||||
The tool description SHALL clearly state that `tags` is a selection filter (which documents to delete), not tags to delete.
|
||||
|
||||
#### Scenario: MCP bulk delete by tag
|
||||
|
||||
- **WHEN** `kb_bulk_delete(tags=["old"])` is called
|
||||
- **THEN** the engine client SHALL send `POST /api/v1/bulk/delete` with `{"tags": ["old"]}`
|
||||
- **AND** the tool SHALL return the engine's JSON response
|
||||
|
||||
### Requirement: MCP bulk tags tool
|
||||
|
||||
The MCP server SHALL expose a `kb_bulk_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `add` (optional list of str), `remove` (optional list of str), and `force` (optional bool).
|
||||
|
||||
The tool description SHALL clearly distinguish `tags` (selection filter) from `add`/`remove` (tag changes to apply).
|
||||
|
||||
#### Scenario: MCP bulk tag update
|
||||
|
||||
- **WHEN** `kb_bulk_tags(tags=["agent:mybot"], add=["reviewed"], remove=["draft"])` is called
|
||||
- **THEN** the engine client SHALL send the appropriate `POST /api/v1/bulk/tags` request
|
||||
|
||||
### Requirement: MCP bulk set-tags tool
|
||||
|
||||
The MCP server SHALL expose a `kb_bulk_set_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `new_tags` (list of str) and `force` (optional bool).
|
||||
|
||||
#### Scenario: MCP bulk set tags
|
||||
|
||||
- **WHEN** `kb_bulk_set_tags(doc_type="note", new_tags=["clean"])` is called
|
||||
- **THEN** the engine client SHALL send `POST /api/v1/bulk/set-tags` with `{"doc_type": "note", "new_tags": ["clean"]}`
|
||||
|
||||
### Requirement: MCP engine client bulk methods
|
||||
|
||||
The MCP engine client (`mcp/engine.py`) SHALL provide three new methods:
|
||||
|
||||
- `bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → dict
|
||||
- `bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → dict
|
||||
- `bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → dict
|
||||
|
||||
Each SHALL send a POST request to the corresponding `/api/v1/bulk/*` endpoint with the parameters as a JSON body. Each SHALL raise on non-2xx status codes, consistent with existing methods.
|
||||
|
||||
### Requirement: CLI bulk-remove command
|
||||
|
||||
The CLI SHALL expose a `kb bulk-remove` command with flags: `--tags` (comma-separated), `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`/`-f`, `--yes`/`-y`.
|
||||
|
||||
Without `--yes`, the CLI SHALL first display the match count and ask for interactive confirmation before proceeding.
|
||||
|
||||
The command SHALL call `POST /api/v1/bulk/delete` with the constructed filter.
|
||||
|
||||
#### Scenario: CLI bulk remove with confirmation
|
||||
|
||||
- **WHEN** `kb bulk-remove --tags "draft,old" --type note` is run without `--yes`
|
||||
- **THEN** the CLI SHALL display "This will delete N documents matching: tags=[draft,old] type=note" and prompt "Proceed? [y/N]"
|
||||
|
||||
#### Scenario: CLI bulk remove with --yes
|
||||
|
||||
- **WHEN** `kb bulk-remove --tags "draft" --yes` is run
|
||||
- **THEN** the CLI SHALL proceed without prompting
|
||||
|
||||
### Requirement: CLI bulk-tag command
|
||||
|
||||
The CLI SHALL expose a `kb bulk-tag` command with the same filter flags as `bulk-remove`, plus `--add` and `--remove` (comma-separated tag lists).
|
||||
|
||||
The command SHALL call `POST /api/v1/bulk/tags` with the constructed filter and tag changes.
|
||||
|
||||
### Requirement: CLI bulk-set-tags command
|
||||
|
||||
The CLI SHALL expose a `kb bulk-set-tags` command with the filter flags, plus `--set` (comma-separated list of replacement tags).
|
||||
|
||||
The command SHALL call `POST /api/v1/bulk/set-tags` with the constructed filter and `new_tags`.
|
||||
@@ -0,0 +1,55 @@
|
||||
## REMOVED Requirements
|
||||
|
||||
### Requirement: Collection abstraction in MCP server
|
||||
|
||||
The MCP server SHALL NOT maintain any collection abstraction. The following SHALL be removed:
|
||||
|
||||
- Constants: `COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`
|
||||
- Functions: `_collection_tag`, `_strip_collection_tags`, `_process_document`, `_process_search_results`, `_ensure_exclusive_collection`
|
||||
- Tool: `kb_set_collection` (entire tool)
|
||||
- Parameters: `collection` from `kb_search`, `kb_addnote`, `kb_upload_start`
|
||||
|
||||
Documents SHALL be returned as-is from the engine with all tags visible. No tag stripping or collection field injection SHALL occur.
|
||||
|
||||
#### Scenario: Search results show all tags
|
||||
|
||||
- **WHEN** `kb_search` is called and a result has tags `["agent:mybot", "collection:documents", "draft"]`
|
||||
- **THEN** all three tags SHALL be returned as-is — no stripping of `collection:*` tags
|
||||
|
||||
#### Scenario: kb_set_collection no longer exists
|
||||
|
||||
- **WHEN** an MCP client attempts to call `kb_set_collection`
|
||||
- **THEN** the tool SHALL not be found (removed)
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: kb_search without collection parameter
|
||||
|
||||
The `kb_search` MCP tool SHALL accept `tags` (optional list of str) for filtering but SHALL NOT accept a `collection` parameter. Callers that previously used `collection="memory"` SHALL instead use `tags=["collection:memory"]` or whatever tag convention they prefer.
|
||||
|
||||
#### Scenario: Filter by tag instead of collection
|
||||
|
||||
- **WHEN** `kb_search(query="test", tags=["agent:mybot"])` is called
|
||||
- **THEN** results SHALL be filtered to documents tagged "agent:mybot"
|
||||
- **AND** no collection field SHALL be present in the response
|
||||
|
||||
### Requirement: kb_addnote without collection parameter
|
||||
|
||||
The `kb_addnote` MCP tool SHALL accept `tags` (optional list of str) but SHALL NOT accept a `collection` parameter. The tool SHALL NOT automatically apply any default collection tag — only explicitly provided tags are applied.
|
||||
|
||||
#### Scenario: Add note with explicit tags
|
||||
|
||||
- **WHEN** `kb_addnote(text="hello", tags=["agent:mybot", "memory"])` is called
|
||||
- **THEN** the note SHALL be created with exactly those two tags — no `collection:documents` tag added
|
||||
|
||||
### Requirement: kb_upload_start without collection parameter
|
||||
|
||||
The `kb_upload_start` MCP tool SHALL accept `tags` (optional list of str) but SHALL NOT accept a `collection` parameter. The tool SHALL NOT automatically apply any default collection tag.
|
||||
|
||||
### Requirement: kb_update_note without collection processing
|
||||
|
||||
The `kb_update_note` MCP tool SHALL return the document as-is from the engine without passing it through `_process_document`. All tags SHALL be visible in the response.
|
||||
|
||||
### Requirement: kb_get without collection processing
|
||||
|
||||
The `kb_get` MCP tool SHALL return documents as-is from the engine without passing through `_process_document`. All tags SHALL be visible in the response. No `collection` field SHALL be injected.
|
||||
@@ -0,0 +1,45 @@
|
||||
## 1. Remove collections from MCP server
|
||||
|
||||
- [x] 1.1 Remove collection constants and helper functions from `mcp/server.py` (`COLLECTION_TAG_PREFIX`, `DEFAULT_COLLECTION`, `_collection_tag`, `_strip_collection_tags`, `_process_document`, `_process_search_results`, `_ensure_exclusive_collection`)
|
||||
- [x] 1.2 Remove `collection` parameter from `kb_search`, `kb_addnote`, `kb_upload_start` tools
|
||||
- [x] 1.3 Remove `kb_set_collection` tool entirely
|
||||
- [x] 1.4 Remove `_process_document` / `_process_search_results` calls from `kb_get`, `kb_update_note`, `kb_search`
|
||||
- [x] 1.5 Update MCP server instructions text to reflect tags-only approach
|
||||
|
||||
## 2. Engine bulk infrastructure
|
||||
|
||||
- [x] 2.1 Add `bulk_safety_percent` to `Config` class in `engine/kb/config.py` (env var `KB_BULK_SAFETY_PERCENT`, default 70)
|
||||
- [x] 2.2 Add `job_type` column migration to `database.py` `init_schema` (TEXT, default "ingest")
|
||||
- [x] 2.3 Add `resolve_bulk_selection(conn, document_ids, tags, doc_type, from_id, to_id)` helper to `database.py` — returns list of matching document IDs
|
||||
- [x] 2.4 Add `create_bulk_job(conn, job_type, filters_json, matched, succeeded, failed, errors_json)` helper to `database.py`
|
||||
|
||||
## 3. Engine bulk endpoints
|
||||
|
||||
- [x] 3.1 Create `engine/kb/routes/bulk.py` with shared Pydantic request model (`BulkSelectionRequest` with selection fields + `force` bool)
|
||||
- [x] 3.2 Add `_check_safety_threshold` helper that returns 409 if threshold exceeded
|
||||
- [x] 3.3 Implement `POST /api/v1/bulk/delete` — resolve selection, check threshold, delete documents in transaction, clean up files, log job, return summary
|
||||
- [x] 3.4 Implement `POST /api/v1/bulk/tags` — resolve selection, check threshold, add/remove tags on matched docs, log job, return summary
|
||||
- [x] 3.5 Implement `POST /api/v1/bulk/set-tags` — resolve selection, check threshold, clear and replace tags on matched docs, log job, return summary
|
||||
- [x] 3.6 Import bulk routes in engine app startup (add to `engine/kb/routes/__init__.py` or `main.py`)
|
||||
|
||||
## 4. MCP bulk tools
|
||||
|
||||
- [x] 4.1 Add `bulk_delete`, `bulk_tags`, `bulk_set_tags` methods to `mcp/engine.py`
|
||||
- [x] 4.2 Add `kb_bulk_delete` tool to `mcp/server.py`
|
||||
- [x] 4.3 Add `kb_bulk_tags` tool to `mcp/server.py`
|
||||
- [x] 4.4 Add `kb_bulk_set_tags` tool to `mcp/server.py`
|
||||
|
||||
## 5. CLI bulk commands
|
||||
|
||||
- [x] 5.1 Create `client/cmd/bulk_remove.go` — `kb bulk-remove` with filter flags, confirmation prompt, JSON output support
|
||||
- [x] 5.2 Create `client/cmd/bulk_tag.go` — `kb bulk-tag` with filter flags + `--add`/`--remove`, confirmation prompt
|
||||
- [x] 5.3 Create `client/cmd/bulk_set_tags.go` — `kb bulk-set-tags` with filter flags + `--set`, confirmation prompt
|
||||
|
||||
## 6. Verification
|
||||
|
||||
- [x] 6.1 Test collection removal: verify `kb_search`, `kb_addnote`, `kb_get`, `kb_update_note`, `kb_upload_start` work without collection params
|
||||
- [x] 6.2 Test bulk delete via engine API: filter by tags, by IDs, by range, safety threshold trigger and force override
|
||||
- [x] 6.3 Test bulk tags and bulk set-tags via engine API
|
||||
- [x] 6.4 Test MCP bulk tools against running engine
|
||||
- [x] 6.5 Test CLI bulk commands against running engine
|
||||
- [x] 6.6 Test audit trail: verify bulk jobs appear in `kb jobs` output
|
||||
Reference in New Issue
Block a user