Add bulk operations and remove collections abstraction

- Add bulk delete, bulk tags, and bulk set-tags engine endpoints
  (POST /api/v1/bulk/delete, /bulk/tags, /bulk/set-tags)
- Filter-based selection: by tags, doc_type, ID list, ID range
- Safety threshold (KB_BULK_SAFETY_PERCENT, default 70%) prevents
  accidental mass operations unless force=true
- Synchronous execution with audit trail via jobs table
- Add kb_bulk_delete, kb_bulk_tags, kb_bulk_set_tags MCP tools
- Add kb bulk-remove, bulk-tag, bulk-set-tags CLI commands
- Remove collection abstraction from MCP server (use tags instead)
- Remove kb_set_collection MCP tool
- Update SKILL.md, MCP.md, README.md documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-04 22:34:47 +01:00
parent 0c124c4ab7
commit b5a203d2aa
21 changed files with 1619 additions and 112 deletions
@@ -0,0 +1,230 @@
## ADDED Requirements
### Requirement: Common selection filter
All bulk engine endpoints SHALL accept a JSON body with the following optional selection fields, combined with AND logic:
- `document_ids` (list of int) — match documents with these specific IDs
- `tags` (list of str) — match documents that have ALL specified tags
- `doc_type` (str) — match documents with this document type
- `from_id` (int) — match documents with id >= this value
- `to_id` (int) — match documents with id <= this value
At least one selection field MUST be present. If no selection fields are provided, the endpoint SHALL return 400 Bad Request.
#### Scenario: Filter by tags and doc_type
- **WHEN** a bulk endpoint receives `{"tags": ["draft"], "doc_type": "note"}`
- **THEN** it SHALL match only documents that have the tag "draft" AND have doc_type "note"
#### Scenario: Filter by ID range
- **WHEN** a bulk endpoint receives `{"from_id": 10, "to_id": 50}`
- **THEN** it SHALL match documents with id >= 10 AND id <= 50
#### Scenario: Filter by explicit IDs
- **WHEN** a bulk endpoint receives `{"document_ids": [1, 5, 12]}`
- **THEN** it SHALL match only documents with those specific IDs
#### Scenario: Combined filters
- **WHEN** a bulk endpoint receives `{"tags": ["agent:mybot"], "doc_type": "note", "from_id": 100}`
- **THEN** it SHALL match documents satisfying ALL three criteria
#### Scenario: No selection fields provided
- **WHEN** a bulk endpoint receives `{}` or `{"force": true}` with no selection fields
- **THEN** it SHALL return 400 Bad Request
### Requirement: Safety threshold
All bulk endpoints SHALL enforce a safety threshold. Before executing, the engine SHALL count the matched documents and the total documents in the database. If `matched / total * 100` exceeds the configured threshold, the request SHALL be rejected with 409 Conflict.
The response SHALL include: `error` ("safety_threshold_exceeded"), `message` (human-readable), `matched` (int), `total` (int), `percent` (float), and `threshold` (int).
The threshold SHALL default to 70 and be configurable via the `KB_BULK_SAFETY_PERCENT` environment variable (integer 0-100). A value of 0 disables the check.
The caller MAY override the threshold by including `"force": true` in the request body.
#### Scenario: Threshold exceeded
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
- **WHEN** a bulk endpoint matches 750 documents (75%) without `force: true`
- **THEN** it SHALL return 409 with `matched: 750`, `total: 1000`, `percent: 75.0`, `threshold: 70`
#### Scenario: Threshold not exceeded
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
- **WHEN** a bulk endpoint matches 500 documents (50%) without `force: true`
- **THEN** the operation SHALL proceed normally
#### Scenario: Force override
- **GIVEN** 1000 total documents and a match of 900 (90%)
- **WHEN** the request includes `"force": true`
- **THEN** the operation SHALL proceed regardless of threshold
#### Scenario: Zero threshold
- **GIVEN** `KB_BULK_SAFETY_PERCENT` is 0
- **THEN** the safety check SHALL be effectively disabled for all operations
### Requirement: Synchronous response with audit log
All bulk endpoints SHALL execute synchronously and return a JSON response with:
- `job_id` (int) — ID of the audit log entry in the jobs table
- `status` (str) — "done" or "partial_failure"
- `matched` (int) — number of documents that matched the selection
- `succeeded` (int) — number of documents successfully processed
- `failed` (int) — number of documents that failed
- `errors` (list) — array of `{"document_id": int, "error": str}` for each failure (empty on full success)
A job record SHALL be created in the jobs table with `job_type` set to the operation type. The `filename` field SHALL store a JSON representation of the selection filter. The `error` field SHALL store a JSON array of individual errors if any occurred.
#### Scenario: Full success
- **WHEN** a bulk operation matches 50 documents and all succeed
- **THEN** the response SHALL have `status: "done"`, `matched: 50`, `succeeded: 50`, `failed: 0`, `errors: []`
#### Scenario: Partial failure
- **WHEN** a bulk operation matches 50 documents but 2 fail
- **THEN** the response SHALL have `status: "partial_failure"`, `matched: 50`, `succeeded: 48`, `failed: 2`, and `errors` listing the 2 failures
### Requirement: Bulk delete endpoint
The engine SHALL expose `POST /api/v1/bulk/delete` which permanently deletes all documents matching the selection filter. For each matched document, it SHALL delete embeddings from `chunks_vec`, delete the document row (cascading to chunks and document_tags), and delete any stored file from disk.
Database deletions SHALL be performed within a single transaction. File deletions SHALL occur after the transaction commits and SHALL be best-effort (failures logged but not counted as document failures).
#### Scenario: Bulk delete by tag
- **WHEN** `POST /api/v1/bulk/delete` receives `{"tags": ["old", "draft"]}`
- **THEN** all documents with both tags "old" and "draft" SHALL be deleted
- **AND** their chunks, embeddings, tag associations, and stored files SHALL be removed
#### Scenario: Bulk delete with no matches
- **WHEN** `POST /api/v1/bulk/delete` receives a filter that matches 0 documents
- **THEN** the response SHALL have `matched: 0`, `succeeded: 0`, `failed: 0`
### Requirement: Bulk tags endpoint
The engine SHALL expose `POST /api/v1/bulk/tags` which adds and/or removes tags on all documents matching the selection filter. The request body SHALL include the selection filter plus:
- `add` (list of str, optional) — tags to add
- `remove` (list of str, optional) — tags to remove
At least one of `add` or `remove` MUST be present. The endpoint SHALL return 400 if neither is provided.
The endpoint SHALL update `updated_at` on all affected documents.
#### Scenario: Add and remove tags in one call
- **WHEN** `POST /api/v1/bulk/tags` receives `{"tags": ["agent:mybot"], "add": ["reviewed"], "remove": ["pending"]}`
- **THEN** all documents tagged "agent:mybot" SHALL have "reviewed" added and "pending" removed
### Requirement: Bulk set-tags endpoint
The engine SHALL expose `POST /api/v1/bulk/set-tags` which replaces all tags on matched documents with a new set. The request body SHALL include the selection filter plus:
- `new_tags` (list of str) — the replacement tag set
The endpoint SHALL remove all existing tag associations from matched documents, then apply the new set. It SHALL update `updated_at` on all affected documents.
#### Scenario: Replace all tags
- **WHEN** `POST /api/v1/bulk/set-tags` receives `{"doc_type": "note", "new_tags": ["clean", "final"]}`
- **THEN** all notes SHALL have their existing tags removed and replaced with "clean" and "final"
### Requirement: Jobs table extension
The jobs table SHALL be extended with a `job_type` column (TEXT, default "ingest") to distinguish ingestion jobs from bulk operation audit entries. Valid values: "ingest", "bulk_delete", "bulk_tags", "bulk_set_tags".
Existing jobs SHALL default to `job_type = "ingest"`. The existing jobs list endpoint and CLI `kb jobs` command SHALL continue to work unchanged.
#### Scenario: Migration adds column
- **GIVEN** an existing database without the `job_type` column
- **WHEN** the engine starts
- **THEN** the column SHALL be added with default value "ingest"
### Requirement: Engine config for safety threshold
The engine `Config` class SHALL read `KB_BULK_SAFETY_PERCENT` from the environment as an integer (default 70, range 0-100). This value SHALL be used as the default safety threshold for all bulk endpoints.
### Requirement: MCP bulk delete tool
The MCP server SHALL expose a `kb_bulk_delete` tool with parameters: `document_ids` (optional list of int), `tags` (optional list of str), `doc_type` (optional str), `from_id` (optional int), `to_id` (optional int), `force` (optional bool).
The tool SHALL call `POST /api/v1/bulk/delete` on the engine via the engine client and return the JSON response.
The tool description SHALL clearly state that `tags` is a selection filter (which documents to delete), not tags to delete.
#### Scenario: MCP bulk delete by tag
- **WHEN** `kb_bulk_delete(tags=["old"])` is called
- **THEN** the engine client SHALL send `POST /api/v1/bulk/delete` with `{"tags": ["old"]}`
- **AND** the tool SHALL return the engine's JSON response
### Requirement: MCP bulk tags tool
The MCP server SHALL expose a `kb_bulk_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `add` (optional list of str), `remove` (optional list of str), and `force` (optional bool).
The tool description SHALL clearly distinguish `tags` (selection filter) from `add`/`remove` (tag changes to apply).
#### Scenario: MCP bulk tag update
- **WHEN** `kb_bulk_tags(tags=["agent:mybot"], add=["reviewed"], remove=["draft"])` is called
- **THEN** the engine client SHALL send the appropriate `POST /api/v1/bulk/tags` request
### Requirement: MCP bulk set-tags tool
The MCP server SHALL expose a `kb_bulk_set_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `new_tags` (list of str) and `force` (optional bool).
#### Scenario: MCP bulk set tags
- **WHEN** `kb_bulk_set_tags(doc_type="note", new_tags=["clean"])` is called
- **THEN** the engine client SHALL send `POST /api/v1/bulk/set-tags` with `{"doc_type": "note", "new_tags": ["clean"]}`
### Requirement: MCP engine client bulk methods
The MCP engine client (`mcp/engine.py`) SHALL provide three new methods:
- `bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → dict
- `bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → dict
- `bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → dict
Each SHALL send a POST request to the corresponding `/api/v1/bulk/*` endpoint with the parameters as a JSON body. Each SHALL raise on non-2xx status codes, consistent with existing methods.
### Requirement: CLI bulk-remove command
The CLI SHALL expose a `kb bulk-remove` command with flags: `--tags` (comma-separated), `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`/`-f`, `--yes`/`-y`.
Without `--yes`, the CLI SHALL first display the match count and ask for interactive confirmation before proceeding.
The command SHALL call `POST /api/v1/bulk/delete` with the constructed filter.
#### Scenario: CLI bulk remove with confirmation
- **WHEN** `kb bulk-remove --tags "draft,old" --type note` is run without `--yes`
- **THEN** the CLI SHALL display "This will delete N documents matching: tags=[draft,old] type=note" and prompt "Proceed? [y/N]"
#### Scenario: CLI bulk remove with --yes
- **WHEN** `kb bulk-remove --tags "draft" --yes` is run
- **THEN** the CLI SHALL proceed without prompting
### Requirement: CLI bulk-tag command
The CLI SHALL expose a `kb bulk-tag` command with the same filter flags as `bulk-remove`, plus `--add` and `--remove` (comma-separated tag lists).
The command SHALL call `POST /api/v1/bulk/tags` with the constructed filter and tag changes.
### Requirement: CLI bulk-set-tags command
The CLI SHALL expose a `kb bulk-set-tags` command with the filter flags, plus `--set` (comma-separated list of replacement tags).
The command SHALL call `POST /api/v1/bulk/set-tags` with the constructed filter and `new_tags`.