## ADDED Requirements ### Requirement: Common selection filter All bulk engine endpoints SHALL accept a JSON body with the following optional selection fields, combined with AND logic: - `document_ids` (list of int) — match documents with these specific IDs - `tags` (list of str) — match documents that have ALL specified tags - `doc_type` (str) — match documents with this document type - `from_id` (int) — match documents with id >= this value - `to_id` (int) — match documents with id <= this value At least one selection field MUST be present. If no selection fields are provided, the endpoint SHALL return 400 Bad Request. #### Scenario: Filter by tags and doc_type - **WHEN** a bulk endpoint receives `{"tags": ["draft"], "doc_type": "note"}` - **THEN** it SHALL match only documents that have the tag "draft" AND have doc_type "note" #### Scenario: Filter by ID range - **WHEN** a bulk endpoint receives `{"from_id": 10, "to_id": 50}` - **THEN** it SHALL match documents with id >= 10 AND id <= 50 #### Scenario: Filter by explicit IDs - **WHEN** a bulk endpoint receives `{"document_ids": [1, 5, 12]}` - **THEN** it SHALL match only documents with those specific IDs #### Scenario: Combined filters - **WHEN** a bulk endpoint receives `{"tags": ["agent:mybot"], "doc_type": "note", "from_id": 100}` - **THEN** it SHALL match documents satisfying ALL three criteria #### Scenario: No selection fields provided - **WHEN** a bulk endpoint receives `{}` or `{"force": true}` with no selection fields - **THEN** it SHALL return 400 Bad Request ### Requirement: Safety threshold All bulk endpoints SHALL enforce a safety threshold. Before executing, the engine SHALL count the matched documents and the total documents in the database. If `matched / total * 100` exceeds the configured threshold, the request SHALL be rejected with 409 Conflict. The response SHALL include: `error` ("safety_threshold_exceeded"), `message` (human-readable), `matched` (int), `total` (int), `percent` (float), and `threshold` (int). The threshold SHALL default to 70 and be configurable via the `KB_BULK_SAFETY_PERCENT` environment variable (integer 0-100). A value of 0 disables the check. The caller MAY override the threshold by including `"force": true` in the request body. #### Scenario: Threshold exceeded - **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70 - **WHEN** a bulk endpoint matches 750 documents (75%) without `force: true` - **THEN** it SHALL return 409 with `matched: 750`, `total: 1000`, `percent: 75.0`, `threshold: 70` #### Scenario: Threshold not exceeded - **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70 - **WHEN** a bulk endpoint matches 500 documents (50%) without `force: true` - **THEN** the operation SHALL proceed normally #### Scenario: Force override - **GIVEN** 1000 total documents and a match of 900 (90%) - **WHEN** the request includes `"force": true` - **THEN** the operation SHALL proceed regardless of threshold #### Scenario: Zero threshold - **GIVEN** `KB_BULK_SAFETY_PERCENT` is 0 - **THEN** the safety check SHALL be effectively disabled for all operations ### Requirement: Synchronous response with audit log All bulk endpoints SHALL execute synchronously and return a JSON response with: - `job_id` (int) — ID of the audit log entry in the jobs table - `status` (str) — "done" or "partial_failure" - `matched` (int) — number of documents that matched the selection - `succeeded` (int) — number of documents successfully processed - `failed` (int) — number of documents that failed - `errors` (list) — array of `{"document_id": int, "error": str}` for each failure (empty on full success) A job record SHALL be created in the jobs table with `job_type` set to the operation type. The `filename` field SHALL store a JSON representation of the selection filter. The `error` field SHALL store a JSON array of individual errors if any occurred. #### Scenario: Full success - **WHEN** a bulk operation matches 50 documents and all succeed - **THEN** the response SHALL have `status: "done"`, `matched: 50`, `succeeded: 50`, `failed: 0`, `errors: []` #### Scenario: Partial failure - **WHEN** a bulk operation matches 50 documents but 2 fail - **THEN** the response SHALL have `status: "partial_failure"`, `matched: 50`, `succeeded: 48`, `failed: 2`, and `errors` listing the 2 failures ### Requirement: Bulk delete endpoint The engine SHALL expose `POST /api/v1/bulk/delete` which permanently deletes all documents matching the selection filter. For each matched document, it SHALL delete embeddings from `chunks_vec`, delete the document row (cascading to chunks and document_tags), and delete any stored file from disk. Database deletions SHALL be performed within a single transaction. File deletions SHALL occur after the transaction commits and SHALL be best-effort (failures logged but not counted as document failures). #### Scenario: Bulk delete by tag - **WHEN** `POST /api/v1/bulk/delete` receives `{"tags": ["old", "draft"]}` - **THEN** all documents with both tags "old" and "draft" SHALL be deleted - **AND** their chunks, embeddings, tag associations, and stored files SHALL be removed #### Scenario: Bulk delete with no matches - **WHEN** `POST /api/v1/bulk/delete` receives a filter that matches 0 documents - **THEN** the response SHALL have `matched: 0`, `succeeded: 0`, `failed: 0` ### Requirement: Bulk tags endpoint The engine SHALL expose `POST /api/v1/bulk/tags` which adds and/or removes tags on all documents matching the selection filter. The request body SHALL include the selection filter plus: - `add` (list of str, optional) — tags to add - `remove` (list of str, optional) — tags to remove At least one of `add` or `remove` MUST be present. The endpoint SHALL return 400 if neither is provided. The endpoint SHALL update `updated_at` on all affected documents. #### Scenario: Add and remove tags in one call - **WHEN** `POST /api/v1/bulk/tags` receives `{"tags": ["agent:mybot"], "add": ["reviewed"], "remove": ["pending"]}` - **THEN** all documents tagged "agent:mybot" SHALL have "reviewed" added and "pending" removed ### Requirement: Bulk set-tags endpoint The engine SHALL expose `POST /api/v1/bulk/set-tags` which replaces all tags on matched documents with a new set. The request body SHALL include the selection filter plus: - `new_tags` (list of str) — the replacement tag set The endpoint SHALL remove all existing tag associations from matched documents, then apply the new set. It SHALL update `updated_at` on all affected documents. #### Scenario: Replace all tags - **WHEN** `POST /api/v1/bulk/set-tags` receives `{"doc_type": "note", "new_tags": ["clean", "final"]}` - **THEN** all notes SHALL have their existing tags removed and replaced with "clean" and "final" ### Requirement: Jobs table extension The jobs table SHALL be extended with a `job_type` column (TEXT, default "ingest") to distinguish ingestion jobs from bulk operation audit entries. Valid values: "ingest", "bulk_delete", "bulk_tags", "bulk_set_tags". Existing jobs SHALL default to `job_type = "ingest"`. The existing jobs list endpoint and CLI `kb jobs` command SHALL continue to work unchanged. #### Scenario: Migration adds column - **GIVEN** an existing database without the `job_type` column - **WHEN** the engine starts - **THEN** the column SHALL be added with default value "ingest" ### Requirement: Engine config for safety threshold The engine `Config` class SHALL read `KB_BULK_SAFETY_PERCENT` from the environment as an integer (default 70, range 0-100). This value SHALL be used as the default safety threshold for all bulk endpoints. ### Requirement: MCP bulk delete tool The MCP server SHALL expose a `kb_bulk_delete` tool with parameters: `document_ids` (optional list of int), `tags` (optional list of str), `doc_type` (optional str), `from_id` (optional int), `to_id` (optional int), `force` (optional bool). The tool SHALL call `POST /api/v1/bulk/delete` on the engine via the engine client and return the JSON response. The tool description SHALL clearly state that `tags` is a selection filter (which documents to delete), not tags to delete. #### Scenario: MCP bulk delete by tag - **WHEN** `kb_bulk_delete(tags=["old"])` is called - **THEN** the engine client SHALL send `POST /api/v1/bulk/delete` with `{"tags": ["old"]}` - **AND** the tool SHALL return the engine's JSON response ### Requirement: MCP bulk tags tool The MCP server SHALL expose a `kb_bulk_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `add` (optional list of str), `remove` (optional list of str), and `force` (optional bool). The tool description SHALL clearly distinguish `tags` (selection filter) from `add`/`remove` (tag changes to apply). #### Scenario: MCP bulk tag update - **WHEN** `kb_bulk_tags(tags=["agent:mybot"], add=["reviewed"], remove=["draft"])` is called - **THEN** the engine client SHALL send the appropriate `POST /api/v1/bulk/tags` request ### Requirement: MCP bulk set-tags tool The MCP server SHALL expose a `kb_bulk_set_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `new_tags` (list of str) and `force` (optional bool). #### Scenario: MCP bulk set tags - **WHEN** `kb_bulk_set_tags(doc_type="note", new_tags=["clean"])` is called - **THEN** the engine client SHALL send `POST /api/v1/bulk/set-tags` with `{"doc_type": "note", "new_tags": ["clean"]}` ### Requirement: MCP engine client bulk methods The MCP engine client (`mcp/engine.py`) SHALL provide three new methods: - `bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → dict - `bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → dict - `bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → dict Each SHALL send a POST request to the corresponding `/api/v1/bulk/*` endpoint with the parameters as a JSON body. Each SHALL raise on non-2xx status codes, consistent with existing methods. ### Requirement: CLI bulk-remove command The CLI SHALL expose a `kb bulk-remove` command with flags: `--tags` (comma-separated), `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`/`-f`, `--yes`/`-y`. Without `--yes`, the CLI SHALL first display the match count and ask for interactive confirmation before proceeding. The command SHALL call `POST /api/v1/bulk/delete` with the constructed filter. #### Scenario: CLI bulk remove with confirmation - **WHEN** `kb bulk-remove --tags "draft,old" --type note` is run without `--yes` - **THEN** the CLI SHALL display "This will delete N documents matching: tags=[draft,old] type=note" and prompt "Proceed? [y/N]" #### Scenario: CLI bulk remove with --yes - **WHEN** `kb bulk-remove --tags "draft" --yes` is run - **THEN** the CLI SHALL proceed without prompting ### Requirement: CLI bulk-tag command The CLI SHALL expose a `kb bulk-tag` command with the same filter flags as `bulk-remove`, plus `--add` and `--remove` (comma-separated tag lists). The command SHALL call `POST /api/v1/bulk/tags` with the constructed filter and tag changes. ### Requirement: CLI bulk-set-tags command The CLI SHALL expose a `kb bulk-set-tags` command with the filter flags, plus `--set` (comma-separated list of replacement tags). The command SHALL call `POST /api/v1/bulk/set-tags` with the constructed filter and `new_tags`.