231 lines
11 KiB
Markdown
231 lines
11 KiB
Markdown
## ADDED Requirements
|
|
|
|
### Requirement: Common selection filter
|
|
|
|
All bulk engine endpoints SHALL accept a JSON body with the following optional selection fields, combined with AND logic:
|
|
|
|
- `document_ids` (list of int) — match documents with these specific IDs
|
|
- `tags` (list of str) — match documents that have ALL specified tags
|
|
- `doc_type` (str) — match documents with this document type
|
|
- `from_id` (int) — match documents with id >= this value
|
|
- `to_id` (int) — match documents with id <= this value
|
|
|
|
At least one selection field MUST be present. If no selection fields are provided, the endpoint SHALL return 400 Bad Request.
|
|
|
|
#### Scenario: Filter by tags and doc_type
|
|
|
|
- **WHEN** a bulk endpoint receives `{"tags": ["draft"], "doc_type": "note"}`
|
|
- **THEN** it SHALL match only documents that have the tag "draft" AND have doc_type "note"
|
|
|
|
#### Scenario: Filter by ID range
|
|
|
|
- **WHEN** a bulk endpoint receives `{"from_id": 10, "to_id": 50}`
|
|
- **THEN** it SHALL match documents with id >= 10 AND id <= 50
|
|
|
|
#### Scenario: Filter by explicit IDs
|
|
|
|
- **WHEN** a bulk endpoint receives `{"document_ids": [1, 5, 12]}`
|
|
- **THEN** it SHALL match only documents with those specific IDs
|
|
|
|
#### Scenario: Combined filters
|
|
|
|
- **WHEN** a bulk endpoint receives `{"tags": ["agent:mybot"], "doc_type": "note", "from_id": 100}`
|
|
- **THEN** it SHALL match documents satisfying ALL three criteria
|
|
|
|
#### Scenario: No selection fields provided
|
|
|
|
- **WHEN** a bulk endpoint receives `{}` or `{"force": true}` with no selection fields
|
|
- **THEN** it SHALL return 400 Bad Request
|
|
|
|
### Requirement: Safety threshold
|
|
|
|
All bulk endpoints SHALL enforce a safety threshold. Before executing, the engine SHALL count the matched documents and the total documents in the database. If `matched / total * 100` exceeds the configured threshold, the request SHALL be rejected with 409 Conflict.
|
|
|
|
The response SHALL include: `error` ("safety_threshold_exceeded"), `message` (human-readable), `matched` (int), `total` (int), `percent` (float), and `threshold` (int).
|
|
|
|
The threshold SHALL default to 70 and be configurable via the `KB_BULK_SAFETY_PERCENT` environment variable (integer 0-100). A value of 0 disables the check.
|
|
|
|
The caller MAY override the threshold by including `"force": true` in the request body.
|
|
|
|
#### Scenario: Threshold exceeded
|
|
|
|
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
|
|
- **WHEN** a bulk endpoint matches 750 documents (75%) without `force: true`
|
|
- **THEN** it SHALL return 409 with `matched: 750`, `total: 1000`, `percent: 75.0`, `threshold: 70`
|
|
|
|
#### Scenario: Threshold not exceeded
|
|
|
|
- **GIVEN** 1000 total documents and `KB_BULK_SAFETY_PERCENT` is 70
|
|
- **WHEN** a bulk endpoint matches 500 documents (50%) without `force: true`
|
|
- **THEN** the operation SHALL proceed normally
|
|
|
|
#### Scenario: Force override
|
|
|
|
- **GIVEN** 1000 total documents and a match of 900 (90%)
|
|
- **WHEN** the request includes `"force": true`
|
|
- **THEN** the operation SHALL proceed regardless of threshold
|
|
|
|
#### Scenario: Zero threshold
|
|
|
|
- **GIVEN** `KB_BULK_SAFETY_PERCENT` is 0
|
|
- **THEN** the safety check SHALL be effectively disabled for all operations
|
|
|
|
### Requirement: Synchronous response with audit log
|
|
|
|
All bulk endpoints SHALL execute synchronously and return a JSON response with:
|
|
|
|
- `job_id` (int) — ID of the audit log entry in the jobs table
|
|
- `status` (str) — "done" or "partial_failure"
|
|
- `matched` (int) — number of documents that matched the selection
|
|
- `succeeded` (int) — number of documents successfully processed
|
|
- `failed` (int) — number of documents that failed
|
|
- `errors` (list) — array of `{"document_id": int, "error": str}` for each failure (empty on full success)
|
|
|
|
A job record SHALL be created in the jobs table with `job_type` set to the operation type. The `filename` field SHALL store a JSON representation of the selection filter. The `error` field SHALL store a JSON array of individual errors if any occurred.
|
|
|
|
#### Scenario: Full success
|
|
|
|
- **WHEN** a bulk operation matches 50 documents and all succeed
|
|
- **THEN** the response SHALL have `status: "done"`, `matched: 50`, `succeeded: 50`, `failed: 0`, `errors: []`
|
|
|
|
#### Scenario: Partial failure
|
|
|
|
- **WHEN** a bulk operation matches 50 documents but 2 fail
|
|
- **THEN** the response SHALL have `status: "partial_failure"`, `matched: 50`, `succeeded: 48`, `failed: 2`, and `errors` listing the 2 failures
|
|
|
|
### Requirement: Bulk delete endpoint
|
|
|
|
The engine SHALL expose `POST /api/v1/bulk/delete` which permanently deletes all documents matching the selection filter. For each matched document, it SHALL delete embeddings from `chunks_vec`, delete the document row (cascading to chunks and document_tags), and delete any stored file from disk.
|
|
|
|
Database deletions SHALL be performed within a single transaction. File deletions SHALL occur after the transaction commits and SHALL be best-effort (failures logged but not counted as document failures).
|
|
|
|
#### Scenario: Bulk delete by tag
|
|
|
|
- **WHEN** `POST /api/v1/bulk/delete` receives `{"tags": ["old", "draft"]}`
|
|
- **THEN** all documents with both tags "old" and "draft" SHALL be deleted
|
|
- **AND** their chunks, embeddings, tag associations, and stored files SHALL be removed
|
|
|
|
#### Scenario: Bulk delete with no matches
|
|
|
|
- **WHEN** `POST /api/v1/bulk/delete` receives a filter that matches 0 documents
|
|
- **THEN** the response SHALL have `matched: 0`, `succeeded: 0`, `failed: 0`
|
|
|
|
### Requirement: Bulk tags endpoint
|
|
|
|
The engine SHALL expose `POST /api/v1/bulk/tags` which adds and/or removes tags on all documents matching the selection filter. The request body SHALL include the selection filter plus:
|
|
|
|
- `add` (list of str, optional) — tags to add
|
|
- `remove` (list of str, optional) — tags to remove
|
|
|
|
At least one of `add` or `remove` MUST be present. The endpoint SHALL return 400 if neither is provided.
|
|
|
|
The endpoint SHALL update `updated_at` on all affected documents.
|
|
|
|
#### Scenario: Add and remove tags in one call
|
|
|
|
- **WHEN** `POST /api/v1/bulk/tags` receives `{"tags": ["agent:mybot"], "add": ["reviewed"], "remove": ["pending"]}`
|
|
- **THEN** all documents tagged "agent:mybot" SHALL have "reviewed" added and "pending" removed
|
|
|
|
### Requirement: Bulk set-tags endpoint
|
|
|
|
The engine SHALL expose `POST /api/v1/bulk/set-tags` which replaces all tags on matched documents with a new set. The request body SHALL include the selection filter plus:
|
|
|
|
- `new_tags` (list of str) — the replacement tag set
|
|
|
|
The endpoint SHALL remove all existing tag associations from matched documents, then apply the new set. It SHALL update `updated_at` on all affected documents.
|
|
|
|
#### Scenario: Replace all tags
|
|
|
|
- **WHEN** `POST /api/v1/bulk/set-tags` receives `{"doc_type": "note", "new_tags": ["clean", "final"]}`
|
|
- **THEN** all notes SHALL have their existing tags removed and replaced with "clean" and "final"
|
|
|
|
### Requirement: Jobs table extension
|
|
|
|
The jobs table SHALL be extended with a `job_type` column (TEXT, default "ingest") to distinguish ingestion jobs from bulk operation audit entries. Valid values: "ingest", "bulk_delete", "bulk_tags", "bulk_set_tags".
|
|
|
|
Existing jobs SHALL default to `job_type = "ingest"`. The existing jobs list endpoint and CLI `kb jobs` command SHALL continue to work unchanged.
|
|
|
|
#### Scenario: Migration adds column
|
|
|
|
- **GIVEN** an existing database without the `job_type` column
|
|
- **WHEN** the engine starts
|
|
- **THEN** the column SHALL be added with default value "ingest"
|
|
|
|
### Requirement: Engine config for safety threshold
|
|
|
|
The engine `Config` class SHALL read `KB_BULK_SAFETY_PERCENT` from the environment as an integer (default 70, range 0-100). This value SHALL be used as the default safety threshold for all bulk endpoints.
|
|
|
|
### Requirement: MCP bulk delete tool
|
|
|
|
The MCP server SHALL expose a `kb_bulk_delete` tool with parameters: `document_ids` (optional list of int), `tags` (optional list of str), `doc_type` (optional str), `from_id` (optional int), `to_id` (optional int), `force` (optional bool).
|
|
|
|
The tool SHALL call `POST /api/v1/bulk/delete` on the engine via the engine client and return the JSON response.
|
|
|
|
The tool description SHALL clearly state that `tags` is a selection filter (which documents to delete), not tags to delete.
|
|
|
|
#### Scenario: MCP bulk delete by tag
|
|
|
|
- **WHEN** `kb_bulk_delete(tags=["old"])` is called
|
|
- **THEN** the engine client SHALL send `POST /api/v1/bulk/delete` with `{"tags": ["old"]}`
|
|
- **AND** the tool SHALL return the engine's JSON response
|
|
|
|
### Requirement: MCP bulk tags tool
|
|
|
|
The MCP server SHALL expose a `kb_bulk_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `add` (optional list of str), `remove` (optional list of str), and `force` (optional bool).
|
|
|
|
The tool description SHALL clearly distinguish `tags` (selection filter) from `add`/`remove` (tag changes to apply).
|
|
|
|
#### Scenario: MCP bulk tag update
|
|
|
|
- **WHEN** `kb_bulk_tags(tags=["agent:mybot"], add=["reviewed"], remove=["draft"])` is called
|
|
- **THEN** the engine client SHALL send the appropriate `POST /api/v1/bulk/tags` request
|
|
|
|
### Requirement: MCP bulk set-tags tool
|
|
|
|
The MCP server SHALL expose a `kb_bulk_set_tags` tool with parameters: `document_ids`, `tags`, `doc_type`, `from_id`, `to_id` (selection filters), plus `new_tags` (list of str) and `force` (optional bool).
|
|
|
|
#### Scenario: MCP bulk set tags
|
|
|
|
- **WHEN** `kb_bulk_set_tags(doc_type="note", new_tags=["clean"])` is called
|
|
- **THEN** the engine client SHALL send `POST /api/v1/bulk/set-tags` with `{"doc_type": "note", "new_tags": ["clean"]}`
|
|
|
|
### Requirement: MCP engine client bulk methods
|
|
|
|
The MCP engine client (`mcp/engine.py`) SHALL provide three new methods:
|
|
|
|
- `bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?)` → dict
|
|
- `bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?)` → dict
|
|
- `bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?)` → dict
|
|
|
|
Each SHALL send a POST request to the corresponding `/api/v1/bulk/*` endpoint with the parameters as a JSON body. Each SHALL raise on non-2xx status codes, consistent with existing methods.
|
|
|
|
### Requirement: CLI bulk-remove command
|
|
|
|
The CLI SHALL expose a `kb bulk-remove` command with flags: `--tags` (comma-separated), `--type`, `--ids` (comma-separated), `--from-id`, `--to-id`, `--force`/`-f`, `--yes`/`-y`.
|
|
|
|
Without `--yes`, the CLI SHALL first display the match count and ask for interactive confirmation before proceeding.
|
|
|
|
The command SHALL call `POST /api/v1/bulk/delete` with the constructed filter.
|
|
|
|
#### Scenario: CLI bulk remove with confirmation
|
|
|
|
- **WHEN** `kb bulk-remove --tags "draft,old" --type note` is run without `--yes`
|
|
- **THEN** the CLI SHALL display "This will delete N documents matching: tags=[draft,old] type=note" and prompt "Proceed? [y/N]"
|
|
|
|
#### Scenario: CLI bulk remove with --yes
|
|
|
|
- **WHEN** `kb bulk-remove --tags "draft" --yes` is run
|
|
- **THEN** the CLI SHALL proceed without prompting
|
|
|
|
### Requirement: CLI bulk-tag command
|
|
|
|
The CLI SHALL expose a `kb bulk-tag` command with the same filter flags as `bulk-remove`, plus `--add` and `--remove` (comma-separated tag lists).
|
|
|
|
The command SHALL call `POST /api/v1/bulk/tags` with the constructed filter and tag changes.
|
|
|
|
### Requirement: CLI bulk-set-tags command
|
|
|
|
The CLI SHALL expose a `kb bulk-set-tags` command with the filter flags, plus `--set` (comma-separated list of replacement tags).
|
|
|
|
The command SHALL call `POST /api/v1/bulk/set-tags` with the constructed filter and `new_tags`.
|