Files
kb/openspec/specs/bulk-operations/spec.md
T
2026-04-04 22:50:19 +01:00

11 KiB

ADDED Requirements

Requirement: Common selection filter

All bulk engine endpoints SHALL accept a JSON body with the following optional selection fields, combined with AND logic:

  • document_ids (list of int) — match documents with these specific IDs
  • tags (list of str) — match documents that have ALL specified tags
  • doc_type (str) — match documents with this document type
  • from_id (int) — match documents with id >= this value
  • to_id (int) — match documents with id <= this value

At least one selection field MUST be present. If no selection fields are provided, the endpoint SHALL return 400 Bad Request.

Scenario: Filter by tags and doc_type

  • WHEN a bulk endpoint receives {"tags": ["draft"], "doc_type": "note"}
  • THEN it SHALL match only documents that have the tag "draft" AND have doc_type "note"

Scenario: Filter by ID range

  • WHEN a bulk endpoint receives {"from_id": 10, "to_id": 50}
  • THEN it SHALL match documents with id >= 10 AND id <= 50

Scenario: Filter by explicit IDs

  • WHEN a bulk endpoint receives {"document_ids": [1, 5, 12]}
  • THEN it SHALL match only documents with those specific IDs

Scenario: Combined filters

  • WHEN a bulk endpoint receives {"tags": ["agent:mybot"], "doc_type": "note", "from_id": 100}
  • THEN it SHALL match documents satisfying ALL three criteria

Scenario: No selection fields provided

  • WHEN a bulk endpoint receives {} or {"force": true} with no selection fields
  • THEN it SHALL return 400 Bad Request

Requirement: Safety threshold

All bulk endpoints SHALL enforce a safety threshold. Before executing, the engine SHALL count the matched documents and the total documents in the database. If matched / total * 100 exceeds the configured threshold, the request SHALL be rejected with 409 Conflict.

The response SHALL include: error ("safety_threshold_exceeded"), message (human-readable), matched (int), total (int), percent (float), and threshold (int).

The threshold SHALL default to 70 and be configurable via the KB_BULK_SAFETY_PERCENT environment variable (integer 0-100). A value of 0 disables the check.

The caller MAY override the threshold by including "force": true in the request body.

Scenario: Threshold exceeded

  • GIVEN 1000 total documents and KB_BULK_SAFETY_PERCENT is 70
  • WHEN a bulk endpoint matches 750 documents (75%) without force: true
  • THEN it SHALL return 409 with matched: 750, total: 1000, percent: 75.0, threshold: 70

Scenario: Threshold not exceeded

  • GIVEN 1000 total documents and KB_BULK_SAFETY_PERCENT is 70
  • WHEN a bulk endpoint matches 500 documents (50%) without force: true
  • THEN the operation SHALL proceed normally

Scenario: Force override

  • GIVEN 1000 total documents and a match of 900 (90%)
  • WHEN the request includes "force": true
  • THEN the operation SHALL proceed regardless of threshold

Scenario: Zero threshold

  • GIVEN KB_BULK_SAFETY_PERCENT is 0
  • THEN the safety check SHALL be effectively disabled for all operations

Requirement: Synchronous response with audit log

All bulk endpoints SHALL execute synchronously and return a JSON response with:

  • job_id (int) — ID of the audit log entry in the jobs table
  • status (str) — "done" or "partial_failure"
  • matched (int) — number of documents that matched the selection
  • succeeded (int) — number of documents successfully processed
  • failed (int) — number of documents that failed
  • errors (list) — array of {"document_id": int, "error": str} for each failure (empty on full success)

A job record SHALL be created in the jobs table with job_type set to the operation type. The filename field SHALL store a JSON representation of the selection filter. The error field SHALL store a JSON array of individual errors if any occurred.

Scenario: Full success

  • WHEN a bulk operation matches 50 documents and all succeed
  • THEN the response SHALL have status: "done", matched: 50, succeeded: 50, failed: 0, errors: []

Scenario: Partial failure

  • WHEN a bulk operation matches 50 documents but 2 fail
  • THEN the response SHALL have status: "partial_failure", matched: 50, succeeded: 48, failed: 2, and errors listing the 2 failures

Requirement: Bulk delete endpoint

The engine SHALL expose POST /api/v1/bulk/delete which permanently deletes all documents matching the selection filter. For each matched document, it SHALL delete embeddings from chunks_vec, delete the document row (cascading to chunks and document_tags), and delete any stored file from disk.

Database deletions SHALL be performed within a single transaction. File deletions SHALL occur after the transaction commits and SHALL be best-effort (failures logged but not counted as document failures).

Scenario: Bulk delete by tag

  • WHEN POST /api/v1/bulk/delete receives {"tags": ["old", "draft"]}
  • THEN all documents with both tags "old" and "draft" SHALL be deleted
  • AND their chunks, embeddings, tag associations, and stored files SHALL be removed

Scenario: Bulk delete with no matches

  • WHEN POST /api/v1/bulk/delete receives a filter that matches 0 documents
  • THEN the response SHALL have matched: 0, succeeded: 0, failed: 0

Requirement: Bulk tags endpoint

The engine SHALL expose POST /api/v1/bulk/tags which adds and/or removes tags on all documents matching the selection filter. The request body SHALL include the selection filter plus:

  • add (list of str, optional) — tags to add
  • remove (list of str, optional) — tags to remove

At least one of add or remove MUST be present. The endpoint SHALL return 400 if neither is provided.

The endpoint SHALL update updated_at on all affected documents.

Scenario: Add and remove tags in one call

  • WHEN POST /api/v1/bulk/tags receives {"tags": ["agent:mybot"], "add": ["reviewed"], "remove": ["pending"]}
  • THEN all documents tagged "agent:mybot" SHALL have "reviewed" added and "pending" removed

Requirement: Bulk set-tags endpoint

The engine SHALL expose POST /api/v1/bulk/set-tags which replaces all tags on matched documents with a new set. The request body SHALL include the selection filter plus:

  • new_tags (list of str) — the replacement tag set

The endpoint SHALL remove all existing tag associations from matched documents, then apply the new set. It SHALL update updated_at on all affected documents.

Scenario: Replace all tags

  • WHEN POST /api/v1/bulk/set-tags receives {"doc_type": "note", "new_tags": ["clean", "final"]}
  • THEN all notes SHALL have their existing tags removed and replaced with "clean" and "final"

Requirement: Jobs table extension

The jobs table SHALL be extended with a job_type column (TEXT, default "ingest") to distinguish ingestion jobs from bulk operation audit entries. Valid values: "ingest", "bulk_delete", "bulk_tags", "bulk_set_tags".

Existing jobs SHALL default to job_type = "ingest". The existing jobs list endpoint and CLI kb jobs command SHALL continue to work unchanged.

Scenario: Migration adds column

  • GIVEN an existing database without the job_type column
  • WHEN the engine starts
  • THEN the column SHALL be added with default value "ingest"

Requirement: Engine config for safety threshold

The engine Config class SHALL read KB_BULK_SAFETY_PERCENT from the environment as an integer (default 70, range 0-100). This value SHALL be used as the default safety threshold for all bulk endpoints.

Requirement: MCP bulk delete tool

The MCP server SHALL expose a kb_bulk_delete tool with parameters: document_ids (optional list of int), tags (optional list of str), doc_type (optional str), from_id (optional int), to_id (optional int), force (optional bool).

The tool SHALL call POST /api/v1/bulk/delete on the engine via the engine client and return the JSON response.

The tool description SHALL clearly state that tags is a selection filter (which documents to delete), not tags to delete.

Scenario: MCP bulk delete by tag

  • WHEN kb_bulk_delete(tags=["old"]) is called
  • THEN the engine client SHALL send POST /api/v1/bulk/delete with {"tags": ["old"]}
  • AND the tool SHALL return the engine's JSON response

Requirement: MCP bulk tags tool

The MCP server SHALL expose a kb_bulk_tags tool with parameters: document_ids, tags, doc_type, from_id, to_id (selection filters), plus add (optional list of str), remove (optional list of str), and force (optional bool).

The tool description SHALL clearly distinguish tags (selection filter) from add/remove (tag changes to apply).

Scenario: MCP bulk tag update

  • WHEN kb_bulk_tags(tags=["agent:mybot"], add=["reviewed"], remove=["draft"]) is called
  • THEN the engine client SHALL send the appropriate POST /api/v1/bulk/tags request

Requirement: MCP bulk set-tags tool

The MCP server SHALL expose a kb_bulk_set_tags tool with parameters: document_ids, tags, doc_type, from_id, to_id (selection filters), plus new_tags (list of str) and force (optional bool).

Scenario: MCP bulk set tags

  • WHEN kb_bulk_set_tags(doc_type="note", new_tags=["clean"]) is called
  • THEN the engine client SHALL send POST /api/v1/bulk/set-tags with {"doc_type": "note", "new_tags": ["clean"]}

Requirement: MCP engine client bulk methods

The MCP engine client (mcp/engine.py) SHALL provide three new methods:

  • bulk_delete(document_ids?, tags?, doc_type?, from_id?, to_id?, force?) → dict
  • bulk_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, add?, remove?, force?) → dict
  • bulk_set_tags(document_ids?, tags?, doc_type?, from_id?, to_id?, new_tags?, force?) → dict

Each SHALL send a POST request to the corresponding /api/v1/bulk/* endpoint with the parameters as a JSON body. Each SHALL raise on non-2xx status codes, consistent with existing methods.

Requirement: CLI bulk-remove command

The CLI SHALL expose a kb bulk-remove command with flags: --tags (comma-separated), --type, --ids (comma-separated), --from-id, --to-id, --force/-f, --yes/-y.

Without --yes, the CLI SHALL first display the match count and ask for interactive confirmation before proceeding.

The command SHALL call POST /api/v1/bulk/delete with the constructed filter.

Scenario: CLI bulk remove with confirmation

  • WHEN kb bulk-remove --tags "draft,old" --type note is run without --yes
  • THEN the CLI SHALL display "This will delete N documents matching: tags=[draft,old] type=note" and prompt "Proceed? [y/N]"

Scenario: CLI bulk remove with --yes

  • WHEN kb bulk-remove --tags "draft" --yes is run
  • THEN the CLI SHALL proceed without prompting

Requirement: CLI bulk-tag command

The CLI SHALL expose a kb bulk-tag command with the same filter flags as bulk-remove, plus --add and --remove (comma-separated tag lists).

The command SHALL call POST /api/v1/bulk/tags with the constructed filter and tag changes.

Requirement: CLI bulk-set-tags command

The CLI SHALL expose a kb bulk-set-tags command with the filter flags, plus --set (comma-separated list of replacement tags).

The command SHALL call POST /api/v1/bulk/set-tags with the constructed filter and new_tags.