Files
kb/openspec/changes/archive/2026-03-26-upload-time-dedup-check/specs/engine-api/spec.md
T
steve 6fec627503 Upload-time duplicate detection, FTS5 query sanitization, release guard
- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 23:05:07 +00:00

3.5 KiB

MODIFIED Requirements

Requirement: Async ingestion via job queue

The engine SHALL accept file uploads and text notes for ingestion asynchronously. Uploaded content SHALL be written to a staging area and a job record created in the database. The engine SHALL return HTTP 202 immediately. A background worker SHALL process queued jobs sequentially. Before staging, the engine SHALL compute a SHA256 hash of the uploaded content and reject duplicates immediately.

Scenario: Upload a PDF file

  • WHEN a client sends POST /api/v1/jobs with a multipart form containing a PDF file and optional fields (tags, doc_type)
  • THEN the engine SHALL compute the SHA256 hash of the file bytes, verify no existing document has the same hash, write the file to the staging directory, create a job record with status queued, and return HTTP 202 with {"job_id": "<id>", "status": "queued", "filename": "report.pdf"}

Scenario: Upload a text note

  • WHEN a client sends POST /api/v1/jobs with a multipart form containing a note text field and optional title field
  • THEN the engine SHALL compute the SHA256 hash of the note text (UTF-8 encoded), verify no existing document has the same hash, write the note content to a staging file, create a job record with status queued, and return HTTP 202 with the job ID

Scenario: Upload multiple files in sequence

  • WHEN a client sends multiple POST /api/v1/jobs requests in quick succession
  • THEN the engine SHALL queue each job independently and the background worker SHALL process them in FIFO order

Scenario: Duplicate file detected at upload time (already ingested)

  • WHEN a client uploads a file whose SHA256 content hash matches an already-ingested document
  • THEN the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with {"error": "duplicate", "document_id": <id>, "title": "<title>"}

Scenario: Duplicate file detected at upload time (in-flight job)

  • WHEN a client uploads a file whose SHA256 content hash matches a queued or processing job
  • THEN the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with {"error": "duplicate", "job_id": <id>, "title": "<filename>"}

Scenario: Duplicate note detected at upload time (already ingested)

  • WHEN a client submits a note whose SHA256 content hash matches an already-ingested document
  • THEN the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with {"error": "duplicate", "document_id": <id>, "title": "<title>"}

Scenario: Duplicate note detected at upload time (in-flight job)

  • WHEN a client submits a note whose SHA256 content hash matches a queued or processing job
  • THEN the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with {"error": "duplicate", "job_id": <id>, "title": "<filename>"}

Scenario: Duplicate uploaded during concurrent request handling

  • WHEN two identical files are uploaded in the same instant, both passing the API hash check before either job is committed
  • THEN both jobs SHALL be queued, and the background worker SHALL process the first normally and mark the second as skipped (worker-side safety net via hash_exists() and UNIQUE constraint)

Scenario: Upload failure due to unsupported file type

  • WHEN a client uploads a file with an unsupported extension
  • THEN the engine SHALL return HTTP 422 with an error message listing supported types