Files
steve 6fec627503 Upload-time duplicate detection, FTS5 query sanitization, release guard
- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 23:05:07 +00:00

2.4 KiB

Why

Duplicate document detection currently happens in the background worker — the upload endpoint always returns HTTP 202, and the user only discovers a duplicate later when the job status is skipped. This wastes staging I/O, creates noise in the job list, and gives poor user feedback. Moving the SHA256 content hash check to the upload endpoint allows immediate rejection with a clear error, preventing unnecessary work and giving the user instant feedback.

What Changes

  • Compute content hash at upload time: The POST /api/v1/jobs endpoint will SHA256-hash the uploaded file bytes before staging and check against documents.content_hash
  • Reject duplicates immediately: Return HTTP 409 Conflict with the existing document ID when a duplicate is detected, instead of accepting and later skipping
  • No job created for duplicates: Duplicate uploads will not create a job record or stage a file
  • Remove worker-side dedup: The background worker's hash_exists() check becomes redundant for the normal flow but should be retained as a safety net (race condition guard)
  • Update Go client: Surface the 409 response with a clear message (e.g., "Already imported: