6fec627503
- Reject duplicate uploads at the API boundary (HTTP 409) instead of silently skipping in the background worker. Checks both ingested documents and in-flight jobs via content_hash on the jobs table. - Go client handles 409 with distinct messages for already-imported documents vs already-queued jobs. - Sanitize FTS5 search queries by quoting each token to prevent syntax errors from special characters like ?, *, ", (), AND, OR, NOT. - Add try/except safety net around FTS5 execute for edge cases. - Add main branch guard to release.sh to prevent releasing from feature branches. - Update specs and README to reflect new behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.4 KiB
2.4 KiB
Why
Duplicate document detection currently happens in the background worker — the upload endpoint always returns HTTP 202, and the user only discovers a duplicate later when the job status is skipped. This wastes staging I/O, creates noise in the job list, and gives poor user feedback. Moving the SHA256 content hash check to the upload endpoint allows immediate rejection with a clear error, preventing unnecessary work and giving the user instant feedback.
What Changes
- Compute content hash at upload time: The
POST /api/v1/jobsendpoint will SHA256-hash the uploaded file bytes before staging and check againstdocuments.content_hash - Reject duplicates immediately: Return HTTP 409 Conflict with the existing document ID when a duplicate is detected, instead of accepting and later skipping
- No job created for duplicates: Duplicate uploads will not create a job record or stage a file
- Remove worker-side dedup: The background worker's
hash_exists()check becomes redundant for the normal flow but should be retained as a safety net (race condition guard) - Update Go client: Surface the 409 response with a clear message (e.g., "Already imported: