- Reject duplicate uploads at the API boundary (HTTP 409) instead of silently skipping in the background worker. Checks both ingested documents and in-flight jobs via content_hash on the jobs table. - Go client handles 409 with distinct messages for already-imported documents vs already-queued jobs. - Sanitize FTS5 search queries by quoting each token to prevent syntax errors from special characters like ?, *, ", (), AND, OR, NOT. - Add try/except safety net around FTS5 execute for edge cases. - Add main branch guard to release.sh to prevent releasing from feature branches. - Update specs and README to reflect new behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.3 KiB
Context
The engine currently accepts all uploads with HTTP 202, stages the file, creates a job record, and relies on the background worker to detect duplicates via SHA256 content hash. When a duplicate is found, the worker marks the job as skipped — but the user has already received a success response and must poll job status to discover the duplicate. This creates unnecessary I/O (staging), pollutes the job list, and provides poor UX.
The documents table already has a content_hash TEXT UNIQUE column, and database.hash_exists() already exists. The infrastructure for dedup is in place — it just runs too late in the pipeline.
Goals / Non-Goals
Goals:
- Reject duplicate uploads at the API boundary with HTTP 409 and useful context (existing document ID/title)
- Avoid staging files or creating job records for duplicates
- Apply to both file uploads and note submissions
- Keep the worker-side hash check as a race condition safety net
- Update the Go client to handle 409 and display a clear message
Non-Goals:
- Fuzzy/near-duplicate detection (e.g., same PDF with different metadata) — byte-identical only
- Changing the hash algorithm (SHA256 is fine)
- Adding a "force re-import" override flag (can be added later if needed)
- Dedup across different file formats with identical content (e.g., .md and .pdf of same text)
Decisions
1. Hash in the upload endpoint, before staging
Compute SHA256 from the uploaded bytes in submit_job() before calling stage_file(). This avoids writing to disk or creating a DB job record for duplicates.
Alternative considered: Hash after staging but before job creation. Rejected because it still wastes disk I/O for the staging write.
2. Return HTTP 409 Conflict with context-dependent metadata
The 409 response includes {"error": "duplicate", ...} with a distinct shape depending on where the duplicate was found:
- Already-ingested document:
{"error": "duplicate", "document_id": <id>, "title": "<title>"} - In-flight job (queued/processing):
{"error": "duplicate", "job_id": <id>, "title": "<filename>"}
This allows clients to distinguish between "this document already exists" and "this document is already being processed" and display appropriate messages.
Alternative considered: Return 200 with a "status": "duplicate" field. Rejected because 409 is the semantically correct status code and allows clients to distinguish duplicates from successful uploads without parsing the body.
3. New database helper: get_document_by_hash()
Returns a dict with duplicate info for a given hash, or None. Checks both the documents table (already ingested) and the jobs table (queued/processing), returning document_id or job_id accordingly. The content_hash column on the jobs table is populated at upload time to support this check. The boolean hash_exists() is retained for the worker safety net.
Alternative considered: Modify hash_exists() to return the document row. Rejected to avoid changing the worker's existing interface — keep changes minimal.
4. Retain worker-side dedup as safety net
The worker's hash_exists() check stays. In theory, two identical uploads could arrive in the same instant — both pass the API hash check before either commits. The jobs-table check closes most of this window (the hash is written at job creation), but a narrow race remains between the API check and the job insert. The UNIQUE constraint on documents.content_hash is the final backstop.
5. Note dedup: hash the text content
For notes submitted via the note field, SHA256-hash the UTF-8 encoded text. This catches identical note resubmissions.
Risks / Trade-offs
- [Race condition window] Two identical files uploaded in the same millisecond could both pass the API hash check. → Mitigated by the worker-side
hash_exists()check and the UNIQUE constraint. The second job would beskipped, not a crash. - [Blocking I/O in async endpoint] SHA256 hashing is CPU-bound but fast (~5ms for 10MB). → Acceptable for the upload endpoint which already reads the full file into memory. No need for
run_in_executor. - [Client compatibility] Older clients not expecting 409 will see an error. → This is correct behavior — they'll see an HTTP error rather than silently accepting a duplicate. The Go client will be updated to handle it gracefully.