6fec627503
- Reject duplicate uploads at the API boundary (HTTP 409) instead of silently skipping in the background worker. Checks both ingested documents and in-flight jobs via content_hash on the jobs table. - Go client handles 409 with distinct messages for already-imported documents vs already-queued jobs. - Sanitize FTS5 search queries by quoting each token to prevent syntax errors from special characters like ?, *, ", (), AND, OR, NOT. - Add try/except safety net around FTS5 execute for edge cases. - Add main branch guard to release.sh to prevent releasing from feature branches. - Update specs and README to reflect new behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
59 lines
4.3 KiB
Markdown
59 lines
4.3 KiB
Markdown
## Context
|
|
|
|
The engine currently accepts all uploads with HTTP 202, stages the file, creates a job record, and relies on the background worker to detect duplicates via SHA256 content hash. When a duplicate is found, the worker marks the job as `skipped` — but the user has already received a success response and must poll job status to discover the duplicate. This creates unnecessary I/O (staging), pollutes the job list, and provides poor UX.
|
|
|
|
The `documents` table already has a `content_hash TEXT UNIQUE` column, and `database.hash_exists()` already exists. The infrastructure for dedup is in place — it just runs too late in the pipeline.
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Reject duplicate uploads at the API boundary with HTTP 409 and useful context (existing document ID/title)
|
|
- Avoid staging files or creating job records for duplicates
|
|
- Apply to both file uploads and note submissions
|
|
- Keep the worker-side hash check as a race condition safety net
|
|
- Update the Go client to handle 409 and display a clear message
|
|
|
|
**Non-Goals:**
|
|
- Fuzzy/near-duplicate detection (e.g., same PDF with different metadata) — byte-identical only
|
|
- Changing the hash algorithm (SHA256 is fine)
|
|
- Adding a "force re-import" override flag (can be added later if needed)
|
|
- Dedup across different file formats with identical content (e.g., .md and .pdf of same text)
|
|
|
|
## Decisions
|
|
|
|
### 1. Hash in the upload endpoint, before staging
|
|
|
|
Compute SHA256 from the uploaded bytes in `submit_job()` before calling `stage_file()`. This avoids writing to disk or creating a DB job record for duplicates.
|
|
|
|
**Alternative considered**: Hash after staging but before job creation. Rejected because it still wastes disk I/O for the staging write.
|
|
|
|
### 2. Return HTTP 409 Conflict with context-dependent metadata
|
|
|
|
The 409 response includes `{"error": "duplicate", ...}` with a distinct shape depending on where the duplicate was found:
|
|
- **Already-ingested document**: `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
|
|
- **In-flight job (queued/processing)**: `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
|
|
|
|
This allows clients to distinguish between "this document already exists" and "this document is already being processed" and display appropriate messages.
|
|
|
|
**Alternative considered**: Return 200 with a `"status": "duplicate"` field. Rejected because 409 is the semantically correct status code and allows clients to distinguish duplicates from successful uploads without parsing the body.
|
|
|
|
### 3. New database helper: `get_document_by_hash()`
|
|
|
|
Returns a dict with duplicate info for a given hash, or `None`. Checks both the `documents` table (already ingested) and the `jobs` table (queued/processing), returning `document_id` or `job_id` accordingly. The `content_hash` column on the `jobs` table is populated at upload time to support this check. The boolean `hash_exists()` is retained for the worker safety net.
|
|
|
|
**Alternative considered**: Modify `hash_exists()` to return the document row. Rejected to avoid changing the worker's existing interface — keep changes minimal.
|
|
|
|
### 4. Retain worker-side dedup as safety net
|
|
|
|
The worker's `hash_exists()` check stays. In theory, two identical uploads could arrive in the same instant — both pass the API hash check before either commits. The jobs-table check closes most of this window (the hash is written at job creation), but a narrow race remains between the API check and the job insert. The UNIQUE constraint on `documents.content_hash` is the final backstop.
|
|
|
|
### 5. Note dedup: hash the text content
|
|
|
|
For notes submitted via the `note` field, SHA256-hash the UTF-8 encoded text. This catches identical note resubmissions.
|
|
|
|
## Risks / Trade-offs
|
|
|
|
- **[Race condition window]** Two identical files uploaded in the same millisecond could both pass the API hash check. → Mitigated by the worker-side `hash_exists()` check and the UNIQUE constraint. The second job would be `skipped`, not a crash.
|
|
- **[Blocking I/O in async endpoint]** SHA256 hashing is CPU-bound but fast (~5ms for 10MB). → Acceptable for the upload endpoint which already reads the full file into memory. No need for `run_in_executor`.
|
|
- **[Client compatibility]** Older clients not expecting 409 will see an error. → This is correct behavior — they'll see an HTTP error rather than silently accepting a duplicate. The Go client will be updated to handle it gracefully.
|