6fec627503
- Reject duplicate uploads at the API boundary (HTTP 409) instead of silently skipping in the background worker. Checks both ingested documents and in-flight jobs via content_hash on the jobs table. - Go client handles 409 with distinct messages for already-imported documents vs already-queued jobs. - Sanitize FTS5 search queries by quoting each token to prevent syntax errors from special characters like ?, *, ", (), AND, OR, NOT. - Add try/except safety net around FTS5 execute for edge cases. - Add main branch guard to release.sh to prevent releasing from feature branches. - Update specs and README to reflect new behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
32 lines
2.4 KiB
Markdown
32 lines
2.4 KiB
Markdown
## Why
|
|
|
|
Duplicate document detection currently happens in the background worker — the upload endpoint always returns HTTP 202, and the user only discovers a duplicate later when the job status is `skipped`. This wastes staging I/O, creates noise in the job list, and gives poor user feedback. Moving the SHA256 content hash check to the upload endpoint allows immediate rejection with a clear error, preventing unnecessary work and giving the user instant feedback.
|
|
|
|
## What Changes
|
|
|
|
- **Compute content hash at upload time**: The `POST /api/v1/jobs` endpoint will SHA256-hash the uploaded file bytes before staging and check against `documents.content_hash`
|
|
- **Reject duplicates immediately**: Return HTTP 409 Conflict with the existing document ID when a duplicate is detected, instead of accepting and later skipping
|
|
- **No job created for duplicates**: Duplicate uploads will not create a job record or stage a file
|
|
- **Remove worker-side dedup**: The background worker's `hash_exists()` check becomes redundant for the normal flow but should be retained as a safety net (race condition guard)
|
|
- **Update Go client**: Surface the 409 response with a clear message (e.g., "Already imported: <title> (doc ID: <id>)")
|
|
- **Note dedup**: Apply the same check to notes — hash the note text content
|
|
|
|
## Capabilities
|
|
|
|
### New Capabilities
|
|
|
|
_(none — this modifies existing capabilities)_
|
|
|
|
### Modified Capabilities
|
|
|
|
- `engine-api`: The "Async ingestion via job queue" requirement changes — duplicate content is now rejected at upload time (HTTP 409) instead of accepted and later skipped by the worker. The "Duplicate content detection" scenario moves from background to synchronous.
|
|
- `go-client`: The "Add command" requirement changes — the client must handle HTTP 409 responses and display the duplicate document info to the user.
|
|
|
|
## Impact
|
|
|
|
- **Engine API** (`engine/kb/routes/jobs.py`): `submit_job()` gains hash computation and DB lookup before staging/job creation
|
|
- **Engine database** (`engine/kb/database.py`): Need a query to return the existing document ID/title for a given hash (not just boolean exists check)
|
|
- **Engine worker** (`engine/kb/worker.py`): Dedup check retained as safety net but no longer the primary guard
|
|
- **Go client** (`client/cmd/add.go`): Handle 409 response, display duplicate info
|
|
- **API contract**: New HTTP 409 response on `POST /api/v1/jobs` — this is additive, not breaking, since no consumer expects 409 today
|