Files
steve 6fec627503 Upload-time duplicate detection, FTS5 query sanitization, release guard
- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 23:05:07 +00:00

32 lines
2.4 KiB
Markdown

## Why
Duplicate document detection currently happens in the background worker — the upload endpoint always returns HTTP 202, and the user only discovers a duplicate later when the job status is `skipped`. This wastes staging I/O, creates noise in the job list, and gives poor user feedback. Moving the SHA256 content hash check to the upload endpoint allows immediate rejection with a clear error, preventing unnecessary work and giving the user instant feedback.
## What Changes
- **Compute content hash at upload time**: The `POST /api/v1/jobs` endpoint will SHA256-hash the uploaded file bytes before staging and check against `documents.content_hash`
- **Reject duplicates immediately**: Return HTTP 409 Conflict with the existing document ID when a duplicate is detected, instead of accepting and later skipping
- **No job created for duplicates**: Duplicate uploads will not create a job record or stage a file
- **Remove worker-side dedup**: The background worker's `hash_exists()` check becomes redundant for the normal flow but should be retained as a safety net (race condition guard)
- **Update Go client**: Surface the 409 response with a clear message (e.g., "Already imported: <title> (doc ID: <id>)")
- **Note dedup**: Apply the same check to notes — hash the note text content
## Capabilities
### New Capabilities
_(none — this modifies existing capabilities)_
### Modified Capabilities
- `engine-api`: The "Async ingestion via job queue" requirement changes — duplicate content is now rejected at upload time (HTTP 409) instead of accepted and later skipped by the worker. The "Duplicate content detection" scenario moves from background to synchronous.
- `go-client`: The "Add command" requirement changes — the client must handle HTTP 409 responses and display the duplicate document info to the user.
## Impact
- **Engine API** (`engine/kb/routes/jobs.py`): `submit_job()` gains hash computation and DB lookup before staging/job creation
- **Engine database** (`engine/kb/database.py`): Need a query to return the existing document ID/title for a given hash (not just boolean exists check)
- **Engine worker** (`engine/kb/worker.py`): Dedup check retained as safety net but no longer the primary guard
- **Go client** (`client/cmd/add.go`): Handle 409 response, display duplicate info
- **API contract**: New HTTP 409 response on `POST /api/v1/jobs` — this is additive, not breaking, since no consumer expects 409 today