Upload-time duplicate detection, FTS5 query sanitization, release guard

- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 23:05:07 +00:00
parent 63654a59b8
commit 6fec627503
20 changed files with 536 additions and 30 deletions
@@ -0,0 +1,22 @@
## 1. Database Layer
- [x] 1.1 Add `get_document_by_hash(conn, content_hash)` function to `engine/kb/database.py` that returns `(document_id, title)` or `None`
## 2. Upload Endpoint
- [x] 2.1 Update `submit_job()` in `engine/kb/routes/jobs.py` to compute SHA256 hash of uploaded file bytes before staging
- [x] 2.2 Add duplicate check: call `get_document_by_hash()` and return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}` if match found
- [x] 2.3 Apply same hash check for note submissions (hash the UTF-8 encoded note text)
## 3. Go Client
- [x] 3.1 Update `uploadFile()` in `client/cmd/add.go` to handle HTTP 409 responses — parse the JSON body and print "Already imported: <title> (doc ID: <id>)"
- [x] 3.2 Update recursive directory upload to continue on 409, track duplicate count, and include in summary output
- [x] 3.3 Handle 409 in JSON output mode — pass through the raw engine response
## 4. Testing
- [x] 4.1 Test: upload a file, then upload the same file again — verify 409 with correct document_id and title
- [x] 4.2 Test: upload a note, then upload the same note text — verify 409
- [x] 4.3 Test: upload a file, then upload a different file — verify 202 as normal
- [x] 4.4 Test: verify the worker-side `hash_exists()` safety net still works (direct job insertion bypassing API)