Upload-time duplicate detection, FTS5 query sanitization, release guard

- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 23:05:07 +00:00
parent 63654a59b8
commit 6fec627503
20 changed files with 536 additions and 30 deletions
+39 -7
View File
@@ -26,7 +26,7 @@ The engine SHALL load the embedding model eagerly at startup before accepting HT
### Requirement: Hybrid search
The engine SHALL provide hybrid search combining BM25 full-text search (via FTS5) and vector similarity search (via sqlite-vec), merged using Reciprocal Rank Fusion. Search SHALL complete in under 100ms when the model is warm.
The engine SHALL provide hybrid search combining BM25 full-text search (via FTS5) and vector similarity search (via sqlite-vec), merged using Reciprocal Rank Fusion. Search SHALL complete in under 100ms when the model is warm. The engine SHALL sanitize user query strings to prevent FTS5 syntax errors for any input.
#### Scenario: Hybrid search with results
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "how to change oil", "top": 5}`
@@ -44,27 +44,59 @@ The engine SHALL provide hybrid search combining BM25 full-text search (via FTS5
- **WHEN** a client searches against an empty database
- **THEN** the engine SHALL return HTTP 200 with `{"query": "...", "results": [], "total_matches": 0}`
#### Scenario: Search with special characters
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "what color is grass?"}`
- **THEN** the engine SHALL sanitize the query for FTS5, execute the search successfully, and return results (not a 500 error)
#### Scenario: Search with FTS5 operators in query
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "NOT something OR (other)"}`
- **THEN** the engine SHALL treat the input as literal search terms, not FTS5 operators, and return matching results
#### Scenario: Search with only special characters
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "??!@#"}`
- **THEN** the engine SHALL return HTTP 200 with an empty result set (not a 500 error)
#### Scenario: Search with quotes in query
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "the \"quick\" fox"}`
- **THEN** the engine SHALL sanitize embedded quotes and return results normally
---
### Requirement: Async ingestion via job queue
The engine SHALL accept file uploads and text notes for ingestion asynchronously. Uploaded content SHALL be written to a staging area and a job record created in the database. The engine SHALL return HTTP 202 immediately. A background worker SHALL process queued jobs sequentially.
The engine SHALL accept file uploads and text notes for ingestion asynchronously. Uploaded content SHALL be written to a staging area and a job record created in the database. The engine SHALL return HTTP 202 immediately. A background worker SHALL process queued jobs sequentially. Before staging, the engine SHALL compute a SHA256 hash of the uploaded content and reject duplicates immediately.
#### Scenario: Upload a PDF file
- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a PDF file and optional fields (tags, doc_type)
- **THEN** the engine SHALL write the file to the staging directory, create a job record with status `queued`, and return HTTP 202 with `{"job_id": "<id>", "status": "queued", "filename": "report.pdf"}`
- **THEN** the engine SHALL compute the SHA256 hash of the file bytes, verify no existing document has the same hash, write the file to the staging directory, create a job record with status `queued`, and return HTTP 202 with `{"job_id": "<id>", "status": "queued", "filename": "report.pdf"}`
#### Scenario: Upload a text note
- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a `note` text field and optional `title` field
- **THEN** the engine SHALL write the note content to a staging file, create a job record with status `queued`, and return HTTP 202 with the job ID
- **THEN** the engine SHALL compute the SHA256 hash of the note text (UTF-8 encoded), verify no existing document has the same hash, write the note content to a staging file, create a job record with status `queued`, and return HTTP 202 with the job ID
#### Scenario: Upload multiple files in sequence
- **WHEN** a client sends multiple `POST /api/v1/jobs` requests in quick succession
- **THEN** the engine SHALL queue each job independently and the background worker SHALL process them in FIFO order
#### Scenario: Duplicate content detection
- **WHEN** a client uploads a file whose content hash matches an already-ingested document
- **THEN** the engine SHALL return HTTP 202 but the background worker SHALL mark the job as `skipped` with reason `duplicate`
#### Scenario: Duplicate file detected at upload time (already ingested)
- **WHEN** a client uploads a file whose SHA256 content hash matches an already-ingested document
- **THEN** the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
#### Scenario: Duplicate file detected at upload time (in-flight job)
- **WHEN** a client uploads a file whose SHA256 content hash matches a queued or processing job
- **THEN** the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
#### Scenario: Duplicate note detected at upload time (already ingested)
- **WHEN** a client submits a note whose SHA256 content hash matches an already-ingested document
- **THEN** the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
#### Scenario: Duplicate note detected at upload time (in-flight job)
- **WHEN** a client submits a note whose SHA256 content hash matches a queued or processing job
- **THEN** the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
#### Scenario: Duplicate uploaded during concurrent request handling
- **WHEN** two identical files are uploaded in the same instant, both passing the API hash check before either job is committed
- **THEN** both jobs SHALL be queued, and the background worker SHALL process the first normally and mark the second as `skipped` (worker-side safety net via `hash_exists()` and UNIQUE constraint)
#### Scenario: Upload failure due to unsupported file type
- **WHEN** a client uploads a file with an unsupported extension