Upload-time duplicate detection, FTS5 query sanitization, release guard
- Reject duplicate uploads at the API boundary (HTTP 409) instead of silently skipping in the background worker. Checks both ingested documents and in-flight jobs via content_hash on the jobs table. - Go client handles 409 with distinct messages for already-imported documents vs already-queued jobs. - Sanitize FTS5 search queries by quoting each token to prevent syntax errors from special characters like ?, *, ", (), AND, OR, NOT. - Add try/except safety net around FTS5 execute for edge cases. - Add main branch guard to release.sh to prevent releasing from feature branches. - Update specs and README to reflect new behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-03-26
|
||||
@@ -0,0 +1,58 @@
|
||||
## Context
|
||||
|
||||
The engine currently accepts all uploads with HTTP 202, stages the file, creates a job record, and relies on the background worker to detect duplicates via SHA256 content hash. When a duplicate is found, the worker marks the job as `skipped` — but the user has already received a success response and must poll job status to discover the duplicate. This creates unnecessary I/O (staging), pollutes the job list, and provides poor UX.
|
||||
|
||||
The `documents` table already has a `content_hash TEXT UNIQUE` column, and `database.hash_exists()` already exists. The infrastructure for dedup is in place — it just runs too late in the pipeline.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Reject duplicate uploads at the API boundary with HTTP 409 and useful context (existing document ID/title)
|
||||
- Avoid staging files or creating job records for duplicates
|
||||
- Apply to both file uploads and note submissions
|
||||
- Keep the worker-side hash check as a race condition safety net
|
||||
- Update the Go client to handle 409 and display a clear message
|
||||
|
||||
**Non-Goals:**
|
||||
- Fuzzy/near-duplicate detection (e.g., same PDF with different metadata) — byte-identical only
|
||||
- Changing the hash algorithm (SHA256 is fine)
|
||||
- Adding a "force re-import" override flag (can be added later if needed)
|
||||
- Dedup across different file formats with identical content (e.g., .md and .pdf of same text)
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Hash in the upload endpoint, before staging
|
||||
|
||||
Compute SHA256 from the uploaded bytes in `submit_job()` before calling `stage_file()`. This avoids writing to disk or creating a DB job record for duplicates.
|
||||
|
||||
**Alternative considered**: Hash after staging but before job creation. Rejected because it still wastes disk I/O for the staging write.
|
||||
|
||||
### 2. Return HTTP 409 Conflict with context-dependent metadata
|
||||
|
||||
The 409 response includes `{"error": "duplicate", ...}` with a distinct shape depending on where the duplicate was found:
|
||||
- **Already-ingested document**: `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
|
||||
- **In-flight job (queued/processing)**: `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
|
||||
|
||||
This allows clients to distinguish between "this document already exists" and "this document is already being processed" and display appropriate messages.
|
||||
|
||||
**Alternative considered**: Return 200 with a `"status": "duplicate"` field. Rejected because 409 is the semantically correct status code and allows clients to distinguish duplicates from successful uploads without parsing the body.
|
||||
|
||||
### 3. New database helper: `get_document_by_hash()`
|
||||
|
||||
Returns a dict with duplicate info for a given hash, or `None`. Checks both the `documents` table (already ingested) and the `jobs` table (queued/processing), returning `document_id` or `job_id` accordingly. The `content_hash` column on the `jobs` table is populated at upload time to support this check. The boolean `hash_exists()` is retained for the worker safety net.
|
||||
|
||||
**Alternative considered**: Modify `hash_exists()` to return the document row. Rejected to avoid changing the worker's existing interface — keep changes minimal.
|
||||
|
||||
### 4. Retain worker-side dedup as safety net
|
||||
|
||||
The worker's `hash_exists()` check stays. In theory, two identical uploads could arrive in the same instant — both pass the API hash check before either commits. The jobs-table check closes most of this window (the hash is written at job creation), but a narrow race remains between the API check and the job insert. The UNIQUE constraint on `documents.content_hash` is the final backstop.
|
||||
|
||||
### 5. Note dedup: hash the text content
|
||||
|
||||
For notes submitted via the `note` field, SHA256-hash the UTF-8 encoded text. This catches identical note resubmissions.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Race condition window]** Two identical files uploaded in the same millisecond could both pass the API hash check. → Mitigated by the worker-side `hash_exists()` check and the UNIQUE constraint. The second job would be `skipped`, not a crash.
|
||||
- **[Blocking I/O in async endpoint]** SHA256 hashing is CPU-bound but fast (~5ms for 10MB). → Acceptable for the upload endpoint which already reads the full file into memory. No need for `run_in_executor`.
|
||||
- **[Client compatibility]** Older clients not expecting 409 will see an error. → This is correct behavior — they'll see an HTTP error rather than silently accepting a duplicate. The Go client will be updated to handle it gracefully.
|
||||
@@ -0,0 +1,31 @@
|
||||
## Why
|
||||
|
||||
Duplicate document detection currently happens in the background worker — the upload endpoint always returns HTTP 202, and the user only discovers a duplicate later when the job status is `skipped`. This wastes staging I/O, creates noise in the job list, and gives poor user feedback. Moving the SHA256 content hash check to the upload endpoint allows immediate rejection with a clear error, preventing unnecessary work and giving the user instant feedback.
|
||||
|
||||
## What Changes
|
||||
|
||||
- **Compute content hash at upload time**: The `POST /api/v1/jobs` endpoint will SHA256-hash the uploaded file bytes before staging and check against `documents.content_hash`
|
||||
- **Reject duplicates immediately**: Return HTTP 409 Conflict with the existing document ID when a duplicate is detected, instead of accepting and later skipping
|
||||
- **No job created for duplicates**: Duplicate uploads will not create a job record or stage a file
|
||||
- **Remove worker-side dedup**: The background worker's `hash_exists()` check becomes redundant for the normal flow but should be retained as a safety net (race condition guard)
|
||||
- **Update Go client**: Surface the 409 response with a clear message (e.g., "Already imported: <title> (doc ID: <id>)")
|
||||
- **Note dedup**: Apply the same check to notes — hash the note text content
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
|
||||
_(none — this modifies existing capabilities)_
|
||||
|
||||
### Modified Capabilities
|
||||
|
||||
- `engine-api`: The "Async ingestion via job queue" requirement changes — duplicate content is now rejected at upload time (HTTP 409) instead of accepted and later skipped by the worker. The "Duplicate content detection" scenario moves from background to synchronous.
|
||||
- `go-client`: The "Add command" requirement changes — the client must handle HTTP 409 responses and display the duplicate document info to the user.
|
||||
|
||||
## Impact
|
||||
|
||||
- **Engine API** (`engine/kb/routes/jobs.py`): `submit_job()` gains hash computation and DB lookup before staging/job creation
|
||||
- **Engine database** (`engine/kb/database.py`): Need a query to return the existing document ID/title for a given hash (not just boolean exists check)
|
||||
- **Engine worker** (`engine/kb/worker.py`): Dedup check retained as safety net but no longer the primary guard
|
||||
- **Go client** (`client/cmd/add.go`): Handle 409 response, display duplicate info
|
||||
- **API contract**: New HTTP 409 response on `POST /api/v1/jobs` — this is additive, not breaking, since no consumer expects 409 today
|
||||
@@ -0,0 +1,41 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Async ingestion via job queue
|
||||
|
||||
The engine SHALL accept file uploads and text notes for ingestion asynchronously. Uploaded content SHALL be written to a staging area and a job record created in the database. The engine SHALL return HTTP 202 immediately. A background worker SHALL process queued jobs sequentially. Before staging, the engine SHALL compute a SHA256 hash of the uploaded content and reject duplicates immediately.
|
||||
|
||||
#### Scenario: Upload a PDF file
|
||||
- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a PDF file and optional fields (tags, doc_type)
|
||||
- **THEN** the engine SHALL compute the SHA256 hash of the file bytes, verify no existing document has the same hash, write the file to the staging directory, create a job record with status `queued`, and return HTTP 202 with `{"job_id": "<id>", "status": "queued", "filename": "report.pdf"}`
|
||||
|
||||
#### Scenario: Upload a text note
|
||||
- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a `note` text field and optional `title` field
|
||||
- **THEN** the engine SHALL compute the SHA256 hash of the note text (UTF-8 encoded), verify no existing document has the same hash, write the note content to a staging file, create a job record with status `queued`, and return HTTP 202 with the job ID
|
||||
|
||||
#### Scenario: Upload multiple files in sequence
|
||||
- **WHEN** a client sends multiple `POST /api/v1/jobs` requests in quick succession
|
||||
- **THEN** the engine SHALL queue each job independently and the background worker SHALL process them in FIFO order
|
||||
|
||||
#### Scenario: Duplicate file detected at upload time (already ingested)
|
||||
- **WHEN** a client uploads a file whose SHA256 content hash matches an already-ingested document
|
||||
- **THEN** the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
|
||||
|
||||
#### Scenario: Duplicate file detected at upload time (in-flight job)
|
||||
- **WHEN** a client uploads a file whose SHA256 content hash matches a queued or processing job
|
||||
- **THEN** the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
|
||||
|
||||
#### Scenario: Duplicate note detected at upload time (already ingested)
|
||||
- **WHEN** a client submits a note whose SHA256 content hash matches an already-ingested document
|
||||
- **THEN** the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
|
||||
|
||||
#### Scenario: Duplicate note detected at upload time (in-flight job)
|
||||
- **WHEN** a client submits a note whose SHA256 content hash matches a queued or processing job
|
||||
- **THEN** the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
|
||||
|
||||
#### Scenario: Duplicate uploaded during concurrent request handling
|
||||
- **WHEN** two identical files are uploaded in the same instant, both passing the API hash check before either job is committed
|
||||
- **THEN** both jobs SHALL be queued, and the background worker SHALL process the first normally and mark the second as `skipped` (worker-side safety net via `hash_exists()` and UNIQUE constraint)
|
||||
|
||||
#### Scenario: Upload failure due to unsupported file type
|
||||
- **WHEN** a client uploads a file with an unsupported extension
|
||||
- **THEN** the engine SHALL return HTTP 422 with an error message listing supported types
|
||||
@@ -0,0 +1,49 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Add command (file and note ingestion)
|
||||
|
||||
The client SHALL provide a `kb add` command that uploads files or notes to the engine for async ingestion. The client SHALL exit immediately after a successful upload. The client SHALL handle duplicate rejection (HTTP 409) and display the existing document information.
|
||||
|
||||
#### Scenario: Add a single file
|
||||
- **WHEN** the user runs `kb add report.pdf`
|
||||
- **THEN** the client SHALL upload the file via `POST /api/v1/jobs` (multipart), print "Queued: report.pdf", and exit
|
||||
|
||||
#### Scenario: Add a file with tags
|
||||
- **WHEN** the user runs `kb add manual.pdf --tags car,maintenance`
|
||||
- **THEN** the client SHALL include the tags in the multipart upload metadata
|
||||
|
||||
#### Scenario: Add a directory recursively
|
||||
- **WHEN** the user runs `kb add ~/documents/ --recursive`
|
||||
- **THEN** the client SHALL discover all supported files in the directory tree, upload each one sequentially, and print "Queued: N files"
|
||||
|
||||
#### Scenario: Add a text note
|
||||
- **WHEN** the user runs `kb add --note "The server room is in building 3, floor 2"`
|
||||
- **THEN** the client SHALL submit the note text via `POST /api/v1/jobs` (multipart with note field), print "Queued: note", and exit
|
||||
|
||||
#### Scenario: Duplicate file rejected (already ingested)
|
||||
- **WHEN** the user runs `kb add report.pdf` and the engine returns HTTP 409 with `{"error": "duplicate", "document_id": 42, "title": "report.pdf"}`
|
||||
- **THEN** the client SHALL print "Already imported: report.pdf (doc ID: 42)" and exit with code 0
|
||||
|
||||
#### Scenario: Duplicate file rejected (in-flight job)
|
||||
- **WHEN** the user runs `kb add report.pdf` and the engine returns HTTP 409 with `{"error": "duplicate", "job_id": 7, "title": "report.pdf"}`
|
||||
- **THEN** the client SHALL print "Already queued: report.pdf (job ID: 7)" and exit with code 0
|
||||
|
||||
#### Scenario: Duplicate file in recursive add
|
||||
- **WHEN** the user runs `kb add ~/documents/ --recursive` and some files are rejected as duplicates
|
||||
- **THEN** the client SHALL print the duplicate message for each rejected file (distinguishing "Already imported" from "Already queued"), continue uploading remaining files, and include a summary (e.g., "Queued: 5 files, 2 duplicates skipped")
|
||||
|
||||
#### Scenario: Duplicate with JSON output
|
||||
- **WHEN** the user runs `kb add report.pdf --format json` and the engine returns HTTP 409
|
||||
- **THEN** the client SHALL output the raw JSON response from the engine including the document_id and title
|
||||
|
||||
#### Scenario: Add with JSON output
|
||||
- **WHEN** the user runs `kb add report.pdf --format json`
|
||||
- **THEN** the client SHALL output the JSON response from the engine including the job_id
|
||||
|
||||
#### Scenario: File not found
|
||||
- **WHEN** the user runs `kb add nonexistent.pdf`
|
||||
- **THEN** the client SHALL print an error and exit with a non-zero code without making any API call
|
||||
|
||||
#### Scenario: Upload failure
|
||||
- **WHEN** the upload fails (network error, engine returns 4xx/5xx other than 409)
|
||||
- **THEN** the client SHALL print the error and exit with a non-zero code
|
||||
@@ -0,0 +1,22 @@
|
||||
## 1. Database Layer
|
||||
|
||||
- [x] 1.1 Add `get_document_by_hash(conn, content_hash)` function to `engine/kb/database.py` that returns `(document_id, title)` or `None`
|
||||
|
||||
## 2. Upload Endpoint
|
||||
|
||||
- [x] 2.1 Update `submit_job()` in `engine/kb/routes/jobs.py` to compute SHA256 hash of uploaded file bytes before staging
|
||||
- [x] 2.2 Add duplicate check: call `get_document_by_hash()` and return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}` if match found
|
||||
- [x] 2.3 Apply same hash check for note submissions (hash the UTF-8 encoded note text)
|
||||
|
||||
## 3. Go Client
|
||||
|
||||
- [x] 3.1 Update `uploadFile()` in `client/cmd/add.go` to handle HTTP 409 responses — parse the JSON body and print "Already imported: <title> (doc ID: <id>)"
|
||||
- [x] 3.2 Update recursive directory upload to continue on 409, track duplicate count, and include in summary output
|
||||
- [x] 3.3 Handle 409 in JSON output mode — pass through the raw engine response
|
||||
|
||||
## 4. Testing
|
||||
|
||||
- [x] 4.1 Test: upload a file, then upload the same file again — verify 409 with correct document_id and title
|
||||
- [x] 4.2 Test: upload a note, then upload the same note text — verify 409
|
||||
- [x] 4.3 Test: upload a file, then upload a different file — verify 202 as normal
|
||||
- [x] 4.4 Test: verify the worker-side `hash_exists()` safety net still works (direct job insertion bypassing API)
|
||||
Reference in New Issue
Block a user