Upload-time duplicate detection, FTS5 query sanitization, release guard
- Reject duplicate uploads at the API boundary (HTTP 409) instead of silently skipping in the background worker. Checks both ingested documents and in-flight jobs via content_hash on the jobs table. - Go client handles 409 with distinct messages for already-imported documents vs already-queued jobs. - Sanitize FTS5 search queries by quoting each token to prevent syntax errors from special characters like ?, *, ", (), AND, OR, NOT. - Add try/except safety net around FTS5 execute for edge cases. - Add main branch guard to release.sh to prevent releasing from feature branches. - Update specs and README to reflect new behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,41 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Async ingestion via job queue
|
||||
|
||||
The engine SHALL accept file uploads and text notes for ingestion asynchronously. Uploaded content SHALL be written to a staging area and a job record created in the database. The engine SHALL return HTTP 202 immediately. A background worker SHALL process queued jobs sequentially. Before staging, the engine SHALL compute a SHA256 hash of the uploaded content and reject duplicates immediately.
|
||||
|
||||
#### Scenario: Upload a PDF file
|
||||
- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a PDF file and optional fields (tags, doc_type)
|
||||
- **THEN** the engine SHALL compute the SHA256 hash of the file bytes, verify no existing document has the same hash, write the file to the staging directory, create a job record with status `queued`, and return HTTP 202 with `{"job_id": "<id>", "status": "queued", "filename": "report.pdf"}`
|
||||
|
||||
#### Scenario: Upload a text note
|
||||
- **WHEN** a client sends `POST /api/v1/jobs` with a multipart form containing a `note` text field and optional `title` field
|
||||
- **THEN** the engine SHALL compute the SHA256 hash of the note text (UTF-8 encoded), verify no existing document has the same hash, write the note content to a staging file, create a job record with status `queued`, and return HTTP 202 with the job ID
|
||||
|
||||
#### Scenario: Upload multiple files in sequence
|
||||
- **WHEN** a client sends multiple `POST /api/v1/jobs` requests in quick succession
|
||||
- **THEN** the engine SHALL queue each job independently and the background worker SHALL process them in FIFO order
|
||||
|
||||
#### Scenario: Duplicate file detected at upload time (already ingested)
|
||||
- **WHEN** a client uploads a file whose SHA256 content hash matches an already-ingested document
|
||||
- **THEN** the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
|
||||
|
||||
#### Scenario: Duplicate file detected at upload time (in-flight job)
|
||||
- **WHEN** a client uploads a file whose SHA256 content hash matches a queued or processing job
|
||||
- **THEN** the engine SHALL NOT stage the file or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
|
||||
|
||||
#### Scenario: Duplicate note detected at upload time (already ingested)
|
||||
- **WHEN** a client submits a note whose SHA256 content hash matches an already-ingested document
|
||||
- **THEN** the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "document_id": <id>, "title": "<title>"}`
|
||||
|
||||
#### Scenario: Duplicate note detected at upload time (in-flight job)
|
||||
- **WHEN** a client submits a note whose SHA256 content hash matches a queued or processing job
|
||||
- **THEN** the engine SHALL NOT stage the note or create a job record, and SHALL return HTTP 409 with `{"error": "duplicate", "job_id": <id>, "title": "<filename>"}`
|
||||
|
||||
#### Scenario: Duplicate uploaded during concurrent request handling
|
||||
- **WHEN** two identical files are uploaded in the same instant, both passing the API hash check before either job is committed
|
||||
- **THEN** both jobs SHALL be queued, and the background worker SHALL process the first normally and mark the second as `skipped` (worker-side safety net via `hash_exists()` and UNIQUE constraint)
|
||||
|
||||
#### Scenario: Upload failure due to unsupported file type
|
||||
- **WHEN** a client uploads a file with an unsupported extension
|
||||
- **THEN** the engine SHALL return HTTP 422 with an error message listing supported types
|
||||
@@ -0,0 +1,49 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Add command (file and note ingestion)
|
||||
|
||||
The client SHALL provide a `kb add` command that uploads files or notes to the engine for async ingestion. The client SHALL exit immediately after a successful upload. The client SHALL handle duplicate rejection (HTTP 409) and display the existing document information.
|
||||
|
||||
#### Scenario: Add a single file
|
||||
- **WHEN** the user runs `kb add report.pdf`
|
||||
- **THEN** the client SHALL upload the file via `POST /api/v1/jobs` (multipart), print "Queued: report.pdf", and exit
|
||||
|
||||
#### Scenario: Add a file with tags
|
||||
- **WHEN** the user runs `kb add manual.pdf --tags car,maintenance`
|
||||
- **THEN** the client SHALL include the tags in the multipart upload metadata
|
||||
|
||||
#### Scenario: Add a directory recursively
|
||||
- **WHEN** the user runs `kb add ~/documents/ --recursive`
|
||||
- **THEN** the client SHALL discover all supported files in the directory tree, upload each one sequentially, and print "Queued: N files"
|
||||
|
||||
#### Scenario: Add a text note
|
||||
- **WHEN** the user runs `kb add --note "The server room is in building 3, floor 2"`
|
||||
- **THEN** the client SHALL submit the note text via `POST /api/v1/jobs` (multipart with note field), print "Queued: note", and exit
|
||||
|
||||
#### Scenario: Duplicate file rejected (already ingested)
|
||||
- **WHEN** the user runs `kb add report.pdf` and the engine returns HTTP 409 with `{"error": "duplicate", "document_id": 42, "title": "report.pdf"}`
|
||||
- **THEN** the client SHALL print "Already imported: report.pdf (doc ID: 42)" and exit with code 0
|
||||
|
||||
#### Scenario: Duplicate file rejected (in-flight job)
|
||||
- **WHEN** the user runs `kb add report.pdf` and the engine returns HTTP 409 with `{"error": "duplicate", "job_id": 7, "title": "report.pdf"}`
|
||||
- **THEN** the client SHALL print "Already queued: report.pdf (job ID: 7)" and exit with code 0
|
||||
|
||||
#### Scenario: Duplicate file in recursive add
|
||||
- **WHEN** the user runs `kb add ~/documents/ --recursive` and some files are rejected as duplicates
|
||||
- **THEN** the client SHALL print the duplicate message for each rejected file (distinguishing "Already imported" from "Already queued"), continue uploading remaining files, and include a summary (e.g., "Queued: 5 files, 2 duplicates skipped")
|
||||
|
||||
#### Scenario: Duplicate with JSON output
|
||||
- **WHEN** the user runs `kb add report.pdf --format json` and the engine returns HTTP 409
|
||||
- **THEN** the client SHALL output the raw JSON response from the engine including the document_id and title
|
||||
|
||||
#### Scenario: Add with JSON output
|
||||
- **WHEN** the user runs `kb add report.pdf --format json`
|
||||
- **THEN** the client SHALL output the JSON response from the engine including the job_id
|
||||
|
||||
#### Scenario: File not found
|
||||
- **WHEN** the user runs `kb add nonexistent.pdf`
|
||||
- **THEN** the client SHALL print an error and exit with a non-zero code without making any API call
|
||||
|
||||
#### Scenario: Upload failure
|
||||
- **WHEN** the upload fails (network error, engine returns 4xx/5xx other than 409)
|
||||
- **THEN** the client SHALL print the error and exit with a non-zero code
|
||||
Reference in New Issue
Block a user