Upload-time duplicate detection, FTS5 query sanitization, release guard

- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 23:05:07 +00:00
parent 63654a59b8
commit 6fec627503
20 changed files with 536 additions and 30 deletions
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-26
@@ -0,0 +1,40 @@
## Context
FTS5 has its own query syntax. Characters like `?`, `*`, `"`, `(`, `)`, `+`, `-`, `^` and keywords like `AND`, `OR`, `NOT`, `NEAR` have special meaning. The current code passes the raw user query to `chunks_fts MATCH ?` — parameterized (safe from SQL injection) but not safe from FTS5 syntax errors.
The fix point is `_fts_search()` in `engine/kb/search.py:92` where `params: list = [query]`.
## Goals / Non-Goals
**Goals:**
- Any user input to the search endpoint produces either valid results or an empty result set — never a 500 error
- Preserve the user's search intent as much as possible (don't over-strip)
**Non-Goals:**
- Exposing FTS5 advanced syntax to users (they can't use AND/OR/NEAR operators intentionally)
- Changing vector search (it already handles arbitrary strings via the embedding model)
## Decisions
### 1. Quote each token individually
Split the query on whitespace, wrap each token in double quotes (`"token"`), and join with spaces. FTS5 interprets double-quoted strings as literal phrases, disabling all operator parsing within them. Any embedded double quotes in a token are stripped.
Example: `what color is grass?` becomes `"what" "color" "is" "grass?"` — FTS5 treats `?` as a literal character inside quotes.
**Alternative considered**: Strip all non-alphanumeric characters. Rejected because it would break searches for terms containing hyphens, dots, or other meaningful punctuation (e.g., searching for "v2.0" or "self-hosted").
**Alternative considered**: Use a try/except to catch FTS5 errors and fall back. Rejected as a primary strategy because it silently degrades — but we'll add it as a safety net.
### 2. Handle empty/whitespace-only queries
If after sanitization no tokens remain, skip FTS search entirely and return empty results. This prevents sending an empty string to MATCH which would also error.
### 3. Try/except safety net
Wrap the FTS5 execute call in a try/except for `sqlite3.OperationalError`. If an edge case still slips through, return empty FTS results and log a warning rather than crashing with a 500.
## Risks / Trade-offs
- **[Reduced FTS expressiveness]** Users cannot use FTS5 operators like `AND`, `OR`, phrase matching. → Acceptable trade-off for a personal knowledge base tool where natural language queries are the norm. The hybrid search (vector + FTS) compensates.
- **[Edge cases]** Some Unicode or control characters might still cause issues. → The try/except safety net handles these.
@@ -0,0 +1,24 @@
## Why
Searching with natural language queries containing characters like `?`, `"`, `*`, `(`, `)`, `-`, or FTS5 keywords (`AND`, `OR`, `NOT`, `NEAR`) causes a 500 error because the raw query string is passed directly to `chunks_fts MATCH ?` without escaping. Users should be able to type anything into a search query without triggering syntax errors.
## What Changes
- **Sanitize FTS5 query input**: Escape or strip FTS5 special characters from the user's query before passing it to the MATCH operator
- **Graceful fallback**: If the sanitized query produces no valid FTS5 terms, return empty results from FTS instead of erroring
## Capabilities
### New Capabilities
_(none)_
### Modified Capabilities
- `engine-api`: The "Hybrid search" requirement changes — the engine must sanitize user queries to prevent FTS5 syntax errors for any input
## Impact
- **Engine search** (`engine/kb/search.py`): `_fts_search()` needs query sanitization before the MATCH parameter
- **No client changes**: The client already displays results or errors correctly
- **No schema changes**: No database modifications needed
@@ -0,0 +1,21 @@
## MODIFIED Requirements
### Requirement: Hybrid search
The engine SHALL provide hybrid search combining BM25 full-text search (via FTS5) and vector similarity search (via sqlite-vec), merged using Reciprocal Rank Fusion. Search SHALL complete in under 100ms when the model is warm. The engine SHALL sanitize user query strings to prevent FTS5 syntax errors for any input.
#### Scenario: Search with special characters
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "what color is grass?"}`
- **THEN** the engine SHALL sanitize the query for FTS5, execute the search successfully, and return results (not a 500 error)
#### Scenario: Search with FTS5 operators in query
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "NOT something OR (other)"}`
- **THEN** the engine SHALL treat the input as literal search terms, not FTS5 operators, and return matching results
#### Scenario: Search with only special characters
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "??!@#"}`
- **THEN** the engine SHALL return HTTP 200 with an empty result set (not a 500 error)
#### Scenario: Search with quotes in query
- **WHEN** a client sends `POST /api/v1/search` with body `{"query": "the \"quick\" fox"}`
- **THEN** the engine SHALL sanitize embedded quotes and return results normally
@@ -0,0 +1,15 @@
## 1. Query Sanitization
- [x] 1.1 Add `_sanitize_fts_query(query)` function to `engine/kb/search.py` that splits on whitespace, strips double quotes from each token, wraps each token in double quotes, and joins with spaces
- [x] 1.2 Handle edge case: if no valid tokens remain after sanitization, return empty dict from `_fts_search` without executing the query
## 2. Integration
- [x] 2.1 Call `_sanitize_fts_query()` in `_fts_search()` before adding the query to params (line 92)
- [x] 2.2 Add try/except `sqlite3.OperationalError` around the FTS5 execute call — log a warning and return empty results on error
## 3. Testing
- [x] 3.1 Test: `kb search "what color is grass?"` returns results, not a 500 error
- [x] 3.2 Test: `kb search "NOT something OR (other)"` returns results, treating input as literal terms
- [x] 3.3 Test: query with only special characters returns empty results, not an error