Upload-time duplicate detection, FTS5 query sanitization, release guard
- Reject duplicate uploads at the API boundary (HTTP 409) instead of silently skipping in the background worker. Checks both ingested documents and in-flight jobs via content_hash on the jobs table. - Go client handles 409 with distinct messages for already-imported documents vs already-queued jobs. - Sanitize FTS5 search queries by quoting each token to prevent syntax errors from special characters like ?, *, ", (), AND, OR, NOT. - Add try/except safety net around FTS5 execute for edge cases. - Add main branch guard to release.sh to prevent releasing from feature branches. - Update specs and README to reflect new behaviour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,40 @@
|
||||
## Context
|
||||
|
||||
FTS5 has its own query syntax. Characters like `?`, `*`, `"`, `(`, `)`, `+`, `-`, `^` and keywords like `AND`, `OR`, `NOT`, `NEAR` have special meaning. The current code passes the raw user query to `chunks_fts MATCH ?` — parameterized (safe from SQL injection) but not safe from FTS5 syntax errors.
|
||||
|
||||
The fix point is `_fts_search()` in `engine/kb/search.py:92` where `params: list = [query]`.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Any user input to the search endpoint produces either valid results or an empty result set — never a 500 error
|
||||
- Preserve the user's search intent as much as possible (don't over-strip)
|
||||
|
||||
**Non-Goals:**
|
||||
- Exposing FTS5 advanced syntax to users (they can't use AND/OR/NEAR operators intentionally)
|
||||
- Changing vector search (it already handles arbitrary strings via the embedding model)
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Quote each token individually
|
||||
|
||||
Split the query on whitespace, wrap each token in double quotes (`"token"`), and join with spaces. FTS5 interprets double-quoted strings as literal phrases, disabling all operator parsing within them. Any embedded double quotes in a token are stripped.
|
||||
|
||||
Example: `what color is grass?` becomes `"what" "color" "is" "grass?"` — FTS5 treats `?` as a literal character inside quotes.
|
||||
|
||||
**Alternative considered**: Strip all non-alphanumeric characters. Rejected because it would break searches for terms containing hyphens, dots, or other meaningful punctuation (e.g., searching for "v2.0" or "self-hosted").
|
||||
|
||||
**Alternative considered**: Use a try/except to catch FTS5 errors and fall back. Rejected as a primary strategy because it silently degrades — but we'll add it as a safety net.
|
||||
|
||||
### 2. Handle empty/whitespace-only queries
|
||||
|
||||
If after sanitization no tokens remain, skip FTS search entirely and return empty results. This prevents sending an empty string to MATCH which would also error.
|
||||
|
||||
### 3. Try/except safety net
|
||||
|
||||
Wrap the FTS5 execute call in a try/except for `sqlite3.OperationalError`. If an edge case still slips through, return empty FTS results and log a warning rather than crashing with a 500.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **[Reduced FTS expressiveness]** Users cannot use FTS5 operators like `AND`, `OR`, phrase matching. → Acceptable trade-off for a personal knowledge base tool where natural language queries are the norm. The hybrid search (vector + FTS) compensates.
|
||||
- **[Edge cases]** Some Unicode or control characters might still cause issues. → The try/except safety net handles these.
|
||||
Reference in New Issue
Block a user