Upload-time duplicate detection, FTS5 query sanitization, release guard

- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 23:05:07 +00:00
parent 63654a59b8
commit 6fec627503
20 changed files with 536 additions and 30 deletions
@@ -0,0 +1,40 @@
## Context
FTS5 has its own query syntax. Characters like `?`, `*`, `"`, `(`, `)`, `+`, `-`, `^` and keywords like `AND`, `OR`, `NOT`, `NEAR` have special meaning. The current code passes the raw user query to `chunks_fts MATCH ?` — parameterized (safe from SQL injection) but not safe from FTS5 syntax errors.
The fix point is `_fts_search()` in `engine/kb/search.py:92` where `params: list = [query]`.
## Goals / Non-Goals
**Goals:**
- Any user input to the search endpoint produces either valid results or an empty result set — never a 500 error
- Preserve the user's search intent as much as possible (don't over-strip)
**Non-Goals:**
- Exposing FTS5 advanced syntax to users (they can't use AND/OR/NEAR operators intentionally)
- Changing vector search (it already handles arbitrary strings via the embedding model)
## Decisions
### 1. Quote each token individually
Split the query on whitespace, wrap each token in double quotes (`"token"`), and join with spaces. FTS5 interprets double-quoted strings as literal phrases, disabling all operator parsing within them. Any embedded double quotes in a token are stripped.
Example: `what color is grass?` becomes `"what" "color" "is" "grass?"` — FTS5 treats `?` as a literal character inside quotes.
**Alternative considered**: Strip all non-alphanumeric characters. Rejected because it would break searches for terms containing hyphens, dots, or other meaningful punctuation (e.g., searching for "v2.0" or "self-hosted").
**Alternative considered**: Use a try/except to catch FTS5 errors and fall back. Rejected as a primary strategy because it silently degrades — but we'll add it as a safety net.
### 2. Handle empty/whitespace-only queries
If after sanitization no tokens remain, skip FTS search entirely and return empty results. This prevents sending an empty string to MATCH which would also error.
### 3. Try/except safety net
Wrap the FTS5 execute call in a try/except for `sqlite3.OperationalError`. If an edge case still slips through, return empty FTS results and log a warning rather than crashing with a 500.
## Risks / Trade-offs
- **[Reduced FTS expressiveness]** Users cannot use FTS5 operators like `AND`, `OR`, phrase matching. → Acceptable trade-off for a personal knowledge base tool where natural language queries are the norm. The hybrid search (vector + FTS) compensates.
- **[Edge cases]** Some Unicode or control characters might still cause issues. → The try/except safety net handles these.