kb/openspec/changes/archive/2026-03-26-upload-time-dedup-check/design.md at c5191df9c03f94483208f2acc81eb08ad5a7c203

steve/kb

Files

T

steve 6fec627503 Upload-time duplicate detection, FTS5 query sanitization, release guard

- Reject duplicate uploads at the API boundary (HTTP 409) instead of
  silently skipping in the background worker. Checks both ingested
  documents and in-flight jobs via content_hash on the jobs table.
- Go client handles 409 with distinct messages for already-imported
  documents vs already-queued jobs.
- Sanitize FTS5 search queries by quoting each token to prevent syntax
  errors from special characters like ?, *, ", (), AND, OR, NOT.
- Add try/except safety net around FTS5 execute for edge cases.
- Add main branch guard to release.sh to prevent releasing from
  feature branches.
- Update specs and README to reflect new behaviour.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 23:05:07 +00:00

4.3 KiB

Raw Blame History

Context

The engine currently accepts all uploads with HTTP 202, stages the file, creates a job record, and relies on the background worker to detect duplicates via SHA256 content hash. When a duplicate is found, the worker marks the job as skipped — but the user has already received a success response and must poll job status to discover the duplicate. This creates unnecessary I/O (staging), pollutes the job list, and provides poor UX.

The documents table already has a content_hash TEXT UNIQUE column, and database.hash_exists() already exists. The infrastructure for dedup is in place — it just runs too late in the pipeline.

Goals / Non-Goals

Goals:

Reject duplicate uploads at the API boundary with HTTP 409 and useful context (existing document ID/title)
Avoid staging files or creating job records for duplicates
Apply to both file uploads and note submissions
Keep the worker-side hash check as a race condition safety net
Update the Go client to handle 409 and display a clear message

Non-Goals:

Fuzzy/near-duplicate detection (e.g., same PDF with different metadata) — byte-identical only
Changing the hash algorithm (SHA256 is fine)
Adding a "force re-import" override flag (can be added later if needed)
Dedup across different file formats with identical content (e.g., .md and .pdf of same text)

Decisions

1. Hash in the upload endpoint, before staging

Compute SHA256 from the uploaded bytes in submit_job() before calling stage_file(). This avoids writing to disk or creating a DB job record for duplicates.

Alternative considered: Hash after staging but before job creation. Rejected because it still wastes disk I/O for the staging write.

2. Return HTTP 409 Conflict with context-dependent metadata

The 409 response includes {"error": "duplicate", ...} with a distinct shape depending on where the duplicate was found:

Already-ingested document: {"error": "duplicate", "document_id": <id>, "title": "<title>"}
In-flight job (queued/processing): {"error": "duplicate", "job_id": <id>, "title": "<filename>"}

This allows clients to distinguish between "this document already exists" and "this document is already being processed" and display appropriate messages.

Alternative considered: Return 200 with a "status": "duplicate" field. Rejected because 409 is the semantically correct status code and allows clients to distinguish duplicates from successful uploads without parsing the body.

3. New database helper: `get_document_by_hash()`

Returns a dict with duplicate info for a given hash, or None. Checks both the documents table (already ingested) and the jobs table (queued/processing), returning document_id or job_id accordingly. The content_hash column on the jobs table is populated at upload time to support this check. The boolean hash_exists() is retained for the worker safety net.

Alternative considered: Modify hash_exists() to return the document row. Rejected to avoid changing the worker's existing interface — keep changes minimal.

4. Retain worker-side dedup as safety net

The worker's hash_exists() check stays. In theory, two identical uploads could arrive in the same instant — both pass the API hash check before either commits. The jobs-table check closes most of this window (the hash is written at job creation), but a narrow race remains between the API check and the job insert. The UNIQUE constraint on documents.content_hash is the final backstop.

5. Note dedup: hash the text content

For notes submitted via the note field, SHA256-hash the UTF-8 encoded text. This catches identical note resubmissions.

Risks / Trade-offs

[Race condition window] Two identical files uploaded in the same millisecond could both pass the API hash check. → Mitigated by the worker-side hash_exists() check and the UNIQUE constraint. The second job would be skipped, not a crash.
[Blocking I/O in async endpoint] SHA256 hashing is CPU-bound but fast (~5ms for 10MB). → Acceptable for the upload endpoint which already reads the full file into memory. No need for run_in_executor.
[Client compatibility] Older clients not expecting 409 will see an error. → This is correct behavior — they'll see an HTTP error rather than silently accepting a duplicate. The Go client will be updated to handle it gracefully.

4.3 KiB Raw Blame History