Files
kb/openspec/changes/kb-search/specs/hybrid-search/spec.md
T
2026-03-23 20:38:42 +00:00

4.0 KiB
Raw Blame History

ADDED Requirements

Requirement: Full-text search via FTS5

The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the porter unicode61 tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.

  • WHEN user runs kb search "install git"
  • THEN FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score

Scenario: FTS-only mode

  • WHEN user runs kb search "install git" --fts-only
  • THEN only FTS5 results are returned, no vector search is performed

Requirement: Vector similarity search via sqlite-vec

The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in chunks_vec using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.

  • WHEN user runs kb search "how to set up version control"
  • THEN the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"

Scenario: Vector-only mode

  • WHEN user runs kb search "how to set up version control" --vec-only
  • THEN only vector similarity results are returned, no FTS search is performed

Requirement: Reciprocal Rank Fusion merging

The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: score(d) = Σ 1/(k + rank) where k is configurable (default: 60). Results SHALL be sorted by descending RRF score.

Scenario: Hybrid search combines both signals

  • WHEN user runs kb search "install git" (default hybrid mode)
  • THEN the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score

Scenario: Document appears in both result sets

  • WHEN a chunk ranks #2 in FTS5 and #5 in vector search
  • THEN its RRF score SHALL be 1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315, higher than a chunk appearing in only one result set

Requirement: Tag-based filtering

The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.

Scenario: Filter by single tag

  • WHEN user runs kb search "deploy" --tags ops
  • THEN only chunks from documents tagged with "ops" are included in results

Scenario: Filter by multiple tags

  • WHEN user runs kb search "deploy" --tags ops,production
  • THEN only chunks from documents tagged with BOTH "ops" AND "production" are included

Requirement: Type-based filtering

The system SHALL support filtering search results by document type. Valid types: pdf, markdown, code, note.

Scenario: Filter by type

  • WHEN user runs kb search "deploy" --type code
  • THEN only chunks from code documents are included in results

Requirement: Score threshold

The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.

Scenario: Apply score threshold

  • WHEN user runs kb search "deploy" --threshold 0.02
  • THEN only results with RRF score >= 0.02 are returned

Requirement: Result count control

The system SHALL return a configurable number of results (default: 10, configurable via --top flag or search.default_top in config).

Scenario: Request specific number of results

  • WHEN user runs kb search "deploy" --top 5
  • THEN at most 5 results are returned

Scenario: Fewer matches than requested

  • WHEN user searches and only 3 chunks match
  • THEN the system returns 3 results without error, with returned: 3 in the output