Files
kb/openspec/changes/kb-search/specs/hybrid-search/spec.md
T
2026-03-23 20:38:42 +00:00

71 lines
4.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## ADDED Requirements
### Requirement: Full-text search via FTS5
The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the `porter unicode61` tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.
#### Scenario: Keyword search
- **WHEN** user runs `kb search "install git"`
- **THEN** FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score
#### Scenario: FTS-only mode
- **WHEN** user runs `kb search "install git" --fts-only`
- **THEN** only FTS5 results are returned, no vector search is performed
### Requirement: Vector similarity search via sqlite-vec
The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in `chunks_vec` using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.
#### Scenario: Semantic search
- **WHEN** user runs `kb search "how to set up version control"`
- **THEN** the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"
#### Scenario: Vector-only mode
- **WHEN** user runs `kb search "how to set up version control" --vec-only`
- **THEN** only vector similarity results are returned, no FTS search is performed
### Requirement: Reciprocal Rank Fusion merging
The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: `score(d) = Σ 1/(k + rank)` where `k` is configurable (default: 60). Results SHALL be sorted by descending RRF score.
#### Scenario: Hybrid search combines both signals
- **WHEN** user runs `kb search "install git"` (default hybrid mode)
- **THEN** the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score
#### Scenario: Document appears in both result sets
- **WHEN** a chunk ranks #2 in FTS5 and #5 in vector search
- **THEN** its RRF score SHALL be `1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315`, higher than a chunk appearing in only one result set
### Requirement: Tag-based filtering
The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.
#### Scenario: Filter by single tag
- **WHEN** user runs `kb search "deploy" --tags ops`
- **THEN** only chunks from documents tagged with "ops" are included in results
#### Scenario: Filter by multiple tags
- **WHEN** user runs `kb search "deploy" --tags ops,production`
- **THEN** only chunks from documents tagged with BOTH "ops" AND "production" are included
### Requirement: Type-based filtering
The system SHALL support filtering search results by document type. Valid types: `pdf`, `markdown`, `code`, `note`.
#### Scenario: Filter by type
- **WHEN** user runs `kb search "deploy" --type code`
- **THEN** only chunks from code documents are included in results
### Requirement: Score threshold
The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.
#### Scenario: Apply score threshold
- **WHEN** user runs `kb search "deploy" --threshold 0.02`
- **THEN** only results with RRF score >= 0.02 are returned
### Requirement: Result count control
The system SHALL return a configurable number of results (default: 10, configurable via `--top` flag or `search.default_top` in config).
#### Scenario: Request specific number of results
- **WHEN** user runs `kb search "deploy" --top 5`
- **THEN** at most 5 results are returned
#### Scenario: Fewer matches than requested
- **WHEN** user searches and only 3 chunks match
- **THEN** the system returns 3 results without error, with `returned: 3` in the output