4.0 KiB
ADDED Requirements
Requirement: Full-text search via FTS5
The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the porter unicode61 tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.
Scenario: Keyword search
- WHEN user runs
kb search "install git" - THEN FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score
Scenario: FTS-only mode
- WHEN user runs
kb search "install git" --fts-only - THEN only FTS5 results are returned, no vector search is performed
Requirement: Vector similarity search via sqlite-vec
The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in chunks_vec using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.
Scenario: Semantic search
- WHEN user runs
kb search "how to set up version control" - THEN the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"
Scenario: Vector-only mode
- WHEN user runs
kb search "how to set up version control" --vec-only - THEN only vector similarity results are returned, no FTS search is performed
Requirement: Reciprocal Rank Fusion merging
The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: score(d) = Σ 1/(k + rank) where k is configurable (default: 60). Results SHALL be sorted by descending RRF score.
Scenario: Hybrid search combines both signals
- WHEN user runs
kb search "install git"(default hybrid mode) - THEN the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score
Scenario: Document appears in both result sets
- WHEN a chunk ranks #2 in FTS5 and #5 in vector search
- THEN its RRF score SHALL be
1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315, higher than a chunk appearing in only one result set
Requirement: Tag-based filtering
The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.
Scenario: Filter by single tag
- WHEN user runs
kb search "deploy" --tags ops - THEN only chunks from documents tagged with "ops" are included in results
Scenario: Filter by multiple tags
- WHEN user runs
kb search "deploy" --tags ops,production - THEN only chunks from documents tagged with BOTH "ops" AND "production" are included
Requirement: Type-based filtering
The system SHALL support filtering search results by document type. Valid types: pdf, markdown, code, note.
Scenario: Filter by type
- WHEN user runs
kb search "deploy" --type code - THEN only chunks from code documents are included in results
Requirement: Score threshold
The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.
Scenario: Apply score threshold
- WHEN user runs
kb search "deploy" --threshold 0.02 - THEN only results with RRF score >= 0.02 are returned
Requirement: Result count control
The system SHALL return a configurable number of results (default: 10, configurable via --top flag or search.default_top in config).
Scenario: Request specific number of results
- WHEN user runs
kb search "deploy" --top 5 - THEN at most 5 results are returned
Scenario: Fewer matches than requested
- WHEN user searches and only 3 chunks match
- THEN the system returns 3 results without error, with
returned: 3in the output