71 lines
4.0 KiB
Markdown
71 lines
4.0 KiB
Markdown
## ADDED Requirements
|
||
|
||
### Requirement: Full-text search via FTS5
|
||
The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the `porter unicode61` tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.
|
||
|
||
#### Scenario: Keyword search
|
||
- **WHEN** user runs `kb search "install git"`
|
||
- **THEN** FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score
|
||
|
||
#### Scenario: FTS-only mode
|
||
- **WHEN** user runs `kb search "install git" --fts-only`
|
||
- **THEN** only FTS5 results are returned, no vector search is performed
|
||
|
||
### Requirement: Vector similarity search via sqlite-vec
|
||
The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in `chunks_vec` using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.
|
||
|
||
#### Scenario: Semantic search
|
||
- **WHEN** user runs `kb search "how to set up version control"`
|
||
- **THEN** the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"
|
||
|
||
#### Scenario: Vector-only mode
|
||
- **WHEN** user runs `kb search "how to set up version control" --vec-only`
|
||
- **THEN** only vector similarity results are returned, no FTS search is performed
|
||
|
||
### Requirement: Reciprocal Rank Fusion merging
|
||
The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: `score(d) = Σ 1/(k + rank)` where `k` is configurable (default: 60). Results SHALL be sorted by descending RRF score.
|
||
|
||
#### Scenario: Hybrid search combines both signals
|
||
- **WHEN** user runs `kb search "install git"` (default hybrid mode)
|
||
- **THEN** the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score
|
||
|
||
#### Scenario: Document appears in both result sets
|
||
- **WHEN** a chunk ranks #2 in FTS5 and #5 in vector search
|
||
- **THEN** its RRF score SHALL be `1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315`, higher than a chunk appearing in only one result set
|
||
|
||
### Requirement: Tag-based filtering
|
||
The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.
|
||
|
||
#### Scenario: Filter by single tag
|
||
- **WHEN** user runs `kb search "deploy" --tags ops`
|
||
- **THEN** only chunks from documents tagged with "ops" are included in results
|
||
|
||
#### Scenario: Filter by multiple tags
|
||
- **WHEN** user runs `kb search "deploy" --tags ops,production`
|
||
- **THEN** only chunks from documents tagged with BOTH "ops" AND "production" are included
|
||
|
||
### Requirement: Type-based filtering
|
||
The system SHALL support filtering search results by document type. Valid types: `pdf`, `markdown`, `code`, `note`.
|
||
|
||
#### Scenario: Filter by type
|
||
- **WHEN** user runs `kb search "deploy" --type code`
|
||
- **THEN** only chunks from code documents are included in results
|
||
|
||
### Requirement: Score threshold
|
||
The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.
|
||
|
||
#### Scenario: Apply score threshold
|
||
- **WHEN** user runs `kb search "deploy" --threshold 0.02`
|
||
- **THEN** only results with RRF score >= 0.02 are returned
|
||
|
||
### Requirement: Result count control
|
||
The system SHALL return a configurable number of results (default: 10, configurable via `--top` flag or `search.default_top` in config).
|
||
|
||
#### Scenario: Request specific number of results
|
||
- **WHEN** user runs `kb search "deploy" --top 5`
|
||
- **THEN** at most 5 results are returned
|
||
|
||
#### Scenario: Fewer matches than requested
|
||
- **WHEN** user searches and only 3 chunks match
|
||
- **THEN** the system returns 3 results without error, with `returned: 3` in the output
|