Initial MVP

This commit is contained in:
2026-03-23 20:38:42 +00:00
commit f245c24928
57 changed files with 6812 additions and 0 deletions
+152
View File
@@ -0,0 +1,152 @@
---
name: "OPSX: Apply"
description: Implement tasks from an OpenSpec change (Experimental)
category: Workflow
tags: [workflow, artifacts, experimental]
---
Implement tasks from an OpenSpec change.
**Input**: Optionally specify a change name (e.g., `/opsx:apply add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **Select the change**
If a name is provided, use it. Otherwise:
- Infer from conversation context if the user mentioned a change
- Auto-select if only one active change exists
- If ambiguous, run `openspec list --json` to get available changes and use the **AskUserQuestion tool** to let the user select
Always announce: "Using change: <name>" and how to override (e.g., `/opsx:apply <other>`).
2. **Check status to understand the schema**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to understand:
- `schemaName`: The workflow being used (e.g., "spec-driven")
- Which artifact contains the tasks (typically "tasks" for spec-driven, check status for others)
3. **Get apply instructions**
```bash
openspec instructions apply --change "<name>" --json
```
This returns:
- Context file paths (varies by schema)
- Progress (total, complete, remaining)
- Task list with status
- Dynamic instruction based on current state
**Handle states:**
- If `state: "blocked"` (missing artifacts): show message, suggest using `/opsx:continue`
- If `state: "all_done"`: congratulate, suggest archive
- Otherwise: proceed to implementation
4. **Read context files**
Read the files listed in `contextFiles` from the apply instructions output.
The files depend on the schema being used:
- **spec-driven**: proposal, specs, design, tasks
- Other schemas: follow the contextFiles from CLI output
5. **Show current progress**
Display:
- Schema being used
- Progress: "N/M tasks complete"
- Remaining tasks overview
- Dynamic instruction from CLI
6. **Implement tasks (loop until done or blocked)**
For each pending task:
- Show which task is being worked on
- Make the code changes required
- Keep changes minimal and focused
- Mark task complete in the tasks file: `- [ ]` → `- [x]`
- Continue to next task
**Pause if:**
- Task is unclear → ask for clarification
- Implementation reveals a design issue → suggest updating artifacts
- Error or blocker encountered → report and wait for guidance
- User interrupts
7. **On completion or pause, show status**
Display:
- Tasks completed this session
- Overall progress: "N/M tasks complete"
- If all done: suggest archive
- If paused: explain why and wait for guidance
**Output During Implementation**
```
## Implementing: <change-name> (schema: <schema-name>)
Working on task 3/7: <task description>
[...implementation happening...]
✓ Task complete
Working on task 4/7: <task description>
[...implementation happening...]
✓ Task complete
```
**Output On Completion**
```
## Implementation Complete
**Change:** <change-name>
**Schema:** <schema-name>
**Progress:** 7/7 tasks complete ✓
### Completed This Session
- [x] Task 1
- [x] Task 2
...
All tasks complete! You can archive this change with `/opsx:archive`.
```
**Output On Pause (Issue Encountered)**
```
## Implementation Paused
**Change:** <change-name>
**Schema:** <schema-name>
**Progress:** 4/7 tasks complete
### Issue Encountered
<description of the issue>
**Options:**
1. <option 1>
2. <option 2>
3. Other approach
What would you like to do?
```
**Guardrails**
- Keep going through tasks until done or blocked
- Always read context files before starting (from the apply instructions output)
- If task is ambiguous, pause and ask before implementing
- If implementation reveals issues, pause and suggest artifact updates
- Keep code changes minimal and scoped to each task
- Update task checkbox immediately after completing each task
- Pause on errors, blockers, or unclear requirements - don't guess
- Use contextFiles from CLI output, don't assume specific file names
**Fluid Workflow Integration**
This skill supports the "actions on a change" model:
- **Can be invoked anytime**: Before all artifacts are done (if tasks exist), after partial implementation, interleaved with other actions
- **Allows artifact updates**: If implementation reveals design issues, suggest updating artifacts - not phase-locked, work fluidly
+157
View File
@@ -0,0 +1,157 @@
---
name: "OPSX: Archive"
description: Archive a completed change in the experimental workflow
category: Workflow
tags: [workflow, archive, experimental]
---
Archive a completed change in the experimental workflow.
**Input**: Optionally specify a change name after `/opsx:archive` (e.g., `/opsx:archive add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
Show only active changes (not already archived).
Include the schema used for each change if available.
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Check artifact completion status**
Run `openspec status --change "<name>" --json` to check artifact completion.
Parse the JSON to understand:
- `schemaName`: The workflow being used
- `artifacts`: List of artifacts with their status (`done` or other)
**If any artifacts are not `done`:**
- Display warning listing incomplete artifacts
- Prompt user for confirmation to continue
- Proceed if user confirms
3. **Check task completion status**
Read the tasks file (typically `tasks.md`) to check for incomplete tasks.
Count tasks marked with `- [ ]` (incomplete) vs `- [x]` (complete).
**If incomplete tasks found:**
- Display warning showing count of incomplete tasks
- Prompt user for confirmation to continue
- Proceed if user confirms
**If no tasks file exists:** Proceed without task-related warning.
4. **Assess delta spec sync state**
Check for delta specs at `openspec/changes/<name>/specs/`. If none exist, proceed without sync prompt.
**If delta specs exist:**
- Compare each delta spec with its corresponding main spec at `openspec/specs/<capability>/spec.md`
- Determine what changes would be applied (adds, modifications, removals, renames)
- Show a combined summary before prompting
**Prompt options:**
- If changes needed: "Sync now (recommended)", "Archive without syncing"
- If already synced: "Archive now", "Sync anyway", "Cancel"
If user chooses sync, use Task tool (subagent_type: "general-purpose", prompt: "Use Skill tool to invoke openspec-sync-specs for change '<name>'. Delta spec analysis: <include the analyzed delta spec summary>"). Proceed to archive regardless of choice.
5. **Perform the archive**
Create the archive directory if it doesn't exist:
```bash
mkdir -p openspec/changes/archive
```
Generate target name using current date: `YYYY-MM-DD-<change-name>`
**Check if target already exists:**
- If yes: Fail with error, suggest renaming existing archive or using different date
- If no: Move the change directory to archive
```bash
mv openspec/changes/<name> openspec/changes/archive/YYYY-MM-DD-<name>
```
6. **Display summary**
Show archive completion summary including:
- Change name
- Schema that was used
- Archive location
- Spec sync status (synced / sync skipped / no delta specs)
- Note about any warnings (incomplete artifacts/tasks)
**Output On Success**
```
## Archive Complete
**Change:** <change-name>
**Schema:** <schema-name>
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
**Specs:** ✓ Synced to main specs
All artifacts complete. All tasks complete.
```
**Output On Success (No Delta Specs)**
```
## Archive Complete
**Change:** <change-name>
**Schema:** <schema-name>
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
**Specs:** No delta specs
All artifacts complete. All tasks complete.
```
**Output On Success With Warnings**
```
## Archive Complete (with warnings)
**Change:** <change-name>
**Schema:** <schema-name>
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
**Specs:** Sync skipped (user chose to skip)
**Warnings:**
- Archived with 2 incomplete artifacts
- Archived with 3 incomplete tasks
- Delta spec sync was skipped (user chose to skip)
Review the archive if this was not intentional.
```
**Output On Error (Archive Exists)**
```
## Archive Failed
**Change:** <change-name>
**Target:** openspec/changes/archive/YYYY-MM-DD-<name>/
Target archive directory already exists.
**Options:**
1. Rename the existing archive
2. Delete the existing archive if it's a duplicate
3. Wait until a different date to archive
```
**Guardrails**
- Always prompt for change selection if not provided
- Use artifact graph (openspec status --json) for completion checking
- Don't block archive on warnings - just inform and confirm
- Preserve .openspec.yaml when moving to archive (it moves with the directory)
- Show clear summary of what happened
- If sync is requested, use the Skill tool to invoke `openspec-sync-specs` (agent-driven)
- If delta specs exist, always run the sync assessment and show the combined summary before prompting
+114
View File
@@ -0,0 +1,114 @@
---
name: "OPSX: Continue"
description: Continue working on a change - create the next artifact (Experimental)
category: Workflow
tags: [workflow, artifacts, experimental]
---
Continue working on a change by creating the next artifact.
**Input**: Optionally specify a change name after `/opsx:continue` (e.g., `/opsx:continue add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes sorted by most recently modified. Then use the **AskUserQuestion tool** to let the user select which change to work on.
Present the top 3-4 most recently modified changes as options, showing:
- Change name
- Schema (from `schema` field if present, otherwise "spec-driven")
- Status (e.g., "0/5 tasks", "complete", "no tasks")
- How recently it was modified (from `lastModified` field)
Mark the most recently modified change as "(Recommended)" since it's likely what the user wants to continue.
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Check current status**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to understand current state. The response includes:
- `schemaName`: The workflow schema being used (e.g., "spec-driven")
- `artifacts`: Array of artifacts with their status ("done", "ready", "blocked")
- `isComplete`: Boolean indicating if all artifacts are complete
3. **Act based on status**:
---
**If all artifacts are complete (`isComplete: true`)**:
- Congratulate the user
- Show final status including the schema used
- Suggest: "All artifacts created! You can now implement this change with `/opsx:apply` or archive it with `/opsx:archive`."
- STOP
---
**If artifacts are ready to create** (status shows artifacts with `status: "ready"`):
- Pick the FIRST artifact with `status: "ready"` from the status output
- Get its instructions:
```bash
openspec instructions <artifact-id> --change "<name>" --json
```
- Parse the JSON. The key fields are:
- `context`: Project background (constraints for you - do NOT include in output)
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
- `template`: The structure to use for your output file
- `instruction`: Schema-specific guidance
- `outputPath`: Where to write the artifact
- `dependencies`: Completed artifacts to read for context
- **Create the artifact file**:
- Read any completed dependency files for context
- Use `template` as the structure - fill in its sections
- Apply `context` and `rules` as constraints when writing - but do NOT copy them into the file
- Write to the output path specified in instructions
- Show what was created and what's now unlocked
- STOP after creating ONE artifact
---
**If no artifacts are ready (all blocked)**:
- This shouldn't happen with a valid schema
- Show status and suggest checking for issues
4. **After creating an artifact, show progress**
```bash
openspec status --change "<name>"
```
**Output**
After each invocation, show:
- Which artifact was created
- Schema workflow being used
- Current progress (N/M complete)
- What artifacts are now unlocked
- Prompt: "Run `/opsx:continue` to create the next artifact"
**Artifact Creation Guidelines**
The artifact types and their purpose depend on the schema. Use the `instruction` field from the instructions output to understand what to create.
Common artifact patterns:
**spec-driven schema** (proposal → specs → design → tasks):
- **proposal.md**: Ask user about the change if not clear. Fill in Why, What Changes, Capabilities, Impact.
- The Capabilities section is critical - each capability listed will need a spec file.
- **specs/<capability>/spec.md**: Create one spec per capability listed in the proposal's Capabilities section (use the capability name, not the change name).
- **design.md**: Document technical decisions, architecture, and implementation approach.
- **tasks.md**: Break down implementation into checkboxed tasks.
For other schemas, follow the `instruction` field from the CLI output.
**Guardrails**
- Create ONE artifact per invocation
- Always read dependency artifacts before creating a new one
- Never skip artifacts or create out of order
- If context is unclear, ask the user before creating
- Verify the artifact file exists after writing before marking progress
- Use the schema's artifact sequence, don't assume specific artifact names
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
- These guide what you write, but should never appear in the output
+173
View File
@@ -0,0 +1,173 @@
---
name: "OPSX: Explore"
description: "Enter explore mode - think through ideas, investigate problems, clarify requirements"
category: Workflow
tags: [workflow, explore, experimental, thinking]
---
Enter explore mode. Think deeply. Visualize freely. Follow the conversation wherever it goes.
**IMPORTANT: Explore mode is for thinking, not implementing.** You may read files, search code, and investigate the codebase, but you must NEVER write code or implement features. If the user asks you to implement something, remind them to exit explore mode first and create a change proposal. You MAY create OpenSpec artifacts (proposals, designs, specs) if the user asks—that's capturing thinking, not implementing.
**This is a stance, not a workflow.** There are no fixed steps, no required sequence, no mandatory outputs. You're a thinking partner helping the user explore.
**Input**: The argument after `/opsx:explore` is whatever the user wants to think about. Could be:
- A vague idea: "real-time collaboration"
- A specific problem: "the auth system is getting unwieldy"
- A change name: "add-dark-mode" (to explore in context of that change)
- A comparison: "postgres vs sqlite for this"
- Nothing (just enter explore mode)
---
## The Stance
- **Curious, not prescriptive** - Ask questions that emerge naturally, don't follow a script
- **Open threads, not interrogations** - Surface multiple interesting directions and let the user follow what resonates. Don't funnel them through a single path of questions.
- **Visual** - Use ASCII diagrams liberally when they'd help clarify thinking
- **Adaptive** - Follow interesting threads, pivot when new information emerges
- **Patient** - Don't rush to conclusions, let the shape of the problem emerge
- **Grounded** - Explore the actual codebase when relevant, don't just theorize
---
## What You Might Do
Depending on what the user brings, you might:
**Explore the problem space**
- Ask clarifying questions that emerge from what they said
- Challenge assumptions
- Reframe the problem
- Find analogies
**Investigate the codebase**
- Map existing architecture relevant to the discussion
- Find integration points
- Identify patterns already in use
- Surface hidden complexity
**Compare options**
- Brainstorm multiple approaches
- Build comparison tables
- Sketch tradeoffs
- Recommend a path (if asked)
**Visualize**
```
┌─────────────────────────────────────────┐
│ Use ASCII diagrams liberally │
├─────────────────────────────────────────┤
│ │
│ ┌────────┐ ┌────────┐ │
│ │ State │────────▶│ State │ │
│ │ A │ │ B │ │
│ └────────┘ └────────┘ │
│ │
│ System diagrams, state machines, │
│ data flows, architecture sketches, │
│ dependency graphs, comparison tables │
│ │
└─────────────────────────────────────────┘
```
**Surface risks and unknowns**
- Identify what could go wrong
- Find gaps in understanding
- Suggest spikes or investigations
---
## OpenSpec Awareness
You have full context of the OpenSpec system. Use it naturally, don't force it.
### Check for context
At the start, quickly check what exists:
```bash
openspec list --json
```
This tells you:
- If there are active changes
- Their names, schemas, and status
- What the user might be working on
If the user mentioned a specific change name, read its artifacts for context.
### When no change exists
Think freely. When insights crystallize, you might offer:
- "This feels solid enough to start a change. Want me to create a proposal?"
- Or keep exploring - no pressure to formalize
### When a change exists
If the user mentions a change or you detect one is relevant:
1. **Read existing artifacts for context**
- `openspec/changes/<name>/proposal.md`
- `openspec/changes/<name>/design.md`
- `openspec/changes/<name>/tasks.md`
- etc.
2. **Reference them naturally in conversation**
- "Your design mentions using Redis, but we just realized SQLite fits better..."
- "The proposal scopes this to premium users, but we're now thinking everyone..."
3. **Offer to capture when decisions are made**
| Insight Type | Where to Capture |
|--------------|------------------|
| New requirement discovered | `specs/<capability>/spec.md` |
| Requirement changed | `specs/<capability>/spec.md` |
| Design decision made | `design.md` |
| Scope changed | `proposal.md` |
| New work identified | `tasks.md` |
| Assumption invalidated | Relevant artifact |
Example offers:
- "That's a design decision. Capture it in design.md?"
- "This is a new requirement. Add it to specs?"
- "This changes scope. Update the proposal?"
4. **The user decides** - Offer and move on. Don't pressure. Don't auto-capture.
---
## What You Don't Have To Do
- Follow a script
- Ask the same questions every time
- Produce a specific artifact
- Reach a conclusion
- Stay on topic if a tangent is valuable
- Be brief (this is thinking time)
---
## Ending Discovery
There's no required ending. Discovery might:
- **Flow into a proposal**: "Ready to start? I can create a change proposal."
- **Result in artifact updates**: "Updated design.md with these decisions"
- **Just provide clarity**: User has what they need, moves on
- **Continue later**: "We can pick this up anytime"
When things crystallize, you might offer a summary - but it's optional. Sometimes the thinking IS the value.
---
## Guardrails
- **Don't implement** - Never write code or implement features. Creating OpenSpec artifacts is fine, writing application code is not.
- **Don't fake understanding** - If something is unclear, dig deeper
- **Don't rush** - Discovery is thinking time, not task time
- **Don't force structure** - Let patterns emerge naturally
- **Don't auto-capture** - Offer to save insights, don't just do it
- **Do visualize** - A good diagram is worth many paragraphs
- **Do explore the codebase** - Ground discussions in reality
- **Do question assumptions** - Including the user's and your own
+97
View File
@@ -0,0 +1,97 @@
---
name: "OPSX: Fast Forward"
description: Create a change and generate all artifacts needed for implementation in one go
category: Workflow
tags: [workflow, artifacts, experimental]
---
Fast-forward through artifact creation - generate everything needed to start implementation.
**Input**: The argument after `/opsx:ff` is the change name (kebab-case), OR a description of what the user wants to build.
**Steps**
1. **If no input provided, ask what they want to build**
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
> "What change do you want to work on? Describe what you want to build or fix."
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
2. **Create the change directory**
```bash
openspec new change "<name>"
```
This creates a scaffolded change at `openspec/changes/<name>/`.
3. **Get the artifact build order**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to get:
- `applyRequires`: array of artifact IDs needed before implementation (e.g., `["tasks"]`)
- `artifacts`: list of all artifacts with their status and dependencies
4. **Create artifacts in sequence until apply-ready**
Use the **TodoWrite tool** to track progress through the artifacts.
Loop through artifacts in dependency order (artifacts with no pending dependencies first):
a. **For each artifact that is `ready` (dependencies satisfied)**:
- Get instructions:
```bash
openspec instructions <artifact-id> --change "<name>" --json
```
- The instructions JSON includes:
- `context`: Project background (constraints for you - do NOT include in output)
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
- `template`: The structure to use for your output file
- `instruction`: Schema-specific guidance for this artifact type
- `outputPath`: Where to write the artifact
- `dependencies`: Completed artifacts to read for context
- Read any completed dependency files for context
- Create the artifact file using `template` as the structure
- Apply `context` and `rules` as constraints - but do NOT copy them into the file
- Show brief progress: "✓ Created <artifact-id>"
b. **Continue until all `applyRequires` artifacts are complete**
- After creating each artifact, re-run `openspec status --change "<name>" --json`
- Check if every artifact ID in `applyRequires` has `status: "done"` in the artifacts array
- Stop when all `applyRequires` artifacts are done
c. **If an artifact requires user input** (unclear context):
- Use **AskUserQuestion tool** to clarify
- Then continue with creation
5. **Show final status**
```bash
openspec status --change "<name>"
```
**Output**
After completing all artifacts, summarize:
- Change name and location
- List of artifacts created with brief descriptions
- What's ready: "All artifacts created! Ready for implementation."
- Prompt: "Run `/opsx:apply` to start implementing."
**Artifact Creation Guidelines**
- Follow the `instruction` field from `openspec instructions` for each artifact type
- The schema defines what each artifact should contain - follow it
- Read dependency artifacts for context before creating new ones
- Use `template` as the structure for your output file - fill in its sections
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
- These guide what you write, but should never appear in the output
**Guardrails**
- Create ALL artifacts needed for implementation (as defined by schema's `apply.requires`)
- Always read dependency artifacts before creating a new one
- If context is critically unclear, ask the user - but prefer making reasonable decisions to keep momentum
- If a change with that name already exists, ask if user wants to continue it or create a new one
- Verify each artifact file exists after writing before proceeding to next
+69
View File
@@ -0,0 +1,69 @@
---
name: "OPSX: New"
description: Start a new change using the experimental artifact workflow (OPSX)
category: Workflow
tags: [workflow, artifacts, experimental]
---
Start a new change using the experimental artifact-driven approach.
**Input**: The argument after `/opsx:new` is the change name (kebab-case), OR a description of what the user wants to build.
**Steps**
1. **If no input provided, ask what they want to build**
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
> "What change do you want to work on? Describe what you want to build or fix."
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
2. **Determine the workflow schema**
Use the default schema (omit `--schema`) unless the user explicitly requests a different workflow.
**Use a different schema only if the user mentions:**
- A specific schema name → use `--schema <name>`
- "show workflows" or "what workflows" → run `openspec schemas --json` and let them choose
**Otherwise**: Omit `--schema` to use the default.
3. **Create the change directory**
```bash
openspec new change "<name>"
```
Add `--schema <name>` only if the user requested a specific workflow.
This creates a scaffolded change at `openspec/changes/<name>/` with the selected schema.
4. **Show the artifact status**
```bash
openspec status --change "<name>"
```
This shows which artifacts need to be created and which are ready (dependencies satisfied).
5. **Get instructions for the first artifact**
The first artifact depends on the schema. Check the status output to find the first artifact with status "ready".
```bash
openspec instructions <first-artifact-id> --change "<name>"
```
This outputs the template and context for creating the first artifact.
6. **STOP and wait for user direction**
**Output**
After completing the steps, summarize:
- Change name and location
- Schema/workflow being used and its artifact sequence
- Current status (0/N artifacts complete)
- The template for the first artifact
- Prompt: "Ready to create the first artifact? Run `/opsx:continue` or just describe what this change is about and I'll draft it."
**Guardrails**
- Do NOT create any artifacts yet - just show the instructions
- Do NOT advance beyond showing the first artifact template
- If the name is invalid (not kebab-case), ask for a valid name
- If a change with that name already exists, suggest using `/opsx:continue` instead
- Pass --schema if using a non-default workflow
+134
View File
@@ -0,0 +1,134 @@
---
name: "OPSX: Sync"
description: Sync delta specs from a change to main specs
category: Workflow
tags: [workflow, specs, experimental]
---
Sync delta specs from a change to main specs.
This is an **agent-driven** operation - you will read delta specs and directly edit main specs to apply the changes. This allows intelligent merging (e.g., adding a scenario without copying the entire requirement).
**Input**: Optionally specify a change name after `/opsx:sync` (e.g., `/opsx:sync add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
Show changes that have delta specs (under `specs/` directory).
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Find delta specs**
Look for delta spec files in `openspec/changes/<name>/specs/*/spec.md`.
Each delta spec file contains sections like:
- `## ADDED Requirements` - New requirements to add
- `## MODIFIED Requirements` - Changes to existing requirements
- `## REMOVED Requirements` - Requirements to remove
- `## RENAMED Requirements` - Requirements to rename (FROM:/TO: format)
If no delta specs found, inform user and stop.
3. **For each delta spec, apply changes to main specs**
For each capability with a delta spec at `openspec/changes/<name>/specs/<capability>/spec.md`:
a. **Read the delta spec** to understand the intended changes
b. **Read the main spec** at `openspec/specs/<capability>/spec.md` (may not exist yet)
c. **Apply changes intelligently**:
**ADDED Requirements:**
- If requirement doesn't exist in main spec → add it
- If requirement already exists → update it to match (treat as implicit MODIFIED)
**MODIFIED Requirements:**
- Find the requirement in main spec
- Apply the changes - this can be:
- Adding new scenarios (don't need to copy existing ones)
- Modifying existing scenarios
- Changing the requirement description
- Preserve scenarios/content not mentioned in the delta
**REMOVED Requirements:**
- Remove the entire requirement block from main spec
**RENAMED Requirements:**
- Find the FROM requirement, rename to TO
d. **Create new main spec** if capability doesn't exist yet:
- Create `openspec/specs/<capability>/spec.md`
- Add Purpose section (can be brief, mark as TBD)
- Add Requirements section with the ADDED requirements
4. **Show summary**
After applying all changes, summarize:
- Which capabilities were updated
- What changes were made (requirements added/modified/removed/renamed)
**Delta Spec Format Reference**
```markdown
## ADDED Requirements
### Requirement: New Feature
The system SHALL do something new.
#### Scenario: Basic case
- **WHEN** user does X
- **THEN** system does Y
## MODIFIED Requirements
### Requirement: Existing Feature
#### Scenario: New scenario to add
- **WHEN** user does A
- **THEN** system does B
## REMOVED Requirements
### Requirement: Deprecated Feature
## RENAMED Requirements
- FROM: `### Requirement: Old Name`
- TO: `### Requirement: New Name`
```
**Key Principle: Intelligent Merging**
Unlike programmatic merging, you can apply **partial updates**:
- To add a scenario, just include that scenario under MODIFIED - don't copy existing scenarios
- The delta represents *intent*, not a wholesale replacement
- Use your judgment to merge changes sensibly
**Output On Success**
```
## Specs Synced: <change-name>
Updated main specs:
**<capability-1>**:
- Added requirement: "New Feature"
- Modified requirement: "Existing Feature" (added 1 scenario)
**<capability-2>**:
- Created new spec file
- Added requirement: "Another Feature"
Main specs are now updated. The change remains active - archive when implementation is complete.
```
**Guardrails**
- Read both delta and main specs before making changes
- Preserve existing content not mentioned in delta
- If something is unclear, ask for clarification
- Show what you're changing as you go
- The operation should be idempotent - running twice should give same result
+164
View File
@@ -0,0 +1,164 @@
---
name: "OPSX: Verify"
description: Verify implementation matches change artifacts before archiving
category: Workflow
tags: [workflow, verify, experimental]
---
Verify that an implementation matches the change artifacts (specs, tasks, design).
**Input**: Optionally specify a change name after `/opsx:verify` (e.g., `/opsx:verify add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
Show changes that have implementation tasks (tasks artifact exists).
Include the schema used for each change if available.
Mark changes with incomplete tasks as "(In Progress)".
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Check status to understand the schema**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to understand:
- `schemaName`: The workflow being used (e.g., "spec-driven")
- Which artifacts exist for this change
3. **Get the change directory and load artifacts**
```bash
openspec instructions apply --change "<name>" --json
```
This returns the change directory and context files. Read all available artifacts from `contextFiles`.
4. **Initialize verification report structure**
Create a report structure with three dimensions:
- **Completeness**: Track tasks and spec coverage
- **Correctness**: Track requirement implementation and scenario coverage
- **Coherence**: Track design adherence and pattern consistency
Each dimension can have CRITICAL, WARNING, or SUGGESTION issues.
5. **Verify Completeness**
**Task Completion**:
- If tasks.md exists in contextFiles, read it
- Parse checkboxes: `- [ ]` (incomplete) vs `- [x]` (complete)
- Count complete vs total tasks
- If incomplete tasks exist:
- Add CRITICAL issue for each incomplete task
- Recommendation: "Complete task: <description>" or "Mark as done if already implemented"
**Spec Coverage**:
- If delta specs exist in `openspec/changes/<name>/specs/`:
- Extract all requirements (marked with "### Requirement:")
- For each requirement:
- Search codebase for keywords related to the requirement
- Assess if implementation likely exists
- If requirements appear unimplemented:
- Add CRITICAL issue: "Requirement not found: <requirement name>"
- Recommendation: "Implement requirement X: <description>"
6. **Verify Correctness**
**Requirement Implementation Mapping**:
- For each requirement from delta specs:
- Search codebase for implementation evidence
- If found, note file paths and line ranges
- Assess if implementation matches requirement intent
- If divergence detected:
- Add WARNING: "Implementation may diverge from spec: <details>"
- Recommendation: "Review <file>:<lines> against requirement X"
**Scenario Coverage**:
- For each scenario in delta specs (marked with "#### Scenario:"):
- Check if conditions are handled in code
- Check if tests exist covering the scenario
- If scenario appears uncovered:
- Add WARNING: "Scenario not covered: <scenario name>"
- Recommendation: "Add test or implementation for scenario: <description>"
7. **Verify Coherence**
**Design Adherence**:
- If design.md exists in contextFiles:
- Extract key decisions (look for sections like "Decision:", "Approach:", "Architecture:")
- Verify implementation follows those decisions
- If contradiction detected:
- Add WARNING: "Design decision not followed: <decision>"
- Recommendation: "Update implementation or revise design.md to match reality"
- If no design.md: Skip design adherence check, note "No design.md to verify against"
**Code Pattern Consistency**:
- Review new code for consistency with project patterns
- Check file naming, directory structure, coding style
- If significant deviations found:
- Add SUGGESTION: "Code pattern deviation: <details>"
- Recommendation: "Consider following project pattern: <example>"
8. **Generate Verification Report**
**Summary Scorecard**:
```
## Verification Report: <change-name>
### Summary
| Dimension | Status |
|--------------|------------------|
| Completeness | X/Y tasks, N reqs|
| Correctness | M/N reqs covered |
| Coherence | Followed/Issues |
```
**Issues by Priority**:
1. **CRITICAL** (Must fix before archive):
- Incomplete tasks
- Missing requirement implementations
- Each with specific, actionable recommendation
2. **WARNING** (Should fix):
- Spec/design divergences
- Missing scenario coverage
- Each with specific recommendation
3. **SUGGESTION** (Nice to fix):
- Pattern inconsistencies
- Minor improvements
- Each with specific recommendation
**Final Assessment**:
- If CRITICAL issues: "X critical issue(s) found. Fix before archiving."
- If only warnings: "No critical issues. Y warning(s) to consider. Ready for archive (with noted improvements)."
- If all clear: "All checks passed. Ready for archive."
**Verification Heuristics**
- **Completeness**: Focus on objective checklist items (checkboxes, requirements list)
- **Correctness**: Use keyword search, file path analysis, reasonable inference - don't require perfect certainty
- **Coherence**: Look for glaring inconsistencies, don't nitpick style
- **False Positives**: When uncertain, prefer SUGGESTION over WARNING, WARNING over CRITICAL
- **Actionability**: Every issue must have a specific recommendation with file/line references where applicable
**Graceful Degradation**
- If only tasks.md exists: verify task completion only, skip spec/design checks
- If tasks + specs exist: verify completeness and correctness, skip design
- If full artifacts: verify all three dimensions
- Always note which checks were skipped and why
**Output Format**
Use clear markdown with:
- Table for summary scorecard
- Grouped lists for issues (CRITICAL/WARNING/SUGGESTION)
- Code references in format: `file.ts:123`
- Specific, actionable recommendations
- No vague suggestions like "consider reviewing"
@@ -0,0 +1,156 @@
---
name: openspec-apply-change
description: Implement tasks from an OpenSpec change. Use when the user wants to start implementing, continue implementation, or work through tasks.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Implement tasks from an OpenSpec change.
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **Select the change**
If a name is provided, use it. Otherwise:
- Infer from conversation context if the user mentioned a change
- Auto-select if only one active change exists
- If ambiguous, run `openspec list --json` to get available changes and use the **AskUserQuestion tool** to let the user select
Always announce: "Using change: <name>" and how to override (e.g., `/opsx:apply <other>`).
2. **Check status to understand the schema**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to understand:
- `schemaName`: The workflow being used (e.g., "spec-driven")
- Which artifact contains the tasks (typically "tasks" for spec-driven, check status for others)
3. **Get apply instructions**
```bash
openspec instructions apply --change "<name>" --json
```
This returns:
- Context file paths (varies by schema - could be proposal/specs/design/tasks or spec/tests/implementation/docs)
- Progress (total, complete, remaining)
- Task list with status
- Dynamic instruction based on current state
**Handle states:**
- If `state: "blocked"` (missing artifacts): show message, suggest using openspec-continue-change
- If `state: "all_done"`: congratulate, suggest archive
- Otherwise: proceed to implementation
4. **Read context files**
Read the files listed in `contextFiles` from the apply instructions output.
The files depend on the schema being used:
- **spec-driven**: proposal, specs, design, tasks
- Other schemas: follow the contextFiles from CLI output
5. **Show current progress**
Display:
- Schema being used
- Progress: "N/M tasks complete"
- Remaining tasks overview
- Dynamic instruction from CLI
6. **Implement tasks (loop until done or blocked)**
For each pending task:
- Show which task is being worked on
- Make the code changes required
- Keep changes minimal and focused
- Mark task complete in the tasks file: `- [ ]` → `- [x]`
- Continue to next task
**Pause if:**
- Task is unclear → ask for clarification
- Implementation reveals a design issue → suggest updating artifacts
- Error or blocker encountered → report and wait for guidance
- User interrupts
7. **On completion or pause, show status**
Display:
- Tasks completed this session
- Overall progress: "N/M tasks complete"
- If all done: suggest archive
- If paused: explain why and wait for guidance
**Output During Implementation**
```
## Implementing: <change-name> (schema: <schema-name>)
Working on task 3/7: <task description>
[...implementation happening...]
✓ Task complete
Working on task 4/7: <task description>
[...implementation happening...]
✓ Task complete
```
**Output On Completion**
```
## Implementation Complete
**Change:** <change-name>
**Schema:** <schema-name>
**Progress:** 7/7 tasks complete ✓
### Completed This Session
- [x] Task 1
- [x] Task 2
...
All tasks complete! Ready to archive this change.
```
**Output On Pause (Issue Encountered)**
```
## Implementation Paused
**Change:** <change-name>
**Schema:** <schema-name>
**Progress:** 4/7 tasks complete
### Issue Encountered
<description of the issue>
**Options:**
1. <option 1>
2. <option 2>
3. Other approach
What would you like to do?
```
**Guardrails**
- Keep going through tasks until done or blocked
- Always read context files before starting (from the apply instructions output)
- If task is ambiguous, pause and ask before implementing
- If implementation reveals issues, pause and suggest artifact updates
- Keep code changes minimal and scoped to each task
- Update task checkbox immediately after completing each task
- Pause on errors, blockers, or unclear requirements - don't guess
- Use contextFiles from CLI output, don't assume specific file names
**Fluid Workflow Integration**
This skill supports the "actions on a change" model:
- **Can be invoked anytime**: Before all artifacts are done (if tasks exist), after partial implementation, interleaved with other actions
- **Allows artifact updates**: If implementation reveals design issues, suggest updating artifacts - not phase-locked, work fluidly
@@ -0,0 +1,114 @@
---
name: openspec-archive-change
description: Archive a completed change in the experimental workflow. Use when the user wants to finalize and archive a change after implementation is complete.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Archive a completed change in the experimental workflow.
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
Show only active changes (not already archived).
Include the schema used for each change if available.
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Check artifact completion status**
Run `openspec status --change "<name>" --json` to check artifact completion.
Parse the JSON to understand:
- `schemaName`: The workflow being used
- `artifacts`: List of artifacts with their status (`done` or other)
**If any artifacts are not `done`:**
- Display warning listing incomplete artifacts
- Use **AskUserQuestion tool** to confirm user wants to proceed
- Proceed if user confirms
3. **Check task completion status**
Read the tasks file (typically `tasks.md`) to check for incomplete tasks.
Count tasks marked with `- [ ]` (incomplete) vs `- [x]` (complete).
**If incomplete tasks found:**
- Display warning showing count of incomplete tasks
- Use **AskUserQuestion tool** to confirm user wants to proceed
- Proceed if user confirms
**If no tasks file exists:** Proceed without task-related warning.
4. **Assess delta spec sync state**
Check for delta specs at `openspec/changes/<name>/specs/`. If none exist, proceed without sync prompt.
**If delta specs exist:**
- Compare each delta spec with its corresponding main spec at `openspec/specs/<capability>/spec.md`
- Determine what changes would be applied (adds, modifications, removals, renames)
- Show a combined summary before prompting
**Prompt options:**
- If changes needed: "Sync now (recommended)", "Archive without syncing"
- If already synced: "Archive now", "Sync anyway", "Cancel"
If user chooses sync, use Task tool (subagent_type: "general-purpose", prompt: "Use Skill tool to invoke openspec-sync-specs for change '<name>'. Delta spec analysis: <include the analyzed delta spec summary>"). Proceed to archive regardless of choice.
5. **Perform the archive**
Create the archive directory if it doesn't exist:
```bash
mkdir -p openspec/changes/archive
```
Generate target name using current date: `YYYY-MM-DD-<change-name>`
**Check if target already exists:**
- If yes: Fail with error, suggest renaming existing archive or using different date
- If no: Move the change directory to archive
```bash
mv openspec/changes/<name> openspec/changes/archive/YYYY-MM-DD-<name>
```
6. **Display summary**
Show archive completion summary including:
- Change name
- Schema that was used
- Archive location
- Whether specs were synced (if applicable)
- Note about any warnings (incomplete artifacts/tasks)
**Output On Success**
```
## Archive Complete
**Change:** <change-name>
**Schema:** <schema-name>
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
**Specs:** ✓ Synced to main specs (or "No delta specs" or "Sync skipped")
All artifacts complete. All tasks complete.
```
**Guardrails**
- Always prompt for change selection if not provided
- Use artifact graph (openspec status --json) for completion checking
- Don't block archive on warnings - just inform and confirm
- Preserve .openspec.yaml when moving to archive (it moves with the directory)
- Show clear summary of what happened
- If sync is requested, use openspec-sync-specs approach (agent-driven)
- If delta specs exist, always run the sync assessment and show the combined summary before prompting
@@ -0,0 +1,118 @@
---
name: openspec-continue-change
description: Continue working on an OpenSpec change by creating the next artifact. Use when the user wants to progress their change, create the next artifact, or continue their workflow.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Continue working on a change by creating the next artifact.
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes sorted by most recently modified. Then use the **AskUserQuestion tool** to let the user select which change to work on.
Present the top 3-4 most recently modified changes as options, showing:
- Change name
- Schema (from `schema` field if present, otherwise "spec-driven")
- Status (e.g., "0/5 tasks", "complete", "no tasks")
- How recently it was modified (from `lastModified` field)
Mark the most recently modified change as "(Recommended)" since it's likely what the user wants to continue.
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Check current status**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to understand current state. The response includes:
- `schemaName`: The workflow schema being used (e.g., "spec-driven")
- `artifacts`: Array of artifacts with their status ("done", "ready", "blocked")
- `isComplete`: Boolean indicating if all artifacts are complete
3. **Act based on status**:
---
**If all artifacts are complete (`isComplete: true`)**:
- Congratulate the user
- Show final status including the schema used
- Suggest: "All artifacts created! You can now implement this change or archive it."
- STOP
---
**If artifacts are ready to create** (status shows artifacts with `status: "ready"`):
- Pick the FIRST artifact with `status: "ready"` from the status output
- Get its instructions:
```bash
openspec instructions <artifact-id> --change "<name>" --json
```
- Parse the JSON. The key fields are:
- `context`: Project background (constraints for you - do NOT include in output)
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
- `template`: The structure to use for your output file
- `instruction`: Schema-specific guidance
- `outputPath`: Where to write the artifact
- `dependencies`: Completed artifacts to read for context
- **Create the artifact file**:
- Read any completed dependency files for context
- Use `template` as the structure - fill in its sections
- Apply `context` and `rules` as constraints when writing - but do NOT copy them into the file
- Write to the output path specified in instructions
- Show what was created and what's now unlocked
- STOP after creating ONE artifact
---
**If no artifacts are ready (all blocked)**:
- This shouldn't happen with a valid schema
- Show status and suggest checking for issues
4. **After creating an artifact, show progress**
```bash
openspec status --change "<name>"
```
**Output**
After each invocation, show:
- Which artifact was created
- Schema workflow being used
- Current progress (N/M complete)
- What artifacts are now unlocked
- Prompt: "Want to continue? Just ask me to continue or tell me what to do next."
**Artifact Creation Guidelines**
The artifact types and their purpose depend on the schema. Use the `instruction` field from the instructions output to understand what to create.
Common artifact patterns:
**spec-driven schema** (proposal → specs → design → tasks):
- **proposal.md**: Ask user about the change if not clear. Fill in Why, What Changes, Capabilities, Impact.
- The Capabilities section is critical - each capability listed will need a spec file.
- **specs/<capability>/spec.md**: Create one spec per capability listed in the proposal's Capabilities section (use the capability name, not the change name).
- **design.md**: Document technical decisions, architecture, and implementation approach.
- **tasks.md**: Break down implementation into checkboxed tasks.
For other schemas, follow the `instruction` field from the CLI output.
**Guardrails**
- Create ONE artifact per invocation
- Always read dependency artifacts before creating a new one
- Never skip artifacts or create out of order
- If context is unclear, ask the user before creating
- Verify the artifact file exists after writing before marking progress
- Use the schema's artifact sequence, don't assume specific artifact names
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
- These guide what you write, but should never appear in the output
+288
View File
@@ -0,0 +1,288 @@
---
name: openspec-explore
description: Enter explore mode - a thinking partner for exploring ideas, investigating problems, and clarifying requirements. Use when the user wants to think through something before or during a change.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Enter explore mode. Think deeply. Visualize freely. Follow the conversation wherever it goes.
**IMPORTANT: Explore mode is for thinking, not implementing.** You may read files, search code, and investigate the codebase, but you must NEVER write code or implement features. If the user asks you to implement something, remind them to exit explore mode first and create a change proposal. You MAY create OpenSpec artifacts (proposals, designs, specs) if the user asks—that's capturing thinking, not implementing.
**This is a stance, not a workflow.** There are no fixed steps, no required sequence, no mandatory outputs. You're a thinking partner helping the user explore.
---
## The Stance
- **Curious, not prescriptive** - Ask questions that emerge naturally, don't follow a script
- **Open threads, not interrogations** - Surface multiple interesting directions and let the user follow what resonates. Don't funnel them through a single path of questions.
- **Visual** - Use ASCII diagrams liberally when they'd help clarify thinking
- **Adaptive** - Follow interesting threads, pivot when new information emerges
- **Patient** - Don't rush to conclusions, let the shape of the problem emerge
- **Grounded** - Explore the actual codebase when relevant, don't just theorize
---
## What You Might Do
Depending on what the user brings, you might:
**Explore the problem space**
- Ask clarifying questions that emerge from what they said
- Challenge assumptions
- Reframe the problem
- Find analogies
**Investigate the codebase**
- Map existing architecture relevant to the discussion
- Find integration points
- Identify patterns already in use
- Surface hidden complexity
**Compare options**
- Brainstorm multiple approaches
- Build comparison tables
- Sketch tradeoffs
- Recommend a path (if asked)
**Visualize**
```
┌─────────────────────────────────────────┐
│ Use ASCII diagrams liberally │
├─────────────────────────────────────────┤
│ │
│ ┌────────┐ ┌────────┐ │
│ │ State │────────▶│ State │ │
│ │ A │ │ B │ │
│ └────────┘ └────────┘ │
│ │
│ System diagrams, state machines, │
│ data flows, architecture sketches, │
│ dependency graphs, comparison tables │
│ │
└─────────────────────────────────────────┘
```
**Surface risks and unknowns**
- Identify what could go wrong
- Find gaps in understanding
- Suggest spikes or investigations
---
## OpenSpec Awareness
You have full context of the OpenSpec system. Use it naturally, don't force it.
### Check for context
At the start, quickly check what exists:
```bash
openspec list --json
```
This tells you:
- If there are active changes
- Their names, schemas, and status
- What the user might be working on
### When no change exists
Think freely. When insights crystallize, you might offer:
- "This feels solid enough to start a change. Want me to create a proposal?"
- Or keep exploring - no pressure to formalize
### When a change exists
If the user mentions a change or you detect one is relevant:
1. **Read existing artifacts for context**
- `openspec/changes/<name>/proposal.md`
- `openspec/changes/<name>/design.md`
- `openspec/changes/<name>/tasks.md`
- etc.
2. **Reference them naturally in conversation**
- "Your design mentions using Redis, but we just realized SQLite fits better..."
- "The proposal scopes this to premium users, but we're now thinking everyone..."
3. **Offer to capture when decisions are made**
| Insight Type | Where to Capture |
|--------------|------------------|
| New requirement discovered | `specs/<capability>/spec.md` |
| Requirement changed | `specs/<capability>/spec.md` |
| Design decision made | `design.md` |
| Scope changed | `proposal.md` |
| New work identified | `tasks.md` |
| Assumption invalidated | Relevant artifact |
Example offers:
- "That's a design decision. Capture it in design.md?"
- "This is a new requirement. Add it to specs?"
- "This changes scope. Update the proposal?"
4. **The user decides** - Offer and move on. Don't pressure. Don't auto-capture.
---
## What You Don't Have To Do
- Follow a script
- Ask the same questions every time
- Produce a specific artifact
- Reach a conclusion
- Stay on topic if a tangent is valuable
- Be brief (this is thinking time)
---
## Handling Different Entry Points
**User brings a vague idea:**
```
User: I'm thinking about adding real-time collaboration
You: Real-time collab is a big space. Let me think about this...
COLLABORATION SPECTRUM
════════════════════════════════════════════
Awareness Coordination Sync
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Presence│ │Cursors │ │ CRDT │
│ "3 │ │ Multi │ │Conflict│
│online" │ │ select │ │ free │
└────────┘ └────────┘ └────────┘
│ │ │
trivial moderate complex
Where's your head at?
```
**User brings a specific problem:**
```
User: The auth system is a mess
You: [reads codebase]
┌─────────────────────────────────────────────┐
│ CURRENT AUTH FLOW │
└─────────────────────────────────────────────┘
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Google │ │ GitHub │ │ Email │
│ OAuth │ │ OAuth │ │ Magic │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────────────┼─────────────┘
┌───────────┐
│ Session │
└─────┬─────┘
┌───────────┐
│ Perms │
└───────────┘
I see three tangles. Which one's burning?
```
**User is stuck mid-implementation:**
```
User: /opsx:explore add-auth-system
The OAuth integration is more complex than expected
You: [reads change artifacts]
You're on task 4: "Implement OAuth flow"
Let me trace what's involved...
[draws diagram, explores options, suggests paths]
Want to update the design to reflect this?
Or add a spike task to investigate?
```
**User wants to compare options:**
```
User: Should we use Postgres or SQLite?
You: Generic answer is boring. What's the context?
User: A CLI tool that tracks local dev environments
You: That changes everything.
┌─────────────────────────────────────────────────┐
│ CLI TOOL DATA STORAGE │
└─────────────────────────────────────────────────┘
Key constraints:
• No daemon running
• Must work offline
• Single user
SQLite Postgres
Deployment embedded ✓ needs server ✗
Offline yes ✓ no ✗
Single file yes ✓ no ✗
SQLite. Not even close.
Unless... is there a sync component?
```
---
## Ending Discovery
There's no required ending. Discovery might:
- **Flow into a proposal**: "Ready to start? I can create a change proposal."
- **Result in artifact updates**: "Updated design.md with these decisions"
- **Just provide clarity**: User has what they need, moves on
- **Continue later**: "We can pick this up anytime"
When it feels like things are crystallizing, you might summarize:
```
## What We Figured Out
**The problem**: [crystallized understanding]
**The approach**: [if one emerged]
**Open questions**: [if any remain]
**Next steps** (if ready):
- Create a change proposal
- Keep exploring: just keep talking
```
But this summary is optional. Sometimes the thinking IS the value.
---
## Guardrails
- **Don't implement** - Never write code or implement features. Creating OpenSpec artifacts is fine, writing application code is not.
- **Don't fake understanding** - If something is unclear, dig deeper
- **Don't rush** - Discovery is thinking time, not task time
- **Don't force structure** - Let patterns emerge naturally
- **Don't auto-capture** - Offer to save insights, don't just do it
- **Do visualize** - A good diagram is worth many paragraphs
- **Do explore the codebase** - Ground discussions in reality
- **Do question assumptions** - Including the user's and your own
+101
View File
@@ -0,0 +1,101 @@
---
name: openspec-ff-change
description: Fast-forward through OpenSpec artifact creation. Use when the user wants to quickly create all artifacts needed for implementation without stepping through each one individually.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Fast-forward through artifact creation - generate everything needed to start implementation in one go.
**Input**: The user's request should include a change name (kebab-case) OR a description of what they want to build.
**Steps**
1. **If no clear input provided, ask what they want to build**
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
> "What change do you want to work on? Describe what you want to build or fix."
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
2. **Create the change directory**
```bash
openspec new change "<name>"
```
This creates a scaffolded change at `openspec/changes/<name>/`.
3. **Get the artifact build order**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to get:
- `applyRequires`: array of artifact IDs needed before implementation (e.g., `["tasks"]`)
- `artifacts`: list of all artifacts with their status and dependencies
4. **Create artifacts in sequence until apply-ready**
Use the **TodoWrite tool** to track progress through the artifacts.
Loop through artifacts in dependency order (artifacts with no pending dependencies first):
a. **For each artifact that is `ready` (dependencies satisfied)**:
- Get instructions:
```bash
openspec instructions <artifact-id> --change "<name>" --json
```
- The instructions JSON includes:
- `context`: Project background (constraints for you - do NOT include in output)
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
- `template`: The structure to use for your output file
- `instruction`: Schema-specific guidance for this artifact type
- `outputPath`: Where to write the artifact
- `dependencies`: Completed artifacts to read for context
- Read any completed dependency files for context
- Create the artifact file using `template` as the structure
- Apply `context` and `rules` as constraints - but do NOT copy them into the file
- Show brief progress: "✓ Created <artifact-id>"
b. **Continue until all `applyRequires` artifacts are complete**
- After creating each artifact, re-run `openspec status --change "<name>" --json`
- Check if every artifact ID in `applyRequires` has `status: "done"` in the artifacts array
- Stop when all `applyRequires` artifacts are done
c. **If an artifact requires user input** (unclear context):
- Use **AskUserQuestion tool** to clarify
- Then continue with creation
5. **Show final status**
```bash
openspec status --change "<name>"
```
**Output**
After completing all artifacts, summarize:
- Change name and location
- List of artifacts created with brief descriptions
- What's ready: "All artifacts created! Ready for implementation."
- Prompt: "Run `/opsx:apply` or ask me to implement to start working on the tasks."
**Artifact Creation Guidelines**
- Follow the `instruction` field from `openspec instructions` for each artifact type
- The schema defines what each artifact should contain - follow it
- Read dependency artifacts for context before creating new ones
- Use `template` as the structure for your output file - fill in its sections
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
- These guide what you write, but should never appear in the output
**Guardrails**
- Create ALL artifacts needed for implementation (as defined by schema's `apply.requires`)
- Always read dependency artifacts before creating a new one
- If context is critically unclear, ask the user - but prefer making reasonable decisions to keep momentum
- If a change with that name already exists, suggest continuing that change instead
- Verify each artifact file exists after writing before proceeding to next
@@ -0,0 +1,74 @@
---
name: openspec-new-change
description: Start a new OpenSpec change using the experimental artifact workflow. Use when the user wants to create a new feature, fix, or modification with a structured step-by-step approach.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Start a new change using the experimental artifact-driven approach.
**Input**: The user's request should include a change name (kebab-case) OR a description of what they want to build.
**Steps**
1. **If no clear input provided, ask what they want to build**
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
> "What change do you want to work on? Describe what you want to build or fix."
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
2. **Determine the workflow schema**
Use the default schema (omit `--schema`) unless the user explicitly requests a different workflow.
**Use a different schema only if the user mentions:**
- A specific schema name → use `--schema <name>`
- "show workflows" or "what workflows" → run `openspec schemas --json` and let them choose
**Otherwise**: Omit `--schema` to use the default.
3. **Create the change directory**
```bash
openspec new change "<name>"
```
Add `--schema <name>` only if the user requested a specific workflow.
This creates a scaffolded change at `openspec/changes/<name>/` with the selected schema.
4. **Show the artifact status**
```bash
openspec status --change "<name>"
```
This shows which artifacts need to be created and which are ready (dependencies satisfied).
5. **Get instructions for the first artifact**
The first artifact depends on the schema (e.g., `proposal` for spec-driven).
Check the status output to find the first artifact with status "ready".
```bash
openspec instructions <first-artifact-id> --change "<name>"
```
This outputs the template and context for creating the first artifact.
6. **STOP and wait for user direction**
**Output**
After completing the steps, summarize:
- Change name and location
- Schema/workflow being used and its artifact sequence
- Current status (0/N artifacts complete)
- The template for the first artifact
- Prompt: "Ready to create the first artifact? Just describe what this change is about and I'll draft it, or ask me to continue."
**Guardrails**
- Do NOT create any artifacts yet - just show the instructions
- Do NOT advance beyond showing the first artifact template
- If the name is invalid (not kebab-case), ask for a valid name
- If a change with that name already exists, suggest continuing that change instead
- Pass --schema if using a non-default workflow
+138
View File
@@ -0,0 +1,138 @@
---
name: openspec-sync-specs
description: Sync delta specs from a change to main specs. Use when the user wants to update main specs with changes from a delta spec, without archiving the change.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Sync delta specs from a change to main specs.
This is an **agent-driven** operation - you will read delta specs and directly edit main specs to apply the changes. This allows intelligent merging (e.g., adding a scenario without copying the entire requirement).
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
Show changes that have delta specs (under `specs/` directory).
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Find delta specs**
Look for delta spec files in `openspec/changes/<name>/specs/*/spec.md`.
Each delta spec file contains sections like:
- `## ADDED Requirements` - New requirements to add
- `## MODIFIED Requirements` - Changes to existing requirements
- `## REMOVED Requirements` - Requirements to remove
- `## RENAMED Requirements` - Requirements to rename (FROM:/TO: format)
If no delta specs found, inform user and stop.
3. **For each delta spec, apply changes to main specs**
For each capability with a delta spec at `openspec/changes/<name>/specs/<capability>/spec.md`:
a. **Read the delta spec** to understand the intended changes
b. **Read the main spec** at `openspec/specs/<capability>/spec.md` (may not exist yet)
c. **Apply changes intelligently**:
**ADDED Requirements:**
- If requirement doesn't exist in main spec → add it
- If requirement already exists → update it to match (treat as implicit MODIFIED)
**MODIFIED Requirements:**
- Find the requirement in main spec
- Apply the changes - this can be:
- Adding new scenarios (don't need to copy existing ones)
- Modifying existing scenarios
- Changing the requirement description
- Preserve scenarios/content not mentioned in the delta
**REMOVED Requirements:**
- Remove the entire requirement block from main spec
**RENAMED Requirements:**
- Find the FROM requirement, rename to TO
d. **Create new main spec** if capability doesn't exist yet:
- Create `openspec/specs/<capability>/spec.md`
- Add Purpose section (can be brief, mark as TBD)
- Add Requirements section with the ADDED requirements
4. **Show summary**
After applying all changes, summarize:
- Which capabilities were updated
- What changes were made (requirements added/modified/removed/renamed)
**Delta Spec Format Reference**
```markdown
## ADDED Requirements
### Requirement: New Feature
The system SHALL do something new.
#### Scenario: Basic case
- **WHEN** user does X
- **THEN** system does Y
## MODIFIED Requirements
### Requirement: Existing Feature
#### Scenario: New scenario to add
- **WHEN** user does A
- **THEN** system does B
## REMOVED Requirements
### Requirement: Deprecated Feature
## RENAMED Requirements
- FROM: `### Requirement: Old Name`
- TO: `### Requirement: New Name`
```
**Key Principle: Intelligent Merging**
Unlike programmatic merging, you can apply **partial updates**:
- To add a scenario, just include that scenario under MODIFIED - don't copy existing scenarios
- The delta represents *intent*, not a wholesale replacement
- Use your judgment to merge changes sensibly
**Output On Success**
```
## Specs Synced: <change-name>
Updated main specs:
**<capability-1>**:
- Added requirement: "New Feature"
- Modified requirement: "Existing Feature" (added 1 scenario)
**<capability-2>**:
- Created new spec file
- Added requirement: "Another Feature"
Main specs are now updated. The change remains active - archive when implementation is complete.
```
**Guardrails**
- Read both delta and main specs before making changes
- Preserve existing content not mentioned in delta
- If something is unclear, ask for clarification
- Show what you're changing as you go
- The operation should be idempotent - running twice should give same result
@@ -0,0 +1,168 @@
---
name: openspec-verify-change
description: Verify implementation matches change artifacts. Use when the user wants to validate that implementation is complete, correct, and coherent before archiving.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.2.0"
---
Verify that an implementation matches the change artifacts (specs, tasks, design).
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
**Steps**
1. **If no change name provided, prompt for selection**
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
Show changes that have implementation tasks (tasks artifact exists).
Include the schema used for each change if available.
Mark changes with incomplete tasks as "(In Progress)".
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
2. **Check status to understand the schema**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to understand:
- `schemaName`: The workflow being used (e.g., "spec-driven")
- Which artifacts exist for this change
3. **Get the change directory and load artifacts**
```bash
openspec instructions apply --change "<name>" --json
```
This returns the change directory and context files. Read all available artifacts from `contextFiles`.
4. **Initialize verification report structure**
Create a report structure with three dimensions:
- **Completeness**: Track tasks and spec coverage
- **Correctness**: Track requirement implementation and scenario coverage
- **Coherence**: Track design adherence and pattern consistency
Each dimension can have CRITICAL, WARNING, or SUGGESTION issues.
5. **Verify Completeness**
**Task Completion**:
- If tasks.md exists in contextFiles, read it
- Parse checkboxes: `- [ ]` (incomplete) vs `- [x]` (complete)
- Count complete vs total tasks
- If incomplete tasks exist:
- Add CRITICAL issue for each incomplete task
- Recommendation: "Complete task: <description>" or "Mark as done if already implemented"
**Spec Coverage**:
- If delta specs exist in `openspec/changes/<name>/specs/`:
- Extract all requirements (marked with "### Requirement:")
- For each requirement:
- Search codebase for keywords related to the requirement
- Assess if implementation likely exists
- If requirements appear unimplemented:
- Add CRITICAL issue: "Requirement not found: <requirement name>"
- Recommendation: "Implement requirement X: <description>"
6. **Verify Correctness**
**Requirement Implementation Mapping**:
- For each requirement from delta specs:
- Search codebase for implementation evidence
- If found, note file paths and line ranges
- Assess if implementation matches requirement intent
- If divergence detected:
- Add WARNING: "Implementation may diverge from spec: <details>"
- Recommendation: "Review <file>:<lines> against requirement X"
**Scenario Coverage**:
- For each scenario in delta specs (marked with "#### Scenario:"):
- Check if conditions are handled in code
- Check if tests exist covering the scenario
- If scenario appears uncovered:
- Add WARNING: "Scenario not covered: <scenario name>"
- Recommendation: "Add test or implementation for scenario: <description>"
7. **Verify Coherence**
**Design Adherence**:
- If design.md exists in contextFiles:
- Extract key decisions (look for sections like "Decision:", "Approach:", "Architecture:")
- Verify implementation follows those decisions
- If contradiction detected:
- Add WARNING: "Design decision not followed: <decision>"
- Recommendation: "Update implementation or revise design.md to match reality"
- If no design.md: Skip design adherence check, note "No design.md to verify against"
**Code Pattern Consistency**:
- Review new code for consistency with project patterns
- Check file naming, directory structure, coding style
- If significant deviations found:
- Add SUGGESTION: "Code pattern deviation: <details>"
- Recommendation: "Consider following project pattern: <example>"
8. **Generate Verification Report**
**Summary Scorecard**:
```
## Verification Report: <change-name>
### Summary
| Dimension | Status |
|--------------|------------------|
| Completeness | X/Y tasks, N reqs|
| Correctness | M/N reqs covered |
| Coherence | Followed/Issues |
```
**Issues by Priority**:
1. **CRITICAL** (Must fix before archive):
- Incomplete tasks
- Missing requirement implementations
- Each with specific, actionable recommendation
2. **WARNING** (Should fix):
- Spec/design divergences
- Missing scenario coverage
- Each with specific recommendation
3. **SUGGESTION** (Nice to fix):
- Pattern inconsistencies
- Minor improvements
- Each with specific recommendation
**Final Assessment**:
- If CRITICAL issues: "X critical issue(s) found. Fix before archiving."
- If only warnings: "No critical issues. Y warning(s) to consider. Ready for archive (with noted improvements)."
- If all clear: "All checks passed. Ready for archive."
**Verification Heuristics**
- **Completeness**: Focus on objective checklist items (checkboxes, requirements list)
- **Correctness**: Use keyword search, file path analysis, reasonable inference - don't require perfect certainty
- **Coherence**: Look for glaring inconsistencies, don't nitpick style
- **False Positives**: When uncertain, prefer SUGGESTION over WARNING, WARNING over CRITICAL
- **Actionability**: Every issue must have a specific recommendation with file/line references where applicable
**Graceful Degradation**
- If only tasks.md exists: verify task completion only, skip spec/design checks
- If tasks + specs exist: verify completeness and correctness, skip design
- If full artifacts: verify all three dimensions
- Always note which checks were skipped and why
**Output Format**
Use clear markdown with:
- Table for summary scorecard
- Grouped lists for issues (CRITICAL/WARNING/SUGGESTION)
- Code references in format: `file.ts:123`
- Specific, actionable recommendations
- No vague suggestions like "consider reviewing"
+7
View File
@@ -0,0 +1,7 @@
.venv/
__pycache__/
*.egg-info/
*.pyc
dist/
build/
.eggs/
+53
View File
@@ -0,0 +1,53 @@
# kb-search
CLI knowledge base with hybrid search (full-text + semantic vector search).
## Install
```bash
pipx install kb-search
```
## Quickstart
```bash
# Initialise (downloads embedding model ~90MB)
kb init
# Add documents
kb add ~/docs/manual.pdf --tags admin
kb add ~/notes/ --recursive
kb add --note "Always restart nginx after config changes" --tags ops
# Search
kb search "how to install git"
kb search "deploy process" --tags ops --type pdf
kb search "authentication" --format human
# Manage
kb list --format human
kb tags
kb status
```
## How it works
- **Ingestion**: Documents are chunked (PDFs via Docling, markdown by headers, code by AST/functions) and embedded locally
- **Storage**: Everything in a single SQLite database (`~/.kb/kb.db`) using FTS5 for keyword search and sqlite-vec for vector search
- **Search**: Hybrid retrieval combining BM25 keyword scoring and vector similarity via Reciprocal Rank Fusion
- **Output**: JSON (for LLM tool use) or human-readable terminal format
## Configuration
Optional YAML config at `~/.kb/config.yaml`. Works with zero configuration.
```bash
kb config # View current config
kb config set chunking.pdf.max_tokens 2048 # Change a value
```
ENV overrides: `KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`
## Claude Code Skill
This tool is designed to be wrapped as a Claude Code skill. See `SKILL.md` for the skill definition.
+110
View File
@@ -0,0 +1,110 @@
# kb-search skill
Search the user's personal knowledge base containing PDFs, markdown documents, code snippets, and text notes.
## When to use
- User asks a question that might be answered by their stored documents, notes, or code
- User explicitly says "check my notes", "search kb", "look in my knowledge base", "what do my docs say about..."
- User references documents or notes they've previously stored
- User asks "how do I..." style questions that their knowledge base likely covers
## Available commands
### Search (primary)
```bash
kb search "<query>" --top 10 --format json
```
Returns JSON with ranked results combining full-text and semantic search.
**Flags:**
- `--top N` — number of results (default: 10)
- `--tags tag1,tag2` — filter by tags (AND logic)
- `--type pdf|markdown|code|note` — filter by document type
- `--format json|human` — output format (always use json)
- `--fts-only` — keyword search only (skip semantic)
- `--vec-only` — semantic search only (skip keyword)
- `--threshold FLOAT` — minimum score cutoff
### Other useful commands
```bash
kb list --format json # List all documents
kb list --type pdf --format json # List only PDFs
kb tags --format json # List tags with counts
kb info <doc_id> --format json # Document details
kb status --format json # DB stats
```
## Output format (search)
```json
{
"query": "how to install git",
"results": [
{
"chunk_id": 1423,
"score": 0.031,
"score_breakdown": {"fts": 0.016, "vector": 0.015},
"text": "To install the latest version of git from source...",
"source": {
"document_id": 42,
"title": "Git Admin Guide",
"path": "/home/user/docs/git-admin.pdf",
"type": "pdf",
"page": 12,
"chunk_index": 3,
"total_chunks": 28,
"tags": ["git", "admin"]
}
}
],
"total_matches": 47,
"returned": 10
}
```
## How to answer
1. Run `kb search "<query>" --top 10 --format json`
2. Read the returned chunks
3. Synthesise a natural language answer from the top results
4. **ALWAYS cite sources**: "According to [title] (p.X)..." or "From [title], section [header]..."
5. If results have low scores (all below 0.01) or `returned: 0`, tell the user: "I couldn't find anything in your knowledge base about this"
6. If initial results seem off-target, try refining the query and searching again
## Multi-query strategy
For complex questions, search multiple times with different queries:
- Decompose the question into sub-queries
- Run each query separately
- Combine and deduplicate results across queries
- Synthesise a unified answer citing all relevant sources
Example:
```
User: "What's the difference between git rebase and merge?"
Query 1: kb search "git rebase explanation" --top 5 --format json
Query 2: kb search "git merge explanation" --top 5 --format json
Query 3: kb search "git rebase vs merge" --top 5 --format json
```
## Filtering
Use filters when the question implies a specific domain:
- Code question → `--type code`
- From a specific topic → `--tags <topic>`
- Check available tags first: `kb tags --format json`
## Important notes
- Always use `--format json` for machine parsing
- The `score` field is relative, not absolute — compare scores within a result set
- `source.page` is only present for PDF documents
- `source.section_header` is only present for markdown documents with headers
- Results are already ranked by relevance (hybrid FTS + vector search)
BIN
View File
Binary file not shown.
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-22
+396
View File
@@ -0,0 +1,396 @@
## Context
This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.
Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.
The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.
## Goals / Non-Goals
**Goals:**
- Single-command install (`pipx install kb-search`) with `kb init` for model setup
- Ingest heterogeneous documents with format-appropriate chunking
- Hybrid search (keyword + semantic) with a single command
- JSON output contract stable enough for skill integration
- Configurable but works with zero configuration
- All state in one SQLite file for easy backup/portability
**Non-Goals:**
- LLM-based answer synthesis (the calling skill handles this)
- Multi-user or networked access
- Real-time / streaming ingestion
- Web UI or TUI dashboard
- Support for every possible document format (start with PDF, markdown, code, notes)
- Clustering, deduplication, or automatic organisation of documents
## Decisions
### 1. Package Structure
```
kb-search/
├── pyproject.toml
├── src/
│ └── kb_search/
│ ├── __init__.py
│ ├── cli.py # Click CLI entry point
│ ├── config.py # YAML config loading + ENV overrides
│ ├── database.py # SQLite schema, migrations, connection
│ ├── embeddings.py # Model download, loading, inference
│ ├── search.py # Hybrid search + RRF merging
│ ├── ingest/
│ │ ├── __init__.py
│ │ ├── detector.py # File type detection + routing
│ │ ├── docling.py # Docling pipeline (PDF, DOCX, HTML, images)
│ │ ├── markdown.py # Header-based markdown splitting
│ │ ├── code.py # AST/regex code splitting
│ │ └── note.py # Whole-document note handler
│ └── output.py # JSON + human-readable formatters
├── tests/
└── SKILL.md # Claude Code skill definition
```
**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.
### 2. SQLite as Sole Storage Backend
All data lives in `~/.kb/kb.db`:
```sql
-- Documents
CREATE TABLE documents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
source_path TEXT,
content_hash TEXT NOT NULL, -- SHA-256 for dedup/change detection
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
language TEXT, -- for code: 'python','bash','go'
created_at TEXT DEFAULT (datetime('now')),
metadata TEXT DEFAULT '{}' -- JSON: page_count, author, etc.
);
-- Chunks
CREATE TABLE chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
token_count INTEGER,
metadata TEXT DEFAULT '{}', -- JSON: page, section_header, symbol_name
created_at TEXT DEFAULT (datetime('now'))
);
-- FTS5 index (content-sync with chunks table)
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
content='chunks',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
END;
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
-- Vector storage (sqlite-vec)
CREATE VIRTUAL TABLE chunks_vec USING vec0(
chunk_id INTEGER PRIMARY KEY,
embedding FLOAT[384] -- dimension matches model
);
-- Tags
CREATE TABLE tags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
);
CREATE TABLE document_tags (
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
PRIMARY KEY (document_id, tag_id)
);
-- Config stored in DB (model binding)
CREATE TABLE config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
```
**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.
**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."
**Alternatives considered:**
- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
- LanceDB: Interesting but less mature, no FTS built in
### 3. Docling for Complex Document Ingestion
Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.
**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.
**Docling configuration for this project:**
- Use `pypdfium2` backend (default, fast for text-based PDFs)
- Enable OCR only when needed (detect pages with no extractable text)
- Use hierarchy-aware chunking (respects section/paragraph boundaries)
- Disable image extraction (we're indexing text, not images)
- Run with multiple workers for batch ingestion
**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.
**Alternatives considered:**
- pymupdf4llm: Fast, lightweight, but poor table/layout handling
- Unstructured: Heavier than Docling, commercial focus, less predictable output
- LlamaParse: Cloud-only, violates local-first constraint
### 4. Per-Type Chunking Strategy
Each document type gets a purpose-built chunker with configurable parameters:
**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.
**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.
**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.
**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.
**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.
**Notes:** Whole document = one chunk. Notes are small by definition.
**Configurable defaults (in `~/.kb/config.yaml`):**
```yaml
chunking:
defaults:
max_tokens: 512
overlap_tokens: 50
pdf:
strategy: hierarchy # hierarchy | fixed
max_tokens: 1024 # for fixed strategy fallback
markdown:
strategy: header # header | fixed
min_tokens: 50 # merge sections smaller than this
max_tokens: 1024
code:
strategy: ast # ast | fixed
include_context: true # include class/module docstring with methods
max_tokens: 1024
note:
strategy: whole
```
### 5. Embedding Model Management
**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).
**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.
**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.
**Model switching (`kb reindex`):**
1. Download new model
2. Read all chunks from DB
3. Re-embed in batches (with progress bar)
4. Replace all vectors in `chunks_vec`
5. Update DB config (model_name, embedding_dim)
6. Recreate `chunks_vec` table if dimension changed
**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.
**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
- Dimension (read from model config)
- Max sequence length (read from model config, used to cap chunk size)
- Query/passage prefixes (configurable in YAML, empty by default)
```yaml
embedding:
model: all-MiniLM-L6-v2
query_prefix: "" # some models need "search_query: "
passage_prefix: "" # some models need "search_document: "
```
### 6. Hybrid Search with Reciprocal Rank Fusion
**Search flow:**
```
Query: "how to install git"
├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
└──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
(cosine distance, top-K)
Reciprocal Rank Fusion (RRF)
score(d) = Σ 1/(k + rank_in_list) where k=60 (standard)
Merged results, sorted by RRF score
Apply filters (tags, doc_type) ──▶ Top-N results
```
**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.
**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.
**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.
**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
- Type filter: Applied in the SQL query (efficient)
- Tag filter: Applied in the SQL query via JOIN (efficient)
- Score threshold: Applied post-RRF as a cutoff
### 7. Output Format (Skill Contract)
**JSON output (`--format json`, default):**
```json
{
"query": "how to install git",
"results": [
{
"chunk_id": 1423,
"score": 0.87,
"score_breakdown": {"fts": 0.72, "vector": 0.94},
"text": "To install the latest version of git from source...",
"source": {
"document_id": 42,
"title": "Git Admin Guide",
"path": "/home/user/docs/git-admin.pdf",
"type": "pdf",
"page": 12,
"chunk_index": 3,
"total_chunks": 28,
"tags": ["git", "admin"]
}
}
],
"total_matches": 47,
"returned": 10
}
```
**Human output (`--format human`):**
```
Search: "how to install git" (47 matches, showing top 10)
1. [0.87] Git Admin Guide (p.12) [pdf] [git, admin]
To install the latest version of git from source...
2. [0.65] setup-notes.md §Installation [markdown] [git]
First, add the PPA repository for the latest git...
```
**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.
### 8. Configuration Architecture
```
Precedence (highest to lowest):
1. CLI flags (--top, --tags, --format)
2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
3. ~/.kb/config.yaml
4. Built-in defaults
ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
KB_DATA_DIR → ~/.kb/
KB_MODEL → all-MiniLM-L6-v2
KB_DEFAULT_TOP → 10
```
**Full default config.yaml:**
```yaml
# ~/.kb/config.yaml
data_dir: ~/.kb
embedding:
model: all-MiniLM-L6-v2
query_prefix: ""
passage_prefix: ""
search:
default_top: 10
default_format: json
rrf_k: 60
chunking:
defaults:
max_tokens: 512
overlap_tokens: 50
pdf:
strategy: hierarchy
max_tokens: 1024
markdown:
strategy: header
min_tokens: 50
max_tokens: 1024
code:
strategy: ast
include_context: true
max_tokens: 1024
note:
strategy: whole
ingestion:
workers: 4 # parallel Docling workers
batch_size: 50 # commit to DB every N documents
enable_ocr: auto # auto | always | never
```
### 9. CLI Framework: Click
**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.
### 10. Error Handling and Resumability
**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
- Each document is processed independently
- On success: document + chunks inserted in a single transaction
- On failure: error logged, document skipped, processing continues
- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
- Progress shown via `click.progressbar` or `rich.progress`
- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."
Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.
## Risks / Trade-offs
**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.
**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.
**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."
**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.
**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.
**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.
## Resolved Questions
1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.
3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.
+39
View File
@@ -0,0 +1,39 @@
## Why
There is no simple, local-first CLI tool for building a personal knowledge base across heterogeneous document types (PDFs, markdown, code snippets, text notes) with hybrid search that combines keyword matching and semantic understanding. Existing tools either require cloud services, lack semantic search, or can't handle the variety of document formats. This tool fills the gap — a retrieval engine that can be used standalone from the terminal or wrapped as an AI skill (e.g. Claude Code) where the LLM layer provides natural language synthesis over retrieved results.
## What Changes
- New Python CLI tool (`kb`) distributed via pipx (PyPI package: `kb-search`)
- Ingestion pipeline with per-format handling:
- **PDFs/DOCX/HTML/images**: Docling (layout-aware, table reconstruction, optional OCR)
- **Markdown/text**: Header-based semantic splitting
- **Code (Python, Bash, Go)**: AST/regex-based splitting at function/class boundaries
- **Notes**: Inline text stored as whole-document chunks
- Hybrid search combining SQLite FTS5 (BM25 keyword scoring) and sqlite-vec (vector similarity), merged via Reciprocal Rank Fusion
- Local embedding models downloaded from HuggingFace on first run (`kb init`), with multi-model support and full reindex capability when switching models
- Document tagging system for manual categorisation and filtered search
- Structured JSON output designed for LLM skill consumption, plus human-readable terminal output
- Configurable chunking parameters per document type with sensible defaults
- All state in a single SQLite database (`~/.kb/kb.db`)
- Configuration via YAML (`~/.kb/config.yaml`) with ENV variable overrides
## Capabilities
### New Capabilities
- `document-ingestion`: Ingest PDFs, markdown, code, and text notes into chunked, embedded, searchable storage. Handles format detection, per-type chunking strategies, Docling pipeline for complex documents, and resumable batch imports.
- `hybrid-search`: Hybrid retrieval combining FTS5 full-text search and sqlite-vec vector similarity via Reciprocal Rank Fusion. Supports tag/type filtering, configurable result counts, score thresholds, and JSON/human output formats.
- `embedding-management`: Local embedding model lifecycle — download on init, bind model to database, detect mismatches, and full re-embedding via reindex when switching models.
- `document-management`: CRUD operations on the document store — list, inspect, remove documents. Tag management (add/remove tags, filter by tags, list tags with counts).
- `configuration`: TOML-based configuration with per-document-type chunking parameters, model selection, and ENV variable overrides. Sensible defaults that work without any config file.
- `skill-interface`: Structured JSON output contract designed for LLM skill consumption — chunks with scores, source metadata, and provenance for citation.
### Modified Capabilities
_(none — greenfield project)_
## Impact
- **Dependencies**: Docling (~1.5 GB models), sentence-transformers with ONNX Runtime backend, sqlite-vec, Click
- **Storage**: ~/.kb/ directory containing SQLite database, config file, and downloaded models (~1.6 GB on init, database grows with content)
- **First-run experience**: `kb init` required before use to download models. Batch ingestion of 2,000 PDFs estimated at ~17 hours CPU / ~3 hours GPU (one-time cost, resumable)
- **External integration**: Designed to be wrapped as a Claude Code skill — the skill definition (SKILL.md) is a deliverable alongside the code
@@ -0,0 +1,72 @@
## ADDED Requirements
### Requirement: YAML configuration file
The system SHALL read configuration from `~/.kb/config.yaml`. If the file does not exist, the system SHALL use built-in defaults. The configuration file SHALL be optional — the tool MUST work with zero configuration.
#### Scenario: No config file
- **WHEN** `~/.kb/config.yaml` does not exist
- **THEN** the system uses built-in defaults for all settings and operates normally
#### Scenario: Partial config file
- **WHEN** `~/.kb/config.yaml` exists but only specifies `chunking.pdf.max_tokens: 2048`
- **THEN** the system uses built-in defaults for all other settings, overriding only `chunking.pdf.max_tokens`
#### Scenario: Invalid config file
- **WHEN** `~/.kb/config.yaml` contains invalid YAML
- **THEN** the system prints a clear error message identifying the YAML syntax issue and exits with non-zero status
### Requirement: Environment variable overrides
The system SHALL support environment variable overrides with the prefix `KB_`. ENV variables SHALL take precedence over the YAML config file. Supported variables: `KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`.
#### Scenario: Override data directory
- **WHEN** `KB_DATA_DIR=/tmp/test-kb` is set
- **THEN** the system uses `/tmp/test-kb/` instead of `~/.kb/` for the database and config
#### Scenario: Override model
- **WHEN** `KB_MODEL=nomic-embed-text` is set
- **THEN** the system uses `nomic-embed-text` as the embedding model, overriding the YAML config
#### Scenario: ENV overrides YAML
- **WHEN** YAML config has `search.default_top: 10` and `KB_DEFAULT_TOP=20` is set
- **THEN** the default top value is 20
### Requirement: Configuration precedence
The system SHALL apply configuration in this order (highest to lowest precedence): CLI flags, environment variables, YAML config file, built-in defaults.
#### Scenario: CLI flag overrides everything
- **WHEN** YAML config has `search.default_top: 10`, ENV has `KB_DEFAULT_TOP=20`, and user runs `kb search "test" --top 5`
- **THEN** 5 results are returned
### Requirement: View and set configuration
The system SHALL support viewing the current effective configuration via `kb config` and setting individual values via `kb config set <key> <value>`.
#### Scenario: View configuration
- **WHEN** user runs `kb config`
- **THEN** the system displays the fully resolved configuration (defaults merged with YAML merged with ENV), indicating the source of each value
#### Scenario: Set a config value
- **WHEN** user runs `kb config set chunking.pdf.max_tokens 2048`
- **THEN** the value is written to `~/.kb/config.yaml`, creating the file if necessary
### Requirement: Configurable chunking parameters
The system SHALL support per-document-type chunking configuration with sensible defaults.
#### Scenario: Default chunking for PDF
- **WHEN** no chunking config is specified for PDF
- **THEN** the system uses `strategy: hierarchy, max_tokens: 1024`
#### Scenario: Default chunking for markdown
- **WHEN** no chunking config is specified for markdown
- **THEN** the system uses `strategy: header, min_tokens: 50, max_tokens: 1024`
#### Scenario: Default chunking for code
- **WHEN** no chunking config is specified for code
- **THEN** the system uses `strategy: ast, include_context: true, max_tokens: 1024`
#### Scenario: Default chunking for notes
- **WHEN** no chunking config is specified for notes
- **THEN** the system uses `strategy: whole`
#### Scenario: Custom chunking overrides
- **WHEN** YAML config specifies `chunking.pdf.strategy: fixed` and `chunking.pdf.max_tokens: 512`
- **THEN** PDFs are chunked with fixed-size windows of 512 tokens instead of hierarchy-aware chunking
@@ -0,0 +1,125 @@
## ADDED Requirements
### Requirement: File type detection and routing
The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (`.pdf`), DOCX (`.docx`), HTML (`.html`, `.htm`), Markdown (`.md`, `.markdown`, `.txt`), Code (`.py`, `.sh`, `.bash`, `.go`), and image files (`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`). The user MAY override detection with `--type` and `--language` flags.
#### Scenario: Auto-detect PDF file
- **WHEN** user runs `kb add report.pdf`
- **THEN** the file is routed to the Docling ingestion pipeline
#### Scenario: Auto-detect Python code
- **WHEN** user runs `kb add script.py`
- **THEN** the file is routed to the code ingestion pipeline with language set to `python`
#### Scenario: Override type detection
- **WHEN** user runs `kb add data.txt --type code --language bash`
- **THEN** the file is routed to the code pipeline as Bash, regardless of the `.txt` extension
#### Scenario: Unsupported file type
- **WHEN** user runs `kb add archive.zip`
- **THEN** the system SHALL print an error message listing supported formats and exit with non-zero status
### Requirement: Docling pipeline for complex documents
The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the `pypdfium2` backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: `auto` (detect pages with no extractable text and OCR those), `always`, or `never`.
#### Scenario: Ingest a text-based PDF
- **WHEN** user runs `kb add manual.pdf`
- **THEN** the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database
#### Scenario: Ingest a PDF with tables
- **WHEN** user ingests a PDF containing data tables
- **THEN** Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments
#### Scenario: Ingest a scanned PDF with OCR auto mode
- **WHEN** user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to `auto`
- **THEN** the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally
#### Scenario: Ingest an image file
- **WHEN** user runs `kb add diagram.png`
- **THEN** the system SHALL process it through Docling with OCR enabled, extracting any text content from the image
### Requirement: Markdown ingestion with header-based splitting
The system SHALL split markdown and text files at header boundaries (`##`, `###`). Each chunk SHALL include its parent header chain as context. Sections smaller than `min_tokens` SHALL be merged with the following section. Sections larger than `max_tokens` SHALL be split at paragraph boundaries with configurable overlap.
#### Scenario: Split markdown at headers
- **WHEN** user runs `kb add guide.md` and the file contains multiple `##` sections
- **THEN** each section becomes a separate chunk, with the header text included in the chunk
#### Scenario: Preserve header hierarchy
- **WHEN** a markdown file has nested headers like `## Config` > `### Advanced Options`
- **THEN** the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"
#### Scenario: Merge small sections
- **WHEN** a markdown section contains fewer tokens than `min_tokens` (default: 50)
- **THEN** it SHALL be merged with the next section into a single chunk
#### Scenario: Plain text file without headers
- **WHEN** user runs `kb add notes.txt` and the file has no markdown headers
- **THEN** the system SHALL fall back to fixed-size chunking with configurable `max_tokens` and `overlap_tokens`
### Requirement: Code ingestion with AST/regex splitting
The system SHALL split code files at function and class boundaries. Python files SHALL use the `ast` module. Bash and Go files SHALL use regex-based splitting. When `include_context` is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.
#### Scenario: Python file with functions and classes
- **WHEN** user runs `kb add auth.py` and the file contains a class with methods
- **THEN** each method becomes a chunk, and each chunk includes the class name and docstring as context
#### Scenario: Bash script with functions
- **WHEN** user runs `kb add deploy.sh` and the file contains `function deploy() {` blocks
- **THEN** each function becomes a separate chunk, including any preceding comment block
#### Scenario: Go file with functions
- **WHEN** user runs `kb add main.go` and the file contains `func` declarations
- **THEN** each function becomes a separate chunk
#### Scenario: Code file with no functions
- **WHEN** user runs `kb add script.sh` and the file has no function declarations
- **THEN** the system SHALL fall back to fixed-size chunking with `max_tokens` and `overlap_tokens`
### Requirement: Inline note ingestion
The system SHALL support adding text notes directly from the command line via `kb add --note "text"`. Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional `--title` for display purposes.
#### Scenario: Add an inline note
- **WHEN** user runs `kb add --note "Always restart nginx after config changes" --title "nginx reminder"`
- **THEN** a document of type `note` is created with the title "nginx reminder", and the full text becomes a single chunk
#### Scenario: Add a note without title
- **WHEN** user runs `kb add --note "some text"`
- **THEN** the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title
### Requirement: Deduplication via content hash
The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same `content_hash` already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.
#### Scenario: Add a file that is already indexed
- **WHEN** user runs `kb add report.pdf` and the file's SHA-256 matches an existing document
- **THEN** the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate
#### Scenario: Add a modified version of an existing file
- **WHEN** user runs `kb add report.pdf` and the file has changed since last indexed (different hash)
- **THEN** the system SHALL ingest it as a new document (the old version remains unless manually removed)
### Requirement: Batch ingestion with progress and resumability
The system SHALL support ingesting entire directories via `kb add <dir> --recursive`. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.
#### Scenario: Ingest a directory
- **WHEN** user runs `kb add ~/docs/ --recursive`
- **THEN** the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."
#### Scenario: Resume after interruption
- **WHEN** a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
- **THEN** already-indexed files are skipped via content hash, and processing continues with remaining files
#### Scenario: Failed file during batch
- **WHEN** a single file fails to process (corrupt PDF, encoding error)
- **THEN** the error is logged to `~/.kb/ingest-errors.log` with the file path and error message, and processing continues with the next file
### Requirement: Parallel ingestion workers
The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's `DocumentConverter` SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.
#### Scenario: Parallel PDF ingestion
- **WHEN** user runs `kb add ~/pdfs/ --recursive` with `workers: 4` in config
- **THEN** up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially
#### Scenario: Override worker count
- **WHEN** user runs `kb add ~/pdfs/ --recursive --workers 1`
- **THEN** documents are processed sequentially with a single worker
@@ -0,0 +1,80 @@
## ADDED Requirements
### Requirement: List documents
The system SHALL list all indexed documents via `kb list`. Results SHALL include document ID, title, type, tag count, chunk count, and creation date. Output SHALL support `--format json` and `--format human`.
#### Scenario: List all documents
- **WHEN** user runs `kb list`
- **THEN** all documents are listed with their ID, title, type, tags, chunk count, and creation date
#### Scenario: Filter by type
- **WHEN** user runs `kb list --type pdf`
- **THEN** only PDF documents are listed
#### Scenario: Filter by tags
- **WHEN** user runs `kb list --tags admin,ops`
- **THEN** only documents tagged with BOTH "admin" AND "ops" are listed
#### Scenario: Empty database
- **WHEN** user runs `kb list` with no documents indexed
- **THEN** the system prints "No documents indexed. Run `kb add` to get started." and exits with zero status
### Requirement: Document info
The system SHALL display detailed information about a single document via `kb info <doc_id>`, including all metadata, tags, chunk count, and chunk previews (first 100 characters of each chunk).
#### Scenario: View document info
- **WHEN** user runs `kb info 42`
- **THEN** the system displays: title, source path, type, language (if code), content hash, creation date, tags, total chunks, and a preview of each chunk
#### Scenario: Invalid document ID
- **WHEN** user runs `kb info 9999` and no document with ID 9999 exists
- **THEN** the system prints "Document not found: 9999" and exits with non-zero status
### Requirement: Remove document
The system SHALL remove a document and all its associated chunks, embeddings, and tag associations via `kb remove <doc_id>`. The system SHALL ask for confirmation before deletion unless `--yes` is passed.
#### Scenario: Remove with confirmation
- **WHEN** user runs `kb remove 42`
- **THEN** the system displays the document title and asks "Remove 'Git Admin Guide' and its 28 chunks? [y/N]". On confirmation, the document, its chunks, FTS entries, vector embeddings, and tag associations are deleted.
#### Scenario: Remove with --yes flag
- **WHEN** user runs `kb remove 42 --yes`
- **THEN** the document is removed without confirmation prompt
#### Scenario: Cascading delete
- **WHEN** a document is removed
- **THEN** all rows in `chunks`, `chunks_fts`, `chunks_vec`, and `document_tags` referencing that document SHALL be deleted
### Requirement: Tag management
The system SHALL support adding and removing tags on documents via `kb tag <doc_id> --add tag1,tag2` and `kb tag <doc_id> --remove tag1`. Tags are case-insensitive and stored lowercase. The system SHALL list all tags with document counts via `kb tags`.
#### Scenario: Add tags to a document
- **WHEN** user runs `kb tag 42 --add git,admin`
- **THEN** the tags "git" and "admin" are associated with document 42. Tags are created if they don't exist.
#### Scenario: Remove a tag from a document
- **WHEN** user runs `kb tag 42 --remove admin`
- **THEN** the "admin" tag association is removed from document 42. The tag itself remains in the tags table if other documents use it.
#### Scenario: List all tags
- **WHEN** user runs `kb tags`
- **THEN** the system lists all tags with the count of documents using each tag, sorted by count descending
#### Scenario: Tag on ingestion
- **WHEN** user runs `kb add report.pdf --tags compliance,q1`
- **THEN** the document is ingested and immediately tagged with "compliance" and "q1"
#### Scenario: Tags in JSON format
- **WHEN** user runs `kb tags --format json`
- **THEN** output is a JSON array of objects: `[{"name": "git", "count": 15}, ...]`
### Requirement: Database status
The system SHALL report database statistics via `kb status`, including: total documents (by type), total chunks, database file size, active model name and dimension, and schema version.
#### Scenario: Show status
- **WHEN** user runs `kb status`
- **THEN** the system displays: document counts by type, total chunks, DB file size, model name, embedding dimension, and schema version
#### Scenario: Status before init
- **WHEN** user runs `kb status` before `kb init`
- **THEN** the system prints "Knowledge base not initialised. Run `kb init` first." and exits with non-zero status
@@ -0,0 +1,57 @@
## ADDED Requirements
### Requirement: Model initialisation
The system SHALL download the embedding model on `kb init`. The default model SHALL be `all-MiniLM-L6-v2`. The user MAY specify a different model via `kb init --model <name>`. The model SHALL be downloaded via sentence-transformers to the HuggingFace default cache (`~/.cache/huggingface/`). On first load, the model SHALL be exported to ONNX format for inference.
#### Scenario: Default init
- **WHEN** user runs `kb init`
- **THEN** the system downloads `all-MiniLM-L6-v2`, creates `~/.kb/kb.db` with the schema, and records `model_name=all-MiniLM-L6-v2` and `embedding_dim=384` in the DB config table
#### Scenario: Init with custom model
- **WHEN** user runs `kb init --model nomic-embed-text`
- **THEN** the system downloads `nomic-embed-text`, creates the database, and records the model name and its dimension in the DB config table
#### Scenario: Init status check
- **WHEN** user runs `kb init --status`
- **THEN** the system reports: whether `~/.kb/` exists, whether the DB is initialised, which model is configured, whether the model is downloaded, and Docling model status
#### Scenario: ONNX export on first load
- **WHEN** the embedding model is loaded for the first time after download
- **THEN** the system SHALL display "Optimising model for ONNX inference (one-time)..." and export the model to ONNX format. Subsequent loads SHALL use the cached ONNX export.
### Requirement: Model-database binding
The system SHALL store the active model name and embedding dimension in the database `config` table. Every operation that uses the embedding model (add, search, reindex) SHALL verify that the loaded model matches the DB record. A mismatch SHALL be a hard error.
#### Scenario: Model mismatch on add
- **WHEN** user runs `kb add doc.pdf` but the config YAML specifies a different model than what the DB was initialised with
- **THEN** the system SHALL print an error: "Model mismatch: DB uses 'all-MiniLM-L6-v2' (384 dim) but config specifies 'nomic-embed-text'. Run `kb reindex --model nomic-embed-text` to switch models." and exit with non-zero status
#### Scenario: Model match on add
- **WHEN** user runs `kb add doc.pdf` and the config model matches the DB model
- **THEN** ingestion proceeds normally
### Requirement: Full reindex with model switching
The system SHALL support re-embedding all chunks via `kb reindex`. If `--model` is specified, the system SHALL download the new model, re-embed all chunks, replace all vectors, and update the DB config. A progress bar SHALL be displayed. The operation SHALL be atomic — if interrupted, the old embeddings remain intact.
#### Scenario: Reindex with same model
- **WHEN** user runs `kb reindex`
- **THEN** all chunks are re-embedded with the current model and vectors are replaced. Useful if the model's ONNX export was corrupted or chunks were modified.
#### Scenario: Reindex with new model
- **WHEN** user runs `kb reindex --model bge-small-en-v1.5`
- **THEN** the system downloads the new model, re-embeds all chunks (showing progress), replaces all vectors in `chunks_vec` (recreating the table if dimension changed), and updates `model_name` and `embedding_dim` in the DB config table
#### Scenario: Interrupted reindex
- **WHEN** a reindex is interrupted partway through
- **THEN** the old embeddings remain intact (the vector table is only replaced on successful completion of all embeddings). The user can rerun `kb reindex` to retry.
### Requirement: Embedding model inference via ONNX
The system SHALL use `sentence-transformers` with the ONNX backend for all embedding inference. This avoids a PyTorch dependency. The ONNX Runtime (`onnxruntime`) SHALL be the inference engine.
#### Scenario: Embed a chunk
- **WHEN** a chunk of text needs to be embedded during ingestion
- **THEN** the system uses the sentence-transformers ONNX backend to produce a float vector of the correct dimension for the active model
#### Scenario: Embed a query
- **WHEN** a search query needs to be embedded
- **THEN** the system applies the configured `query_prefix` (if any) to the query text before embedding, and uses the same ONNX model used for chunk embeddings
@@ -0,0 +1,70 @@
## ADDED Requirements
### Requirement: Full-text search via FTS5
The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the `porter unicode61` tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.
#### Scenario: Keyword search
- **WHEN** user runs `kb search "install git"`
- **THEN** FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score
#### Scenario: FTS-only mode
- **WHEN** user runs `kb search "install git" --fts-only`
- **THEN** only FTS5 results are returned, no vector search is performed
### Requirement: Vector similarity search via sqlite-vec
The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in `chunks_vec` using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.
#### Scenario: Semantic search
- **WHEN** user runs `kb search "how to set up version control"`
- **THEN** the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"
#### Scenario: Vector-only mode
- **WHEN** user runs `kb search "how to set up version control" --vec-only`
- **THEN** only vector similarity results are returned, no FTS search is performed
### Requirement: Reciprocal Rank Fusion merging
The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: `score(d) = Σ 1/(k + rank)` where `k` is configurable (default: 60). Results SHALL be sorted by descending RRF score.
#### Scenario: Hybrid search combines both signals
- **WHEN** user runs `kb search "install git"` (default hybrid mode)
- **THEN** the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score
#### Scenario: Document appears in both result sets
- **WHEN** a chunk ranks #2 in FTS5 and #5 in vector search
- **THEN** its RRF score SHALL be `1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315`, higher than a chunk appearing in only one result set
### Requirement: Tag-based filtering
The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.
#### Scenario: Filter by single tag
- **WHEN** user runs `kb search "deploy" --tags ops`
- **THEN** only chunks from documents tagged with "ops" are included in results
#### Scenario: Filter by multiple tags
- **WHEN** user runs `kb search "deploy" --tags ops,production`
- **THEN** only chunks from documents tagged with BOTH "ops" AND "production" are included
### Requirement: Type-based filtering
The system SHALL support filtering search results by document type. Valid types: `pdf`, `markdown`, `code`, `note`.
#### Scenario: Filter by type
- **WHEN** user runs `kb search "deploy" --type code`
- **THEN** only chunks from code documents are included in results
### Requirement: Score threshold
The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.
#### Scenario: Apply score threshold
- **WHEN** user runs `kb search "deploy" --threshold 0.02`
- **THEN** only results with RRF score >= 0.02 are returned
### Requirement: Result count control
The system SHALL return a configurable number of results (default: 10, configurable via `--top` flag or `search.default_top` in config).
#### Scenario: Request specific number of results
- **WHEN** user runs `kb search "deploy" --top 5`
- **THEN** at most 5 results are returned
#### Scenario: Fewer matches than requested
- **WHEN** user searches and only 3 chunks match
- **THEN** the system returns 3 results without error, with `returned: 3` in the output
@@ -0,0 +1,101 @@
## ADDED Requirements
### Requirement: JSON output format for search
The system SHALL output search results as JSON when `--format json` is used (this is the default). The JSON schema SHALL include: `query`, `results` array, `total_matches`, and `returned` count. Each result SHALL include: `chunk_id`, `score`, `score_breakdown` (with `fts` and `vector` sub-scores), `text`, and `source` object.
#### Scenario: JSON search output
- **WHEN** user runs `kb search "install git" --format json`
- **THEN** the output is valid JSON matching this structure:
```json
{
"query": "install git",
"results": [
{
"chunk_id": 1423,
"score": 0.031,
"score_breakdown": {"fts": 0.016, "vector": 0.015},
"text": "To install the latest version...",
"source": {
"document_id": 42,
"title": "Git Admin Guide",
"path": "/home/user/docs/git-admin.pdf",
"type": "pdf",
"page": 12,
"chunk_index": 3,
"total_chunks": 28,
"tags": ["git", "admin"]
}
}
],
"total_matches": 47,
"returned": 10
}
```
#### Scenario: Score breakdown in FTS-only mode
- **WHEN** user runs `kb search "test" --fts-only --format json`
- **THEN** `score_breakdown` contains `{"fts": <score>, "vector": null}`
#### Scenario: Score breakdown in vector-only mode
- **WHEN** user runs `kb search "test" --vec-only --format json`
- **THEN** `score_breakdown` contains `{"fts": null, "vector": <score>}`
### Requirement: Human-readable output format
The system SHALL support human-readable output via `--format human`. This format SHALL show: query, match count, and for each result: rank, score, title, page/section (if applicable), type, tags, and a text preview.
#### Scenario: Human-readable search output
- **WHEN** user runs `kb search "install git" --format human`
- **THEN** output is formatted for terminal reading:
```
Search: "install git" (47 matches, showing top 10)
1. [0.031] Git Admin Guide (p.12) [pdf] [git, admin]
To install the latest version of git from source...
2. [0.025] setup-notes.md §Installation [markdown] [git]
First, add the PPA repository for the latest git...
```
### Requirement: JSON output for list and tags commands
The system SHALL support `--format json` for `kb list`, `kb tags`, `kb info`, and `kb status` commands. JSON output SHALL be valid and parseable by the skill wrapper.
#### Scenario: List documents as JSON
- **WHEN** user runs `kb list --format json`
- **THEN** output is a JSON array of document objects with `id`, `title`, `type`, `tags`, `chunk_count`, `created_at`
#### Scenario: Tags as JSON
- **WHEN** user runs `kb tags --format json`
- **THEN** output is a JSON array: `[{"name": "git", "count": 15}, ...]`
#### Scenario: Status as JSON
- **WHEN** user runs `kb status --format json`
- **THEN** output is a JSON object with `documents` (counts by type), `total_chunks`, `db_size_bytes`, `model_name`, `embedding_dim`, `schema_version`
### Requirement: JSON schema stability
The JSON output schema SHALL be treated as a public contract. Fields MAY be added to JSON objects in future versions. Fields SHALL NOT be removed or renamed. The skill wrapper MUST be able to rely on the presence and type of all documented fields.
#### Scenario: Forward compatibility
- **WHEN** a future version adds a `language` field to search results
- **THEN** all existing fields remain present and unchanged, the new field is additive only
### Requirement: Exit codes
The system SHALL use consistent exit codes: 0 for success, 1 for user errors (bad arguments, missing files), 2 for system errors (database corruption, model failure). JSON error output SHALL include an `error` field with a human-readable message.
#### Scenario: Successful operation
- **WHEN** any command completes successfully
- **THEN** exit code is 0
#### Scenario: User error with JSON output
- **WHEN** user runs `kb search` with no query argument
- **THEN** exit code is 1 and stderr contains a clear error message
#### Scenario: System error
- **WHEN** the SQLite database is corrupted
- **THEN** exit code is 2 and stderr contains the error details
### Requirement: Skill definition file
The project SHALL include a `SKILL.md` file that defines how an LLM tool (e.g. Claude Code) should invoke and interpret `kb` commands. The skill file SHALL document: when to use the tool, available commands, output format, how to cite sources, and how to handle low-confidence results.
#### Scenario: Skill file exists
- **WHEN** the project is built
- **THEN** a `SKILL.md` file exists at the project root describing the skill interface for LLM consumption
+115
View File
@@ -0,0 +1,115 @@
## 1. Project Scaffolding
- [x] 1.1 Create Python virtual environment (`python3 -m venv .venv`) and add `.venv/` to `.gitignore`. All development and testing MUST run inside this venv.
- [x] 1.2 Create `pyproject.toml` with project metadata, dependencies (`click`, `sqlite-vec`, `pyyaml`, `sentence-transformers`, `onnxruntime`, `docling`), dev dependencies (`pytest`, `pytest-cov`), and `[project.scripts] kb = "kb_search.cli:main"` entry point
- [x] 1.3 Install the project in editable mode inside the venv: `.venv/bin/pip install -e ".[dev]"`
- [x] 1.4 Create `src/kb_search/` package directory with `__init__.py`
- [x] 1.5 Create `src/kb_search/cli.py` with Click group and stub subcommands (`init`, `add`, `search`, `list`, `info`, `remove`, `tags`, `tag`, `status`, `reindex`, `config`)
- [x] 1.6 Verify `.venv/bin/kb --help` shows all commands
## 2. Configuration
- [x] 2.1 Create `src/kb_search/config.py` — load YAML from `~/.kb/config.yaml` with deep-merge against built-in defaults. Handle missing file gracefully.
- [x] 2.2 Implement ENV variable overrides (`KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`) with precedence: CLI flags > ENV > YAML > defaults
- [x] 2.3 Implement `kb config` command — display fully resolved config with source indicators
- [x] 2.4 Implement `kb config set <key> <value>` — write to `~/.kb/config.yaml`, creating file if needed
- [x] 2.5 Write tests for config loading, merging, ENV overrides, and precedence
## 3. Database Layer
- [x] 3.1 Create `src/kb_search/database.py` — SQLite connection management with sqlite-vec extension loading
- [x] 3.2 Implement schema creation: `documents`, `chunks`, `tags`, `document_tags`, `config` tables per design.md
- [x] 3.3 Implement FTS5 virtual table (`chunks_fts`) with `porter unicode61` tokenizer and sync triggers (INSERT, UPDATE, DELETE)
- [x] 3.4 Implement `chunks_vec` virtual table via sqlite-vec
- [x] 3.5 Implement schema versioning: store `schema_version` in `config` table, check on open, run migrations sequentially
- [x] 3.6 Implement DB config helpers: `get_config(key)`, `set_config(key, value)` for model binding
- [x] 3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers
## 4. Embedding Management
- [x] 4.1 Create `src/kb_search/embeddings.py` — model download, ONNX export, and loading via `SentenceTransformer(model_name, backend="onnx")`
- [x] 4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
- [x] 4.3 Implement `embed_texts(texts: list[str]) -> list[list[float]]` with configurable query/passage prefix support
- [x] 4.4 Implement `kb init` command — create `~/.kb/`, init DB schema, download model, record binding. Support `--model` flag and `--status` check.
- [x] 4.5 Implement `kb reindex` command — download new model if `--model` specified, re-embed all chunks with progress bar, replace vectors atomically, update DB config
- [x] 4.6 Write tests for embedding, model binding verification, and mismatch detection
## 5. Document Ingestion — Core
- [x] 5.1 Create `src/kb_search/ingest/__init__.py` and `src/kb_search/ingest/detector.py` — file type detection by extension, routing to correct pipeline, `--type`/`--language` override support
- [x] 5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against `documents.content_hash`
- [x] 5.3 Implement `kb add <file>` command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction
- [x] 5.4 Implement `kb add --note "text"` — create note document with whole-text chunk, optional `--title`, auto-title from first 80 chars
- [x] 5.5 Implement `kb add <dir> --recursive` — walk directory, filter supported extensions, process each file, skip dupes, log failures to `~/.kb/ingest-errors.log`, display summary
- [x] 5.6 Implement parallel ingestion with configurable `--workers` (default: 4), serialised DB writes
- [x] 5.7 Write tests for type detection, dedup, note creation, and batch processing
## 6. Document Ingestion — Docling Pipeline
- [x] 6.1 Create `src/kb_search/ingest/docling.py` — Docling `DocumentConverter` setup with `pypdfium2` backend, layout model enabled, table reconstruction enabled
- [x] 6.2 Implement OCR configuration (`auto`/`always`/`never`) per config.yaml `ingestion.enable_ocr`
- [x] 6.3 Implement hierarchy-aware chunking via Docling's `HierarchicalChunker`, with fallback to fixed-size chunking when hierarchy detection fails
- [x] 6.4 Extract and preserve chunk metadata: page number, section headers, table markers
- [x] 6.5 Wire Docling models to download on `kb init` (using HuggingFace default cache)
- [x] 6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)
## 7. Document Ingestion — Markdown Pipeline
- [x] 7.1 Create `src/kb_search/ingest/markdown.py` — split at `##`/`###` header boundaries
- [x] 7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
- [x] 7.3 Implement small section merging (sections below `min_tokens` merged with next section)
- [x] 7.4 Implement large section splitting at paragraph boundaries with overlap
- [x] 7.5 Implement fallback to fixed-size chunking for plain text files without headers
- [x] 7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback
## 8. Document Ingestion — Code Pipeline
- [x] 8.1 Create `src/kb_search/ingest/code.py` — language detection from extension (`.py`, `.sh`, `.bash`, `.go`)
- [x] 8.2 Implement Python AST splitting using stdlib `ast` module — function and class boundaries, class docstring context on methods
- [x] 8.3 Implement Bash regex splitting — `function name()` and `name()` patterns with preceding comment blocks
- [x] 8.4 Implement Go regex splitting — `func` declarations with type grouping
- [x] 8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
- [x] 8.6 Write tests for each language parser and fallback behaviour
## 9. Hybrid Search
- [x] 9.1 Create `src/kb_search/search.py` — FTS5 query execution with BM25 scoring, special character escaping
- [x] 9.2 Implement vector similarity search: embed query, query `chunks_vec` for top-K (3× requested), cosine similarity
- [x] 9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with `score(d) = Σ 1/(k + rank)`, configurable `k` (default: 60)
- [x] 9.4 Implement `--fts-only` and `--vec-only` modes
- [x] 9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
- [x] 9.6 Implement `--threshold` score cutoff (post-RRF)
- [x] 9.7 Implement `--top` result count control (default from config)
- [x] 9.8 Wire up `kb search` command with all flags: `--top`, `--tags`, `--type`, `--format`, `--fts-only`, `--vec-only`, `--threshold`
- [x] 9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)
## 10. Output Formatting
- [x] 10.1 Create `src/kb_search/output.py` — JSON formatter for search results matching the schema in skill-interface spec
- [x] 10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
- [x] 10.3 Implement JSON formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.4 Implement human-readable formatters for `list`, `tags`, `info`, and `status` commands
- [x] 10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
- [x] 10.6 Write tests for JSON output schema validation and exit codes
## 11. Document Management Commands
- [x] 11.1 Implement `kb list` — query documents with optional `--type` and `--tags` filters, `--format` output
- [x] 11.2 Implement `kb info <doc_id>` — document details with chunk previews
- [x] 11.3 Implement `kb remove <doc_id>` — cascading delete with confirmation prompt, `--yes` flag
- [x] 11.4 Implement `kb tags` — list all tags with document counts, `--format` support
- [x] 11.5 Implement `kb tag <doc_id> --add/--remove` — tag management, case-insensitive storage
- [x] 11.6 Implement `kb status` — DB stats, model info, storage size, schema version
- [x] 11.7 Write tests for each management command
## 12. Skill Definition
- [x] 12.1 Write `SKILL.md` — when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance
- [x] 12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results
## 13. Packaging and Distribution
- [x] 13.1 Verify `pipx install kb-search` works from a clean environment
- [x] 13.2 Verify `kb init` downloads both embedding model and Docling models successfully
- [x] 13.3 Add a README with quickstart: install, init, add, search
- [x] 13.4 Add `py.typed` marker and basic type annotations on public interfaces
+10
View File
@@ -0,0 +1,10 @@
schema: spec-driven
context: |
Tech stack: Python 3.11+, Click (CLI), SQLite (FTS5 + sqlite-vec), Docling, sentence-transformers
Distribution: pipx (PyPI package name: kb-search, CLI entry point: kb)
Storage: Single SQLite database at ~/.kb/kb.db
Config: TOML at ~/.kb/config.toml with ENV overrides
Domain: CLI knowledge base / retrieval engine for personal document search
Primary consumer: Claude Code skills (JSON output), secondary: human terminal use
Local-first: no cloud dependencies, embedding models downloaded from HuggingFace on init
+32
View File
@@ -0,0 +1,32 @@
[build-system]
requires = ["setuptools>=68.0", "setuptools-scm>=8.0"]
build-backend = "setuptools.build_meta"
[project]
name = "kb-search"
version = "0.1.0"
description = "CLI knowledge base with hybrid search (FTS + vector)"
requires-python = ">=3.11"
license = "MIT"
dependencies = [
"click>=8.1",
"pyyaml>=6.0",
"sentence-transformers[onnx]>=3.0",
"sqlite-vec>=0.1.1",
"docling>=2.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0",
"pytest-cov>=5.0",
]
[project.scripts]
kb = "kb_search.cli:main"
[tool.setuptools.packages.find]
where = ["src"]
[tool.pytest.ini_options]
testpaths = ["tests"]
+3
View File
@@ -0,0 +1,3 @@
"""kb-search: CLI knowledge base with hybrid search."""
__version__ = "0.1.0"
+616
View File
@@ -0,0 +1,616 @@
"""CLI entry point for kb-search."""
import click
@click.group()
@click.version_option(package_name="kb-search")
def main():
"""Personal knowledge base with hybrid search."""
@main.command()
@click.option("--model", default=None, help="Embedding model name (HuggingFace).")
@click.option("--status", is_flag=True, help="Show initialisation status.")
def init(model, status):
"""Initialise the knowledge base and download models."""
from kb_search.config import get_data_dir, get_db_path, load_config
from kb_search.database import get_connection, get_db_config, init_schema, run_migrations, set_db_config
from kb_search.embeddings import download_model, get_model_dim
cfg = load_config()
data_dir = get_data_dir(cfg)
db_path = get_db_path(cfg)
model_name = model or cfg["embedding"]["model"]
if status:
click.echo(f"Data directory: {data_dir} ({'exists' if data_dir.exists() else 'not created'})")
click.echo(f"Database: {db_path} ({'exists' if db_path.exists() else 'not created'})")
if db_path.exists():
conn = get_connection(db_path)
db_model = get_db_config(conn, "model_name", "not set")
db_dim = get_db_config(conn, "embedding_dim", "not set")
click.echo(f"Model: {db_model} ({db_dim} dim)")
conn.close()
else:
click.echo(f"Model: {model_name} (not yet initialised)")
return
# Create data directory
data_dir.mkdir(parents=True, exist_ok=True)
# Download model and get dimension
download_model(model_name)
dim = get_model_dim(model_name)
# Initialise database
conn = get_connection(db_path)
init_schema(conn, embedding_dim=dim)
run_migrations(conn)
set_db_config(conn, "model_name", model_name)
set_db_config(conn, "embedding_dim", str(dim))
conn.close()
click.echo(f"Knowledge base initialised at {data_dir}")
click.echo(f"Model: {model_name} ({dim} dimensions)")
click.echo("Ready! Add documents with `kb add`.")
@main.command()
@click.argument("path", required=False)
@click.option("--note", default=None, help="Add an inline text note.")
@click.option("--title", default=None, help="Title for the note.")
@click.option("--tags", default=None, help="Comma-separated tags.")
@click.option("--type", "doc_type", default=None, type=click.Choice(["pdf", "markdown", "code", "note"]), help="Force document type.")
@click.option("--language", default=None, type=click.Choice(["python", "bash", "go"]), help="Force code language.")
@click.option("--recursive", is_flag=True, help="Recurse into directories.")
@click.option("--workers", default=None, type=int, help="Number of parallel workers.")
def add(path, note, title, tags, doc_type, language, recursive, workers):
"""Add documents to the knowledge base."""
import hashlib
from pathlib import Path as P
from kb_search.config import get_db_path, load_config
from kb_search.database import (
get_connection, hash_exists, insert_chunk, insert_document,
insert_embedding, tag_document,
)
from kb_search.embeddings import check_model_binding, embed_texts
from kb_search.ingest.detector import detect_type, is_supported
from kb_search.ingest.note import auto_title, chunk_note
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
check_model_binding(conn, cfg)
model_name = cfg["embedding"]["model"]
tag_list = [t.strip() for t in tags.split(",")] if tags else []
if note:
# Inline note
content_hash = hashlib.sha256(note.encode()).hexdigest()
if hash_exists(conn, content_hash):
click.echo("Skipped: note (already indexed)")
conn.close()
return
note_title = title or auto_title(note)
chunks = chunk_note(note)
doc_id = insert_document(conn, note_title, None, content_hash, "note")
for c in chunks:
chunk_id = insert_chunk(conn, doc_id, c["chunk_index"], c["text"], metadata=c["metadata"])
emb = embed_texts(model_name, [c["text"]], prefix=cfg["embedding"].get("passage_prefix", ""))
insert_embedding(conn, chunk_id, emb[0])
if tag_list:
tag_document(conn, doc_id, tag_list)
conn.commit()
conn.close()
click.echo(f"Added note: {note_title}")
return
if not path:
raise click.ClickException("Provide a file/directory path or use --note.")
file_path = P(path).expanduser().resolve()
if file_path.is_dir():
_add_directory(conn, file_path, cfg, model_name, tag_list, doc_type, language,
recursive, workers)
elif file_path.is_file():
result = _add_single_file(conn, file_path, cfg, model_name, tag_list, doc_type, language)
click.echo(result)
else:
raise click.ClickException(f"Path not found: {file_path}")
conn.close()
def _add_single_file(conn, file_path, cfg, model_name, tag_list, force_type, force_language):
"""Add a single file. Returns a status message."""
import hashlib
from kb_search.database import (
hash_exists, insert_chunk, insert_document, insert_embedding, tag_document,
)
from kb_search.embeddings import embed_texts
from kb_search.ingest.detector import detect_type
# Dedup check
content_hash = hashlib.sha256(file_path.read_bytes()).hexdigest()
if hash_exists(conn, content_hash):
return f"Skipped: {file_path.name} (already indexed)"
doc_type, language = detect_type(file_path, force_type, force_language)
chunks = _get_chunks(file_path, doc_type, language, cfg)
if not chunks:
return f"Skipped: {file_path.name} (no content extracted)"
title = file_path.stem
doc_id = insert_document(conn, title, str(file_path), content_hash, doc_type,
language=language)
# Embed all chunks in one batch
texts = [c["text"] for c in chunks]
prefix = cfg["embedding"].get("passage_prefix", "")
embeddings = embed_texts(model_name, texts, prefix=prefix)
for c, emb in zip(chunks, embeddings):
chunk_id = insert_chunk(conn, doc_id, c["chunk_index"], c["text"],
token_count=c.get("token_count"),
metadata=c.get("metadata", {}))
insert_embedding(conn, chunk_id, emb)
if tag_list:
tag_document(conn, doc_id, tag_list)
conn.commit()
return f"Added: {file_path.name} ({len(chunks)} chunks)"
def _get_chunks(file_path, doc_type, language, cfg):
"""Route to the correct chunking pipeline."""
if doc_type == "pdf":
from kb_search.ingest.docling import chunk_document
return chunk_document(file_path, cfg)
elif doc_type == "markdown":
from kb_search.ingest.markdown import chunk_markdown
text = file_path.read_text(errors="replace")
return chunk_markdown(text, cfg)
elif doc_type == "code":
from kb_search.ingest.code import chunk_code
text = file_path.read_text(errors="replace")
return chunk_code(text, language, cfg)
elif doc_type == "note":
from kb_search.ingest.note import chunk_note
text = file_path.read_text(errors="replace")
return chunk_note(text)
return []
def _add_directory(conn, dir_path, cfg, model_name, tag_list, force_type, force_language,
recursive, workers):
"""Add all supported files in a directory."""
from pathlib import Path as P
from kb_search.ingest.detector import is_supported
pattern = "**/*" if recursive else "*"
files = sorted(f for f in dir_path.glob(pattern) if f.is_file() and is_supported(f))
if not files:
click.echo(f"No supported files found in {dir_path}")
return
added = 0
skipped = 0
failed = 0
error_log = cfg.get("data_dir", "~/.kb")
from kb_search.config import get_data_dir
error_log_path = get_data_dir(cfg) / "ingest-errors.log"
with click.progressbar(files, label="Ingesting", show_pos=True) as bar:
for f in bar:
try:
result = _add_single_file(conn, f, cfg, model_name, tag_list,
force_type, force_language)
if "Skipped" in result:
skipped += 1
else:
added += 1
except Exception as e:
failed += 1
with open(error_log_path, "a") as log:
log.write(f"{f}: {e}\n")
click.echo(f"\nAdded {added} documents. {failed} failed. {skipped} skipped (already indexed).")
@main.command()
@click.argument("query")
@click.option("--top", default=None, type=int, help="Number of results.")
@click.option("--tags", default=None, help="Filter by tags (comma-separated).")
@click.option("--type", "doc_type", default=None, type=click.Choice(["pdf", "markdown", "code", "note"]), help="Filter by document type.")
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
@click.option("--fts-only", is_flag=True, help="Full-text search only.")
@click.option("--vec-only", is_flag=True, help="Vector search only.")
@click.option("--threshold", default=None, type=float, help="Minimum score cutoff.")
def search(query, top, tags, doc_type, fmt, fts_only, vec_only, threshold):
"""Search the knowledge base."""
from kb_search.config import get_db_path, load_config
from kb_search.database import get_connection, get_db_config
from kb_search.embeddings import check_model_binding
from kb_search.search import hybrid_search
from kb_search.output import format_search_results
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
check_model_binding(conn, cfg)
model_name = get_db_config(conn, "model_name") or cfg["embedding"]["model"]
top = top or cfg["search"]["default_top"]
fmt = fmt or cfg["search"]["default_format"]
tag_list = [t.strip() for t in tags.split(",")] if tags else None
results = hybrid_search(
conn, query, model_name, cfg,
top=top, tags=tag_list, doc_type=doc_type,
fts_only=fts_only, vec_only=vec_only, threshold=threshold,
)
conn.close()
click.echo(format_search_results(results, fmt))
@main.command("list")
@click.option("--type", "doc_type", default=None, type=click.Choice(["pdf", "markdown", "code", "note"]), help="Filter by document type.")
@click.option("--tags", default=None, help="Filter by tags (comma-separated).")
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
def list_docs(doc_type, tags, fmt):
"""List indexed documents."""
from kb_search.config import get_db_path, load_config
from kb_search.database import get_connection
from kb_search.output import format_document_list
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
fmt = fmt or cfg["search"]["default_format"]
sql = """
SELECT d.id, d.title, d.doc_type as type, d.created_at,
COUNT(c.id) as chunk_count
FROM documents d
LEFT JOIN chunks c ON d.id = c.document_id
"""
joins = []
where = []
params = []
if doc_type:
where.append("d.doc_type = ?")
params.append(doc_type)
tag_list = [t.strip().lower() for t in tags.split(",")] if tags else []
for i, tag in enumerate(tag_list):
joins.append(f"JOIN document_tags dt{i} ON d.id = dt{i}.document_id")
joins.append(f"JOIN tags t{i} ON dt{i}.tag_id = t{i}.id")
where.append(f"t{i}.name = ?")
params.append(tag)
sql += " " + " ".join(joins)
if where:
sql += " WHERE " + " AND ".join(where)
sql += " GROUP BY d.id ORDER BY d.created_at DESC"
rows = conn.execute(sql, params).fetchall()
docs = []
for row in rows:
tag_rows = conn.execute("""
SELECT t.name FROM tags t
JOIN document_tags dt ON t.id = dt.tag_id
WHERE dt.document_id = ?
ORDER BY t.name
""", (row["id"],)).fetchall()
docs.append({
"id": row["id"],
"title": row["title"],
"type": row["type"],
"tags": [r["name"] for r in tag_rows],
"chunk_count": row["chunk_count"],
"created_at": row["created_at"],
})
conn.close()
click.echo(format_document_list(docs, fmt))
@main.command()
@click.argument("doc_id", type=int)
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
def info(doc_id, fmt):
"""Show document details."""
import json as jsonlib
from kb_search.config import get_db_path, load_config
from kb_search.database import get_connection
from kb_search.output import format_doc_info
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
fmt = fmt or cfg["search"]["default_format"]
row = conn.execute("SELECT * FROM documents WHERE id = ?", (doc_id,)).fetchone()
if not row:
raise click.ClickException(f"Document not found: {doc_id}")
chunks = conn.execute(
"SELECT chunk_index, text FROM chunks WHERE document_id = ? ORDER BY chunk_index",
(doc_id,),
).fetchall()
tag_rows = conn.execute("""
SELECT t.name FROM tags t
JOIN document_tags dt ON t.id = dt.tag_id
WHERE dt.document_id = ?
ORDER BY t.name
""", (doc_id,)).fetchall()
info_data = {
"id": row["id"],
"title": row["title"],
"type": row["doc_type"],
"language": row["language"],
"path": row["source_path"],
"content_hash": row["content_hash"],
"created_at": row["created_at"],
"tags": [r["name"] for r in tag_rows],
"chunk_count": len(chunks),
"chunks": [{"chunk_index": c["chunk_index"], "text": c["text"]} for c in chunks],
}
conn.close()
click.echo(format_doc_info(info_data, fmt))
@main.command()
@click.argument("doc_id", type=int)
@click.option("--yes", is_flag=True, help="Skip confirmation prompt.")
def remove(doc_id, yes):
"""Remove a document from the knowledge base."""
from kb_search.config import get_db_path, load_config
from kb_search.database import get_connection
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
row = conn.execute("SELECT id, title FROM documents WHERE id = ?", (doc_id,)).fetchone()
if not row:
raise click.ClickException(f"Document not found: {doc_id}")
chunk_count = conn.execute(
"SELECT COUNT(*) FROM chunks WHERE document_id = ?", (doc_id,)
).fetchone()[0]
if not yes:
if not click.confirm(f"Remove '{row['title']}' and its {chunk_count} chunks?"):
click.echo("Cancelled.")
conn.close()
return
# Delete vectors for this document's chunks
conn.execute("""
DELETE FROM chunks_vec WHERE chunk_id IN (
SELECT id FROM chunks WHERE document_id = ?
)
""", (doc_id,))
# Cascade handles chunks, document_tags
conn.execute("DELETE FROM documents WHERE id = ?", (doc_id,))
conn.commit()
conn.close()
click.echo(f"Removed '{row['title']}' ({chunk_count} chunks).")
@main.command("tags")
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
def list_tags(fmt):
"""List all tags with document counts."""
from kb_search.config import get_db_path, load_config
from kb_search.database import get_connection
from kb_search.output import format_tags
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
fmt = fmt or cfg["search"]["default_format"]
rows = conn.execute("""
SELECT t.name, COUNT(dt.document_id) as count
FROM tags t
LEFT JOIN document_tags dt ON t.id = dt.tag_id
GROUP BY t.id
ORDER BY count DESC, t.name
""").fetchall()
tags = [{"name": r["name"], "count": r["count"]} for r in rows]
conn.close()
click.echo(format_tags(tags, fmt))
@main.command()
@click.argument("doc_id", type=int)
@click.option("--add", "add_tags", default=None, help="Tags to add (comma-separated).")
@click.option("--remove", "remove_tags", default=None, help="Tags to remove (comma-separated).")
def tag(doc_id, add_tags, remove_tags):
"""Manage tags on a document."""
from kb_search.config import get_db_path, load_config
from kb_search.database import get_connection, tag_document, untag_document
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
row = conn.execute("SELECT id, title FROM documents WHERE id = ?", (doc_id,)).fetchone()
if not row:
raise click.ClickException(f"Document not found: {doc_id}")
if add_tags:
tags = [t.strip() for t in add_tags.split(",")]
tag_document(conn, doc_id, tags)
conn.commit()
click.echo(f"Added tags [{', '.join(tags)}] to '{row['title']}'")
if remove_tags:
tags = [t.strip() for t in remove_tags.split(",")]
untag_document(conn, doc_id, tags)
conn.commit()
click.echo(f"Removed tags [{', '.join(tags)}] from '{row['title']}'")
conn.close()
@main.command()
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
def status(fmt):
"""Show knowledge base status and statistics."""
from kb_search.config import get_db_path, load_config
from kb_search.database import get_connection, get_db_config
from kb_search.output import format_status
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
fmt = fmt or cfg["search"]["default_format"]
doc_counts = {}
for row in conn.execute("SELECT doc_type, COUNT(*) as cnt FROM documents GROUP BY doc_type").fetchall():
doc_counts[row["doc_type"]] = row["cnt"]
total_docs = sum(doc_counts.values())
total_chunks = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
db_size = db_path.stat().st_size
status_data = {
"model_name": get_db_config(conn, "model_name", "not set"),
"embedding_dim": get_db_config(conn, "embedding_dim", "not set"),
"schema_version": get_db_config(conn, "schema_version", "not set"),
"db_size_bytes": db_size,
"documents": doc_counts,
"total_documents": total_docs,
"total_chunks": total_chunks,
}
conn.close()
click.echo(format_status(status_data, fmt))
@main.command()
@click.option("--model", default=None, help="Switch to a different embedding model.")
def reindex(model):
"""Re-embed all chunks (optionally with a new model)."""
import struct
from kb_search.config import get_db_path, load_config
from kb_search.database import (
get_connection, get_db_config, insert_embedding,
recreate_vec_table, set_db_config,
)
from kb_search.embeddings import download_model, embed_texts, get_model_dim
cfg = load_config()
db_path = get_db_path(cfg)
if not db_path.exists():
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
conn = get_connection(db_path)
model_name = model or get_db_config(conn, "model_name") or cfg["embedding"]["model"]
# Download model if switching
if model:
download_model(model_name)
dim = get_model_dim(model_name)
# Get all chunks
rows = conn.execute("SELECT id, text FROM chunks ORDER BY id").fetchall()
if not rows:
click.echo("No chunks to re-embed.")
conn.close()
return
click.echo(f"Re-embedding {len(rows)} chunks with '{model_name}' ({dim} dim)...")
# Embed in batches
batch_size = 256
all_ids = [r["id"] for r in rows]
all_texts = [r["text"] for r in rows]
prefix = cfg["embedding"].get("passage_prefix", "")
all_embeddings = []
with click.progressbar(range(0, len(all_texts), batch_size), label="Embedding") as bar:
for i in bar:
batch = all_texts[i:i + batch_size]
batch_embs = embed_texts(model_name, batch, prefix=prefix)
all_embeddings.extend(batch_embs)
# Atomically replace vectors
recreate_vec_table(conn, dim)
for chunk_id, emb in zip(all_ids, all_embeddings):
insert_embedding(conn, chunk_id, emb)
set_db_config(conn, "model_name", model_name)
set_db_config(conn, "embedding_dim", str(dim))
conn.commit()
conn.close()
click.echo(f"Reindex complete. {len(rows)} chunks embedded with '{model_name}'.")
@main.group(invoke_without_command=True)
@click.pass_context
def config(ctx):
"""View or modify configuration."""
if ctx.invoked_subcommand is None:
from kb_search.config import config_with_sources
entries = config_with_sources()
max_key = max(len(k) for k, _, _ in entries)
max_val = max(len(v) for _, v, _ in entries)
for key, value, source in entries:
click.echo(f" {key:<{max_key}} {value:<{max_val}} ({source})")
@config.command("set")
@click.argument("key")
@click.argument("value")
def config_set(key, value):
"""Set a configuration value."""
from kb_search.config import get_config_path, load_config, save_config_value
cfg = load_config()
path = get_config_path(cfg)
save_config_value(path, key, value)
click.echo(f"Set {key} = {value} in {path}")
main.add_command(config)
+195
View File
@@ -0,0 +1,195 @@
"""Configuration loading with YAML + ENV + defaults."""
import os
from copy import deepcopy
from pathlib import Path
import yaml
DEFAULTS = {
"data_dir": "~/.kb",
"embedding": {
"model": "all-MiniLM-L6-v2",
"query_prefix": "",
"passage_prefix": "",
},
"search": {
"default_top": 10,
"default_format": "json",
"rrf_k": 60,
},
"chunking": {
"defaults": {
"max_tokens": 512,
"overlap_tokens": 50,
},
"pdf": {
"strategy": "hierarchy",
"max_tokens": 1024,
},
"markdown": {
"strategy": "header",
"min_tokens": 50,
"max_tokens": 1024,
},
"code": {
"strategy": "ast",
"include_context": True,
"max_tokens": 1024,
},
"note": {
"strategy": "whole",
},
},
"ingestion": {
"workers": 4,
"batch_size": 50,
"enable_ocr": "auto",
},
}
# ENV variable mapping: ENV_NAME -> config dotted key
ENV_MAP = {
"KB_DATA_DIR": "data_dir",
"KB_MODEL": "embedding.model",
"KB_DEFAULT_TOP": "search.default_top",
"KB_DEFAULT_FORMAT": "search.default_format",
}
# Type coercions for ENV values
ENV_TYPES = {
"search.default_top": int,
}
def _deep_merge(base: dict, override: dict) -> dict:
"""Deep merge override into base, returning a new dict."""
result = deepcopy(base)
for key, value in override.items():
if key in result and isinstance(result[key], dict) and isinstance(value, dict):
result[key] = _deep_merge(result[key], value)
else:
result[key] = deepcopy(value)
return result
def _set_nested(d: dict, dotted_key: str, value):
"""Set a value in a nested dict using a dotted key path."""
keys = dotted_key.split(".")
for key in keys[:-1]:
d = d.setdefault(key, {})
d[keys[-1]] = value
def _get_nested(d: dict, dotted_key: str, default=None):
"""Get a value from a nested dict using a dotted key path."""
keys = dotted_key.split(".")
for key in keys:
if not isinstance(d, dict) or key not in d:
return default
d = d[key]
return d
def get_data_dir(cfg: dict) -> Path:
"""Resolve the data directory from config."""
return Path(cfg["data_dir"]).expanduser()
def get_config_path(cfg: dict) -> Path:
"""Path to the YAML config file."""
return get_data_dir(cfg) / "config.yaml"
def get_db_path(cfg: dict) -> Path:
"""Path to the SQLite database."""
return get_data_dir(cfg) / "kb.db"
def load_config(config_path: Path | None = None) -> dict:
"""Load config with precedence: ENV > YAML > defaults.
CLI flags are applied by the caller after this returns.
"""
cfg = deepcopy(DEFAULTS)
# Determine config file path (ENV can override data_dir which affects path)
if config_path is None:
data_dir = os.environ.get("KB_DATA_DIR", DEFAULTS["data_dir"])
config_path = Path(data_dir).expanduser() / "config.yaml"
# Load YAML if it exists
if config_path.is_file():
with open(config_path) as f:
yaml_cfg = yaml.safe_load(f) or {}
cfg = _deep_merge(cfg, yaml_cfg)
# Apply ENV overrides
for env_name, dotted_key in ENV_MAP.items():
env_val = os.environ.get(env_name)
if env_val is not None:
coerce = ENV_TYPES.get(dotted_key, str)
_set_nested(cfg, dotted_key, coerce(env_val))
return cfg
def save_config_value(config_path: Path, dotted_key: str, value: str):
"""Set a single value in the YAML config file."""
config_path.parent.mkdir(parents=True, exist_ok=True)
existing = {}
if config_path.is_file():
with open(config_path) as f:
existing = yaml.safe_load(f) or {}
# Try numeric coercion
try:
value = int(value)
except ValueError:
try:
value = float(value)
except ValueError:
if value.lower() in ("true", "false"):
value = value.lower() == "true"
_set_nested(existing, dotted_key, value)
with open(config_path, "w") as f:
yaml.dump(existing, f, default_flow_style=False, sort_keys=False)
def config_with_sources(config_path: Path | None = None) -> list[tuple[str, str, str]]:
"""Return a flat list of (dotted_key, value, source) tuples for display."""
if config_path is None:
data_dir = os.environ.get("KB_DATA_DIR", DEFAULTS["data_dir"])
config_path = Path(data_dir).expanduser() / "config.yaml"
yaml_cfg = {}
if config_path.is_file():
with open(config_path) as f:
yaml_cfg = yaml.safe_load(f) or {}
# Build reverse ENV map for source detection
env_keys = {v: k for k, v in ENV_MAP.items()}
def _flatten(d, prefix=""):
items = []
for k, v in d.items():
key = f"{prefix}.{k}" if prefix else k
if isinstance(v, dict):
items.extend(_flatten(v, key))
else:
# Determine source
env_name = env_keys.get(key)
if env_name and os.environ.get(env_name) is not None:
source = f"env ({env_name})"
elif _get_nested(yaml_cfg, key) is not None:
source = "config.yaml"
else:
source = "default"
items.append((key, str(v), source))
return items
cfg = load_config(config_path)
return _flatten(cfg)
+229
View File
@@ -0,0 +1,229 @@
"""SQLite database management with FTS5 and sqlite-vec."""
import json
import sqlite3
from pathlib import Path
import sqlite_vec
SCHEMA_VERSION = 1
def get_connection(db_path: Path) -> sqlite3.Connection:
"""Open a SQLite connection with sqlite-vec loaded."""
conn = sqlite3.connect(str(db_path))
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.row_factory = sqlite3.Row
return conn
def init_schema(conn: sqlite3.Connection, embedding_dim: int):
"""Create all tables, FTS, vector index, and triggers."""
conn.executescript(f"""
CREATE TABLE IF NOT EXISTS documents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
source_path TEXT,
content_hash TEXT NOT NULL,
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
language TEXT,
created_at TEXT DEFAULT (datetime('now')),
metadata TEXT DEFAULT '{{}}'
);
CREATE TABLE IF NOT EXISTS chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
token_count INTEGER,
metadata TEXT DEFAULT '{{}}',
created_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS tags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
);
CREATE TABLE IF NOT EXISTS document_tags (
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
PRIMARY KEY (document_id, tag_id)
);
CREATE TABLE IF NOT EXISTS config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_chunks_document_id ON chunks(document_id);
CREATE INDEX IF NOT EXISTS idx_documents_content_hash ON documents(content_hash);
""")
# FTS5 virtual table (content-sync with chunks)
conn.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS chunks_fts USING fts5(
text,
content='chunks',
content_rowid='id',
tokenize='porter unicode61'
)
""")
# FTS sync triggers
conn.executescript("""
CREATE TRIGGER IF NOT EXISTS chunks_ai AFTER INSERT ON chunks BEGIN
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
CREATE TRIGGER IF NOT EXISTS chunks_ad AFTER DELETE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
END;
CREATE TRIGGER IF NOT EXISTS chunks_au AFTER UPDATE ON chunks BEGIN
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
END;
""")
# Vector table
conn.execute(f"""
CREATE VIRTUAL TABLE IF NOT EXISTS chunks_vec USING vec0(
chunk_id INTEGER PRIMARY KEY,
embedding FLOAT[{embedding_dim}]
)
""")
conn.commit()
def get_db_config(conn: sqlite3.Connection, key: str, default: str | None = None) -> str | None:
"""Get a value from the config table."""
row = conn.execute("SELECT value FROM config WHERE key = ?", (key,)).fetchone()
return row["value"] if row else default
def set_db_config(conn: sqlite3.Connection, key: str, value: str):
"""Set a value in the config table."""
conn.execute(
"INSERT INTO config (key, value) VALUES (?, ?) ON CONFLICT(key) DO UPDATE SET value = ?",
(key, value, value),
)
conn.commit()
def check_schema_version(conn: sqlite3.Connection) -> int | None:
"""Check the current schema version. Returns None if not initialised."""
try:
return int(get_db_config(conn, "schema_version", "0"))
except Exception:
return None
def run_migrations(conn: sqlite3.Connection):
"""Run pending schema migrations."""
current = check_schema_version(conn) or 0
# Migration registry: version -> callable
migrations: dict[int, callable] = {
# Future migrations go here:
# 2: _migrate_v2,
}
for version in sorted(migrations.keys()):
if current < version:
migrations[version](conn)
set_db_config(conn, "schema_version", str(version))
if current < SCHEMA_VERSION:
set_db_config(conn, "schema_version", str(SCHEMA_VERSION))
def recreate_vec_table(conn: sqlite3.Connection, embedding_dim: int):
"""Drop and recreate the vector table with a new dimension."""
conn.execute("DROP TABLE IF EXISTS chunks_vec")
conn.execute(f"""
CREATE VIRTUAL TABLE chunks_vec USING vec0(
chunk_id INTEGER PRIMARY KEY,
embedding FLOAT[{embedding_dim}]
)
""")
conn.commit()
def insert_document(conn: sqlite3.Connection, title: str, source_path: str | None,
content_hash: str, doc_type: str, language: str | None = None,
metadata: dict | None = None) -> int:
"""Insert a document and return its ID."""
cur = conn.execute(
"INSERT INTO documents (title, source_path, content_hash, doc_type, language, metadata) "
"VALUES (?, ?, ?, ?, ?, ?)",
(title, source_path, content_hash, doc_type, language, json.dumps(metadata or {})),
)
return cur.lastrowid
def insert_chunk(conn: sqlite3.Connection, document_id: int, chunk_index: int,
text: str, token_count: int | None = None,
metadata: dict | None = None) -> int:
"""Insert a chunk and return its ID."""
cur = conn.execute(
"INSERT INTO chunks (document_id, chunk_index, text, token_count, metadata) "
"VALUES (?, ?, ?, ?, ?)",
(document_id, chunk_index, text, token_count, json.dumps(metadata or {})),
)
return cur.lastrowid
def insert_embedding(conn: sqlite3.Connection, chunk_id: int, embedding: list[float]):
"""Insert a chunk embedding into the vector table."""
import struct
blob = struct.pack(f"{len(embedding)}f", *embedding)
conn.execute(
"INSERT INTO chunks_vec (chunk_id, embedding) VALUES (?, ?)",
(chunk_id, blob),
)
def hash_exists(conn: sqlite3.Connection, content_hash: str) -> bool:
"""Check if a document with this content hash already exists."""
row = conn.execute(
"SELECT 1 FROM documents WHERE content_hash = ? LIMIT 1", (content_hash,)
).fetchone()
return row is not None
def get_or_create_tag(conn: sqlite3.Connection, name: str) -> int:
"""Get or create a tag, return its ID. Tags are stored lowercase."""
name = name.strip().lower()
row = conn.execute("SELECT id FROM tags WHERE name = ?", (name,)).fetchone()
if row:
return row["id"]
cur = conn.execute("INSERT INTO tags (name) VALUES (?)", (name,))
return cur.lastrowid
def tag_document(conn: sqlite3.Connection, document_id: int, tag_names: list[str]):
"""Associate tags with a document."""
for name in tag_names:
tag_id = get_or_create_tag(conn, name)
conn.execute(
"INSERT OR IGNORE INTO document_tags (document_id, tag_id) VALUES (?, ?)",
(document_id, tag_id),
)
def untag_document(conn: sqlite3.Connection, document_id: int, tag_names: list[str]):
"""Remove tag associations from a document."""
for name in tag_names:
name = name.strip().lower()
conn.execute(
"DELETE FROM document_tags WHERE document_id = ? AND tag_id = "
"(SELECT id FROM tags WHERE name = ?)",
(document_id, name),
)
+67
View File
@@ -0,0 +1,67 @@
"""Embedding model management — download, load, and inference via ONNX."""
import click
from pathlib import Path
_model_instance = None
_model_name = None
def load_model(model_name: str):
"""Load a sentence-transformers model with ONNX backend. Caches in-process."""
global _model_instance, _model_name
if _model_instance is not None and _model_name == model_name:
return _model_instance
from sentence_transformers import SentenceTransformer
click.echo(f"Loading model '{model_name}'...")
try:
_model_instance = SentenceTransformer(model_name, backend="onnx")
except Exception:
# Fallback: some models may not have pre-exported ONNX. Let sentence-transformers export.
click.echo("Optimising model for ONNX inference (one-time)...")
_model_instance = SentenceTransformer(model_name, backend="onnx")
_model_name = model_name
return _model_instance
def get_model_dim(model_name: str) -> int:
"""Get the embedding dimension for a model."""
model = load_model(model_name)
return model.get_sentence_embedding_dimension()
def embed_texts(model_name: str, texts: list[str],
prefix: str = "", show_progress: bool = False) -> list[list[float]]:
"""Embed a list of texts, returning float vectors."""
model = load_model(model_name)
if prefix:
texts = [prefix + t for t in texts]
embeddings = model.encode(texts, show_progress_bar=show_progress, convert_to_numpy=True)
return [e.tolist() for e in embeddings]
def download_model(model_name: str):
"""Pre-download a model (for kb init)."""
click.echo(f"Downloading embedding model '{model_name}'...")
load_model(model_name)
click.echo("Embedding model ready.")
def check_model_binding(conn, cfg: dict):
"""Verify the loaded model matches what the DB expects. Raises on mismatch."""
from kb_search.database import get_db_config
db_model = get_db_config(conn, "model_name")
if db_model is None:
return # Not yet initialised
config_model = cfg["embedding"]["model"]
if db_model != config_model:
db_dim = get_db_config(conn, "embedding_dim", "?")
raise click.ClickException(
f"Model mismatch: DB uses '{db_model}' ({db_dim} dim) but config specifies "
f"'{config_model}'. Run `kb reindex --model {config_model}` to switch models."
)
View File
+244
View File
@@ -0,0 +1,244 @@
"""Code ingestion — AST/regex-based splitting for Python, Bash, Go."""
import ast
import re
def chunk_code(text: str, language: str | None, cfg: dict) -> list[dict]:
"""Split code at function/class boundaries."""
chunking_cfg = cfg.get("chunking", {}).get("code", {})
strategy = chunking_cfg.get("strategy", "ast")
include_context = chunking_cfg.get("include_context", True)
if strategy == "fixed":
return _fixed_chunk(text, chunking_cfg)
if language == "python":
chunks = _chunk_python(text, include_context)
elif language in ("bash", "sh"):
chunks = _chunk_bash(text, include_context)
elif language == "go":
chunks = _chunk_go(text, include_context)
else:
chunks = []
if not chunks:
return _fixed_chunk(text, chunking_cfg)
for i, c in enumerate(chunks):
c["chunk_index"] = i
return chunks
def _chunk_python(text: str, include_context: bool) -> list[dict]:
"""Split Python using stdlib ast module."""
try:
tree = ast.parse(text)
except SyntaxError:
return []
lines = text.splitlines(keepends=True)
chunks = []
for node in ast.iter_child_nodes(tree):
if isinstance(node, ast.ClassDef):
class_lines = _get_node_source(lines, node)
class_docstring = ast.get_docstring(node) or ""
# Each method becomes a chunk
methods = [n for n in ast.iter_child_nodes(node) if isinstance(n, (ast.FunctionDef, ast.AsyncFunctionDef))]
if methods:
for method in methods:
method_src = _get_node_source(lines, method)
if include_context and class_docstring:
context = f"class {node.name}:\n \"\"\"{class_docstring}\"\"\"\n\n"
chunk_text = context + method_src
elif include_context:
chunk_text = f"class {node.name}:\n\n" + method_src
else:
chunk_text = method_src
chunks.append({
"text": chunk_text,
"metadata": {
"symbol_name": f"{node.name}.{method.name}",
"line_start": method.lineno,
"line_end": method.end_lineno,
},
})
else:
# Class with no methods — single chunk
chunks.append({
"text": class_lines,
"metadata": {
"symbol_name": node.name,
"line_start": node.lineno,
"line_end": node.end_lineno,
},
})
elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
func_src = _get_node_source(lines, node)
chunks.append({
"text": func_src,
"metadata": {
"symbol_name": node.name,
"line_start": node.lineno,
"line_end": node.end_lineno,
},
})
return chunks
def _get_node_source(lines: list[str], node) -> str:
"""Extract source code for an AST node, including decorators."""
start = node.lineno - 1
# Include decorators
if hasattr(node, "decorator_list") and node.decorator_list:
start = node.decorator_list[0].lineno - 1
end = node.end_lineno
return "".join(lines[start:end]).rstrip()
def _chunk_bash(text: str, include_context: bool) -> list[dict]:
"""Split Bash at function boundaries using regex."""
# Match: function name() { or name() {
func_pattern = re.compile(
r"^((?:#[^\n]*\n)*)?" # Optional preceding comment block
r"(?:function\s+(\w+)\s*\(\s*\)\s*\{|(\w+)\s*\(\s*\)\s*\{)",
re.MULTILINE,
)
chunks = []
matches = list(func_pattern.finditer(text))
if not matches:
return []
for i, match in enumerate(matches):
start = match.start()
# Find end: next function or end of file
if i + 1 < len(matches):
end = matches[i + 1].start()
else:
end = len(text)
func_name = match.group(2) or match.group(3)
chunk_text = text[start:end].rstrip()
chunks.append({
"text": chunk_text,
"metadata": {
"symbol_name": func_name,
},
})
return chunks
def _chunk_go(text: str, include_context: bool) -> list[dict]:
"""Split Go at func declarations using regex."""
func_pattern = re.compile(
r"^func\s+(?:\([^)]*\)\s+)?(\w+)\s*\(",
re.MULTILINE,
)
chunks = []
matches = list(func_pattern.finditer(text))
if not matches:
return []
for i, match in enumerate(matches):
start = match.start()
# Include preceding comment block
before = text[:start]
comment_lines = []
for line in reversed(before.splitlines()):
stripped = line.strip()
if stripped.startswith("//") or not stripped:
comment_lines.insert(0, line)
else:
break
if comment_lines:
comment_text = "\n".join(comment_lines).strip()
if comment_text:
start = text.rfind(comment_lines[0], 0, start)
# Find end: next func or end of file
if i + 1 < len(matches):
end = matches[i + 1].start()
# Backtrack to exclude preceding comments of next func
before_next = text[:end]
for line in reversed(before_next.splitlines()):
stripped = line.strip()
if stripped.startswith("//") or not stripped:
end = text.rfind(line, 0, end)
else:
break
else:
end = len(text)
func_name = match.group(1)
chunk_text = text[start:end].rstrip()
chunks.append({
"text": chunk_text,
"metadata": {
"symbol_name": func_name,
},
})
return chunks
def _fixed_chunk(text: str, chunking_cfg: dict) -> list[dict]:
"""Fixed-size fallback for code without recognisable boundaries."""
max_tokens = chunking_cfg.get("max_tokens", 1024)
overlap_tokens = chunking_cfg.get("overlap_tokens", 50)
lines = text.splitlines()
if not lines:
return []
# Approximate tokens as words
chunks = []
current_lines = []
current_tokens = 0
idx = 0
for line in lines:
line_tokens = len(line.split())
if current_tokens + line_tokens > max_tokens and current_lines:
chunks.append({
"text": "\n".join(current_lines),
"chunk_index": idx,
"metadata": {},
})
idx += 1
# Keep some overlap
overlap_lines = []
overlap_count = 0
for l in reversed(current_lines):
l_tokens = len(l.split())
if overlap_count + l_tokens > overlap_tokens:
break
overlap_lines.insert(0, l)
overlap_count += l_tokens
current_lines = overlap_lines
current_tokens = overlap_count
current_lines.append(line)
current_tokens += line_tokens
if current_lines:
chunks.append({
"text": "\n".join(current_lines),
"chunk_index": idx,
"metadata": {},
})
return chunks
+54
View File
@@ -0,0 +1,54 @@
"""File type detection and routing."""
from pathlib import Path
EXTENSION_MAP = {
# Docling-handled formats
".pdf": ("pdf", None),
".docx": ("pdf", None), # Docling handles DOCX too
".html": ("pdf", None),
".htm": ("pdf", None),
".png": ("pdf", None),
".jpg": ("pdf", None),
".jpeg": ("pdf", None),
".tiff": ("pdf", None),
".bmp": ("pdf", None),
".webp": ("pdf", None),
# Markdown / text
".md": ("markdown", None),
".markdown": ("markdown", None),
".txt": ("markdown", None),
# Code
".py": ("code", "python"),
".sh": ("code", "bash"),
".bash": ("code", "bash"),
".go": ("code", "go"),
}
SUPPORTED_EXTENSIONS = set(EXTENSION_MAP.keys())
def detect_type(path: Path, force_type: str | None = None,
force_language: str | None = None) -> tuple[str, str | None]:
"""Detect document type and language from file extension.
Returns (doc_type, language) tuple.
Raises ValueError for unsupported file types.
"""
if force_type:
return force_type, force_language
ext = path.suffix.lower()
if ext not in EXTENSION_MAP:
supported = ", ".join(sorted(SUPPORTED_EXTENSIONS))
raise ValueError(f"Unsupported file type '{ext}'. Supported: {supported}")
doc_type, language = EXTENSION_MAP[ext]
if force_language:
language = force_language
return doc_type, language
def is_supported(path: Path) -> bool:
"""Check if a file has a supported extension."""
return path.suffix.lower() in SUPPORTED_EXTENSIONS
+123
View File
@@ -0,0 +1,123 @@
"""Docling-based ingestion for PDFs, DOCX, HTML, and images."""
import logging
from pathlib import Path
# Suppress noisy Docling/RapidOCR logging
logging.getLogger("RapidOCR").setLevel(logging.ERROR)
logging.getLogger("docling.models.stages.ocr.rapid_ocr_model").setLevel(logging.ERROR)
logging.getLogger("docling").setLevel(logging.WARNING)
def chunk_document(file_path: Path, cfg: dict) -> list[dict]:
"""Ingest a document using Docling and return chunks."""
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
# Configure PDF pipeline
ocr_setting = cfg.get("ingestion", {}).get("enable_ocr", "auto")
pdf_opts = PdfPipelineOptions()
if ocr_setting == "never":
pdf_opts.do_ocr = False
elif ocr_setting == "always":
pdf_opts.do_ocr = True
pdf_opts.ocr_options = RapidOcrOptions(force_full_page_ocr=True)
else:
# "auto" — enable OCR but only trigger on pages with significant bitmap content
pdf_opts.do_ocr = True
pdf_opts.ocr_options = RapidOcrOptions(bitmap_area_threshold=0.25)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_opts),
}
)
# Convert
result = converter.convert(str(file_path))
doc = result.document
# Chunk using hierarchy-aware chunker
chunking_cfg = cfg.get("chunking", {}).get("pdf", {})
strategy = chunking_cfg.get("strategy", "hierarchy")
if strategy == "hierarchy":
chunks = _hierarchy_chunk(doc)
else:
chunks = _fixed_chunk(doc, chunking_cfg)
if not chunks:
# Fallback: try extracting raw text
text = doc.export_to_markdown()
if text and text.strip():
chunks = _fixed_chunk_text(text, chunking_cfg)
return chunks
def _hierarchy_chunk(doc) -> list[dict]:
"""Use Docling's HierarchicalChunker."""
from docling_core.transforms.chunker import HierarchicalChunker
chunker = HierarchicalChunker()
chunks = []
for i, chunk in enumerate(chunker.chunk(doc)):
meta = {}
# Extract page info if available
if hasattr(chunk, "meta") and chunk.meta:
if hasattr(chunk.meta, "doc_items"):
for item in chunk.meta.doc_items:
if hasattr(item, "prov") and item.prov:
for prov in item.prov:
if hasattr(prov, "page_no"):
meta["page"] = prov.page_no
break
# Section headers
if hasattr(chunk.meta, "headings") and chunk.meta.headings:
meta["section_header"] = " > ".join(chunk.meta.headings)
chunks.append({
"text": chunk.text,
"chunk_index": i,
"metadata": meta,
})
return chunks
def _fixed_chunk(doc, chunking_cfg: dict) -> list[dict]:
"""Fixed-size chunking from Docling document."""
text = doc.export_to_markdown()
return _fixed_chunk_text(text, chunking_cfg)
def _fixed_chunk_text(text: str, chunking_cfg: dict) -> list[dict]:
"""Fixed-size chunking from plain text."""
max_tokens = chunking_cfg.get("max_tokens", 1024)
overlap = chunking_cfg.get("overlap_tokens", 50)
# Approximate: 1 token ~= 4 chars
max_chars = max_tokens * 4
overlap_chars = overlap * 4
chunks = []
start = 0
idx = 0
while start < len(text):
end = start + max_chars
chunk_text = text[start:end].strip()
if chunk_text:
chunks.append({
"text": chunk_text,
"chunk_index": idx,
"metadata": {},
})
idx += 1
start = end - overlap_chars
return chunks
+210
View File
@@ -0,0 +1,210 @@
"""Markdown ingestion — header-based splitting."""
import re
def chunk_markdown(text: str, cfg: dict) -> list[dict]:
"""Split markdown at header boundaries with hierarchy context."""
chunking_cfg = cfg.get("chunking", {}).get("markdown", {})
strategy = chunking_cfg.get("strategy", "header")
if strategy == "fixed" or not _has_headers(text):
return _fixed_chunk(text, chunking_cfg)
return _header_chunk(text, chunking_cfg)
def _has_headers(text: str) -> bool:
"""Check if text contains markdown headers."""
return bool(re.search(r"^#{1,6}\s+", text, re.MULTILINE))
def _header_chunk(text: str, chunking_cfg: dict) -> list[dict]:
"""Split at ## and ### boundaries with hierarchy context."""
min_tokens = chunking_cfg.get("min_tokens", 50)
max_tokens = chunking_cfg.get("max_tokens", 1024)
sections = _split_at_headers(text)
if not sections:
return _fixed_chunk(text, chunking_cfg)
# Merge small sections
sections = _merge_small_sections(sections, min_tokens)
# Split large sections
chunks = []
for section in sections:
content = section["content"].strip()
if not content:
continue
# Add hierarchy context
if section["header_chain"]:
context = " > ".join(section["header_chain"])
full_text = f"{context}\n\n{content}"
else:
full_text = content
approx_tokens = len(full_text.split())
if approx_tokens > max_tokens:
sub_chunks = _split_large_section(full_text, max_tokens, chunking_cfg)
chunks.extend(sub_chunks)
else:
chunks.append({"text": full_text, "metadata": {
"section_header": section["header_chain"][-1] if section["header_chain"] else None,
}})
# Assign chunk indices
for i, c in enumerate(chunks):
c["chunk_index"] = i
return chunks
def _split_at_headers(text: str) -> list[dict]:
"""Split text into sections at header boundaries."""
header_pattern = re.compile(r"^(#{1,6})\s+(.*?)$", re.MULTILINE)
sections = []
header_stack = [] # Stack of (level, title)
last_end = 0
for match in header_pattern.finditer(text):
# Capture content before this header
if last_end < match.start():
content = text[last_end:match.start()].strip()
if content and sections:
sections[-1]["content"] += "\n\n" + content
elif content:
sections.append({
"header_chain": [],
"content": content,
})
level = len(match.group(1))
title = match.group(2).strip()
# Update header stack
while header_stack and header_stack[-1][0] >= level:
header_stack.pop()
header_stack.append((level, title))
chain = [h[1] for h in header_stack]
sections.append({
"header_chain": chain,
"content": "",
})
last_end = match.end()
# Capture trailing content
if last_end < len(text):
trailing = text[last_end:].strip()
if trailing and sections:
sections[-1]["content"] += "\n\n" + trailing
elif trailing:
sections.append({"header_chain": [], "content": trailing})
return sections
def _merge_small_sections(sections: list[dict], min_tokens: int) -> list[dict]:
"""Merge sections smaller than min_tokens with next section."""
if not sections:
return sections
merged = []
pending = None
for section in sections:
if pending is not None:
# Merge pending into this section
section["content"] = pending["content"] + "\n\n" + section["content"]
if not section["header_chain"] and pending["header_chain"]:
section["header_chain"] = pending["header_chain"]
pending = None
approx_tokens = len(section["content"].split())
if approx_tokens < min_tokens:
pending = section
else:
merged.append(section)
if pending is not None:
if merged:
merged[-1]["content"] += "\n\n" + pending["content"]
else:
merged.append(pending)
return merged
def _split_large_section(text: str, max_tokens: int, chunking_cfg: dict) -> list[dict]:
"""Split a large section at paragraph boundaries with overlap."""
overlap_tokens = chunking_cfg.get("overlap_tokens",
cfg_defaults().get("overlap_tokens", 50))
paragraphs = re.split(r"\n\n+", text)
chunks = []
current_paras = []
current_tokens = 0
for para in paragraphs:
para_tokens = len(para.split())
if current_tokens + para_tokens > max_tokens and current_paras:
chunks.append({"text": "\n\n".join(current_paras), "metadata": {}})
# Keep overlap
overlap_paras = []
overlap_count = 0
for p in reversed(current_paras):
p_tokens = len(p.split())
if overlap_count + p_tokens > overlap_tokens:
break
overlap_paras.insert(0, p)
overlap_count += p_tokens
current_paras = overlap_paras
current_tokens = overlap_count
current_paras.append(para)
current_tokens += para_tokens
if current_paras:
chunks.append({"text": "\n\n".join(current_paras), "metadata": {}})
return chunks
def cfg_defaults():
"""Return default chunking config."""
return {"max_tokens": 1024, "overlap_tokens": 50, "min_tokens": 50}
def _fixed_chunk(text: str, chunking_cfg: dict) -> list[dict]:
"""Fixed-size fallback for plain text without headers."""
max_tokens = chunking_cfg.get("max_tokens", 512)
overlap_tokens = chunking_cfg.get("overlap_tokens", 50)
words = text.split()
if not words:
return []
chunks = []
start = 0
idx = 0
while start < len(words):
end = min(start + max_tokens, len(words))
chunk_text = " ".join(words[start:end]).strip()
if chunk_text:
chunks.append({
"text": chunk_text,
"chunk_index": idx,
"metadata": {},
})
idx += 1
start = end - overlap_tokens
if start >= len(words) or end == len(words):
break
return chunks
+19
View File
@@ -0,0 +1,19 @@
"""Note ingestion — whole-document chunks."""
def chunk_note(text: str) -> list[dict]:
"""Return note text as a single chunk."""
return [{"text": text, "metadata": {}, "chunk_index": 0}]
def auto_title(text: str, max_len: int = 80) -> str:
"""Generate a title from the first line of text, truncated at word boundary."""
first_line = text.strip().split("\n")[0].strip()
if len(first_line) <= max_len:
return first_line
truncated = first_line[:max_len]
# Truncate at last space
last_space = truncated.rfind(" ")
if last_space > 0:
truncated = truncated[:last_space]
return truncated + "..."
+144
View File
@@ -0,0 +1,144 @@
"""Output formatters — JSON and human-readable."""
import json
import sys
def format_search_results(data: dict, fmt: str = "json") -> str:
"""Format search results for output."""
if fmt == "json":
return json.dumps(data, indent=2, ensure_ascii=False)
return _human_search(data)
def _human_search(data: dict) -> str:
"""Human-readable search output."""
lines = []
total = data["total_matches"]
returned = data["returned"]
lines.append(f'Search: "{data["query"]}" ({total} matches, showing top {returned})')
lines.append("")
for i, r in enumerate(data["results"], 1):
src = r["source"]
score = r["score"]
# Title with page/section
location = ""
if src.get("page"):
location = f" (p.{src['page']})"
elif src.get("section_header"):
location = f" \u00a7{src['section_header']}"
# Tags
tag_str = ""
if src.get("tags"):
tag_str = " [" + ", ".join(src["tags"]) + "]"
lines.append(f" {i:2d}. [{score:.3f}] {src['title']}{location} [{src['type']}]{tag_str}")
# Text preview (first 200 chars)
preview = r["text"][:200].replace("\n", " ").strip()
if len(r["text"]) > 200:
preview += "..."
lines.append(f" {preview}")
lines.append("")
return "\n".join(lines)
def format_document_list(docs: list[dict], fmt: str = "json") -> str:
"""Format document list."""
if fmt == "json":
return json.dumps(docs, indent=2, ensure_ascii=False)
return _human_doc_list(docs)
def _human_doc_list(docs: list[dict]) -> str:
"""Human-readable document list."""
if not docs:
return "No documents indexed. Run `kb add` to get started."
lines = [f"{'ID':>5} {'Type':<10} {'Chunks':>6} {'Title':<40} {'Tags'}"]
lines.append("-" * 80)
for d in docs:
tags = ", ".join(d.get("tags", []))
title = d["title"][:40]
lines.append(f"{d['id']:>5} {d['type']:<10} {d['chunk_count']:>6} {title:<40} {tags}")
return "\n".join(lines)
def format_tags(tags: list[dict], fmt: str = "json") -> str:
"""Format tag list."""
if fmt == "json":
return json.dumps(tags, indent=2, ensure_ascii=False)
if not tags:
return "No tags. Use `kb add --tags` or `kb tag` to add tags."
lines = [f"{'Tag':<30} {'Documents':>10}"]
lines.append("-" * 42)
for t in tags:
lines.append(f"{t['name']:<30} {t['count']:>10}")
return "\n".join(lines)
def format_doc_info(info: dict, fmt: str = "json") -> str:
"""Format document info."""
if fmt == "json":
return json.dumps(info, indent=2, ensure_ascii=False)
lines = []
lines.append(f"Document #{info['id']}: {info['title']}")
lines.append(f" Type: {info['type']}")
if info.get("language"):
lines.append(f" Language: {info['language']}")
if info.get("path"):
lines.append(f" Path: {info['path']}")
lines.append(f" Hash: {info['content_hash']}")
lines.append(f" Created: {info['created_at']}")
if info.get("tags"):
lines.append(f" Tags: {', '.join(info['tags'])}")
lines.append(f" Chunks: {info['chunk_count']}")
lines.append("")
for chunk in info.get("chunks", []):
preview = chunk["text"][:100].replace("\n", " ").strip()
if len(chunk["text"]) > 100:
preview += "..."
lines.append(f" [{chunk['chunk_index']}] {preview}")
return "\n".join(lines)
def format_status(status: dict, fmt: str = "json") -> str:
"""Format status output."""
if fmt == "json":
return json.dumps(status, indent=2, ensure_ascii=False)
lines = []
lines.append("Knowledge Base Status")
lines.append("=" * 40)
lines.append(f" Model: {status['model_name']}")
lines.append(f" Embedding dim: {status['embedding_dim']}")
lines.append(f" Schema version: {status['schema_version']}")
lines.append(f" DB size: {_human_size(status['db_size_bytes'])}")
lines.append("")
lines.append(" Documents:")
for dtype, count in status.get("documents", {}).items():
lines.append(f" {dtype:<12} {count:>5}")
lines.append(f" {'total':<12} {status['total_documents']:>5}")
lines.append(f" Total chunks: {status['total_chunks']}")
return "\n".join(lines)
def _human_size(size_bytes: int) -> str:
"""Format bytes as human-readable."""
for unit in ("B", "KB", "MB", "GB"):
if size_bytes < 1024:
return f"{size_bytes:.1f} {unit}"
size_bytes /= 1024
return f"{size_bytes:.1f} TB"
View File
+261
View File
@@ -0,0 +1,261 @@
"""Hybrid search — FTS5 + vector with Reciprocal Rank Fusion."""
import json
import re
import struct
import sqlite3
def hybrid_search(conn: sqlite3.Connection, query: str, model_name: str, cfg: dict,
top: int = 10, tags: list[str] | None = None,
doc_type: str | None = None, fts_only: bool = False,
vec_only: bool = False, threshold: float | None = None) -> dict:
"""Run hybrid search and return merged results."""
candidate_count = top * 3 # Fetch more candidates for RRF
fts_results = {}
vec_results = {}
if not vec_only:
fts_results = _fts_search(conn, query, candidate_count, tags, doc_type)
if not fts_only:
vec_results = _vector_search(conn, query, model_name, cfg, candidate_count, tags, doc_type)
# Merge via RRF
rrf_k = cfg.get("search", {}).get("rrf_k", 60)
if fts_only:
merged = _single_source_results(fts_results, "fts")
elif vec_only:
merged = _single_source_results(vec_results, "vector")
else:
merged = _rrf_merge(fts_results, vec_results, rrf_k)
# Apply threshold
if threshold is not None:
merged = [r for r in merged if r["score"] >= threshold]
# Sort and limit
merged.sort(key=lambda x: x["score"], reverse=True)
total = len(merged)
merged = merged[:top]
# Enrich with document metadata
results = []
for r in merged:
chunk_id = r["chunk_id"]
row = conn.execute("""
SELECT c.id, c.text, c.chunk_index, c.metadata as chunk_meta,
d.id as doc_id, d.title, d.source_path, d.doc_type,
d.language, d.metadata as doc_meta
FROM chunks c
JOIN documents d ON c.document_id = d.id
WHERE c.id = ?
""", (chunk_id,)).fetchone()
if not row:
continue
chunk_meta = json.loads(row["chunk_meta"]) if row["chunk_meta"] else {}
# Get tags for this document
tag_rows = conn.execute("""
SELECT t.name FROM tags t
JOIN document_tags dt ON t.id = dt.tag_id
WHERE dt.document_id = ?
ORDER BY t.name
""", (row["doc_id"],)).fetchall()
# Count total chunks for this document
total_chunks = conn.execute(
"SELECT COUNT(*) FROM chunks WHERE document_id = ?", (row["doc_id"],)
).fetchone()[0]
results.append({
"chunk_id": row["id"],
"score": round(r["score"], 6),
"score_breakdown": r["score_breakdown"],
"text": row["text"],
"source": {
"document_id": row["doc_id"],
"title": row["title"],
"path": row["source_path"],
"type": row["doc_type"],
"page": chunk_meta.get("page"),
"section_header": chunk_meta.get("section_header"),
"chunk_index": row["chunk_index"],
"total_chunks": total_chunks,
"tags": [r["name"] for r in tag_rows],
},
})
return {
"query": query,
"results": results,
"total_matches": total,
"returned": len(results),
}
def _fts_search(conn: sqlite3.Connection, query: str, limit: int,
tags: list[str] | None, doc_type: str | None) -> dict[int, float]:
"""Run FTS5 search, return {chunk_id: bm25_score}."""
escaped = _escape_fts_query(query)
if not escaped.strip():
return {}
sql = """
SELECT f.rowid as chunk_id, bm25(chunks_fts) as score
FROM chunks_fts f
"""
joins = []
where = [f"chunks_fts MATCH ?"]
params = [escaped]
if tags or doc_type:
joins.append("JOIN chunks c ON f.rowid = c.id")
joins.append("JOIN documents d ON c.document_id = d.id")
if doc_type:
where.append("d.doc_type = ?")
params.append(doc_type)
if tags:
for i, tag in enumerate(tags):
joins.append(f"JOIN document_tags dt{i} ON d.id = dt{i}.document_id")
joins.append(f"JOIN tags t{i} ON dt{i}.tag_id = t{i}.id")
where.append(f"t{i}.name = ?")
params.append(tag.strip().lower())
sql += " " + " ".join(joins)
sql += " WHERE " + " AND ".join(where)
sql += " ORDER BY score LIMIT ?"
params.append(limit)
rows = conn.execute(sql, params).fetchall()
# BM25 scores are negative (lower = better), normalise to positive
results = {}
for row in rows:
results[row["chunk_id"]] = -row["score"] # Negate so higher = better
return results
def _vector_search(conn: sqlite3.Connection, query: str, model_name: str,
cfg: dict, limit: int, tags: list[str] | None,
doc_type: str | None) -> dict[int, float]:
"""Run vector similarity search, return {chunk_id: similarity_score}."""
from kb_search.embeddings import embed_texts
prefix = cfg.get("embedding", {}).get("query_prefix", "")
query_emb = embed_texts(model_name, [query], prefix=prefix)[0]
blob = struct.pack(f"{len(query_emb)}f", *query_emb)
# sqlite-vec returns results ordered by distance (lower = more similar)
rows = conn.execute("""
SELECT chunk_id, distance
FROM chunks_vec
WHERE embedding MATCH ?
ORDER BY distance
LIMIT ?
""", (blob, limit)).fetchall()
results = {}
for row in rows:
# Convert distance to similarity (1 - distance for cosine)
similarity = max(0, 1 - row["distance"])
chunk_id = row["chunk_id"]
# Apply filters post-hoc for vector search
if tags or doc_type:
check = conn.execute("""
SELECT 1 FROM chunks c
JOIN documents d ON c.document_id = d.id
WHERE c.id = ?
""" + (" AND d.doc_type = ?" if doc_type else ""),
(chunk_id,) + ((doc_type,) if doc_type else ())
).fetchone()
if not check:
continue
if tags:
tag_count = conn.execute("""
SELECT COUNT(*) FROM chunks c
JOIN documents d ON c.document_id = d.id
JOIN document_tags dt ON d.id = dt.document_id
JOIN tags t ON dt.tag_id = t.id
WHERE c.id = ? AND t.name IN ({})
""".format(",".join("?" * len(tags))),
(chunk_id, *[t.strip().lower() for t in tags])
).fetchone()[0]
if tag_count < len(tags):
continue
results[chunk_id] = similarity
return results
def _rrf_merge(fts_results: dict[int, float], vec_results: dict[int, float],
k: int = 60) -> list[dict]:
"""Merge two result sets using Reciprocal Rank Fusion."""
# Rank each result set
fts_ranked = _rank_results(fts_results)
vec_ranked = _rank_results(vec_results)
all_ids = set(fts_ranked.keys()) | set(vec_ranked.keys())
merged = []
for chunk_id in all_ids:
fts_rank = fts_ranked.get(chunk_id)
vec_rank = vec_ranked.get(chunk_id)
score = 0
if fts_rank is not None:
score += 1 / (k + fts_rank)
if vec_rank is not None:
score += 1 / (k + vec_rank)
fts_score = round(1 / (k + fts_rank), 6) if fts_rank is not None else None
vec_score = round(1 / (k + vec_rank), 6) if vec_rank is not None else None
merged.append({
"chunk_id": chunk_id,
"score": score,
"score_breakdown": {"fts": fts_score, "vector": vec_score},
})
return merged
def _single_source_results(results: dict[int, float], source: str) -> list[dict]:
"""Convert single-source results to merged format."""
ranked = _rank_results(results)
merged = []
for chunk_id, rank in ranked.items():
score = results[chunk_id]
breakdown = {"fts": None, "vector": None}
breakdown[source] = round(score, 6)
merged.append({
"chunk_id": chunk_id,
"score": score,
"score_breakdown": breakdown,
})
return merged
def _rank_results(results: dict[int, float]) -> dict[int, int]:
"""Rank results by score (1-indexed, higher score = lower rank number)."""
sorted_ids = sorted(results.keys(), key=lambda x: results[x], reverse=True)
return {chunk_id: rank + 1 for rank, chunk_id in enumerate(sorted_ids)}
def _escape_fts_query(query: str) -> str:
"""Escape special FTS5 characters in a query."""
# Remove FTS5 operators that could cause syntax errors
query = re.sub(r'["\(\)\*\:\^]', " ", query)
# Collapse multiple spaces
query = re.sub(r"\s+", " ", query).strip()
return query
View File
+131
View File
@@ -0,0 +1,131 @@
"""Tests for configuration loading, merging, and ENV overrides."""
import os
from pathlib import Path
import pytest
import yaml
from kb_search.config import (
DEFAULTS,
_deep_merge,
_get_nested,
_set_nested,
config_with_sources,
load_config,
save_config_value,
)
def test_deep_merge_basic():
base = {"a": 1, "b": {"c": 2, "d": 3}}
override = {"b": {"c": 99}}
result = _deep_merge(base, override)
assert result == {"a": 1, "b": {"c": 99, "d": 3}}
def test_deep_merge_new_keys():
base = {"a": 1}
override = {"b": 2}
result = _deep_merge(base, override)
assert result == {"a": 1, "b": 2}
def test_deep_merge_does_not_mutate():
base = {"a": {"b": 1}}
override = {"a": {"b": 2}}
_deep_merge(base, override)
assert base["a"]["b"] == 1
def test_set_nested():
d = {}
_set_nested(d, "a.b.c", 42)
assert d == {"a": {"b": {"c": 42}}}
def test_get_nested():
d = {"a": {"b": {"c": 42}}}
assert _get_nested(d, "a.b.c") == 42
assert _get_nested(d, "a.b.x", "missing") == "missing"
assert _get_nested(d, "x.y.z") is None
def test_load_config_defaults(tmp_path):
"""With no config file, returns defaults."""
cfg = load_config(tmp_path / "nonexistent.yaml")
assert cfg["embedding"]["model"] == "all-MiniLM-L6-v2"
assert cfg["search"]["default_top"] == 10
assert cfg["chunking"]["pdf"]["strategy"] == "hierarchy"
def test_load_config_yaml_override(tmp_path):
"""YAML values override defaults."""
config_path = tmp_path / "config.yaml"
config_path.write_text(yaml.dump({"embedding": {"model": "nomic-embed-text"}}))
cfg = load_config(config_path)
assert cfg["embedding"]["model"] == "nomic-embed-text"
# Other defaults preserved
assert cfg["search"]["default_top"] == 10
def test_load_config_env_override(tmp_path, monkeypatch):
"""ENV overrides both YAML and defaults."""
config_path = tmp_path / "config.yaml"
config_path.write_text(yaml.dump({"search": {"default_top": 20}}))
monkeypatch.setenv("KB_DEFAULT_TOP", "50")
cfg = load_config(config_path)
assert cfg["search"]["default_top"] == 50
def test_load_config_env_model(tmp_path, monkeypatch):
monkeypatch.setenv("KB_MODEL", "bge-small-en-v1.5")
cfg = load_config(tmp_path / "nonexistent.yaml")
assert cfg["embedding"]["model"] == "bge-small-en-v1.5"
def test_save_config_value(tmp_path):
config_path = tmp_path / "config.yaml"
save_config_value(config_path, "chunking.pdf.max_tokens", "2048")
with open(config_path) as f:
data = yaml.safe_load(f)
assert data["chunking"]["pdf"]["max_tokens"] == 2048
def test_save_config_value_bool(tmp_path):
config_path = tmp_path / "config.yaml"
save_config_value(config_path, "chunking.code.include_context", "false")
with open(config_path) as f:
data = yaml.safe_load(f)
assert data["chunking"]["code"]["include_context"] is False
def test_save_config_preserves_existing(tmp_path):
config_path = tmp_path / "config.yaml"
config_path.write_text(yaml.dump({"embedding": {"model": "custom"}}))
save_config_value(config_path, "search.default_top", "20")
with open(config_path) as f:
data = yaml.safe_load(f)
assert data["embedding"]["model"] == "custom"
assert data["search"]["default_top"] == 20
def test_config_with_sources_defaults(tmp_path, monkeypatch):
entries = config_with_sources(tmp_path / "nonexistent.yaml")
sources = {k: s for k, _, s in entries}
assert sources["embedding.model"] == "default"
def test_config_with_sources_yaml(tmp_path):
config_path = tmp_path / "config.yaml"
config_path.write_text(yaml.dump({"embedding": {"model": "custom"}}))
entries = config_with_sources(config_path)
sources = {k: s for k, _, s in entries}
assert sources["embedding.model"] == "config.yaml"
def test_config_with_sources_env(tmp_path, monkeypatch):
monkeypatch.setenv("KB_MODEL", "from-env")
entries = config_with_sources(tmp_path / "nonexistent.yaml")
sources = {k: s for k, _, s in entries}
assert sources["embedding.model"] == "env (KB_MODEL)"
+206
View File
@@ -0,0 +1,206 @@
"""Tests for database schema, FTS triggers, and config helpers."""
import struct
from pathlib import Path
import pytest
from kb_search.database import (
SCHEMA_VERSION,
check_schema_version,
get_connection,
get_db_config,
get_or_create_tag,
hash_exists,
init_schema,
insert_chunk,
insert_document,
insert_embedding,
recreate_vec_table,
run_migrations,
set_db_config,
tag_document,
untag_document,
)
@pytest.fixture
def db(tmp_path):
"""Provide an initialised in-memory-like DB."""
db_path = tmp_path / "test.db"
conn = get_connection(db_path)
init_schema(conn, embedding_dim=384)
set_db_config(conn, "schema_version", str(SCHEMA_VERSION))
yield conn
conn.close()
def test_schema_creation(db):
tables = [r[0] for r in db.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()]
assert "documents" in tables
assert "chunks" in tables
assert "tags" in tables
assert "document_tags" in tables
assert "config" in tables
def test_fts_table_exists(db):
tables = [r[0] for r in db.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()]
assert "chunks_fts" in tables
def test_vec_table_exists(db):
tables = [r[0] for r in db.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()]
assert "chunks_vec" in tables
def test_config_get_set(db):
set_db_config(db, "test_key", "test_value")
assert get_db_config(db, "test_key") == "test_value"
def test_config_get_default(db):
assert get_db_config(db, "nonexistent", "fallback") == "fallback"
def test_config_upsert(db):
set_db_config(db, "key", "v1")
set_db_config(db, "key", "v2")
assert get_db_config(db, "key") == "v2"
def test_schema_version(db):
assert check_schema_version(db) == SCHEMA_VERSION
def test_insert_document(db):
doc_id = insert_document(db, "Test Doc", "/path/test.pdf", "abc123", "pdf")
db.commit()
row = db.execute("SELECT * FROM documents WHERE id = ?", (doc_id,)).fetchone()
assert row["title"] == "Test Doc"
assert row["doc_type"] == "pdf"
assert row["content_hash"] == "abc123"
def test_insert_chunk_with_fts_sync(db):
doc_id = insert_document(db, "Doc", None, "hash1", "note")
chunk_id = insert_chunk(db, doc_id, 0, "This is searchable text about Python programming")
db.commit()
# FTS should find it
rows = db.execute(
"SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'python'"
).fetchall()
assert len(rows) == 1
assert rows[0][0] == chunk_id
def test_fts_delete_trigger(db):
doc_id = insert_document(db, "Doc", None, "hash2", "note")
chunk_id = insert_chunk(db, doc_id, 0, "unique_keyword_xyz")
db.commit()
db.execute("DELETE FROM chunks WHERE id = ?", (chunk_id,))
db.commit()
rows = db.execute(
"SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'unique_keyword_xyz'"
).fetchall()
assert len(rows) == 0
def test_fts_update_trigger(db):
doc_id = insert_document(db, "Doc", None, "hash3", "note")
chunk_id = insert_chunk(db, doc_id, 0, "old_content_abc")
db.commit()
db.execute("UPDATE chunks SET text = 'new_content_def' WHERE id = ?", (chunk_id,))
db.commit()
old = db.execute("SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'old_content_abc'").fetchall()
new = db.execute("SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'new_content_def'").fetchall()
assert len(old) == 0
assert len(new) == 1
def test_insert_embedding(db):
doc_id = insert_document(db, "Doc", None, "hash4", "note")
chunk_id = insert_chunk(db, doc_id, 0, "text")
db.commit()
embedding = [0.1] * 384
insert_embedding(db, chunk_id, embedding)
db.commit()
row = db.execute("SELECT * FROM chunks_vec WHERE chunk_id = ?", (chunk_id,)).fetchone()
assert row is not None
def test_hash_exists(db):
assert not hash_exists(db, "newhash")
insert_document(db, "Doc", None, "newhash", "note")
db.commit()
assert hash_exists(db, "newhash")
def test_tag_management(db):
doc_id = insert_document(db, "Doc", None, "hash5", "pdf")
db.commit()
tag_document(db, doc_id, ["git", "admin"])
db.commit()
rows = db.execute(
"SELECT t.name FROM tags t JOIN document_tags dt ON t.id = dt.tag_id "
"WHERE dt.document_id = ? ORDER BY t.name",
(doc_id,),
).fetchall()
assert [r["name"] for r in rows] == ["admin", "git"]
def test_untag_document(db):
doc_id = insert_document(db, "Doc", None, "hash6", "pdf")
tag_document(db, doc_id, ["a", "b", "c"])
db.commit()
untag_document(db, doc_id, ["b"])
db.commit()
rows = db.execute(
"SELECT t.name FROM tags t JOIN document_tags dt ON t.id = dt.tag_id "
"WHERE dt.document_id = ? ORDER BY t.name",
(doc_id,),
).fetchall()
assert [r["name"] for r in rows] == ["a", "c"]
def test_tags_are_lowercase(db):
tag_id = get_or_create_tag(db, "MyTag")
db.commit()
row = db.execute("SELECT name FROM tags WHERE id = ?", (tag_id,)).fetchone()
assert row["name"] == "mytag"
def test_recreate_vec_table(db):
doc_id = insert_document(db, "Doc", None, "hash7", "note")
chunk_id = insert_chunk(db, doc_id, 0, "text")
insert_embedding(db, chunk_id, [0.1] * 384)
db.commit()
recreate_vec_table(db, 768)
# Old data gone, new dimension
rows = db.execute("SELECT * FROM chunks_vec").fetchall()
assert len(rows) == 0
def test_cascade_delete(db):
doc_id = insert_document(db, "Doc", None, "hash8", "pdf")
insert_chunk(db, doc_id, 0, "chunk text")
tag_document(db, doc_id, ["test"])
db.commit()
db.execute("DELETE FROM documents WHERE id = ?", (doc_id,))
db.commit()
assert db.execute("SELECT COUNT(*) FROM chunks WHERE document_id = ?", (doc_id,)).fetchone()[0] == 0
assert db.execute("SELECT COUNT(*) FROM document_tags WHERE document_id = ?", (doc_id,)).fetchone()[0] == 0
+50
View File
@@ -0,0 +1,50 @@
"""Tests for embedding model management."""
from unittest.mock import MagicMock, patch
import click
import pytest
from kb_search.embeddings import check_model_binding
@pytest.fixture
def mock_conn():
"""Mock DB connection with config values."""
def make_conn(config_values=None):
config_values = config_values or {}
conn = MagicMock()
def mock_execute(sql, params=None):
if "SELECT value FROM config" in sql and params:
key = params[0]
val = config_values.get(key)
row = MagicMock()
row.__getitem__ = lambda self, k: val
result = MagicMock()
result.fetchone.return_value = row if val else None
return result
return MagicMock()
conn.execute = mock_execute
return conn
return make_conn
def test_model_binding_match(mock_conn):
conn = mock_conn({"model_name": "all-MiniLM-L6-v2", "embedding_dim": "384"})
cfg = {"embedding": {"model": "all-MiniLM-L6-v2"}}
# Should not raise
check_model_binding(conn, cfg)
def test_model_binding_mismatch(mock_conn):
conn = mock_conn({"model_name": "all-MiniLM-L6-v2", "embedding_dim": "384"})
cfg = {"embedding": {"model": "nomic-embed-text"}}
with pytest.raises(click.ClickException, match="Model mismatch"):
check_model_binding(conn, cfg)
def test_model_binding_no_db_model(mock_conn):
conn = mock_conn({})
cfg = {"embedding": {"model": "anything"}}
# Should not raise when DB not yet initialised
check_model_binding(conn, cfg)
+172
View File
@@ -0,0 +1,172 @@
"""Tests for code chunking — Python, Bash, Go."""
from kb_search.ingest.code import chunk_code, _chunk_python, _chunk_bash, _chunk_go, _fixed_chunk
CFG = {"chunking": {"code": {"strategy": "ast", "include_context": True, "max_tokens": 1024}}}
class TestPythonChunking:
def test_functions(self):
code = '''
def hello():
"""Say hello."""
print("hello")
def goodbye():
"""Say goodbye."""
print("bye")
'''
chunks = _chunk_python(code, include_context=True)
assert len(chunks) == 2
assert chunks[0]["metadata"]["symbol_name"] == "hello"
assert chunks[1]["metadata"]["symbol_name"] == "goodbye"
def test_class_with_methods(self):
code = '''
class MyClass:
"""A test class."""
def method_a(self):
pass
def method_b(self):
pass
'''
chunks = _chunk_python(code, include_context=True)
assert len(chunks) == 2
assert chunks[0]["metadata"]["symbol_name"] == "MyClass.method_a"
assert chunks[1]["metadata"]["symbol_name"] == "MyClass.method_b"
# Context should include class docstring
assert "A test class" in chunks[0]["text"]
def test_class_without_methods(self):
code = '''
class Config:
"""Configuration."""
DEBUG = True
PORT = 8080
'''
chunks = _chunk_python(code, include_context=True)
assert len(chunks) == 1
assert chunks[0]["metadata"]["symbol_name"] == "Config"
def test_syntax_error_returns_empty(self):
chunks = _chunk_python("def broken(:\n pass", include_context=True)
assert chunks == []
def test_no_context(self):
code = '''
class Foo:
"""Docstring."""
def bar(self):
pass
'''
chunks = _chunk_python(code, include_context=False)
assert len(chunks) == 1
assert "Docstring" not in chunks[0]["text"]
class TestBashChunking:
def test_function_keyword(self):
code = '''#!/bin/bash
function deploy() {
echo "deploying"
}
function rollback() {
echo "rolling back"
}
'''
chunks = _chunk_bash(code, include_context=True)
assert len(chunks) == 2
assert chunks[0]["metadata"]["symbol_name"] == "deploy"
assert chunks[1]["metadata"]["symbol_name"] == "rollback"
def test_shorthand_syntax(self):
code = '''
setup() {
echo "setup"
}
cleanup() {
echo "cleanup"
}
'''
chunks = _chunk_bash(code, include_context=True)
assert len(chunks) == 2
def test_no_functions(self):
code = "#!/bin/bash\necho hello\nexit 0"
chunks = _chunk_bash(code, include_context=True)
assert chunks == []
def test_with_preceding_comments(self):
code = '''
# Deploy to production
# Requires valid credentials
function deploy() {
echo "deploying"
}
'''
chunks = _chunk_bash(code, include_context=True)
assert len(chunks) == 1
assert "Deploy to production" in chunks[0]["text"]
class TestGoChunking:
def test_basic_funcs(self):
code = '''package main
func main() {
fmt.Println("hello")
}
func helper() string {
return "help"
}
'''
chunks = _chunk_go(code, include_context=True)
assert len(chunks) == 2
assert chunks[0]["metadata"]["symbol_name"] == "main"
assert chunks[1]["metadata"]["symbol_name"] == "helper"
def test_method_receiver(self):
code = '''
func (s *Server) Start() error {
return nil
}
func (s *Server) Stop() {
}
'''
chunks = _chunk_go(code, include_context=True)
assert len(chunks) == 2
assert chunks[0]["metadata"]["symbol_name"] == "Start"
def test_no_funcs(self):
code = "package main\n\nvar x = 1"
chunks = _chunk_go(code, include_context=True)
assert chunks == []
class TestFallback:
def test_unknown_language_uses_fixed(self):
code = "line1\nline2\nline3"
chunks = chunk_code(code, "ruby", CFG)
assert len(chunks) >= 1
def test_python_no_functions_uses_fixed(self):
code = "x = 1\ny = 2\nprint(x + y)"
chunks = chunk_code(code, "python", CFG)
assert len(chunks) >= 1
def test_fixed_strategy_config(self):
cfg = {"chunking": {"code": {"strategy": "fixed", "max_tokens": 10}}}
code = "\n".join(f"x_{i} = {i}" for i in range(50))
chunks = chunk_code(code, "python", cfg)
assert len(chunks) > 1
def test_empty_code(self):
chunks = chunk_code("", "python", CFG)
assert len(chunks) == 0
+81
View File
@@ -0,0 +1,81 @@
"""Tests for file type detection, dedup, note creation."""
from pathlib import Path
import pytest
from kb_search.ingest.detector import detect_type, is_supported
from kb_search.ingest.note import auto_title, chunk_note
class TestDetector:
def test_pdf(self, tmp_path):
assert detect_type(tmp_path / "doc.pdf") == ("pdf", None)
def test_markdown(self, tmp_path):
assert detect_type(tmp_path / "notes.md") == ("markdown", None)
def test_txt(self, tmp_path):
assert detect_type(tmp_path / "notes.txt") == ("markdown", None)
def test_python(self, tmp_path):
assert detect_type(tmp_path / "main.py") == ("code", "python")
def test_bash(self, tmp_path):
assert detect_type(tmp_path / "deploy.sh") == ("code", "bash")
def test_go(self, tmp_path):
assert detect_type(tmp_path / "main.go") == ("code", "go")
def test_unsupported(self, tmp_path):
with pytest.raises(ValueError, match="Unsupported"):
detect_type(tmp_path / "archive.zip")
def test_force_type(self, tmp_path):
assert detect_type(tmp_path / "data.txt", force_type="code", force_language="bash") == ("code", "bash")
def test_force_language_only(self, tmp_path):
doc_type, lang = detect_type(tmp_path / "script.py", force_language="go")
assert doc_type == "code"
assert lang == "go"
def test_is_supported(self, tmp_path):
assert is_supported(tmp_path / "test.pdf")
assert is_supported(tmp_path / "test.py")
assert not is_supported(tmp_path / "test.zip")
def test_case_insensitive(self, tmp_path):
assert detect_type(tmp_path / "DOC.PDF") == ("pdf", None)
def test_image_files(self, tmp_path):
assert detect_type(tmp_path / "scan.png") == ("pdf", None)
assert detect_type(tmp_path / "photo.jpg") == ("pdf", None)
def test_docx(self, tmp_path):
assert detect_type(tmp_path / "report.docx") == ("pdf", None)
class TestNote:
def test_chunk_note(self):
chunks = chunk_note("Hello world")
assert len(chunks) == 1
assert chunks[0]["text"] == "Hello world"
assert chunks[0]["chunk_index"] == 0
def test_auto_title_short(self):
assert auto_title("Short note") == "Short note"
def test_auto_title_long(self):
long_text = "This is a very long note that exceeds the maximum title length and should be truncated at a word boundary"
result = auto_title(long_text, max_len=50)
assert len(result) <= 54 # 50 + "..."
assert result.endswith("...")
def test_auto_title_multiline(self):
text = "First line\nSecond line\nThird line"
assert auto_title(text) == "First line"
def test_auto_title_no_space(self):
text = "a" * 100
result = auto_title(text, max_len=80)
assert result.endswith("...")
+33
View File
@@ -0,0 +1,33 @@
"""Tests for Docling ingestion (fixed-size chunking logic, mocked Docling)."""
from kb_search.ingest.docling import _fixed_chunk_text
class TestFixedChunkText:
def test_short_text_single_chunk(self):
chunks = _fixed_chunk_text("Hello world", {})
assert len(chunks) == 1
assert chunks[0]["text"] == "Hello world"
assert chunks[0]["chunk_index"] == 0
def test_long_text_multiple_chunks(self):
text = "word " * 2000 # ~10000 chars
chunks = _fixed_chunk_text(text, {"max_tokens": 512, "overlap_tokens": 50})
assert len(chunks) > 1
# Chunks should overlap
for i, c in enumerate(chunks):
assert c["chunk_index"] == i
def test_empty_text(self):
chunks = _fixed_chunk_text("", {})
assert len(chunks) == 0
def test_whitespace_only(self):
chunks = _fixed_chunk_text(" \n\n ", {})
assert len(chunks) == 0
def test_custom_max_tokens(self):
text = "a " * 500
chunks = _fixed_chunk_text(text, {"max_tokens": 100})
# 100 tokens * 4 chars = 400 chars window, 1000 chars total
assert len(chunks) > 1
+121
View File
@@ -0,0 +1,121 @@
"""Tests for markdown header-based splitting."""
from kb_search.ingest.markdown import (
_fixed_chunk,
_has_headers,
_merge_small_sections,
_split_at_headers,
chunk_markdown,
)
def make_cfg(**overrides):
cfg = {"chunking": {"markdown": {"strategy": "header", "min_tokens": 50, "max_tokens": 1024}}}
cfg["chunking"]["markdown"].update(overrides)
return cfg
class TestHasHeaders:
def test_with_headers(self):
assert _has_headers("## Title\nContent")
def test_without_headers(self):
assert not _has_headers("Just plain text\nNo headers here")
def test_h3(self):
assert _has_headers("### Subsection\nStuff")
class TestSplitAtHeaders:
def test_basic_split(self):
text = "## Section 1\nContent one\n\n## Section 2\nContent two"
sections = _split_at_headers(text)
assert len(sections) == 2
assert sections[0]["header_chain"] == ["Section 1"]
assert "Content one" in sections[0]["content"]
assert sections[1]["header_chain"] == ["Section 2"]
def test_nested_headers(self):
text = "## Config\nIntro\n\n### Advanced Options\nDetails"
sections = _split_at_headers(text)
assert len(sections) == 2
# The ### should have full chain
assert sections[1]["header_chain"] == ["Config", "Advanced Options"]
def test_leading_content(self):
text = "Preamble text\n\n## First Section\nContent"
sections = _split_at_headers(text)
assert len(sections) == 2
assert sections[0]["header_chain"] == []
assert "Preamble" in sections[0]["content"]
def test_header_level_reset(self):
text = "## A\n\n### B\n\n## C\n\n### D"
sections = _split_at_headers(text)
assert sections[2]["header_chain"] == ["C"]
assert sections[3]["header_chain"] == ["C", "D"]
class TestMergeSmallSections:
def test_merge_tiny_into_next(self):
sections = [
{"header_chain": ["A"], "content": "tiny"},
{"header_chain": ["B"], "content": "This is a much longer section with plenty of words " * 5},
]
merged = _merge_small_sections(sections, min_tokens=10)
assert len(merged) == 1
assert "tiny" in merged[0]["content"]
def test_no_merge_when_large_enough(self):
sections = [
{"header_chain": ["A"], "content": "word " * 100},
{"header_chain": ["B"], "content": "word " * 100},
]
merged = _merge_small_sections(sections, min_tokens=10)
assert len(merged) == 2
class TestChunkMarkdown:
def test_header_strategy(self):
text = "## Intro\nSome intro text with enough words to avoid merging. " * 5
text += "\n\n## Details\nDetailed content follows here with sufficient length. " * 5
cfg = make_cfg(min_tokens=5)
chunks = chunk_markdown(text, cfg)
assert len(chunks) >= 2
# Verify chunk_index assigned
for i, c in enumerate(chunks):
assert c["chunk_index"] == i
def test_hierarchy_context(self):
text = "## Config\nIntro\n\n### Advanced\n" + "Details " * 60
cfg = make_cfg(min_tokens=5)
chunks = chunk_markdown(text, cfg)
# Find the Advanced chunk
advanced = [c for c in chunks if "Advanced" in c["text"]]
assert len(advanced) > 0
assert "Config > Advanced" in advanced[0]["text"]
def test_plain_text_fallback(self):
text = "No headers here, just plain text. " * 200
cfg = make_cfg()
chunks = chunk_markdown(text, cfg)
assert len(chunks) >= 1
def test_empty_text(self):
chunks = chunk_markdown("", make_cfg())
assert len(chunks) == 0
class TestFixedChunk:
def test_basic(self):
text = "word " * 200
chunks = _fixed_chunk(text, {"max_tokens": 50, "overlap_tokens": 10})
assert len(chunks) > 1
def test_empty(self):
chunks = _fixed_chunk("", {})
assert len(chunks) == 0
def test_short_text(self):
chunks = _fixed_chunk("hello world", {"max_tokens": 512})
assert len(chunks) == 1
+156
View File
@@ -0,0 +1,156 @@
"""Tests for document management commands via Click test runner."""
import json
import pytest
from click.testing import CliRunner
from kb_search.cli import main
from kb_search.database import (
SCHEMA_VERSION,
get_connection,
init_schema,
insert_chunk,
insert_document,
insert_embedding,
set_db_config,
tag_document,
)
@pytest.fixture
def kb_env(tmp_path, monkeypatch):
"""Set up a test KB environment."""
data_dir = tmp_path / ".kb"
data_dir.mkdir()
db_path = data_dir / "kb.db"
conn = get_connection(db_path)
init_schema(conn, 384)
set_db_config(conn, "schema_version", str(SCHEMA_VERSION))
set_db_config(conn, "model_name", "all-MiniLM-L6-v2")
set_db_config(conn, "embedding_dim", "384")
# Add a test document
doc_id = insert_document(conn, "Test Doc", "/tmp/test.pdf", "abc123", "pdf")
insert_chunk(conn, doc_id, 0, "This is chunk zero about Python")
insert_chunk(conn, doc_id, 1, "This is chunk one about testing")
tag_document(conn, doc_id, ["test", "pdf"])
conn.commit()
conn.close()
monkeypatch.setenv("KB_DATA_DIR", str(data_dir))
return data_dir
class TestList:
def test_json_output(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["list", "--format", "json"])
assert result.exit_code == 0
data = json.loads(result.output)
assert len(data) == 1
assert data[0]["title"] == "Test Doc"
assert data[0]["type"] == "pdf"
def test_human_output(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["list", "--format", "human"])
assert result.exit_code == 0
assert "Test Doc" in result.output
def test_filter_type(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["list", "--type", "markdown", "--format", "json"])
assert result.exit_code == 0
data = json.loads(result.output)
assert len(data) == 0
def test_filter_tags(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["list", "--tags", "test", "--format", "json"])
assert result.exit_code == 0
data = json.loads(result.output)
assert len(data) == 1
class TestInfo:
def test_json_output(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["info", "1", "--format", "json"])
assert result.exit_code == 0
data = json.loads(result.output)
assert data["title"] == "Test Doc"
assert data["chunk_count"] == 2
assert "test" in data["tags"]
def test_not_found(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["info", "999"])
assert result.exit_code != 0
assert "not found" in result.output.lower()
class TestRemove:
def test_remove_with_yes(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["remove", "1", "--yes"])
assert result.exit_code == 0
assert "Removed" in result.output
# Verify gone
result = runner.invoke(main, ["list", "--format", "json"])
data = json.loads(result.output)
assert len(data) == 0
def test_remove_not_found(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["remove", "999", "--yes"])
assert result.exit_code != 0
class TestTags:
def test_list_tags(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["tags", "--format", "json"])
assert result.exit_code == 0
data = json.loads(result.output)
names = [t["name"] for t in data]
assert "test" in names
assert "pdf" in names
def test_add_tag(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["tag", "1", "--add", "new"])
assert result.exit_code == 0
assert "Added" in result.output
def test_remove_tag(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["tag", "1", "--remove", "test"])
assert result.exit_code == 0
assert "Removed" in result.output
class TestStatus:
def test_json_output(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["status", "--format", "json"])
assert result.exit_code == 0
data = json.loads(result.output)
assert data["model_name"] == "all-MiniLM-L6-v2"
assert data["total_documents"] == 1
assert data["total_chunks"] == 2
def test_human_output(self, kb_env):
runner = CliRunner()
result = runner.invoke(main, ["status", "--format", "human"])
assert result.exit_code == 0
assert "all-MiniLM-L6-v2" in result.output
def test_not_initialised(self, tmp_path, monkeypatch):
monkeypatch.setenv("KB_DATA_DIR", str(tmp_path / "nonexistent"))
runner = CliRunner()
result = runner.invoke(main, ["status"])
assert result.exit_code != 0
assert "not initialised" in result.output.lower()
+120
View File
@@ -0,0 +1,120 @@
"""Tests for output formatters."""
import json
from kb_search.output import (
_human_size,
format_doc_info,
format_document_list,
format_search_results,
format_status,
format_tags,
)
SAMPLE_SEARCH = {
"query": "install git",
"results": [
{
"chunk_id": 1,
"score": 0.031,
"score_breakdown": {"fts": 0.016, "vector": 0.015},
"text": "To install git from source...",
"source": {
"document_id": 42,
"title": "Git Admin Guide",
"path": "/docs/git.pdf",
"type": "pdf",
"page": 12,
"section_header": None,
"chunk_index": 3,
"total_chunks": 28,
"tags": ["git", "admin"],
},
}
],
"total_matches": 47,
"returned": 1,
}
class TestSearchOutput:
def test_json_format(self):
output = format_search_results(SAMPLE_SEARCH, "json")
parsed = json.loads(output)
assert parsed["query"] == "install git"
assert len(parsed["results"]) == 1
assert parsed["results"][0]["chunk_id"] == 1
assert "fts" in parsed["results"][0]["score_breakdown"]
assert "vector" in parsed["results"][0]["score_breakdown"]
def test_json_schema_fields(self):
output = format_search_results(SAMPLE_SEARCH, "json")
parsed = json.loads(output)
r = parsed["results"][0]
assert "chunk_id" in r
assert "score" in r
assert "text" in r
assert "source" in r
src = r["source"]
assert "document_id" in src
assert "title" in src
assert "type" in src
assert "tags" in src
def test_human_format(self):
output = format_search_results(SAMPLE_SEARCH, "human")
assert "install git" in output
assert "Git Admin Guide" in output
assert "p.12" in output
assert "0.031" in output
class TestDocList:
def test_json(self):
docs = [{"id": 1, "title": "Test", "type": "pdf", "tags": ["a"], "chunk_count": 5, "created_at": "2024-01-01"}]
parsed = json.loads(format_document_list(docs, "json"))
assert len(parsed) == 1
def test_human_empty(self):
assert "No documents" in format_document_list([], "human")
def test_human(self):
docs = [{"id": 1, "title": "Test", "type": "pdf", "tags": ["a"], "chunk_count": 5}]
output = format_document_list(docs, "human")
assert "Test" in output
class TestTags:
def test_json(self):
tags = [{"name": "git", "count": 15}]
parsed = json.loads(format_tags(tags, "json"))
assert parsed[0]["name"] == "git"
def test_human_empty(self):
assert "No tags" in format_tags([], "human")
class TestStatus:
def test_json(self):
status = {"model_name": "test", "embedding_dim": 384, "schema_version": 1,
"db_size_bytes": 1024, "documents": {"pdf": 5}, "total_documents": 5, "total_chunks": 50}
parsed = json.loads(format_status(status, "json"))
assert parsed["model_name"] == "test"
def test_human(self):
status = {"model_name": "test", "embedding_dim": 384, "schema_version": 1,
"db_size_bytes": 1024000, "documents": {"pdf": 5}, "total_documents": 5, "total_chunks": 50}
output = format_status(status, "human")
assert "test" in output
assert "384" in output
class TestHumanSize:
def test_bytes(self):
assert _human_size(512) == "512.0 B"
def test_kb(self):
assert _human_size(2048) == "2.0 KB"
def test_mb(self):
assert _human_size(5 * 1024 * 1024) == "5.0 MB"
+91
View File
@@ -0,0 +1,91 @@
"""Tests for hybrid search, RRF merging, and filtering."""
import pytest
from kb_search.search import (
_escape_fts_query,
_rank_results,
_rrf_merge,
_single_source_results,
)
class TestEscapeFtsQuery:
def test_plain_query(self):
assert _escape_fts_query("install git") == "install git"
def test_special_chars(self):
result = _escape_fts_query('install "git" (latest)')
assert '"' not in result
assert "(" not in result
assert ")" not in result
def test_collapses_spaces(self):
assert _escape_fts_query(" too many spaces ") == "too many spaces"
def test_empty(self):
assert _escape_fts_query("") == ""
class TestRankResults:
def test_basic_ranking(self):
results = {1: 0.9, 2: 0.5, 3: 0.7}
ranked = _rank_results(results)
assert ranked[1] == 1 # highest score = rank 1
assert ranked[3] == 2
assert ranked[2] == 3
def test_empty(self):
assert _rank_results({}) == {}
class TestRRFMerge:
def test_basic_merge(self):
fts = {1: 0.9, 2: 0.5}
vec = {1: 0.8, 3: 0.7}
merged = _rrf_merge(fts, vec, k=60)
scores = {r["chunk_id"]: r["score"] for r in merged}
# Chunk 1 appears in both — should have highest score
assert scores[1] > scores[2]
assert scores[1] > scores[3]
def test_no_overlap(self):
fts = {1: 0.9}
vec = {2: 0.8}
merged = _rrf_merge(fts, vec, k=60)
assert len(merged) == 2
def test_score_breakdown(self):
fts = {1: 0.9}
vec = {1: 0.8}
merged = _rrf_merge(fts, vec, k=60)
assert len(merged) == 1
assert merged[0]["score_breakdown"]["fts"] is not None
assert merged[0]["score_breakdown"]["vector"] is not None
def test_single_source_fts(self):
fts = {1: 0.9, 2: 0.5}
merged = _rrf_merge(fts, {}, k=60)
for r in merged:
assert r["score_breakdown"]["vector"] is None
assert r["score_breakdown"]["fts"] is not None
def test_empty_both(self):
merged = _rrf_merge({}, {}, k=60)
assert merged == []
class TestSingleSourceResults:
def test_fts_only(self):
results = _single_source_results({1: 0.9, 2: 0.5}, "fts")
assert len(results) == 2
for r in results:
assert r["score_breakdown"]["vector"] is None
assert r["score_breakdown"]["fts"] is not None
def test_vec_only(self):
results = _single_source_results({1: 0.8}, "vector")
assert len(results) == 1
assert results[0]["score_breakdown"]["fts"] is None
assert results[0]["score_breakdown"]["vector"] is not None