Initial MVP
This commit is contained in:
@@ -0,0 +1,152 @@
|
||||
---
|
||||
name: "OPSX: Apply"
|
||||
description: Implement tasks from an OpenSpec change (Experimental)
|
||||
category: Workflow
|
||||
tags: [workflow, artifacts, experimental]
|
||||
---
|
||||
|
||||
Implement tasks from an OpenSpec change.
|
||||
|
||||
**Input**: Optionally specify a change name (e.g., `/opsx:apply add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **Select the change**
|
||||
|
||||
If a name is provided, use it. Otherwise:
|
||||
- Infer from conversation context if the user mentioned a change
|
||||
- Auto-select if only one active change exists
|
||||
- If ambiguous, run `openspec list --json` to get available changes and use the **AskUserQuestion tool** to let the user select
|
||||
|
||||
Always announce: "Using change: <name>" and how to override (e.g., `/opsx:apply <other>`).
|
||||
|
||||
2. **Check status to understand the schema**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to understand:
|
||||
- `schemaName`: The workflow being used (e.g., "spec-driven")
|
||||
- Which artifact contains the tasks (typically "tasks" for spec-driven, check status for others)
|
||||
|
||||
3. **Get apply instructions**
|
||||
|
||||
```bash
|
||||
openspec instructions apply --change "<name>" --json
|
||||
```
|
||||
|
||||
This returns:
|
||||
- Context file paths (varies by schema)
|
||||
- Progress (total, complete, remaining)
|
||||
- Task list with status
|
||||
- Dynamic instruction based on current state
|
||||
|
||||
**Handle states:**
|
||||
- If `state: "blocked"` (missing artifacts): show message, suggest using `/opsx:continue`
|
||||
- If `state: "all_done"`: congratulate, suggest archive
|
||||
- Otherwise: proceed to implementation
|
||||
|
||||
4. **Read context files**
|
||||
|
||||
Read the files listed in `contextFiles` from the apply instructions output.
|
||||
The files depend on the schema being used:
|
||||
- **spec-driven**: proposal, specs, design, tasks
|
||||
- Other schemas: follow the contextFiles from CLI output
|
||||
|
||||
5. **Show current progress**
|
||||
|
||||
Display:
|
||||
- Schema being used
|
||||
- Progress: "N/M tasks complete"
|
||||
- Remaining tasks overview
|
||||
- Dynamic instruction from CLI
|
||||
|
||||
6. **Implement tasks (loop until done or blocked)**
|
||||
|
||||
For each pending task:
|
||||
- Show which task is being worked on
|
||||
- Make the code changes required
|
||||
- Keep changes minimal and focused
|
||||
- Mark task complete in the tasks file: `- [ ]` → `- [x]`
|
||||
- Continue to next task
|
||||
|
||||
**Pause if:**
|
||||
- Task is unclear → ask for clarification
|
||||
- Implementation reveals a design issue → suggest updating artifacts
|
||||
- Error or blocker encountered → report and wait for guidance
|
||||
- User interrupts
|
||||
|
||||
7. **On completion or pause, show status**
|
||||
|
||||
Display:
|
||||
- Tasks completed this session
|
||||
- Overall progress: "N/M tasks complete"
|
||||
- If all done: suggest archive
|
||||
- If paused: explain why and wait for guidance
|
||||
|
||||
**Output During Implementation**
|
||||
|
||||
```
|
||||
## Implementing: <change-name> (schema: <schema-name>)
|
||||
|
||||
Working on task 3/7: <task description>
|
||||
[...implementation happening...]
|
||||
✓ Task complete
|
||||
|
||||
Working on task 4/7: <task description>
|
||||
[...implementation happening...]
|
||||
✓ Task complete
|
||||
```
|
||||
|
||||
**Output On Completion**
|
||||
|
||||
```
|
||||
## Implementation Complete
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Progress:** 7/7 tasks complete ✓
|
||||
|
||||
### Completed This Session
|
||||
- [x] Task 1
|
||||
- [x] Task 2
|
||||
...
|
||||
|
||||
All tasks complete! You can archive this change with `/opsx:archive`.
|
||||
```
|
||||
|
||||
**Output On Pause (Issue Encountered)**
|
||||
|
||||
```
|
||||
## Implementation Paused
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Progress:** 4/7 tasks complete
|
||||
|
||||
### Issue Encountered
|
||||
<description of the issue>
|
||||
|
||||
**Options:**
|
||||
1. <option 1>
|
||||
2. <option 2>
|
||||
3. Other approach
|
||||
|
||||
What would you like to do?
|
||||
```
|
||||
|
||||
**Guardrails**
|
||||
- Keep going through tasks until done or blocked
|
||||
- Always read context files before starting (from the apply instructions output)
|
||||
- If task is ambiguous, pause and ask before implementing
|
||||
- If implementation reveals issues, pause and suggest artifact updates
|
||||
- Keep code changes minimal and scoped to each task
|
||||
- Update task checkbox immediately after completing each task
|
||||
- Pause on errors, blockers, or unclear requirements - don't guess
|
||||
- Use contextFiles from CLI output, don't assume specific file names
|
||||
|
||||
**Fluid Workflow Integration**
|
||||
|
||||
This skill supports the "actions on a change" model:
|
||||
|
||||
- **Can be invoked anytime**: Before all artifacts are done (if tasks exist), after partial implementation, interleaved with other actions
|
||||
- **Allows artifact updates**: If implementation reveals design issues, suggest updating artifacts - not phase-locked, work fluidly
|
||||
@@ -0,0 +1,157 @@
|
||||
---
|
||||
name: "OPSX: Archive"
|
||||
description: Archive a completed change in the experimental workflow
|
||||
category: Workflow
|
||||
tags: [workflow, archive, experimental]
|
||||
---
|
||||
|
||||
Archive a completed change in the experimental workflow.
|
||||
|
||||
**Input**: Optionally specify a change name after `/opsx:archive` (e.g., `/opsx:archive add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
|
||||
|
||||
Show only active changes (not already archived).
|
||||
Include the schema used for each change if available.
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Check artifact completion status**
|
||||
|
||||
Run `openspec status --change "<name>" --json` to check artifact completion.
|
||||
|
||||
Parse the JSON to understand:
|
||||
- `schemaName`: The workflow being used
|
||||
- `artifacts`: List of artifacts with their status (`done` or other)
|
||||
|
||||
**If any artifacts are not `done`:**
|
||||
- Display warning listing incomplete artifacts
|
||||
- Prompt user for confirmation to continue
|
||||
- Proceed if user confirms
|
||||
|
||||
3. **Check task completion status**
|
||||
|
||||
Read the tasks file (typically `tasks.md`) to check for incomplete tasks.
|
||||
|
||||
Count tasks marked with `- [ ]` (incomplete) vs `- [x]` (complete).
|
||||
|
||||
**If incomplete tasks found:**
|
||||
- Display warning showing count of incomplete tasks
|
||||
- Prompt user for confirmation to continue
|
||||
- Proceed if user confirms
|
||||
|
||||
**If no tasks file exists:** Proceed without task-related warning.
|
||||
|
||||
4. **Assess delta spec sync state**
|
||||
|
||||
Check for delta specs at `openspec/changes/<name>/specs/`. If none exist, proceed without sync prompt.
|
||||
|
||||
**If delta specs exist:**
|
||||
- Compare each delta spec with its corresponding main spec at `openspec/specs/<capability>/spec.md`
|
||||
- Determine what changes would be applied (adds, modifications, removals, renames)
|
||||
- Show a combined summary before prompting
|
||||
|
||||
**Prompt options:**
|
||||
- If changes needed: "Sync now (recommended)", "Archive without syncing"
|
||||
- If already synced: "Archive now", "Sync anyway", "Cancel"
|
||||
|
||||
If user chooses sync, use Task tool (subagent_type: "general-purpose", prompt: "Use Skill tool to invoke openspec-sync-specs for change '<name>'. Delta spec analysis: <include the analyzed delta spec summary>"). Proceed to archive regardless of choice.
|
||||
|
||||
5. **Perform the archive**
|
||||
|
||||
Create the archive directory if it doesn't exist:
|
||||
```bash
|
||||
mkdir -p openspec/changes/archive
|
||||
```
|
||||
|
||||
Generate target name using current date: `YYYY-MM-DD-<change-name>`
|
||||
|
||||
**Check if target already exists:**
|
||||
- If yes: Fail with error, suggest renaming existing archive or using different date
|
||||
- If no: Move the change directory to archive
|
||||
|
||||
```bash
|
||||
mv openspec/changes/<name> openspec/changes/archive/YYYY-MM-DD-<name>
|
||||
```
|
||||
|
||||
6. **Display summary**
|
||||
|
||||
Show archive completion summary including:
|
||||
- Change name
|
||||
- Schema that was used
|
||||
- Archive location
|
||||
- Spec sync status (synced / sync skipped / no delta specs)
|
||||
- Note about any warnings (incomplete artifacts/tasks)
|
||||
|
||||
**Output On Success**
|
||||
|
||||
```
|
||||
## Archive Complete
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
|
||||
**Specs:** ✓ Synced to main specs
|
||||
|
||||
All artifacts complete. All tasks complete.
|
||||
```
|
||||
|
||||
**Output On Success (No Delta Specs)**
|
||||
|
||||
```
|
||||
## Archive Complete
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
|
||||
**Specs:** No delta specs
|
||||
|
||||
All artifacts complete. All tasks complete.
|
||||
```
|
||||
|
||||
**Output On Success With Warnings**
|
||||
|
||||
```
|
||||
## Archive Complete (with warnings)
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
|
||||
**Specs:** Sync skipped (user chose to skip)
|
||||
|
||||
**Warnings:**
|
||||
- Archived with 2 incomplete artifacts
|
||||
- Archived with 3 incomplete tasks
|
||||
- Delta spec sync was skipped (user chose to skip)
|
||||
|
||||
Review the archive if this was not intentional.
|
||||
```
|
||||
|
||||
**Output On Error (Archive Exists)**
|
||||
|
||||
```
|
||||
## Archive Failed
|
||||
|
||||
**Change:** <change-name>
|
||||
**Target:** openspec/changes/archive/YYYY-MM-DD-<name>/
|
||||
|
||||
Target archive directory already exists.
|
||||
|
||||
**Options:**
|
||||
1. Rename the existing archive
|
||||
2. Delete the existing archive if it's a duplicate
|
||||
3. Wait until a different date to archive
|
||||
```
|
||||
|
||||
**Guardrails**
|
||||
- Always prompt for change selection if not provided
|
||||
- Use artifact graph (openspec status --json) for completion checking
|
||||
- Don't block archive on warnings - just inform and confirm
|
||||
- Preserve .openspec.yaml when moving to archive (it moves with the directory)
|
||||
- Show clear summary of what happened
|
||||
- If sync is requested, use the Skill tool to invoke `openspec-sync-specs` (agent-driven)
|
||||
- If delta specs exist, always run the sync assessment and show the combined summary before prompting
|
||||
@@ -0,0 +1,114 @@
|
||||
---
|
||||
name: "OPSX: Continue"
|
||||
description: Continue working on a change - create the next artifact (Experimental)
|
||||
category: Workflow
|
||||
tags: [workflow, artifacts, experimental]
|
||||
---
|
||||
|
||||
Continue working on a change by creating the next artifact.
|
||||
|
||||
**Input**: Optionally specify a change name after `/opsx:continue` (e.g., `/opsx:continue add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes sorted by most recently modified. Then use the **AskUserQuestion tool** to let the user select which change to work on.
|
||||
|
||||
Present the top 3-4 most recently modified changes as options, showing:
|
||||
- Change name
|
||||
- Schema (from `schema` field if present, otherwise "spec-driven")
|
||||
- Status (e.g., "0/5 tasks", "complete", "no tasks")
|
||||
- How recently it was modified (from `lastModified` field)
|
||||
|
||||
Mark the most recently modified change as "(Recommended)" since it's likely what the user wants to continue.
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Check current status**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to understand current state. The response includes:
|
||||
- `schemaName`: The workflow schema being used (e.g., "spec-driven")
|
||||
- `artifacts`: Array of artifacts with their status ("done", "ready", "blocked")
|
||||
- `isComplete`: Boolean indicating if all artifacts are complete
|
||||
|
||||
3. **Act based on status**:
|
||||
|
||||
---
|
||||
|
||||
**If all artifacts are complete (`isComplete: true`)**:
|
||||
- Congratulate the user
|
||||
- Show final status including the schema used
|
||||
- Suggest: "All artifacts created! You can now implement this change with `/opsx:apply` or archive it with `/opsx:archive`."
|
||||
- STOP
|
||||
|
||||
---
|
||||
|
||||
**If artifacts are ready to create** (status shows artifacts with `status: "ready"`):
|
||||
- Pick the FIRST artifact with `status: "ready"` from the status output
|
||||
- Get its instructions:
|
||||
```bash
|
||||
openspec instructions <artifact-id> --change "<name>" --json
|
||||
```
|
||||
- Parse the JSON. The key fields are:
|
||||
- `context`: Project background (constraints for you - do NOT include in output)
|
||||
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
|
||||
- `template`: The structure to use for your output file
|
||||
- `instruction`: Schema-specific guidance
|
||||
- `outputPath`: Where to write the artifact
|
||||
- `dependencies`: Completed artifacts to read for context
|
||||
- **Create the artifact file**:
|
||||
- Read any completed dependency files for context
|
||||
- Use `template` as the structure - fill in its sections
|
||||
- Apply `context` and `rules` as constraints when writing - but do NOT copy them into the file
|
||||
- Write to the output path specified in instructions
|
||||
- Show what was created and what's now unlocked
|
||||
- STOP after creating ONE artifact
|
||||
|
||||
---
|
||||
|
||||
**If no artifacts are ready (all blocked)**:
|
||||
- This shouldn't happen with a valid schema
|
||||
- Show status and suggest checking for issues
|
||||
|
||||
4. **After creating an artifact, show progress**
|
||||
```bash
|
||||
openspec status --change "<name>"
|
||||
```
|
||||
|
||||
**Output**
|
||||
|
||||
After each invocation, show:
|
||||
- Which artifact was created
|
||||
- Schema workflow being used
|
||||
- Current progress (N/M complete)
|
||||
- What artifacts are now unlocked
|
||||
- Prompt: "Run `/opsx:continue` to create the next artifact"
|
||||
|
||||
**Artifact Creation Guidelines**
|
||||
|
||||
The artifact types and their purpose depend on the schema. Use the `instruction` field from the instructions output to understand what to create.
|
||||
|
||||
Common artifact patterns:
|
||||
|
||||
**spec-driven schema** (proposal → specs → design → tasks):
|
||||
- **proposal.md**: Ask user about the change if not clear. Fill in Why, What Changes, Capabilities, Impact.
|
||||
- The Capabilities section is critical - each capability listed will need a spec file.
|
||||
- **specs/<capability>/spec.md**: Create one spec per capability listed in the proposal's Capabilities section (use the capability name, not the change name).
|
||||
- **design.md**: Document technical decisions, architecture, and implementation approach.
|
||||
- **tasks.md**: Break down implementation into checkboxed tasks.
|
||||
|
||||
For other schemas, follow the `instruction` field from the CLI output.
|
||||
|
||||
**Guardrails**
|
||||
- Create ONE artifact per invocation
|
||||
- Always read dependency artifacts before creating a new one
|
||||
- Never skip artifacts or create out of order
|
||||
- If context is unclear, ask the user before creating
|
||||
- Verify the artifact file exists after writing before marking progress
|
||||
- Use the schema's artifact sequence, don't assume specific artifact names
|
||||
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
|
||||
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
|
||||
- These guide what you write, but should never appear in the output
|
||||
@@ -0,0 +1,173 @@
|
||||
---
|
||||
name: "OPSX: Explore"
|
||||
description: "Enter explore mode - think through ideas, investigate problems, clarify requirements"
|
||||
category: Workflow
|
||||
tags: [workflow, explore, experimental, thinking]
|
||||
---
|
||||
|
||||
Enter explore mode. Think deeply. Visualize freely. Follow the conversation wherever it goes.
|
||||
|
||||
**IMPORTANT: Explore mode is for thinking, not implementing.** You may read files, search code, and investigate the codebase, but you must NEVER write code or implement features. If the user asks you to implement something, remind them to exit explore mode first and create a change proposal. You MAY create OpenSpec artifacts (proposals, designs, specs) if the user asks—that's capturing thinking, not implementing.
|
||||
|
||||
**This is a stance, not a workflow.** There are no fixed steps, no required sequence, no mandatory outputs. You're a thinking partner helping the user explore.
|
||||
|
||||
**Input**: The argument after `/opsx:explore` is whatever the user wants to think about. Could be:
|
||||
- A vague idea: "real-time collaboration"
|
||||
- A specific problem: "the auth system is getting unwieldy"
|
||||
- A change name: "add-dark-mode" (to explore in context of that change)
|
||||
- A comparison: "postgres vs sqlite for this"
|
||||
- Nothing (just enter explore mode)
|
||||
|
||||
---
|
||||
|
||||
## The Stance
|
||||
|
||||
- **Curious, not prescriptive** - Ask questions that emerge naturally, don't follow a script
|
||||
- **Open threads, not interrogations** - Surface multiple interesting directions and let the user follow what resonates. Don't funnel them through a single path of questions.
|
||||
- **Visual** - Use ASCII diagrams liberally when they'd help clarify thinking
|
||||
- **Adaptive** - Follow interesting threads, pivot when new information emerges
|
||||
- **Patient** - Don't rush to conclusions, let the shape of the problem emerge
|
||||
- **Grounded** - Explore the actual codebase when relevant, don't just theorize
|
||||
|
||||
---
|
||||
|
||||
## What You Might Do
|
||||
|
||||
Depending on what the user brings, you might:
|
||||
|
||||
**Explore the problem space**
|
||||
- Ask clarifying questions that emerge from what they said
|
||||
- Challenge assumptions
|
||||
- Reframe the problem
|
||||
- Find analogies
|
||||
|
||||
**Investigate the codebase**
|
||||
- Map existing architecture relevant to the discussion
|
||||
- Find integration points
|
||||
- Identify patterns already in use
|
||||
- Surface hidden complexity
|
||||
|
||||
**Compare options**
|
||||
- Brainstorm multiple approaches
|
||||
- Build comparison tables
|
||||
- Sketch tradeoffs
|
||||
- Recommend a path (if asked)
|
||||
|
||||
**Visualize**
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Use ASCII diagrams liberally │
|
||||
├─────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌────────┐ ┌────────┐ │
|
||||
│ │ State │────────▶│ State │ │
|
||||
│ │ A │ │ B │ │
|
||||
│ └────────┘ └────────┘ │
|
||||
│ │
|
||||
│ System diagrams, state machines, │
|
||||
│ data flows, architecture sketches, │
|
||||
│ dependency graphs, comparison tables │
|
||||
│ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Surface risks and unknowns**
|
||||
- Identify what could go wrong
|
||||
- Find gaps in understanding
|
||||
- Suggest spikes or investigations
|
||||
|
||||
---
|
||||
|
||||
## OpenSpec Awareness
|
||||
|
||||
You have full context of the OpenSpec system. Use it naturally, don't force it.
|
||||
|
||||
### Check for context
|
||||
|
||||
At the start, quickly check what exists:
|
||||
```bash
|
||||
openspec list --json
|
||||
```
|
||||
|
||||
This tells you:
|
||||
- If there are active changes
|
||||
- Their names, schemas, and status
|
||||
- What the user might be working on
|
||||
|
||||
If the user mentioned a specific change name, read its artifacts for context.
|
||||
|
||||
### When no change exists
|
||||
|
||||
Think freely. When insights crystallize, you might offer:
|
||||
|
||||
- "This feels solid enough to start a change. Want me to create a proposal?"
|
||||
- Or keep exploring - no pressure to formalize
|
||||
|
||||
### When a change exists
|
||||
|
||||
If the user mentions a change or you detect one is relevant:
|
||||
|
||||
1. **Read existing artifacts for context**
|
||||
- `openspec/changes/<name>/proposal.md`
|
||||
- `openspec/changes/<name>/design.md`
|
||||
- `openspec/changes/<name>/tasks.md`
|
||||
- etc.
|
||||
|
||||
2. **Reference them naturally in conversation**
|
||||
- "Your design mentions using Redis, but we just realized SQLite fits better..."
|
||||
- "The proposal scopes this to premium users, but we're now thinking everyone..."
|
||||
|
||||
3. **Offer to capture when decisions are made**
|
||||
|
||||
| Insight Type | Where to Capture |
|
||||
|--------------|------------------|
|
||||
| New requirement discovered | `specs/<capability>/spec.md` |
|
||||
| Requirement changed | `specs/<capability>/spec.md` |
|
||||
| Design decision made | `design.md` |
|
||||
| Scope changed | `proposal.md` |
|
||||
| New work identified | `tasks.md` |
|
||||
| Assumption invalidated | Relevant artifact |
|
||||
|
||||
Example offers:
|
||||
- "That's a design decision. Capture it in design.md?"
|
||||
- "This is a new requirement. Add it to specs?"
|
||||
- "This changes scope. Update the proposal?"
|
||||
|
||||
4. **The user decides** - Offer and move on. Don't pressure. Don't auto-capture.
|
||||
|
||||
---
|
||||
|
||||
## What You Don't Have To Do
|
||||
|
||||
- Follow a script
|
||||
- Ask the same questions every time
|
||||
- Produce a specific artifact
|
||||
- Reach a conclusion
|
||||
- Stay on topic if a tangent is valuable
|
||||
- Be brief (this is thinking time)
|
||||
|
||||
---
|
||||
|
||||
## Ending Discovery
|
||||
|
||||
There's no required ending. Discovery might:
|
||||
|
||||
- **Flow into a proposal**: "Ready to start? I can create a change proposal."
|
||||
- **Result in artifact updates**: "Updated design.md with these decisions"
|
||||
- **Just provide clarity**: User has what they need, moves on
|
||||
- **Continue later**: "We can pick this up anytime"
|
||||
|
||||
When things crystallize, you might offer a summary - but it's optional. Sometimes the thinking IS the value.
|
||||
|
||||
---
|
||||
|
||||
## Guardrails
|
||||
|
||||
- **Don't implement** - Never write code or implement features. Creating OpenSpec artifacts is fine, writing application code is not.
|
||||
- **Don't fake understanding** - If something is unclear, dig deeper
|
||||
- **Don't rush** - Discovery is thinking time, not task time
|
||||
- **Don't force structure** - Let patterns emerge naturally
|
||||
- **Don't auto-capture** - Offer to save insights, don't just do it
|
||||
- **Do visualize** - A good diagram is worth many paragraphs
|
||||
- **Do explore the codebase** - Ground discussions in reality
|
||||
- **Do question assumptions** - Including the user's and your own
|
||||
@@ -0,0 +1,97 @@
|
||||
---
|
||||
name: "OPSX: Fast Forward"
|
||||
description: Create a change and generate all artifacts needed for implementation in one go
|
||||
category: Workflow
|
||||
tags: [workflow, artifacts, experimental]
|
||||
---
|
||||
|
||||
Fast-forward through artifact creation - generate everything needed to start implementation.
|
||||
|
||||
**Input**: The argument after `/opsx:ff` is the change name (kebab-case), OR a description of what the user wants to build.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no input provided, ask what they want to build**
|
||||
|
||||
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
|
||||
> "What change do you want to work on? Describe what you want to build or fix."
|
||||
|
||||
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
|
||||
|
||||
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
|
||||
|
||||
2. **Create the change directory**
|
||||
```bash
|
||||
openspec new change "<name>"
|
||||
```
|
||||
This creates a scaffolded change at `openspec/changes/<name>/`.
|
||||
|
||||
3. **Get the artifact build order**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to get:
|
||||
- `applyRequires`: array of artifact IDs needed before implementation (e.g., `["tasks"]`)
|
||||
- `artifacts`: list of all artifacts with their status and dependencies
|
||||
|
||||
4. **Create artifacts in sequence until apply-ready**
|
||||
|
||||
Use the **TodoWrite tool** to track progress through the artifacts.
|
||||
|
||||
Loop through artifacts in dependency order (artifacts with no pending dependencies first):
|
||||
|
||||
a. **For each artifact that is `ready` (dependencies satisfied)**:
|
||||
- Get instructions:
|
||||
```bash
|
||||
openspec instructions <artifact-id> --change "<name>" --json
|
||||
```
|
||||
- The instructions JSON includes:
|
||||
- `context`: Project background (constraints for you - do NOT include in output)
|
||||
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
|
||||
- `template`: The structure to use for your output file
|
||||
- `instruction`: Schema-specific guidance for this artifact type
|
||||
- `outputPath`: Where to write the artifact
|
||||
- `dependencies`: Completed artifacts to read for context
|
||||
- Read any completed dependency files for context
|
||||
- Create the artifact file using `template` as the structure
|
||||
- Apply `context` and `rules` as constraints - but do NOT copy them into the file
|
||||
- Show brief progress: "✓ Created <artifact-id>"
|
||||
|
||||
b. **Continue until all `applyRequires` artifacts are complete**
|
||||
- After creating each artifact, re-run `openspec status --change "<name>" --json`
|
||||
- Check if every artifact ID in `applyRequires` has `status: "done"` in the artifacts array
|
||||
- Stop when all `applyRequires` artifacts are done
|
||||
|
||||
c. **If an artifact requires user input** (unclear context):
|
||||
- Use **AskUserQuestion tool** to clarify
|
||||
- Then continue with creation
|
||||
|
||||
5. **Show final status**
|
||||
```bash
|
||||
openspec status --change "<name>"
|
||||
```
|
||||
|
||||
**Output**
|
||||
|
||||
After completing all artifacts, summarize:
|
||||
- Change name and location
|
||||
- List of artifacts created with brief descriptions
|
||||
- What's ready: "All artifacts created! Ready for implementation."
|
||||
- Prompt: "Run `/opsx:apply` to start implementing."
|
||||
|
||||
**Artifact Creation Guidelines**
|
||||
|
||||
- Follow the `instruction` field from `openspec instructions` for each artifact type
|
||||
- The schema defines what each artifact should contain - follow it
|
||||
- Read dependency artifacts for context before creating new ones
|
||||
- Use `template` as the structure for your output file - fill in its sections
|
||||
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
|
||||
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
|
||||
- These guide what you write, but should never appear in the output
|
||||
|
||||
**Guardrails**
|
||||
- Create ALL artifacts needed for implementation (as defined by schema's `apply.requires`)
|
||||
- Always read dependency artifacts before creating a new one
|
||||
- If context is critically unclear, ask the user - but prefer making reasonable decisions to keep momentum
|
||||
- If a change with that name already exists, ask if user wants to continue it or create a new one
|
||||
- Verify each artifact file exists after writing before proceeding to next
|
||||
@@ -0,0 +1,69 @@
|
||||
---
|
||||
name: "OPSX: New"
|
||||
description: Start a new change using the experimental artifact workflow (OPSX)
|
||||
category: Workflow
|
||||
tags: [workflow, artifacts, experimental]
|
||||
---
|
||||
|
||||
Start a new change using the experimental artifact-driven approach.
|
||||
|
||||
**Input**: The argument after `/opsx:new` is the change name (kebab-case), OR a description of what the user wants to build.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no input provided, ask what they want to build**
|
||||
|
||||
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
|
||||
> "What change do you want to work on? Describe what you want to build or fix."
|
||||
|
||||
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
|
||||
|
||||
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
|
||||
|
||||
2. **Determine the workflow schema**
|
||||
|
||||
Use the default schema (omit `--schema`) unless the user explicitly requests a different workflow.
|
||||
|
||||
**Use a different schema only if the user mentions:**
|
||||
- A specific schema name → use `--schema <name>`
|
||||
- "show workflows" or "what workflows" → run `openspec schemas --json` and let them choose
|
||||
|
||||
**Otherwise**: Omit `--schema` to use the default.
|
||||
|
||||
3. **Create the change directory**
|
||||
```bash
|
||||
openspec new change "<name>"
|
||||
```
|
||||
Add `--schema <name>` only if the user requested a specific workflow.
|
||||
This creates a scaffolded change at `openspec/changes/<name>/` with the selected schema.
|
||||
|
||||
4. **Show the artifact status**
|
||||
```bash
|
||||
openspec status --change "<name>"
|
||||
```
|
||||
This shows which artifacts need to be created and which are ready (dependencies satisfied).
|
||||
|
||||
5. **Get instructions for the first artifact**
|
||||
The first artifact depends on the schema. Check the status output to find the first artifact with status "ready".
|
||||
```bash
|
||||
openspec instructions <first-artifact-id> --change "<name>"
|
||||
```
|
||||
This outputs the template and context for creating the first artifact.
|
||||
|
||||
6. **STOP and wait for user direction**
|
||||
|
||||
**Output**
|
||||
|
||||
After completing the steps, summarize:
|
||||
- Change name and location
|
||||
- Schema/workflow being used and its artifact sequence
|
||||
- Current status (0/N artifacts complete)
|
||||
- The template for the first artifact
|
||||
- Prompt: "Ready to create the first artifact? Run `/opsx:continue` or just describe what this change is about and I'll draft it."
|
||||
|
||||
**Guardrails**
|
||||
- Do NOT create any artifacts yet - just show the instructions
|
||||
- Do NOT advance beyond showing the first artifact template
|
||||
- If the name is invalid (not kebab-case), ask for a valid name
|
||||
- If a change with that name already exists, suggest using `/opsx:continue` instead
|
||||
- Pass --schema if using a non-default workflow
|
||||
@@ -0,0 +1,134 @@
|
||||
---
|
||||
name: "OPSX: Sync"
|
||||
description: Sync delta specs from a change to main specs
|
||||
category: Workflow
|
||||
tags: [workflow, specs, experimental]
|
||||
---
|
||||
|
||||
Sync delta specs from a change to main specs.
|
||||
|
||||
This is an **agent-driven** operation - you will read delta specs and directly edit main specs to apply the changes. This allows intelligent merging (e.g., adding a scenario without copying the entire requirement).
|
||||
|
||||
**Input**: Optionally specify a change name after `/opsx:sync` (e.g., `/opsx:sync add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
|
||||
|
||||
Show changes that have delta specs (under `specs/` directory).
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Find delta specs**
|
||||
|
||||
Look for delta spec files in `openspec/changes/<name>/specs/*/spec.md`.
|
||||
|
||||
Each delta spec file contains sections like:
|
||||
- `## ADDED Requirements` - New requirements to add
|
||||
- `## MODIFIED Requirements` - Changes to existing requirements
|
||||
- `## REMOVED Requirements` - Requirements to remove
|
||||
- `## RENAMED Requirements` - Requirements to rename (FROM:/TO: format)
|
||||
|
||||
If no delta specs found, inform user and stop.
|
||||
|
||||
3. **For each delta spec, apply changes to main specs**
|
||||
|
||||
For each capability with a delta spec at `openspec/changes/<name>/specs/<capability>/spec.md`:
|
||||
|
||||
a. **Read the delta spec** to understand the intended changes
|
||||
|
||||
b. **Read the main spec** at `openspec/specs/<capability>/spec.md` (may not exist yet)
|
||||
|
||||
c. **Apply changes intelligently**:
|
||||
|
||||
**ADDED Requirements:**
|
||||
- If requirement doesn't exist in main spec → add it
|
||||
- If requirement already exists → update it to match (treat as implicit MODIFIED)
|
||||
|
||||
**MODIFIED Requirements:**
|
||||
- Find the requirement in main spec
|
||||
- Apply the changes - this can be:
|
||||
- Adding new scenarios (don't need to copy existing ones)
|
||||
- Modifying existing scenarios
|
||||
- Changing the requirement description
|
||||
- Preserve scenarios/content not mentioned in the delta
|
||||
|
||||
**REMOVED Requirements:**
|
||||
- Remove the entire requirement block from main spec
|
||||
|
||||
**RENAMED Requirements:**
|
||||
- Find the FROM requirement, rename to TO
|
||||
|
||||
d. **Create new main spec** if capability doesn't exist yet:
|
||||
- Create `openspec/specs/<capability>/spec.md`
|
||||
- Add Purpose section (can be brief, mark as TBD)
|
||||
- Add Requirements section with the ADDED requirements
|
||||
|
||||
4. **Show summary**
|
||||
|
||||
After applying all changes, summarize:
|
||||
- Which capabilities were updated
|
||||
- What changes were made (requirements added/modified/removed/renamed)
|
||||
|
||||
**Delta Spec Format Reference**
|
||||
|
||||
```markdown
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: New Feature
|
||||
The system SHALL do something new.
|
||||
|
||||
#### Scenario: Basic case
|
||||
- **WHEN** user does X
|
||||
- **THEN** system does Y
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Existing Feature
|
||||
#### Scenario: New scenario to add
|
||||
- **WHEN** user does A
|
||||
- **THEN** system does B
|
||||
|
||||
## REMOVED Requirements
|
||||
|
||||
### Requirement: Deprecated Feature
|
||||
|
||||
## RENAMED Requirements
|
||||
|
||||
- FROM: `### Requirement: Old Name`
|
||||
- TO: `### Requirement: New Name`
|
||||
```
|
||||
|
||||
**Key Principle: Intelligent Merging**
|
||||
|
||||
Unlike programmatic merging, you can apply **partial updates**:
|
||||
- To add a scenario, just include that scenario under MODIFIED - don't copy existing scenarios
|
||||
- The delta represents *intent*, not a wholesale replacement
|
||||
- Use your judgment to merge changes sensibly
|
||||
|
||||
**Output On Success**
|
||||
|
||||
```
|
||||
## Specs Synced: <change-name>
|
||||
|
||||
Updated main specs:
|
||||
|
||||
**<capability-1>**:
|
||||
- Added requirement: "New Feature"
|
||||
- Modified requirement: "Existing Feature" (added 1 scenario)
|
||||
|
||||
**<capability-2>**:
|
||||
- Created new spec file
|
||||
- Added requirement: "Another Feature"
|
||||
|
||||
Main specs are now updated. The change remains active - archive when implementation is complete.
|
||||
```
|
||||
|
||||
**Guardrails**
|
||||
- Read both delta and main specs before making changes
|
||||
- Preserve existing content not mentioned in delta
|
||||
- If something is unclear, ask for clarification
|
||||
- Show what you're changing as you go
|
||||
- The operation should be idempotent - running twice should give same result
|
||||
@@ -0,0 +1,164 @@
|
||||
---
|
||||
name: "OPSX: Verify"
|
||||
description: Verify implementation matches change artifacts before archiving
|
||||
category: Workflow
|
||||
tags: [workflow, verify, experimental]
|
||||
---
|
||||
|
||||
Verify that an implementation matches the change artifacts (specs, tasks, design).
|
||||
|
||||
**Input**: Optionally specify a change name after `/opsx:verify` (e.g., `/opsx:verify add-auth`). If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
|
||||
|
||||
Show changes that have implementation tasks (tasks artifact exists).
|
||||
Include the schema used for each change if available.
|
||||
Mark changes with incomplete tasks as "(In Progress)".
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Check status to understand the schema**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to understand:
|
||||
- `schemaName`: The workflow being used (e.g., "spec-driven")
|
||||
- Which artifacts exist for this change
|
||||
|
||||
3. **Get the change directory and load artifacts**
|
||||
|
||||
```bash
|
||||
openspec instructions apply --change "<name>" --json
|
||||
```
|
||||
|
||||
This returns the change directory and context files. Read all available artifacts from `contextFiles`.
|
||||
|
||||
4. **Initialize verification report structure**
|
||||
|
||||
Create a report structure with three dimensions:
|
||||
- **Completeness**: Track tasks and spec coverage
|
||||
- **Correctness**: Track requirement implementation and scenario coverage
|
||||
- **Coherence**: Track design adherence and pattern consistency
|
||||
|
||||
Each dimension can have CRITICAL, WARNING, or SUGGESTION issues.
|
||||
|
||||
5. **Verify Completeness**
|
||||
|
||||
**Task Completion**:
|
||||
- If tasks.md exists in contextFiles, read it
|
||||
- Parse checkboxes: `- [ ]` (incomplete) vs `- [x]` (complete)
|
||||
- Count complete vs total tasks
|
||||
- If incomplete tasks exist:
|
||||
- Add CRITICAL issue for each incomplete task
|
||||
- Recommendation: "Complete task: <description>" or "Mark as done if already implemented"
|
||||
|
||||
**Spec Coverage**:
|
||||
- If delta specs exist in `openspec/changes/<name>/specs/`:
|
||||
- Extract all requirements (marked with "### Requirement:")
|
||||
- For each requirement:
|
||||
- Search codebase for keywords related to the requirement
|
||||
- Assess if implementation likely exists
|
||||
- If requirements appear unimplemented:
|
||||
- Add CRITICAL issue: "Requirement not found: <requirement name>"
|
||||
- Recommendation: "Implement requirement X: <description>"
|
||||
|
||||
6. **Verify Correctness**
|
||||
|
||||
**Requirement Implementation Mapping**:
|
||||
- For each requirement from delta specs:
|
||||
- Search codebase for implementation evidence
|
||||
- If found, note file paths and line ranges
|
||||
- Assess if implementation matches requirement intent
|
||||
- If divergence detected:
|
||||
- Add WARNING: "Implementation may diverge from spec: <details>"
|
||||
- Recommendation: "Review <file>:<lines> against requirement X"
|
||||
|
||||
**Scenario Coverage**:
|
||||
- For each scenario in delta specs (marked with "#### Scenario:"):
|
||||
- Check if conditions are handled in code
|
||||
- Check if tests exist covering the scenario
|
||||
- If scenario appears uncovered:
|
||||
- Add WARNING: "Scenario not covered: <scenario name>"
|
||||
- Recommendation: "Add test or implementation for scenario: <description>"
|
||||
|
||||
7. **Verify Coherence**
|
||||
|
||||
**Design Adherence**:
|
||||
- If design.md exists in contextFiles:
|
||||
- Extract key decisions (look for sections like "Decision:", "Approach:", "Architecture:")
|
||||
- Verify implementation follows those decisions
|
||||
- If contradiction detected:
|
||||
- Add WARNING: "Design decision not followed: <decision>"
|
||||
- Recommendation: "Update implementation or revise design.md to match reality"
|
||||
- If no design.md: Skip design adherence check, note "No design.md to verify against"
|
||||
|
||||
**Code Pattern Consistency**:
|
||||
- Review new code for consistency with project patterns
|
||||
- Check file naming, directory structure, coding style
|
||||
- If significant deviations found:
|
||||
- Add SUGGESTION: "Code pattern deviation: <details>"
|
||||
- Recommendation: "Consider following project pattern: <example>"
|
||||
|
||||
8. **Generate Verification Report**
|
||||
|
||||
**Summary Scorecard**:
|
||||
```
|
||||
## Verification Report: <change-name>
|
||||
|
||||
### Summary
|
||||
| Dimension | Status |
|
||||
|--------------|------------------|
|
||||
| Completeness | X/Y tasks, N reqs|
|
||||
| Correctness | M/N reqs covered |
|
||||
| Coherence | Followed/Issues |
|
||||
```
|
||||
|
||||
**Issues by Priority**:
|
||||
|
||||
1. **CRITICAL** (Must fix before archive):
|
||||
- Incomplete tasks
|
||||
- Missing requirement implementations
|
||||
- Each with specific, actionable recommendation
|
||||
|
||||
2. **WARNING** (Should fix):
|
||||
- Spec/design divergences
|
||||
- Missing scenario coverage
|
||||
- Each with specific recommendation
|
||||
|
||||
3. **SUGGESTION** (Nice to fix):
|
||||
- Pattern inconsistencies
|
||||
- Minor improvements
|
||||
- Each with specific recommendation
|
||||
|
||||
**Final Assessment**:
|
||||
- If CRITICAL issues: "X critical issue(s) found. Fix before archiving."
|
||||
- If only warnings: "No critical issues. Y warning(s) to consider. Ready for archive (with noted improvements)."
|
||||
- If all clear: "All checks passed. Ready for archive."
|
||||
|
||||
**Verification Heuristics**
|
||||
|
||||
- **Completeness**: Focus on objective checklist items (checkboxes, requirements list)
|
||||
- **Correctness**: Use keyword search, file path analysis, reasonable inference - don't require perfect certainty
|
||||
- **Coherence**: Look for glaring inconsistencies, don't nitpick style
|
||||
- **False Positives**: When uncertain, prefer SUGGESTION over WARNING, WARNING over CRITICAL
|
||||
- **Actionability**: Every issue must have a specific recommendation with file/line references where applicable
|
||||
|
||||
**Graceful Degradation**
|
||||
|
||||
- If only tasks.md exists: verify task completion only, skip spec/design checks
|
||||
- If tasks + specs exist: verify completeness and correctness, skip design
|
||||
- If full artifacts: verify all three dimensions
|
||||
- Always note which checks were skipped and why
|
||||
|
||||
**Output Format**
|
||||
|
||||
Use clear markdown with:
|
||||
- Table for summary scorecard
|
||||
- Grouped lists for issues (CRITICAL/WARNING/SUGGESTION)
|
||||
- Code references in format: `file.ts:123`
|
||||
- Specific, actionable recommendations
|
||||
- No vague suggestions like "consider reviewing"
|
||||
@@ -0,0 +1,156 @@
|
||||
---
|
||||
name: openspec-apply-change
|
||||
description: Implement tasks from an OpenSpec change. Use when the user wants to start implementing, continue implementation, or work through tasks.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Implement tasks from an OpenSpec change.
|
||||
|
||||
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **Select the change**
|
||||
|
||||
If a name is provided, use it. Otherwise:
|
||||
- Infer from conversation context if the user mentioned a change
|
||||
- Auto-select if only one active change exists
|
||||
- If ambiguous, run `openspec list --json` to get available changes and use the **AskUserQuestion tool** to let the user select
|
||||
|
||||
Always announce: "Using change: <name>" and how to override (e.g., `/opsx:apply <other>`).
|
||||
|
||||
2. **Check status to understand the schema**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to understand:
|
||||
- `schemaName`: The workflow being used (e.g., "spec-driven")
|
||||
- Which artifact contains the tasks (typically "tasks" for spec-driven, check status for others)
|
||||
|
||||
3. **Get apply instructions**
|
||||
|
||||
```bash
|
||||
openspec instructions apply --change "<name>" --json
|
||||
```
|
||||
|
||||
This returns:
|
||||
- Context file paths (varies by schema - could be proposal/specs/design/tasks or spec/tests/implementation/docs)
|
||||
- Progress (total, complete, remaining)
|
||||
- Task list with status
|
||||
- Dynamic instruction based on current state
|
||||
|
||||
**Handle states:**
|
||||
- If `state: "blocked"` (missing artifacts): show message, suggest using openspec-continue-change
|
||||
- If `state: "all_done"`: congratulate, suggest archive
|
||||
- Otherwise: proceed to implementation
|
||||
|
||||
4. **Read context files**
|
||||
|
||||
Read the files listed in `contextFiles` from the apply instructions output.
|
||||
The files depend on the schema being used:
|
||||
- **spec-driven**: proposal, specs, design, tasks
|
||||
- Other schemas: follow the contextFiles from CLI output
|
||||
|
||||
5. **Show current progress**
|
||||
|
||||
Display:
|
||||
- Schema being used
|
||||
- Progress: "N/M tasks complete"
|
||||
- Remaining tasks overview
|
||||
- Dynamic instruction from CLI
|
||||
|
||||
6. **Implement tasks (loop until done or blocked)**
|
||||
|
||||
For each pending task:
|
||||
- Show which task is being worked on
|
||||
- Make the code changes required
|
||||
- Keep changes minimal and focused
|
||||
- Mark task complete in the tasks file: `- [ ]` → `- [x]`
|
||||
- Continue to next task
|
||||
|
||||
**Pause if:**
|
||||
- Task is unclear → ask for clarification
|
||||
- Implementation reveals a design issue → suggest updating artifacts
|
||||
- Error or blocker encountered → report and wait for guidance
|
||||
- User interrupts
|
||||
|
||||
7. **On completion or pause, show status**
|
||||
|
||||
Display:
|
||||
- Tasks completed this session
|
||||
- Overall progress: "N/M tasks complete"
|
||||
- If all done: suggest archive
|
||||
- If paused: explain why and wait for guidance
|
||||
|
||||
**Output During Implementation**
|
||||
|
||||
```
|
||||
## Implementing: <change-name> (schema: <schema-name>)
|
||||
|
||||
Working on task 3/7: <task description>
|
||||
[...implementation happening...]
|
||||
✓ Task complete
|
||||
|
||||
Working on task 4/7: <task description>
|
||||
[...implementation happening...]
|
||||
✓ Task complete
|
||||
```
|
||||
|
||||
**Output On Completion**
|
||||
|
||||
```
|
||||
## Implementation Complete
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Progress:** 7/7 tasks complete ✓
|
||||
|
||||
### Completed This Session
|
||||
- [x] Task 1
|
||||
- [x] Task 2
|
||||
...
|
||||
|
||||
All tasks complete! Ready to archive this change.
|
||||
```
|
||||
|
||||
**Output On Pause (Issue Encountered)**
|
||||
|
||||
```
|
||||
## Implementation Paused
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Progress:** 4/7 tasks complete
|
||||
|
||||
### Issue Encountered
|
||||
<description of the issue>
|
||||
|
||||
**Options:**
|
||||
1. <option 1>
|
||||
2. <option 2>
|
||||
3. Other approach
|
||||
|
||||
What would you like to do?
|
||||
```
|
||||
|
||||
**Guardrails**
|
||||
- Keep going through tasks until done or blocked
|
||||
- Always read context files before starting (from the apply instructions output)
|
||||
- If task is ambiguous, pause and ask before implementing
|
||||
- If implementation reveals issues, pause and suggest artifact updates
|
||||
- Keep code changes minimal and scoped to each task
|
||||
- Update task checkbox immediately after completing each task
|
||||
- Pause on errors, blockers, or unclear requirements - don't guess
|
||||
- Use contextFiles from CLI output, don't assume specific file names
|
||||
|
||||
**Fluid Workflow Integration**
|
||||
|
||||
This skill supports the "actions on a change" model:
|
||||
|
||||
- **Can be invoked anytime**: Before all artifacts are done (if tasks exist), after partial implementation, interleaved with other actions
|
||||
- **Allows artifact updates**: If implementation reveals design issues, suggest updating artifacts - not phase-locked, work fluidly
|
||||
@@ -0,0 +1,114 @@
|
||||
---
|
||||
name: openspec-archive-change
|
||||
description: Archive a completed change in the experimental workflow. Use when the user wants to finalize and archive a change after implementation is complete.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Archive a completed change in the experimental workflow.
|
||||
|
||||
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
|
||||
|
||||
Show only active changes (not already archived).
|
||||
Include the schema used for each change if available.
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Check artifact completion status**
|
||||
|
||||
Run `openspec status --change "<name>" --json` to check artifact completion.
|
||||
|
||||
Parse the JSON to understand:
|
||||
- `schemaName`: The workflow being used
|
||||
- `artifacts`: List of artifacts with their status (`done` or other)
|
||||
|
||||
**If any artifacts are not `done`:**
|
||||
- Display warning listing incomplete artifacts
|
||||
- Use **AskUserQuestion tool** to confirm user wants to proceed
|
||||
- Proceed if user confirms
|
||||
|
||||
3. **Check task completion status**
|
||||
|
||||
Read the tasks file (typically `tasks.md`) to check for incomplete tasks.
|
||||
|
||||
Count tasks marked with `- [ ]` (incomplete) vs `- [x]` (complete).
|
||||
|
||||
**If incomplete tasks found:**
|
||||
- Display warning showing count of incomplete tasks
|
||||
- Use **AskUserQuestion tool** to confirm user wants to proceed
|
||||
- Proceed if user confirms
|
||||
|
||||
**If no tasks file exists:** Proceed without task-related warning.
|
||||
|
||||
4. **Assess delta spec sync state**
|
||||
|
||||
Check for delta specs at `openspec/changes/<name>/specs/`. If none exist, proceed without sync prompt.
|
||||
|
||||
**If delta specs exist:**
|
||||
- Compare each delta spec with its corresponding main spec at `openspec/specs/<capability>/spec.md`
|
||||
- Determine what changes would be applied (adds, modifications, removals, renames)
|
||||
- Show a combined summary before prompting
|
||||
|
||||
**Prompt options:**
|
||||
- If changes needed: "Sync now (recommended)", "Archive without syncing"
|
||||
- If already synced: "Archive now", "Sync anyway", "Cancel"
|
||||
|
||||
If user chooses sync, use Task tool (subagent_type: "general-purpose", prompt: "Use Skill tool to invoke openspec-sync-specs for change '<name>'. Delta spec analysis: <include the analyzed delta spec summary>"). Proceed to archive regardless of choice.
|
||||
|
||||
5. **Perform the archive**
|
||||
|
||||
Create the archive directory if it doesn't exist:
|
||||
```bash
|
||||
mkdir -p openspec/changes/archive
|
||||
```
|
||||
|
||||
Generate target name using current date: `YYYY-MM-DD-<change-name>`
|
||||
|
||||
**Check if target already exists:**
|
||||
- If yes: Fail with error, suggest renaming existing archive or using different date
|
||||
- If no: Move the change directory to archive
|
||||
|
||||
```bash
|
||||
mv openspec/changes/<name> openspec/changes/archive/YYYY-MM-DD-<name>
|
||||
```
|
||||
|
||||
6. **Display summary**
|
||||
|
||||
Show archive completion summary including:
|
||||
- Change name
|
||||
- Schema that was used
|
||||
- Archive location
|
||||
- Whether specs were synced (if applicable)
|
||||
- Note about any warnings (incomplete artifacts/tasks)
|
||||
|
||||
**Output On Success**
|
||||
|
||||
```
|
||||
## Archive Complete
|
||||
|
||||
**Change:** <change-name>
|
||||
**Schema:** <schema-name>
|
||||
**Archived to:** openspec/changes/archive/YYYY-MM-DD-<name>/
|
||||
**Specs:** ✓ Synced to main specs (or "No delta specs" or "Sync skipped")
|
||||
|
||||
All artifacts complete. All tasks complete.
|
||||
```
|
||||
|
||||
**Guardrails**
|
||||
- Always prompt for change selection if not provided
|
||||
- Use artifact graph (openspec status --json) for completion checking
|
||||
- Don't block archive on warnings - just inform and confirm
|
||||
- Preserve .openspec.yaml when moving to archive (it moves with the directory)
|
||||
- Show clear summary of what happened
|
||||
- If sync is requested, use openspec-sync-specs approach (agent-driven)
|
||||
- If delta specs exist, always run the sync assessment and show the combined summary before prompting
|
||||
@@ -0,0 +1,118 @@
|
||||
---
|
||||
name: openspec-continue-change
|
||||
description: Continue working on an OpenSpec change by creating the next artifact. Use when the user wants to progress their change, create the next artifact, or continue their workflow.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Continue working on a change by creating the next artifact.
|
||||
|
||||
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes sorted by most recently modified. Then use the **AskUserQuestion tool** to let the user select which change to work on.
|
||||
|
||||
Present the top 3-4 most recently modified changes as options, showing:
|
||||
- Change name
|
||||
- Schema (from `schema` field if present, otherwise "spec-driven")
|
||||
- Status (e.g., "0/5 tasks", "complete", "no tasks")
|
||||
- How recently it was modified (from `lastModified` field)
|
||||
|
||||
Mark the most recently modified change as "(Recommended)" since it's likely what the user wants to continue.
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Check current status**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to understand current state. The response includes:
|
||||
- `schemaName`: The workflow schema being used (e.g., "spec-driven")
|
||||
- `artifacts`: Array of artifacts with their status ("done", "ready", "blocked")
|
||||
- `isComplete`: Boolean indicating if all artifacts are complete
|
||||
|
||||
3. **Act based on status**:
|
||||
|
||||
---
|
||||
|
||||
**If all artifacts are complete (`isComplete: true`)**:
|
||||
- Congratulate the user
|
||||
- Show final status including the schema used
|
||||
- Suggest: "All artifacts created! You can now implement this change or archive it."
|
||||
- STOP
|
||||
|
||||
---
|
||||
|
||||
**If artifacts are ready to create** (status shows artifacts with `status: "ready"`):
|
||||
- Pick the FIRST artifact with `status: "ready"` from the status output
|
||||
- Get its instructions:
|
||||
```bash
|
||||
openspec instructions <artifact-id> --change "<name>" --json
|
||||
```
|
||||
- Parse the JSON. The key fields are:
|
||||
- `context`: Project background (constraints for you - do NOT include in output)
|
||||
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
|
||||
- `template`: The structure to use for your output file
|
||||
- `instruction`: Schema-specific guidance
|
||||
- `outputPath`: Where to write the artifact
|
||||
- `dependencies`: Completed artifacts to read for context
|
||||
- **Create the artifact file**:
|
||||
- Read any completed dependency files for context
|
||||
- Use `template` as the structure - fill in its sections
|
||||
- Apply `context` and `rules` as constraints when writing - but do NOT copy them into the file
|
||||
- Write to the output path specified in instructions
|
||||
- Show what was created and what's now unlocked
|
||||
- STOP after creating ONE artifact
|
||||
|
||||
---
|
||||
|
||||
**If no artifacts are ready (all blocked)**:
|
||||
- This shouldn't happen with a valid schema
|
||||
- Show status and suggest checking for issues
|
||||
|
||||
4. **After creating an artifact, show progress**
|
||||
```bash
|
||||
openspec status --change "<name>"
|
||||
```
|
||||
|
||||
**Output**
|
||||
|
||||
After each invocation, show:
|
||||
- Which artifact was created
|
||||
- Schema workflow being used
|
||||
- Current progress (N/M complete)
|
||||
- What artifacts are now unlocked
|
||||
- Prompt: "Want to continue? Just ask me to continue or tell me what to do next."
|
||||
|
||||
**Artifact Creation Guidelines**
|
||||
|
||||
The artifact types and their purpose depend on the schema. Use the `instruction` field from the instructions output to understand what to create.
|
||||
|
||||
Common artifact patterns:
|
||||
|
||||
**spec-driven schema** (proposal → specs → design → tasks):
|
||||
- **proposal.md**: Ask user about the change if not clear. Fill in Why, What Changes, Capabilities, Impact.
|
||||
- The Capabilities section is critical - each capability listed will need a spec file.
|
||||
- **specs/<capability>/spec.md**: Create one spec per capability listed in the proposal's Capabilities section (use the capability name, not the change name).
|
||||
- **design.md**: Document technical decisions, architecture, and implementation approach.
|
||||
- **tasks.md**: Break down implementation into checkboxed tasks.
|
||||
|
||||
For other schemas, follow the `instruction` field from the CLI output.
|
||||
|
||||
**Guardrails**
|
||||
- Create ONE artifact per invocation
|
||||
- Always read dependency artifacts before creating a new one
|
||||
- Never skip artifacts or create out of order
|
||||
- If context is unclear, ask the user before creating
|
||||
- Verify the artifact file exists after writing before marking progress
|
||||
- Use the schema's artifact sequence, don't assume specific artifact names
|
||||
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
|
||||
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
|
||||
- These guide what you write, but should never appear in the output
|
||||
@@ -0,0 +1,288 @@
|
||||
---
|
||||
name: openspec-explore
|
||||
description: Enter explore mode - a thinking partner for exploring ideas, investigating problems, and clarifying requirements. Use when the user wants to think through something before or during a change.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Enter explore mode. Think deeply. Visualize freely. Follow the conversation wherever it goes.
|
||||
|
||||
**IMPORTANT: Explore mode is for thinking, not implementing.** You may read files, search code, and investigate the codebase, but you must NEVER write code or implement features. If the user asks you to implement something, remind them to exit explore mode first and create a change proposal. You MAY create OpenSpec artifacts (proposals, designs, specs) if the user asks—that's capturing thinking, not implementing.
|
||||
|
||||
**This is a stance, not a workflow.** There are no fixed steps, no required sequence, no mandatory outputs. You're a thinking partner helping the user explore.
|
||||
|
||||
---
|
||||
|
||||
## The Stance
|
||||
|
||||
- **Curious, not prescriptive** - Ask questions that emerge naturally, don't follow a script
|
||||
- **Open threads, not interrogations** - Surface multiple interesting directions and let the user follow what resonates. Don't funnel them through a single path of questions.
|
||||
- **Visual** - Use ASCII diagrams liberally when they'd help clarify thinking
|
||||
- **Adaptive** - Follow interesting threads, pivot when new information emerges
|
||||
- **Patient** - Don't rush to conclusions, let the shape of the problem emerge
|
||||
- **Grounded** - Explore the actual codebase when relevant, don't just theorize
|
||||
|
||||
---
|
||||
|
||||
## What You Might Do
|
||||
|
||||
Depending on what the user brings, you might:
|
||||
|
||||
**Explore the problem space**
|
||||
- Ask clarifying questions that emerge from what they said
|
||||
- Challenge assumptions
|
||||
- Reframe the problem
|
||||
- Find analogies
|
||||
|
||||
**Investigate the codebase**
|
||||
- Map existing architecture relevant to the discussion
|
||||
- Find integration points
|
||||
- Identify patterns already in use
|
||||
- Surface hidden complexity
|
||||
|
||||
**Compare options**
|
||||
- Brainstorm multiple approaches
|
||||
- Build comparison tables
|
||||
- Sketch tradeoffs
|
||||
- Recommend a path (if asked)
|
||||
|
||||
**Visualize**
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Use ASCII diagrams liberally │
|
||||
├─────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌────────┐ ┌────────┐ │
|
||||
│ │ State │────────▶│ State │ │
|
||||
│ │ A │ │ B │ │
|
||||
│ └────────┘ └────────┘ │
|
||||
│ │
|
||||
│ System diagrams, state machines, │
|
||||
│ data flows, architecture sketches, │
|
||||
│ dependency graphs, comparison tables │
|
||||
│ │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Surface risks and unknowns**
|
||||
- Identify what could go wrong
|
||||
- Find gaps in understanding
|
||||
- Suggest spikes or investigations
|
||||
|
||||
---
|
||||
|
||||
## OpenSpec Awareness
|
||||
|
||||
You have full context of the OpenSpec system. Use it naturally, don't force it.
|
||||
|
||||
### Check for context
|
||||
|
||||
At the start, quickly check what exists:
|
||||
```bash
|
||||
openspec list --json
|
||||
```
|
||||
|
||||
This tells you:
|
||||
- If there are active changes
|
||||
- Their names, schemas, and status
|
||||
- What the user might be working on
|
||||
|
||||
### When no change exists
|
||||
|
||||
Think freely. When insights crystallize, you might offer:
|
||||
|
||||
- "This feels solid enough to start a change. Want me to create a proposal?"
|
||||
- Or keep exploring - no pressure to formalize
|
||||
|
||||
### When a change exists
|
||||
|
||||
If the user mentions a change or you detect one is relevant:
|
||||
|
||||
1. **Read existing artifacts for context**
|
||||
- `openspec/changes/<name>/proposal.md`
|
||||
- `openspec/changes/<name>/design.md`
|
||||
- `openspec/changes/<name>/tasks.md`
|
||||
- etc.
|
||||
|
||||
2. **Reference them naturally in conversation**
|
||||
- "Your design mentions using Redis, but we just realized SQLite fits better..."
|
||||
- "The proposal scopes this to premium users, but we're now thinking everyone..."
|
||||
|
||||
3. **Offer to capture when decisions are made**
|
||||
|
||||
| Insight Type | Where to Capture |
|
||||
|--------------|------------------|
|
||||
| New requirement discovered | `specs/<capability>/spec.md` |
|
||||
| Requirement changed | `specs/<capability>/spec.md` |
|
||||
| Design decision made | `design.md` |
|
||||
| Scope changed | `proposal.md` |
|
||||
| New work identified | `tasks.md` |
|
||||
| Assumption invalidated | Relevant artifact |
|
||||
|
||||
Example offers:
|
||||
- "That's a design decision. Capture it in design.md?"
|
||||
- "This is a new requirement. Add it to specs?"
|
||||
- "This changes scope. Update the proposal?"
|
||||
|
||||
4. **The user decides** - Offer and move on. Don't pressure. Don't auto-capture.
|
||||
|
||||
---
|
||||
|
||||
## What You Don't Have To Do
|
||||
|
||||
- Follow a script
|
||||
- Ask the same questions every time
|
||||
- Produce a specific artifact
|
||||
- Reach a conclusion
|
||||
- Stay on topic if a tangent is valuable
|
||||
- Be brief (this is thinking time)
|
||||
|
||||
---
|
||||
|
||||
## Handling Different Entry Points
|
||||
|
||||
**User brings a vague idea:**
|
||||
```
|
||||
User: I'm thinking about adding real-time collaboration
|
||||
|
||||
You: Real-time collab is a big space. Let me think about this...
|
||||
|
||||
COLLABORATION SPECTRUM
|
||||
════════════════════════════════════════════
|
||||
|
||||
Awareness Coordination Sync
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌────────┐ ┌────────┐ ┌────────┐
|
||||
│Presence│ │Cursors │ │ CRDT │
|
||||
│ "3 │ │ Multi │ │Conflict│
|
||||
│online" │ │ select │ │ free │
|
||||
└────────┘ └────────┘ └────────┘
|
||||
│ │ │
|
||||
trivial moderate complex
|
||||
|
||||
Where's your head at?
|
||||
```
|
||||
|
||||
**User brings a specific problem:**
|
||||
```
|
||||
User: The auth system is a mess
|
||||
|
||||
You: [reads codebase]
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ CURRENT AUTH FLOW │
|
||||
└─────────────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────┼─────────────┐
|
||||
▼ ▼ ▼
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ Google │ │ GitHub │ │ Email │
|
||||
│ OAuth │ │ OAuth │ │ Magic │
|
||||
└────┬────┘ └────┬────┘ └────┬────┘
|
||||
│ │ │
|
||||
└─────────────┼─────────────┘
|
||||
▼
|
||||
┌───────────┐
|
||||
│ Session │
|
||||
└─────┬─────┘
|
||||
│
|
||||
▼
|
||||
┌───────────┐
|
||||
│ Perms │
|
||||
└───────────┘
|
||||
|
||||
I see three tangles. Which one's burning?
|
||||
```
|
||||
|
||||
**User is stuck mid-implementation:**
|
||||
```
|
||||
User: /opsx:explore add-auth-system
|
||||
The OAuth integration is more complex than expected
|
||||
|
||||
You: [reads change artifacts]
|
||||
|
||||
You're on task 4: "Implement OAuth flow"
|
||||
|
||||
Let me trace what's involved...
|
||||
|
||||
[draws diagram, explores options, suggests paths]
|
||||
|
||||
Want to update the design to reflect this?
|
||||
Or add a spike task to investigate?
|
||||
```
|
||||
|
||||
**User wants to compare options:**
|
||||
```
|
||||
User: Should we use Postgres or SQLite?
|
||||
|
||||
You: Generic answer is boring. What's the context?
|
||||
|
||||
User: A CLI tool that tracks local dev environments
|
||||
|
||||
You: That changes everything.
|
||||
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ CLI TOOL DATA STORAGE │
|
||||
└─────────────────────────────────────────────────┘
|
||||
|
||||
Key constraints:
|
||||
• No daemon running
|
||||
• Must work offline
|
||||
• Single user
|
||||
|
||||
SQLite Postgres
|
||||
Deployment embedded ✓ needs server ✗
|
||||
Offline yes ✓ no ✗
|
||||
Single file yes ✓ no ✗
|
||||
|
||||
SQLite. Not even close.
|
||||
|
||||
Unless... is there a sync component?
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Ending Discovery
|
||||
|
||||
There's no required ending. Discovery might:
|
||||
|
||||
- **Flow into a proposal**: "Ready to start? I can create a change proposal."
|
||||
- **Result in artifact updates**: "Updated design.md with these decisions"
|
||||
- **Just provide clarity**: User has what they need, moves on
|
||||
- **Continue later**: "We can pick this up anytime"
|
||||
|
||||
When it feels like things are crystallizing, you might summarize:
|
||||
|
||||
```
|
||||
## What We Figured Out
|
||||
|
||||
**The problem**: [crystallized understanding]
|
||||
|
||||
**The approach**: [if one emerged]
|
||||
|
||||
**Open questions**: [if any remain]
|
||||
|
||||
**Next steps** (if ready):
|
||||
- Create a change proposal
|
||||
- Keep exploring: just keep talking
|
||||
```
|
||||
|
||||
But this summary is optional. Sometimes the thinking IS the value.
|
||||
|
||||
---
|
||||
|
||||
## Guardrails
|
||||
|
||||
- **Don't implement** - Never write code or implement features. Creating OpenSpec artifacts is fine, writing application code is not.
|
||||
- **Don't fake understanding** - If something is unclear, dig deeper
|
||||
- **Don't rush** - Discovery is thinking time, not task time
|
||||
- **Don't force structure** - Let patterns emerge naturally
|
||||
- **Don't auto-capture** - Offer to save insights, don't just do it
|
||||
- **Do visualize** - A good diagram is worth many paragraphs
|
||||
- **Do explore the codebase** - Ground discussions in reality
|
||||
- **Do question assumptions** - Including the user's and your own
|
||||
@@ -0,0 +1,101 @@
|
||||
---
|
||||
name: openspec-ff-change
|
||||
description: Fast-forward through OpenSpec artifact creation. Use when the user wants to quickly create all artifacts needed for implementation without stepping through each one individually.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Fast-forward through artifact creation - generate everything needed to start implementation in one go.
|
||||
|
||||
**Input**: The user's request should include a change name (kebab-case) OR a description of what they want to build.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no clear input provided, ask what they want to build**
|
||||
|
||||
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
|
||||
> "What change do you want to work on? Describe what you want to build or fix."
|
||||
|
||||
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
|
||||
|
||||
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
|
||||
|
||||
2. **Create the change directory**
|
||||
```bash
|
||||
openspec new change "<name>"
|
||||
```
|
||||
This creates a scaffolded change at `openspec/changes/<name>/`.
|
||||
|
||||
3. **Get the artifact build order**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to get:
|
||||
- `applyRequires`: array of artifact IDs needed before implementation (e.g., `["tasks"]`)
|
||||
- `artifacts`: list of all artifacts with their status and dependencies
|
||||
|
||||
4. **Create artifacts in sequence until apply-ready**
|
||||
|
||||
Use the **TodoWrite tool** to track progress through the artifacts.
|
||||
|
||||
Loop through artifacts in dependency order (artifacts with no pending dependencies first):
|
||||
|
||||
a. **For each artifact that is `ready` (dependencies satisfied)**:
|
||||
- Get instructions:
|
||||
```bash
|
||||
openspec instructions <artifact-id> --change "<name>" --json
|
||||
```
|
||||
- The instructions JSON includes:
|
||||
- `context`: Project background (constraints for you - do NOT include in output)
|
||||
- `rules`: Artifact-specific rules (constraints for you - do NOT include in output)
|
||||
- `template`: The structure to use for your output file
|
||||
- `instruction`: Schema-specific guidance for this artifact type
|
||||
- `outputPath`: Where to write the artifact
|
||||
- `dependencies`: Completed artifacts to read for context
|
||||
- Read any completed dependency files for context
|
||||
- Create the artifact file using `template` as the structure
|
||||
- Apply `context` and `rules` as constraints - but do NOT copy them into the file
|
||||
- Show brief progress: "✓ Created <artifact-id>"
|
||||
|
||||
b. **Continue until all `applyRequires` artifacts are complete**
|
||||
- After creating each artifact, re-run `openspec status --change "<name>" --json`
|
||||
- Check if every artifact ID in `applyRequires` has `status: "done"` in the artifacts array
|
||||
- Stop when all `applyRequires` artifacts are done
|
||||
|
||||
c. **If an artifact requires user input** (unclear context):
|
||||
- Use **AskUserQuestion tool** to clarify
|
||||
- Then continue with creation
|
||||
|
||||
5. **Show final status**
|
||||
```bash
|
||||
openspec status --change "<name>"
|
||||
```
|
||||
|
||||
**Output**
|
||||
|
||||
After completing all artifacts, summarize:
|
||||
- Change name and location
|
||||
- List of artifacts created with brief descriptions
|
||||
- What's ready: "All artifacts created! Ready for implementation."
|
||||
- Prompt: "Run `/opsx:apply` or ask me to implement to start working on the tasks."
|
||||
|
||||
**Artifact Creation Guidelines**
|
||||
|
||||
- Follow the `instruction` field from `openspec instructions` for each artifact type
|
||||
- The schema defines what each artifact should contain - follow it
|
||||
- Read dependency artifacts for context before creating new ones
|
||||
- Use `template` as the structure for your output file - fill in its sections
|
||||
- **IMPORTANT**: `context` and `rules` are constraints for YOU, not content for the file
|
||||
- Do NOT copy `<context>`, `<rules>`, `<project_context>` blocks into the artifact
|
||||
- These guide what you write, but should never appear in the output
|
||||
|
||||
**Guardrails**
|
||||
- Create ALL artifacts needed for implementation (as defined by schema's `apply.requires`)
|
||||
- Always read dependency artifacts before creating a new one
|
||||
- If context is critically unclear, ask the user - but prefer making reasonable decisions to keep momentum
|
||||
- If a change with that name already exists, suggest continuing that change instead
|
||||
- Verify each artifact file exists after writing before proceeding to next
|
||||
@@ -0,0 +1,74 @@
|
||||
---
|
||||
name: openspec-new-change
|
||||
description: Start a new OpenSpec change using the experimental artifact workflow. Use when the user wants to create a new feature, fix, or modification with a structured step-by-step approach.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Start a new change using the experimental artifact-driven approach.
|
||||
|
||||
**Input**: The user's request should include a change name (kebab-case) OR a description of what they want to build.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no clear input provided, ask what they want to build**
|
||||
|
||||
Use the **AskUserQuestion tool** (open-ended, no preset options) to ask:
|
||||
> "What change do you want to work on? Describe what you want to build or fix."
|
||||
|
||||
From their description, derive a kebab-case name (e.g., "add user authentication" → `add-user-auth`).
|
||||
|
||||
**IMPORTANT**: Do NOT proceed without understanding what the user wants to build.
|
||||
|
||||
2. **Determine the workflow schema**
|
||||
|
||||
Use the default schema (omit `--schema`) unless the user explicitly requests a different workflow.
|
||||
|
||||
**Use a different schema only if the user mentions:**
|
||||
- A specific schema name → use `--schema <name>`
|
||||
- "show workflows" or "what workflows" → run `openspec schemas --json` and let them choose
|
||||
|
||||
**Otherwise**: Omit `--schema` to use the default.
|
||||
|
||||
3. **Create the change directory**
|
||||
```bash
|
||||
openspec new change "<name>"
|
||||
```
|
||||
Add `--schema <name>` only if the user requested a specific workflow.
|
||||
This creates a scaffolded change at `openspec/changes/<name>/` with the selected schema.
|
||||
|
||||
4. **Show the artifact status**
|
||||
```bash
|
||||
openspec status --change "<name>"
|
||||
```
|
||||
This shows which artifacts need to be created and which are ready (dependencies satisfied).
|
||||
|
||||
5. **Get instructions for the first artifact**
|
||||
The first artifact depends on the schema (e.g., `proposal` for spec-driven).
|
||||
Check the status output to find the first artifact with status "ready".
|
||||
```bash
|
||||
openspec instructions <first-artifact-id> --change "<name>"
|
||||
```
|
||||
This outputs the template and context for creating the first artifact.
|
||||
|
||||
6. **STOP and wait for user direction**
|
||||
|
||||
**Output**
|
||||
|
||||
After completing the steps, summarize:
|
||||
- Change name and location
|
||||
- Schema/workflow being used and its artifact sequence
|
||||
- Current status (0/N artifacts complete)
|
||||
- The template for the first artifact
|
||||
- Prompt: "Ready to create the first artifact? Just describe what this change is about and I'll draft it, or ask me to continue."
|
||||
|
||||
**Guardrails**
|
||||
- Do NOT create any artifacts yet - just show the instructions
|
||||
- Do NOT advance beyond showing the first artifact template
|
||||
- If the name is invalid (not kebab-case), ask for a valid name
|
||||
- If a change with that name already exists, suggest continuing that change instead
|
||||
- Pass --schema if using a non-default workflow
|
||||
@@ -0,0 +1,138 @@
|
||||
---
|
||||
name: openspec-sync-specs
|
||||
description: Sync delta specs from a change to main specs. Use when the user wants to update main specs with changes from a delta spec, without archiving the change.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Sync delta specs from a change to main specs.
|
||||
|
||||
This is an **agent-driven** operation - you will read delta specs and directly edit main specs to apply the changes. This allows intelligent merging (e.g., adding a scenario without copying the entire requirement).
|
||||
|
||||
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
|
||||
|
||||
Show changes that have delta specs (under `specs/` directory).
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Find delta specs**
|
||||
|
||||
Look for delta spec files in `openspec/changes/<name>/specs/*/spec.md`.
|
||||
|
||||
Each delta spec file contains sections like:
|
||||
- `## ADDED Requirements` - New requirements to add
|
||||
- `## MODIFIED Requirements` - Changes to existing requirements
|
||||
- `## REMOVED Requirements` - Requirements to remove
|
||||
- `## RENAMED Requirements` - Requirements to rename (FROM:/TO: format)
|
||||
|
||||
If no delta specs found, inform user and stop.
|
||||
|
||||
3. **For each delta spec, apply changes to main specs**
|
||||
|
||||
For each capability with a delta spec at `openspec/changes/<name>/specs/<capability>/spec.md`:
|
||||
|
||||
a. **Read the delta spec** to understand the intended changes
|
||||
|
||||
b. **Read the main spec** at `openspec/specs/<capability>/spec.md` (may not exist yet)
|
||||
|
||||
c. **Apply changes intelligently**:
|
||||
|
||||
**ADDED Requirements:**
|
||||
- If requirement doesn't exist in main spec → add it
|
||||
- If requirement already exists → update it to match (treat as implicit MODIFIED)
|
||||
|
||||
**MODIFIED Requirements:**
|
||||
- Find the requirement in main spec
|
||||
- Apply the changes - this can be:
|
||||
- Adding new scenarios (don't need to copy existing ones)
|
||||
- Modifying existing scenarios
|
||||
- Changing the requirement description
|
||||
- Preserve scenarios/content not mentioned in the delta
|
||||
|
||||
**REMOVED Requirements:**
|
||||
- Remove the entire requirement block from main spec
|
||||
|
||||
**RENAMED Requirements:**
|
||||
- Find the FROM requirement, rename to TO
|
||||
|
||||
d. **Create new main spec** if capability doesn't exist yet:
|
||||
- Create `openspec/specs/<capability>/spec.md`
|
||||
- Add Purpose section (can be brief, mark as TBD)
|
||||
- Add Requirements section with the ADDED requirements
|
||||
|
||||
4. **Show summary**
|
||||
|
||||
After applying all changes, summarize:
|
||||
- Which capabilities were updated
|
||||
- What changes were made (requirements added/modified/removed/renamed)
|
||||
|
||||
**Delta Spec Format Reference**
|
||||
|
||||
```markdown
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: New Feature
|
||||
The system SHALL do something new.
|
||||
|
||||
#### Scenario: Basic case
|
||||
- **WHEN** user does X
|
||||
- **THEN** system does Y
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Existing Feature
|
||||
#### Scenario: New scenario to add
|
||||
- **WHEN** user does A
|
||||
- **THEN** system does B
|
||||
|
||||
## REMOVED Requirements
|
||||
|
||||
### Requirement: Deprecated Feature
|
||||
|
||||
## RENAMED Requirements
|
||||
|
||||
- FROM: `### Requirement: Old Name`
|
||||
- TO: `### Requirement: New Name`
|
||||
```
|
||||
|
||||
**Key Principle: Intelligent Merging**
|
||||
|
||||
Unlike programmatic merging, you can apply **partial updates**:
|
||||
- To add a scenario, just include that scenario under MODIFIED - don't copy existing scenarios
|
||||
- The delta represents *intent*, not a wholesale replacement
|
||||
- Use your judgment to merge changes sensibly
|
||||
|
||||
**Output On Success**
|
||||
|
||||
```
|
||||
## Specs Synced: <change-name>
|
||||
|
||||
Updated main specs:
|
||||
|
||||
**<capability-1>**:
|
||||
- Added requirement: "New Feature"
|
||||
- Modified requirement: "Existing Feature" (added 1 scenario)
|
||||
|
||||
**<capability-2>**:
|
||||
- Created new spec file
|
||||
- Added requirement: "Another Feature"
|
||||
|
||||
Main specs are now updated. The change remains active - archive when implementation is complete.
|
||||
```
|
||||
|
||||
**Guardrails**
|
||||
- Read both delta and main specs before making changes
|
||||
- Preserve existing content not mentioned in delta
|
||||
- If something is unclear, ask for clarification
|
||||
- Show what you're changing as you go
|
||||
- The operation should be idempotent - running twice should give same result
|
||||
@@ -0,0 +1,168 @@
|
||||
---
|
||||
name: openspec-verify-change
|
||||
description: Verify implementation matches change artifacts. Use when the user wants to validate that implementation is complete, correct, and coherent before archiving.
|
||||
license: MIT
|
||||
compatibility: Requires openspec CLI.
|
||||
metadata:
|
||||
author: openspec
|
||||
version: "1.0"
|
||||
generatedBy: "1.2.0"
|
||||
---
|
||||
|
||||
Verify that an implementation matches the change artifacts (specs, tasks, design).
|
||||
|
||||
**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.
|
||||
|
||||
**Steps**
|
||||
|
||||
1. **If no change name provided, prompt for selection**
|
||||
|
||||
Run `openspec list --json` to get available changes. Use the **AskUserQuestion tool** to let the user select.
|
||||
|
||||
Show changes that have implementation tasks (tasks artifact exists).
|
||||
Include the schema used for each change if available.
|
||||
Mark changes with incomplete tasks as "(In Progress)".
|
||||
|
||||
**IMPORTANT**: Do NOT guess or auto-select a change. Always let the user choose.
|
||||
|
||||
2. **Check status to understand the schema**
|
||||
```bash
|
||||
openspec status --change "<name>" --json
|
||||
```
|
||||
Parse the JSON to understand:
|
||||
- `schemaName`: The workflow being used (e.g., "spec-driven")
|
||||
- Which artifacts exist for this change
|
||||
|
||||
3. **Get the change directory and load artifacts**
|
||||
|
||||
```bash
|
||||
openspec instructions apply --change "<name>" --json
|
||||
```
|
||||
|
||||
This returns the change directory and context files. Read all available artifacts from `contextFiles`.
|
||||
|
||||
4. **Initialize verification report structure**
|
||||
|
||||
Create a report structure with three dimensions:
|
||||
- **Completeness**: Track tasks and spec coverage
|
||||
- **Correctness**: Track requirement implementation and scenario coverage
|
||||
- **Coherence**: Track design adherence and pattern consistency
|
||||
|
||||
Each dimension can have CRITICAL, WARNING, or SUGGESTION issues.
|
||||
|
||||
5. **Verify Completeness**
|
||||
|
||||
**Task Completion**:
|
||||
- If tasks.md exists in contextFiles, read it
|
||||
- Parse checkboxes: `- [ ]` (incomplete) vs `- [x]` (complete)
|
||||
- Count complete vs total tasks
|
||||
- If incomplete tasks exist:
|
||||
- Add CRITICAL issue for each incomplete task
|
||||
- Recommendation: "Complete task: <description>" or "Mark as done if already implemented"
|
||||
|
||||
**Spec Coverage**:
|
||||
- If delta specs exist in `openspec/changes/<name>/specs/`:
|
||||
- Extract all requirements (marked with "### Requirement:")
|
||||
- For each requirement:
|
||||
- Search codebase for keywords related to the requirement
|
||||
- Assess if implementation likely exists
|
||||
- If requirements appear unimplemented:
|
||||
- Add CRITICAL issue: "Requirement not found: <requirement name>"
|
||||
- Recommendation: "Implement requirement X: <description>"
|
||||
|
||||
6. **Verify Correctness**
|
||||
|
||||
**Requirement Implementation Mapping**:
|
||||
- For each requirement from delta specs:
|
||||
- Search codebase for implementation evidence
|
||||
- If found, note file paths and line ranges
|
||||
- Assess if implementation matches requirement intent
|
||||
- If divergence detected:
|
||||
- Add WARNING: "Implementation may diverge from spec: <details>"
|
||||
- Recommendation: "Review <file>:<lines> against requirement X"
|
||||
|
||||
**Scenario Coverage**:
|
||||
- For each scenario in delta specs (marked with "#### Scenario:"):
|
||||
- Check if conditions are handled in code
|
||||
- Check if tests exist covering the scenario
|
||||
- If scenario appears uncovered:
|
||||
- Add WARNING: "Scenario not covered: <scenario name>"
|
||||
- Recommendation: "Add test or implementation for scenario: <description>"
|
||||
|
||||
7. **Verify Coherence**
|
||||
|
||||
**Design Adherence**:
|
||||
- If design.md exists in contextFiles:
|
||||
- Extract key decisions (look for sections like "Decision:", "Approach:", "Architecture:")
|
||||
- Verify implementation follows those decisions
|
||||
- If contradiction detected:
|
||||
- Add WARNING: "Design decision not followed: <decision>"
|
||||
- Recommendation: "Update implementation or revise design.md to match reality"
|
||||
- If no design.md: Skip design adherence check, note "No design.md to verify against"
|
||||
|
||||
**Code Pattern Consistency**:
|
||||
- Review new code for consistency with project patterns
|
||||
- Check file naming, directory structure, coding style
|
||||
- If significant deviations found:
|
||||
- Add SUGGESTION: "Code pattern deviation: <details>"
|
||||
- Recommendation: "Consider following project pattern: <example>"
|
||||
|
||||
8. **Generate Verification Report**
|
||||
|
||||
**Summary Scorecard**:
|
||||
```
|
||||
## Verification Report: <change-name>
|
||||
|
||||
### Summary
|
||||
| Dimension | Status |
|
||||
|--------------|------------------|
|
||||
| Completeness | X/Y tasks, N reqs|
|
||||
| Correctness | M/N reqs covered |
|
||||
| Coherence | Followed/Issues |
|
||||
```
|
||||
|
||||
**Issues by Priority**:
|
||||
|
||||
1. **CRITICAL** (Must fix before archive):
|
||||
- Incomplete tasks
|
||||
- Missing requirement implementations
|
||||
- Each with specific, actionable recommendation
|
||||
|
||||
2. **WARNING** (Should fix):
|
||||
- Spec/design divergences
|
||||
- Missing scenario coverage
|
||||
- Each with specific recommendation
|
||||
|
||||
3. **SUGGESTION** (Nice to fix):
|
||||
- Pattern inconsistencies
|
||||
- Minor improvements
|
||||
- Each with specific recommendation
|
||||
|
||||
**Final Assessment**:
|
||||
- If CRITICAL issues: "X critical issue(s) found. Fix before archiving."
|
||||
- If only warnings: "No critical issues. Y warning(s) to consider. Ready for archive (with noted improvements)."
|
||||
- If all clear: "All checks passed. Ready for archive."
|
||||
|
||||
**Verification Heuristics**
|
||||
|
||||
- **Completeness**: Focus on objective checklist items (checkboxes, requirements list)
|
||||
- **Correctness**: Use keyword search, file path analysis, reasonable inference - don't require perfect certainty
|
||||
- **Coherence**: Look for glaring inconsistencies, don't nitpick style
|
||||
- **False Positives**: When uncertain, prefer SUGGESTION over WARNING, WARNING over CRITICAL
|
||||
- **Actionability**: Every issue must have a specific recommendation with file/line references where applicable
|
||||
|
||||
**Graceful Degradation**
|
||||
|
||||
- If only tasks.md exists: verify task completion only, skip spec/design checks
|
||||
- If tasks + specs exist: verify completeness and correctness, skip design
|
||||
- If full artifacts: verify all three dimensions
|
||||
- Always note which checks were skipped and why
|
||||
|
||||
**Output Format**
|
||||
|
||||
Use clear markdown with:
|
||||
- Table for summary scorecard
|
||||
- Grouped lists for issues (CRITICAL/WARNING/SUGGESTION)
|
||||
- Code references in format: `file.ts:123`
|
||||
- Specific, actionable recommendations
|
||||
- No vague suggestions like "consider reviewing"
|
||||
@@ -0,0 +1,7 @@
|
||||
.venv/
|
||||
__pycache__/
|
||||
*.egg-info/
|
||||
*.pyc
|
||||
dist/
|
||||
build/
|
||||
.eggs/
|
||||
@@ -0,0 +1,53 @@
|
||||
# kb-search
|
||||
|
||||
CLI knowledge base with hybrid search (full-text + semantic vector search).
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
pipx install kb-search
|
||||
```
|
||||
|
||||
## Quickstart
|
||||
|
||||
```bash
|
||||
# Initialise (downloads embedding model ~90MB)
|
||||
kb init
|
||||
|
||||
# Add documents
|
||||
kb add ~/docs/manual.pdf --tags admin
|
||||
kb add ~/notes/ --recursive
|
||||
kb add --note "Always restart nginx after config changes" --tags ops
|
||||
|
||||
# Search
|
||||
kb search "how to install git"
|
||||
kb search "deploy process" --tags ops --type pdf
|
||||
kb search "authentication" --format human
|
||||
|
||||
# Manage
|
||||
kb list --format human
|
||||
kb tags
|
||||
kb status
|
||||
```
|
||||
|
||||
## How it works
|
||||
|
||||
- **Ingestion**: Documents are chunked (PDFs via Docling, markdown by headers, code by AST/functions) and embedded locally
|
||||
- **Storage**: Everything in a single SQLite database (`~/.kb/kb.db`) using FTS5 for keyword search and sqlite-vec for vector search
|
||||
- **Search**: Hybrid retrieval combining BM25 keyword scoring and vector similarity via Reciprocal Rank Fusion
|
||||
- **Output**: JSON (for LLM tool use) or human-readable terminal format
|
||||
|
||||
## Configuration
|
||||
|
||||
Optional YAML config at `~/.kb/config.yaml`. Works with zero configuration.
|
||||
|
||||
```bash
|
||||
kb config # View current config
|
||||
kb config set chunking.pdf.max_tokens 2048 # Change a value
|
||||
```
|
||||
|
||||
ENV overrides: `KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`
|
||||
|
||||
## Claude Code Skill
|
||||
|
||||
This tool is designed to be wrapped as a Claude Code skill. See `SKILL.md` for the skill definition.
|
||||
@@ -0,0 +1,110 @@
|
||||
# kb-search skill
|
||||
|
||||
Search the user's personal knowledge base containing PDFs, markdown documents, code snippets, and text notes.
|
||||
|
||||
## When to use
|
||||
|
||||
- User asks a question that might be answered by their stored documents, notes, or code
|
||||
- User explicitly says "check my notes", "search kb", "look in my knowledge base", "what do my docs say about..."
|
||||
- User references documents or notes they've previously stored
|
||||
- User asks "how do I..." style questions that their knowledge base likely covers
|
||||
|
||||
## Available commands
|
||||
|
||||
### Search (primary)
|
||||
|
||||
```bash
|
||||
kb search "<query>" --top 10 --format json
|
||||
```
|
||||
|
||||
Returns JSON with ranked results combining full-text and semantic search.
|
||||
|
||||
**Flags:**
|
||||
- `--top N` — number of results (default: 10)
|
||||
- `--tags tag1,tag2` — filter by tags (AND logic)
|
||||
- `--type pdf|markdown|code|note` — filter by document type
|
||||
- `--format json|human` — output format (always use json)
|
||||
- `--fts-only` — keyword search only (skip semantic)
|
||||
- `--vec-only` — semantic search only (skip keyword)
|
||||
- `--threshold FLOAT` — minimum score cutoff
|
||||
|
||||
### Other useful commands
|
||||
|
||||
```bash
|
||||
kb list --format json # List all documents
|
||||
kb list --type pdf --format json # List only PDFs
|
||||
kb tags --format json # List tags with counts
|
||||
kb info <doc_id> --format json # Document details
|
||||
kb status --format json # DB stats
|
||||
```
|
||||
|
||||
## Output format (search)
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "how to install git",
|
||||
"results": [
|
||||
{
|
||||
"chunk_id": 1423,
|
||||
"score": 0.031,
|
||||
"score_breakdown": {"fts": 0.016, "vector": 0.015},
|
||||
"text": "To install the latest version of git from source...",
|
||||
"source": {
|
||||
"document_id": 42,
|
||||
"title": "Git Admin Guide",
|
||||
"path": "/home/user/docs/git-admin.pdf",
|
||||
"type": "pdf",
|
||||
"page": 12,
|
||||
"chunk_index": 3,
|
||||
"total_chunks": 28,
|
||||
"tags": ["git", "admin"]
|
||||
}
|
||||
}
|
||||
],
|
||||
"total_matches": 47,
|
||||
"returned": 10
|
||||
}
|
||||
```
|
||||
|
||||
## How to answer
|
||||
|
||||
1. Run `kb search "<query>" --top 10 --format json`
|
||||
2. Read the returned chunks
|
||||
3. Synthesise a natural language answer from the top results
|
||||
4. **ALWAYS cite sources**: "According to [title] (p.X)..." or "From [title], section [header]..."
|
||||
5. If results have low scores (all below 0.01) or `returned: 0`, tell the user: "I couldn't find anything in your knowledge base about this"
|
||||
6. If initial results seem off-target, try refining the query and searching again
|
||||
|
||||
## Multi-query strategy
|
||||
|
||||
For complex questions, search multiple times with different queries:
|
||||
|
||||
- Decompose the question into sub-queries
|
||||
- Run each query separately
|
||||
- Combine and deduplicate results across queries
|
||||
- Synthesise a unified answer citing all relevant sources
|
||||
|
||||
Example:
|
||||
```
|
||||
User: "What's the difference between git rebase and merge?"
|
||||
|
||||
Query 1: kb search "git rebase explanation" --top 5 --format json
|
||||
Query 2: kb search "git merge explanation" --top 5 --format json
|
||||
Query 3: kb search "git rebase vs merge" --top 5 --format json
|
||||
```
|
||||
|
||||
## Filtering
|
||||
|
||||
Use filters when the question implies a specific domain:
|
||||
|
||||
- Code question → `--type code`
|
||||
- From a specific topic → `--tags <topic>`
|
||||
- Check available tags first: `kb tags --format json`
|
||||
|
||||
## Important notes
|
||||
|
||||
- Always use `--format json` for machine parsing
|
||||
- The `score` field is relative, not absolute — compare scores within a result set
|
||||
- `source.page` is only present for PDF documents
|
||||
- `source.section_header` is only present for markdown documents with headers
|
||||
- Results are already ranked by relevance (hybrid FTS + vector search)
|
||||
Binary file not shown.
@@ -0,0 +1,2 @@
|
||||
schema: spec-driven
|
||||
created: 2026-03-22
|
||||
@@ -0,0 +1,396 @@
|
||||
## Context
|
||||
|
||||
This is a greenfield Python CLI project. No existing codebase, no migration concerns. The tool will live at `~/.kb/` on the user's machine and be installed via `pipx install kb-search`. It must work entirely offline after initial model download.
|
||||
|
||||
Primary consumer is Claude Code (or similar LLM tools) via a skill wrapper that calls `kb search` and feeds JSON results to the LLM for synthesis. Secondary consumer is the user directly in a terminal. This dual-consumer constraint means output must be machine-parseable first, human-readable second.
|
||||
|
||||
The document corpus is ~3,000 items (2,000 PDFs of varying complexity, 500 markdown/text notes, 500 code snippets) producing ~22,000 chunks. This is small enough that brute-force vector search is viable and SQLite is more than sufficient.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Single-command install (`pipx install kb-search`) with `kb init` for model setup
|
||||
- Ingest heterogeneous documents with format-appropriate chunking
|
||||
- Hybrid search (keyword + semantic) with a single command
|
||||
- JSON output contract stable enough for skill integration
|
||||
- Configurable but works with zero configuration
|
||||
- All state in one SQLite file for easy backup/portability
|
||||
|
||||
**Non-Goals:**
|
||||
- LLM-based answer synthesis (the calling skill handles this)
|
||||
- Multi-user or networked access
|
||||
- Real-time / streaming ingestion
|
||||
- Web UI or TUI dashboard
|
||||
- Support for every possible document format (start with PDF, markdown, code, notes)
|
||||
- Clustering, deduplication, or automatic organisation of documents
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Package Structure
|
||||
|
||||
```
|
||||
kb-search/
|
||||
├── pyproject.toml
|
||||
├── src/
|
||||
│ └── kb_search/
|
||||
│ ├── __init__.py
|
||||
│ ├── cli.py # Click CLI entry point
|
||||
│ ├── config.py # YAML config loading + ENV overrides
|
||||
│ ├── database.py # SQLite schema, migrations, connection
|
||||
│ ├── embeddings.py # Model download, loading, inference
|
||||
│ ├── search.py # Hybrid search + RRF merging
|
||||
│ ├── ingest/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── detector.py # File type detection + routing
|
||||
│ │ ├── docling.py # Docling pipeline (PDF, DOCX, HTML, images)
|
||||
│ │ ├── markdown.py # Header-based markdown splitting
|
||||
│ │ ├── code.py # AST/regex code splitting
|
||||
│ │ └── note.py # Whole-document note handler
|
||||
│ └── output.py # JSON + human-readable formatters
|
||||
├── tests/
|
||||
└── SKILL.md # Claude Code skill definition
|
||||
```
|
||||
|
||||
**Why this structure:** Flat enough to navigate easily, but the `ingest/` subpackage isolates format-specific logic. Each ingestion module exports the same interface (`ingest(path, config) -> list[Chunk]`), making it easy to add formats later. Using `src/` layout per Python packaging best practices.
|
||||
|
||||
### 2. SQLite as Sole Storage Backend
|
||||
|
||||
All data lives in `~/.kb/kb.db`:
|
||||
|
||||
```sql
|
||||
-- Documents
|
||||
CREATE TABLE documents (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
source_path TEXT,
|
||||
content_hash TEXT NOT NULL, -- SHA-256 for dedup/change detection
|
||||
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
|
||||
language TEXT, -- for code: 'python','bash','go'
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
metadata TEXT DEFAULT '{}' -- JSON: page_count, author, etc.
|
||||
);
|
||||
|
||||
-- Chunks
|
||||
CREATE TABLE chunks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||||
chunk_index INTEGER NOT NULL,
|
||||
text TEXT NOT NULL,
|
||||
token_count INTEGER,
|
||||
metadata TEXT DEFAULT '{}', -- JSON: page, section_header, symbol_name
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
-- FTS5 index (content-sync with chunks table)
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
text,
|
||||
content='chunks',
|
||||
content_rowid='id',
|
||||
tokenize='porter unicode61'
|
||||
);
|
||||
|
||||
-- Triggers to keep FTS in sync
|
||||
CREATE TRIGGER chunks_ai AFTER INSERT ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||||
END;
|
||||
CREATE TRIGGER chunks_ad AFTER DELETE ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||||
END;
|
||||
CREATE TRIGGER chunks_au AFTER UPDATE ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||||
END;
|
||||
|
||||
-- Vector storage (sqlite-vec)
|
||||
CREATE VIRTUAL TABLE chunks_vec USING vec0(
|
||||
chunk_id INTEGER PRIMARY KEY,
|
||||
embedding FLOAT[384] -- dimension matches model
|
||||
);
|
||||
|
||||
-- Tags
|
||||
CREATE TABLE tags (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
name TEXT UNIQUE NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE document_tags (
|
||||
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
|
||||
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
|
||||
PRIMARY KEY (document_id, tag_id)
|
||||
);
|
||||
|
||||
-- Config stored in DB (model binding)
|
||||
CREATE TABLE config (
|
||||
key TEXT PRIMARY KEY,
|
||||
value TEXT NOT NULL
|
||||
);
|
||||
-- Keys: schema_version, model_name, embedding_dim, model_max_tokens
|
||||
```
|
||||
|
||||
**Why SQLite for everything:** At ~22,000 chunks, SQLite handles FTS, vector search, and relational data without breaking a sweat. One file = trivial backup (`cp kb.db kb.db.bak`), no server process, no port conflicts. FTS5 is built into SQLite. sqlite-vec is a single loadable extension.
|
||||
|
||||
**Why store config in DB _and_ YAML:** The YAML file holds user preferences (chunking params, model choice). The DB `config` table records what the DB was _actually built with_ (model name, dimension). This separation lets us detect mismatches: "config says use nomic-embed-text but DB was built with all-MiniLM-L6-v2."
|
||||
|
||||
**Alternatives considered:**
|
||||
- ChromaDB/Qdrant: External services, overkill for this scale, breaks single-file story
|
||||
- DuckDB: Good at analytics, but FTS support is weaker than SQLite FTS5
|
||||
- LanceDB: Interesting but less mature, no FTS built in
|
||||
|
||||
### 3. Docling for Complex Document Ingestion
|
||||
|
||||
Docling handles PDF, DOCX, HTML, and image files through a unified pipeline with ML-based layout detection and table reconstruction.
|
||||
|
||||
**Why Docling over simpler extractors:** The 2,000 PDFs are "many and varied" — simple text extraction (pymupdf, pdfplumber) works for clean PDFs but silently produces garbage for complex layouts, tables, or multi-column documents. Docling's layout model correctly identifies structural elements, and its table reconstruction preserves data that would otherwise be lost. The quality difference matters because bad chunks → bad search results → useless tool.
|
||||
|
||||
**Docling configuration for this project:**
|
||||
- Use `pypdfium2` backend (default, fast for text-based PDFs)
|
||||
- Enable OCR only when needed (detect pages with no extractable text)
|
||||
- Use hierarchy-aware chunking (respects section/paragraph boundaries)
|
||||
- Disable image extraction (we're indexing text, not images)
|
||||
- Run with multiple workers for batch ingestion
|
||||
|
||||
**Model download:** Docling models (~1.5 GB) download on first use or via `kb init`. Stored in `~/.kb/models/docling/` or HuggingFace's default cache.
|
||||
|
||||
**Alternatives considered:**
|
||||
- pymupdf4llm: Fast, lightweight, but poor table/layout handling
|
||||
- Unstructured: Heavier than Docling, commercial focus, less predictable output
|
||||
- LlamaParse: Cloud-only, violates local-first constraint
|
||||
|
||||
### 4. Per-Type Chunking Strategy
|
||||
|
||||
Each document type gets a purpose-built chunker with configurable parameters:
|
||||
|
||||
**PDF (Docling):** Hierarchy-aware chunking. Docling's `HierarchicalChunker` splits at section/paragraph boundaries respecting the document's logical structure. Falls back to fixed-size if hierarchy detection fails.
|
||||
|
||||
**Markdown:** Header-based splitting. Split at `##` and `###` boundaries. Preserve parent header chain as context (so a chunk under "## Config > ### Advanced" carries that path). Merge small sections (< `min_tokens`) with their neighbor. Configurable: `min_tokens`, `max_tokens`.
|
||||
|
||||
**Code (Python):** Use stdlib `ast` module. Each function and class becomes a chunk. Class methods include the class docstring for context. Top-level code between definitions becomes its own chunk.
|
||||
|
||||
**Code (Bash):** Regex-based. Split on `function name() {` and `name() {` patterns with brace-depth counting. Comment blocks preceding a function attach to that function's chunk. Fall back to fixed-size windowed chunks if no functions detected.
|
||||
|
||||
**Code (Go):** Regex-based. Split on `func ` declarations. Type definitions with methods are grouped. Fall back to fixed-size if no recognisable boundaries.
|
||||
|
||||
**Notes:** Whole document = one chunk. Notes are small by definition.
|
||||
|
||||
**Configurable defaults (in `~/.kb/config.yaml`):**
|
||||
```yaml
|
||||
chunking:
|
||||
defaults:
|
||||
max_tokens: 512
|
||||
overlap_tokens: 50
|
||||
pdf:
|
||||
strategy: hierarchy # hierarchy | fixed
|
||||
max_tokens: 1024 # for fixed strategy fallback
|
||||
markdown:
|
||||
strategy: header # header | fixed
|
||||
min_tokens: 50 # merge sections smaller than this
|
||||
max_tokens: 1024
|
||||
code:
|
||||
strategy: ast # ast | fixed
|
||||
include_context: true # include class/module docstring with methods
|
||||
max_tokens: 1024
|
||||
note:
|
||||
strategy: whole
|
||||
```
|
||||
|
||||
### 5. Embedding Model Management
|
||||
|
||||
**Default model:** `all-MiniLM-L6-v2` (384 dimensions, 90 MB, good quality/speed tradeoff for CPU).
|
||||
|
||||
**Model loading:** Use `sentence-transformers` library which provides a unified API across models. Models stored in HuggingFace's default cache (`~/.cache/huggingface/`), shared with other tools that use HF models. No custom cache directory override.
|
||||
|
||||
**Model binding:** On `kb init`, the chosen model's name and dimension are written to the DB `config` table. Every subsequent `kb add` checks the loaded model matches the DB. Mismatch = hard error with clear message.
|
||||
|
||||
**Model switching (`kb reindex`):**
|
||||
1. Download new model
|
||||
2. Read all chunks from DB
|
||||
3. Re-embed in batches (with progress bar)
|
||||
4. Replace all vectors in `chunks_vec`
|
||||
5. Update DB config (model_name, embedding_dim)
|
||||
6. Recreate `chunks_vec` table if dimension changed
|
||||
|
||||
**ONNX Runtime for inference:** Use `sentence-transformers` with ONNX backend (`model = SentenceTransformer(model_name, backend="onnx")`). This gives us sentence-transformers' correct tokenization/pooling/normalization while using ONNX Runtime (~30 MB) instead of PyTorch (~200 MB) for inference. Models are automatically exported to ONNX format on first load. This keeps the install lightweight without sacrificing the convenience of the sentence-transformers API.
|
||||
|
||||
**Model compatibility:** All models on HuggingFace that work with `sentence-transformers` are supported. The only per-model differences handled in code:
|
||||
- Dimension (read from model config)
|
||||
- Max sequence length (read from model config, used to cap chunk size)
|
||||
- Query/passage prefixes (configurable in YAML, empty by default)
|
||||
|
||||
```yaml
|
||||
embedding:
|
||||
model: all-MiniLM-L6-v2
|
||||
query_prefix: "" # some models need "search_query: "
|
||||
passage_prefix: "" # some models need "search_document: "
|
||||
```
|
||||
|
||||
### 6. Hybrid Search with Reciprocal Rank Fusion
|
||||
|
||||
**Search flow:**
|
||||
|
||||
```
|
||||
Query: "how to install git"
|
||||
│
|
||||
├──▶ FTS5 query ──▶ BM25-ranked results (chunk_id, fts_score)
|
||||
│
|
||||
└──▶ Embed query ──▶ vec similarity search ──▶ (chunk_id, vec_score)
|
||||
(cosine distance, top-K)
|
||||
│
|
||||
▼
|
||||
Reciprocal Rank Fusion (RRF)
|
||||
score(d) = Σ 1/(k + rank_in_list) where k=60 (standard)
|
||||
│
|
||||
▼
|
||||
Merged results, sorted by RRF score
|
||||
│
|
||||
▼
|
||||
Apply filters (tags, doc_type) ──▶ Top-N results
|
||||
```
|
||||
|
||||
**Why RRF over learned re-ranking:** RRF is simple, parameter-free (k=60 is standard), and performs surprisingly well. A learned re-ranker (e.g., cross-encoder) would add another model download, slow down queries, and the marginal quality improvement isn't worth it at this scale. RRF can be swapped out later if needed.
|
||||
|
||||
**FTS5 query construction:** Pass the raw query string to FTS5. FTS5's porter stemmer handles basic normalisation. For queries with special characters, escape them. No query expansion or synonym handling — keep it simple.
|
||||
|
||||
**Vector search:** Embed the query with the same model used for chunks. Retrieve top-K (K = 3× requested results, to give RRF enough candidates). sqlite-vec does brute-force cosine similarity over all vectors — at 22K vectors this is ~2-5ms.
|
||||
|
||||
**Filter application:** Tag and type filters are applied as SQL WHERE clauses _before_ search where possible (for FTS5 via JOIN), or as post-filters on the merged results. This is a design choice per filter type:
|
||||
- Type filter: Applied in the SQL query (efficient)
|
||||
- Tag filter: Applied in the SQL query via JOIN (efficient)
|
||||
- Score threshold: Applied post-RRF as a cutoff
|
||||
|
||||
### 7. Output Format (Skill Contract)
|
||||
|
||||
**JSON output (`--format json`, default):**
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "how to install git",
|
||||
"results": [
|
||||
{
|
||||
"chunk_id": 1423,
|
||||
"score": 0.87,
|
||||
"score_breakdown": {"fts": 0.72, "vector": 0.94},
|
||||
"text": "To install the latest version of git from source...",
|
||||
"source": {
|
||||
"document_id": 42,
|
||||
"title": "Git Admin Guide",
|
||||
"path": "/home/user/docs/git-admin.pdf",
|
||||
"type": "pdf",
|
||||
"page": 12,
|
||||
"chunk_index": 3,
|
||||
"total_chunks": 28,
|
||||
"tags": ["git", "admin"]
|
||||
}
|
||||
}
|
||||
],
|
||||
"total_matches": 47,
|
||||
"returned": 10
|
||||
}
|
||||
```
|
||||
|
||||
**Human output (`--format human`):**
|
||||
|
||||
```
|
||||
Search: "how to install git" (47 matches, showing top 10)
|
||||
|
||||
1. [0.87] Git Admin Guide (p.12) [pdf] [git, admin]
|
||||
To install the latest version of git from source...
|
||||
|
||||
2. [0.65] setup-notes.md §Installation [markdown] [git]
|
||||
First, add the PPA repository for the latest git...
|
||||
```
|
||||
|
||||
**Stability commitment:** The JSON schema is the contract with the skill. Fields may be _added_ but not removed or renamed once the skill is built.
|
||||
|
||||
### 8. Configuration Architecture
|
||||
|
||||
```
|
||||
Precedence (highest to lowest):
|
||||
1. CLI flags (--top, --tags, --format)
|
||||
2. Environment variables (KB_MODEL, KB_DATA_DIR, KB_DEFAULT_TOP)
|
||||
3. ~/.kb/config.yaml
|
||||
4. Built-in defaults
|
||||
|
||||
ENV variable naming: KB_ prefix + UPPER_SNAKE_CASE
|
||||
KB_DATA_DIR → ~/.kb/
|
||||
KB_MODEL → all-MiniLM-L6-v2
|
||||
KB_DEFAULT_TOP → 10
|
||||
```
|
||||
|
||||
**Full default config.yaml:**
|
||||
|
||||
```yaml
|
||||
# ~/.kb/config.yaml
|
||||
|
||||
data_dir: ~/.kb
|
||||
|
||||
embedding:
|
||||
model: all-MiniLM-L6-v2
|
||||
query_prefix: ""
|
||||
passage_prefix: ""
|
||||
|
||||
search:
|
||||
default_top: 10
|
||||
default_format: json
|
||||
rrf_k: 60
|
||||
|
||||
chunking:
|
||||
defaults:
|
||||
max_tokens: 512
|
||||
overlap_tokens: 50
|
||||
pdf:
|
||||
strategy: hierarchy
|
||||
max_tokens: 1024
|
||||
markdown:
|
||||
strategy: header
|
||||
min_tokens: 50
|
||||
max_tokens: 1024
|
||||
code:
|
||||
strategy: ast
|
||||
include_context: true
|
||||
max_tokens: 1024
|
||||
note:
|
||||
strategy: whole
|
||||
|
||||
ingestion:
|
||||
workers: 4 # parallel Docling workers
|
||||
batch_size: 50 # commit to DB every N documents
|
||||
enable_ocr: auto # auto | always | never
|
||||
```
|
||||
|
||||
### 9. CLI Framework: Click
|
||||
|
||||
**Why Click:** Mature, well-documented, supports nested command groups, automatic `--help` generation, parameter validation, and progress bars (via `click.progressbar`). The alternative (Typer) adds type-hint magic but less control. argparse is too verbose for this many commands.
|
||||
|
||||
### 10. Error Handling and Resumability
|
||||
|
||||
**Batch ingestion must be resumable.** When adding a directory of 2,000 PDFs:
|
||||
- Each document is processed independently
|
||||
- On success: document + chunks inserted in a single transaction
|
||||
- On failure: error logged, document skipped, processing continues
|
||||
- `content_hash` (SHA-256 of file contents) enables skip-if-already-indexed
|
||||
- Progress shown via `click.progressbar` or `rich.progress`
|
||||
- Summary at end: "Added 1,847 documents. 12 failed. 141 skipped (already indexed)."
|
||||
|
||||
Failed documents are logged to `~/.kb/ingest-errors.log` with the file path and error for later investigation.
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
**[Docling model size] → Mitigation:** ~1.5 GB download on first init. Clear progress indication during download. Models cached permanently in `~/.kb/models/`. Document this in `kb init` output and SKILL.md.
|
||||
|
||||
**[Docling ingestion speed on CPU] → Mitigation:** ~17 hours for 2,000 PDFs on CPU. Support parallel workers (configurable). Show per-document progress. Resumable by design (skip already-indexed). Suggest GPU if available. This is a one-time cost.
|
||||
|
||||
**[ONNX model export on first load] → Mitigation:** First time a model is loaded, sentence-transformers exports it to ONNX format. This takes 10-30 seconds and is cached for subsequent runs. Users see a one-time delay on first `kb add` or `kb search` after init. Show a clear message: "Optimising model for ONNX inference (one-time)..."
|
||||
|
||||
**[sqlite-vec maturity] → Mitigation:** sqlite-vec is relatively new. At 22K vectors, brute-force search means we're not relying on its ANN indexing. If sqlite-vec has issues, swapping to numpy cosine similarity over a stored blob column is straightforward — same DB, different query path.
|
||||
|
||||
**[FTS5 trigger sync] → Mitigation:** FTS5 content-sync triggers add write overhead. At our scale (inserts during ingestion, not real-time) this is negligible. If it becomes an issue, switch to manual sync with `INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')` after batch operations.
|
||||
|
||||
**[Model lock-in] → Mitigation:** Changing embedding models requires full reindex (~22K embeddings, ~10-30 minutes on CPU). `kb reindex` with progress bar makes this manageable. Model name stored in DB prevents silent mixing.
|
||||
|
||||
## Resolved Questions
|
||||
|
||||
1. **ONNX for inference from day one.** Use sentence-transformers with ONNX backend. Smaller install (~30 MB vs ~200 MB for PyTorch), faster CPU inference. No PyTorch dependency.
|
||||
|
||||
2. **HuggingFace default cache for models.** Both embedding and Docling models use `~/.cache/huggingface/`. Shared with other HF tools — no duplicate downloads if the user already has models cached.
|
||||
|
||||
3. **Manual schema migrations.** Version number in `config` table. `database.py` checks version on open and runs ALTER TABLE scripts sequentially. Simple enough for this project's schema complexity.
|
||||
@@ -0,0 +1,39 @@
|
||||
## Why
|
||||
|
||||
There is no simple, local-first CLI tool for building a personal knowledge base across heterogeneous document types (PDFs, markdown, code snippets, text notes) with hybrid search that combines keyword matching and semantic understanding. Existing tools either require cloud services, lack semantic search, or can't handle the variety of document formats. This tool fills the gap — a retrieval engine that can be used standalone from the terminal or wrapped as an AI skill (e.g. Claude Code) where the LLM layer provides natural language synthesis over retrieved results.
|
||||
|
||||
## What Changes
|
||||
|
||||
- New Python CLI tool (`kb`) distributed via pipx (PyPI package: `kb-search`)
|
||||
- Ingestion pipeline with per-format handling:
|
||||
- **PDFs/DOCX/HTML/images**: Docling (layout-aware, table reconstruction, optional OCR)
|
||||
- **Markdown/text**: Header-based semantic splitting
|
||||
- **Code (Python, Bash, Go)**: AST/regex-based splitting at function/class boundaries
|
||||
- **Notes**: Inline text stored as whole-document chunks
|
||||
- Hybrid search combining SQLite FTS5 (BM25 keyword scoring) and sqlite-vec (vector similarity), merged via Reciprocal Rank Fusion
|
||||
- Local embedding models downloaded from HuggingFace on first run (`kb init`), with multi-model support and full reindex capability when switching models
|
||||
- Document tagging system for manual categorisation and filtered search
|
||||
- Structured JSON output designed for LLM skill consumption, plus human-readable terminal output
|
||||
- Configurable chunking parameters per document type with sensible defaults
|
||||
- All state in a single SQLite database (`~/.kb/kb.db`)
|
||||
- Configuration via YAML (`~/.kb/config.yaml`) with ENV variable overrides
|
||||
|
||||
## Capabilities
|
||||
|
||||
### New Capabilities
|
||||
- `document-ingestion`: Ingest PDFs, markdown, code, and text notes into chunked, embedded, searchable storage. Handles format detection, per-type chunking strategies, Docling pipeline for complex documents, and resumable batch imports.
|
||||
- `hybrid-search`: Hybrid retrieval combining FTS5 full-text search and sqlite-vec vector similarity via Reciprocal Rank Fusion. Supports tag/type filtering, configurable result counts, score thresholds, and JSON/human output formats.
|
||||
- `embedding-management`: Local embedding model lifecycle — download on init, bind model to database, detect mismatches, and full re-embedding via reindex when switching models.
|
||||
- `document-management`: CRUD operations on the document store — list, inspect, remove documents. Tag management (add/remove tags, filter by tags, list tags with counts).
|
||||
- `configuration`: TOML-based configuration with per-document-type chunking parameters, model selection, and ENV variable overrides. Sensible defaults that work without any config file.
|
||||
- `skill-interface`: Structured JSON output contract designed for LLM skill consumption — chunks with scores, source metadata, and provenance for citation.
|
||||
|
||||
### Modified Capabilities
|
||||
_(none — greenfield project)_
|
||||
|
||||
## Impact
|
||||
|
||||
- **Dependencies**: Docling (~1.5 GB models), sentence-transformers with ONNX Runtime backend, sqlite-vec, Click
|
||||
- **Storage**: ~/.kb/ directory containing SQLite database, config file, and downloaded models (~1.6 GB on init, database grows with content)
|
||||
- **First-run experience**: `kb init` required before use to download models. Batch ingestion of 2,000 PDFs estimated at ~17 hours CPU / ~3 hours GPU (one-time cost, resumable)
|
||||
- **External integration**: Designed to be wrapped as a Claude Code skill — the skill definition (SKILL.md) is a deliverable alongside the code
|
||||
@@ -0,0 +1,72 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: YAML configuration file
|
||||
The system SHALL read configuration from `~/.kb/config.yaml`. If the file does not exist, the system SHALL use built-in defaults. The configuration file SHALL be optional — the tool MUST work with zero configuration.
|
||||
|
||||
#### Scenario: No config file
|
||||
- **WHEN** `~/.kb/config.yaml` does not exist
|
||||
- **THEN** the system uses built-in defaults for all settings and operates normally
|
||||
|
||||
#### Scenario: Partial config file
|
||||
- **WHEN** `~/.kb/config.yaml` exists but only specifies `chunking.pdf.max_tokens: 2048`
|
||||
- **THEN** the system uses built-in defaults for all other settings, overriding only `chunking.pdf.max_tokens`
|
||||
|
||||
#### Scenario: Invalid config file
|
||||
- **WHEN** `~/.kb/config.yaml` contains invalid YAML
|
||||
- **THEN** the system prints a clear error message identifying the YAML syntax issue and exits with non-zero status
|
||||
|
||||
### Requirement: Environment variable overrides
|
||||
The system SHALL support environment variable overrides with the prefix `KB_`. ENV variables SHALL take precedence over the YAML config file. Supported variables: `KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`.
|
||||
|
||||
#### Scenario: Override data directory
|
||||
- **WHEN** `KB_DATA_DIR=/tmp/test-kb` is set
|
||||
- **THEN** the system uses `/tmp/test-kb/` instead of `~/.kb/` for the database and config
|
||||
|
||||
#### Scenario: Override model
|
||||
- **WHEN** `KB_MODEL=nomic-embed-text` is set
|
||||
- **THEN** the system uses `nomic-embed-text` as the embedding model, overriding the YAML config
|
||||
|
||||
#### Scenario: ENV overrides YAML
|
||||
- **WHEN** YAML config has `search.default_top: 10` and `KB_DEFAULT_TOP=20` is set
|
||||
- **THEN** the default top value is 20
|
||||
|
||||
### Requirement: Configuration precedence
|
||||
The system SHALL apply configuration in this order (highest to lowest precedence): CLI flags, environment variables, YAML config file, built-in defaults.
|
||||
|
||||
#### Scenario: CLI flag overrides everything
|
||||
- **WHEN** YAML config has `search.default_top: 10`, ENV has `KB_DEFAULT_TOP=20`, and user runs `kb search "test" --top 5`
|
||||
- **THEN** 5 results are returned
|
||||
|
||||
### Requirement: View and set configuration
|
||||
The system SHALL support viewing the current effective configuration via `kb config` and setting individual values via `kb config set <key> <value>`.
|
||||
|
||||
#### Scenario: View configuration
|
||||
- **WHEN** user runs `kb config`
|
||||
- **THEN** the system displays the fully resolved configuration (defaults merged with YAML merged with ENV), indicating the source of each value
|
||||
|
||||
#### Scenario: Set a config value
|
||||
- **WHEN** user runs `kb config set chunking.pdf.max_tokens 2048`
|
||||
- **THEN** the value is written to `~/.kb/config.yaml`, creating the file if necessary
|
||||
|
||||
### Requirement: Configurable chunking parameters
|
||||
The system SHALL support per-document-type chunking configuration with sensible defaults.
|
||||
|
||||
#### Scenario: Default chunking for PDF
|
||||
- **WHEN** no chunking config is specified for PDF
|
||||
- **THEN** the system uses `strategy: hierarchy, max_tokens: 1024`
|
||||
|
||||
#### Scenario: Default chunking for markdown
|
||||
- **WHEN** no chunking config is specified for markdown
|
||||
- **THEN** the system uses `strategy: header, min_tokens: 50, max_tokens: 1024`
|
||||
|
||||
#### Scenario: Default chunking for code
|
||||
- **WHEN** no chunking config is specified for code
|
||||
- **THEN** the system uses `strategy: ast, include_context: true, max_tokens: 1024`
|
||||
|
||||
#### Scenario: Default chunking for notes
|
||||
- **WHEN** no chunking config is specified for notes
|
||||
- **THEN** the system uses `strategy: whole`
|
||||
|
||||
#### Scenario: Custom chunking overrides
|
||||
- **WHEN** YAML config specifies `chunking.pdf.strategy: fixed` and `chunking.pdf.max_tokens: 512`
|
||||
- **THEN** PDFs are chunked with fixed-size windows of 512 tokens instead of hierarchy-aware chunking
|
||||
@@ -0,0 +1,125 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: File type detection and routing
|
||||
The system SHALL detect the type of a file being ingested and route it to the appropriate ingestion pipeline. Detection SHALL be based on file extension. Supported types: PDF (`.pdf`), DOCX (`.docx`), HTML (`.html`, `.htm`), Markdown (`.md`, `.markdown`, `.txt`), Code (`.py`, `.sh`, `.bash`, `.go`), and image files (`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`). The user MAY override detection with `--type` and `--language` flags.
|
||||
|
||||
#### Scenario: Auto-detect PDF file
|
||||
- **WHEN** user runs `kb add report.pdf`
|
||||
- **THEN** the file is routed to the Docling ingestion pipeline
|
||||
|
||||
#### Scenario: Auto-detect Python code
|
||||
- **WHEN** user runs `kb add script.py`
|
||||
- **THEN** the file is routed to the code ingestion pipeline with language set to `python`
|
||||
|
||||
#### Scenario: Override type detection
|
||||
- **WHEN** user runs `kb add data.txt --type code --language bash`
|
||||
- **THEN** the file is routed to the code pipeline as Bash, regardless of the `.txt` extension
|
||||
|
||||
#### Scenario: Unsupported file type
|
||||
- **WHEN** user runs `kb add archive.zip`
|
||||
- **THEN** the system SHALL print an error message listing supported formats and exit with non-zero status
|
||||
|
||||
### Requirement: Docling pipeline for complex documents
|
||||
The system SHALL use Docling to ingest PDF, DOCX, HTML, and image files. The pipeline SHALL use the `pypdfium2` backend for PDFs, enable layout model for structural detection, and enable table reconstruction. OCR SHALL be configurable: `auto` (detect pages with no extractable text and OCR those), `always`, or `never`.
|
||||
|
||||
#### Scenario: Ingest a text-based PDF
|
||||
- **WHEN** user runs `kb add manual.pdf`
|
||||
- **THEN** the system extracts text using Docling with layout detection, produces hierarchy-aware chunks preserving section structure, embeds each chunk, and stores the document with all chunks in the database
|
||||
|
||||
#### Scenario: Ingest a PDF with tables
|
||||
- **WHEN** user ingests a PDF containing data tables
|
||||
- **THEN** Docling's table reconstruction SHALL produce chunks where table content is represented as structured text (markdown table format) rather than garbled column fragments
|
||||
|
||||
#### Scenario: Ingest a scanned PDF with OCR auto mode
|
||||
- **WHEN** user ingests a PDF where some pages contain only images (no extractable text) and OCR is set to `auto`
|
||||
- **THEN** the system SHALL detect the imageless pages and apply OCR to those pages only, leaving text-extractable pages processed normally
|
||||
|
||||
#### Scenario: Ingest an image file
|
||||
- **WHEN** user runs `kb add diagram.png`
|
||||
- **THEN** the system SHALL process it through Docling with OCR enabled, extracting any text content from the image
|
||||
|
||||
### Requirement: Markdown ingestion with header-based splitting
|
||||
The system SHALL split markdown and text files at header boundaries (`##`, `###`). Each chunk SHALL include its parent header chain as context. Sections smaller than `min_tokens` SHALL be merged with the following section. Sections larger than `max_tokens` SHALL be split at paragraph boundaries with configurable overlap.
|
||||
|
||||
#### Scenario: Split markdown at headers
|
||||
- **WHEN** user runs `kb add guide.md` and the file contains multiple `##` sections
|
||||
- **THEN** each section becomes a separate chunk, with the header text included in the chunk
|
||||
|
||||
#### Scenario: Preserve header hierarchy
|
||||
- **WHEN** a markdown file has nested headers like `## Config` > `### Advanced Options`
|
||||
- **THEN** the chunk for "Advanced Options" SHALL include context indicating it falls under "Config > Advanced Options"
|
||||
|
||||
#### Scenario: Merge small sections
|
||||
- **WHEN** a markdown section contains fewer tokens than `min_tokens` (default: 50)
|
||||
- **THEN** it SHALL be merged with the next section into a single chunk
|
||||
|
||||
#### Scenario: Plain text file without headers
|
||||
- **WHEN** user runs `kb add notes.txt` and the file has no markdown headers
|
||||
- **THEN** the system SHALL fall back to fixed-size chunking with configurable `max_tokens` and `overlap_tokens`
|
||||
|
||||
### Requirement: Code ingestion with AST/regex splitting
|
||||
The system SHALL split code files at function and class boundaries. Python files SHALL use the `ast` module. Bash and Go files SHALL use regex-based splitting. When `include_context` is enabled (default), class methods SHALL include the class docstring/signature for context. Files with no recognisable function/class boundaries SHALL fall back to fixed-size chunking.
|
||||
|
||||
#### Scenario: Python file with functions and classes
|
||||
- **WHEN** user runs `kb add auth.py` and the file contains a class with methods
|
||||
- **THEN** each method becomes a chunk, and each chunk includes the class name and docstring as context
|
||||
|
||||
#### Scenario: Bash script with functions
|
||||
- **WHEN** user runs `kb add deploy.sh` and the file contains `function deploy() {` blocks
|
||||
- **THEN** each function becomes a separate chunk, including any preceding comment block
|
||||
|
||||
#### Scenario: Go file with functions
|
||||
- **WHEN** user runs `kb add main.go` and the file contains `func` declarations
|
||||
- **THEN** each function becomes a separate chunk
|
||||
|
||||
#### Scenario: Code file with no functions
|
||||
- **WHEN** user runs `kb add script.sh` and the file has no function declarations
|
||||
- **THEN** the system SHALL fall back to fixed-size chunking with `max_tokens` and `overlap_tokens`
|
||||
|
||||
### Requirement: Inline note ingestion
|
||||
The system SHALL support adding text notes directly from the command line via `kb add --note "text"`. Notes SHALL be stored as a single chunk (no splitting). Notes MAY have an optional `--title` for display purposes.
|
||||
|
||||
#### Scenario: Add an inline note
|
||||
- **WHEN** user runs `kb add --note "Always restart nginx after config changes" --title "nginx reminder"`
|
||||
- **THEN** a document of type `note` is created with the title "nginx reminder", and the full text becomes a single chunk
|
||||
|
||||
#### Scenario: Add a note without title
|
||||
- **WHEN** user runs `kb add --note "some text"`
|
||||
- **THEN** the system SHALL use the first 80 characters of the text (truncated at a word boundary) as the title
|
||||
|
||||
### Requirement: Deduplication via content hash
|
||||
The system SHALL compute a SHA-256 hash of each file's contents before ingestion. If a document with the same `content_hash` already exists in the database, the file SHALL be skipped with a message indicating it is already indexed.
|
||||
|
||||
#### Scenario: Add a file that is already indexed
|
||||
- **WHEN** user runs `kb add report.pdf` and the file's SHA-256 matches an existing document
|
||||
- **THEN** the system SHALL print "Skipped: report.pdf (already indexed)" and not create a duplicate
|
||||
|
||||
#### Scenario: Add a modified version of an existing file
|
||||
- **WHEN** user runs `kb add report.pdf` and the file has changed since last indexed (different hash)
|
||||
- **THEN** the system SHALL ingest it as a new document (the old version remains unless manually removed)
|
||||
|
||||
### Requirement: Batch ingestion with progress and resumability
|
||||
The system SHALL support ingesting entire directories via `kb add <dir> --recursive`. Processing SHALL be resumable — files already indexed (by content hash) are skipped. Failed files SHALL be logged and skipped without aborting the batch. A summary SHALL be displayed at completion.
|
||||
|
||||
#### Scenario: Ingest a directory
|
||||
- **WHEN** user runs `kb add ~/docs/ --recursive`
|
||||
- **THEN** the system recursively finds all supported files, processes each one, skips duplicates, logs failures, and displays a summary: "Added X documents. Y failed. Z skipped (already indexed)."
|
||||
|
||||
#### Scenario: Resume after interruption
|
||||
- **WHEN** a batch ingestion is interrupted (Ctrl+C, crash) and user reruns the same command
|
||||
- **THEN** already-indexed files are skipped via content hash, and processing continues with remaining files
|
||||
|
||||
#### Scenario: Failed file during batch
|
||||
- **WHEN** a single file fails to process (corrupt PDF, encoding error)
|
||||
- **THEN** the error is logged to `~/.kb/ingest-errors.log` with the file path and error message, and processing continues with the next file
|
||||
|
||||
### Requirement: Parallel ingestion workers
|
||||
The system SHALL support parallel document processing via configurable worker count (default: 4). Docling's `DocumentConverter` SHALL be used with multiple workers for PDF/DOCX/HTML ingestion. Database writes SHALL be serialised to avoid SQLite locking issues.
|
||||
|
||||
#### Scenario: Parallel PDF ingestion
|
||||
- **WHEN** user runs `kb add ~/pdfs/ --recursive` with `workers: 4` in config
|
||||
- **THEN** up to 4 documents are processed concurrently through Docling, with chunks written to the database sequentially
|
||||
|
||||
#### Scenario: Override worker count
|
||||
- **WHEN** user runs `kb add ~/pdfs/ --recursive --workers 1`
|
||||
- **THEN** documents are processed sequentially with a single worker
|
||||
@@ -0,0 +1,80 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: List documents
|
||||
The system SHALL list all indexed documents via `kb list`. Results SHALL include document ID, title, type, tag count, chunk count, and creation date. Output SHALL support `--format json` and `--format human`.
|
||||
|
||||
#### Scenario: List all documents
|
||||
- **WHEN** user runs `kb list`
|
||||
- **THEN** all documents are listed with their ID, title, type, tags, chunk count, and creation date
|
||||
|
||||
#### Scenario: Filter by type
|
||||
- **WHEN** user runs `kb list --type pdf`
|
||||
- **THEN** only PDF documents are listed
|
||||
|
||||
#### Scenario: Filter by tags
|
||||
- **WHEN** user runs `kb list --tags admin,ops`
|
||||
- **THEN** only documents tagged with BOTH "admin" AND "ops" are listed
|
||||
|
||||
#### Scenario: Empty database
|
||||
- **WHEN** user runs `kb list` with no documents indexed
|
||||
- **THEN** the system prints "No documents indexed. Run `kb add` to get started." and exits with zero status
|
||||
|
||||
### Requirement: Document info
|
||||
The system SHALL display detailed information about a single document via `kb info <doc_id>`, including all metadata, tags, chunk count, and chunk previews (first 100 characters of each chunk).
|
||||
|
||||
#### Scenario: View document info
|
||||
- **WHEN** user runs `kb info 42`
|
||||
- **THEN** the system displays: title, source path, type, language (if code), content hash, creation date, tags, total chunks, and a preview of each chunk
|
||||
|
||||
#### Scenario: Invalid document ID
|
||||
- **WHEN** user runs `kb info 9999` and no document with ID 9999 exists
|
||||
- **THEN** the system prints "Document not found: 9999" and exits with non-zero status
|
||||
|
||||
### Requirement: Remove document
|
||||
The system SHALL remove a document and all its associated chunks, embeddings, and tag associations via `kb remove <doc_id>`. The system SHALL ask for confirmation before deletion unless `--yes` is passed.
|
||||
|
||||
#### Scenario: Remove with confirmation
|
||||
- **WHEN** user runs `kb remove 42`
|
||||
- **THEN** the system displays the document title and asks "Remove 'Git Admin Guide' and its 28 chunks? [y/N]". On confirmation, the document, its chunks, FTS entries, vector embeddings, and tag associations are deleted.
|
||||
|
||||
#### Scenario: Remove with --yes flag
|
||||
- **WHEN** user runs `kb remove 42 --yes`
|
||||
- **THEN** the document is removed without confirmation prompt
|
||||
|
||||
#### Scenario: Cascading delete
|
||||
- **WHEN** a document is removed
|
||||
- **THEN** all rows in `chunks`, `chunks_fts`, `chunks_vec`, and `document_tags` referencing that document SHALL be deleted
|
||||
|
||||
### Requirement: Tag management
|
||||
The system SHALL support adding and removing tags on documents via `kb tag <doc_id> --add tag1,tag2` and `kb tag <doc_id> --remove tag1`. Tags are case-insensitive and stored lowercase. The system SHALL list all tags with document counts via `kb tags`.
|
||||
|
||||
#### Scenario: Add tags to a document
|
||||
- **WHEN** user runs `kb tag 42 --add git,admin`
|
||||
- **THEN** the tags "git" and "admin" are associated with document 42. Tags are created if they don't exist.
|
||||
|
||||
#### Scenario: Remove a tag from a document
|
||||
- **WHEN** user runs `kb tag 42 --remove admin`
|
||||
- **THEN** the "admin" tag association is removed from document 42. The tag itself remains in the tags table if other documents use it.
|
||||
|
||||
#### Scenario: List all tags
|
||||
- **WHEN** user runs `kb tags`
|
||||
- **THEN** the system lists all tags with the count of documents using each tag, sorted by count descending
|
||||
|
||||
#### Scenario: Tag on ingestion
|
||||
- **WHEN** user runs `kb add report.pdf --tags compliance,q1`
|
||||
- **THEN** the document is ingested and immediately tagged with "compliance" and "q1"
|
||||
|
||||
#### Scenario: Tags in JSON format
|
||||
- **WHEN** user runs `kb tags --format json`
|
||||
- **THEN** output is a JSON array of objects: `[{"name": "git", "count": 15}, ...]`
|
||||
|
||||
### Requirement: Database status
|
||||
The system SHALL report database statistics via `kb status`, including: total documents (by type), total chunks, database file size, active model name and dimension, and schema version.
|
||||
|
||||
#### Scenario: Show status
|
||||
- **WHEN** user runs `kb status`
|
||||
- **THEN** the system displays: document counts by type, total chunks, DB file size, model name, embedding dimension, and schema version
|
||||
|
||||
#### Scenario: Status before init
|
||||
- **WHEN** user runs `kb status` before `kb init`
|
||||
- **THEN** the system prints "Knowledge base not initialised. Run `kb init` first." and exits with non-zero status
|
||||
@@ -0,0 +1,57 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Model initialisation
|
||||
The system SHALL download the embedding model on `kb init`. The default model SHALL be `all-MiniLM-L6-v2`. The user MAY specify a different model via `kb init --model <name>`. The model SHALL be downloaded via sentence-transformers to the HuggingFace default cache (`~/.cache/huggingface/`). On first load, the model SHALL be exported to ONNX format for inference.
|
||||
|
||||
#### Scenario: Default init
|
||||
- **WHEN** user runs `kb init`
|
||||
- **THEN** the system downloads `all-MiniLM-L6-v2`, creates `~/.kb/kb.db` with the schema, and records `model_name=all-MiniLM-L6-v2` and `embedding_dim=384` in the DB config table
|
||||
|
||||
#### Scenario: Init with custom model
|
||||
- **WHEN** user runs `kb init --model nomic-embed-text`
|
||||
- **THEN** the system downloads `nomic-embed-text`, creates the database, and records the model name and its dimension in the DB config table
|
||||
|
||||
#### Scenario: Init status check
|
||||
- **WHEN** user runs `kb init --status`
|
||||
- **THEN** the system reports: whether `~/.kb/` exists, whether the DB is initialised, which model is configured, whether the model is downloaded, and Docling model status
|
||||
|
||||
#### Scenario: ONNX export on first load
|
||||
- **WHEN** the embedding model is loaded for the first time after download
|
||||
- **THEN** the system SHALL display "Optimising model for ONNX inference (one-time)..." and export the model to ONNX format. Subsequent loads SHALL use the cached ONNX export.
|
||||
|
||||
### Requirement: Model-database binding
|
||||
The system SHALL store the active model name and embedding dimension in the database `config` table. Every operation that uses the embedding model (add, search, reindex) SHALL verify that the loaded model matches the DB record. A mismatch SHALL be a hard error.
|
||||
|
||||
#### Scenario: Model mismatch on add
|
||||
- **WHEN** user runs `kb add doc.pdf` but the config YAML specifies a different model than what the DB was initialised with
|
||||
- **THEN** the system SHALL print an error: "Model mismatch: DB uses 'all-MiniLM-L6-v2' (384 dim) but config specifies 'nomic-embed-text'. Run `kb reindex --model nomic-embed-text` to switch models." and exit with non-zero status
|
||||
|
||||
#### Scenario: Model match on add
|
||||
- **WHEN** user runs `kb add doc.pdf` and the config model matches the DB model
|
||||
- **THEN** ingestion proceeds normally
|
||||
|
||||
### Requirement: Full reindex with model switching
|
||||
The system SHALL support re-embedding all chunks via `kb reindex`. If `--model` is specified, the system SHALL download the new model, re-embed all chunks, replace all vectors, and update the DB config. A progress bar SHALL be displayed. The operation SHALL be atomic — if interrupted, the old embeddings remain intact.
|
||||
|
||||
#### Scenario: Reindex with same model
|
||||
- **WHEN** user runs `kb reindex`
|
||||
- **THEN** all chunks are re-embedded with the current model and vectors are replaced. Useful if the model's ONNX export was corrupted or chunks were modified.
|
||||
|
||||
#### Scenario: Reindex with new model
|
||||
- **WHEN** user runs `kb reindex --model bge-small-en-v1.5`
|
||||
- **THEN** the system downloads the new model, re-embeds all chunks (showing progress), replaces all vectors in `chunks_vec` (recreating the table if dimension changed), and updates `model_name` and `embedding_dim` in the DB config table
|
||||
|
||||
#### Scenario: Interrupted reindex
|
||||
- **WHEN** a reindex is interrupted partway through
|
||||
- **THEN** the old embeddings remain intact (the vector table is only replaced on successful completion of all embeddings). The user can rerun `kb reindex` to retry.
|
||||
|
||||
### Requirement: Embedding model inference via ONNX
|
||||
The system SHALL use `sentence-transformers` with the ONNX backend for all embedding inference. This avoids a PyTorch dependency. The ONNX Runtime (`onnxruntime`) SHALL be the inference engine.
|
||||
|
||||
#### Scenario: Embed a chunk
|
||||
- **WHEN** a chunk of text needs to be embedded during ingestion
|
||||
- **THEN** the system uses the sentence-transformers ONNX backend to produce a float vector of the correct dimension for the active model
|
||||
|
||||
#### Scenario: Embed a query
|
||||
- **WHEN** a search query needs to be embedded
|
||||
- **THEN** the system applies the configured `query_prefix` (if any) to the query text before embedding, and uses the same ONNX model used for chunk embeddings
|
||||
@@ -0,0 +1,70 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Full-text search via FTS5
|
||||
The system SHALL maintain an FTS5 virtual table synchronised with the chunks table via triggers. FTS5 SHALL use the `porter unicode61` tokenizer for stemming and unicode support. Queries SHALL be passed to FTS5 with special characters escaped.
|
||||
|
||||
#### Scenario: Keyword search
|
||||
- **WHEN** user runs `kb search "install git"`
|
||||
- **THEN** FTS5 returns chunks containing "install" and/or "git" (including stemmed variants like "installation"), ranked by BM25 score
|
||||
|
||||
#### Scenario: FTS-only mode
|
||||
- **WHEN** user runs `kb search "install git" --fts-only`
|
||||
- **THEN** only FTS5 results are returned, no vector search is performed
|
||||
|
||||
### Requirement: Vector similarity search via sqlite-vec
|
||||
The system SHALL embed the query using the same model that was used to embed stored chunks. The embedded query SHALL be compared against all chunk embeddings in `chunks_vec` using cosine similarity. The system SHALL retrieve 3× the requested result count as candidates for RRF merging.
|
||||
|
||||
#### Scenario: Semantic search
|
||||
- **WHEN** user runs `kb search "how to set up version control"`
|
||||
- **THEN** the query is embedded and compared against stored vectors, returning semantically similar chunks even if they don't contain the exact words "version control"
|
||||
|
||||
#### Scenario: Vector-only mode
|
||||
- **WHEN** user runs `kb search "how to set up version control" --vec-only`
|
||||
- **THEN** only vector similarity results are returned, no FTS search is performed
|
||||
|
||||
### Requirement: Reciprocal Rank Fusion merging
|
||||
The system SHALL merge FTS5 and vector search results using Reciprocal Rank Fusion (RRF). The RRF formula SHALL be: `score(d) = Σ 1/(k + rank)` where `k` is configurable (default: 60). Results SHALL be sorted by descending RRF score.
|
||||
|
||||
#### Scenario: Hybrid search combines both signals
|
||||
- **WHEN** user runs `kb search "install git"` (default hybrid mode)
|
||||
- **THEN** the system runs both FTS5 and vector searches, merges results via RRF, and returns results sorted by combined score
|
||||
|
||||
#### Scenario: Document appears in both result sets
|
||||
- **WHEN** a chunk ranks #2 in FTS5 and #5 in vector search
|
||||
- **THEN** its RRF score SHALL be `1/(60+2) + 1/(60+5) = 0.0161 + 0.0154 = 0.0315`, higher than a chunk appearing in only one result set
|
||||
|
||||
### Requirement: Tag-based filtering
|
||||
The system SHALL support filtering search results by one or more tags. When multiple tags are specified, the filter SHALL use AND logic (document must have ALL specified tags). Tag filtering SHALL be applied in the SQL query via JOIN for efficiency.
|
||||
|
||||
#### Scenario: Filter by single tag
|
||||
- **WHEN** user runs `kb search "deploy" --tags ops`
|
||||
- **THEN** only chunks from documents tagged with "ops" are included in results
|
||||
|
||||
#### Scenario: Filter by multiple tags
|
||||
- **WHEN** user runs `kb search "deploy" --tags ops,production`
|
||||
- **THEN** only chunks from documents tagged with BOTH "ops" AND "production" are included
|
||||
|
||||
### Requirement: Type-based filtering
|
||||
The system SHALL support filtering search results by document type. Valid types: `pdf`, `markdown`, `code`, `note`.
|
||||
|
||||
#### Scenario: Filter by type
|
||||
- **WHEN** user runs `kb search "deploy" --type code`
|
||||
- **THEN** only chunks from code documents are included in results
|
||||
|
||||
### Requirement: Score threshold
|
||||
The system SHALL support a minimum score cutoff. Results with an RRF score below the threshold SHALL be excluded from output.
|
||||
|
||||
#### Scenario: Apply score threshold
|
||||
- **WHEN** user runs `kb search "deploy" --threshold 0.02`
|
||||
- **THEN** only results with RRF score >= 0.02 are returned
|
||||
|
||||
### Requirement: Result count control
|
||||
The system SHALL return a configurable number of results (default: 10, configurable via `--top` flag or `search.default_top` in config).
|
||||
|
||||
#### Scenario: Request specific number of results
|
||||
- **WHEN** user runs `kb search "deploy" --top 5`
|
||||
- **THEN** at most 5 results are returned
|
||||
|
||||
#### Scenario: Fewer matches than requested
|
||||
- **WHEN** user searches and only 3 chunks match
|
||||
- **THEN** the system returns 3 results without error, with `returned: 3` in the output
|
||||
@@ -0,0 +1,101 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: JSON output format for search
|
||||
The system SHALL output search results as JSON when `--format json` is used (this is the default). The JSON schema SHALL include: `query`, `results` array, `total_matches`, and `returned` count. Each result SHALL include: `chunk_id`, `score`, `score_breakdown` (with `fts` and `vector` sub-scores), `text`, and `source` object.
|
||||
|
||||
#### Scenario: JSON search output
|
||||
- **WHEN** user runs `kb search "install git" --format json`
|
||||
- **THEN** the output is valid JSON matching this structure:
|
||||
```json
|
||||
{
|
||||
"query": "install git",
|
||||
"results": [
|
||||
{
|
||||
"chunk_id": 1423,
|
||||
"score": 0.031,
|
||||
"score_breakdown": {"fts": 0.016, "vector": 0.015},
|
||||
"text": "To install the latest version...",
|
||||
"source": {
|
||||
"document_id": 42,
|
||||
"title": "Git Admin Guide",
|
||||
"path": "/home/user/docs/git-admin.pdf",
|
||||
"type": "pdf",
|
||||
"page": 12,
|
||||
"chunk_index": 3,
|
||||
"total_chunks": 28,
|
||||
"tags": ["git", "admin"]
|
||||
}
|
||||
}
|
||||
],
|
||||
"total_matches": 47,
|
||||
"returned": 10
|
||||
}
|
||||
```
|
||||
|
||||
#### Scenario: Score breakdown in FTS-only mode
|
||||
- **WHEN** user runs `kb search "test" --fts-only --format json`
|
||||
- **THEN** `score_breakdown` contains `{"fts": <score>, "vector": null}`
|
||||
|
||||
#### Scenario: Score breakdown in vector-only mode
|
||||
- **WHEN** user runs `kb search "test" --vec-only --format json`
|
||||
- **THEN** `score_breakdown` contains `{"fts": null, "vector": <score>}`
|
||||
|
||||
### Requirement: Human-readable output format
|
||||
The system SHALL support human-readable output via `--format human`. This format SHALL show: query, match count, and for each result: rank, score, title, page/section (if applicable), type, tags, and a text preview.
|
||||
|
||||
#### Scenario: Human-readable search output
|
||||
- **WHEN** user runs `kb search "install git" --format human`
|
||||
- **THEN** output is formatted for terminal reading:
|
||||
```
|
||||
Search: "install git" (47 matches, showing top 10)
|
||||
|
||||
1. [0.031] Git Admin Guide (p.12) [pdf] [git, admin]
|
||||
To install the latest version of git from source...
|
||||
|
||||
2. [0.025] setup-notes.md §Installation [markdown] [git]
|
||||
First, add the PPA repository for the latest git...
|
||||
```
|
||||
|
||||
### Requirement: JSON output for list and tags commands
|
||||
The system SHALL support `--format json` for `kb list`, `kb tags`, `kb info`, and `kb status` commands. JSON output SHALL be valid and parseable by the skill wrapper.
|
||||
|
||||
#### Scenario: List documents as JSON
|
||||
- **WHEN** user runs `kb list --format json`
|
||||
- **THEN** output is a JSON array of document objects with `id`, `title`, `type`, `tags`, `chunk_count`, `created_at`
|
||||
|
||||
#### Scenario: Tags as JSON
|
||||
- **WHEN** user runs `kb tags --format json`
|
||||
- **THEN** output is a JSON array: `[{"name": "git", "count": 15}, ...]`
|
||||
|
||||
#### Scenario: Status as JSON
|
||||
- **WHEN** user runs `kb status --format json`
|
||||
- **THEN** output is a JSON object with `documents` (counts by type), `total_chunks`, `db_size_bytes`, `model_name`, `embedding_dim`, `schema_version`
|
||||
|
||||
### Requirement: JSON schema stability
|
||||
The JSON output schema SHALL be treated as a public contract. Fields MAY be added to JSON objects in future versions. Fields SHALL NOT be removed or renamed. The skill wrapper MUST be able to rely on the presence and type of all documented fields.
|
||||
|
||||
#### Scenario: Forward compatibility
|
||||
- **WHEN** a future version adds a `language` field to search results
|
||||
- **THEN** all existing fields remain present and unchanged, the new field is additive only
|
||||
|
||||
### Requirement: Exit codes
|
||||
The system SHALL use consistent exit codes: 0 for success, 1 for user errors (bad arguments, missing files), 2 for system errors (database corruption, model failure). JSON error output SHALL include an `error` field with a human-readable message.
|
||||
|
||||
#### Scenario: Successful operation
|
||||
- **WHEN** any command completes successfully
|
||||
- **THEN** exit code is 0
|
||||
|
||||
#### Scenario: User error with JSON output
|
||||
- **WHEN** user runs `kb search` with no query argument
|
||||
- **THEN** exit code is 1 and stderr contains a clear error message
|
||||
|
||||
#### Scenario: System error
|
||||
- **WHEN** the SQLite database is corrupted
|
||||
- **THEN** exit code is 2 and stderr contains the error details
|
||||
|
||||
### Requirement: Skill definition file
|
||||
The project SHALL include a `SKILL.md` file that defines how an LLM tool (e.g. Claude Code) should invoke and interpret `kb` commands. The skill file SHALL document: when to use the tool, available commands, output format, how to cite sources, and how to handle low-confidence results.
|
||||
|
||||
#### Scenario: Skill file exists
|
||||
- **WHEN** the project is built
|
||||
- **THEN** a `SKILL.md` file exists at the project root describing the skill interface for LLM consumption
|
||||
@@ -0,0 +1,115 @@
|
||||
## 1. Project Scaffolding
|
||||
|
||||
- [x] 1.1 Create Python virtual environment (`python3 -m venv .venv`) and add `.venv/` to `.gitignore`. All development and testing MUST run inside this venv.
|
||||
- [x] 1.2 Create `pyproject.toml` with project metadata, dependencies (`click`, `sqlite-vec`, `pyyaml`, `sentence-transformers`, `onnxruntime`, `docling`), dev dependencies (`pytest`, `pytest-cov`), and `[project.scripts] kb = "kb_search.cli:main"` entry point
|
||||
- [x] 1.3 Install the project in editable mode inside the venv: `.venv/bin/pip install -e ".[dev]"`
|
||||
- [x] 1.4 Create `src/kb_search/` package directory with `__init__.py`
|
||||
- [x] 1.5 Create `src/kb_search/cli.py` with Click group and stub subcommands (`init`, `add`, `search`, `list`, `info`, `remove`, `tags`, `tag`, `status`, `reindex`, `config`)
|
||||
- [x] 1.6 Verify `.venv/bin/kb --help` shows all commands
|
||||
|
||||
## 2. Configuration
|
||||
|
||||
- [x] 2.1 Create `src/kb_search/config.py` — load YAML from `~/.kb/config.yaml` with deep-merge against built-in defaults. Handle missing file gracefully.
|
||||
- [x] 2.2 Implement ENV variable overrides (`KB_DATA_DIR`, `KB_MODEL`, `KB_DEFAULT_TOP`, `KB_DEFAULT_FORMAT`) with precedence: CLI flags > ENV > YAML > defaults
|
||||
- [x] 2.3 Implement `kb config` command — display fully resolved config with source indicators
|
||||
- [x] 2.4 Implement `kb config set <key> <value>` — write to `~/.kb/config.yaml`, creating file if needed
|
||||
- [x] 2.5 Write tests for config loading, merging, ENV overrides, and precedence
|
||||
|
||||
## 3. Database Layer
|
||||
|
||||
- [x] 3.1 Create `src/kb_search/database.py` — SQLite connection management with sqlite-vec extension loading
|
||||
- [x] 3.2 Implement schema creation: `documents`, `chunks`, `tags`, `document_tags`, `config` tables per design.md
|
||||
- [x] 3.3 Implement FTS5 virtual table (`chunks_fts`) with `porter unicode61` tokenizer and sync triggers (INSERT, UPDATE, DELETE)
|
||||
- [x] 3.4 Implement `chunks_vec` virtual table via sqlite-vec
|
||||
- [x] 3.5 Implement schema versioning: store `schema_version` in `config` table, check on open, run migrations sequentially
|
||||
- [x] 3.6 Implement DB config helpers: `get_config(key)`, `set_config(key, value)` for model binding
|
||||
- [x] 3.7 Write tests for schema creation, migrations, FTS sync triggers, and config helpers
|
||||
|
||||
## 4. Embedding Management
|
||||
|
||||
- [x] 4.1 Create `src/kb_search/embeddings.py` — model download, ONNX export, and loading via `SentenceTransformer(model_name, backend="onnx")`
|
||||
- [x] 4.2 Implement model-database binding: on init, write model_name + embedding_dim to DB config; on load, verify match and hard-error on mismatch
|
||||
- [x] 4.3 Implement `embed_texts(texts: list[str]) -> list[list[float]]` with configurable query/passage prefix support
|
||||
- [x] 4.4 Implement `kb init` command — create `~/.kb/`, init DB schema, download model, record binding. Support `--model` flag and `--status` check.
|
||||
- [x] 4.5 Implement `kb reindex` command — download new model if `--model` specified, re-embed all chunks with progress bar, replace vectors atomically, update DB config
|
||||
- [x] 4.6 Write tests for embedding, model binding verification, and mismatch detection
|
||||
|
||||
## 5. Document Ingestion — Core
|
||||
|
||||
- [x] 5.1 Create `src/kb_search/ingest/__init__.py` and `src/kb_search/ingest/detector.py` — file type detection by extension, routing to correct pipeline, `--type`/`--language` override support
|
||||
- [x] 5.2 Implement deduplication: SHA-256 content hash, skip-if-exists check against `documents.content_hash`
|
||||
- [x] 5.3 Implement `kb add <file>` command — detect type, route to pipeline, store document + chunks + embeddings + tags in a single transaction
|
||||
- [x] 5.4 Implement `kb add --note "text"` — create note document with whole-text chunk, optional `--title`, auto-title from first 80 chars
|
||||
- [x] 5.5 Implement `kb add <dir> --recursive` — walk directory, filter supported extensions, process each file, skip dupes, log failures to `~/.kb/ingest-errors.log`, display summary
|
||||
- [x] 5.6 Implement parallel ingestion with configurable `--workers` (default: 4), serialised DB writes
|
||||
- [x] 5.7 Write tests for type detection, dedup, note creation, and batch processing
|
||||
|
||||
## 6. Document Ingestion — Docling Pipeline
|
||||
|
||||
- [x] 6.1 Create `src/kb_search/ingest/docling.py` — Docling `DocumentConverter` setup with `pypdfium2` backend, layout model enabled, table reconstruction enabled
|
||||
- [x] 6.2 Implement OCR configuration (`auto`/`always`/`never`) per config.yaml `ingestion.enable_ocr`
|
||||
- [x] 6.3 Implement hierarchy-aware chunking via Docling's `HierarchicalChunker`, with fallback to fixed-size chunking when hierarchy detection fails
|
||||
- [x] 6.4 Extract and preserve chunk metadata: page number, section headers, table markers
|
||||
- [x] 6.5 Wire Docling models to download on `kb init` (using HuggingFace default cache)
|
||||
- [x] 6.6 Write tests with sample PDFs (text-based, table-heavy, mixed layout)
|
||||
|
||||
## 7. Document Ingestion — Markdown Pipeline
|
||||
|
||||
- [x] 7.1 Create `src/kb_search/ingest/markdown.py` — split at `##`/`###` header boundaries
|
||||
- [x] 7.2 Implement parent header chain context (e.g. "Config > Advanced Options" prefix on nested chunks)
|
||||
- [x] 7.3 Implement small section merging (sections below `min_tokens` merged with next section)
|
||||
- [x] 7.4 Implement large section splitting at paragraph boundaries with overlap
|
||||
- [x] 7.5 Implement fallback to fixed-size chunking for plain text files without headers
|
||||
- [x] 7.6 Write tests for header splitting, merging, hierarchy context, and plain text fallback
|
||||
|
||||
## 8. Document Ingestion — Code Pipeline
|
||||
|
||||
- [x] 8.1 Create `src/kb_search/ingest/code.py` — language detection from extension (`.py`, `.sh`, `.bash`, `.go`)
|
||||
- [x] 8.2 Implement Python AST splitting using stdlib `ast` module — function and class boundaries, class docstring context on methods
|
||||
- [x] 8.3 Implement Bash regex splitting — `function name()` and `name()` patterns with preceding comment blocks
|
||||
- [x] 8.4 Implement Go regex splitting — `func` declarations with type grouping
|
||||
- [x] 8.5 Implement fallback to fixed-size chunking when no function/class boundaries detected
|
||||
- [x] 8.6 Write tests for each language parser and fallback behaviour
|
||||
|
||||
## 9. Hybrid Search
|
||||
|
||||
- [x] 9.1 Create `src/kb_search/search.py` — FTS5 query execution with BM25 scoring, special character escaping
|
||||
- [x] 9.2 Implement vector similarity search: embed query, query `chunks_vec` for top-K (3× requested), cosine similarity
|
||||
- [x] 9.3 Implement Reciprocal Rank Fusion: merge FTS and vector results with `score(d) = Σ 1/(k + rank)`, configurable `k` (default: 60)
|
||||
- [x] 9.4 Implement `--fts-only` and `--vec-only` modes
|
||||
- [x] 9.5 Implement tag filtering via SQL JOIN and type filtering via WHERE clause
|
||||
- [x] 9.6 Implement `--threshold` score cutoff (post-RRF)
|
||||
- [x] 9.7 Implement `--top` result count control (default from config)
|
||||
- [x] 9.8 Wire up `kb search` command with all flags: `--top`, `--tags`, `--type`, `--format`, `--fts-only`, `--vec-only`, `--threshold`
|
||||
- [x] 9.9 Write tests for FTS, vector search, RRF merging, filtering, and edge cases (empty results, fewer matches than requested)
|
||||
|
||||
## 10. Output Formatting
|
||||
|
||||
- [x] 10.1 Create `src/kb_search/output.py` — JSON formatter for search results matching the schema in skill-interface spec
|
||||
- [x] 10.2 Implement human-readable formatter for search results (rank, score, title, page/section, type, tags, text preview)
|
||||
- [x] 10.3 Implement JSON formatters for `list`, `tags`, `info`, and `status` commands
|
||||
- [x] 10.4 Implement human-readable formatters for `list`, `tags`, `info`, and `status` commands
|
||||
- [x] 10.5 Implement consistent exit codes: 0 success, 1 user error, 2 system error
|
||||
- [x] 10.6 Write tests for JSON output schema validation and exit codes
|
||||
|
||||
## 11. Document Management Commands
|
||||
|
||||
- [x] 11.1 Implement `kb list` — query documents with optional `--type` and `--tags` filters, `--format` output
|
||||
- [x] 11.2 Implement `kb info <doc_id>` — document details with chunk previews
|
||||
- [x] 11.3 Implement `kb remove <doc_id>` — cascading delete with confirmation prompt, `--yes` flag
|
||||
- [x] 11.4 Implement `kb tags` — list all tags with document counts, `--format` support
|
||||
- [x] 11.5 Implement `kb tag <doc_id> --add/--remove` — tag management, case-insensitive storage
|
||||
- [x] 11.6 Implement `kb status` — DB stats, model info, storage size, schema version
|
||||
- [x] 11.7 Write tests for each management command
|
||||
|
||||
## 12. Skill Definition
|
||||
|
||||
- [x] 12.1 Write `SKILL.md` — when to use, available commands, output format, how to cite sources, handling low-confidence results, multi-query guidance
|
||||
- [x] 12.2 Test the skill end-to-end: ingest sample documents, run searches via the skill prompt, verify Claude Code can parse and cite results
|
||||
|
||||
## 13. Packaging and Distribution
|
||||
|
||||
- [x] 13.1 Verify `pipx install kb-search` works from a clean environment
|
||||
- [x] 13.2 Verify `kb init` downloads both embedding model and Docling models successfully
|
||||
- [x] 13.3 Add a README with quickstart: install, init, add, search
|
||||
- [x] 13.4 Add `py.typed` marker and basic type annotations on public interfaces
|
||||
@@ -0,0 +1,10 @@
|
||||
schema: spec-driven
|
||||
|
||||
context: |
|
||||
Tech stack: Python 3.11+, Click (CLI), SQLite (FTS5 + sqlite-vec), Docling, sentence-transformers
|
||||
Distribution: pipx (PyPI package name: kb-search, CLI entry point: kb)
|
||||
Storage: Single SQLite database at ~/.kb/kb.db
|
||||
Config: TOML at ~/.kb/config.toml with ENV overrides
|
||||
Domain: CLI knowledge base / retrieval engine for personal document search
|
||||
Primary consumer: Claude Code skills (JSON output), secondary: human terminal use
|
||||
Local-first: no cloud dependencies, embedding models downloaded from HuggingFace on init
|
||||
@@ -0,0 +1,32 @@
|
||||
[build-system]
|
||||
requires = ["setuptools>=68.0", "setuptools-scm>=8.0"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "kb-search"
|
||||
version = "0.1.0"
|
||||
description = "CLI knowledge base with hybrid search (FTS + vector)"
|
||||
requires-python = ">=3.11"
|
||||
license = "MIT"
|
||||
dependencies = [
|
||||
"click>=8.1",
|
||||
"pyyaml>=6.0",
|
||||
"sentence-transformers[onnx]>=3.0",
|
||||
"sqlite-vec>=0.1.1",
|
||||
"docling>=2.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=8.0",
|
||||
"pytest-cov>=5.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
kb = "kb_search.cli:main"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["src"]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
@@ -0,0 +1,3 @@
|
||||
"""kb-search: CLI knowledge base with hybrid search."""
|
||||
|
||||
__version__ = "0.1.0"
|
||||
@@ -0,0 +1,616 @@
|
||||
"""CLI entry point for kb-search."""
|
||||
|
||||
import click
|
||||
|
||||
|
||||
@click.group()
|
||||
@click.version_option(package_name="kb-search")
|
||||
def main():
|
||||
"""Personal knowledge base with hybrid search."""
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.option("--model", default=None, help="Embedding model name (HuggingFace).")
|
||||
@click.option("--status", is_flag=True, help="Show initialisation status.")
|
||||
def init(model, status):
|
||||
"""Initialise the knowledge base and download models."""
|
||||
from kb_search.config import get_data_dir, get_db_path, load_config
|
||||
from kb_search.database import get_connection, get_db_config, init_schema, run_migrations, set_db_config
|
||||
from kb_search.embeddings import download_model, get_model_dim
|
||||
|
||||
cfg = load_config()
|
||||
data_dir = get_data_dir(cfg)
|
||||
db_path = get_db_path(cfg)
|
||||
model_name = model or cfg["embedding"]["model"]
|
||||
|
||||
if status:
|
||||
click.echo(f"Data directory: {data_dir} ({'exists' if data_dir.exists() else 'not created'})")
|
||||
click.echo(f"Database: {db_path} ({'exists' if db_path.exists() else 'not created'})")
|
||||
if db_path.exists():
|
||||
conn = get_connection(db_path)
|
||||
db_model = get_db_config(conn, "model_name", "not set")
|
||||
db_dim = get_db_config(conn, "embedding_dim", "not set")
|
||||
click.echo(f"Model: {db_model} ({db_dim} dim)")
|
||||
conn.close()
|
||||
else:
|
||||
click.echo(f"Model: {model_name} (not yet initialised)")
|
||||
return
|
||||
|
||||
# Create data directory
|
||||
data_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Download model and get dimension
|
||||
download_model(model_name)
|
||||
dim = get_model_dim(model_name)
|
||||
|
||||
# Initialise database
|
||||
conn = get_connection(db_path)
|
||||
init_schema(conn, embedding_dim=dim)
|
||||
run_migrations(conn)
|
||||
set_db_config(conn, "model_name", model_name)
|
||||
set_db_config(conn, "embedding_dim", str(dim))
|
||||
conn.close()
|
||||
|
||||
click.echo(f"Knowledge base initialised at {data_dir}")
|
||||
click.echo(f"Model: {model_name} ({dim} dimensions)")
|
||||
click.echo("Ready! Add documents with `kb add`.")
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.argument("path", required=False)
|
||||
@click.option("--note", default=None, help="Add an inline text note.")
|
||||
@click.option("--title", default=None, help="Title for the note.")
|
||||
@click.option("--tags", default=None, help="Comma-separated tags.")
|
||||
@click.option("--type", "doc_type", default=None, type=click.Choice(["pdf", "markdown", "code", "note"]), help="Force document type.")
|
||||
@click.option("--language", default=None, type=click.Choice(["python", "bash", "go"]), help="Force code language.")
|
||||
@click.option("--recursive", is_flag=True, help="Recurse into directories.")
|
||||
@click.option("--workers", default=None, type=int, help="Number of parallel workers.")
|
||||
def add(path, note, title, tags, doc_type, language, recursive, workers):
|
||||
"""Add documents to the knowledge base."""
|
||||
import hashlib
|
||||
from pathlib import Path as P
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import (
|
||||
get_connection, hash_exists, insert_chunk, insert_document,
|
||||
insert_embedding, tag_document,
|
||||
)
|
||||
from kb_search.embeddings import check_model_binding, embed_texts
|
||||
from kb_search.ingest.detector import detect_type, is_supported
|
||||
from kb_search.ingest.note import auto_title, chunk_note
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
check_model_binding(conn, cfg)
|
||||
model_name = cfg["embedding"]["model"]
|
||||
tag_list = [t.strip() for t in tags.split(",")] if tags else []
|
||||
|
||||
if note:
|
||||
# Inline note
|
||||
content_hash = hashlib.sha256(note.encode()).hexdigest()
|
||||
if hash_exists(conn, content_hash):
|
||||
click.echo("Skipped: note (already indexed)")
|
||||
conn.close()
|
||||
return
|
||||
note_title = title or auto_title(note)
|
||||
chunks = chunk_note(note)
|
||||
doc_id = insert_document(conn, note_title, None, content_hash, "note")
|
||||
for c in chunks:
|
||||
chunk_id = insert_chunk(conn, doc_id, c["chunk_index"], c["text"], metadata=c["metadata"])
|
||||
emb = embed_texts(model_name, [c["text"]], prefix=cfg["embedding"].get("passage_prefix", ""))
|
||||
insert_embedding(conn, chunk_id, emb[0])
|
||||
if tag_list:
|
||||
tag_document(conn, doc_id, tag_list)
|
||||
conn.commit()
|
||||
conn.close()
|
||||
click.echo(f"Added note: {note_title}")
|
||||
return
|
||||
|
||||
if not path:
|
||||
raise click.ClickException("Provide a file/directory path or use --note.")
|
||||
|
||||
file_path = P(path).expanduser().resolve()
|
||||
|
||||
if file_path.is_dir():
|
||||
_add_directory(conn, file_path, cfg, model_name, tag_list, doc_type, language,
|
||||
recursive, workers)
|
||||
elif file_path.is_file():
|
||||
result = _add_single_file(conn, file_path, cfg, model_name, tag_list, doc_type, language)
|
||||
click.echo(result)
|
||||
else:
|
||||
raise click.ClickException(f"Path not found: {file_path}")
|
||||
|
||||
conn.close()
|
||||
|
||||
|
||||
def _add_single_file(conn, file_path, cfg, model_name, tag_list, force_type, force_language):
|
||||
"""Add a single file. Returns a status message."""
|
||||
import hashlib
|
||||
from kb_search.database import (
|
||||
hash_exists, insert_chunk, insert_document, insert_embedding, tag_document,
|
||||
)
|
||||
from kb_search.embeddings import embed_texts
|
||||
from kb_search.ingest.detector import detect_type
|
||||
|
||||
# Dedup check
|
||||
content_hash = hashlib.sha256(file_path.read_bytes()).hexdigest()
|
||||
if hash_exists(conn, content_hash):
|
||||
return f"Skipped: {file_path.name} (already indexed)"
|
||||
|
||||
doc_type, language = detect_type(file_path, force_type, force_language)
|
||||
chunks = _get_chunks(file_path, doc_type, language, cfg)
|
||||
|
||||
if not chunks:
|
||||
return f"Skipped: {file_path.name} (no content extracted)"
|
||||
|
||||
title = file_path.stem
|
||||
doc_id = insert_document(conn, title, str(file_path), content_hash, doc_type,
|
||||
language=language)
|
||||
|
||||
# Embed all chunks in one batch
|
||||
texts = [c["text"] for c in chunks]
|
||||
prefix = cfg["embedding"].get("passage_prefix", "")
|
||||
embeddings = embed_texts(model_name, texts, prefix=prefix)
|
||||
|
||||
for c, emb in zip(chunks, embeddings):
|
||||
chunk_id = insert_chunk(conn, doc_id, c["chunk_index"], c["text"],
|
||||
token_count=c.get("token_count"),
|
||||
metadata=c.get("metadata", {}))
|
||||
insert_embedding(conn, chunk_id, emb)
|
||||
|
||||
if tag_list:
|
||||
tag_document(conn, doc_id, tag_list)
|
||||
|
||||
conn.commit()
|
||||
return f"Added: {file_path.name} ({len(chunks)} chunks)"
|
||||
|
||||
|
||||
def _get_chunks(file_path, doc_type, language, cfg):
|
||||
"""Route to the correct chunking pipeline."""
|
||||
if doc_type == "pdf":
|
||||
from kb_search.ingest.docling import chunk_document
|
||||
return chunk_document(file_path, cfg)
|
||||
elif doc_type == "markdown":
|
||||
from kb_search.ingest.markdown import chunk_markdown
|
||||
text = file_path.read_text(errors="replace")
|
||||
return chunk_markdown(text, cfg)
|
||||
elif doc_type == "code":
|
||||
from kb_search.ingest.code import chunk_code
|
||||
text = file_path.read_text(errors="replace")
|
||||
return chunk_code(text, language, cfg)
|
||||
elif doc_type == "note":
|
||||
from kb_search.ingest.note import chunk_note
|
||||
text = file_path.read_text(errors="replace")
|
||||
return chunk_note(text)
|
||||
return []
|
||||
|
||||
|
||||
def _add_directory(conn, dir_path, cfg, model_name, tag_list, force_type, force_language,
|
||||
recursive, workers):
|
||||
"""Add all supported files in a directory."""
|
||||
from pathlib import Path as P
|
||||
from kb_search.ingest.detector import is_supported
|
||||
|
||||
pattern = "**/*" if recursive else "*"
|
||||
files = sorted(f for f in dir_path.glob(pattern) if f.is_file() and is_supported(f))
|
||||
|
||||
if not files:
|
||||
click.echo(f"No supported files found in {dir_path}")
|
||||
return
|
||||
|
||||
added = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
error_log = cfg.get("data_dir", "~/.kb")
|
||||
from kb_search.config import get_data_dir
|
||||
error_log_path = get_data_dir(cfg) / "ingest-errors.log"
|
||||
|
||||
with click.progressbar(files, label="Ingesting", show_pos=True) as bar:
|
||||
for f in bar:
|
||||
try:
|
||||
result = _add_single_file(conn, f, cfg, model_name, tag_list,
|
||||
force_type, force_language)
|
||||
if "Skipped" in result:
|
||||
skipped += 1
|
||||
else:
|
||||
added += 1
|
||||
except Exception as e:
|
||||
failed += 1
|
||||
with open(error_log_path, "a") as log:
|
||||
log.write(f"{f}: {e}\n")
|
||||
|
||||
click.echo(f"\nAdded {added} documents. {failed} failed. {skipped} skipped (already indexed).")
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.argument("query")
|
||||
@click.option("--top", default=None, type=int, help="Number of results.")
|
||||
@click.option("--tags", default=None, help="Filter by tags (comma-separated).")
|
||||
@click.option("--type", "doc_type", default=None, type=click.Choice(["pdf", "markdown", "code", "note"]), help="Filter by document type.")
|
||||
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
|
||||
@click.option("--fts-only", is_flag=True, help="Full-text search only.")
|
||||
@click.option("--vec-only", is_flag=True, help="Vector search only.")
|
||||
@click.option("--threshold", default=None, type=float, help="Minimum score cutoff.")
|
||||
def search(query, top, tags, doc_type, fmt, fts_only, vec_only, threshold):
|
||||
"""Search the knowledge base."""
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import get_connection, get_db_config
|
||||
from kb_search.embeddings import check_model_binding
|
||||
from kb_search.search import hybrid_search
|
||||
from kb_search.output import format_search_results
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
check_model_binding(conn, cfg)
|
||||
|
||||
model_name = get_db_config(conn, "model_name") or cfg["embedding"]["model"]
|
||||
top = top or cfg["search"]["default_top"]
|
||||
fmt = fmt or cfg["search"]["default_format"]
|
||||
tag_list = [t.strip() for t in tags.split(",")] if tags else None
|
||||
|
||||
results = hybrid_search(
|
||||
conn, query, model_name, cfg,
|
||||
top=top, tags=tag_list, doc_type=doc_type,
|
||||
fts_only=fts_only, vec_only=vec_only, threshold=threshold,
|
||||
)
|
||||
conn.close()
|
||||
|
||||
click.echo(format_search_results(results, fmt))
|
||||
|
||||
|
||||
@main.command("list")
|
||||
@click.option("--type", "doc_type", default=None, type=click.Choice(["pdf", "markdown", "code", "note"]), help="Filter by document type.")
|
||||
@click.option("--tags", default=None, help="Filter by tags (comma-separated).")
|
||||
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
|
||||
def list_docs(doc_type, tags, fmt):
|
||||
"""List indexed documents."""
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import get_connection
|
||||
from kb_search.output import format_document_list
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
fmt = fmt or cfg["search"]["default_format"]
|
||||
|
||||
sql = """
|
||||
SELECT d.id, d.title, d.doc_type as type, d.created_at,
|
||||
COUNT(c.id) as chunk_count
|
||||
FROM documents d
|
||||
LEFT JOIN chunks c ON d.id = c.document_id
|
||||
"""
|
||||
joins = []
|
||||
where = []
|
||||
params = []
|
||||
|
||||
if doc_type:
|
||||
where.append("d.doc_type = ?")
|
||||
params.append(doc_type)
|
||||
|
||||
tag_list = [t.strip().lower() for t in tags.split(",")] if tags else []
|
||||
for i, tag in enumerate(tag_list):
|
||||
joins.append(f"JOIN document_tags dt{i} ON d.id = dt{i}.document_id")
|
||||
joins.append(f"JOIN tags t{i} ON dt{i}.tag_id = t{i}.id")
|
||||
where.append(f"t{i}.name = ?")
|
||||
params.append(tag)
|
||||
|
||||
sql += " " + " ".join(joins)
|
||||
if where:
|
||||
sql += " WHERE " + " AND ".join(where)
|
||||
sql += " GROUP BY d.id ORDER BY d.created_at DESC"
|
||||
|
||||
rows = conn.execute(sql, params).fetchall()
|
||||
|
||||
docs = []
|
||||
for row in rows:
|
||||
tag_rows = conn.execute("""
|
||||
SELECT t.name FROM tags t
|
||||
JOIN document_tags dt ON t.id = dt.tag_id
|
||||
WHERE dt.document_id = ?
|
||||
ORDER BY t.name
|
||||
""", (row["id"],)).fetchall()
|
||||
docs.append({
|
||||
"id": row["id"],
|
||||
"title": row["title"],
|
||||
"type": row["type"],
|
||||
"tags": [r["name"] for r in tag_rows],
|
||||
"chunk_count": row["chunk_count"],
|
||||
"created_at": row["created_at"],
|
||||
})
|
||||
|
||||
conn.close()
|
||||
click.echo(format_document_list(docs, fmt))
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.argument("doc_id", type=int)
|
||||
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
|
||||
def info(doc_id, fmt):
|
||||
"""Show document details."""
|
||||
import json as jsonlib
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import get_connection
|
||||
from kb_search.output import format_doc_info
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
fmt = fmt or cfg["search"]["default_format"]
|
||||
|
||||
row = conn.execute("SELECT * FROM documents WHERE id = ?", (doc_id,)).fetchone()
|
||||
if not row:
|
||||
raise click.ClickException(f"Document not found: {doc_id}")
|
||||
|
||||
chunks = conn.execute(
|
||||
"SELECT chunk_index, text FROM chunks WHERE document_id = ? ORDER BY chunk_index",
|
||||
(doc_id,),
|
||||
).fetchall()
|
||||
|
||||
tag_rows = conn.execute("""
|
||||
SELECT t.name FROM tags t
|
||||
JOIN document_tags dt ON t.id = dt.tag_id
|
||||
WHERE dt.document_id = ?
|
||||
ORDER BY t.name
|
||||
""", (doc_id,)).fetchall()
|
||||
|
||||
info_data = {
|
||||
"id": row["id"],
|
||||
"title": row["title"],
|
||||
"type": row["doc_type"],
|
||||
"language": row["language"],
|
||||
"path": row["source_path"],
|
||||
"content_hash": row["content_hash"],
|
||||
"created_at": row["created_at"],
|
||||
"tags": [r["name"] for r in tag_rows],
|
||||
"chunk_count": len(chunks),
|
||||
"chunks": [{"chunk_index": c["chunk_index"], "text": c["text"]} for c in chunks],
|
||||
}
|
||||
|
||||
conn.close()
|
||||
click.echo(format_doc_info(info_data, fmt))
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.argument("doc_id", type=int)
|
||||
@click.option("--yes", is_flag=True, help="Skip confirmation prompt.")
|
||||
def remove(doc_id, yes):
|
||||
"""Remove a document from the knowledge base."""
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import get_connection
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
row = conn.execute("SELECT id, title FROM documents WHERE id = ?", (doc_id,)).fetchone()
|
||||
if not row:
|
||||
raise click.ClickException(f"Document not found: {doc_id}")
|
||||
|
||||
chunk_count = conn.execute(
|
||||
"SELECT COUNT(*) FROM chunks WHERE document_id = ?", (doc_id,)
|
||||
).fetchone()[0]
|
||||
|
||||
if not yes:
|
||||
if not click.confirm(f"Remove '{row['title']}' and its {chunk_count} chunks?"):
|
||||
click.echo("Cancelled.")
|
||||
conn.close()
|
||||
return
|
||||
|
||||
# Delete vectors for this document's chunks
|
||||
conn.execute("""
|
||||
DELETE FROM chunks_vec WHERE chunk_id IN (
|
||||
SELECT id FROM chunks WHERE document_id = ?
|
||||
)
|
||||
""", (doc_id,))
|
||||
# Cascade handles chunks, document_tags
|
||||
conn.execute("DELETE FROM documents WHERE id = ?", (doc_id,))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
click.echo(f"Removed '{row['title']}' ({chunk_count} chunks).")
|
||||
|
||||
|
||||
@main.command("tags")
|
||||
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
|
||||
def list_tags(fmt):
|
||||
"""List all tags with document counts."""
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import get_connection
|
||||
from kb_search.output import format_tags
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
fmt = fmt or cfg["search"]["default_format"]
|
||||
|
||||
rows = conn.execute("""
|
||||
SELECT t.name, COUNT(dt.document_id) as count
|
||||
FROM tags t
|
||||
LEFT JOIN document_tags dt ON t.id = dt.tag_id
|
||||
GROUP BY t.id
|
||||
ORDER BY count DESC, t.name
|
||||
""").fetchall()
|
||||
|
||||
tags = [{"name": r["name"], "count": r["count"]} for r in rows]
|
||||
conn.close()
|
||||
click.echo(format_tags(tags, fmt))
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.argument("doc_id", type=int)
|
||||
@click.option("--add", "add_tags", default=None, help="Tags to add (comma-separated).")
|
||||
@click.option("--remove", "remove_tags", default=None, help="Tags to remove (comma-separated).")
|
||||
def tag(doc_id, add_tags, remove_tags):
|
||||
"""Manage tags on a document."""
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import get_connection, tag_document, untag_document
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
row = conn.execute("SELECT id, title FROM documents WHERE id = ?", (doc_id,)).fetchone()
|
||||
if not row:
|
||||
raise click.ClickException(f"Document not found: {doc_id}")
|
||||
|
||||
if add_tags:
|
||||
tags = [t.strip() for t in add_tags.split(",")]
|
||||
tag_document(conn, doc_id, tags)
|
||||
conn.commit()
|
||||
click.echo(f"Added tags [{', '.join(tags)}] to '{row['title']}'")
|
||||
|
||||
if remove_tags:
|
||||
tags = [t.strip() for t in remove_tags.split(",")]
|
||||
untag_document(conn, doc_id, tags)
|
||||
conn.commit()
|
||||
click.echo(f"Removed tags [{', '.join(tags)}] from '{row['title']}'")
|
||||
|
||||
conn.close()
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.option("--format", "fmt", default=None, type=click.Choice(["json", "human"]), help="Output format.")
|
||||
def status(fmt):
|
||||
"""Show knowledge base status and statistics."""
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import get_connection, get_db_config
|
||||
from kb_search.output import format_status
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
fmt = fmt or cfg["search"]["default_format"]
|
||||
|
||||
doc_counts = {}
|
||||
for row in conn.execute("SELECT doc_type, COUNT(*) as cnt FROM documents GROUP BY doc_type").fetchall():
|
||||
doc_counts[row["doc_type"]] = row["cnt"]
|
||||
|
||||
total_docs = sum(doc_counts.values())
|
||||
total_chunks = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
|
||||
db_size = db_path.stat().st_size
|
||||
|
||||
status_data = {
|
||||
"model_name": get_db_config(conn, "model_name", "not set"),
|
||||
"embedding_dim": get_db_config(conn, "embedding_dim", "not set"),
|
||||
"schema_version": get_db_config(conn, "schema_version", "not set"),
|
||||
"db_size_bytes": db_size,
|
||||
"documents": doc_counts,
|
||||
"total_documents": total_docs,
|
||||
"total_chunks": total_chunks,
|
||||
}
|
||||
|
||||
conn.close()
|
||||
click.echo(format_status(status_data, fmt))
|
||||
|
||||
|
||||
@main.command()
|
||||
@click.option("--model", default=None, help="Switch to a different embedding model.")
|
||||
def reindex(model):
|
||||
"""Re-embed all chunks (optionally with a new model)."""
|
||||
import struct
|
||||
from kb_search.config import get_db_path, load_config
|
||||
from kb_search.database import (
|
||||
get_connection, get_db_config, insert_embedding,
|
||||
recreate_vec_table, set_db_config,
|
||||
)
|
||||
from kb_search.embeddings import download_model, embed_texts, get_model_dim
|
||||
|
||||
cfg = load_config()
|
||||
db_path = get_db_path(cfg)
|
||||
if not db_path.exists():
|
||||
raise click.ClickException("Knowledge base not initialised. Run `kb init` first.")
|
||||
|
||||
conn = get_connection(db_path)
|
||||
model_name = model or get_db_config(conn, "model_name") or cfg["embedding"]["model"]
|
||||
|
||||
# Download model if switching
|
||||
if model:
|
||||
download_model(model_name)
|
||||
|
||||
dim = get_model_dim(model_name)
|
||||
|
||||
# Get all chunks
|
||||
rows = conn.execute("SELECT id, text FROM chunks ORDER BY id").fetchall()
|
||||
if not rows:
|
||||
click.echo("No chunks to re-embed.")
|
||||
conn.close()
|
||||
return
|
||||
|
||||
click.echo(f"Re-embedding {len(rows)} chunks with '{model_name}' ({dim} dim)...")
|
||||
|
||||
# Embed in batches
|
||||
batch_size = 256
|
||||
all_ids = [r["id"] for r in rows]
|
||||
all_texts = [r["text"] for r in rows]
|
||||
|
||||
prefix = cfg["embedding"].get("passage_prefix", "")
|
||||
all_embeddings = []
|
||||
with click.progressbar(range(0, len(all_texts), batch_size), label="Embedding") as bar:
|
||||
for i in bar:
|
||||
batch = all_texts[i:i + batch_size]
|
||||
batch_embs = embed_texts(model_name, batch, prefix=prefix)
|
||||
all_embeddings.extend(batch_embs)
|
||||
|
||||
# Atomically replace vectors
|
||||
recreate_vec_table(conn, dim)
|
||||
for chunk_id, emb in zip(all_ids, all_embeddings):
|
||||
insert_embedding(conn, chunk_id, emb)
|
||||
|
||||
set_db_config(conn, "model_name", model_name)
|
||||
set_db_config(conn, "embedding_dim", str(dim))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
click.echo(f"Reindex complete. {len(rows)} chunks embedded with '{model_name}'.")
|
||||
|
||||
|
||||
@main.group(invoke_without_command=True)
|
||||
@click.pass_context
|
||||
def config(ctx):
|
||||
"""View or modify configuration."""
|
||||
if ctx.invoked_subcommand is None:
|
||||
from kb_search.config import config_with_sources
|
||||
|
||||
entries = config_with_sources()
|
||||
max_key = max(len(k) for k, _, _ in entries)
|
||||
max_val = max(len(v) for _, v, _ in entries)
|
||||
for key, value, source in entries:
|
||||
click.echo(f" {key:<{max_key}} {value:<{max_val}} ({source})")
|
||||
|
||||
|
||||
@config.command("set")
|
||||
@click.argument("key")
|
||||
@click.argument("value")
|
||||
def config_set(key, value):
|
||||
"""Set a configuration value."""
|
||||
from kb_search.config import get_config_path, load_config, save_config_value
|
||||
|
||||
cfg = load_config()
|
||||
path = get_config_path(cfg)
|
||||
save_config_value(path, key, value)
|
||||
click.echo(f"Set {key} = {value} in {path}")
|
||||
|
||||
|
||||
main.add_command(config)
|
||||
@@ -0,0 +1,195 @@
|
||||
"""Configuration loading with YAML + ENV + defaults."""
|
||||
|
||||
import os
|
||||
from copy import deepcopy
|
||||
from pathlib import Path
|
||||
|
||||
import yaml
|
||||
|
||||
DEFAULTS = {
|
||||
"data_dir": "~/.kb",
|
||||
"embedding": {
|
||||
"model": "all-MiniLM-L6-v2",
|
||||
"query_prefix": "",
|
||||
"passage_prefix": "",
|
||||
},
|
||||
"search": {
|
||||
"default_top": 10,
|
||||
"default_format": "json",
|
||||
"rrf_k": 60,
|
||||
},
|
||||
"chunking": {
|
||||
"defaults": {
|
||||
"max_tokens": 512,
|
||||
"overlap_tokens": 50,
|
||||
},
|
||||
"pdf": {
|
||||
"strategy": "hierarchy",
|
||||
"max_tokens": 1024,
|
||||
},
|
||||
"markdown": {
|
||||
"strategy": "header",
|
||||
"min_tokens": 50,
|
||||
"max_tokens": 1024,
|
||||
},
|
||||
"code": {
|
||||
"strategy": "ast",
|
||||
"include_context": True,
|
||||
"max_tokens": 1024,
|
||||
},
|
||||
"note": {
|
||||
"strategy": "whole",
|
||||
},
|
||||
},
|
||||
"ingestion": {
|
||||
"workers": 4,
|
||||
"batch_size": 50,
|
||||
"enable_ocr": "auto",
|
||||
},
|
||||
}
|
||||
|
||||
# ENV variable mapping: ENV_NAME -> config dotted key
|
||||
ENV_MAP = {
|
||||
"KB_DATA_DIR": "data_dir",
|
||||
"KB_MODEL": "embedding.model",
|
||||
"KB_DEFAULT_TOP": "search.default_top",
|
||||
"KB_DEFAULT_FORMAT": "search.default_format",
|
||||
}
|
||||
|
||||
# Type coercions for ENV values
|
||||
ENV_TYPES = {
|
||||
"search.default_top": int,
|
||||
}
|
||||
|
||||
|
||||
def _deep_merge(base: dict, override: dict) -> dict:
|
||||
"""Deep merge override into base, returning a new dict."""
|
||||
result = deepcopy(base)
|
||||
for key, value in override.items():
|
||||
if key in result and isinstance(result[key], dict) and isinstance(value, dict):
|
||||
result[key] = _deep_merge(result[key], value)
|
||||
else:
|
||||
result[key] = deepcopy(value)
|
||||
return result
|
||||
|
||||
|
||||
def _set_nested(d: dict, dotted_key: str, value):
|
||||
"""Set a value in a nested dict using a dotted key path."""
|
||||
keys = dotted_key.split(".")
|
||||
for key in keys[:-1]:
|
||||
d = d.setdefault(key, {})
|
||||
d[keys[-1]] = value
|
||||
|
||||
|
||||
def _get_nested(d: dict, dotted_key: str, default=None):
|
||||
"""Get a value from a nested dict using a dotted key path."""
|
||||
keys = dotted_key.split(".")
|
||||
for key in keys:
|
||||
if not isinstance(d, dict) or key not in d:
|
||||
return default
|
||||
d = d[key]
|
||||
return d
|
||||
|
||||
|
||||
def get_data_dir(cfg: dict) -> Path:
|
||||
"""Resolve the data directory from config."""
|
||||
return Path(cfg["data_dir"]).expanduser()
|
||||
|
||||
|
||||
def get_config_path(cfg: dict) -> Path:
|
||||
"""Path to the YAML config file."""
|
||||
return get_data_dir(cfg) / "config.yaml"
|
||||
|
||||
|
||||
def get_db_path(cfg: dict) -> Path:
|
||||
"""Path to the SQLite database."""
|
||||
return get_data_dir(cfg) / "kb.db"
|
||||
|
||||
|
||||
def load_config(config_path: Path | None = None) -> dict:
|
||||
"""Load config with precedence: ENV > YAML > defaults.
|
||||
|
||||
CLI flags are applied by the caller after this returns.
|
||||
"""
|
||||
cfg = deepcopy(DEFAULTS)
|
||||
|
||||
# Determine config file path (ENV can override data_dir which affects path)
|
||||
if config_path is None:
|
||||
data_dir = os.environ.get("KB_DATA_DIR", DEFAULTS["data_dir"])
|
||||
config_path = Path(data_dir).expanduser() / "config.yaml"
|
||||
|
||||
# Load YAML if it exists
|
||||
if config_path.is_file():
|
||||
with open(config_path) as f:
|
||||
yaml_cfg = yaml.safe_load(f) or {}
|
||||
cfg = _deep_merge(cfg, yaml_cfg)
|
||||
|
||||
# Apply ENV overrides
|
||||
for env_name, dotted_key in ENV_MAP.items():
|
||||
env_val = os.environ.get(env_name)
|
||||
if env_val is not None:
|
||||
coerce = ENV_TYPES.get(dotted_key, str)
|
||||
_set_nested(cfg, dotted_key, coerce(env_val))
|
||||
|
||||
return cfg
|
||||
|
||||
|
||||
def save_config_value(config_path: Path, dotted_key: str, value: str):
|
||||
"""Set a single value in the YAML config file."""
|
||||
config_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
existing = {}
|
||||
if config_path.is_file():
|
||||
with open(config_path) as f:
|
||||
existing = yaml.safe_load(f) or {}
|
||||
|
||||
# Try numeric coercion
|
||||
try:
|
||||
value = int(value)
|
||||
except ValueError:
|
||||
try:
|
||||
value = float(value)
|
||||
except ValueError:
|
||||
if value.lower() in ("true", "false"):
|
||||
value = value.lower() == "true"
|
||||
|
||||
_set_nested(existing, dotted_key, value)
|
||||
|
||||
with open(config_path, "w") as f:
|
||||
yaml.dump(existing, f, default_flow_style=False, sort_keys=False)
|
||||
|
||||
|
||||
def config_with_sources(config_path: Path | None = None) -> list[tuple[str, str, str]]:
|
||||
"""Return a flat list of (dotted_key, value, source) tuples for display."""
|
||||
if config_path is None:
|
||||
data_dir = os.environ.get("KB_DATA_DIR", DEFAULTS["data_dir"])
|
||||
config_path = Path(data_dir).expanduser() / "config.yaml"
|
||||
|
||||
yaml_cfg = {}
|
||||
if config_path.is_file():
|
||||
with open(config_path) as f:
|
||||
yaml_cfg = yaml.safe_load(f) or {}
|
||||
|
||||
# Build reverse ENV map for source detection
|
||||
env_keys = {v: k for k, v in ENV_MAP.items()}
|
||||
|
||||
def _flatten(d, prefix=""):
|
||||
items = []
|
||||
for k, v in d.items():
|
||||
key = f"{prefix}.{k}" if prefix else k
|
||||
if isinstance(v, dict):
|
||||
items.extend(_flatten(v, key))
|
||||
else:
|
||||
# Determine source
|
||||
env_name = env_keys.get(key)
|
||||
if env_name and os.environ.get(env_name) is not None:
|
||||
source = f"env ({env_name})"
|
||||
elif _get_nested(yaml_cfg, key) is not None:
|
||||
source = "config.yaml"
|
||||
else:
|
||||
source = "default"
|
||||
items.append((key, str(v), source))
|
||||
return items
|
||||
|
||||
cfg = load_config(config_path)
|
||||
return _flatten(cfg)
|
||||
@@ -0,0 +1,229 @@
|
||||
"""SQLite database management with FTS5 and sqlite-vec."""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
|
||||
import sqlite_vec
|
||||
|
||||
SCHEMA_VERSION = 1
|
||||
|
||||
|
||||
def get_connection(db_path: Path) -> sqlite3.Connection:
|
||||
"""Open a SQLite connection with sqlite-vec loaded."""
|
||||
conn = sqlite3.connect(str(db_path))
|
||||
conn.enable_load_extension(True)
|
||||
sqlite_vec.load(conn)
|
||||
conn.enable_load_extension(False)
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
conn.row_factory = sqlite3.Row
|
||||
return conn
|
||||
|
||||
|
||||
def init_schema(conn: sqlite3.Connection, embedding_dim: int):
|
||||
"""Create all tables, FTS, vector index, and triggers."""
|
||||
conn.executescript(f"""
|
||||
CREATE TABLE IF NOT EXISTS documents (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
source_path TEXT,
|
||||
content_hash TEXT NOT NULL,
|
||||
doc_type TEXT NOT NULL CHECK(doc_type IN ('pdf','markdown','code','note')),
|
||||
language TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
metadata TEXT DEFAULT '{{}}'
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS chunks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||||
chunk_index INTEGER NOT NULL,
|
||||
text TEXT NOT NULL,
|
||||
token_count INTEGER,
|
||||
metadata TEXT DEFAULT '{{}}',
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS tags (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
name TEXT UNIQUE NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS document_tags (
|
||||
document_id INTEGER REFERENCES documents(id) ON DELETE CASCADE,
|
||||
tag_id INTEGER REFERENCES tags(id) ON DELETE CASCADE,
|
||||
PRIMARY KEY (document_id, tag_id)
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS config (
|
||||
key TEXT PRIMARY KEY,
|
||||
value TEXT NOT NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_chunks_document_id ON chunks(document_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_content_hash ON documents(content_hash);
|
||||
""")
|
||||
|
||||
# FTS5 virtual table (content-sync with chunks)
|
||||
conn.execute("""
|
||||
CREATE VIRTUAL TABLE IF NOT EXISTS chunks_fts USING fts5(
|
||||
text,
|
||||
content='chunks',
|
||||
content_rowid='id',
|
||||
tokenize='porter unicode61'
|
||||
)
|
||||
""")
|
||||
|
||||
# FTS sync triggers
|
||||
conn.executescript("""
|
||||
CREATE TRIGGER IF NOT EXISTS chunks_ai AFTER INSERT ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||||
END;
|
||||
|
||||
CREATE TRIGGER IF NOT EXISTS chunks_ad AFTER DELETE ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||||
END;
|
||||
|
||||
CREATE TRIGGER IF NOT EXISTS chunks_au AFTER UPDATE ON chunks BEGIN
|
||||
INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
|
||||
INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
|
||||
END;
|
||||
""")
|
||||
|
||||
# Vector table
|
||||
conn.execute(f"""
|
||||
CREATE VIRTUAL TABLE IF NOT EXISTS chunks_vec USING vec0(
|
||||
chunk_id INTEGER PRIMARY KEY,
|
||||
embedding FLOAT[{embedding_dim}]
|
||||
)
|
||||
""")
|
||||
|
||||
conn.commit()
|
||||
|
||||
|
||||
def get_db_config(conn: sqlite3.Connection, key: str, default: str | None = None) -> str | None:
|
||||
"""Get a value from the config table."""
|
||||
row = conn.execute("SELECT value FROM config WHERE key = ?", (key,)).fetchone()
|
||||
return row["value"] if row else default
|
||||
|
||||
|
||||
def set_db_config(conn: sqlite3.Connection, key: str, value: str):
|
||||
"""Set a value in the config table."""
|
||||
conn.execute(
|
||||
"INSERT INTO config (key, value) VALUES (?, ?) ON CONFLICT(key) DO UPDATE SET value = ?",
|
||||
(key, value, value),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
|
||||
def check_schema_version(conn: sqlite3.Connection) -> int | None:
|
||||
"""Check the current schema version. Returns None if not initialised."""
|
||||
try:
|
||||
return int(get_db_config(conn, "schema_version", "0"))
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def run_migrations(conn: sqlite3.Connection):
|
||||
"""Run pending schema migrations."""
|
||||
current = check_schema_version(conn) or 0
|
||||
|
||||
# Migration registry: version -> callable
|
||||
migrations: dict[int, callable] = {
|
||||
# Future migrations go here:
|
||||
# 2: _migrate_v2,
|
||||
}
|
||||
|
||||
for version in sorted(migrations.keys()):
|
||||
if current < version:
|
||||
migrations[version](conn)
|
||||
set_db_config(conn, "schema_version", str(version))
|
||||
|
||||
if current < SCHEMA_VERSION:
|
||||
set_db_config(conn, "schema_version", str(SCHEMA_VERSION))
|
||||
|
||||
|
||||
def recreate_vec_table(conn: sqlite3.Connection, embedding_dim: int):
|
||||
"""Drop and recreate the vector table with a new dimension."""
|
||||
conn.execute("DROP TABLE IF EXISTS chunks_vec")
|
||||
conn.execute(f"""
|
||||
CREATE VIRTUAL TABLE chunks_vec USING vec0(
|
||||
chunk_id INTEGER PRIMARY KEY,
|
||||
embedding FLOAT[{embedding_dim}]
|
||||
)
|
||||
""")
|
||||
conn.commit()
|
||||
|
||||
|
||||
def insert_document(conn: sqlite3.Connection, title: str, source_path: str | None,
|
||||
content_hash: str, doc_type: str, language: str | None = None,
|
||||
metadata: dict | None = None) -> int:
|
||||
"""Insert a document and return its ID."""
|
||||
cur = conn.execute(
|
||||
"INSERT INTO documents (title, source_path, content_hash, doc_type, language, metadata) "
|
||||
"VALUES (?, ?, ?, ?, ?, ?)",
|
||||
(title, source_path, content_hash, doc_type, language, json.dumps(metadata or {})),
|
||||
)
|
||||
return cur.lastrowid
|
||||
|
||||
|
||||
def insert_chunk(conn: sqlite3.Connection, document_id: int, chunk_index: int,
|
||||
text: str, token_count: int | None = None,
|
||||
metadata: dict | None = None) -> int:
|
||||
"""Insert a chunk and return its ID."""
|
||||
cur = conn.execute(
|
||||
"INSERT INTO chunks (document_id, chunk_index, text, token_count, metadata) "
|
||||
"VALUES (?, ?, ?, ?, ?)",
|
||||
(document_id, chunk_index, text, token_count, json.dumps(metadata or {})),
|
||||
)
|
||||
return cur.lastrowid
|
||||
|
||||
|
||||
def insert_embedding(conn: sqlite3.Connection, chunk_id: int, embedding: list[float]):
|
||||
"""Insert a chunk embedding into the vector table."""
|
||||
import struct
|
||||
blob = struct.pack(f"{len(embedding)}f", *embedding)
|
||||
conn.execute(
|
||||
"INSERT INTO chunks_vec (chunk_id, embedding) VALUES (?, ?)",
|
||||
(chunk_id, blob),
|
||||
)
|
||||
|
||||
|
||||
def hash_exists(conn: sqlite3.Connection, content_hash: str) -> bool:
|
||||
"""Check if a document with this content hash already exists."""
|
||||
row = conn.execute(
|
||||
"SELECT 1 FROM documents WHERE content_hash = ? LIMIT 1", (content_hash,)
|
||||
).fetchone()
|
||||
return row is not None
|
||||
|
||||
|
||||
def get_or_create_tag(conn: sqlite3.Connection, name: str) -> int:
|
||||
"""Get or create a tag, return its ID. Tags are stored lowercase."""
|
||||
name = name.strip().lower()
|
||||
row = conn.execute("SELECT id FROM tags WHERE name = ?", (name,)).fetchone()
|
||||
if row:
|
||||
return row["id"]
|
||||
cur = conn.execute("INSERT INTO tags (name) VALUES (?)", (name,))
|
||||
return cur.lastrowid
|
||||
|
||||
|
||||
def tag_document(conn: sqlite3.Connection, document_id: int, tag_names: list[str]):
|
||||
"""Associate tags with a document."""
|
||||
for name in tag_names:
|
||||
tag_id = get_or_create_tag(conn, name)
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO document_tags (document_id, tag_id) VALUES (?, ?)",
|
||||
(document_id, tag_id),
|
||||
)
|
||||
|
||||
|
||||
def untag_document(conn: sqlite3.Connection, document_id: int, tag_names: list[str]):
|
||||
"""Remove tag associations from a document."""
|
||||
for name in tag_names:
|
||||
name = name.strip().lower()
|
||||
conn.execute(
|
||||
"DELETE FROM document_tags WHERE document_id = ? AND tag_id = "
|
||||
"(SELECT id FROM tags WHERE name = ?)",
|
||||
(document_id, name),
|
||||
)
|
||||
@@ -0,0 +1,67 @@
|
||||
"""Embedding model management — download, load, and inference via ONNX."""
|
||||
|
||||
import click
|
||||
from pathlib import Path
|
||||
|
||||
_model_instance = None
|
||||
_model_name = None
|
||||
|
||||
|
||||
def load_model(model_name: str):
|
||||
"""Load a sentence-transformers model with ONNX backend. Caches in-process."""
|
||||
global _model_instance, _model_name
|
||||
if _model_instance is not None and _model_name == model_name:
|
||||
return _model_instance
|
||||
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
click.echo(f"Loading model '{model_name}'...")
|
||||
try:
|
||||
_model_instance = SentenceTransformer(model_name, backend="onnx")
|
||||
except Exception:
|
||||
# Fallback: some models may not have pre-exported ONNX. Let sentence-transformers export.
|
||||
click.echo("Optimising model for ONNX inference (one-time)...")
|
||||
_model_instance = SentenceTransformer(model_name, backend="onnx")
|
||||
|
||||
_model_name = model_name
|
||||
return _model_instance
|
||||
|
||||
|
||||
def get_model_dim(model_name: str) -> int:
|
||||
"""Get the embedding dimension for a model."""
|
||||
model = load_model(model_name)
|
||||
return model.get_sentence_embedding_dimension()
|
||||
|
||||
|
||||
def embed_texts(model_name: str, texts: list[str],
|
||||
prefix: str = "", show_progress: bool = False) -> list[list[float]]:
|
||||
"""Embed a list of texts, returning float vectors."""
|
||||
model = load_model(model_name)
|
||||
if prefix:
|
||||
texts = [prefix + t for t in texts]
|
||||
embeddings = model.encode(texts, show_progress_bar=show_progress, convert_to_numpy=True)
|
||||
return [e.tolist() for e in embeddings]
|
||||
|
||||
|
||||
def download_model(model_name: str):
|
||||
"""Pre-download a model (for kb init)."""
|
||||
click.echo(f"Downloading embedding model '{model_name}'...")
|
||||
load_model(model_name)
|
||||
click.echo("Embedding model ready.")
|
||||
|
||||
|
||||
def check_model_binding(conn, cfg: dict):
|
||||
"""Verify the loaded model matches what the DB expects. Raises on mismatch."""
|
||||
from kb_search.database import get_db_config
|
||||
|
||||
db_model = get_db_config(conn, "model_name")
|
||||
if db_model is None:
|
||||
return # Not yet initialised
|
||||
|
||||
config_model = cfg["embedding"]["model"]
|
||||
if db_model != config_model:
|
||||
db_dim = get_db_config(conn, "embedding_dim", "?")
|
||||
raise click.ClickException(
|
||||
f"Model mismatch: DB uses '{db_model}' ({db_dim} dim) but config specifies "
|
||||
f"'{config_model}'. Run `kb reindex --model {config_model}` to switch models."
|
||||
)
|
||||
@@ -0,0 +1,244 @@
|
||||
"""Code ingestion — AST/regex-based splitting for Python, Bash, Go."""
|
||||
|
||||
import ast
|
||||
import re
|
||||
|
||||
|
||||
def chunk_code(text: str, language: str | None, cfg: dict) -> list[dict]:
|
||||
"""Split code at function/class boundaries."""
|
||||
chunking_cfg = cfg.get("chunking", {}).get("code", {})
|
||||
strategy = chunking_cfg.get("strategy", "ast")
|
||||
include_context = chunking_cfg.get("include_context", True)
|
||||
|
||||
if strategy == "fixed":
|
||||
return _fixed_chunk(text, chunking_cfg)
|
||||
|
||||
if language == "python":
|
||||
chunks = _chunk_python(text, include_context)
|
||||
elif language in ("bash", "sh"):
|
||||
chunks = _chunk_bash(text, include_context)
|
||||
elif language == "go":
|
||||
chunks = _chunk_go(text, include_context)
|
||||
else:
|
||||
chunks = []
|
||||
|
||||
if not chunks:
|
||||
return _fixed_chunk(text, chunking_cfg)
|
||||
|
||||
for i, c in enumerate(chunks):
|
||||
c["chunk_index"] = i
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def _chunk_python(text: str, include_context: bool) -> list[dict]:
|
||||
"""Split Python using stdlib ast module."""
|
||||
try:
|
||||
tree = ast.parse(text)
|
||||
except SyntaxError:
|
||||
return []
|
||||
|
||||
lines = text.splitlines(keepends=True)
|
||||
chunks = []
|
||||
|
||||
for node in ast.iter_child_nodes(tree):
|
||||
if isinstance(node, ast.ClassDef):
|
||||
class_lines = _get_node_source(lines, node)
|
||||
class_docstring = ast.get_docstring(node) or ""
|
||||
|
||||
# Each method becomes a chunk
|
||||
methods = [n for n in ast.iter_child_nodes(node) if isinstance(n, (ast.FunctionDef, ast.AsyncFunctionDef))]
|
||||
|
||||
if methods:
|
||||
for method in methods:
|
||||
method_src = _get_node_source(lines, method)
|
||||
if include_context and class_docstring:
|
||||
context = f"class {node.name}:\n \"\"\"{class_docstring}\"\"\"\n\n"
|
||||
chunk_text = context + method_src
|
||||
elif include_context:
|
||||
chunk_text = f"class {node.name}:\n\n" + method_src
|
||||
else:
|
||||
chunk_text = method_src
|
||||
|
||||
chunks.append({
|
||||
"text": chunk_text,
|
||||
"metadata": {
|
||||
"symbol_name": f"{node.name}.{method.name}",
|
||||
"line_start": method.lineno,
|
||||
"line_end": method.end_lineno,
|
||||
},
|
||||
})
|
||||
else:
|
||||
# Class with no methods — single chunk
|
||||
chunks.append({
|
||||
"text": class_lines,
|
||||
"metadata": {
|
||||
"symbol_name": node.name,
|
||||
"line_start": node.lineno,
|
||||
"line_end": node.end_lineno,
|
||||
},
|
||||
})
|
||||
|
||||
elif isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
|
||||
func_src = _get_node_source(lines, node)
|
||||
chunks.append({
|
||||
"text": func_src,
|
||||
"metadata": {
|
||||
"symbol_name": node.name,
|
||||
"line_start": node.lineno,
|
||||
"line_end": node.end_lineno,
|
||||
},
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def _get_node_source(lines: list[str], node) -> str:
|
||||
"""Extract source code for an AST node, including decorators."""
|
||||
start = node.lineno - 1
|
||||
# Include decorators
|
||||
if hasattr(node, "decorator_list") and node.decorator_list:
|
||||
start = node.decorator_list[0].lineno - 1
|
||||
end = node.end_lineno
|
||||
return "".join(lines[start:end]).rstrip()
|
||||
|
||||
|
||||
def _chunk_bash(text: str, include_context: bool) -> list[dict]:
|
||||
"""Split Bash at function boundaries using regex."""
|
||||
# Match: function name() { or name() {
|
||||
func_pattern = re.compile(
|
||||
r"^((?:#[^\n]*\n)*)?" # Optional preceding comment block
|
||||
r"(?:function\s+(\w+)\s*\(\s*\)\s*\{|(\w+)\s*\(\s*\)\s*\{)",
|
||||
re.MULTILINE,
|
||||
)
|
||||
|
||||
chunks = []
|
||||
matches = list(func_pattern.finditer(text))
|
||||
|
||||
if not matches:
|
||||
return []
|
||||
|
||||
for i, match in enumerate(matches):
|
||||
start = match.start()
|
||||
# Find end: next function or end of file
|
||||
if i + 1 < len(matches):
|
||||
end = matches[i + 1].start()
|
||||
else:
|
||||
end = len(text)
|
||||
|
||||
func_name = match.group(2) or match.group(3)
|
||||
chunk_text = text[start:end].rstrip()
|
||||
|
||||
chunks.append({
|
||||
"text": chunk_text,
|
||||
"metadata": {
|
||||
"symbol_name": func_name,
|
||||
},
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def _chunk_go(text: str, include_context: bool) -> list[dict]:
|
||||
"""Split Go at func declarations using regex."""
|
||||
func_pattern = re.compile(
|
||||
r"^func\s+(?:\([^)]*\)\s+)?(\w+)\s*\(",
|
||||
re.MULTILINE,
|
||||
)
|
||||
|
||||
chunks = []
|
||||
matches = list(func_pattern.finditer(text))
|
||||
|
||||
if not matches:
|
||||
return []
|
||||
|
||||
for i, match in enumerate(matches):
|
||||
start = match.start()
|
||||
# Include preceding comment block
|
||||
before = text[:start]
|
||||
comment_lines = []
|
||||
for line in reversed(before.splitlines()):
|
||||
stripped = line.strip()
|
||||
if stripped.startswith("//") or not stripped:
|
||||
comment_lines.insert(0, line)
|
||||
else:
|
||||
break
|
||||
if comment_lines:
|
||||
comment_text = "\n".join(comment_lines).strip()
|
||||
if comment_text:
|
||||
start = text.rfind(comment_lines[0], 0, start)
|
||||
|
||||
# Find end: next func or end of file
|
||||
if i + 1 < len(matches):
|
||||
end = matches[i + 1].start()
|
||||
# Backtrack to exclude preceding comments of next func
|
||||
before_next = text[:end]
|
||||
for line in reversed(before_next.splitlines()):
|
||||
stripped = line.strip()
|
||||
if stripped.startswith("//") or not stripped:
|
||||
end = text.rfind(line, 0, end)
|
||||
else:
|
||||
break
|
||||
else:
|
||||
end = len(text)
|
||||
|
||||
func_name = match.group(1)
|
||||
chunk_text = text[start:end].rstrip()
|
||||
|
||||
chunks.append({
|
||||
"text": chunk_text,
|
||||
"metadata": {
|
||||
"symbol_name": func_name,
|
||||
},
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def _fixed_chunk(text: str, chunking_cfg: dict) -> list[dict]:
|
||||
"""Fixed-size fallback for code without recognisable boundaries."""
|
||||
max_tokens = chunking_cfg.get("max_tokens", 1024)
|
||||
overlap_tokens = chunking_cfg.get("overlap_tokens", 50)
|
||||
|
||||
lines = text.splitlines()
|
||||
if not lines:
|
||||
return []
|
||||
|
||||
# Approximate tokens as words
|
||||
chunks = []
|
||||
current_lines = []
|
||||
current_tokens = 0
|
||||
idx = 0
|
||||
|
||||
for line in lines:
|
||||
line_tokens = len(line.split())
|
||||
if current_tokens + line_tokens > max_tokens and current_lines:
|
||||
chunks.append({
|
||||
"text": "\n".join(current_lines),
|
||||
"chunk_index": idx,
|
||||
"metadata": {},
|
||||
})
|
||||
idx += 1
|
||||
# Keep some overlap
|
||||
overlap_lines = []
|
||||
overlap_count = 0
|
||||
for l in reversed(current_lines):
|
||||
l_tokens = len(l.split())
|
||||
if overlap_count + l_tokens > overlap_tokens:
|
||||
break
|
||||
overlap_lines.insert(0, l)
|
||||
overlap_count += l_tokens
|
||||
current_lines = overlap_lines
|
||||
current_tokens = overlap_count
|
||||
|
||||
current_lines.append(line)
|
||||
current_tokens += line_tokens
|
||||
|
||||
if current_lines:
|
||||
chunks.append({
|
||||
"text": "\n".join(current_lines),
|
||||
"chunk_index": idx,
|
||||
"metadata": {},
|
||||
})
|
||||
|
||||
return chunks
|
||||
@@ -0,0 +1,54 @@
|
||||
"""File type detection and routing."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
EXTENSION_MAP = {
|
||||
# Docling-handled formats
|
||||
".pdf": ("pdf", None),
|
||||
".docx": ("pdf", None), # Docling handles DOCX too
|
||||
".html": ("pdf", None),
|
||||
".htm": ("pdf", None),
|
||||
".png": ("pdf", None),
|
||||
".jpg": ("pdf", None),
|
||||
".jpeg": ("pdf", None),
|
||||
".tiff": ("pdf", None),
|
||||
".bmp": ("pdf", None),
|
||||
".webp": ("pdf", None),
|
||||
# Markdown / text
|
||||
".md": ("markdown", None),
|
||||
".markdown": ("markdown", None),
|
||||
".txt": ("markdown", None),
|
||||
# Code
|
||||
".py": ("code", "python"),
|
||||
".sh": ("code", "bash"),
|
||||
".bash": ("code", "bash"),
|
||||
".go": ("code", "go"),
|
||||
}
|
||||
|
||||
SUPPORTED_EXTENSIONS = set(EXTENSION_MAP.keys())
|
||||
|
||||
|
||||
def detect_type(path: Path, force_type: str | None = None,
|
||||
force_language: str | None = None) -> tuple[str, str | None]:
|
||||
"""Detect document type and language from file extension.
|
||||
|
||||
Returns (doc_type, language) tuple.
|
||||
Raises ValueError for unsupported file types.
|
||||
"""
|
||||
if force_type:
|
||||
return force_type, force_language
|
||||
|
||||
ext = path.suffix.lower()
|
||||
if ext not in EXTENSION_MAP:
|
||||
supported = ", ".join(sorted(SUPPORTED_EXTENSIONS))
|
||||
raise ValueError(f"Unsupported file type '{ext}'. Supported: {supported}")
|
||||
|
||||
doc_type, language = EXTENSION_MAP[ext]
|
||||
if force_language:
|
||||
language = force_language
|
||||
return doc_type, language
|
||||
|
||||
|
||||
def is_supported(path: Path) -> bool:
|
||||
"""Check if a file has a supported extension."""
|
||||
return path.suffix.lower() in SUPPORTED_EXTENSIONS
|
||||
@@ -0,0 +1,123 @@
|
||||
"""Docling-based ingestion for PDFs, DOCX, HTML, and images."""
|
||||
|
||||
import logging
|
||||
from pathlib import Path
|
||||
|
||||
# Suppress noisy Docling/RapidOCR logging
|
||||
logging.getLogger("RapidOCR").setLevel(logging.ERROR)
|
||||
logging.getLogger("docling.models.stages.ocr.rapid_ocr_model").setLevel(logging.ERROR)
|
||||
logging.getLogger("docling").setLevel(logging.WARNING)
|
||||
|
||||
|
||||
def chunk_document(file_path: Path, cfg: dict) -> list[dict]:
|
||||
"""Ingest a document using Docling and return chunks."""
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling.datamodel.base_models import InputFormat
|
||||
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
|
||||
|
||||
# Configure PDF pipeline
|
||||
ocr_setting = cfg.get("ingestion", {}).get("enable_ocr", "auto")
|
||||
pdf_opts = PdfPipelineOptions()
|
||||
|
||||
if ocr_setting == "never":
|
||||
pdf_opts.do_ocr = False
|
||||
elif ocr_setting == "always":
|
||||
pdf_opts.do_ocr = True
|
||||
pdf_opts.ocr_options = RapidOcrOptions(force_full_page_ocr=True)
|
||||
else:
|
||||
# "auto" — enable OCR but only trigger on pages with significant bitmap content
|
||||
pdf_opts.do_ocr = True
|
||||
pdf_opts.ocr_options = RapidOcrOptions(bitmap_area_threshold=0.25)
|
||||
|
||||
converter = DocumentConverter(
|
||||
format_options={
|
||||
InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_opts),
|
||||
}
|
||||
)
|
||||
|
||||
# Convert
|
||||
result = converter.convert(str(file_path))
|
||||
doc = result.document
|
||||
|
||||
# Chunk using hierarchy-aware chunker
|
||||
chunking_cfg = cfg.get("chunking", {}).get("pdf", {})
|
||||
strategy = chunking_cfg.get("strategy", "hierarchy")
|
||||
|
||||
if strategy == "hierarchy":
|
||||
chunks = _hierarchy_chunk(doc)
|
||||
else:
|
||||
chunks = _fixed_chunk(doc, chunking_cfg)
|
||||
|
||||
if not chunks:
|
||||
# Fallback: try extracting raw text
|
||||
text = doc.export_to_markdown()
|
||||
if text and text.strip():
|
||||
chunks = _fixed_chunk_text(text, chunking_cfg)
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def _hierarchy_chunk(doc) -> list[dict]:
|
||||
"""Use Docling's HierarchicalChunker."""
|
||||
from docling_core.transforms.chunker import HierarchicalChunker
|
||||
|
||||
chunker = HierarchicalChunker()
|
||||
chunks = []
|
||||
|
||||
for i, chunk in enumerate(chunker.chunk(doc)):
|
||||
meta = {}
|
||||
|
||||
# Extract page info if available
|
||||
if hasattr(chunk, "meta") and chunk.meta:
|
||||
if hasattr(chunk.meta, "doc_items"):
|
||||
for item in chunk.meta.doc_items:
|
||||
if hasattr(item, "prov") and item.prov:
|
||||
for prov in item.prov:
|
||||
if hasattr(prov, "page_no"):
|
||||
meta["page"] = prov.page_no
|
||||
break
|
||||
|
||||
# Section headers
|
||||
if hasattr(chunk.meta, "headings") and chunk.meta.headings:
|
||||
meta["section_header"] = " > ".join(chunk.meta.headings)
|
||||
|
||||
chunks.append({
|
||||
"text": chunk.text,
|
||||
"chunk_index": i,
|
||||
"metadata": meta,
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def _fixed_chunk(doc, chunking_cfg: dict) -> list[dict]:
|
||||
"""Fixed-size chunking from Docling document."""
|
||||
text = doc.export_to_markdown()
|
||||
return _fixed_chunk_text(text, chunking_cfg)
|
||||
|
||||
|
||||
def _fixed_chunk_text(text: str, chunking_cfg: dict) -> list[dict]:
|
||||
"""Fixed-size chunking from plain text."""
|
||||
max_tokens = chunking_cfg.get("max_tokens", 1024)
|
||||
overlap = chunking_cfg.get("overlap_tokens", 50)
|
||||
|
||||
# Approximate: 1 token ~= 4 chars
|
||||
max_chars = max_tokens * 4
|
||||
overlap_chars = overlap * 4
|
||||
|
||||
chunks = []
|
||||
start = 0
|
||||
idx = 0
|
||||
while start < len(text):
|
||||
end = start + max_chars
|
||||
chunk_text = text[start:end].strip()
|
||||
if chunk_text:
|
||||
chunks.append({
|
||||
"text": chunk_text,
|
||||
"chunk_index": idx,
|
||||
"metadata": {},
|
||||
})
|
||||
idx += 1
|
||||
start = end - overlap_chars
|
||||
|
||||
return chunks
|
||||
@@ -0,0 +1,210 @@
|
||||
"""Markdown ingestion — header-based splitting."""
|
||||
|
||||
import re
|
||||
|
||||
|
||||
def chunk_markdown(text: str, cfg: dict) -> list[dict]:
|
||||
"""Split markdown at header boundaries with hierarchy context."""
|
||||
chunking_cfg = cfg.get("chunking", {}).get("markdown", {})
|
||||
strategy = chunking_cfg.get("strategy", "header")
|
||||
|
||||
if strategy == "fixed" or not _has_headers(text):
|
||||
return _fixed_chunk(text, chunking_cfg)
|
||||
|
||||
return _header_chunk(text, chunking_cfg)
|
||||
|
||||
|
||||
def _has_headers(text: str) -> bool:
|
||||
"""Check if text contains markdown headers."""
|
||||
return bool(re.search(r"^#{1,6}\s+", text, re.MULTILINE))
|
||||
|
||||
|
||||
def _header_chunk(text: str, chunking_cfg: dict) -> list[dict]:
|
||||
"""Split at ## and ### boundaries with hierarchy context."""
|
||||
min_tokens = chunking_cfg.get("min_tokens", 50)
|
||||
max_tokens = chunking_cfg.get("max_tokens", 1024)
|
||||
|
||||
sections = _split_at_headers(text)
|
||||
|
||||
if not sections:
|
||||
return _fixed_chunk(text, chunking_cfg)
|
||||
|
||||
# Merge small sections
|
||||
sections = _merge_small_sections(sections, min_tokens)
|
||||
|
||||
# Split large sections
|
||||
chunks = []
|
||||
for section in sections:
|
||||
content = section["content"].strip()
|
||||
if not content:
|
||||
continue
|
||||
|
||||
# Add hierarchy context
|
||||
if section["header_chain"]:
|
||||
context = " > ".join(section["header_chain"])
|
||||
full_text = f"{context}\n\n{content}"
|
||||
else:
|
||||
full_text = content
|
||||
|
||||
approx_tokens = len(full_text.split())
|
||||
if approx_tokens > max_tokens:
|
||||
sub_chunks = _split_large_section(full_text, max_tokens, chunking_cfg)
|
||||
chunks.extend(sub_chunks)
|
||||
else:
|
||||
chunks.append({"text": full_text, "metadata": {
|
||||
"section_header": section["header_chain"][-1] if section["header_chain"] else None,
|
||||
}})
|
||||
|
||||
# Assign chunk indices
|
||||
for i, c in enumerate(chunks):
|
||||
c["chunk_index"] = i
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def _split_at_headers(text: str) -> list[dict]:
|
||||
"""Split text into sections at header boundaries."""
|
||||
header_pattern = re.compile(r"^(#{1,6})\s+(.*?)$", re.MULTILINE)
|
||||
|
||||
sections = []
|
||||
header_stack = [] # Stack of (level, title)
|
||||
last_end = 0
|
||||
|
||||
for match in header_pattern.finditer(text):
|
||||
# Capture content before this header
|
||||
if last_end < match.start():
|
||||
content = text[last_end:match.start()].strip()
|
||||
if content and sections:
|
||||
sections[-1]["content"] += "\n\n" + content
|
||||
elif content:
|
||||
sections.append({
|
||||
"header_chain": [],
|
||||
"content": content,
|
||||
})
|
||||
|
||||
level = len(match.group(1))
|
||||
title = match.group(2).strip()
|
||||
|
||||
# Update header stack
|
||||
while header_stack and header_stack[-1][0] >= level:
|
||||
header_stack.pop()
|
||||
header_stack.append((level, title))
|
||||
|
||||
chain = [h[1] for h in header_stack]
|
||||
|
||||
sections.append({
|
||||
"header_chain": chain,
|
||||
"content": "",
|
||||
})
|
||||
last_end = match.end()
|
||||
|
||||
# Capture trailing content
|
||||
if last_end < len(text):
|
||||
trailing = text[last_end:].strip()
|
||||
if trailing and sections:
|
||||
sections[-1]["content"] += "\n\n" + trailing
|
||||
elif trailing:
|
||||
sections.append({"header_chain": [], "content": trailing})
|
||||
|
||||
return sections
|
||||
|
||||
|
||||
def _merge_small_sections(sections: list[dict], min_tokens: int) -> list[dict]:
|
||||
"""Merge sections smaller than min_tokens with next section."""
|
||||
if not sections:
|
||||
return sections
|
||||
|
||||
merged = []
|
||||
pending = None
|
||||
|
||||
for section in sections:
|
||||
if pending is not None:
|
||||
# Merge pending into this section
|
||||
section["content"] = pending["content"] + "\n\n" + section["content"]
|
||||
if not section["header_chain"] and pending["header_chain"]:
|
||||
section["header_chain"] = pending["header_chain"]
|
||||
pending = None
|
||||
|
||||
approx_tokens = len(section["content"].split())
|
||||
if approx_tokens < min_tokens:
|
||||
pending = section
|
||||
else:
|
||||
merged.append(section)
|
||||
|
||||
if pending is not None:
|
||||
if merged:
|
||||
merged[-1]["content"] += "\n\n" + pending["content"]
|
||||
else:
|
||||
merged.append(pending)
|
||||
|
||||
return merged
|
||||
|
||||
|
||||
def _split_large_section(text: str, max_tokens: int, chunking_cfg: dict) -> list[dict]:
|
||||
"""Split a large section at paragraph boundaries with overlap."""
|
||||
overlap_tokens = chunking_cfg.get("overlap_tokens",
|
||||
cfg_defaults().get("overlap_tokens", 50))
|
||||
paragraphs = re.split(r"\n\n+", text)
|
||||
|
||||
chunks = []
|
||||
current_paras = []
|
||||
current_tokens = 0
|
||||
|
||||
for para in paragraphs:
|
||||
para_tokens = len(para.split())
|
||||
if current_tokens + para_tokens > max_tokens and current_paras:
|
||||
chunks.append({"text": "\n\n".join(current_paras), "metadata": {}})
|
||||
# Keep overlap
|
||||
overlap_paras = []
|
||||
overlap_count = 0
|
||||
for p in reversed(current_paras):
|
||||
p_tokens = len(p.split())
|
||||
if overlap_count + p_tokens > overlap_tokens:
|
||||
break
|
||||
overlap_paras.insert(0, p)
|
||||
overlap_count += p_tokens
|
||||
current_paras = overlap_paras
|
||||
current_tokens = overlap_count
|
||||
|
||||
current_paras.append(para)
|
||||
current_tokens += para_tokens
|
||||
|
||||
if current_paras:
|
||||
chunks.append({"text": "\n\n".join(current_paras), "metadata": {}})
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def cfg_defaults():
|
||||
"""Return default chunking config."""
|
||||
return {"max_tokens": 1024, "overlap_tokens": 50, "min_tokens": 50}
|
||||
|
||||
|
||||
def _fixed_chunk(text: str, chunking_cfg: dict) -> list[dict]:
|
||||
"""Fixed-size fallback for plain text without headers."""
|
||||
max_tokens = chunking_cfg.get("max_tokens", 512)
|
||||
overlap_tokens = chunking_cfg.get("overlap_tokens", 50)
|
||||
|
||||
words = text.split()
|
||||
if not words:
|
||||
return []
|
||||
|
||||
chunks = []
|
||||
start = 0
|
||||
idx = 0
|
||||
|
||||
while start < len(words):
|
||||
end = min(start + max_tokens, len(words))
|
||||
chunk_text = " ".join(words[start:end]).strip()
|
||||
if chunk_text:
|
||||
chunks.append({
|
||||
"text": chunk_text,
|
||||
"chunk_index": idx,
|
||||
"metadata": {},
|
||||
})
|
||||
idx += 1
|
||||
start = end - overlap_tokens
|
||||
if start >= len(words) or end == len(words):
|
||||
break
|
||||
|
||||
return chunks
|
||||
@@ -0,0 +1,19 @@
|
||||
"""Note ingestion — whole-document chunks."""
|
||||
|
||||
|
||||
def chunk_note(text: str) -> list[dict]:
|
||||
"""Return note text as a single chunk."""
|
||||
return [{"text": text, "metadata": {}, "chunk_index": 0}]
|
||||
|
||||
|
||||
def auto_title(text: str, max_len: int = 80) -> str:
|
||||
"""Generate a title from the first line of text, truncated at word boundary."""
|
||||
first_line = text.strip().split("\n")[0].strip()
|
||||
if len(first_line) <= max_len:
|
||||
return first_line
|
||||
truncated = first_line[:max_len]
|
||||
# Truncate at last space
|
||||
last_space = truncated.rfind(" ")
|
||||
if last_space > 0:
|
||||
truncated = truncated[:last_space]
|
||||
return truncated + "..."
|
||||
@@ -0,0 +1,144 @@
|
||||
"""Output formatters — JSON and human-readable."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
|
||||
|
||||
def format_search_results(data: dict, fmt: str = "json") -> str:
|
||||
"""Format search results for output."""
|
||||
if fmt == "json":
|
||||
return json.dumps(data, indent=2, ensure_ascii=False)
|
||||
return _human_search(data)
|
||||
|
||||
|
||||
def _human_search(data: dict) -> str:
|
||||
"""Human-readable search output."""
|
||||
lines = []
|
||||
total = data["total_matches"]
|
||||
returned = data["returned"]
|
||||
lines.append(f'Search: "{data["query"]}" ({total} matches, showing top {returned})')
|
||||
lines.append("")
|
||||
|
||||
for i, r in enumerate(data["results"], 1):
|
||||
src = r["source"]
|
||||
score = r["score"]
|
||||
|
||||
# Title with page/section
|
||||
location = ""
|
||||
if src.get("page"):
|
||||
location = f" (p.{src['page']})"
|
||||
elif src.get("section_header"):
|
||||
location = f" \u00a7{src['section_header']}"
|
||||
|
||||
# Tags
|
||||
tag_str = ""
|
||||
if src.get("tags"):
|
||||
tag_str = " [" + ", ".join(src["tags"]) + "]"
|
||||
|
||||
lines.append(f" {i:2d}. [{score:.3f}] {src['title']}{location} [{src['type']}]{tag_str}")
|
||||
|
||||
# Text preview (first 200 chars)
|
||||
preview = r["text"][:200].replace("\n", " ").strip()
|
||||
if len(r["text"]) > 200:
|
||||
preview += "..."
|
||||
lines.append(f" {preview}")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def format_document_list(docs: list[dict], fmt: str = "json") -> str:
|
||||
"""Format document list."""
|
||||
if fmt == "json":
|
||||
return json.dumps(docs, indent=2, ensure_ascii=False)
|
||||
return _human_doc_list(docs)
|
||||
|
||||
|
||||
def _human_doc_list(docs: list[dict]) -> str:
|
||||
"""Human-readable document list."""
|
||||
if not docs:
|
||||
return "No documents indexed. Run `kb add` to get started."
|
||||
|
||||
lines = [f"{'ID':>5} {'Type':<10} {'Chunks':>6} {'Title':<40} {'Tags'}"]
|
||||
lines.append("-" * 80)
|
||||
|
||||
for d in docs:
|
||||
tags = ", ".join(d.get("tags", []))
|
||||
title = d["title"][:40]
|
||||
lines.append(f"{d['id']:>5} {d['type']:<10} {d['chunk_count']:>6} {title:<40} {tags}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def format_tags(tags: list[dict], fmt: str = "json") -> str:
|
||||
"""Format tag list."""
|
||||
if fmt == "json":
|
||||
return json.dumps(tags, indent=2, ensure_ascii=False)
|
||||
|
||||
if not tags:
|
||||
return "No tags. Use `kb add --tags` or `kb tag` to add tags."
|
||||
|
||||
lines = [f"{'Tag':<30} {'Documents':>10}"]
|
||||
lines.append("-" * 42)
|
||||
for t in tags:
|
||||
lines.append(f"{t['name']:<30} {t['count']:>10}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def format_doc_info(info: dict, fmt: str = "json") -> str:
|
||||
"""Format document info."""
|
||||
if fmt == "json":
|
||||
return json.dumps(info, indent=2, ensure_ascii=False)
|
||||
|
||||
lines = []
|
||||
lines.append(f"Document #{info['id']}: {info['title']}")
|
||||
lines.append(f" Type: {info['type']}")
|
||||
if info.get("language"):
|
||||
lines.append(f" Language: {info['language']}")
|
||||
if info.get("path"):
|
||||
lines.append(f" Path: {info['path']}")
|
||||
lines.append(f" Hash: {info['content_hash']}")
|
||||
lines.append(f" Created: {info['created_at']}")
|
||||
if info.get("tags"):
|
||||
lines.append(f" Tags: {', '.join(info['tags'])}")
|
||||
lines.append(f" Chunks: {info['chunk_count']}")
|
||||
lines.append("")
|
||||
|
||||
for chunk in info.get("chunks", []):
|
||||
preview = chunk["text"][:100].replace("\n", " ").strip()
|
||||
if len(chunk["text"]) > 100:
|
||||
preview += "..."
|
||||
lines.append(f" [{chunk['chunk_index']}] {preview}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def format_status(status: dict, fmt: str = "json") -> str:
|
||||
"""Format status output."""
|
||||
if fmt == "json":
|
||||
return json.dumps(status, indent=2, ensure_ascii=False)
|
||||
|
||||
lines = []
|
||||
lines.append("Knowledge Base Status")
|
||||
lines.append("=" * 40)
|
||||
lines.append(f" Model: {status['model_name']}")
|
||||
lines.append(f" Embedding dim: {status['embedding_dim']}")
|
||||
lines.append(f" Schema version: {status['schema_version']}")
|
||||
lines.append(f" DB size: {_human_size(status['db_size_bytes'])}")
|
||||
lines.append("")
|
||||
lines.append(" Documents:")
|
||||
for dtype, count in status.get("documents", {}).items():
|
||||
lines.append(f" {dtype:<12} {count:>5}")
|
||||
lines.append(f" {'total':<12} {status['total_documents']:>5}")
|
||||
lines.append(f" Total chunks: {status['total_chunks']}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _human_size(size_bytes: int) -> str:
|
||||
"""Format bytes as human-readable."""
|
||||
for unit in ("B", "KB", "MB", "GB"):
|
||||
if size_bytes < 1024:
|
||||
return f"{size_bytes:.1f} {unit}"
|
||||
size_bytes /= 1024
|
||||
return f"{size_bytes:.1f} TB"
|
||||
@@ -0,0 +1,261 @@
|
||||
"""Hybrid search — FTS5 + vector with Reciprocal Rank Fusion."""
|
||||
|
||||
import json
|
||||
import re
|
||||
import struct
|
||||
import sqlite3
|
||||
|
||||
|
||||
def hybrid_search(conn: sqlite3.Connection, query: str, model_name: str, cfg: dict,
|
||||
top: int = 10, tags: list[str] | None = None,
|
||||
doc_type: str | None = None, fts_only: bool = False,
|
||||
vec_only: bool = False, threshold: float | None = None) -> dict:
|
||||
"""Run hybrid search and return merged results."""
|
||||
candidate_count = top * 3 # Fetch more candidates for RRF
|
||||
|
||||
fts_results = {}
|
||||
vec_results = {}
|
||||
|
||||
if not vec_only:
|
||||
fts_results = _fts_search(conn, query, candidate_count, tags, doc_type)
|
||||
|
||||
if not fts_only:
|
||||
vec_results = _vector_search(conn, query, model_name, cfg, candidate_count, tags, doc_type)
|
||||
|
||||
# Merge via RRF
|
||||
rrf_k = cfg.get("search", {}).get("rrf_k", 60)
|
||||
|
||||
if fts_only:
|
||||
merged = _single_source_results(fts_results, "fts")
|
||||
elif vec_only:
|
||||
merged = _single_source_results(vec_results, "vector")
|
||||
else:
|
||||
merged = _rrf_merge(fts_results, vec_results, rrf_k)
|
||||
|
||||
# Apply threshold
|
||||
if threshold is not None:
|
||||
merged = [r for r in merged if r["score"] >= threshold]
|
||||
|
||||
# Sort and limit
|
||||
merged.sort(key=lambda x: x["score"], reverse=True)
|
||||
total = len(merged)
|
||||
merged = merged[:top]
|
||||
|
||||
# Enrich with document metadata
|
||||
results = []
|
||||
for r in merged:
|
||||
chunk_id = r["chunk_id"]
|
||||
row = conn.execute("""
|
||||
SELECT c.id, c.text, c.chunk_index, c.metadata as chunk_meta,
|
||||
d.id as doc_id, d.title, d.source_path, d.doc_type,
|
||||
d.language, d.metadata as doc_meta
|
||||
FROM chunks c
|
||||
JOIN documents d ON c.document_id = d.id
|
||||
WHERE c.id = ?
|
||||
""", (chunk_id,)).fetchone()
|
||||
|
||||
if not row:
|
||||
continue
|
||||
|
||||
chunk_meta = json.loads(row["chunk_meta"]) if row["chunk_meta"] else {}
|
||||
|
||||
# Get tags for this document
|
||||
tag_rows = conn.execute("""
|
||||
SELECT t.name FROM tags t
|
||||
JOIN document_tags dt ON t.id = dt.tag_id
|
||||
WHERE dt.document_id = ?
|
||||
ORDER BY t.name
|
||||
""", (row["doc_id"],)).fetchall()
|
||||
|
||||
# Count total chunks for this document
|
||||
total_chunks = conn.execute(
|
||||
"SELECT COUNT(*) FROM chunks WHERE document_id = ?", (row["doc_id"],)
|
||||
).fetchone()[0]
|
||||
|
||||
results.append({
|
||||
"chunk_id": row["id"],
|
||||
"score": round(r["score"], 6),
|
||||
"score_breakdown": r["score_breakdown"],
|
||||
"text": row["text"],
|
||||
"source": {
|
||||
"document_id": row["doc_id"],
|
||||
"title": row["title"],
|
||||
"path": row["source_path"],
|
||||
"type": row["doc_type"],
|
||||
"page": chunk_meta.get("page"),
|
||||
"section_header": chunk_meta.get("section_header"),
|
||||
"chunk_index": row["chunk_index"],
|
||||
"total_chunks": total_chunks,
|
||||
"tags": [r["name"] for r in tag_rows],
|
||||
},
|
||||
})
|
||||
|
||||
return {
|
||||
"query": query,
|
||||
"results": results,
|
||||
"total_matches": total,
|
||||
"returned": len(results),
|
||||
}
|
||||
|
||||
|
||||
def _fts_search(conn: sqlite3.Connection, query: str, limit: int,
|
||||
tags: list[str] | None, doc_type: str | None) -> dict[int, float]:
|
||||
"""Run FTS5 search, return {chunk_id: bm25_score}."""
|
||||
escaped = _escape_fts_query(query)
|
||||
if not escaped.strip():
|
||||
return {}
|
||||
|
||||
sql = """
|
||||
SELECT f.rowid as chunk_id, bm25(chunks_fts) as score
|
||||
FROM chunks_fts f
|
||||
"""
|
||||
joins = []
|
||||
where = [f"chunks_fts MATCH ?"]
|
||||
params = [escaped]
|
||||
|
||||
if tags or doc_type:
|
||||
joins.append("JOIN chunks c ON f.rowid = c.id")
|
||||
joins.append("JOIN documents d ON c.document_id = d.id")
|
||||
|
||||
if doc_type:
|
||||
where.append("d.doc_type = ?")
|
||||
params.append(doc_type)
|
||||
|
||||
if tags:
|
||||
for i, tag in enumerate(tags):
|
||||
joins.append(f"JOIN document_tags dt{i} ON d.id = dt{i}.document_id")
|
||||
joins.append(f"JOIN tags t{i} ON dt{i}.tag_id = t{i}.id")
|
||||
where.append(f"t{i}.name = ?")
|
||||
params.append(tag.strip().lower())
|
||||
|
||||
sql += " " + " ".join(joins)
|
||||
sql += " WHERE " + " AND ".join(where)
|
||||
sql += " ORDER BY score LIMIT ?"
|
||||
params.append(limit)
|
||||
|
||||
rows = conn.execute(sql, params).fetchall()
|
||||
|
||||
# BM25 scores are negative (lower = better), normalise to positive
|
||||
results = {}
|
||||
for row in rows:
|
||||
results[row["chunk_id"]] = -row["score"] # Negate so higher = better
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def _vector_search(conn: sqlite3.Connection, query: str, model_name: str,
|
||||
cfg: dict, limit: int, tags: list[str] | None,
|
||||
doc_type: str | None) -> dict[int, float]:
|
||||
"""Run vector similarity search, return {chunk_id: similarity_score}."""
|
||||
from kb_search.embeddings import embed_texts
|
||||
|
||||
prefix = cfg.get("embedding", {}).get("query_prefix", "")
|
||||
query_emb = embed_texts(model_name, [query], prefix=prefix)[0]
|
||||
blob = struct.pack(f"{len(query_emb)}f", *query_emb)
|
||||
|
||||
# sqlite-vec returns results ordered by distance (lower = more similar)
|
||||
rows = conn.execute("""
|
||||
SELECT chunk_id, distance
|
||||
FROM chunks_vec
|
||||
WHERE embedding MATCH ?
|
||||
ORDER BY distance
|
||||
LIMIT ?
|
||||
""", (blob, limit)).fetchall()
|
||||
|
||||
results = {}
|
||||
for row in rows:
|
||||
# Convert distance to similarity (1 - distance for cosine)
|
||||
similarity = max(0, 1 - row["distance"])
|
||||
chunk_id = row["chunk_id"]
|
||||
|
||||
# Apply filters post-hoc for vector search
|
||||
if tags or doc_type:
|
||||
check = conn.execute("""
|
||||
SELECT 1 FROM chunks c
|
||||
JOIN documents d ON c.document_id = d.id
|
||||
WHERE c.id = ?
|
||||
""" + (" AND d.doc_type = ?" if doc_type else ""),
|
||||
(chunk_id,) + ((doc_type,) if doc_type else ())
|
||||
).fetchone()
|
||||
if not check:
|
||||
continue
|
||||
|
||||
if tags:
|
||||
tag_count = conn.execute("""
|
||||
SELECT COUNT(*) FROM chunks c
|
||||
JOIN documents d ON c.document_id = d.id
|
||||
JOIN document_tags dt ON d.id = dt.document_id
|
||||
JOIN tags t ON dt.tag_id = t.id
|
||||
WHERE c.id = ? AND t.name IN ({})
|
||||
""".format(",".join("?" * len(tags))),
|
||||
(chunk_id, *[t.strip().lower() for t in tags])
|
||||
).fetchone()[0]
|
||||
if tag_count < len(tags):
|
||||
continue
|
||||
|
||||
results[chunk_id] = similarity
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def _rrf_merge(fts_results: dict[int, float], vec_results: dict[int, float],
|
||||
k: int = 60) -> list[dict]:
|
||||
"""Merge two result sets using Reciprocal Rank Fusion."""
|
||||
# Rank each result set
|
||||
fts_ranked = _rank_results(fts_results)
|
||||
vec_ranked = _rank_results(vec_results)
|
||||
|
||||
all_ids = set(fts_ranked.keys()) | set(vec_ranked.keys())
|
||||
|
||||
merged = []
|
||||
for chunk_id in all_ids:
|
||||
fts_rank = fts_ranked.get(chunk_id)
|
||||
vec_rank = vec_ranked.get(chunk_id)
|
||||
|
||||
score = 0
|
||||
if fts_rank is not None:
|
||||
score += 1 / (k + fts_rank)
|
||||
if vec_rank is not None:
|
||||
score += 1 / (k + vec_rank)
|
||||
|
||||
fts_score = round(1 / (k + fts_rank), 6) if fts_rank is not None else None
|
||||
vec_score = round(1 / (k + vec_rank), 6) if vec_rank is not None else None
|
||||
|
||||
merged.append({
|
||||
"chunk_id": chunk_id,
|
||||
"score": score,
|
||||
"score_breakdown": {"fts": fts_score, "vector": vec_score},
|
||||
})
|
||||
|
||||
return merged
|
||||
|
||||
|
||||
def _single_source_results(results: dict[int, float], source: str) -> list[dict]:
|
||||
"""Convert single-source results to merged format."""
|
||||
ranked = _rank_results(results)
|
||||
merged = []
|
||||
for chunk_id, rank in ranked.items():
|
||||
score = results[chunk_id]
|
||||
breakdown = {"fts": None, "vector": None}
|
||||
breakdown[source] = round(score, 6)
|
||||
merged.append({
|
||||
"chunk_id": chunk_id,
|
||||
"score": score,
|
||||
"score_breakdown": breakdown,
|
||||
})
|
||||
return merged
|
||||
|
||||
|
||||
def _rank_results(results: dict[int, float]) -> dict[int, int]:
|
||||
"""Rank results by score (1-indexed, higher score = lower rank number)."""
|
||||
sorted_ids = sorted(results.keys(), key=lambda x: results[x], reverse=True)
|
||||
return {chunk_id: rank + 1 for rank, chunk_id in enumerate(sorted_ids)}
|
||||
|
||||
|
||||
def _escape_fts_query(query: str) -> str:
|
||||
"""Escape special FTS5 characters in a query."""
|
||||
# Remove FTS5 operators that could cause syntax errors
|
||||
query = re.sub(r'["\(\)\*\:\^]', " ", query)
|
||||
# Collapse multiple spaces
|
||||
query = re.sub(r"\s+", " ", query).strip()
|
||||
return query
|
||||
@@ -0,0 +1,131 @@
|
||||
"""Tests for configuration loading, merging, and ENV overrides."""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
import yaml
|
||||
|
||||
from kb_search.config import (
|
||||
DEFAULTS,
|
||||
_deep_merge,
|
||||
_get_nested,
|
||||
_set_nested,
|
||||
config_with_sources,
|
||||
load_config,
|
||||
save_config_value,
|
||||
)
|
||||
|
||||
|
||||
def test_deep_merge_basic():
|
||||
base = {"a": 1, "b": {"c": 2, "d": 3}}
|
||||
override = {"b": {"c": 99}}
|
||||
result = _deep_merge(base, override)
|
||||
assert result == {"a": 1, "b": {"c": 99, "d": 3}}
|
||||
|
||||
|
||||
def test_deep_merge_new_keys():
|
||||
base = {"a": 1}
|
||||
override = {"b": 2}
|
||||
result = _deep_merge(base, override)
|
||||
assert result == {"a": 1, "b": 2}
|
||||
|
||||
|
||||
def test_deep_merge_does_not_mutate():
|
||||
base = {"a": {"b": 1}}
|
||||
override = {"a": {"b": 2}}
|
||||
_deep_merge(base, override)
|
||||
assert base["a"]["b"] == 1
|
||||
|
||||
|
||||
def test_set_nested():
|
||||
d = {}
|
||||
_set_nested(d, "a.b.c", 42)
|
||||
assert d == {"a": {"b": {"c": 42}}}
|
||||
|
||||
|
||||
def test_get_nested():
|
||||
d = {"a": {"b": {"c": 42}}}
|
||||
assert _get_nested(d, "a.b.c") == 42
|
||||
assert _get_nested(d, "a.b.x", "missing") == "missing"
|
||||
assert _get_nested(d, "x.y.z") is None
|
||||
|
||||
|
||||
def test_load_config_defaults(tmp_path):
|
||||
"""With no config file, returns defaults."""
|
||||
cfg = load_config(tmp_path / "nonexistent.yaml")
|
||||
assert cfg["embedding"]["model"] == "all-MiniLM-L6-v2"
|
||||
assert cfg["search"]["default_top"] == 10
|
||||
assert cfg["chunking"]["pdf"]["strategy"] == "hierarchy"
|
||||
|
||||
|
||||
def test_load_config_yaml_override(tmp_path):
|
||||
"""YAML values override defaults."""
|
||||
config_path = tmp_path / "config.yaml"
|
||||
config_path.write_text(yaml.dump({"embedding": {"model": "nomic-embed-text"}}))
|
||||
cfg = load_config(config_path)
|
||||
assert cfg["embedding"]["model"] == "nomic-embed-text"
|
||||
# Other defaults preserved
|
||||
assert cfg["search"]["default_top"] == 10
|
||||
|
||||
|
||||
def test_load_config_env_override(tmp_path, monkeypatch):
|
||||
"""ENV overrides both YAML and defaults."""
|
||||
config_path = tmp_path / "config.yaml"
|
||||
config_path.write_text(yaml.dump({"search": {"default_top": 20}}))
|
||||
monkeypatch.setenv("KB_DEFAULT_TOP", "50")
|
||||
cfg = load_config(config_path)
|
||||
assert cfg["search"]["default_top"] == 50
|
||||
|
||||
|
||||
def test_load_config_env_model(tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("KB_MODEL", "bge-small-en-v1.5")
|
||||
cfg = load_config(tmp_path / "nonexistent.yaml")
|
||||
assert cfg["embedding"]["model"] == "bge-small-en-v1.5"
|
||||
|
||||
|
||||
def test_save_config_value(tmp_path):
|
||||
config_path = tmp_path / "config.yaml"
|
||||
save_config_value(config_path, "chunking.pdf.max_tokens", "2048")
|
||||
with open(config_path) as f:
|
||||
data = yaml.safe_load(f)
|
||||
assert data["chunking"]["pdf"]["max_tokens"] == 2048
|
||||
|
||||
|
||||
def test_save_config_value_bool(tmp_path):
|
||||
config_path = tmp_path / "config.yaml"
|
||||
save_config_value(config_path, "chunking.code.include_context", "false")
|
||||
with open(config_path) as f:
|
||||
data = yaml.safe_load(f)
|
||||
assert data["chunking"]["code"]["include_context"] is False
|
||||
|
||||
|
||||
def test_save_config_preserves_existing(tmp_path):
|
||||
config_path = tmp_path / "config.yaml"
|
||||
config_path.write_text(yaml.dump({"embedding": {"model": "custom"}}))
|
||||
save_config_value(config_path, "search.default_top", "20")
|
||||
with open(config_path) as f:
|
||||
data = yaml.safe_load(f)
|
||||
assert data["embedding"]["model"] == "custom"
|
||||
assert data["search"]["default_top"] == 20
|
||||
|
||||
|
||||
def test_config_with_sources_defaults(tmp_path, monkeypatch):
|
||||
entries = config_with_sources(tmp_path / "nonexistent.yaml")
|
||||
sources = {k: s for k, _, s in entries}
|
||||
assert sources["embedding.model"] == "default"
|
||||
|
||||
|
||||
def test_config_with_sources_yaml(tmp_path):
|
||||
config_path = tmp_path / "config.yaml"
|
||||
config_path.write_text(yaml.dump({"embedding": {"model": "custom"}}))
|
||||
entries = config_with_sources(config_path)
|
||||
sources = {k: s for k, _, s in entries}
|
||||
assert sources["embedding.model"] == "config.yaml"
|
||||
|
||||
|
||||
def test_config_with_sources_env(tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("KB_MODEL", "from-env")
|
||||
entries = config_with_sources(tmp_path / "nonexistent.yaml")
|
||||
sources = {k: s for k, _, s in entries}
|
||||
assert sources["embedding.model"] == "env (KB_MODEL)"
|
||||
@@ -0,0 +1,206 @@
|
||||
"""Tests for database schema, FTS triggers, and config helpers."""
|
||||
|
||||
import struct
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from kb_search.database import (
|
||||
SCHEMA_VERSION,
|
||||
check_schema_version,
|
||||
get_connection,
|
||||
get_db_config,
|
||||
get_or_create_tag,
|
||||
hash_exists,
|
||||
init_schema,
|
||||
insert_chunk,
|
||||
insert_document,
|
||||
insert_embedding,
|
||||
recreate_vec_table,
|
||||
run_migrations,
|
||||
set_db_config,
|
||||
tag_document,
|
||||
untag_document,
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def db(tmp_path):
|
||||
"""Provide an initialised in-memory-like DB."""
|
||||
db_path = tmp_path / "test.db"
|
||||
conn = get_connection(db_path)
|
||||
init_schema(conn, embedding_dim=384)
|
||||
set_db_config(conn, "schema_version", str(SCHEMA_VERSION))
|
||||
yield conn
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_schema_creation(db):
|
||||
tables = [r[0] for r in db.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()]
|
||||
assert "documents" in tables
|
||||
assert "chunks" in tables
|
||||
assert "tags" in tables
|
||||
assert "document_tags" in tables
|
||||
assert "config" in tables
|
||||
|
||||
|
||||
def test_fts_table_exists(db):
|
||||
tables = [r[0] for r in db.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()]
|
||||
assert "chunks_fts" in tables
|
||||
|
||||
|
||||
def test_vec_table_exists(db):
|
||||
tables = [r[0] for r in db.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()]
|
||||
assert "chunks_vec" in tables
|
||||
|
||||
|
||||
def test_config_get_set(db):
|
||||
set_db_config(db, "test_key", "test_value")
|
||||
assert get_db_config(db, "test_key") == "test_value"
|
||||
|
||||
|
||||
def test_config_get_default(db):
|
||||
assert get_db_config(db, "nonexistent", "fallback") == "fallback"
|
||||
|
||||
|
||||
def test_config_upsert(db):
|
||||
set_db_config(db, "key", "v1")
|
||||
set_db_config(db, "key", "v2")
|
||||
assert get_db_config(db, "key") == "v2"
|
||||
|
||||
|
||||
def test_schema_version(db):
|
||||
assert check_schema_version(db) == SCHEMA_VERSION
|
||||
|
||||
|
||||
def test_insert_document(db):
|
||||
doc_id = insert_document(db, "Test Doc", "/path/test.pdf", "abc123", "pdf")
|
||||
db.commit()
|
||||
row = db.execute("SELECT * FROM documents WHERE id = ?", (doc_id,)).fetchone()
|
||||
assert row["title"] == "Test Doc"
|
||||
assert row["doc_type"] == "pdf"
|
||||
assert row["content_hash"] == "abc123"
|
||||
|
||||
|
||||
def test_insert_chunk_with_fts_sync(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash1", "note")
|
||||
chunk_id = insert_chunk(db, doc_id, 0, "This is searchable text about Python programming")
|
||||
db.commit()
|
||||
|
||||
# FTS should find it
|
||||
rows = db.execute(
|
||||
"SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'python'"
|
||||
).fetchall()
|
||||
assert len(rows) == 1
|
||||
assert rows[0][0] == chunk_id
|
||||
|
||||
|
||||
def test_fts_delete_trigger(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash2", "note")
|
||||
chunk_id = insert_chunk(db, doc_id, 0, "unique_keyword_xyz")
|
||||
db.commit()
|
||||
|
||||
db.execute("DELETE FROM chunks WHERE id = ?", (chunk_id,))
|
||||
db.commit()
|
||||
|
||||
rows = db.execute(
|
||||
"SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'unique_keyword_xyz'"
|
||||
).fetchall()
|
||||
assert len(rows) == 0
|
||||
|
||||
|
||||
def test_fts_update_trigger(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash3", "note")
|
||||
chunk_id = insert_chunk(db, doc_id, 0, "old_content_abc")
|
||||
db.commit()
|
||||
|
||||
db.execute("UPDATE chunks SET text = 'new_content_def' WHERE id = ?", (chunk_id,))
|
||||
db.commit()
|
||||
|
||||
old = db.execute("SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'old_content_abc'").fetchall()
|
||||
new = db.execute("SELECT rowid FROM chunks_fts WHERE chunks_fts MATCH 'new_content_def'").fetchall()
|
||||
assert len(old) == 0
|
||||
assert len(new) == 1
|
||||
|
||||
|
||||
def test_insert_embedding(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash4", "note")
|
||||
chunk_id = insert_chunk(db, doc_id, 0, "text")
|
||||
db.commit()
|
||||
|
||||
embedding = [0.1] * 384
|
||||
insert_embedding(db, chunk_id, embedding)
|
||||
db.commit()
|
||||
|
||||
row = db.execute("SELECT * FROM chunks_vec WHERE chunk_id = ?", (chunk_id,)).fetchone()
|
||||
assert row is not None
|
||||
|
||||
|
||||
def test_hash_exists(db):
|
||||
assert not hash_exists(db, "newhash")
|
||||
insert_document(db, "Doc", None, "newhash", "note")
|
||||
db.commit()
|
||||
assert hash_exists(db, "newhash")
|
||||
|
||||
|
||||
def test_tag_management(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash5", "pdf")
|
||||
db.commit()
|
||||
|
||||
tag_document(db, doc_id, ["git", "admin"])
|
||||
db.commit()
|
||||
|
||||
rows = db.execute(
|
||||
"SELECT t.name FROM tags t JOIN document_tags dt ON t.id = dt.tag_id "
|
||||
"WHERE dt.document_id = ? ORDER BY t.name",
|
||||
(doc_id,),
|
||||
).fetchall()
|
||||
assert [r["name"] for r in rows] == ["admin", "git"]
|
||||
|
||||
|
||||
def test_untag_document(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash6", "pdf")
|
||||
tag_document(db, doc_id, ["a", "b", "c"])
|
||||
db.commit()
|
||||
|
||||
untag_document(db, doc_id, ["b"])
|
||||
db.commit()
|
||||
|
||||
rows = db.execute(
|
||||
"SELECT t.name FROM tags t JOIN document_tags dt ON t.id = dt.tag_id "
|
||||
"WHERE dt.document_id = ? ORDER BY t.name",
|
||||
(doc_id,),
|
||||
).fetchall()
|
||||
assert [r["name"] for r in rows] == ["a", "c"]
|
||||
|
||||
|
||||
def test_tags_are_lowercase(db):
|
||||
tag_id = get_or_create_tag(db, "MyTag")
|
||||
db.commit()
|
||||
row = db.execute("SELECT name FROM tags WHERE id = ?", (tag_id,)).fetchone()
|
||||
assert row["name"] == "mytag"
|
||||
|
||||
|
||||
def test_recreate_vec_table(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash7", "note")
|
||||
chunk_id = insert_chunk(db, doc_id, 0, "text")
|
||||
insert_embedding(db, chunk_id, [0.1] * 384)
|
||||
db.commit()
|
||||
|
||||
recreate_vec_table(db, 768)
|
||||
# Old data gone, new dimension
|
||||
rows = db.execute("SELECT * FROM chunks_vec").fetchall()
|
||||
assert len(rows) == 0
|
||||
|
||||
|
||||
def test_cascade_delete(db):
|
||||
doc_id = insert_document(db, "Doc", None, "hash8", "pdf")
|
||||
insert_chunk(db, doc_id, 0, "chunk text")
|
||||
tag_document(db, doc_id, ["test"])
|
||||
db.commit()
|
||||
|
||||
db.execute("DELETE FROM documents WHERE id = ?", (doc_id,))
|
||||
db.commit()
|
||||
|
||||
assert db.execute("SELECT COUNT(*) FROM chunks WHERE document_id = ?", (doc_id,)).fetchone()[0] == 0
|
||||
assert db.execute("SELECT COUNT(*) FROM document_tags WHERE document_id = ?", (doc_id,)).fetchone()[0] == 0
|
||||
@@ -0,0 +1,50 @@
|
||||
"""Tests for embedding model management."""
|
||||
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import click
|
||||
import pytest
|
||||
|
||||
from kb_search.embeddings import check_model_binding
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_conn():
|
||||
"""Mock DB connection with config values."""
|
||||
def make_conn(config_values=None):
|
||||
config_values = config_values or {}
|
||||
conn = MagicMock()
|
||||
def mock_execute(sql, params=None):
|
||||
if "SELECT value FROM config" in sql and params:
|
||||
key = params[0]
|
||||
val = config_values.get(key)
|
||||
row = MagicMock()
|
||||
row.__getitem__ = lambda self, k: val
|
||||
result = MagicMock()
|
||||
result.fetchone.return_value = row if val else None
|
||||
return result
|
||||
return MagicMock()
|
||||
conn.execute = mock_execute
|
||||
return conn
|
||||
return make_conn
|
||||
|
||||
|
||||
def test_model_binding_match(mock_conn):
|
||||
conn = mock_conn({"model_name": "all-MiniLM-L6-v2", "embedding_dim": "384"})
|
||||
cfg = {"embedding": {"model": "all-MiniLM-L6-v2"}}
|
||||
# Should not raise
|
||||
check_model_binding(conn, cfg)
|
||||
|
||||
|
||||
def test_model_binding_mismatch(mock_conn):
|
||||
conn = mock_conn({"model_name": "all-MiniLM-L6-v2", "embedding_dim": "384"})
|
||||
cfg = {"embedding": {"model": "nomic-embed-text"}}
|
||||
with pytest.raises(click.ClickException, match="Model mismatch"):
|
||||
check_model_binding(conn, cfg)
|
||||
|
||||
|
||||
def test_model_binding_no_db_model(mock_conn):
|
||||
conn = mock_conn({})
|
||||
cfg = {"embedding": {"model": "anything"}}
|
||||
# Should not raise when DB not yet initialised
|
||||
check_model_binding(conn, cfg)
|
||||
@@ -0,0 +1,172 @@
|
||||
"""Tests for code chunking — Python, Bash, Go."""
|
||||
|
||||
from kb_search.ingest.code import chunk_code, _chunk_python, _chunk_bash, _chunk_go, _fixed_chunk
|
||||
|
||||
CFG = {"chunking": {"code": {"strategy": "ast", "include_context": True, "max_tokens": 1024}}}
|
||||
|
||||
|
||||
class TestPythonChunking:
|
||||
def test_functions(self):
|
||||
code = '''
|
||||
def hello():
|
||||
"""Say hello."""
|
||||
print("hello")
|
||||
|
||||
def goodbye():
|
||||
"""Say goodbye."""
|
||||
print("bye")
|
||||
'''
|
||||
chunks = _chunk_python(code, include_context=True)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0]["metadata"]["symbol_name"] == "hello"
|
||||
assert chunks[1]["metadata"]["symbol_name"] == "goodbye"
|
||||
|
||||
def test_class_with_methods(self):
|
||||
code = '''
|
||||
class MyClass:
|
||||
"""A test class."""
|
||||
|
||||
def method_a(self):
|
||||
pass
|
||||
|
||||
def method_b(self):
|
||||
pass
|
||||
'''
|
||||
chunks = _chunk_python(code, include_context=True)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0]["metadata"]["symbol_name"] == "MyClass.method_a"
|
||||
assert chunks[1]["metadata"]["symbol_name"] == "MyClass.method_b"
|
||||
# Context should include class docstring
|
||||
assert "A test class" in chunks[0]["text"]
|
||||
|
||||
def test_class_without_methods(self):
|
||||
code = '''
|
||||
class Config:
|
||||
"""Configuration."""
|
||||
DEBUG = True
|
||||
PORT = 8080
|
||||
'''
|
||||
chunks = _chunk_python(code, include_context=True)
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0]["metadata"]["symbol_name"] == "Config"
|
||||
|
||||
def test_syntax_error_returns_empty(self):
|
||||
chunks = _chunk_python("def broken(:\n pass", include_context=True)
|
||||
assert chunks == []
|
||||
|
||||
def test_no_context(self):
|
||||
code = '''
|
||||
class Foo:
|
||||
"""Docstring."""
|
||||
def bar(self):
|
||||
pass
|
||||
'''
|
||||
chunks = _chunk_python(code, include_context=False)
|
||||
assert len(chunks) == 1
|
||||
assert "Docstring" not in chunks[0]["text"]
|
||||
|
||||
|
||||
class TestBashChunking:
|
||||
def test_function_keyword(self):
|
||||
code = '''#!/bin/bash
|
||||
|
||||
function deploy() {
|
||||
echo "deploying"
|
||||
}
|
||||
|
||||
function rollback() {
|
||||
echo "rolling back"
|
||||
}
|
||||
'''
|
||||
chunks = _chunk_bash(code, include_context=True)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0]["metadata"]["symbol_name"] == "deploy"
|
||||
assert chunks[1]["metadata"]["symbol_name"] == "rollback"
|
||||
|
||||
def test_shorthand_syntax(self):
|
||||
code = '''
|
||||
setup() {
|
||||
echo "setup"
|
||||
}
|
||||
|
||||
cleanup() {
|
||||
echo "cleanup"
|
||||
}
|
||||
'''
|
||||
chunks = _chunk_bash(code, include_context=True)
|
||||
assert len(chunks) == 2
|
||||
|
||||
def test_no_functions(self):
|
||||
code = "#!/bin/bash\necho hello\nexit 0"
|
||||
chunks = _chunk_bash(code, include_context=True)
|
||||
assert chunks == []
|
||||
|
||||
def test_with_preceding_comments(self):
|
||||
code = '''
|
||||
# Deploy to production
|
||||
# Requires valid credentials
|
||||
function deploy() {
|
||||
echo "deploying"
|
||||
}
|
||||
'''
|
||||
chunks = _chunk_bash(code, include_context=True)
|
||||
assert len(chunks) == 1
|
||||
assert "Deploy to production" in chunks[0]["text"]
|
||||
|
||||
|
||||
class TestGoChunking:
|
||||
def test_basic_funcs(self):
|
||||
code = '''package main
|
||||
|
||||
func main() {
|
||||
fmt.Println("hello")
|
||||
}
|
||||
|
||||
func helper() string {
|
||||
return "help"
|
||||
}
|
||||
'''
|
||||
chunks = _chunk_go(code, include_context=True)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0]["metadata"]["symbol_name"] == "main"
|
||||
assert chunks[1]["metadata"]["symbol_name"] == "helper"
|
||||
|
||||
def test_method_receiver(self):
|
||||
code = '''
|
||||
func (s *Server) Start() error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *Server) Stop() {
|
||||
}
|
||||
'''
|
||||
chunks = _chunk_go(code, include_context=True)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0]["metadata"]["symbol_name"] == "Start"
|
||||
|
||||
def test_no_funcs(self):
|
||||
code = "package main\n\nvar x = 1"
|
||||
chunks = _chunk_go(code, include_context=True)
|
||||
assert chunks == []
|
||||
|
||||
|
||||
class TestFallback:
|
||||
def test_unknown_language_uses_fixed(self):
|
||||
code = "line1\nline2\nline3"
|
||||
chunks = chunk_code(code, "ruby", CFG)
|
||||
assert len(chunks) >= 1
|
||||
|
||||
def test_python_no_functions_uses_fixed(self):
|
||||
code = "x = 1\ny = 2\nprint(x + y)"
|
||||
chunks = chunk_code(code, "python", CFG)
|
||||
assert len(chunks) >= 1
|
||||
|
||||
def test_fixed_strategy_config(self):
|
||||
cfg = {"chunking": {"code": {"strategy": "fixed", "max_tokens": 10}}}
|
||||
code = "\n".join(f"x_{i} = {i}" for i in range(50))
|
||||
chunks = chunk_code(code, "python", cfg)
|
||||
assert len(chunks) > 1
|
||||
|
||||
def test_empty_code(self):
|
||||
chunks = chunk_code("", "python", CFG)
|
||||
assert len(chunks) == 0
|
||||
@@ -0,0 +1,81 @@
|
||||
"""Tests for file type detection, dedup, note creation."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from kb_search.ingest.detector import detect_type, is_supported
|
||||
from kb_search.ingest.note import auto_title, chunk_note
|
||||
|
||||
|
||||
class TestDetector:
|
||||
def test_pdf(self, tmp_path):
|
||||
assert detect_type(tmp_path / "doc.pdf") == ("pdf", None)
|
||||
|
||||
def test_markdown(self, tmp_path):
|
||||
assert detect_type(tmp_path / "notes.md") == ("markdown", None)
|
||||
|
||||
def test_txt(self, tmp_path):
|
||||
assert detect_type(tmp_path / "notes.txt") == ("markdown", None)
|
||||
|
||||
def test_python(self, tmp_path):
|
||||
assert detect_type(tmp_path / "main.py") == ("code", "python")
|
||||
|
||||
def test_bash(self, tmp_path):
|
||||
assert detect_type(tmp_path / "deploy.sh") == ("code", "bash")
|
||||
|
||||
def test_go(self, tmp_path):
|
||||
assert detect_type(tmp_path / "main.go") == ("code", "go")
|
||||
|
||||
def test_unsupported(self, tmp_path):
|
||||
with pytest.raises(ValueError, match="Unsupported"):
|
||||
detect_type(tmp_path / "archive.zip")
|
||||
|
||||
def test_force_type(self, tmp_path):
|
||||
assert detect_type(tmp_path / "data.txt", force_type="code", force_language="bash") == ("code", "bash")
|
||||
|
||||
def test_force_language_only(self, tmp_path):
|
||||
doc_type, lang = detect_type(tmp_path / "script.py", force_language="go")
|
||||
assert doc_type == "code"
|
||||
assert lang == "go"
|
||||
|
||||
def test_is_supported(self, tmp_path):
|
||||
assert is_supported(tmp_path / "test.pdf")
|
||||
assert is_supported(tmp_path / "test.py")
|
||||
assert not is_supported(tmp_path / "test.zip")
|
||||
|
||||
def test_case_insensitive(self, tmp_path):
|
||||
assert detect_type(tmp_path / "DOC.PDF") == ("pdf", None)
|
||||
|
||||
def test_image_files(self, tmp_path):
|
||||
assert detect_type(tmp_path / "scan.png") == ("pdf", None)
|
||||
assert detect_type(tmp_path / "photo.jpg") == ("pdf", None)
|
||||
|
||||
def test_docx(self, tmp_path):
|
||||
assert detect_type(tmp_path / "report.docx") == ("pdf", None)
|
||||
|
||||
|
||||
class TestNote:
|
||||
def test_chunk_note(self):
|
||||
chunks = chunk_note("Hello world")
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0]["text"] == "Hello world"
|
||||
assert chunks[0]["chunk_index"] == 0
|
||||
|
||||
def test_auto_title_short(self):
|
||||
assert auto_title("Short note") == "Short note"
|
||||
|
||||
def test_auto_title_long(self):
|
||||
long_text = "This is a very long note that exceeds the maximum title length and should be truncated at a word boundary"
|
||||
result = auto_title(long_text, max_len=50)
|
||||
assert len(result) <= 54 # 50 + "..."
|
||||
assert result.endswith("...")
|
||||
|
||||
def test_auto_title_multiline(self):
|
||||
text = "First line\nSecond line\nThird line"
|
||||
assert auto_title(text) == "First line"
|
||||
|
||||
def test_auto_title_no_space(self):
|
||||
text = "a" * 100
|
||||
result = auto_title(text, max_len=80)
|
||||
assert result.endswith("...")
|
||||
@@ -0,0 +1,33 @@
|
||||
"""Tests for Docling ingestion (fixed-size chunking logic, mocked Docling)."""
|
||||
|
||||
from kb_search.ingest.docling import _fixed_chunk_text
|
||||
|
||||
|
||||
class TestFixedChunkText:
|
||||
def test_short_text_single_chunk(self):
|
||||
chunks = _fixed_chunk_text("Hello world", {})
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0]["text"] == "Hello world"
|
||||
assert chunks[0]["chunk_index"] == 0
|
||||
|
||||
def test_long_text_multiple_chunks(self):
|
||||
text = "word " * 2000 # ~10000 chars
|
||||
chunks = _fixed_chunk_text(text, {"max_tokens": 512, "overlap_tokens": 50})
|
||||
assert len(chunks) > 1
|
||||
# Chunks should overlap
|
||||
for i, c in enumerate(chunks):
|
||||
assert c["chunk_index"] == i
|
||||
|
||||
def test_empty_text(self):
|
||||
chunks = _fixed_chunk_text("", {})
|
||||
assert len(chunks) == 0
|
||||
|
||||
def test_whitespace_only(self):
|
||||
chunks = _fixed_chunk_text(" \n\n ", {})
|
||||
assert len(chunks) == 0
|
||||
|
||||
def test_custom_max_tokens(self):
|
||||
text = "a " * 500
|
||||
chunks = _fixed_chunk_text(text, {"max_tokens": 100})
|
||||
# 100 tokens * 4 chars = 400 chars window, 1000 chars total
|
||||
assert len(chunks) > 1
|
||||
@@ -0,0 +1,121 @@
|
||||
"""Tests for markdown header-based splitting."""
|
||||
|
||||
from kb_search.ingest.markdown import (
|
||||
_fixed_chunk,
|
||||
_has_headers,
|
||||
_merge_small_sections,
|
||||
_split_at_headers,
|
||||
chunk_markdown,
|
||||
)
|
||||
|
||||
|
||||
def make_cfg(**overrides):
|
||||
cfg = {"chunking": {"markdown": {"strategy": "header", "min_tokens": 50, "max_tokens": 1024}}}
|
||||
cfg["chunking"]["markdown"].update(overrides)
|
||||
return cfg
|
||||
|
||||
|
||||
class TestHasHeaders:
|
||||
def test_with_headers(self):
|
||||
assert _has_headers("## Title\nContent")
|
||||
|
||||
def test_without_headers(self):
|
||||
assert not _has_headers("Just plain text\nNo headers here")
|
||||
|
||||
def test_h3(self):
|
||||
assert _has_headers("### Subsection\nStuff")
|
||||
|
||||
|
||||
class TestSplitAtHeaders:
|
||||
def test_basic_split(self):
|
||||
text = "## Section 1\nContent one\n\n## Section 2\nContent two"
|
||||
sections = _split_at_headers(text)
|
||||
assert len(sections) == 2
|
||||
assert sections[0]["header_chain"] == ["Section 1"]
|
||||
assert "Content one" in sections[0]["content"]
|
||||
assert sections[1]["header_chain"] == ["Section 2"]
|
||||
|
||||
def test_nested_headers(self):
|
||||
text = "## Config\nIntro\n\n### Advanced Options\nDetails"
|
||||
sections = _split_at_headers(text)
|
||||
assert len(sections) == 2
|
||||
# The ### should have full chain
|
||||
assert sections[1]["header_chain"] == ["Config", "Advanced Options"]
|
||||
|
||||
def test_leading_content(self):
|
||||
text = "Preamble text\n\n## First Section\nContent"
|
||||
sections = _split_at_headers(text)
|
||||
assert len(sections) == 2
|
||||
assert sections[0]["header_chain"] == []
|
||||
assert "Preamble" in sections[0]["content"]
|
||||
|
||||
def test_header_level_reset(self):
|
||||
text = "## A\n\n### B\n\n## C\n\n### D"
|
||||
sections = _split_at_headers(text)
|
||||
assert sections[2]["header_chain"] == ["C"]
|
||||
assert sections[3]["header_chain"] == ["C", "D"]
|
||||
|
||||
|
||||
class TestMergeSmallSections:
|
||||
def test_merge_tiny_into_next(self):
|
||||
sections = [
|
||||
{"header_chain": ["A"], "content": "tiny"},
|
||||
{"header_chain": ["B"], "content": "This is a much longer section with plenty of words " * 5},
|
||||
]
|
||||
merged = _merge_small_sections(sections, min_tokens=10)
|
||||
assert len(merged) == 1
|
||||
assert "tiny" in merged[0]["content"]
|
||||
|
||||
def test_no_merge_when_large_enough(self):
|
||||
sections = [
|
||||
{"header_chain": ["A"], "content": "word " * 100},
|
||||
{"header_chain": ["B"], "content": "word " * 100},
|
||||
]
|
||||
merged = _merge_small_sections(sections, min_tokens=10)
|
||||
assert len(merged) == 2
|
||||
|
||||
|
||||
class TestChunkMarkdown:
|
||||
def test_header_strategy(self):
|
||||
text = "## Intro\nSome intro text with enough words to avoid merging. " * 5
|
||||
text += "\n\n## Details\nDetailed content follows here with sufficient length. " * 5
|
||||
cfg = make_cfg(min_tokens=5)
|
||||
chunks = chunk_markdown(text, cfg)
|
||||
assert len(chunks) >= 2
|
||||
# Verify chunk_index assigned
|
||||
for i, c in enumerate(chunks):
|
||||
assert c["chunk_index"] == i
|
||||
|
||||
def test_hierarchy_context(self):
|
||||
text = "## Config\nIntro\n\n### Advanced\n" + "Details " * 60
|
||||
cfg = make_cfg(min_tokens=5)
|
||||
chunks = chunk_markdown(text, cfg)
|
||||
# Find the Advanced chunk
|
||||
advanced = [c for c in chunks if "Advanced" in c["text"]]
|
||||
assert len(advanced) > 0
|
||||
assert "Config > Advanced" in advanced[0]["text"]
|
||||
|
||||
def test_plain_text_fallback(self):
|
||||
text = "No headers here, just plain text. " * 200
|
||||
cfg = make_cfg()
|
||||
chunks = chunk_markdown(text, cfg)
|
||||
assert len(chunks) >= 1
|
||||
|
||||
def test_empty_text(self):
|
||||
chunks = chunk_markdown("", make_cfg())
|
||||
assert len(chunks) == 0
|
||||
|
||||
|
||||
class TestFixedChunk:
|
||||
def test_basic(self):
|
||||
text = "word " * 200
|
||||
chunks = _fixed_chunk(text, {"max_tokens": 50, "overlap_tokens": 10})
|
||||
assert len(chunks) > 1
|
||||
|
||||
def test_empty(self):
|
||||
chunks = _fixed_chunk("", {})
|
||||
assert len(chunks) == 0
|
||||
|
||||
def test_short_text(self):
|
||||
chunks = _fixed_chunk("hello world", {"max_tokens": 512})
|
||||
assert len(chunks) == 1
|
||||
@@ -0,0 +1,156 @@
|
||||
"""Tests for document management commands via Click test runner."""
|
||||
|
||||
import json
|
||||
|
||||
import pytest
|
||||
from click.testing import CliRunner
|
||||
|
||||
from kb_search.cli import main
|
||||
from kb_search.database import (
|
||||
SCHEMA_VERSION,
|
||||
get_connection,
|
||||
init_schema,
|
||||
insert_chunk,
|
||||
insert_document,
|
||||
insert_embedding,
|
||||
set_db_config,
|
||||
tag_document,
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def kb_env(tmp_path, monkeypatch):
|
||||
"""Set up a test KB environment."""
|
||||
data_dir = tmp_path / ".kb"
|
||||
data_dir.mkdir()
|
||||
db_path = data_dir / "kb.db"
|
||||
|
||||
conn = get_connection(db_path)
|
||||
init_schema(conn, 384)
|
||||
set_db_config(conn, "schema_version", str(SCHEMA_VERSION))
|
||||
set_db_config(conn, "model_name", "all-MiniLM-L6-v2")
|
||||
set_db_config(conn, "embedding_dim", "384")
|
||||
|
||||
# Add a test document
|
||||
doc_id = insert_document(conn, "Test Doc", "/tmp/test.pdf", "abc123", "pdf")
|
||||
insert_chunk(conn, doc_id, 0, "This is chunk zero about Python")
|
||||
insert_chunk(conn, doc_id, 1, "This is chunk one about testing")
|
||||
tag_document(conn, doc_id, ["test", "pdf"])
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
monkeypatch.setenv("KB_DATA_DIR", str(data_dir))
|
||||
return data_dir
|
||||
|
||||
|
||||
class TestList:
|
||||
def test_json_output(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["list", "--format", "json"])
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert len(data) == 1
|
||||
assert data[0]["title"] == "Test Doc"
|
||||
assert data[0]["type"] == "pdf"
|
||||
|
||||
def test_human_output(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["list", "--format", "human"])
|
||||
assert result.exit_code == 0
|
||||
assert "Test Doc" in result.output
|
||||
|
||||
def test_filter_type(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["list", "--type", "markdown", "--format", "json"])
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert len(data) == 0
|
||||
|
||||
def test_filter_tags(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["list", "--tags", "test", "--format", "json"])
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert len(data) == 1
|
||||
|
||||
|
||||
class TestInfo:
|
||||
def test_json_output(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["info", "1", "--format", "json"])
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert data["title"] == "Test Doc"
|
||||
assert data["chunk_count"] == 2
|
||||
assert "test" in data["tags"]
|
||||
|
||||
def test_not_found(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["info", "999"])
|
||||
assert result.exit_code != 0
|
||||
assert "not found" in result.output.lower()
|
||||
|
||||
|
||||
class TestRemove:
|
||||
def test_remove_with_yes(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["remove", "1", "--yes"])
|
||||
assert result.exit_code == 0
|
||||
assert "Removed" in result.output
|
||||
|
||||
# Verify gone
|
||||
result = runner.invoke(main, ["list", "--format", "json"])
|
||||
data = json.loads(result.output)
|
||||
assert len(data) == 0
|
||||
|
||||
def test_remove_not_found(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["remove", "999", "--yes"])
|
||||
assert result.exit_code != 0
|
||||
|
||||
|
||||
class TestTags:
|
||||
def test_list_tags(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["tags", "--format", "json"])
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
names = [t["name"] for t in data]
|
||||
assert "test" in names
|
||||
assert "pdf" in names
|
||||
|
||||
def test_add_tag(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["tag", "1", "--add", "new"])
|
||||
assert result.exit_code == 0
|
||||
assert "Added" in result.output
|
||||
|
||||
def test_remove_tag(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["tag", "1", "--remove", "test"])
|
||||
assert result.exit_code == 0
|
||||
assert "Removed" in result.output
|
||||
|
||||
|
||||
class TestStatus:
|
||||
def test_json_output(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["status", "--format", "json"])
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert data["model_name"] == "all-MiniLM-L6-v2"
|
||||
assert data["total_documents"] == 1
|
||||
assert data["total_chunks"] == 2
|
||||
|
||||
def test_human_output(self, kb_env):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["status", "--format", "human"])
|
||||
assert result.exit_code == 0
|
||||
assert "all-MiniLM-L6-v2" in result.output
|
||||
|
||||
def test_not_initialised(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("KB_DATA_DIR", str(tmp_path / "nonexistent"))
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(main, ["status"])
|
||||
assert result.exit_code != 0
|
||||
assert "not initialised" in result.output.lower()
|
||||
@@ -0,0 +1,120 @@
|
||||
"""Tests for output formatters."""
|
||||
|
||||
import json
|
||||
|
||||
from kb_search.output import (
|
||||
_human_size,
|
||||
format_doc_info,
|
||||
format_document_list,
|
||||
format_search_results,
|
||||
format_status,
|
||||
format_tags,
|
||||
)
|
||||
|
||||
SAMPLE_SEARCH = {
|
||||
"query": "install git",
|
||||
"results": [
|
||||
{
|
||||
"chunk_id": 1,
|
||||
"score": 0.031,
|
||||
"score_breakdown": {"fts": 0.016, "vector": 0.015},
|
||||
"text": "To install git from source...",
|
||||
"source": {
|
||||
"document_id": 42,
|
||||
"title": "Git Admin Guide",
|
||||
"path": "/docs/git.pdf",
|
||||
"type": "pdf",
|
||||
"page": 12,
|
||||
"section_header": None,
|
||||
"chunk_index": 3,
|
||||
"total_chunks": 28,
|
||||
"tags": ["git", "admin"],
|
||||
},
|
||||
}
|
||||
],
|
||||
"total_matches": 47,
|
||||
"returned": 1,
|
||||
}
|
||||
|
||||
|
||||
class TestSearchOutput:
|
||||
def test_json_format(self):
|
||||
output = format_search_results(SAMPLE_SEARCH, "json")
|
||||
parsed = json.loads(output)
|
||||
assert parsed["query"] == "install git"
|
||||
assert len(parsed["results"]) == 1
|
||||
assert parsed["results"][0]["chunk_id"] == 1
|
||||
assert "fts" in parsed["results"][0]["score_breakdown"]
|
||||
assert "vector" in parsed["results"][0]["score_breakdown"]
|
||||
|
||||
def test_json_schema_fields(self):
|
||||
output = format_search_results(SAMPLE_SEARCH, "json")
|
||||
parsed = json.loads(output)
|
||||
r = parsed["results"][0]
|
||||
assert "chunk_id" in r
|
||||
assert "score" in r
|
||||
assert "text" in r
|
||||
assert "source" in r
|
||||
src = r["source"]
|
||||
assert "document_id" in src
|
||||
assert "title" in src
|
||||
assert "type" in src
|
||||
assert "tags" in src
|
||||
|
||||
def test_human_format(self):
|
||||
output = format_search_results(SAMPLE_SEARCH, "human")
|
||||
assert "install git" in output
|
||||
assert "Git Admin Guide" in output
|
||||
assert "p.12" in output
|
||||
assert "0.031" in output
|
||||
|
||||
|
||||
class TestDocList:
|
||||
def test_json(self):
|
||||
docs = [{"id": 1, "title": "Test", "type": "pdf", "tags": ["a"], "chunk_count": 5, "created_at": "2024-01-01"}]
|
||||
parsed = json.loads(format_document_list(docs, "json"))
|
||||
assert len(parsed) == 1
|
||||
|
||||
def test_human_empty(self):
|
||||
assert "No documents" in format_document_list([], "human")
|
||||
|
||||
def test_human(self):
|
||||
docs = [{"id": 1, "title": "Test", "type": "pdf", "tags": ["a"], "chunk_count": 5}]
|
||||
output = format_document_list(docs, "human")
|
||||
assert "Test" in output
|
||||
|
||||
|
||||
class TestTags:
|
||||
def test_json(self):
|
||||
tags = [{"name": "git", "count": 15}]
|
||||
parsed = json.loads(format_tags(tags, "json"))
|
||||
assert parsed[0]["name"] == "git"
|
||||
|
||||
def test_human_empty(self):
|
||||
assert "No tags" in format_tags([], "human")
|
||||
|
||||
|
||||
class TestStatus:
|
||||
def test_json(self):
|
||||
status = {"model_name": "test", "embedding_dim": 384, "schema_version": 1,
|
||||
"db_size_bytes": 1024, "documents": {"pdf": 5}, "total_documents": 5, "total_chunks": 50}
|
||||
parsed = json.loads(format_status(status, "json"))
|
||||
assert parsed["model_name"] == "test"
|
||||
|
||||
def test_human(self):
|
||||
status = {"model_name": "test", "embedding_dim": 384, "schema_version": 1,
|
||||
"db_size_bytes": 1024000, "documents": {"pdf": 5}, "total_documents": 5, "total_chunks": 50}
|
||||
output = format_status(status, "human")
|
||||
assert "test" in output
|
||||
assert "384" in output
|
||||
|
||||
|
||||
class TestHumanSize:
|
||||
def test_bytes(self):
|
||||
assert _human_size(512) == "512.0 B"
|
||||
|
||||
def test_kb(self):
|
||||
assert _human_size(2048) == "2.0 KB"
|
||||
|
||||
def test_mb(self):
|
||||
assert _human_size(5 * 1024 * 1024) == "5.0 MB"
|
||||
@@ -0,0 +1,91 @@
|
||||
"""Tests for hybrid search, RRF merging, and filtering."""
|
||||
|
||||
import pytest
|
||||
|
||||
from kb_search.search import (
|
||||
_escape_fts_query,
|
||||
_rank_results,
|
||||
_rrf_merge,
|
||||
_single_source_results,
|
||||
)
|
||||
|
||||
|
||||
class TestEscapeFtsQuery:
|
||||
def test_plain_query(self):
|
||||
assert _escape_fts_query("install git") == "install git"
|
||||
|
||||
def test_special_chars(self):
|
||||
result = _escape_fts_query('install "git" (latest)')
|
||||
assert '"' not in result
|
||||
assert "(" not in result
|
||||
assert ")" not in result
|
||||
|
||||
def test_collapses_spaces(self):
|
||||
assert _escape_fts_query(" too many spaces ") == "too many spaces"
|
||||
|
||||
def test_empty(self):
|
||||
assert _escape_fts_query("") == ""
|
||||
|
||||
|
||||
class TestRankResults:
|
||||
def test_basic_ranking(self):
|
||||
results = {1: 0.9, 2: 0.5, 3: 0.7}
|
||||
ranked = _rank_results(results)
|
||||
assert ranked[1] == 1 # highest score = rank 1
|
||||
assert ranked[3] == 2
|
||||
assert ranked[2] == 3
|
||||
|
||||
def test_empty(self):
|
||||
assert _rank_results({}) == {}
|
||||
|
||||
|
||||
class TestRRFMerge:
|
||||
def test_basic_merge(self):
|
||||
fts = {1: 0.9, 2: 0.5}
|
||||
vec = {1: 0.8, 3: 0.7}
|
||||
merged = _rrf_merge(fts, vec, k=60)
|
||||
|
||||
scores = {r["chunk_id"]: r["score"] for r in merged}
|
||||
# Chunk 1 appears in both — should have highest score
|
||||
assert scores[1] > scores[2]
|
||||
assert scores[1] > scores[3]
|
||||
|
||||
def test_no_overlap(self):
|
||||
fts = {1: 0.9}
|
||||
vec = {2: 0.8}
|
||||
merged = _rrf_merge(fts, vec, k=60)
|
||||
assert len(merged) == 2
|
||||
|
||||
def test_score_breakdown(self):
|
||||
fts = {1: 0.9}
|
||||
vec = {1: 0.8}
|
||||
merged = _rrf_merge(fts, vec, k=60)
|
||||
assert len(merged) == 1
|
||||
assert merged[0]["score_breakdown"]["fts"] is not None
|
||||
assert merged[0]["score_breakdown"]["vector"] is not None
|
||||
|
||||
def test_single_source_fts(self):
|
||||
fts = {1: 0.9, 2: 0.5}
|
||||
merged = _rrf_merge(fts, {}, k=60)
|
||||
for r in merged:
|
||||
assert r["score_breakdown"]["vector"] is None
|
||||
assert r["score_breakdown"]["fts"] is not None
|
||||
|
||||
def test_empty_both(self):
|
||||
merged = _rrf_merge({}, {}, k=60)
|
||||
assert merged == []
|
||||
|
||||
|
||||
class TestSingleSourceResults:
|
||||
def test_fts_only(self):
|
||||
results = _single_source_results({1: 0.9, 2: 0.5}, "fts")
|
||||
assert len(results) == 2
|
||||
for r in results:
|
||||
assert r["score_breakdown"]["vector"] is None
|
||||
assert r["score_breakdown"]["fts"] is not None
|
||||
|
||||
def test_vec_only(self):
|
||||
results = _single_source_results({1: 0.8}, "vector")
|
||||
assert len(results) == 1
|
||||
assert results[0]["score_breakdown"]["fts"] is None
|
||||
assert results[0]["score_breakdown"]["vector"] is not None
|
||||
Reference in New Issue
Block a user