Add next-steps.md with UX improvement ideas for kb CLI
Captures pain points found while trying to locate an uploaded PDF: kb list silently ignores positional args, kb search results lack document_id, kb info dumps all chunks with no summary mode, and scan-heavy PDFs produce noisy single-char chunk hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,58 @@
|
||||
# kb — Next Steps
|
||||
|
||||
UX improvements to make documents easier to find and inspect, prompted by a session where searching for an uploaded PDF (`M38T_PHEV_RHD_OM_EN_UK_20251209.pdf`, doc id 2077, 1801 chunks) surfaced lots of chunk hits but no obvious path back to the original document.
|
||||
|
||||
## Problems observed
|
||||
|
||||
### 1. `kb list` silently ignores positional arguments
|
||||
|
||||
```
|
||||
kb list --type pdf "M38T_PHEV_RHD_OM_EN_UK_20251209"
|
||||
```
|
||||
|
||||
The quoted term is dropped without warning; user gets the default newest-first listing and assumes the document is missing. `kb list` currently only supports `--tags` and `--type` filters.
|
||||
|
||||
### 2. `kb search` returns chunks with no `document_id`
|
||||
|
||||
Result objects expose `chunk_id`, `title`, `source_path`, `tags` — but not `document_id`. To get from a search hit back to the owning document you have to title-match against `kb list` output or call an undocumented endpoint. The skill docs even claim a `source.document_id` field that isn't actually present in the CLI output.
|
||||
|
||||
### 3. `kb info` dumps every chunk with no summary mode
|
||||
|
||||
`kb info 2077` returns ~1801 chunk objects. The document-level metadata (`id`, `title`, `original_filename`, `source_path`, `stored_path`, `doc_type`, `language`, `content_hash`, `has_file`, `tags`, `created_at`, `updated_at`) **is** present at the top level of the JSON, but in practice it's invisible — human format presumably dumps the chunk list and the user sees only chunks.
|
||||
|
||||
There's no way to ask for "just tell me about this document."
|
||||
|
||||
### 4. Search hits can look like noise on image-heavy PDFs
|
||||
|
||||
Top chunks for the M38T search were single characters (`"1"`, `"B"`, `"\""`). Almost certainly an FTS artefact on short tokens from a scan/image-heavy PDF — but it makes the result set look broken. Worth considering a minimum-text-length filter on indexed chunks, or down-weighting very short chunks in ranking.
|
||||
|
||||
## Proposed changes
|
||||
|
||||
### Small / high-value
|
||||
|
||||
- **`kb info --no-chunks`** (or make `--chunks` opt-in): default to metadata + chunk count, only include chunks when asked. Human format should always lead with the metadata block.
|
||||
- **`kb list --title <substring>`** (or accept a positional query) for filename / title search. At minimum, error or warn when positional args are passed and ignored.
|
||||
- **Include `document_id` in `kb search` result objects.** Either at the top of each result or under `source.document_id` (matching the skill docs).
|
||||
|
||||
### Medium
|
||||
|
||||
- **`kb find <query>`** as a doc-level search that aggregates chunk hits per document and returns ranked *documents* (with hit count, top chunk preview). This is what users usually want when they say "find my PDF about X."
|
||||
- **Update the `kb` skill docs** to match actual CLI output shape, and to steer users toward `kb list | jq` for filename lookups until proper filtering lands.
|
||||
|
||||
### Larger
|
||||
|
||||
- **Quality filter for short chunks** during ingestion (e.g. drop chunks with < N alphanumeric chars, or fold them into neighbours). Stops scanned/image-heavy PDFs from polluting search.
|
||||
- **OCR path for scan-heavy PDFs.** The M38T manual extracted enough real text to be useful, but other "scan" docs likely don't. Detect low text density per page and route through OCR.
|
||||
|
||||
## Quick reference (current workarounds)
|
||||
|
||||
```bash
|
||||
# Find a doc by filename
|
||||
kb list --type pdf --format json | jq '.[] | select(.title | contains("M38T"))'
|
||||
|
||||
# Get just metadata for a doc
|
||||
kb info 2077 --format json | jq 'del(.chunks)'
|
||||
|
||||
# Download the original
|
||||
kb export 2077 -o manual.pdf
|
||||
```
|
||||
Reference in New Issue
Block a user