steve/kb

Files

T

steve 9eccc527ae Add next-steps.md with UX improvement ideas for kb CLI

Captures pain points found while trying to locate an uploaded PDF: kb
list silently ignores positional args, kb search results lack
document_id, kb info dumps all chunks with no summary mode, and
scan-heavy PDFs produce noisy single-char chunk hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 20:34:55 +01:00

3.4 KiB

Raw Permalink Blame History

kb — Next Steps

UX improvements to make documents easier to find and inspect, prompted by a session where searching for an uploaded PDF (M38T_PHEV_RHD_OM_EN_UK_20251209.pdf, doc id 2077, 1801 chunks) surfaced lots of chunk hits but no obvious path back to the original document.

Problems observed

1. `kb list` silently ignores positional arguments

kb list --type pdf "M38T_PHEV_RHD_OM_EN_UK_20251209"

The quoted term is dropped without warning; user gets the default newest-first listing and assumes the document is missing. kb list currently only supports --tags and --type filters.

2. `kb search` returns chunks with no `document_id`

Result objects expose chunk_id, title, source_path, tags — but not document_id. To get from a search hit back to the owning document you have to title-match against kb list output or call an undocumented endpoint. The skill docs even claim a source.document_id field that isn't actually present in the CLI output.

3. `kb info` dumps every chunk with no summary mode

kb info 2077 returns ~1801 chunk objects. The document-level metadata (id, title, original_filename, source_path, stored_path, doc_type, language, content_hash, has_file, tags, created_at, updated_at) is present at the top level of the JSON, but in practice it's invisible — human format presumably dumps the chunk list and the user sees only chunks.

There's no way to ask for "just tell me about this document."

4. Search hits can look like noise on image-heavy PDFs

Top chunks for the M38T search were single characters ("1", "B", "\""). Almost certainly an FTS artefact on short tokens from a scan/image-heavy PDF — but it makes the result set look broken. Worth considering a minimum-text-length filter on indexed chunks, or down-weighting very short chunks in ranking.

Proposed changes

Small / high-value

kb info --no-chunks (or make --chunks opt-in): default to metadata + chunk count, only include chunks when asked. Human format should always lead with the metadata block.
kb list --title <substring> (or accept a positional query) for filename / title search. At minimum, error or warn when positional args are passed and ignored.
Include document_id in kb search result objects. Either at the top of each result or under source.document_id (matching the skill docs).

Medium

kb find <query> as a doc-level search that aggregates chunk hits per document and returns ranked documents (with hit count, top chunk preview). This is what users usually want when they say "find my PDF about X."
Update the kb skill docs to match actual CLI output shape, and to steer users toward kb list | jq for filename lookups until proper filtering lands.

Larger

Quality filter for short chunks during ingestion (e.g. drop chunks with < N alphanumeric chars, or fold them into neighbours). Stops scanned/image-heavy PDFs from polluting search.
OCR path for scan-heavy PDFs. The M38T manual extracted enough real text to be useful, but other "scan" docs likely don't. Detect low text density per page and route through OCR.

Quick reference (current workarounds)

# Find a doc by filename
kb list --type pdf --format json | jq '.[] | select(.title | contains("M38T"))'

# Get just metadata for a doc
kb info 2077 --format json | jq 'del(.chunks)'

# Download the original
kb export 2077 -o manual.pdf

3.4 KiB Raw Permalink Blame History