Captures pain points found while trying to locate an uploaded PDF: kb list silently ignores positional args, kb search results lack document_id, kb info dumps all chunks with no summary mode, and scan-heavy PDFs produce noisy single-char chunk hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.4 KiB
kb — Next Steps
UX improvements to make documents easier to find and inspect, prompted by a session where searching for an uploaded PDF (M38T_PHEV_RHD_OM_EN_UK_20251209.pdf, doc id 2077, 1801 chunks) surfaced lots of chunk hits but no obvious path back to the original document.
Problems observed
1. kb list silently ignores positional arguments
kb list --type pdf "M38T_PHEV_RHD_OM_EN_UK_20251209"
The quoted term is dropped without warning; user gets the default newest-first listing and assumes the document is missing. kb list currently only supports --tags and --type filters.
2. kb search returns chunks with no document_id
Result objects expose chunk_id, title, source_path, tags — but not document_id. To get from a search hit back to the owning document you have to title-match against kb list output or call an undocumented endpoint. The skill docs even claim a source.document_id field that isn't actually present in the CLI output.
3. kb info dumps every chunk with no summary mode
kb info 2077 returns ~1801 chunk objects. The document-level metadata (id, title, original_filename, source_path, stored_path, doc_type, language, content_hash, has_file, tags, created_at, updated_at) is present at the top level of the JSON, but in practice it's invisible — human format presumably dumps the chunk list and the user sees only chunks.
There's no way to ask for "just tell me about this document."
4. Search hits can look like noise on image-heavy PDFs
Top chunks for the M38T search were single characters ("1", "B", "\""). Almost certainly an FTS artefact on short tokens from a scan/image-heavy PDF — but it makes the result set look broken. Worth considering a minimum-text-length filter on indexed chunks, or down-weighting very short chunks in ranking.
Proposed changes
Small / high-value
kb info --no-chunks(or make--chunksopt-in): default to metadata + chunk count, only include chunks when asked. Human format should always lead with the metadata block.kb list --title <substring>(or accept a positional query) for filename / title search. At minimum, error or warn when positional args are passed and ignored.- Include
document_idinkb searchresult objects. Either at the top of each result or undersource.document_id(matching the skill docs).
Medium
kb find <query>as a doc-level search that aggregates chunk hits per document and returns ranked documents (with hit count, top chunk preview). This is what users usually want when they say "find my PDF about X."- Update the
kbskill docs to match actual CLI output shape, and to steer users towardkb list | jqfor filename lookups until proper filtering lands.
Larger
- Quality filter for short chunks during ingestion (e.g. drop chunks with < N alphanumeric chars, or fold them into neighbours). Stops scanned/image-heavy PDFs from polluting search.
- OCR path for scan-heavy PDFs. The M38T manual extracted enough real text to be useful, but other "scan" docs likely don't. Detect low text density per page and route through OCR.
Quick reference (current workarounds)
# Find a doc by filename
kb list --type pdf --format json | jq '.[] | select(.title | contains("M38T"))'
# Get just metadata for a doc
kb info 2077 --format json | jq 'del(.chunks)'
# Download the original
kb export 2077 -o manual.pdf