All major components: hotkey listener (rdev), audio capture (cpal), resampling (rubato), VAD (Silero ONNX), Parakeet v3 TDT transcription (ort), overlay window (winit+softbuffer), paste simulation (enigo+arboard), audio feedback (rodio), YAML config, CLI with clap, HuggingFace model download. ~2400 lines of Rust across 16 source files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.8 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Mouth is a single-binary, offline speech-to-text tool. Press a global hotkey, speak, and transcribed text is pasted at your cursor. Configured via YAML, no UI. Primary target is Windows; Linux/macOS supported where possible.
Uses Parakeet TDT 0.6B v3 (ONNX, from istupakov/parakeet-tdt-0.6b-v3-onnx) for transcription, Silero VAD v4 for voice activity detection.
Build & Run
cargo build # debug build
cargo build --release # release build
cargo run # run daemon (default command)
cargo run -- config --show # show current config
cargo run -- config # interactive config TUI
cargo run -- config --reset # reset to defaults
cargo run -- models # list models
cargo run -- models --download # download configured model
cargo run -- status # daemon status
Architecture
Single-binary Rust application. Core pipeline: hotkey capture (rdev) → audio recording (cpal) → resampling to 16kHz (rubato) → VAD (Silero ONNX) → mel spectrogram → transcription (Parakeet v3 TDT decoder via ort) → clipboard/paste (arboard + enigo). Minimal native overlay window (winit + softbuffer).
Threading model: Main thread owns the overlay window event loop (required by winit). Background threads: hotkey listener (rdev::listen is blocking), audio recorder (cpal stream), coordinator (state machine). All communicate via std::sync::mpsc channels.
Coordinator state machine: Idle → Recording → Transcribing → (Pasting) → Idle. Cancel from Recording returns to Idle.
Parakeet v3 inference: Two-stage ONNX model — encoder (FastConformer) produces features, decoder+joint (TDT transducer) greedily decodes tokens with duration predictions. Audio preprocessing: pre-emphasis → STFT → 128-band log-mel → per-utterance CMVN. Vocab is SentencePiece BPE with ▁ as word boundary marker.
ort crate (v2.0.0-rc.12) notes: Session::run needs &mut self. Input values must be converted to Value::into_dyn() before passing. Use SessionInputValue::Owned(value.into_dyn()) pattern. try_extract_tensor returns (&Shape, &[T]) tuple. from_shape_vec needs [usize; N] not Vec<usize>.
Config lives at ~/.config/mouth/config.yaml (Linux/macOS) or %APPDATA%\mouth\config.yaml (Windows). Models cached via HuggingFace Hub standard cache (~/.cache/huggingface/hub/).
Cross-Compilation
Developing on Ubuntu 24.04, targeting Windows:
cargo build --target x86_64-pc-windows-gnu
System Dependencies (Ubuntu)
sudo apt-get install libssl-dev libasound2-dev libpulse-dev libx11-dev libxcb-shape0-dev libxcb-xfixes0-dev libxkbcommon-dev libwayland-dev libgtk-3-dev libxtst-dev libxdo-dev cmake