Files
steve 9b0bf7d9e3 Implement core speech-to-text pipeline
All major components: hotkey listener (rdev), audio capture (cpal),
resampling (rubato), VAD (Silero ONNX), Parakeet v3 TDT transcription
(ort), overlay window (winit+softbuffer), paste simulation (enigo+arboard),
audio feedback (rodio), YAML config, CLI with clap, HuggingFace model
download. ~2400 lines of Rust across 16 source files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:47:46 +01:00

2.8 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Mouth is a single-binary, offline speech-to-text tool. Press a global hotkey, speak, and transcribed text is pasted at your cursor. Configured via YAML, no UI. Primary target is Windows; Linux/macOS supported where possible.

Uses Parakeet TDT 0.6B v3 (ONNX, from istupakov/parakeet-tdt-0.6b-v3-onnx) for transcription, Silero VAD v4 for voice activity detection.

Build & Run

cargo build                    # debug build
cargo build --release          # release build
cargo run                      # run daemon (default command)
cargo run -- config --show     # show current config
cargo run -- config            # interactive config TUI
cargo run -- config --reset    # reset to defaults
cargo run -- models            # list models
cargo run -- models --download # download configured model
cargo run -- status            # daemon status

Architecture

Single-binary Rust application. Core pipeline: hotkey capture (rdev) → audio recording (cpal) → resampling to 16kHz (rubato) → VAD (Silero ONNX) → mel spectrogram → transcription (Parakeet v3 TDT decoder via ort) → clipboard/paste (arboard + enigo). Minimal native overlay window (winit + softbuffer).

Threading model: Main thread owns the overlay window event loop (required by winit). Background threads: hotkey listener (rdev::listen is blocking), audio recorder (cpal stream), coordinator (state machine). All communicate via std::sync::mpsc channels.

Coordinator state machine: Idle → Recording → Transcribing → (Pasting) → Idle. Cancel from Recording returns to Idle.

Parakeet v3 inference: Two-stage ONNX model — encoder (FastConformer) produces features, decoder+joint (TDT transducer) greedily decodes tokens with duration predictions. Audio preprocessing: pre-emphasis → STFT → 128-band log-mel → per-utterance CMVN. Vocab is SentencePiece BPE with as word boundary marker.

ort crate (v2.0.0-rc.12) notes: Session::run needs &mut self. Input values must be converted to Value::into_dyn() before passing. Use SessionInputValue::Owned(value.into_dyn()) pattern. try_extract_tensor returns (&Shape, &[T]) tuple. from_shape_vec needs [usize; N] not Vec<usize>.

Config lives at ~/.config/mouth/config.yaml (Linux/macOS) or %APPDATA%\mouth\config.yaml (Windows). Models cached via HuggingFace Hub standard cache (~/.cache/huggingface/hub/).

Cross-Compilation

Developing on Ubuntu 24.04, targeting Windows:

cargo build --target x86_64-pc-windows-gnu

System Dependencies (Ubuntu)

sudo apt-get install libssl-dev libasound2-dev libpulse-dev libx11-dev libxcb-shape0-dev libxcb-xfixes0-dev libxkbcommon-dev libwayland-dev libgtk-3-dev libxtst-dev libxdo-dev cmake