Implement core speech-to-text pipeline

All major components: hotkey listener (rdev), audio capture (cpal), resampling (rubato), VAD (Silero ONNX), Parakeet v3 TDT transcription (ort), overlay window (winit+softbuffer), paste simulation (enigo+arboard), audio feedback (rodio), YAML config, CLI with clap, HuggingFace model download. ~2400 lines of Rust across 16 source files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:47:46 +01:00
parent 6b737f92fe
commit 9b0bf7d9e3
22 changed files with 7750 additions and 0 deletions
@@ -0,0 +1,50 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Mouth is a single-binary, offline speech-to-text tool. Press a global hotkey, speak, and transcribed text is pasted at your cursor. Configured via YAML, no UI. Primary target is Windows; Linux/macOS supported where possible.
+
+Uses Parakeet TDT 0.6B v3 (ONNX, from `istupakov/parakeet-tdt-0.6b-v3-onnx`) for transcription, Silero VAD v4 for voice activity detection.
+
+## Build & Run
+
+```bash
+cargo build                    # debug build
+cargo build --release          # release build
+cargo run                      # run daemon (default command)
+cargo run -- config --show     # show current config
+cargo run -- config            # interactive config TUI
+cargo run -- config --reset    # reset to defaults
+cargo run -- models            # list models
+cargo run -- models --download # download configured model
+cargo run -- status            # daemon status
+```
+
+## Architecture
+
+Single-binary Rust application. Core pipeline: hotkey capture (rdev) → audio recording (cpal) → resampling to 16kHz (rubato) → VAD (Silero ONNX) → mel spectrogram → transcription (Parakeet v3 TDT decoder via ort) → clipboard/paste (arboard + enigo). Minimal native overlay window (winit + softbuffer).
+
+**Threading model:** Main thread owns the overlay window event loop (required by winit). Background threads: hotkey listener (rdev::listen is blocking), audio recorder (cpal stream), coordinator (state machine). All communicate via `std::sync::mpsc` channels.
+
+**Coordinator state machine:** Idle → Recording → Transcribing → (Pasting) → Idle. Cancel from Recording returns to Idle.
+
+**Parakeet v3 inference:** Two-stage ONNX model — encoder (FastConformer) produces features, decoder+joint (TDT transducer) greedily decodes tokens with duration predictions. Audio preprocessing: pre-emphasis → STFT → 128-band log-mel → per-utterance CMVN. Vocab is SentencePiece BPE with `▁` as word boundary marker.
+
+**ort crate (v2.0.0-rc.12) notes:** Session::run needs `&mut self`. Input values must be converted to `Value::into_dyn()` before passing. Use `SessionInputValue::Owned(value.into_dyn())` pattern. `try_extract_tensor` returns `(&Shape, &[T])` tuple. `from_shape_vec` needs `[usize; N]` not `Vec<usize>`.
+
+Config lives at `~/.config/mouth/config.yaml` (Linux/macOS) or `%APPDATA%\mouth\config.yaml` (Windows). Models cached via HuggingFace Hub standard cache (`~/.cache/huggingface/hub/`).
+
+## Cross-Compilation
+
+Developing on Ubuntu 24.04, targeting Windows:
+```bash
+cargo build --target x86_64-pc-windows-gnu
+```
+
+## System Dependencies (Ubuntu)
+
+```bash
+sudo apt-get install libssl-dev libasound2-dev libpulse-dev libx11-dev libxcb-shape0-dev libxcb-xfixes0-dev libxkbcommon-dev libwayland-dev libgtk-3-dev libxtst-dev libxdo-dev cmake
+```