9b0bf7d9e3
All major components: hotkey listener (rdev), audio capture (cpal), resampling (rubato), VAD (Silero ONNX), Parakeet v3 TDT transcription (ort), overlay window (winit+softbuffer), paste simulation (enigo+arboard), audio feedback (rodio), YAML config, CLI with clap, HuggingFace model download. ~2400 lines of Rust across 16 source files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 KiB
12 KiB
Mouth — Implementation Plan
Overview
Mouth is a single-binary, offline speech-to-text tool for Windows (with Linux/macOS support where possible). Press a hotkey, speak, and transcribed text is pasted at your cursor. Configured entirely via YAML.
Architecture
┌─────────────┐ ┌───────────┐ ┌─────────────┐ ┌────────────┐
│ Hotkey │────▶│ Recorder │────▶│ Transcriber │────▶│ Paste │
│ Listener │ │ (cpal) │ │ (ort/ONNX) │ │ (enigo) │
│ (rdev) │ │ │ │ │ │ │
└─────────────┘ └───────────┘ └─────────────┘ └────────────┘
│ │ │ │
│ ▼ │ │
│ ┌───────────┐ │ │
│ │ VAD │ │ │
│ │ (silero) │ │ │
│ └───────────┘ │ │
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────┐
│ Overlay (winit) │
│ State: idle → recording → transcribing → done │
└──────────────────────────────────────────────────────────────────────┘
Component Communication
All components communicate via channels (std::sync::mpsc or tokio::sync). The main thread owns the overlay window (required by most windowing systems). A coordinator task receives events from hotkey/recorder/transcriber and drives state transitions.
HotkeyEvent(Pressed/Released) ──┐
AudioReady(Vec<f32>) ───────────┼──▶ Coordinator ──▶ OverlayState
TranscriptionDone(String) ──────┘ ──▶ PasteAction
CancelRequested ────────────────┘
Crate Dependencies
| Crate | Purpose | Notes |
|---|---|---|
rdev |
Global hotkey capture | Cross-platform key events, no focus required |
cpal |
Audio capture | Cross-platform mic input |
rubato |
Audio resampling | Resample to 16kHz for Parakeet |
ort |
ONNX Runtime | Run Parakeet v3 + Silero VAD |
hf-hub |
Model download | Download from HuggingFace, standard cache dir |
enigo |
Keyboard simulation | Simulate Ctrl+V, Shift+Insert, etc. |
arboard |
Clipboard access | Read/write clipboard, save/restore |
winit |
Windowing | Minimal overlay window |
softbuffer |
Pixel rendering | Draw coloured overlay (no GPU needed for overlay) |
serde + serde_yaml |
Config | Deserialize YAML config |
clap |
CLI | Subcommands: run, config, models |
dialoguer |
Interactive TUI | mouth config interactive setup |
rodio |
Audio playback | Blip up/down sounds |
indicatif |
Progress bars | Model download progress |
dirs |
Platform dirs | Config/cache paths |
tracing |
Logging | Structured logging |
Config File
Location: ~/.config/mouth/config.yaml (Linux/macOS), %APPDATA%\mouth\config.yaml (Windows)
# Hotkey to activate recording
hotkey: "ctrl+space"
# Recording mode: push_to_talk or toggle
mode: push_to_talk
# Cancel hotkey (only active while recording)
cancel_key: "escape"
# Speech-to-text model
model: "parakeet-tdt-0.6b-v3"
# Inference accelerator: auto, cpu, cuda, directml
accelerator: auto
# GPU device index (only used when accelerator is cuda/directml)
gpu_device: 0
# How to paste text
paste_method: ctrl_v # ctrl_v | shift_insert | ctrl_shift_v | clipboard_only
# Also keep transcribed text on clipboard after pasting
copy_to_clipboard: true
# Overlay position on screen
overlay_position: top # top | bottom | none
# Audio feedback
audio_feedback: true
# Audio input device (null = system default)
input_device: null
# VAD: trim silence from audio before transcription
vad_enabled: true
# Language (for model hint, if supported)
language: en
CLI Interface
mouth run # Start the daemon (default if no subcommand)
mouth config # Interactive TUI to edit config
mouth config --show # Print current config to stdout
mouth config --reset # Reset config to defaults
mouth models # List available/downloaded models
mouth models download # Download configured model (if not cached)
mouth status # Show daemon status, loaded model, app version
Implementation Phases
Phase 1: Project Skeleton + Config
- Cargo.toml with all dependencies
- Config struct with serde, defaults, load/save
- CLI with clap (run, config, models subcommands)
mouth configinteractive TUI with dialoguer- Platform-aware config/cache directory resolution
Phase 2: Hotkey Listener
- Global hotkey capture using rdev
- Support configurable key combinations (parse from string like "ctrl+space")
- Push-to-talk mode: record on press, stop on release
- Toggle mode: start on first press, stop on second press
- Cancel on Escape while recording
- Debounce rapid key events (~30ms)
Phase 3: Audio Capture + VAD
- Open mic input via cpal (default device or configured)
- Convert to f32 mono
- Resample to 16kHz via rubato
- Buffer audio chunks during recording
- Run Silero VAD to trim leading/trailing silence
- Produce final
Vec<f32>of clean speech at 16kHz
Phase 4: Model Management
- Use hf-hub to download Parakeet v3 ONNX model from HuggingFace
- Store in standard HF cache (
~/.cache/huggingface/hub/) - Show download progress with indicatif
mouth modelscommand to list/download models- Auto-download on first run if model not cached
Phase 5: Transcription
- Load Parakeet v3 ONNX model via ort
- Auto-detect GPU (DirectML on Windows, CUDA if available, CPU fallback)
- Respect accelerator override from config
- Run inference on captured audio
- Return transcribed text string
Phase 6: Overlay
- Create a small always-on-top window using winit
- Render with softbuffer (simple coloured rectangle + text)
- States and colours:
- Recording: red pulsing indicator
- Transcribing: amber/yellow
- Done: brief green flash, then hide
- Error: brief red flash with error hint
- Window flags (Windows):
WS_EX_TOPMOST | WS_EX_TOOLWINDOW | WS_EX_NOACTIVATE - Position: centered horizontally at top or bottom of current monitor
- No focus steal, no taskbar entry
Phase 7: Paste System
- Save current clipboard content (if preserving)
- Set transcribed text to clipboard via arboard
- Simulate keypress via enigo based on paste_method:
ctrl_v: Ctrl+V (Cmd+V on macOS)shift_insert: Shift+Insertctrl_shift_v: Ctrl+Shift+Vclipboard_only: no keypress, just clipboard
- Restore previous clipboard content (unless copy_to_clipboard is true)
- Small delay between clipboard set and paste simulation (~50ms)
Phase 8: Audio Feedback
- Bundle two short PCM blip sounds in the binary (via
include_bytes!) - "Blip up" on recording start
- "Blip down" on recording stop / transcription complete
- Play via rodio on a separate thread (non-blocking)
- Respect audio_feedback config flag
Phase 9: Coordinator + Integration
- Wire all components together with channel-based message passing
- Main thread: overlay window event loop (winit requires this)
- Spawned threads/tasks: hotkey listener, audio recorder, transcriber
- Coordinator receives events, drives state machine:
Idle ──[hotkey press]──▶ Recording Recording ──[hotkey release/press]──▶ Transcribing Recording ──[cancel]──▶ Idle Transcribing ──[result]──▶ Pasting ──▶ Idle Transcribing ──[error]──▶ Error ──▶ Idle - Graceful shutdown on SIGINT / tray quit
Phase 10: Daemon IPC + Status
- The running daemon listens on a local Unix domain socket (Linux/macOS) or named pipe (Windows) for status queries
- Socket/pipe path:
/tmp/mouth.sock(Linux/macOS),\\.\pipe\mouth(Windows) mouth statusconnects and requests current state; daemon responds with JSON:{ "version": "0.1.0", "state": "idle", "model": "parakeet-tdt-0.6b-v3", "accelerator": "directml", "uptime_secs": 3420 }- If the daemon is not running,
mouth statusreports "Mouth is not running" and exits with code 1 - Also used internally to prevent launching a second daemon instance (lock check)
Phase 11: Polish + Distribution
- Error handling: user-friendly messages for common failures (no mic, model not found, etc.)
- Windows installer via
cargo-wixor distribute as standalone .exe - Test on Windows 10/11 primarily
- Test on Linux (X11 + Wayland) and macOS as secondary
- Update CLAUDE.md with build/run/test instructions
- Write user-facing README with setup instructions
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
Parakeet v3 ONNX model compatibility with ort |
Blocks core functionality | Test early in Phase 5; Parakeet v2 as fallback |
rdev hotkey reliability on Windows |
Broken UX | Test early in Phase 2; fallback to Win32 RegisterHotKey |
| Overlay focus stealing | Annoying | Use proper window flags; test with various foreground apps |
| Audio resampling quality | Poor transcription | Use rubato SincInterpolation (high quality) |
| Binary size with bundled ONNX Runtime | Large download | ONNX Runtime is ~20-40MB; acceptable for a single-binary tool |
| winit event loop blocking | Unresponsive | All heavy work on background threads; overlay is lightweight |
File Structure
mouth/
├── Cargo.toml
├── CLAUDE.md
├── README.md
├── plan.md
├── config.yaml.example
├── resources/
│ ├── blip_up.pcm # bundled audio feedback
│ └── blip_down.pcm
└── src/
├── main.rs # CLI entry, clap setup
├── config.rs # Config struct, YAML load/save, defaults
├── hotkey.rs # Global hotkey listener (rdev)
├── recorder.rs # Audio capture (cpal + rubato + VAD)
├── vad.rs # Silero VAD wrapper
├── transcriber.rs # ONNX inference, model loading, GPU detection
├── model_cache.rs # HuggingFace download, cache management
├── overlay.rs # Minimal overlay window (winit + softbuffer)
├── paste.rs # Clipboard + paste simulation
├── audio_feedback.rs # Blip sounds via rodio
├── coordinator.rs # State machine, channel hub
└── cli/
├── mod.rs
├── run.rs # `mouth run` handler
├── config_cmd.rs # `mouth config` TUI
├── models_cmd.rs # `mouth models` handler
└── status_cmd.rs # `mouth status` handler
Not In Scope (v1)
- LLM post-processing of transcriptions
- Transcription history / database
- Multiple model support (v1 is Parakeet v3 only, architecture supports adding more later)
- Auto-submit (Enter after paste)
- Multi-language UI
- Tray icon / system tray integration
- Translate-to-English mode