# Mouth — Implementation Plan ## Overview Mouth is a single-binary, offline speech-to-text tool for Windows (with Linux/macOS support where possible). Press a hotkey, speak, and transcribed text is pasted at your cursor. Configured entirely via YAML. ## Architecture ``` ┌─────────────┐ ┌───────────┐ ┌─────────────┐ ┌────────────┐ │ Hotkey │────▶│ Recorder │────▶│ Transcriber │────▶│ Paste │ │ Listener │ │ (cpal) │ │ (ort/ONNX) │ │ (enigo) │ │ (rdev) │ │ │ │ │ │ │ └─────────────┘ └───────────┘ └─────────────┘ └────────────┘ │ │ │ │ │ ▼ │ │ │ ┌───────────┐ │ │ │ │ VAD │ │ │ │ │ (silero) │ │ │ │ └───────────┘ │ │ │ │ │ ▼ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ Overlay (winit) │ │ State: idle → recording → transcribing → done │ └──────────────────────────────────────────────────────────────────────┘ ``` ### Component Communication All components communicate via channels (`std::sync::mpsc` or `tokio::sync`). The main thread owns the overlay window (required by most windowing systems). A coordinator task receives events from hotkey/recorder/transcriber and drives state transitions. ``` HotkeyEvent(Pressed/Released) ──┐ AudioReady(Vec) ───────────┼──▶ Coordinator ──▶ OverlayState TranscriptionDone(String) ──────┘ ──▶ PasteAction CancelRequested ────────────────┘ ``` ## Crate Dependencies | Crate | Purpose | Notes | |-------|---------|-------| | `rdev` | Global hotkey capture | Cross-platform key events, no focus required | | `cpal` | Audio capture | Cross-platform mic input | | `rubato` | Audio resampling | Resample to 16kHz for Parakeet | | `ort` | ONNX Runtime | Run Parakeet v3 + Silero VAD | | `hf-hub` | Model download | Download from HuggingFace, standard cache dir | | `enigo` | Keyboard simulation | Simulate Ctrl+V, Shift+Insert, etc. | | `arboard` | Clipboard access | Read/write clipboard, save/restore | | `winit` | Windowing | Minimal overlay window | | `softbuffer` | Pixel rendering | Draw coloured overlay (no GPU needed for overlay) | | `serde` + `serde_yaml` | Config | Deserialize YAML config | | `clap` | CLI | Subcommands: `run`, `config`, `models` | | `dialoguer` | Interactive TUI | `mouth config` interactive setup | | `rodio` | Audio playback | Blip up/down sounds | | `indicatif` | Progress bars | Model download progress | | `dirs` | Platform dirs | Config/cache paths | | `tracing` | Logging | Structured logging | ## Config File Location: `~/.config/mouth/config.yaml` (Linux/macOS), `%APPDATA%\mouth\config.yaml` (Windows) ```yaml # Hotkey to activate recording hotkey: "ctrl+space" # Recording mode: push_to_talk or toggle mode: push_to_talk # Cancel hotkey (only active while recording) cancel_key: "escape" # Speech-to-text model model: "parakeet-tdt-0.6b-v3" # Inference accelerator: auto, cpu, cuda, directml accelerator: auto # GPU device index (only used when accelerator is cuda/directml) gpu_device: 0 # How to paste text paste_method: ctrl_v # ctrl_v | shift_insert | ctrl_shift_v | clipboard_only # Also keep transcribed text on clipboard after pasting copy_to_clipboard: true # Overlay position on screen overlay_position: top # top | bottom | none # Audio feedback audio_feedback: true # Audio input device (null = system default) input_device: null # VAD: trim silence from audio before transcription vad_enabled: true # Language (for model hint, if supported) language: en ``` ## CLI Interface ``` mouth run # Start the daemon (default if no subcommand) mouth config # Interactive TUI to edit config mouth config --show # Print current config to stdout mouth config --reset # Reset config to defaults mouth models # List available/downloaded models mouth models download # Download configured model (if not cached) mouth status # Show daemon status, loaded model, app version ``` ## Implementation Phases ### Phase 1: Project Skeleton + Config - Cargo.toml with all dependencies - Config struct with serde, defaults, load/save - CLI with clap (run, config, models subcommands) - `mouth config` interactive TUI with dialoguer - Platform-aware config/cache directory resolution ### Phase 2: Hotkey Listener - Global hotkey capture using rdev - Support configurable key combinations (parse from string like "ctrl+space") - Push-to-talk mode: record on press, stop on release - Toggle mode: start on first press, stop on second press - Cancel on Escape while recording - Debounce rapid key events (~30ms) ### Phase 3: Audio Capture + VAD - Open mic input via cpal (default device or configured) - Convert to f32 mono - Resample to 16kHz via rubato - Buffer audio chunks during recording - Run Silero VAD to trim leading/trailing silence - Produce final `Vec` of clean speech at 16kHz ### Phase 4: Model Management - Use hf-hub to download Parakeet v3 ONNX model from HuggingFace - Store in standard HF cache (`~/.cache/huggingface/hub/`) - Show download progress with indicatif - `mouth models` command to list/download models - Auto-download on first run if model not cached ### Phase 5: Transcription - Load Parakeet v3 ONNX model via ort - Auto-detect GPU (DirectML on Windows, CUDA if available, CPU fallback) - Respect accelerator override from config - Run inference on captured audio - Return transcribed text string ### Phase 6: Overlay - Create a small always-on-top window using winit - Render with softbuffer (simple coloured rectangle + text) - States and colours: - Recording: red pulsing indicator - Transcribing: amber/yellow - Done: brief green flash, then hide - Error: brief red flash with error hint - Window flags (Windows): `WS_EX_TOPMOST | WS_EX_TOOLWINDOW | WS_EX_NOACTIVATE` - Position: centered horizontally at top or bottom of current monitor - No focus steal, no taskbar entry ### Phase 7: Paste System - Save current clipboard content (if preserving) - Set transcribed text to clipboard via arboard - Simulate keypress via enigo based on paste_method: - `ctrl_v`: Ctrl+V (Cmd+V on macOS) - `shift_insert`: Shift+Insert - `ctrl_shift_v`: Ctrl+Shift+V - `clipboard_only`: no keypress, just clipboard - Restore previous clipboard content (unless copy_to_clipboard is true) - Small delay between clipboard set and paste simulation (~50ms) ### Phase 8: Audio Feedback - Bundle two short PCM blip sounds in the binary (via `include_bytes!`) - "Blip up" on recording start - "Blip down" on recording stop / transcription complete - Play via rodio on a separate thread (non-blocking) - Respect audio_feedback config flag ### Phase 9: Coordinator + Integration - Wire all components together with channel-based message passing - Main thread: overlay window event loop (winit requires this) - Spawned threads/tasks: hotkey listener, audio recorder, transcriber - Coordinator receives events, drives state machine: ``` Idle ──[hotkey press]──▶ Recording Recording ──[hotkey release/press]──▶ Transcribing Recording ──[cancel]──▶ Idle Transcribing ──[result]──▶ Pasting ──▶ Idle Transcribing ──[error]──▶ Error ──▶ Idle ``` - Graceful shutdown on SIGINT / tray quit ### Phase 10: Daemon IPC + Status - The running daemon listens on a local Unix domain socket (Linux/macOS) or named pipe (Windows) for status queries - Socket/pipe path: `/tmp/mouth.sock` (Linux/macOS), `\\.\pipe\mouth` (Windows) - `mouth status` connects and requests current state; daemon responds with JSON: ```json { "version": "0.1.0", "state": "idle", "model": "parakeet-tdt-0.6b-v3", "accelerator": "directml", "uptime_secs": 3420 } ``` - If the daemon is not running, `mouth status` reports "Mouth is not running" and exits with code 1 - Also used internally to prevent launching a second daemon instance (lock check) ### Phase 11: Polish + Distribution - Error handling: user-friendly messages for common failures (no mic, model not found, etc.) - Windows installer via `cargo-wix` or distribute as standalone .exe - Test on Windows 10/11 primarily - Test on Linux (X11 + Wayland) and macOS as secondary - Update CLAUDE.md with build/run/test instructions - Write user-facing README with setup instructions ## Risks & Mitigations | Risk | Impact | Mitigation | |------|--------|------------| | Parakeet v3 ONNX model compatibility with `ort` | Blocks core functionality | Test early in Phase 5; Parakeet v2 as fallback | | `rdev` hotkey reliability on Windows | Broken UX | Test early in Phase 2; fallback to Win32 `RegisterHotKey` | | Overlay focus stealing | Annoying | Use proper window flags; test with various foreground apps | | Audio resampling quality | Poor transcription | Use rubato SincInterpolation (high quality) | | Binary size with bundled ONNX Runtime | Large download | ONNX Runtime is ~20-40MB; acceptable for a single-binary tool | | winit event loop blocking | Unresponsive | All heavy work on background threads; overlay is lightweight | ## File Structure ``` mouth/ ├── Cargo.toml ├── CLAUDE.md ├── README.md ├── plan.md ├── config.yaml.example ├── resources/ │ ├── blip_up.pcm # bundled audio feedback │ └── blip_down.pcm └── src/ ├── main.rs # CLI entry, clap setup ├── config.rs # Config struct, YAML load/save, defaults ├── hotkey.rs # Global hotkey listener (rdev) ├── recorder.rs # Audio capture (cpal + rubato + VAD) ├── vad.rs # Silero VAD wrapper ├── transcriber.rs # ONNX inference, model loading, GPU detection ├── model_cache.rs # HuggingFace download, cache management ├── overlay.rs # Minimal overlay window (winit + softbuffer) ├── paste.rs # Clipboard + paste simulation ├── audio_feedback.rs # Blip sounds via rodio ├── coordinator.rs # State machine, channel hub └── cli/ ├── mod.rs ├── run.rs # `mouth run` handler ├── config_cmd.rs # `mouth config` TUI ├── models_cmd.rs # `mouth models` handler └── status_cmd.rs # `mouth status` handler ``` ## Not In Scope (v1) - LLM post-processing of transcriptions - Transcription history / database - Multiple model support (v1 is Parakeet v3 only, architecture supports adding more later) - Auto-submit (Enter after paste) - Multi-language UI - Tray icon / system tray integration - Translate-to-English mode