9b0bf7d9e3
All major components: hotkey listener (rdev), audio capture (cpal), resampling (rubato), VAD (Silero ONNX), Parakeet v3 TDT transcription (ort), overlay window (winit+softbuffer), paste simulation (enigo+arboard), audio feedback (rodio), YAML config, CLI with clap, HuggingFace model download. ~2400 lines of Rust across 16 source files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
288 lines
12 KiB
Markdown
288 lines
12 KiB
Markdown
# Mouth — Implementation Plan
|
|
|
|
## Overview
|
|
|
|
Mouth is a single-binary, offline speech-to-text tool for Windows (with Linux/macOS support where possible). Press a hotkey, speak, and transcribed text is pasted at your cursor. Configured entirely via YAML.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────┐ ┌───────────┐ ┌─────────────┐ ┌────────────┐
|
|
│ Hotkey │────▶│ Recorder │────▶│ Transcriber │────▶│ Paste │
|
|
│ Listener │ │ (cpal) │ │ (ort/ONNX) │ │ (enigo) │
|
|
│ (rdev) │ │ │ │ │ │ │
|
|
└─────────────┘ └───────────┘ └─────────────┘ └────────────┘
|
|
│ │ │ │
|
|
│ ▼ │ │
|
|
│ ┌───────────┐ │ │
|
|
│ │ VAD │ │ │
|
|
│ │ (silero) │ │ │
|
|
│ └───────────┘ │ │
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ Overlay (winit) │
|
|
│ State: idle → recording → transcribing → done │
|
|
└──────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Component Communication
|
|
|
|
All components communicate via channels (`std::sync::mpsc` or `tokio::sync`). The main thread owns the overlay window (required by most windowing systems). A coordinator task receives events from hotkey/recorder/transcriber and drives state transitions.
|
|
|
|
```
|
|
HotkeyEvent(Pressed/Released) ──┐
|
|
AudioReady(Vec<f32>) ───────────┼──▶ Coordinator ──▶ OverlayState
|
|
TranscriptionDone(String) ──────┘ ──▶ PasteAction
|
|
CancelRequested ────────────────┘
|
|
```
|
|
|
|
## Crate Dependencies
|
|
|
|
| Crate | Purpose | Notes |
|
|
|-------|---------|-------|
|
|
| `rdev` | Global hotkey capture | Cross-platform key events, no focus required |
|
|
| `cpal` | Audio capture | Cross-platform mic input |
|
|
| `rubato` | Audio resampling | Resample to 16kHz for Parakeet |
|
|
| `ort` | ONNX Runtime | Run Parakeet v3 + Silero VAD |
|
|
| `hf-hub` | Model download | Download from HuggingFace, standard cache dir |
|
|
| `enigo` | Keyboard simulation | Simulate Ctrl+V, Shift+Insert, etc. |
|
|
| `arboard` | Clipboard access | Read/write clipboard, save/restore |
|
|
| `winit` | Windowing | Minimal overlay window |
|
|
| `softbuffer` | Pixel rendering | Draw coloured overlay (no GPU needed for overlay) |
|
|
| `serde` + `serde_yaml` | Config | Deserialize YAML config |
|
|
| `clap` | CLI | Subcommands: `run`, `config`, `models` |
|
|
| `dialoguer` | Interactive TUI | `mouth config` interactive setup |
|
|
| `rodio` | Audio playback | Blip up/down sounds |
|
|
| `indicatif` | Progress bars | Model download progress |
|
|
| `dirs` | Platform dirs | Config/cache paths |
|
|
| `tracing` | Logging | Structured logging |
|
|
|
|
## Config File
|
|
|
|
Location: `~/.config/mouth/config.yaml` (Linux/macOS), `%APPDATA%\mouth\config.yaml` (Windows)
|
|
|
|
```yaml
|
|
# Hotkey to activate recording
|
|
hotkey: "ctrl+space"
|
|
|
|
# Recording mode: push_to_talk or toggle
|
|
mode: push_to_talk
|
|
|
|
# Cancel hotkey (only active while recording)
|
|
cancel_key: "escape"
|
|
|
|
# Speech-to-text model
|
|
model: "parakeet-tdt-0.6b-v3"
|
|
|
|
# Inference accelerator: auto, cpu, cuda, directml
|
|
accelerator: auto
|
|
|
|
# GPU device index (only used when accelerator is cuda/directml)
|
|
gpu_device: 0
|
|
|
|
# How to paste text
|
|
paste_method: ctrl_v # ctrl_v | shift_insert | ctrl_shift_v | clipboard_only
|
|
|
|
# Also keep transcribed text on clipboard after pasting
|
|
copy_to_clipboard: true
|
|
|
|
# Overlay position on screen
|
|
overlay_position: top # top | bottom | none
|
|
|
|
# Audio feedback
|
|
audio_feedback: true
|
|
|
|
# Audio input device (null = system default)
|
|
input_device: null
|
|
|
|
# VAD: trim silence from audio before transcription
|
|
vad_enabled: true
|
|
|
|
# Language (for model hint, if supported)
|
|
language: en
|
|
```
|
|
|
|
## CLI Interface
|
|
|
|
```
|
|
mouth run # Start the daemon (default if no subcommand)
|
|
mouth config # Interactive TUI to edit config
|
|
mouth config --show # Print current config to stdout
|
|
mouth config --reset # Reset config to defaults
|
|
mouth models # List available/downloaded models
|
|
mouth models download # Download configured model (if not cached)
|
|
mouth status # Show daemon status, loaded model, app version
|
|
```
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Project Skeleton + Config
|
|
|
|
- Cargo.toml with all dependencies
|
|
- Config struct with serde, defaults, load/save
|
|
- CLI with clap (run, config, models subcommands)
|
|
- `mouth config` interactive TUI with dialoguer
|
|
- Platform-aware config/cache directory resolution
|
|
|
|
### Phase 2: Hotkey Listener
|
|
|
|
- Global hotkey capture using rdev
|
|
- Support configurable key combinations (parse from string like "ctrl+space")
|
|
- Push-to-talk mode: record on press, stop on release
|
|
- Toggle mode: start on first press, stop on second press
|
|
- Cancel on Escape while recording
|
|
- Debounce rapid key events (~30ms)
|
|
|
|
### Phase 3: Audio Capture + VAD
|
|
|
|
- Open mic input via cpal (default device or configured)
|
|
- Convert to f32 mono
|
|
- Resample to 16kHz via rubato
|
|
- Buffer audio chunks during recording
|
|
- Run Silero VAD to trim leading/trailing silence
|
|
- Produce final `Vec<f32>` of clean speech at 16kHz
|
|
|
|
### Phase 4: Model Management
|
|
|
|
- Use hf-hub to download Parakeet v3 ONNX model from HuggingFace
|
|
- Store in standard HF cache (`~/.cache/huggingface/hub/`)
|
|
- Show download progress with indicatif
|
|
- `mouth models` command to list/download models
|
|
- Auto-download on first run if model not cached
|
|
|
|
### Phase 5: Transcription
|
|
|
|
- Load Parakeet v3 ONNX model via ort
|
|
- Auto-detect GPU (DirectML on Windows, CUDA if available, CPU fallback)
|
|
- Respect accelerator override from config
|
|
- Run inference on captured audio
|
|
- Return transcribed text string
|
|
|
|
### Phase 6: Overlay
|
|
|
|
- Create a small always-on-top window using winit
|
|
- Render with softbuffer (simple coloured rectangle + text)
|
|
- States and colours:
|
|
- Recording: red pulsing indicator
|
|
- Transcribing: amber/yellow
|
|
- Done: brief green flash, then hide
|
|
- Error: brief red flash with error hint
|
|
- Window flags (Windows): `WS_EX_TOPMOST | WS_EX_TOOLWINDOW | WS_EX_NOACTIVATE`
|
|
- Position: centered horizontally at top or bottom of current monitor
|
|
- No focus steal, no taskbar entry
|
|
|
|
### Phase 7: Paste System
|
|
|
|
- Save current clipboard content (if preserving)
|
|
- Set transcribed text to clipboard via arboard
|
|
- Simulate keypress via enigo based on paste_method:
|
|
- `ctrl_v`: Ctrl+V (Cmd+V on macOS)
|
|
- `shift_insert`: Shift+Insert
|
|
- `ctrl_shift_v`: Ctrl+Shift+V
|
|
- `clipboard_only`: no keypress, just clipboard
|
|
- Restore previous clipboard content (unless copy_to_clipboard is true)
|
|
- Small delay between clipboard set and paste simulation (~50ms)
|
|
|
|
### Phase 8: Audio Feedback
|
|
|
|
- Bundle two short PCM blip sounds in the binary (via `include_bytes!`)
|
|
- "Blip up" on recording start
|
|
- "Blip down" on recording stop / transcription complete
|
|
- Play via rodio on a separate thread (non-blocking)
|
|
- Respect audio_feedback config flag
|
|
|
|
### Phase 9: Coordinator + Integration
|
|
|
|
- Wire all components together with channel-based message passing
|
|
- Main thread: overlay window event loop (winit requires this)
|
|
- Spawned threads/tasks: hotkey listener, audio recorder, transcriber
|
|
- Coordinator receives events, drives state machine:
|
|
```
|
|
Idle ──[hotkey press]──▶ Recording
|
|
Recording ──[hotkey release/press]──▶ Transcribing
|
|
Recording ──[cancel]──▶ Idle
|
|
Transcribing ──[result]──▶ Pasting ──▶ Idle
|
|
Transcribing ──[error]──▶ Error ──▶ Idle
|
|
```
|
|
- Graceful shutdown on SIGINT / tray quit
|
|
|
|
### Phase 10: Daemon IPC + Status
|
|
|
|
- The running daemon listens on a local Unix domain socket (Linux/macOS) or named pipe (Windows) for status queries
|
|
- Socket/pipe path: `/tmp/mouth.sock` (Linux/macOS), `\\.\pipe\mouth` (Windows)
|
|
- `mouth status` connects and requests current state; daemon responds with JSON:
|
|
```json
|
|
{
|
|
"version": "0.1.0",
|
|
"state": "idle",
|
|
"model": "parakeet-tdt-0.6b-v3",
|
|
"accelerator": "directml",
|
|
"uptime_secs": 3420
|
|
}
|
|
```
|
|
- If the daemon is not running, `mouth status` reports "Mouth is not running" and exits with code 1
|
|
- Also used internally to prevent launching a second daemon instance (lock check)
|
|
|
|
### Phase 11: Polish + Distribution
|
|
|
|
- Error handling: user-friendly messages for common failures (no mic, model not found, etc.)
|
|
- Windows installer via `cargo-wix` or distribute as standalone .exe
|
|
- Test on Windows 10/11 primarily
|
|
- Test on Linux (X11 + Wayland) and macOS as secondary
|
|
- Update CLAUDE.md with build/run/test instructions
|
|
- Write user-facing README with setup instructions
|
|
|
|
## Risks & Mitigations
|
|
|
|
| Risk | Impact | Mitigation |
|
|
|------|--------|------------|
|
|
| Parakeet v3 ONNX model compatibility with `ort` | Blocks core functionality | Test early in Phase 5; Parakeet v2 as fallback |
|
|
| `rdev` hotkey reliability on Windows | Broken UX | Test early in Phase 2; fallback to Win32 `RegisterHotKey` |
|
|
| Overlay focus stealing | Annoying | Use proper window flags; test with various foreground apps |
|
|
| Audio resampling quality | Poor transcription | Use rubato SincInterpolation (high quality) |
|
|
| Binary size with bundled ONNX Runtime | Large download | ONNX Runtime is ~20-40MB; acceptable for a single-binary tool |
|
|
| winit event loop blocking | Unresponsive | All heavy work on background threads; overlay is lightweight |
|
|
|
|
## File Structure
|
|
|
|
```
|
|
mouth/
|
|
├── Cargo.toml
|
|
├── CLAUDE.md
|
|
├── README.md
|
|
├── plan.md
|
|
├── config.yaml.example
|
|
├── resources/
|
|
│ ├── blip_up.pcm # bundled audio feedback
|
|
│ └── blip_down.pcm
|
|
└── src/
|
|
├── main.rs # CLI entry, clap setup
|
|
├── config.rs # Config struct, YAML load/save, defaults
|
|
├── hotkey.rs # Global hotkey listener (rdev)
|
|
├── recorder.rs # Audio capture (cpal + rubato + VAD)
|
|
├── vad.rs # Silero VAD wrapper
|
|
├── transcriber.rs # ONNX inference, model loading, GPU detection
|
|
├── model_cache.rs # HuggingFace download, cache management
|
|
├── overlay.rs # Minimal overlay window (winit + softbuffer)
|
|
├── paste.rs # Clipboard + paste simulation
|
|
├── audio_feedback.rs # Blip sounds via rodio
|
|
├── coordinator.rs # State machine, channel hub
|
|
└── cli/
|
|
├── mod.rs
|
|
├── run.rs # `mouth run` handler
|
|
├── config_cmd.rs # `mouth config` TUI
|
|
├── models_cmd.rs # `mouth models` handler
|
|
└── status_cmd.rs # `mouth status` handler
|
|
```
|
|
|
|
## Not In Scope (v1)
|
|
|
|
- LLM post-processing of transcriptions
|
|
- Transcription history / database
|
|
- Multiple model support (v1 is Parakeet v3 only, architecture supports adding more later)
|
|
- Auto-submit (Enter after paste)
|
|
- Multi-language UI
|
|
- Tray icon / system tray integration
|
|
- Translate-to-English mode
|