Implement core speech-to-text pipeline

All major components: hotkey listener (rdev), audio capture (cpal), resampling (rubato), VAD (Silero ONNX), Parakeet v3 TDT transcription (ort), overlay window (winit+softbuffer), paste simulation (enigo+arboard), audio feedback (rodio), YAML config, CLI with clap, HuggingFace model download. ~2400 lines of Rust across 16 source files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:47:46 +01:00
parent 6b737f92fe
commit 9b0bf7d9e3
22 changed files with 7750 additions and 0 deletions
@@ -0,0 +1,287 @@
+# Mouth — Implementation Plan
+
+## Overview
+
+Mouth is a single-binary, offline speech-to-text tool for Windows (with Linux/macOS support where possible). Press a hotkey, speak, and transcribed text is pasted at your cursor. Configured entirely via YAML.
+
+## Architecture
+
+```
+┌─────────────┐     ┌───────────┐     ┌─────────────┐     ┌────────────┐
+│  Hotkey      │────▶│  Recorder │────▶│ Transcriber  │────▶│   Paste    │
+│  Listener    │     │  (cpal)   │     │  (ort/ONNX)  │     │  (enigo)   │
+│  (rdev)      │     │           │     │              │     │            │
+└─────────────┘     └───────────┘     └─────────────┘     └────────────┘
+       │                  │                  │                    │
+       │                  ▼                  │                    │
+       │            ┌───────────┐            │                    │
+       │            │    VAD    │            │                    │
+       │            │ (silero)  │            │                    │
+       │            └───────────┘            │                    │
+       │                                     │                    │
+       ▼                                     ▼                    ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│                        Overlay (winit)                               │
+│                  State: idle → recording → transcribing → done       │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+### Component Communication
+
+All components communicate via channels (`std::sync::mpsc` or `tokio::sync`). The main thread owns the overlay window (required by most windowing systems). A coordinator task receives events from hotkey/recorder/transcriber and drives state transitions.
+
+```
+HotkeyEvent(Pressed/Released) ──┐
+AudioReady(Vec<f32>) ───────────┼──▶ Coordinator ──▶ OverlayState
+TranscriptionDone(String) ──────┘                 ──▶ PasteAction
+CancelRequested ────────────────┘
+```
+
+## Crate Dependencies
+
+| Crate | Purpose | Notes |
+|-------|---------|-------|
+| `rdev` | Global hotkey capture | Cross-platform key events, no focus required |
+| `cpal` | Audio capture | Cross-platform mic input |
+| `rubato` | Audio resampling | Resample to 16kHz for Parakeet |
+| `ort` | ONNX Runtime | Run Parakeet v3 + Silero VAD |
+| `hf-hub` | Model download | Download from HuggingFace, standard cache dir |
+| `enigo` | Keyboard simulation | Simulate Ctrl+V, Shift+Insert, etc. |
+| `arboard` | Clipboard access | Read/write clipboard, save/restore |
+| `winit` | Windowing | Minimal overlay window |
+| `softbuffer` | Pixel rendering | Draw coloured overlay (no GPU needed for overlay) |
+| `serde` + `serde_yaml` | Config | Deserialize YAML config |
+| `clap` | CLI | Subcommands: `run`, `config`, `models` |
+| `dialoguer` | Interactive TUI | `mouth config` interactive setup |
+| `rodio` | Audio playback | Blip up/down sounds |
+| `indicatif` | Progress bars | Model download progress |
+| `dirs` | Platform dirs | Config/cache paths |
+| `tracing` | Logging | Structured logging |
+
+## Config File
+
+Location: `~/.config/mouth/config.yaml` (Linux/macOS), `%APPDATA%\mouth\config.yaml` (Windows)
+
+```yaml
+# Hotkey to activate recording
+hotkey: "ctrl+space"
+
+# Recording mode: push_to_talk or toggle
+mode: push_to_talk
+
+# Cancel hotkey (only active while recording)
+cancel_key: "escape"
+
+# Speech-to-text model
+model: "parakeet-tdt-0.6b-v3"
+
+# Inference accelerator: auto, cpu, cuda, directml
+accelerator: auto
+
+# GPU device index (only used when accelerator is cuda/directml)
+gpu_device: 0
+
+# How to paste text
+paste_method: ctrl_v  # ctrl_v | shift_insert | ctrl_shift_v | clipboard_only
+
+# Also keep transcribed text on clipboard after pasting
+copy_to_clipboard: true
+
+# Overlay position on screen
+overlay_position: top  # top | bottom | none
+
+# Audio feedback
+audio_feedback: true
+
+# Audio input device (null = system default)
+input_device: null
+
+# VAD: trim silence from audio before transcription
+vad_enabled: true
+
+# Language (for model hint, if supported)
+language: en
+```
+
+## CLI Interface
+
+```
+mouth run              # Start the daemon (default if no subcommand)
+mouth config           # Interactive TUI to edit config
+mouth config --show    # Print current config to stdout
+mouth config --reset   # Reset config to defaults
+mouth models           # List available/downloaded models
+mouth models download  # Download configured model (if not cached)
+mouth status           # Show daemon status, loaded model, app version
+```
+
+## Implementation Phases
+
+### Phase 1: Project Skeleton + Config
+
+- Cargo.toml with all dependencies
+- Config struct with serde, defaults, load/save
+- CLI with clap (run, config, models subcommands)
+- `mouth config` interactive TUI with dialoguer
+- Platform-aware config/cache directory resolution
+
+### Phase 2: Hotkey Listener
+
+- Global hotkey capture using rdev
+- Support configurable key combinations (parse from string like "ctrl+space")
+- Push-to-talk mode: record on press, stop on release
+- Toggle mode: start on first press, stop on second press
+- Cancel on Escape while recording
+- Debounce rapid key events (~30ms)
+
+### Phase 3: Audio Capture + VAD
+
+- Open mic input via cpal (default device or configured)
+- Convert to f32 mono
+- Resample to 16kHz via rubato
+- Buffer audio chunks during recording
+- Run Silero VAD to trim leading/trailing silence
+- Produce final `Vec<f32>` of clean speech at 16kHz
+
+### Phase 4: Model Management
+
+- Use hf-hub to download Parakeet v3 ONNX model from HuggingFace
+- Store in standard HF cache (`~/.cache/huggingface/hub/`)
+- Show download progress with indicatif
+- `mouth models` command to list/download models
+- Auto-download on first run if model not cached
+
+### Phase 5: Transcription
+
+- Load Parakeet v3 ONNX model via ort
+- Auto-detect GPU (DirectML on Windows, CUDA if available, CPU fallback)
+- Respect accelerator override from config
+- Run inference on captured audio
+- Return transcribed text string
+
+### Phase 6: Overlay
+
+- Create a small always-on-top window using winit
+- Render with softbuffer (simple coloured rectangle + text)
+- States and colours:
+  - Recording: red pulsing indicator
+  - Transcribing: amber/yellow
+  - Done: brief green flash, then hide
+  - Error: brief red flash with error hint
+- Window flags (Windows): `WS_EX_TOPMOST | WS_EX_TOOLWINDOW | WS_EX_NOACTIVATE`
+- Position: centered horizontally at top or bottom of current monitor
+- No focus steal, no taskbar entry
+
+### Phase 7: Paste System
+
+- Save current clipboard content (if preserving)
+- Set transcribed text to clipboard via arboard
+- Simulate keypress via enigo based on paste_method:
+  - `ctrl_v`: Ctrl+V (Cmd+V on macOS)
+  - `shift_insert`: Shift+Insert
+  - `ctrl_shift_v`: Ctrl+Shift+V
+  - `clipboard_only`: no keypress, just clipboard
+- Restore previous clipboard content (unless copy_to_clipboard is true)
+- Small delay between clipboard set and paste simulation (~50ms)
+
+### Phase 8: Audio Feedback
+
+- Bundle two short PCM blip sounds in the binary (via `include_bytes!`)
+- "Blip up" on recording start
+- "Blip down" on recording stop / transcription complete
+- Play via rodio on a separate thread (non-blocking)
+- Respect audio_feedback config flag
+
+### Phase 9: Coordinator + Integration
+
+- Wire all components together with channel-based message passing
+- Main thread: overlay window event loop (winit requires this)
+- Spawned threads/tasks: hotkey listener, audio recorder, transcriber
+- Coordinator receives events, drives state machine:
+  ```
+  Idle ──[hotkey press]──▶ Recording
+  Recording ──[hotkey release/press]──▶ Transcribing
+  Recording ──[cancel]──▶ Idle
+  Transcribing ──[result]──▶ Pasting ──▶ Idle
+  Transcribing ──[error]──▶ Error ──▶ Idle
+  ```
+- Graceful shutdown on SIGINT / tray quit
+
+### Phase 10: Daemon IPC + Status
+
+- The running daemon listens on a local Unix domain socket (Linux/macOS) or named pipe (Windows) for status queries
+- Socket/pipe path: `/tmp/mouth.sock` (Linux/macOS), `\\.\pipe\mouth` (Windows)
+- `mouth status` connects and requests current state; daemon responds with JSON:
+  ```json
+  {
+    "version": "0.1.0",
+    "state": "idle",
+    "model": "parakeet-tdt-0.6b-v3",
+    "accelerator": "directml",
+    "uptime_secs": 3420
+  }
+  ```
+- If the daemon is not running, `mouth status` reports "Mouth is not running" and exits with code 1
+- Also used internally to prevent launching a second daemon instance (lock check)
+
+### Phase 11: Polish + Distribution
+
+- Error handling: user-friendly messages for common failures (no mic, model not found, etc.)
+- Windows installer via `cargo-wix` or distribute as standalone .exe
+- Test on Windows 10/11 primarily
+- Test on Linux (X11 + Wayland) and macOS as secondary
+- Update CLAUDE.md with build/run/test instructions
+- Write user-facing README with setup instructions
+
+## Risks & Mitigations
+
+| Risk | Impact | Mitigation |
+|------|--------|------------|
+| Parakeet v3 ONNX model compatibility with `ort` | Blocks core functionality | Test early in Phase 5; Parakeet v2 as fallback |
+| `rdev` hotkey reliability on Windows | Broken UX | Test early in Phase 2; fallback to Win32 `RegisterHotKey` |
+| Overlay focus stealing | Annoying | Use proper window flags; test with various foreground apps |
+| Audio resampling quality | Poor transcription | Use rubato SincInterpolation (high quality) |
+| Binary size with bundled ONNX Runtime | Large download | ONNX Runtime is ~20-40MB; acceptable for a single-binary tool |
+| winit event loop blocking | Unresponsive | All heavy work on background threads; overlay is lightweight |
+
+## File Structure
+
+```
+mouth/
+├── Cargo.toml
+├── CLAUDE.md
+├── README.md
+├── plan.md
+├── config.yaml.example
+├── resources/
+│   ├── blip_up.pcm          # bundled audio feedback
+│   └── blip_down.pcm
+└── src/
+    ├── main.rs               # CLI entry, clap setup
+    ├── config.rs             # Config struct, YAML load/save, defaults
+    ├── hotkey.rs             # Global hotkey listener (rdev)
+    ├── recorder.rs           # Audio capture (cpal + rubato + VAD)
+    ├── vad.rs                # Silero VAD wrapper
+    ├── transcriber.rs        # ONNX inference, model loading, GPU detection
+    ├── model_cache.rs        # HuggingFace download, cache management
+    ├── overlay.rs            # Minimal overlay window (winit + softbuffer)
+    ├── paste.rs              # Clipboard + paste simulation
+    ├── audio_feedback.rs     # Blip sounds via rodio
+    ├── coordinator.rs        # State machine, channel hub
+    └── cli/
+        ├── mod.rs
+        ├── run.rs            # `mouth run` handler
+        ├── config_cmd.rs     # `mouth config` TUI
+        ├── models_cmd.rs     # `mouth models` handler
+        └── status_cmd.rs     # `mouth status` handler
+```
+
+## Not In Scope (v1)
+
+- LLM post-processing of transcriptions
+- Transcription history / database
+- Multiple model support (v1 is Parakeet v3 only, architecture supports adding more later)
+- Auto-submit (Enter after paste)
+- Multi-language UI
+- Tray icon / system tray integration
+- Translate-to-English mode