Files
mouth/plan.md
T
steve 9b0bf7d9e3 Implement core speech-to-text pipeline
All major components: hotkey listener (rdev), audio capture (cpal),
resampling (rubato), VAD (Silero ONNX), Parakeet v3 TDT transcription
(ort), overlay window (winit+softbuffer), paste simulation (enigo+arboard),
audio feedback (rodio), YAML config, CLI with clap, HuggingFace model
download. ~2400 lines of Rust across 16 source files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:47:46 +01:00

12 KiB

Mouth — Implementation Plan

Overview

Mouth is a single-binary, offline speech-to-text tool for Windows (with Linux/macOS support where possible). Press a hotkey, speak, and transcribed text is pasted at your cursor. Configured entirely via YAML.

Architecture

┌─────────────┐     ┌───────────┐     ┌─────────────┐     ┌────────────┐
│  Hotkey      │────▶│  Recorder │────▶│ Transcriber  │────▶│   Paste    │
│  Listener    │     │  (cpal)   │     │  (ort/ONNX)  │     │  (enigo)   │
│  (rdev)      │     │           │     │              │     │            │
└─────────────┘     └───────────┘     └─────────────┘     └────────────┘
       │                  │                  │                    │
       │                  ▼                  │                    │
       │            ┌───────────┐            │                    │
       │            │    VAD    │            │                    │
       │            │ (silero)  │            │                    │
       │            └───────────┘            │                    │
       │                                     │                    │
       ▼                                     ▼                    ▼
┌──────────────────────────────────────────────────────────────────────┐
│                        Overlay (winit)                               │
│                  State: idle → recording → transcribing → done       │
└──────────────────────────────────────────────────────────────────────┘

Component Communication

All components communicate via channels (std::sync::mpsc or tokio::sync). The main thread owns the overlay window (required by most windowing systems). A coordinator task receives events from hotkey/recorder/transcriber and drives state transitions.

HotkeyEvent(Pressed/Released) ──┐
AudioReady(Vec<f32>) ───────────┼──▶ Coordinator ──▶ OverlayState
TranscriptionDone(String) ──────┘                 ──▶ PasteAction
CancelRequested ────────────────┘

Crate Dependencies

Crate Purpose Notes
rdev Global hotkey capture Cross-platform key events, no focus required
cpal Audio capture Cross-platform mic input
rubato Audio resampling Resample to 16kHz for Parakeet
ort ONNX Runtime Run Parakeet v3 + Silero VAD
hf-hub Model download Download from HuggingFace, standard cache dir
enigo Keyboard simulation Simulate Ctrl+V, Shift+Insert, etc.
arboard Clipboard access Read/write clipboard, save/restore
winit Windowing Minimal overlay window
softbuffer Pixel rendering Draw coloured overlay (no GPU needed for overlay)
serde + serde_yaml Config Deserialize YAML config
clap CLI Subcommands: run, config, models
dialoguer Interactive TUI mouth config interactive setup
rodio Audio playback Blip up/down sounds
indicatif Progress bars Model download progress
dirs Platform dirs Config/cache paths
tracing Logging Structured logging

Config File

Location: ~/.config/mouth/config.yaml (Linux/macOS), %APPDATA%\mouth\config.yaml (Windows)

# Hotkey to activate recording
hotkey: "ctrl+space"

# Recording mode: push_to_talk or toggle
mode: push_to_talk

# Cancel hotkey (only active while recording)
cancel_key: "escape"

# Speech-to-text model
model: "parakeet-tdt-0.6b-v3"

# Inference accelerator: auto, cpu, cuda, directml
accelerator: auto

# GPU device index (only used when accelerator is cuda/directml)
gpu_device: 0

# How to paste text
paste_method: ctrl_v  # ctrl_v | shift_insert | ctrl_shift_v | clipboard_only

# Also keep transcribed text on clipboard after pasting
copy_to_clipboard: true

# Overlay position on screen
overlay_position: top  # top | bottom | none

# Audio feedback
audio_feedback: true

# Audio input device (null = system default)
input_device: null

# VAD: trim silence from audio before transcription
vad_enabled: true

# Language (for model hint, if supported)
language: en

CLI Interface

mouth run              # Start the daemon (default if no subcommand)
mouth config           # Interactive TUI to edit config
mouth config --show    # Print current config to stdout
mouth config --reset   # Reset config to defaults
mouth models           # List available/downloaded models
mouth models download  # Download configured model (if not cached)
mouth status           # Show daemon status, loaded model, app version

Implementation Phases

Phase 1: Project Skeleton + Config

  • Cargo.toml with all dependencies
  • Config struct with serde, defaults, load/save
  • CLI with clap (run, config, models subcommands)
  • mouth config interactive TUI with dialoguer
  • Platform-aware config/cache directory resolution

Phase 2: Hotkey Listener

  • Global hotkey capture using rdev
  • Support configurable key combinations (parse from string like "ctrl+space")
  • Push-to-talk mode: record on press, stop on release
  • Toggle mode: start on first press, stop on second press
  • Cancel on Escape while recording
  • Debounce rapid key events (~30ms)

Phase 3: Audio Capture + VAD

  • Open mic input via cpal (default device or configured)
  • Convert to f32 mono
  • Resample to 16kHz via rubato
  • Buffer audio chunks during recording
  • Run Silero VAD to trim leading/trailing silence
  • Produce final Vec<f32> of clean speech at 16kHz

Phase 4: Model Management

  • Use hf-hub to download Parakeet v3 ONNX model from HuggingFace
  • Store in standard HF cache (~/.cache/huggingface/hub/)
  • Show download progress with indicatif
  • mouth models command to list/download models
  • Auto-download on first run if model not cached

Phase 5: Transcription

  • Load Parakeet v3 ONNX model via ort
  • Auto-detect GPU (DirectML on Windows, CUDA if available, CPU fallback)
  • Respect accelerator override from config
  • Run inference on captured audio
  • Return transcribed text string

Phase 6: Overlay

  • Create a small always-on-top window using winit
  • Render with softbuffer (simple coloured rectangle + text)
  • States and colours:
    • Recording: red pulsing indicator
    • Transcribing: amber/yellow
    • Done: brief green flash, then hide
    • Error: brief red flash with error hint
  • Window flags (Windows): WS_EX_TOPMOST | WS_EX_TOOLWINDOW | WS_EX_NOACTIVATE
  • Position: centered horizontally at top or bottom of current monitor
  • No focus steal, no taskbar entry

Phase 7: Paste System

  • Save current clipboard content (if preserving)
  • Set transcribed text to clipboard via arboard
  • Simulate keypress via enigo based on paste_method:
    • ctrl_v: Ctrl+V (Cmd+V on macOS)
    • shift_insert: Shift+Insert
    • ctrl_shift_v: Ctrl+Shift+V
    • clipboard_only: no keypress, just clipboard
  • Restore previous clipboard content (unless copy_to_clipboard is true)
  • Small delay between clipboard set and paste simulation (~50ms)

Phase 8: Audio Feedback

  • Bundle two short PCM blip sounds in the binary (via include_bytes!)
  • "Blip up" on recording start
  • "Blip down" on recording stop / transcription complete
  • Play via rodio on a separate thread (non-blocking)
  • Respect audio_feedback config flag

Phase 9: Coordinator + Integration

  • Wire all components together with channel-based message passing
  • Main thread: overlay window event loop (winit requires this)
  • Spawned threads/tasks: hotkey listener, audio recorder, transcriber
  • Coordinator receives events, drives state machine:
    Idle ──[hotkey press]──▶ Recording
    Recording ──[hotkey release/press]──▶ Transcribing
    Recording ──[cancel]──▶ Idle
    Transcribing ──[result]──▶ Pasting ──▶ Idle
    Transcribing ──[error]──▶ Error ──▶ Idle
    
  • Graceful shutdown on SIGINT / tray quit

Phase 10: Daemon IPC + Status

  • The running daemon listens on a local Unix domain socket (Linux/macOS) or named pipe (Windows) for status queries
  • Socket/pipe path: /tmp/mouth.sock (Linux/macOS), \\.\pipe\mouth (Windows)
  • mouth status connects and requests current state; daemon responds with JSON:
    {
      "version": "0.1.0",
      "state": "idle",
      "model": "parakeet-tdt-0.6b-v3",
      "accelerator": "directml",
      "uptime_secs": 3420
    }
    
  • If the daemon is not running, mouth status reports "Mouth is not running" and exits with code 1
  • Also used internally to prevent launching a second daemon instance (lock check)

Phase 11: Polish + Distribution

  • Error handling: user-friendly messages for common failures (no mic, model not found, etc.)
  • Windows installer via cargo-wix or distribute as standalone .exe
  • Test on Windows 10/11 primarily
  • Test on Linux (X11 + Wayland) and macOS as secondary
  • Update CLAUDE.md with build/run/test instructions
  • Write user-facing README with setup instructions

Risks & Mitigations

Risk Impact Mitigation
Parakeet v3 ONNX model compatibility with ort Blocks core functionality Test early in Phase 5; Parakeet v2 as fallback
rdev hotkey reliability on Windows Broken UX Test early in Phase 2; fallback to Win32 RegisterHotKey
Overlay focus stealing Annoying Use proper window flags; test with various foreground apps
Audio resampling quality Poor transcription Use rubato SincInterpolation (high quality)
Binary size with bundled ONNX Runtime Large download ONNX Runtime is ~20-40MB; acceptable for a single-binary tool
winit event loop blocking Unresponsive All heavy work on background threads; overlay is lightweight

File Structure

mouth/
├── Cargo.toml
├── CLAUDE.md
├── README.md
├── plan.md
├── config.yaml.example
├── resources/
│   ├── blip_up.pcm          # bundled audio feedback
│   └── blip_down.pcm
└── src/
    ├── main.rs               # CLI entry, clap setup
    ├── config.rs             # Config struct, YAML load/save, defaults
    ├── hotkey.rs             # Global hotkey listener (rdev)
    ├── recorder.rs           # Audio capture (cpal + rubato + VAD)
    ├── vad.rs                # Silero VAD wrapper
    ├── transcriber.rs        # ONNX inference, model loading, GPU detection
    ├── model_cache.rs        # HuggingFace download, cache management
    ├── overlay.rs            # Minimal overlay window (winit + softbuffer)
    ├── paste.rs              # Clipboard + paste simulation
    ├── audio_feedback.rs     # Blip sounds via rodio
    ├── coordinator.rs        # State machine, channel hub
    └── cli/
        ├── mod.rs
        ├── run.rs            # `mouth run` handler
        ├── config_cmd.rs     # `mouth config` TUI
        ├── models_cmd.rs     # `mouth models` handler
        └── status_cmd.rs     # `mouth status` handler

Not In Scope (v1)

  • LLM post-processing of transcriptions
  • Transcription history / database
  • Multiple model support (v1 is Parakeet v3 only, architecture supports adding more later)
  • Auto-submit (Enter after paste)
  • Multi-language UI
  • Tray icon / system tray integration
  • Translate-to-English mode