The best voice-to-text app for me

For the last few weeks, I have been using a voice-to-text setup built by Claude Code in a session which lasted about an hour. It has:

  1. Three speed modes. A fast one that returns in under a second, a medium one with a fast LLM-based corrections step, and a slow one with near-perfect transcription and enhancements applied.
  2. Integration with Zellij (a terminal workspace manager like tmux). Transcription goes to the right tab/pane even if I have switched focus, and it handles multiple concurrent transcriptions.
  3. Custom post-processing prompts for different contexts, triggered with different shortcuts. Post-processing is the step where the raw transcription goes through an LLM that fixes errors and reformats the text.

My main use case is coding agents. I usually have several Claude Code instances running in parallel, and the way I engage with each one varies. Sometimes I am firing off a quick command, basic English, nothing specialized, and the fast local mode is fine. Sometimes I am dictating a detailed prompt with a lot of context and domain-specific vocabulary. In those cases I use the slow mode, speak, switch to another task/agent, and fifteen seconds later the transcription lands in the right pane and is auto-submitted.

There has been a great proliferation of voice-to-text apps. Many are proper businesses with millions in revenue. I initially started with SuperWhisper (~$10/month), which worked fairly well (apart from the rare cases I would lose a long recording and be forced to repeat myself) but switched to VoiceInk ($25) due to the one-time payment model. I also tried Handy.computer, which is free and open source but limited to local models. Each one was good enough for a while, but I kept running into tiny inconveniences which added up over time.

This particular implementation just started as an experiment to see if it could be done (and I had enough spare usage left before my weekly Claude Code quota). I came in with specific opinions about which models and services to use. Claude Code would likely have made different decisions on its own. But once I specified what I wanted, it built the whole thing and delivered most of the features on my wishlist (mostly one-shot, but there was some struggle to get the paste behaviour exactly right).

Under the hood

The local audio capture and processing is handled by FluidAudio (something I explicitly asked for in the initial prompt). This library is doing a lot of the heavy lifting here, and it is also used by many other voice-to-text apps out there. The fast mode runs Nvidia's Parakeet v3 locally. It is good enough for short commands and everyday vocabulary.

The medium mode still uses Parakeet for the initial transcription but then sends the raw text to GPT OSS 120B running on Cerebras for corrections. The correction prompt knows the context I work in, so it can fix things like "cloud code" back to "Claude Code." Cerebras inference is fast enough that the whole round trip still feels close to instant.

The slow mode uses ElevenLabs Scribe V2 for the transcription itself (the best openly available transcription model to my knowledge) and Opus 4.6 for the corrections. This takes longer but the results are noticeably better for complex, detailed prompts. The ElevenLabs basic plan is ~$5/month, which is less than what most mainstream applications charge.

It fits how I work instead of the other way around.