Vibe coding with vibevoice: local speech-to-text for any app

The Rise of Vibe Coding

Recently, Andrej Karpathy introduced a fascinating concept called “vibe coding” - a new programming paradigm where you “fully give in to the vibes, embrace exponentials, and forget that the code even exists.” This approach has emerged due to the increasing capabilities of Large Language Models (LLMs) and voice-to-text technologies.

As Karpathy describes it:

“I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like ‘decrease the padding on the sidebar by half’ because I’m too lazy to find it.”

This resonated deeply with me. The idea of coding through natural conversation, letting the tools handle the technical details while you focus on the creative flow, seemed like the future of programming. But there was a catch - SuperWhisper is for Mac only and I am on Ubuntu.

Here’s a quick demonstration of vibe coding in action using vibevoice:

Enter Vibevoice

That’s where Vibevoice comes in. Inspired by Vlad’s whisper-keyboard project and Karpathy’s vibe coding approach, I created Vibevoice - a tool that brings the power of local voice-to-text to any application on your system.

How It Works

Vibevoice combines several powerful technologies:

Faster Whisper: An optimized implementation of OpenAI’s Whisper model that runs locally on your GPU. No API keys, no cloud dependencies - just fast, accurate transcription.
Local Server Architecture: When you start Vibevoice, it launches a local FastAPI server that loads the Whisper model into your GPU memory. This server handles all transcription requests with minimal latency.
Global Keyboard Integration: Using pynput, Vibevoice listens globally for a trigger key (right Control by default) - so it works in any application. Hold it down, speak your thoughts, and release to see your words appear wherever your cursor is.

The technical flow is simple:

Hold the trigger key → Start recording audio
Release the key → Save audio and send to local server
Server transcribes using Faster Whisper → Returns text
Text is automatically typed at your cursor position

Vibe Coding in Practice

Here’s how a typical vibe coding session might look with Vibevoice:

Open your favorite IDE (I use Windsurf.ai most of the time these days.)
Start Vibevoice in a terminal
When you want to code something:
- Hold right Ctrl, say “add a function to calculate the fibonacci sequence”
- Release, watch as your words appear
- Let the AI suggest the implementation
- Hold right Ctrl again, say “add memoization to make it more efficient”
- Release, let the AI handle the optimization

The beauty of this approach is its seamless integration into any workflow. Whether you’re:

Writing code with AI
Describing bugs in GitHub issues
Commenting your code
Chatting on Discord
Writing documentation

Just hold, speak, release - your thoughts become text instantly.

The Future of Programming

Karpathy’s vision of vibe coding represents a fundamental shift in how we interact with computers. It’s not just about writing code faster; it’s about making the process more natural, more human. As he puts it:

“I’m building a project or webapp, but it’s not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.”

Vibevoice is my contribution to this future - a bridge between natural human communication and the digital world. By removing the friction of typing, it helps maintain the flow state that’s so crucial for creative work.

Try It Yourself

Want to experience vibe coding? Vibevoice is open source and easy to set up. You’ll need:

A CUDA-capable GPU
Python 3.12
A microphone

Then just:

git clone https://github.com/mpaepper/vibevoice.git
cd vibevoice
pip install -r requirements.txt
python src/vibevoice/cli.py
Bash

Hold right Ctrl, speak your mind, and welcome to the future of programming.

Want to contribute or learn more? Check out Vibevoice on GitHub or visit my blog at paepper.com/blog.