KittenTTS Just Proved You Do Not Need a GPU for Production-Quality Text-to-Speech — Here Is How to Set It Up in Under Ten Minutes

KittenTTS Just Proved You Do Not Need a GPU for Production-Quality Text-to-Speech — Here Is How to Set It Up in Under Ten Minutes

I was up at 1:30 AM on a Wednesday — the kind of hour where you tell yourself "just one more tab" and then suddenly you've cloned three repos. My buddy Priya had Slacked me a link to a Hacker News post with 480 upvotes, and the title alone made me close Spotify: "Three new Kitten TTS models — smallest less than 25MB."

Twenty-five megabytes. That is smaller than most podcast intros. And it does text-to-speech. On a CPU. Without sounding like a GPS from 2009.

I had to try it.

What Exactly Is KittenTTS?

KittenTTS is an open-source, lightweight text-to-speech library built on ONNX Runtime. It comes in three sizes:

  • Kitten-15M — 25 MB on disk (int8 quantized). The one that broke Hacker News.
  • Kitten-40M — about 45 MB. The sweet spot, if you ask me.
  • Kitten-80M — roughly 80 MB. The "I want studio quality and I have the disk space" option.

All three run entirely on CPU. No CUDA drivers, no ROCm nightmares, no cloud GPU rental. You pip install it and go.

It ships with eight built-in voices: Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, and Leo. I have opinions about all of them, and I will share those opinions whether you want them or not.

Why I Was Skeptical (And Why You Should Be Too)

Look, I have been burned before. In 2024, I spent an entire Saturday setting up Coqui TTS, and by the time I got it producing audio that did not sound like a microwave reading poetry, I had consumed four cups of coffee and lost the will to live. Then Coqui went through their... situation... and I moved to Piper TTS, which is genuinely good but feels like it was designed for people who enjoy compiling things.

So when someone tells me "25MB model, CPU-only, sounds great," my immediate reaction is "sure, and my landlord says he will fix the dishwasher this weekend."

But I tried it anyway.

Step-by-Step: Installing KittenTTS (For Real, Ten Minutes)

Prerequisites

You need Python 3.9 or newer. That is genuinely it. If you are reading a tech blog in 2026 and do not have Python installed, we need to have a different conversation.

Step 1: Install the Package

Open your terminal and run:

pip install kitten-tts

This pulls in the ONNX runtime and the core library. On my M2 MacBook Air, it took about 40 seconds. On my ancient ThinkPad running Ubuntu, maybe 90 seconds. Either way, faster than making toast.

Step 2: Download a Model

KittenTTS uses Hugging Face for model hosting. The 15M model downloads automatically on first use, but if you want to pre-download:

python -c "from kitten_tts import KittenTTS; tts = KittenTTS(model='kitten-15m')"

That is 25 MB. I timed it on my home Wi-Fi (which my ISP swears is 200 Mbps but is more like 80 on a good day): four seconds.

Step 3: Generate Your First Audio

from kitten_tts import KittenTTS

tts = KittenTTS(model="kitten-15m")
tts.speak("Hello world. This is KittenTTS running on a CPU.", voice="bella", output="hello.wav")

Run it. Wait approximately 1.5 seconds. Open hello.wav.

I actually said "what" out loud when I heard it. The Bella voice is... good. Not "good for a 25MB model" — actually good. Clear, natural cadence, proper emphasis on punctuation. My colleague Derek walked by my desk, heard the output, and asked if I was listening to a podcast.

Step 4: Try All Eight Voices

voices = ["bella", "jasper", "luna", "bruno", "rosie", "hugo", "kiki", "leo"]
for voice in voices:
    tts.speak(
        "The quick brown fox jumps over the lazy dog.",
        voice=voice,
        output=f"sample_{voice}.wav"
    )

Here is my totally subjective, absolutely biased ranking:

  1. Luna — warm, slightly British, the voice I would want reading me a bedtime story
  2. Bella — professional, clean, perfect for product demos
  3. Hugo — the guy who narrates nature documentaries on streaming services you have never heard of
  4. Jasper — friendly, approachable, "your cool professor" energy
  5. Leo — deep, authoritative, the "movie trailer" voice
  6. Rosie — bubbly, great for assistant-style applications
  7. Bruno — serviceable, a bit flat if I am honest
  8. Kiki — I wanted to like Kiki more, but something about the pacing feels off

Adjusting Speed and Other Parameters

You can control playback speed with the speed parameter:

tts.speak("This is faster.", voice="luna", speed=1.3, output="fast.wav")
tts.speak("This is slower.", voice="luna", speed=0.8, output="slow.wav")

Speed 1.0 is the default and sounds natural. Going above 1.4 starts to sound robotic. Below 0.7 sounds like someone waking up from anesthesia. I learned both of these limits empirically at 2 AM, so you do not have to.

KittenTTS vs Piper TTS: The Comparison Nobody Asked For

I ran both on the same machine (M2 MacBook Air, 8GB RAM) with the same input text — a 500-word paragraph from a Reuters article.

MetricKittenTTS (15M)KittenTTS (40M)Piper TTS (medium)
Model size25 MB45 MB~75 MB
Generation time (500 words)3.2 sec5.1 sec4.8 sec
RAM usage~120 MB~210 MB~350 MB
Voice naturalness (my opinion)7.5/108.5/108/10
Setup difficultyEasyEasyMedium
Voice cloningNot yetNot yetNo

The 15M model is faster and smaller than Piper, with slightly less naturalness. The 40M model trades a bit of speed for audio quality that I genuinely think edges out Piper's medium model. And the setup? Not even close. KittenTTS is a pip install. Piper requires downloading voice files, figuring out which JSON config matches which ONNX model, and possibly sacrificing a goat to the phoneme gods.

Where KittenTTS Actually Makes Sense

Edge Devices and IoT

A Raspberry Pi 4 can run the 15M model comfortably. I tested it on a Pi 4 (4GB model) that I normally use as a Pi-hole (if you are curious about the full cost breakdown for a local voice setup, I wrote about the real cost of running a local voice assistant in 2026), and generation took about 6 seconds for a paragraph. That is viable for smart home announcements, custom voice assistants, or that robot I keep telling my partner I am going to build "someday."

Privacy-Sensitive Applications

Everything runs locally. No audio leaves your machine. No API keys. No per-character billing. If you are building something for healthcare, education, or any context where sending audio to a cloud API makes your compliance officer break out in hives, this is your answer.

Offline Applications

Download the model once. It works on an airplane, in a bunker, or in my apartment when Comcast decides it is maintenance time again (every other Tuesday, apparently).

What KittenTTS Cannot Do (Yet)

I want to be honest here because review sites that only say nice things are about as useful as a screen door on a submarine:

  • No voice cloning. You cannot feed it a sample of your voice and get output. Coqui TTS did this (when it was alive), and some commercial APIs offer it. KittenTTS is not there yet.
  • No SSML support. You cannot mark up pauses, emphasis, or pronunciation hints. You get what the model gives you.
  • Limited languages. English only right now. Their roadmap mentions multilingual support, but no timeline.
  • APIs may change. They are explicit about this being a "developer preview." Do not build your production pipeline around the current function signatures.

The Bigger Picture: Why Tiny Models Matter

Here is what I keep thinking about. In 2023, decent TTS meant either paying per character to a cloud API or running a 2GB+ model on a GPU. In 2024, Piper brought it down to ~75MB but required some setup overhead. Now it is 2026, and a 25MB model that fits on a microcontroller sounds better than what Fortune 500 companies were paying thousands a month for three years ago.

That trajectory is wild. And it is not just TTS — we are seeing the same compression across image generation, translation, and speech recognition. My colleague even showed that a $700 MacBook can beat a cloud server on data processing now. The "you need a data center" era for inference is ending faster than anyone predicted.

If I were building a product today that needed voice output — an accessibility tool, a language learning app, a game, whatever — I would start with KittenTTS and only move to a cloud API if I hit a wall. The cost is zero, the latency is local, and the quality has crossed the "good enough" threshold for most use cases.

Getting Started Checklist

  • ✅ Python 3.9+
  • pip install kitten-tts
  • ✅ Pick a model: 15M for speed, 40M for quality, 80M for "I want the best"
  • ✅ Pick a voice: start with Luna or Bella
  • ✅ Adjust speed if needed (1.0 is fine for most)
  • ✅ Check the GitHub repo for updates — this project is moving fast

And if you are the kind of person who clones repos at 1:30 AM because a Hacker News link looked interesting — yeah, you are going to like this one.

Found this helpful?

Subscribe to our newsletter for more in-depth reviews and comparisons delivered to your inbox.