Hacker News

536

Show HN: Three new Kitten TTS models – smallest less than 25MB

I created a CLI wrapper for Kitten TTS: https://github.com/newptcai/purr

BTW, it seems that kitten (the Python package) has the following chain of dependencies: kittentts → misaki[en] → spacy-curated-transformers

So if you install it directly via uv, it will pull torch and NVIDIA CUDA packages (several GB), which are not needed to run kitten.

by dawdler-purge1773965079
What I love about OpenClaw is that I was able to send it a message on Discord with just this github URL and it started sending me voice messages using it within a few minutes. It also gave me a bunch of different benchmarks and sample audio.

I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.

by kevin421773938441
I created a demo running in the browser, on your device: https://next-voice.vercel.app
by g588928811774024275
Was playing around a bit and for its size it's very impressive. Just has issues pronounciating numbers. I tried to let it generate "Startup finished in 135 ms."

I didn't expect it to pronounciate 'ms' correctly, but the number sounded just like noise. Eventually I got an acceptable result for the string "Startup finished in one hundred and thirty five seconds.

by __fst__1773959388
A very clear improvement from the first set of models you released some time ago. I'm really impressed. Thanks for sharing it all.
by daneel_w1773951162
Very cool :) Look forward to trying it out

Maybe a dumb and slightly tangential question, (I don't mean this as a criticism!) but why not release a command line executable?

Even the API looks like what you'd see in a manpage.

I get it wouldn't be too much work for a user to actually make something like that, I'm just curious what the thought process is

by geokon1773989812
You should put examples comparing the 4 models you released - same text spoken by each.
by ks20481773938816
I'd love to see a monolingual Japanese model sometime in the future. Qwen3-tts works for Japanese in general, but from time to time it will mix with some Mandarin in between, making it unusable.
by _hzw1773964176
Good on device TTS is an amazing accessibility tool. Thank you for building this. Way too many of devices that use it rely on online services, this is much preferred.
by jacquesm1773974187
They sound like cartoon voices... but I really like them I could listen to a book with those.
by nsnzjznzbx1773953230
I ran install instructions and it took 7.1GB of deps, tf you mean "tiny" ?
by PunchyHamster1773959831
The size/quality tradeoff here is interesting. 25MB for a TTS model that's usable is a real achievement, but the practical bottleneck for most edge deployments isn't model size -- it's the inference latency on low-power hardware and the audio streaming architecture around it. Curious how this performs on something like a Raspberry Pi 4 for real-time synthesis. The voice quality tradeoff at that size usually shows up most in prosody and sentence-final intonation rather than phoneme accuracy.
by bobokaytop1773994299
One of the core features I look for is expressive control.

Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.

Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].

the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?

by altruios1773938383
To the folks and Kitten team: I'm working on TTS as a problem statement (for an application), and what is the best model at the latency/cost inference. I'm currently settling for gemini TTS, which allows for a lot of expressiveness, but a word at 150ms starts to hurt when the content is a few sentences.

my current best approach is wrapping around gemini-flash native, and the model speaking the text i send it, which allows me end to end latency under a second.

are there other models at this or better pricing i can be looking at.

by anilgulecha1774028114
There's a number of recent, good quality, small TTS models.

If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.

by ks20481773939160
The Github readme doesn't list this: what data trained this? Was it done by the voices of the creators, or was this trained on data scraped from the internet or other archives?
by jamamp1773964575
the dependency chain issue is a real barrier for edge deployment. i've been running tts models on a raspberry pi for a home automation project and anything that pulls torch + cuda makes the whole thing a non-starter. 25MB is genuinely exciting for that use case.

curious about the latency characteristics though. 1.5x realtime on a 9700 is fine for batch processing but for interactive use you need first-chunk latency under 200ms or the conversation feels broken. does anyone know if it supports streaming output or is it full-utterance only?

the phoneme-based approach should help with pronunciation consistency too. the models i've tried that work on raw text tend to mispronounce technical terms unpredictably — same word pronounced differently across runs.

by baibai0089891774004773
Great stuff. Is your team interested in the STT problem?
by boutell1773951175
Fingers crossed for a normal-sounding voice this time around. The cute Kitten voices are nice, but I want something I can take seriously when I'm listening to an audiobook.
by arcanemachiner1773962280
This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!
by armcat1773945724
The example.py file says "it will run blazing fast on any GPU. But this example will run on CPU."

I couldn't locate how to run it on a GPU anywhere in the repo.

by pumanoir1773945579
How did you make a very small AI model (14M) sound more natural and expressive than even bigger models?
by swaminarayan1773974327
A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.

Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.

Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?

by magicalhippo1773941479
Did they train this on @lauriewired's voice? The demo video sounds exactly like her at 0:18
by stbtrax1773970736
A lot of these models struggle with small text strings, like "next button" that screen readers are going to speak a lot.
by devinprater1773941920
How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?

The iOS version is Swift-based.

by fwsgonzo1773939060
Would an Android app of this be able to replace the built in tts?
by vezycash1773946378
Nice, but it's weird that no "language" or "English" is mentioned on the github page, and only from the "Release multilingual TTS" Roadmap item could I guess it's probably English only for now.
by spyder1774018859
I thought they were going to make kitten sounds instead of speech
by agnishom1773974689
Thanks for open sourcing this.

Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.

by ilaksh1773938140
Only American voices? For some reason I'm only interested in Irish, British or Welsh accents. American is a no
by tim-projects1773990087
How long until I can buy this as a chip for my Arduino projects?
by amelius1773956453
Found they struggle with numbers. Like, give them a random four digit number in a sentence and it fumbles.
by Stevvo1773992612
Is this open-source or open-weights ML?
by pabs31773985705
This would be great as a js package - 25mb is small enough that I think it'd be worth it (in-browser tts is still pretty bad and varies by browser)
by DavidTompkins1773942489
Thanks for working on this!

Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.

by great_psy1773938014
It is based on onnx, so can i use with transformers.js and the browser?
by sroussey1773982388
I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?

I want to be my own personal assistant...

EDIT: I can provide it a RTX 3080ti.

by sschueller1773944348
Really cool to see innovation in terms of quality of tiny models. Great work!
by schopra9091773945219
are there plans to output text alignment?
by gabrielcsapo1773947197
The <25MB figure is what stands out. Been wanting to add TTS to a few Next.js projects for offline/edge scenarios but model sizes have always made it impractical to ship.

At 25MB you can actually bundle it with the app. Going to test whether this works in a Vercel Edge Function context -- if latency is acceptable there it opens up a lot of use cases that currently require a round-trip to a hosted API.

by rsmtjohn1773991994
How noticeable is the difference in quality between the 4M model and the 80M model?
by erkoo1773999678
What's the actual install size for a working example? Like similar "tiny" projects, do these models actually require installing 1GB+ of dependencies?
by janice19991773945662
I'm thinking of giving "voice" to my virtual pets (think Pokemon but less than a dozen). The pets are made up animals but based on real animal, like Mouseier from Mouse (something like that). Is this possible?

Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.

by wiradikusuma1773941467
Is it English only?
by Tacite1773938200
This is great. Demo looks awesome.
by whitepaper271773946760
So, one thing I noticed, and this could easily be user error, is that if I set the text & voice in the example to:

  text ="""
  Hello world. This is Kitten TTS.
  Look, it's working!
  """

  voice = 'Luna'
On macOS, I get "Kitten TTS", but on Linux, I get "Kit… TTS". Both OSes generate the same phonemes of,

  Phonemes: ðɪs ɪz kˈɪʔn ̩ tˌiːtˌiːˈɛs ,
which makes me really confused as to where it's going off the rails on Linux, since from there it should just be invoking the model.

edit: it really helps to use the same model facepalm. It's the 80M model, and it happens on both OS. Wildly the nano gets it better? I'm going to join the Discord lol.

by deathanatos1773970632
Whats the training data for this?
by pabs31773985571
sounds amazing! does it stream? or is it so fast you don't need to?
by exe341773952609
Wow, what an amazing feat. Congratulations!
by moralestapia1773953531
This is something I've been looking for (the <50MB models in particular). Unfortunately my feedback is as follows:

      Downloading https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl (22 kB)
    Collecting num2words (from kittentts==0.8.1)
      Using cached num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
    Collecting spacy (from kittentts==0.8.1)
      Using cached spacy-3.8.11-cp314-cp314-win_amd64.whl.metadata (28 kB)
    Collecting espeakng_loader (from kittentts==0.8.1)
      Using cached espeakng_loader-0.2.4-py3-none-win_amd64.whl.metadata (1.3 kB)
    INFO: pip is looking at multiple versions of kittentts to determine which version is compatible with other requirements. This could take a while.
    ERROR: Ignored the following versions that require a different python version: 0.7.10 Requires-Python >=3.8,<3.13; 0.7.11 Requires-Python >=3.8,<3.13; 0.7.12 Requires-Python >=3.8,<3.13; 0.7.13 Requires-Python >=3.8,<3.13; 0.7.14 Requires-Python >=3.8,<3.13; 0.7.15 Requires-Python >=3.8,<3.13; 0.7.16 Requires-Python >=3.8,<3.13; 0.7.17 Requires-Python >=3.8,<3.13; 0.7.5 Requires-Python >=3.8,<3.13; 0.7.6 Requires-Python >=3.8,<3.13; 0.7.7 Requires-Python >=3.8,<3.13; 0.7.8 Requires-Python >=3.8,<3.13; 0.7.9 Requires-Python >=3.8,<3.13; 0.8.0 Requires-Python >=3.8,<3.13; 0.8.1 Requires-Python >=3.8,<3.13; 0.8.2 Requires-Python >=3.8,<3.13; 0.8.3 Requires-Python >=3.8,<3.13; 0.8.4 Requires-Python >=3.8,<3.13; 0.9.0 Requires-Python >=3.8,<3.13; 0.9.2 Requires-Python >=3.8,<3.13; 0.9.3 Requires-Python >=3.8,<3.13; 0.9.4 Requires-Python >=3.8,<3.13; 3.8.3 Requires-Python >=3.9,<3.13; 3.8.5 Requires-Python >=3.9,<3.13; 3.8.6 Requires-Python >=3.9,<3.13; 3.8.7 Requires-Python >=3.9,<3.14; 3.8.8 Requires-Python >=3.9,<3.14; 3.8.9 Requires-Python >=3.9,<3.14
    ERROR: Could not find a version that satisfies the requirement misaki>=0.9.4 (from kittentts) (from versions: 0.1.0, 0.3.0, 0.3.5, 0.3.9, 0.4.0, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4)
    ERROR: No matching distribution found for misaki>=0.9.4

I realize that I can run a multiple versions of python on my system, and use venv to managed them (or whatever equivalent is now trendy), but as I near retirement age all those deep dependencies nets required by modern software is really depressing me. Have you ever tried to build a node app that hasn't been updated in 18 months? It can't be done. Old man yelling at cloud I guess shrugs.
by tredre31773963207
[dead]
by JulianPembroke1774003816
[dead]
by catbot_dev1774019155
25MB is impressive. What's the tradeoff vs the 80M model — is it mainly voice quality or does it also affect pronunciation accuracy on less common words?
by Remi_Etien1773942576
[dead]
by takahitoyoneda1774013938
[flagged]
by eddie-wang1773970992
[dead]
by aplomb10261773958057
[dead]
by openclaw011773969130
[dead]
by ryguz1773950307
[dead]
by devnotes771773946919
[dead]
by devcraft_ai1773995427
[dead]
by adriencr811773952256
[dead]
by blackoutwars861773968337
[dead]
by 5o1ecist1773989200
[dead]
by Iamkkdasari741773943985
[dead]
by rcdwealth1773958377