BTW, it seems that kitten (the Python package) has the following chain of dependencies: kittentts → misaki[en] → spacy-curated-transformers
So if you install it directly via uv, it will pull torch and NVIDIA CUDA packages (several GB), which are not needed to run kitten.
I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.
I didn't expect it to pronounciate 'ms' correctly, but the number sounded just like noise. Eventually I got an acceptable result for the string "Startup finished in one hundred and thirty five seconds.
Maybe a dumb and slightly tangential question, (I don't mean this as a criticism!) but why not release a command line executable?
Even the API looks like what you'd see in a manpage.
I get it wouldn't be too much work for a user to actually make something like that, I'm just curious what the thought process is
Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.
Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].
the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?
my current best approach is wrapping around gemini-flash native, and the model speaking the text i send it, which allows me end to end latency under a second.
are there other models at this or better pricing i can be looking at.
If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.
curious about the latency characteristics though. 1.5x realtime on a 9700 is fine for batch processing but for interactive use you need first-chunk latency under 200ms or the conversation feels broken. does anyone know if it supports streaming output or is it full-utterance only?
the phoneme-based approach should help with pronunciation consistency too. the models i've tried that work on raw text tend to mispronounce technical terms unpredictably — same word pronounced differently across runs.
I couldn't locate how to run it on a GPU anywhere in the repo.
Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.
Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?
The iOS version is Swift-based.
Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.
Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.
I want to be my own personal assistant...
EDIT: I can provide it a RTX 3080ti.
At 25MB you can actually bundle it with the app. Going to test whether this works in a Vercel Edge Function context -- if latency is acceptable there it opens up a lot of use cases that currently require a round-trip to a hosted API.
Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.
text ="""
Hello world. This is Kitten TTS.
Look, it's working!
"""
voice = 'Luna'
On macOS, I get "Kitten TTS", but on Linux, I get "Kit… TTS". Both OSes generate the same phonemes of, Phonemes: ðɪs ɪz kˈɪʔn ̩ tˌiːtˌiːˈɛs ,
which makes me really confused as to where it's going off the rails on Linux, since from there it should just be invoking the model.edit: it really helps to use the same model facepalm. It's the 80M model, and it happens on both OS. Wildly the nano gets it better? I'm going to join the Discord lol.
Downloading https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl (22 kB)
Collecting num2words (from kittentts==0.8.1)
Using cached num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting spacy (from kittentts==0.8.1)
Using cached spacy-3.8.11-cp314-cp314-win_amd64.whl.metadata (28 kB)
Collecting espeakng_loader (from kittentts==0.8.1)
Using cached espeakng_loader-0.2.4-py3-none-win_amd64.whl.metadata (1.3 kB)
INFO: pip is looking at multiple versions of kittentts to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.7.10 Requires-Python >=3.8,<3.13; 0.7.11 Requires-Python >=3.8,<3.13; 0.7.12 Requires-Python >=3.8,<3.13; 0.7.13 Requires-Python >=3.8,<3.13; 0.7.14 Requires-Python >=3.8,<3.13; 0.7.15 Requires-Python >=3.8,<3.13; 0.7.16 Requires-Python >=3.8,<3.13; 0.7.17 Requires-Python >=3.8,<3.13; 0.7.5 Requires-Python >=3.8,<3.13; 0.7.6 Requires-Python >=3.8,<3.13; 0.7.7 Requires-Python >=3.8,<3.13; 0.7.8 Requires-Python >=3.8,<3.13; 0.7.9 Requires-Python >=3.8,<3.13; 0.8.0 Requires-Python >=3.8,<3.13; 0.8.1 Requires-Python >=3.8,<3.13; 0.8.2 Requires-Python >=3.8,<3.13; 0.8.3 Requires-Python >=3.8,<3.13; 0.8.4 Requires-Python >=3.8,<3.13; 0.9.0 Requires-Python >=3.8,<3.13; 0.9.2 Requires-Python >=3.8,<3.13; 0.9.3 Requires-Python >=3.8,<3.13; 0.9.4 Requires-Python >=3.8,<3.13; 3.8.3 Requires-Python >=3.9,<3.13; 3.8.5 Requires-Python >=3.9,<3.13; 3.8.6 Requires-Python >=3.9,<3.13; 3.8.7 Requires-Python >=3.9,<3.14; 3.8.8 Requires-Python >=3.9,<3.14; 3.8.9 Requires-Python >=3.9,<3.14
ERROR: Could not find a version that satisfies the requirement misaki>=0.9.4 (from kittentts) (from versions: 0.1.0, 0.3.0, 0.3.5, 0.3.9, 0.4.0, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4)
ERROR: No matching distribution found for misaki>=0.9.4
I realize that I can run a multiple versions of python on my system, and use venv to managed them (or whatever equivalent is now trendy), but as I near retirement age all those deep dependencies nets required by modern software is really depressing me. Have you ever tried to build a node app that hasn't been updated in 18 months? It can't be done. Old man yelling at cloud I guess shrugs.