[HN Gopher] Show HN: Kitten TTS - 25MB CPU-Only, Open-Source TTS...
___________________________________________________________________
Show HN: Kitten TTS - 25MB CPU-Only, Open-Source TTS Model
Kitten TTS is an open-source series of tiny and expressive text-to-
speech models for on-device applications. We are excited to launch
a preview of our smallest model, which is less than 25 MB. This
model has 15M parameters. This release supports English text-to-
speech applications in eight voices: four male and four female. The
model is quantized to int8 + fp16, and it uses onnx for runtime.
The model is designed to run literally anywhere eg. raspberry pi,
low-end smartphones, wearables, browsers etc. No GPU required!
We're releasing this to give early users a sense of the latency and
voices that will be available in our next release (hopefully next
week). We'd love your feedback! Just FYI, this model is an early
checkpoint trained on less than 10% of our total data. We started
working on this because existing expressive OSS models require big
GPUs to run them on-device and the cloud alternatives are too
expensive for high frequency use. We think there's a need for
frontier open-source models that are tiny enough to run on edge
devices!
Author : divamgupta
Score : 916 points
Date : 2025-08-06 05:04 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| GaggiX wrote:
| https://huggingface.co/KittenML/kitten-tts-nano-0.1
|
| https://github.com/KittenML/KittenTTS
|
| This is the model and Github page, this blog post looks very much
| AI generated.
| nine_k wrote:
| I hope this is the future. Offline, small ML models, running
| inference on ubiquitous, inexpensive hardware. Models that are
| easy to integrate into other things, into devices and apps, and
| even to drive from other models maybe.
| rohan_joshi wrote:
| yeah totally. the quality of these tiny models are only going
| to go up.
| divamgupta wrote:
| That is our vision too!
| WhyNotHugo wrote:
| Dedicated single-purpose hardware with models would be even
| less energy-intensive. It's theoretically possible to design
| chips which run neural networks and alike using just resistors
| (rather than transistors).
|
| Such hardware is not general-purpose, and upgrading the model
| would not be possible, but there's plenty of use-cases where
| this is reasonable.
| amelius wrote:
| But resistors are, even in theory, heat dissipating devices.
| Unlike transistors, which can in theory be perfectly on or
| off (in both cases not dissipating heat).
| divamgupta wrote:
| The thing is that the new models keep coming every day. So
| it's economically not feasible to make chips for a single
| model
| regularfry wrote:
| It's theoretically possible but physical "neurons" is a
| terrible idea. The number of connections between two layers
| of an FF net is the product of the number of weights in each,
| so routing makes every other problem a rounding error.
| theshrike79 wrote:
| This is what Apple is envisioning with their SLMs, like having
| a model specifically for managing calendar events. It doesn't
| need to have the full knowledge of all humanity in it - just
| what it needs to manage the calendar.
| koolala wrote:
| Issue is their envisioning everyone only using Apple
| products.
| theshrike79 wrote:
| Just like Google wants everyone to use their products.
| That's how companies work.
|
| The tech is still public and the research is available
| throwaway28733 wrote:
| Apple's hardware is notoriously overpriced, so I don't think
| they're envisioning that at all.
| DrBenCarson wrote:
| Is it? The base $600 Mac and $150 Apple TV are easily two
| of the best deals in their market
| depingus wrote:
| Hmm. A pay once (or not at all) model that can run on anything?
| Or a subscription model that locks you in, and requires
| hardware that only the richest megacorps can afford? I wonder
| which one will win out.
| tracker1 wrote:
| The popular one.
| divamgupta wrote:
| This is our goal too.
| mayli wrote:
| Is this english only?
| g7r wrote:
| Yes. The FAQ says that multilingual capabilities are in the
| works.
| a2128 wrote:
| If you're looking for other languages, Piper has been around in
| this scene for much longer and they have open-source training
| code and a lot of models (they're ~60MB instead of 25MB but
| whatever...) https://huggingface.co/rhasspy/piper-
| voices/tree/main
| evgpbfhnr wrote:
| I tried on some Japanese for the kicks of it, it reads...
| "Chinese letter chinese letter japanese letter chinese
| letter..." :D
|
| But yeah, if it's like any of the others we'll likely see a
| different "model" per language down the line based on the same
| techniques
| riedel wrote:
| Actually I found it irritating that the readme does not mention
| the language at all. I think it is not good practice to deduce
| it from the language of the readme itself. I would not like to
| have German language tts models with only a German readme...
| numpad0 wrote:
| TTS is generally not multilingual. One might think a well-
| annotated phonetic descriptions of voices would suffice, but
| that's not quite how languages work nor how TTS work.
|
| (but somehow LLMs handle multilingual input perfectly fine!
| that's a bit strange, if you think about that)
| toisanji wrote:
| Wow, amazing and good work, I hope to see more amazing models
| running on CPUs!
| rohan_joshi wrote:
| thanks, we're going to release many more models in the future,
| that can run on just CPUs.
| onair4you wrote:
| Okay, lots of details information and example code, great. But
| skimming through I didn't see any audio samples to judge the
| quality?
| TheAceOfHearts wrote:
| They posted a demo on reddit[0]. It sounds amazing given the
| tiny size.
|
| [0]
| https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...
| onair4you wrote:
| Thanks! Yeah. It definitely isn't the absolute best in
| quality but it trounces the default TTS options on macOS (as
| third party developers are locked out of the Siri voices).
| And for less than the size of many modern web pages...
| blopker wrote:
| Web version: https://clowerweb.github.io/kitten-tts-web-demo/
|
| It sounds ok, but impressive for the size.
| nine_k wrote:
| Does anybody find it funny that sci-fi movies have to heavily
| distort "robot voices" to make them sound "convincingly
| robotic"? A robotic, _explicitly_ non-natural voice would be
| perfectly acceptable, and even desirable, in many situations. I
| don 't expect a smart toaster to talk like a BBC host; it'd be
| enough is the speech if easy to recognize.
| roywiggins wrote:
| This one is at least an interesting idea:
| https://genderlessvoice.com/
| cosmojg wrote:
| The voice sounds great! I find it quite aesthetically
| pleasing, but it's far from genderless.
| degamad wrote:
| Interesting concept, but why is that site filled with Top X
| blogspam?
| pbronez wrote:
| The YouTube video [1] was published in 2019. The Blog
| spam posts range from Nov 2022 to July 2023.
|
| Other than the video, the only relevant content is on the
| about page [2]. It says the voice is a collaboration
| between 5 different entities, including advocacy groups,
| marketing firms and a music producer.
|
| The video is the only example of the voice in use. There
| is no API, weights, SDK, etc.
|
| I suspect this was a one-off marketing stunt sponsored by
| Copenhagen pride before the pandemic. The initial
| reaction was strong enough that a couple years they were
| still getting a small but steady flow of traffic. One of
| the involved marketing firms decided to monetize the
| asset and defaced it with blog spam.
|
| [1] https://www.youtube.com/watch?v=lvv6zYOQqm0
|
| [2] https://genderlessvoice.com/about/
| dang wrote:
| _Meet Q, a Genderless Voice_ -
| https://news.ycombinator.com/item?id=19505835 - March 2019
| (235 comments)
| cyberax wrote:
| It doesn't sound genderless.
| pbronez wrote:
| Huh. Sounds perfectly intelligible and definitively
| artificial. Feels weakly feminine to me, but only because I
| was primed to think about gender from the branding.
|
| It's a good choice for a robot voice. It's easier to
| understand than the formant synths or deliberately
| distorted human voices. The genderless aspect is alien
| enough to avoid the uncanny valley. You intuitively know
| you're dealing with something a little different.
| qmr wrote:
| Thanks, I hate it.
| userbinator wrote:
| _A robotic, explicitly non-natural voice would be perfectly
| acceptable, and even desirable, in many situations[...]it 'd
| be enough is the speech if easy to recognize._
|
| We've had formant synths for several decades, and they're
| perfectly understandable and require a _tiny_ amount of
| computing power, but people tend not to want to listen to
| them:
|
| https://en.wikipedia.org/wiki/Software_Automatic_Mouth
|
| https://simulationcorner.net/index.php?page=sam (try it
| yourself to hear what it sounds like)
| saretup wrote:
| Well, this one is a bit too jarring to the ears.
| rixed wrote:
| But there is no latency, as opposed to KittenTTS, so it
| certainly has its applications too.
| cess11 wrote:
| Try this demo, which has more knobs:
|
| https://discordier.github.io/sam/
| actionfromafar wrote:
| I think it's charming
| miki123211 wrote:
| SAM and the way it works is not what people typically
| associate with the term "formant synthesizer."
|
| DECtalk[1,2] would be a much better example, that's as
| formant as you get.
|
| [1] https://en.wikipedia.org/wiki/DECtalk [2]
| https://webspeak.terminal.ink
| tapper wrote:
| Yeah blind people love eloquence
| boobsbr wrote:
| Huh, now I know what Airdorf used in Faith: Unholy Trinity.
| Twirrim wrote:
| > I don't expect a smart toaster to talk like a BBC host;
|
| Well sure, the BBC have already established that it's
| supposed to sound like a brit doing an impersonation of an
| American: https://www.youtube.com/watch?v=LRq_SAuQDec
| incone123 wrote:
| Depends on the movie. Ash and Bishop in the Alien franchise
| sound human until there's a dramatic reason to sound more
| 'robotic'.
|
| I agree with your wider point. I use Google TTS with
| Moon+Reader all the time (I tried audio books read by real
| humans but I prefer the consistency of TTS)
| regularfry wrote:
| Slightly different there because it's important in both
| cases that Ripley (and we) can't tell they're androids
| until it's explicitly uncovered. The whole point is that
| they're _not_ presented as artificial. Same in Blade
| Runner: "more human than human". You don't have a film
| without the ambiguity there.
| incone123 wrote:
| You're right. I should have used Marvin from Hitchhiker's
| Guide as an example instead. There's very light
| processing on his speech.
| looperhacks wrote:
| I remember that the novelization of the fifth element
| describes that the cops are taught to speak as robotic as
| possible when using speakers for some reason. Always found
| the idea weird that someone would _want_ that
| addandsubtract wrote:
| If you're on a Mac, you can type "say [thing to say]" into
| your terminal.
| msgodel wrote:
| I personally prefer the older synthetic voices for TTS when
| the text is coming from software or a language model.
| mfro wrote:
| In the Culture novels, Iain Banks imagines that we would
| become uncomfortable with the uncanny realism of transmitted
| voices / holograms, and intentionally include some level of
| distortion to indicate you're speaking to an image
| quantummagic wrote:
| Doesn't work here. Backend module returns 404 :
|
| https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
| Retr0id wrote:
| Looks like this commit 15 minutes ago broke it
| https://github.com/clowerweb/kitten-tts-web-
| demo/commit/6b5c...
|
| (seems reverted now)
| Retr0id wrote:
| I tried to replicate their demo text but it doesn't sound as
| good for some reason.
|
| If anyone else wants to try:
|
| > Kitten TTS is an open-source series of tiny and expressive
| text-to-speech models for on-device applications. Our smallest
| model is less than 25 megabytes.
| cortesoft wrote:
| Is the demo using the not smallest model?
| Retr0id wrote:
| Perhaps, but the 25MB model is the only thing they've
| released
| itake wrote:
| > Error generating speech: failed to call OrtRun(). ERROR_CODE:
| 2, ERROR_MESSAGE: Non-zero status code returned while running
| Expand node. Name:'/bert/Expand' Status Message: invalid expand
| shape
|
| Doesn't seem to work with thai.
| jainilprajapati wrote:
| You can also try on
| https://clowerweb.github.io/node_modules/onnxruntime-
| web/dis...
| nxnsxnbx wrote:
| Thanks, I was looking for that. While the reddit demo sounds
| ok, even though on a level we reached a couple of years ago,
| all TTS samples I tried were barley understandable at all
| divamgupta wrote:
| This is just an early checkpoint. We hope that the quality
| will improve in the future.
| bkyan wrote:
| I got an error when I tried the demo with 6 sentences, but it
| worked great when I reduced the text to 3 sentences. Is the
| length limit due to the model or just a limitation for the
| demo?
| cess11 wrote:
| Perhaps a length limit? I tried this:
|
| "This first Book proposes, first in brief, the whole Subject,
| Mans disobedience, and the loss thereupon of Paradise wherein
| he was plac't: Then touches the prime cause of his fall, the
| Serpent, or rather Satan in the Serpent; who revolting from
| God, and drawing to his side many Legions of Angels, was by
| the command of God driven out of Heaven with all his Crew
| into the great Deep."
|
| It takes a while until it starts generating sound on my i7
| cores but it kind of works.
|
| This also works:
|
| "blah. bleh. blih. bloh. blyh. bluh."
|
| So I don't think it's a limit on punctuation. Voice quality
| is quite bad though, not as far from the old school C64 SAM
| (https://discordier.github.io/sam/) of the eighties as I
| expected.
| divamgupta wrote:
| Currently we don't have chunking enabled yet. We will add it
| soon. That will remove the length limitations.
| belchiorb wrote:
| This doesn't seem to work on Safari. Works great on Chrome,
| though
| divamgupta wrote:
| Hmm, we will look into it.
| tapper wrote:
| You should post on the NVDA email list.
| https://nvda.groups.io/g/nvda Or the Screen reader list:
| https://winaccess.groups.io/g/winaccess FYI blind people do
| not like any lag when reading that's is why so many still
| use eloquence and espeak.
| rohan_joshi wrote:
| yeah, this is just a preview model from an early checkpoint.
| the full model release will be next week which includes a 15M
| model and an 80M model, both of which will have much higher
| quality than this preview.
| Aardwolf wrote:
| On PC it's a python dependency hell but someone managed to
| package it in self contained JS code that works offline once it
| loaded the model? How is that done?
| a2128 wrote:
| ONNXRuntime makes it fairly easy, you just need to provide a
| path to the ONNX file, give it inputs in the correct format,
| and use the outputs. The ONNXRuntime library handles the
| rest. You can see this in the main.js file:
| https://github.com/clowerweb/kitten-tts-web-
| demo/blob/main/m...
|
| Plus, Python software are dependency hell in general, while
| webpages have to be self-contained by their nature (thank god
| we no longer have Silverlight and Java applets...)
| scotty79 wrote:
| It feels like it doesn't handle punctuation well. I don't hear
| sentence boundaries and commas. It sounds like continuous
| stream of words.
| Jotalea wrote:
| Using male voice 2 at 48kHz at 0.5x speed sounds a lot like
| Madeline's voice lines in Celeste. Seemed funny to me.
| mlboss wrote:
| Reddit post with generated audio sample:
| https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...
| tapper wrote:
| Sounds slow and like something from an anine
| ricardobeat wrote:
| Speech speed is always a tunable parameter and not something
| intrinsic to the model.
|
| The comparison to make is expressiveness and correct
| intonation for long sentences vs something like espeak. It
| actually sounds amazing for the size. The closest thing is
| probably KokoroTTS at 82M params and ~300MB.
| dvh wrote:
| I think he meant overacting typical for English dubs.
| Telemakhos wrote:
| The voices sound artificial and a bit grating. The male
| voices especially are lacking, especially in depth: only
| the ultimate voice has any depth at all, while the others
| sound like teenagers who haven't finished puberty. None
| of the voices sound quite human, but they're all very
| annoying, and part of that is that they sound like
| they're acting.
| avisser wrote:
| I heard a little DVa from Overwatch.
| numpad0 wrote:
| The only real questions are which Chinese gacha game they
| ripped data from and whether they used Claude Code or Gemini
| CLI for Python code. I bet one can get a formant match from
| output this much overfit to whatever data. This isn't going
| to stay up for long.
| smusamashah wrote:
| The reddit video is awesome. I don't understand how people are
| calling it an OK model. Under 25MB and cpu only for this
| quality is amazing.
| Retr0id wrote:
| The people calling it "OK" probably tried it for themselves.
| Whatever model is being demoed in that video is not the same
| as the 25MB model they released.
| darkwater wrote:
| Nope, looks like the default voice is the worst and it's
| not in the demo. A Reddit user generated these as well
| https://limewire.com/d/28CRw#UPuRLynIi7
| bouchard wrote:
| Never thought I'd see the name LimeWire again, wow
| divamgupta wrote:
| Haha interesting pivot!
| iab wrote:
| Local quality is very bad
| fortyseven wrote:
| It did say this was a preview release, so I'll reserve
| judgement until that's out the door.
| sergiotapia wrote:
| https://vocaroo.com/1njz1UwwVHCF
|
| It doesn't sound so good. Excellent technical achievement and
| it may just improve more and more! But for now I can't use it
| for consumer facing applications.
| divamgupta wrote:
| We are still training the model. We expect the quality to
| go up in the next release. This is just a preview release
| :)
| KaiserPro wrote:
| was it cross trained on futurama voices?
| junon wrote:
| That would be a feature!
| archon810 wrote:
| Sounds like Mort from Family Guy.
| divamgupta wrote:
| Lol
| divamgupta wrote:
| It was not
| Zardoz84 wrote:
| Sounds very clear. For a non native english speaker like me,
| it's easy to understand.
| Aachen wrote:
| Impressive technical achievement, but in terms of whether I'd
| use it: oof, that male voice is like one of these fake-excited
| newsreaders. Like they're always at the edge of their breath.
| The female one is better but still someone reading out an
| advertisement for a product they were told they must act extra
| excited for. I assume this is what the majority of training
| data was like and not an intentional setting for the demo.
| Unsure whether I could get used to that
|
| I use TTS on my phone regularly and recently also tried this
| new project on F-Droid called SherpaTTS, which grabs some
| models from Huggingface. They're super heavy (the phone
| suspends other apps to disk while this runs) and sound good,
| but in the first news article there were already one or two
| mispronunciations because it's guessing how to say uncommon or
| new words and it's not based on logical rules anymore to turn
| text into speech
|
| Google and Samsung have each a TTS engine pre-installed on my
| device and those sound and work fine. A tad monotonous but it
| seems to always pronounce things the same way so you can always
| work out what the text said
|
| Espeak (or -ng) is the absolute worst, but after 30 seconds of
| listening closely you get used to it and can understand
| everything fine. I don't know if it's the best open source
| option (probably there are others that I should be trying) but
| it's at least the most reliable where you'll always get what is
| happening and you can install it on any device without
| licensing issues
| willwade wrote:
| anyone else wants to try sherpaOnnx you can try this..
| https://github.com/willwade/tts-wrapper we recently added in
| the kokoro models which should sound a lot better. There are
| a LOT of models to choose from. I have a feeling the Droid
| app isnt handling cold starts very well.
| spookie wrote:
| If anyone wants to test ready to install android apks:
| https://k2-fsa.github.io/sherpa/onnx/tts/apk.html
| bornfreddy wrote:
| RHvoice is pretty good, imho.
| divamgupta wrote:
| Thanks a lot for the detailed feedback. We are working on
| some models which do not use a phonemizer
| seligman99 wrote:
| And a quick video with all of the different voices:
|
| https://www.youtube.com/watch?v=60Dy3zKBGQg
| Eduard wrote:
| thank you!
| tracker1 wrote:
| Cool, thanks... aside: the last male voice sounds high/drunk.
| pkaye wrote:
| Where does the training data come for the models? Is there an
| openly available dataset the people use?
| wewewedxfgdf wrote:
| say is only 193K on MacOS ls -lah /usr/bin/say
| -rwxr-xr-x 1 root wheel 193K 15 Nov 2024 /usr/bin/say
|
| Usage: M1-Mac-mini ~ % say "hello world this is
| the kitten TTS model speaking"
| dented42 wrote:
| That's not a far comparison. Say just calls the speech
| synthesis APIs that have been around since at least Mac OS 8.
|
| That being said, the 'classical' (pre-AI) speech synthesisers
| are much smaller than kitten, so you're not wrong per se, just
| for the wrong reason.
| deathanatos wrote:
| The linked repository at the top-level here has several
| gigabytes of dependencies, too.
| wnoise wrote:
| And what dynamic libraries s it linked to? And what other data
| are they pulling in?
| satvikpendem wrote:
| `say` sounds terrible compared to modern neural network based
| text to speech engines.
| wewewedxfgdf wrote:
| Sounds about the same as Kitten TTS.
| satvikpendem wrote:
| To me it sounds worse, especially on the construction of
| certain more complex sentences or words.
| selcuka wrote:
| SAM on Commodore 64 was only 6K:
|
| https://project64.c64.org/Software/SAM10.TXT
|
| Obviously it's not fair to compare these with ML models.
| tonypapousek wrote:
| Tried that on 26 beta, and the default voice sounds a lot
| smoother than it used it.
|
| Running `man say` reveals that "this tool uses the Speech
| Synthesis manager", so I'm guessing the Apple Intelligence
| stuff is kicking in.
| dented42 wrote:
| Nothing to do with Apple Intelligence. The speech synthesiser
| manager (the term manager was used for OS components in
| Classic Mac OS) has been around since the mid 90s or so. The
| change you're hearing is probably a new/modified default
| voice.
| RobKohr wrote:
| What's a good one in reverse; speech to text?
| jasonjmcghee wrote:
| Whisper and the many variants. Here's a good implementation.
|
| https://github.com/ggml-org/whisper.cpp
| wenc wrote:
| This one is a whisper-based Python package
|
| https://github.com/primaprashant/hns
| wkat4242 wrote:
| Hmm the quality is not so impressive. I'm looking for a really
| naturally sounding model. Not very happy with piper/kokoro, XTTS
| was a bit complex to set up.
|
| For STT whisper is really amazing. But I miss a good TTS. And I
| don't mind throwing GPU power at it. But anyway. this isn't it
| either, this sounds worse than kokoro.
| kenarsa wrote:
| Try https://github.com/Picovoice/orca
| wkat4242 wrote:
| Thanks!
| echelon wrote:
| > Hmm the quality is not so impressive. [...] And I don't mind
| throwing GPU power at it.
|
| This isn't for you, then. You should evaluate quality here
| based on the fact you _don 't_ need a GPU.
|
| Back in the pre-Tacotron2 days, I was running slim TTS and
| vocoder models like GlowTTS and MelGAN on Digital Ocean
| droplets. No GPU to speak of. It cost next to nothing to run.
|
| Since then, the trend has been to scale up. We need more models
| to scale down.
|
| In the future we'll see small models living on-device. Embedded
| within toys and tools that don't need or want a network
| connection. Deployed with Raspberry Pi.
|
| Edge AI will be huge for robotics, toys and consumer products,
| and gaming (ie. world models).
| wkat4242 wrote:
| > This isn't for you, then. You should evaluate quality here
| based on the fact you don't need a GPU.
|
| I know but it was more of a general comment. A really good
| TTS just isn't around yes in the OSS sphere. I looked at some
| of the other suggestions here but they have too many quirks.
| Dia sounds great but messages must have certain lengths etc
| and it picks a random voice every time. I'd love to have
| something self hosted that's as good as openai.
| kamranjon wrote:
| The best open one I've found so far is Dia -
| https://github.com/nari-labs/dia - it has some limitations, but
| i think it's really impressive and I can run it on my laptop.
| wkat4242 wrote:
| Thanks I'll try! I like how it sounds, the quality is really
| good. But the limitations are really severe (shorter than 5
| seconds is not ok, > 30 seconds is not ok, it will play a
| random voice every time, those make it pretty much unusable
| for an assistant to be honest).
|
| But it might be worth setting it up and seeing if it improves
| over time.
| kamranjon wrote:
| You can get consistent voice by providing a sample - and
| yea the timing stuff is what you have to work around - have
| to basically chunk your inputs.
| guskel wrote:
| Chatterbox is also worth a try.
| jainilprajapati wrote:
| You should give try to https://pinokio.co/
| wkat4242 wrote:
| Thanks I'll try!
| gnulinux wrote:
| Imho chatterbox is the current open weight SOTA model in terms
| of quality: https://huggingface.co/ResembleAI/chatterbox
| wkat4242 wrote:
| Thank you, I hadn't heard of it. Will have a look! The
| samples sound excellent indeed.
| andai wrote:
| Can you run it in reverse for speech recognition?
| gromgull wrote:
| no, but whisper has a 39M model:
| https://github.com/openai/whisper
| divamgupta wrote:
| We will release an STT model as well.
| keyle wrote:
| I don't mind so much the size in MB, the fact that it's pure CPU
| and the quality, what I do mind however is the latency. I hope
| it's fast.
|
| Aside: Are there any models for understanding voice to text,
| fully offline, without training?
|
| I will be very impressed when we will be able to have a
| conversation with an AI at a natural rate and not "probe, space,
| response"
| Teever wrote:
| Any idea what factors play into latency in TTS models?
| divamgupta wrote:
| Mostly model size, and input size. Some models which use
| attention are O(N^2)
| blensor wrote:
| "The brown fox jumps over the lazy dog.."
|
| Average duration per generation: 1.28 seconds
|
| Characters processed per second: 30.35
|
| --
|
| "Um"
|
| Average duration per generation: 0.22 seconds
|
| Characters processed per second: 9.23
|
| --
|
| "The brown fox jumps over the lazy dog.. The brown fox jumps
| over the lazy dog.."
|
| Average duration per generation: 2.25 seconds
|
| Characters processed per second: 35.04
|
| --
|
| processor : 0
|
| vendor_id : AuthenticAMD
|
| cpu family : 25
|
| model : 80
|
| model name : AMD Ryzen 7 5800H with Radeon Graphics
|
| stepping : 0
|
| microcode : 0xa50000c
|
| cpu MHz : 1397.397
|
| cache size : 512 KB
| keyle wrote:
| assuming most answers will be more than a sentence, 2.25
| seconds is already long enough if you factor the token
| generation in between... and imagine with reasoning!... We're
| not there yet.
| moffkalast wrote:
| Hmm that actually seems extremely slow, Piper can crank out a
| sentence almost instantly on a Pi 4 which is a like a sloth
| compared to that Ryzen and the speech quality seems about the
| same at first glance.
|
| I suppose it would make sense if you want to include it on
| top of an LLM that's already occupying most of a GPU and this
| could run in the limited VRAM that's left.
| colechristensen wrote:
| >Aside: Are there any models for understanding voice to text,
| fully offline, without training?
|
| OpenAI's whisper is a few years old and pretty solid.
|
| https://github.com/openai/whisper
| Hackbraten wrote:
| Whisper tends to fill silence with random garbage from its
| training set. [0] [1] [2]
|
| [0]: https://github.com/openai/whisper/discussions/679 [1]:
| https://github.com/openai/whisper/discussions/928 [2]:
| https://github.com/openai/whisper/discussions/2608
| jiehong wrote:
| Voice to text fully offline can be done with whisper. A few
| apps offer it for dictation or transcription.
| Dayshine wrote:
| Nvidia's parakeet https://huggingface.co/nvidia/parakeet-
| tdt-0.6b-v2 appears to be state of the art for english: 10x
| faster than Whisper.
|
| My mid-range AMD CPU is multiple times faster than realtime
| with parakeet.
| sandreas wrote:
| Cool.
|
| While I think this is indeed impressive and has a specific use
| case (e.g. in the embedded sector), I'm not totally convinced
| that the quality is good enough to replace bigger models.
|
| With fish-speech[1] and f5-tts[2] there are at least 2 open
| source models pushing the quality limits of offline text-to-
| speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it
| worked ok-ish, so running it on a little more modern hardware
| will not cost you a fortune and produce MUCH higher quality with
| multi-language and zero-shot support.
|
| For Android there is SherpaTTS[3], which plays pretty well with
| most TTS Applications.
|
| 1: https://github.com/fishaudio/fish-speech
|
| 2: https://github.com/SWivid/F5-TTS
|
| 3: https://github.com/woheller69/ttsengine
| divamgupta wrote:
| We have released just a preview of the model. We hope to get
| the model much better in the future releases.
| nickpsecurity wrote:
| Fish Speech says its weights are for non-commercial use.
|
| Also, what are the two's VRAM requirents? This model has 15
| million parameters which might run on low-power, sub-$100
| computers with up-to-date software. Your hardware was an out-
| of-date 6GB GPU.
| jainilprajapati wrote:
| maxloh wrote:
| Hi. Will the training and fine-tuning code also be released?
|
| It would be great if the training data were released too!
| MutedEstate45 wrote:
| The headline feature isn't the 25 MB footprint alone. It's that
| KittenTTS is Apache-2.0. That combo means you can embed a fully
| offline voice in Pi Zero-class hardware or even battery-powered
| toys without worrying about GPUs, cloud calls, or restrictive
| licenses. In one stroke it turns voice everywhere from a
| hardware/licensing problem into a packaging problem. Quality
| tweaks can come later; unlocking that deployment tier is the real
| game-changer.
| defanor wrote:
| A Festival's English model, festvox-kallpc16k, is about 6 MB,
| and it is a large model; festvox-kallpc8k is about 3.5 MB.
|
| eSpeak NG's data files take about 12 MB (multi-lingual).
|
| I guess this one may generate more natural-sounding speech, but
| older or lower-end computers were capable of decent speech
| synthesis previously as well.
| Joel_Mckay wrote:
| Custom voices could be added, but the speed was more
| important to some users.
|
| $ ls -lh /usr/bin/flite
|
| Listed as 27K last I checked.
|
| I recall some Blind users were able to decode Gordon 8-bit
| dialogue at speeds most people found incomprehensible. =3
| anthk wrote:
| I'm not blind but spoken English it's far more difficult to
| grasp than written one (I'm a non-native speaker), and
| Flite runs on n270 netbooks at crazy speeds with really
| good enough voices.
| rohan_joshi wrote:
| yeah, we are super excited to build tiny ai models that are
| super high quality. local voice interfaces are inevitable and
| we want to power those in the future. btw, this model is just a
| preview, and the full release next week will be of much higher
| quality, along w another ~80M model ;)
| phh wrote:
| It depends on espeak-ng which is GPLv3
| pjc50 wrote:
| > KittenTTS is Apache-2.0
|
| What about the training data? Is everyone 100% confident that
| models are not a derived work of the training inputs now, even
| if they can reproduce input exactly?
| entropie wrote:
| I play around with a nvidia jetson orin nano super right now
| and its actually pretty usuable with gemma3:4b and quite fast -
| even image processing is done in like 10-20 seconds but this is
| with GPU support. When something is not working and ollama is
| not using the GPU this calls take _ages_ because the cpu is
| just bad.
|
| Iam curious how fast this is with CPU only.
| woadwarrior01 wrote:
| > It's that KittenTTS is Apache-2.0
|
| Have you seen the code[1] in the repo? It uses phonemizer[2]
| which is GPL-3.0 licensed. In its current state, it's
| effectively GPL licensed.
|
| [1]:
| https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...
|
| [2]: https://github.com/bootphon/phonemizer
|
| Edit: It looks like I replied to an LLM generated comment.
| jacereda wrote:
| https://github.com/KittenML/KittenTTS/issues/17
| dspillett wrote:
| _> IANAL, but AFAICS this leaves 2 options, switching the
| license or removing that dependency._
|
| There is a third option: asking the project for an
| exception.
|
| Though that is unlikely to be granted1 leaving you back
| with just the other two options.
|
| And of course a forth choice: just ignore the license. This
| is the option taken by companies like Onyx, whose products
| I might otherwise be interested in...
|
| ----
|
| [1] Those of us who pick GPL3 or AGPL generally do so to
| keep things _definite_ and an exception would muddy the
| waters, also it might not even be possible if the project
| has many maintainers as relicensing would require agreement
| from all who have provided code that is in the current
| release. Furthermore, if it has inherited the license from
| one of _its_ dependencies, an exception is even less
| practical.
| woadwarrior01 wrote:
| > There is a third option: asking the project for an
| exception.
|
| IIUC, the project isn't at the liberty to grant such an
| exception because it inherits its GPL license from
| espeak-ng.
| dspillett wrote:
| Ah, yes, good catch, I didn't look deeper into the
| dependency tree at all. I'll update my footnote to
| include that as one of the reasons an exception may be
| impossible (or at least highly impractical).
| wongarsu wrote:
| A fourth option would be a kind of dual-licensing: the
| project as-is is available under GPL-3.0, but the source
| code in this repository excluding any dependencies is
| also available under Apache 2.0
|
| Any user would still effectively be bound by the GPL-3.0,
| but if someone can remove the GPL dependencies they could
| use the project under Apache
| dspillett wrote:
| That is an option for the publisher of the library, not
| the consumer of it. If it isn't already done then asking
| for it to be done is the same as asking for an exception
| otherwise (option three).
| wongarsu wrote:
| The use of the library is four lines. Three set up the
| library (`phonemizer.backend.EspeakBackend(language="en-
| us", preserve_punctuation=True, with_stress=True)`), the
| other calls it (`phonemes_list =
| self.phonemizer.phonemize([text])`). Plus I guess the
| import statements. Even ignoring Google vs Oracle I don't
| think those lines by themselves meet any threshold of
| originality.
|
| Obviously you can't run them (with the original library)
| without complying with the GPL. But I don't see why I
| couldn't independently of that also give you this text
| file under Apache 2.0 to do with as you want (which for
| the record still doesn't allow you to run them with the
| original library without complying with the GPL, but
| that'd be phoneme forcing you to do that, not this
| project)
|
| You would have to be very specific about the dual-
| licensing to avoid confusion about what you are allowed
| to do under Apache conditions though. You can't just say
| "it's dual-licensed"
| joshuaissac wrote:
| You could even extract out the parts that do not call the
| GPL library into an upstream project under the Apache 2.0
| licence, and pull in both that and the GPL library in the
| downstream project, relying on Apache 2.0 -> GPL 3.0
| compatibility instead of explicit dual licensing to allow
| the combined work to be distributed under GPLv3.
| ape4 wrote:
| Once the license issues are resolved it would nice if you
| could install it on a distro with the normal package
| manager.
| keyKeeper wrote:
| Okay, what's stopping you from feeding the code into an LLM
| and re-write it and make it yours? You can even add extra
| steps like make it analyze the code block by block then
| supervise it as it is rewriting it. Bam. AI age IP freedom.
|
| Morals may stop you but other than that? IMHO all open source
| code is public domain code if anyone is willing to spend some
| AI tokens.
| woadwarrior01 wrote:
| Tell me you haven't used LLMs on large, non-trivial
| codebases without telling me... :)
| keyKeeper wrote:
| Tell me you don't know how to use LLMs properly without
| telling me.
|
| You don't give the whole codebase to an LLM and expect it
| to have one shot output. Instead, you break it down and
| and write the code block by block. Then the size if the
| codebase doesn't matter. You use the LLM as a tool, it is
| not supposed to replace you. You don't try to become
| George from Jetsons who is just pressing a button and
| doesn't touch anything, instead you are on top of it as
| the LLM does the coding. You test the code on every step
| to see if the implementation behaves as expected. Do
| enough of this and you have proper, full "bespoke"
| software.
| akx wrote:
| I'll help you along - this is the core function that
| Kitten ends up calling. Good luck!
|
| https://github.com/espeak-ng/espeak-
| ng/blob/a4ca101c99de3534...
| Twirrim wrote:
| That would be a derivative work, and still be subject to
| the license terms and conditions, at best.
|
| There are standard ways to approach this called clean room
| engineering.
|
| https://en.m.wikipedia.org/wiki/Clean-room_design
|
| One person reads the code and produces a detailed technical
| specification. Someone reviews it to ensure that there is
| nothing in there that could be classified as copyrighted
| material, then a third person (who has never seen the
| original code) implements the spec.
|
| You could use an LLM at both stages, but you'd have to be
| able to prove that the LLM that does the implementation had
| no prior knowledge of the code in question... Which given
| how LLMs have been trained seems to me to be very dubious
| territory for now until that legal situation gets resolved.
| K0balt wrote:
| AI is useful in Chinese walling code, but it's not as easy
| as you make it sound. To stay out of legal trouble, you
| probably should refactor the code into a different
| language, then back into the target language. In the end,
| it turns into a process of being forced to understand the
| codebase and supervising its rewriting. I've translated
| libraries into another language using LLMs, I'd say that
| process was 1/2 the labor of writing it myself. So in the
| end, going 2 ways, you may as well rewrite the code
| yourself... but working with the LLM will make you familiar
| with the subject matter so you -could- rewrite the code, so
| I guess you could think of it as a sort of buggy tutorial
| process?
| graemep wrote:
| I am not sure even that is enough. You would really need
| to do a clean room reimplementation to be safe - for
| exactly the same reasons that people writing code write
| clean room reimplementations.
| K0balt wrote:
| Yeah, the algorithms and program flow would have to be
| materially distinct to be really safe. Maybe switching
| language paradigms would get that for you in most cases?
| Js->haskell->js? Sounds like a nightmare lol.
| gorgoiler wrote:
| This would only apply if they were distributing the GPL
| licensed code alongside their own code.
|
| If my MIT-licensed one-line Python library has this line of
| code... run(["bash", "-c", "echo hello"])
|
| ...I'm not suddenly subject to bash's licensing. For anyone
| wanting to run my stuff though, they're going to need to make
| sure they themselves have bash installed.
|
| (But, to argue against my own point, if an OS vendor ships my
| library alongside a copy of bash, do they have to now
| relicense my library as GPL?)
| calvinmorrison wrote:
| GPL is for boomers at this point. Floppy disks?
| Distribution? You can use a tool but you cant change it? A
| DLL call means you need to redistribute your code but
| forking doesn't?
|
| Sillyness
| dboreham wrote:
| GPL post-dates network software distribution (we got our
| first gcc via ftp).
| calvinmorrison wrote:
| Yes, but if you use open source libraries for your closed
| source SaaS - thats fine. People get their software
| _over_ the network delivered to them in a VM (your
| browser).
| r4indeer wrote:
| > This would only apply if they were distributing the GPL
| licensed code alongside their own code.
|
| As far as I understand the FSF's interpretation of their
| license, that's not true. Even if you only dynamically link
| to GPL-licensed code, you create a combined work which has
| to be licensed, as a whole, under the GPL.
|
| I don't believe that this extends to calling an external
| program via its CLI, but that's not what the code in
| question seems to be doing.
|
| (This is not an endorsement, but merely my understanding on
| how the GPL is supposed to work.)
| woadwarrior01 wrote:
| This is a false analogy. It's quite straightforward.
|
| Running bash (via exec()/fork()/spawn()/etc) isn't the same
| as (statically or dynamically) linking with its codebase.
| If your MIT-licensed one-liner links to code that's GPL
| licensed, then it gets infected by the GPL license.
| themerone wrote:
| I've seen people use IPC to workaround the GPL, but I've
| also seen the FSF interpretations claiming that is still
| a derived work.
|
| I don't know if this has ever been tested in court.
| woadwarrior01 wrote:
| My interpretation of their FAQ[1] on it is that shelling
| out and IPC are fine, while linking is not. As you say,
| it's ultimately up to the courts to decide on.
|
| [1]: https://www.gnu.org/licenses/gpl-
| faq.html#MereAggregation
| sim7c00 wrote:
| you are correct. its about linking as in LD does it, not
| conceptual linking.
| ApolloFortyNine wrote:
| The FSF thinks it counts as a derivative work and you have
| to use the LGPL to allow linking.
|
| However, this has never actually been proven in court, and
| there's many good arguments that linking doesn't count as a
| derivative work.
|
| Old post by a lawyer someone else found (version 3 wouldn't
| affect this) [1]
|
| For me personally I don't really understand how, if dynamic
| linking was viral, using linux to run code isn't viral.
| Surely at some level what linux does to run your code calls
| GPLed code.
|
| It doesn't really matter though, since the FSF stance is
| enough to scare companies from not using it, and any
| individual is highly unlikely to be sued.
|
| [1] https://www.linuxjournal.com/article/6366
| JoshTriplett wrote:
| > For me personally I don't really understand how, if
| dynamic linking was viral, using linux to run code isn't
| viral. Surely at some level what linux does to run your
| code calls GPLed code.
|
| The Linux kernel has an _explicit_ exception for
| userspace software:
|
| > NOTE! This copyright does _not_ cover user programs
| that use kernel services by normal system calls
| jcelerier wrote:
| And the GPL also has an explicit exception for "system"
| software such as kernel, platform libraries etc.:
|
| > The "System Libraries" of an executable work include
| anything, other than the work as a whole, that (a) is
| included in the normal form of packaging a Major
| Component, but which is not part of that Major Component,
| and (b) serves only to enable use of the work with that
| Major Component, or to implement a Standard Interface for
| which an implementation is available to the public in
| source code form. A "Major Component", in this context,
| means a major essential component (kernel, window system,
| and so on) of the specific operating system (if any) on
| which the executable work runs, or a compiler used to
| produce the work, or an object code interpreter used to
| run it.
|
| > The "Corresponding Source" for a work in object code
| form means all the source code needed to generate,
| install, and (for an executable work) run the object code
| and to modify the work, including scripts to control
| those activities. However, it does not include the work's
| System Libraries, or general-purpose tools or generally
| available free programs which are used unmodified in
| performing those activities but which are not part of the
| work.
| oezi wrote:
| The issue is even bigger: phonemizer is using espeak-ng,
| which isn't very good at turning graphemes into phonemes. In
| other TTS which rely on phonemes (e.g. Zonos) it turned out
| to be one of the key issues which cause bad generations.
|
| And it isn't something you can fix, because the model was
| trained on bad phonemes (everyone uses Whisper + then
| phonemizes the text transcript).
| Hackbraten wrote:
| Given that the FSF considers Apache-2.0 to be compatible with
| GPL-3.0 [0], how could the fact that phonemizer is GPL-3.0
| possibly be an issue?
|
| [0]: https://www.gnu.org/licenses/license-list.html#apache2
| adastra22 wrote:
| Compatible means they can be linked together, BUT the
| result is GPL-3.
| bscphil wrote:
| > the result is GPL-3
|
| The result _can only be distributed under the terms of_
| the GPL-3. That 's actually a crucial difference: there's
| nothing preventing Kitten TTS from being Apache licensed,
| soliciting technical contributions under that license,
| and parts of its code being re-used in other software
| under that license. Yes, for the time being, this limits
| what you can do with Kitten TTS if you want to use the
| software as a whole (e.g. by embedding it into your
| product), but the license itself is still Apache and that
| can have value.
| CyberDildonics wrote:
| The github just has a few KB of python that looks like an
| install script. How is this used from C++ ?
| Narishma wrote:
| But Pi Zero has a GPU, so why not make use of it?
| ethan_smith wrote:
| This opens up voice interfaces for medical devices, offline
| language learning tools, and accessibility gadgets for the
| visually impaired - all markets where cloud dependency and
| proprietary licenses were showstoppers.
| vahid4m wrote:
| amazing! can't wait to integrate it into
| https://desktop.with.audio I'm already using KokorosTTS without a
| GPU. It works fairly well on Apple Silicon.
|
| Foundational tools like this open up the possiblity of one-time
| payment or even free tools.
| rohan_joshi wrote:
| would love to see how that turns out. the full model release
| next week will be more expressive and higher quality than this
| one so we're excited to see you try that out.
| glietu wrote:
| Kudos guys!
| divamgupta wrote:
| Thanks
| wewewedxfgdf wrote:
| Chrome does TTS too.
|
| https://codepen.io/logicalmadboy/pen/RwpqMRV
| dang wrote:
| Most of these comments were originally posted to a different
| thread (https://news.ycombinator.com/item?id=44806543). I've
| moved them hither because on HN we always prefer to give the
| project creators credit for their work.
|
| (it does however explain how many of these comments are older
| than the thread they are now children of)
| righthand wrote:
| The sample rate does more than change the quality.
| indigodaddy wrote:
| Can coqui run in cpu only?
| palmfacehn wrote:
| Yes, XTTS2 has been reasonably performant for me and the
| cloning is acceptable.
| mg wrote:
| Good TTS feels like it is something that should be natively built
| into every consumer device. So the user can decide if they want
| to read or listen to the text at hand.
|
| I'm surprised that phone manufacturers do not include good TTS
| models in their browser APIs for example. So that websites can
| build good audio interfaces.
|
| I for one would love to build a text editor that the user can use
| completely via audio. Text input might already be feasible via
| the "speak to type" feature, both Android and iOS offer.
|
| But there seems to be no good way to output spoken text without
| doing round-trips to a server and generate the audio there.
|
| The interface I would like would offer a way to talk to write and
| then commands like "Ok editor, read the last paragraph" or "Ok
| editor, delete the last sentence".
|
| It could be cool to do writing this way while walking. Just with
| a headset connected to a phone that sits in one's pocket.
| jiehong wrote:
| On Mac OS you can "speak" a text in almost every app, using
| built in voice (like the Siri voice or some older voices). All
| offline, and even from the terminal with "say".
| Fluorescence wrote:
| I tried it a few months ago to narrate an epub in Apple Books
| and it was very broken in a weird way. It starts out decent
| but after a few pages, it starts slurring, skipping words,
| trailing off not finishing sentences and then goes silent.
|
| (I've just tried it again without seeing that issue within a
| few pages)
|
| > Siri voice or some older voices
|
| You can choose "Enhanced" and "Premium" versions of voices
| which are larger and sound nice and modern to me. The "Serena
| Premium" voice I was using is over 200Mb and far better that
| this Show HN. It's very natural but kind of ruined by
| diabolical pronunciation of anything slightly non-standard
| which sadly seems to cover everything I read e.g.
| people/place names, technical/scientific terms or any
| neologisms in scifi/fantasy.
|
| It's so wildly incomprehensible for e.g. Tibetan names in a
| mountaineering book, that you have to check the text. If the
| word being butchered is frequently repeated e.g. main
| character's name, then it's just too painful to use.
| pjc50 wrote:
| Can't most people read faster than they can hear? Isn't this
| why phone menus are so awful?
|
| > But there seems to be no good way to output spoken text
| without doing round-trips to a server and generate the audio
| there
|
| As people have been pointing out, we've had mediocre TTS since
| the 80s. If it was a real benefit people would be using even
| the inadequate version.
| babycommando wrote:
| Someone please port this to ONNX so we don't need to do all this
| ass tooling
| victorbjorklund wrote:
| It is not the best TTS but it is freaking amazing it can be done
| by such a small model and it is good enough for so many use
| cases.
| rohan_joshi wrote:
| thanks, but keep in mind that this model is just a preview
| checkpoint that is only 10% trained. the full release next week
| will be of much higher quality and it will include a 15M model
| and an 80M model.
| android521 wrote:
| it would be great if there is typescript support in the future
| divamgupta wrote:
| Yup it runs on the web browser.
| https://clowerweb.github.io/kitten-tts-web-demo/
| khanan wrote:
| "please join our DISCORD!"...
| klipklop wrote:
| I tried it. Not bad for the size (of the model) and speed. Once
| you install all the massive number of libraries and things needed
| we are a far cry away from 25MB though. Cool project nonetheless.
| Dayshine wrote:
| It mentions ONNX, so I imagine an ONNX model is or will be
| available.
|
| ONNX runtime is a single library, with C#'s package being
| ~115MB compressed.
|
| Not tiny, but usually only a few lines to actually run and only
| a single dependency.
| divamgupta wrote:
| We will try to get rid of dependencies.
| wongarsu wrote:
| The repository already runs an ONNX model. But the onnx model
| doesn't get English text as input, it gets tokenized
| phonemes. The prepocessing for that is where most of the
| dependencies come from.
|
| Which is completely reasonable imho, but obviously comes with
| tradeoffs.
| pbronez wrote:
| For space sensitive applications like embedded systems,
| could you shift the preprocessing to compile time?
|
| You would need to constrain the vocabulary to see any
| benefits, but that could be reasonable. For example, you an
| enumeration of numbers, units and metric names could handle
| dynamic time, temperature and other dashboard items.
|
| For something more complex like offline navigation, you
| already need to store a map. You could store street names
| as tokens instead of text. Add a few turn commands, and you
| have offline spoken directions without on device pre-
| processing.
| WhyNotHugo wrote:
| Usually pulling in lots of libraries helps develop/iterate
| faster. Then can be removed later once the whole thing starts
| to take shape.
| zelphirkalt wrote:
| This case might be different, but ... usually that "later"
| never happens.
| devnen wrote:
| That's a great point about the dependencies.
|
| To make the setup easier and add a few features people are
| asking for here (like GPU support and long text handling), I
| built a self-hosted server for this model:
| https://github.com/devnen/Kitten-TTS-Server
|
| The goal was a setup that "just works" using a standard Python
| virtual environment to avoid dependency conflicts.
|
| The setup is just the standard git clone, pip install in a
| venv, and python server.py.
| k4rnaj1k wrote:
| Oh wow, really impressive. How long did this take you to
| make?
| devnen wrote:
| It didn't take too long. I already have two similar
| projects for Dia and Chatterbox tts models so I just needed
| to convert a few files.
| antisol wrote:
| System Requirements Works literally everywhere
|
| Haha, on one of my machines my python version is too old, and the
| package/dependencies don't want to install.
|
| On another machie the python version is too new, and the
| package/dependencies don't want to install.
| divamgupta wrote:
| We are working to fix that. Thanks
| raybb wrote:
| Have you considered offering a uvx command to run to get
| people going quickly?
| zelphirkalt wrote:
| Though I think you would still need to have the Python
| build dependencies installed for that to work.
| pjc50 wrote:
| If you restrict your dependencies to only those for which
| wheels are available, then uv should just be able to
| handle them for you.
| IshKebab wrote:
| I think it can install Python itself too. Though I have
| had issues with that - especially with SSL certificate
| locations, which is one of Linux's other clusterfucks.
| pjc50 wrote:
| "Fixing python packaging" is somewhat harder than AGI.
| dlcarrier wrote:
| I was commiserating with my brother over how difficult it
| is to set up an environment to run one LLM or diffusion
| model, let alone multiple or a combination. It's 5 percent
| CUDA/ROCm difficulties and 95% Python difficulties. We have
| a theory that Lanyone working with generative AI has to
| tolerate output that is only 90% right, and is totaly fine
| working with a language and environment that only 90%
| works.
|
| Why is Python so bad at that? It's less kludgy than Bash
| scripts, but even those are easier to get working.
| 77pt77 wrote:
| This is a generic problem.
|
| JS/TS/npm is just as bad with probably more build
| tools/frameworks.
|
| Rust is a mess.
|
| Go, well.
|
| Even perl was quite complicated.
| dlcarrier wrote:
| Yeah, but it's easily solved, with directives, headers,
| or make files that specify which language standard it
| follows. Better yet, you can use different syntax with
| different language standards, so it's clear which to
| follow. If a compiler can automatically figure whether
| I'm compiling C or C++, why can't a Python interpreter
| figure out if I'm running version two or three, of the
| same language?
| com2kid wrote:
| > JS/TS/npm is just as bad with probably more build
| tools/frameworks.
|
| This is flat out wrong. NPM packages by default are local
| to a directory. And I haven't seen a package rely on a
| specific minor version of node in literally years. Node's
| back compat is also great, there was one hiccup 5 or 6
| years ago where a super popular native package was
| deprecated ago but that's been about it.
|
| I can take current LTS node and run just about any
| package from the NPM repo written within the last 4 or 5
| years and it will just work. Meanwhile plenty of python
| packages somehow need specific point releases. What the
| unholy hell.
|
| Node version manager does exist, and it can be setup to
| work per directory, which is super cool, but I haven't
| needed NVM in literal years.
| qingcharles wrote:
| This is how we'll know ASI has arrived.
| flanked-evergl wrote:
| Just point people to uv/uvx.
| wongarsu wrote:
| The project is like 80% there by having a pyproject file
| that should work with uv and poetry. The just aren't any
| package versions specified and the python version is
| incredibly lax, and no lock file is provided.
| flanked-evergl wrote:
| in this context uv works perfectly fine with poertry, if
| you publish a wheel from poetry uv can use it. You don't
| have to switch anything in your project to make it work.
| superkuh wrote:
| A tool that was only released, what, a year or two ago? It
| simply won't be present in nearly all OS/distros. Only
| modern or rolling will have it (maybe). It's funny when the
| recommended python dependency manager managers are just as
| hard to install and use as the script themselves. Very
| python.
| hahn-kev wrote:
| Python man
| baobun wrote:
| man python
|
| There you go.
| wizzwizz4 wrote:
| PYTHON(1) General Commands Manual
| PYTHON(1) NAME python - an object-
| oriented programming language SYNOPSIS
| python [ -c command | script | - ] [ arguments ]
| DESCRIPTION Python is the standard programming
| language.
|
| Computer scientists love Python, not just because
| whitespace comes first ASCIIbetically, but because it's the
| standard. Everyone else loves Python because it's PYTHON!
| rebolek wrote:
| Python is used not because it's good but because it's
| good enough just like Windows and plastics.
| wizzwizz4 wrote:
| I thought we were doing https://www.gnu.org/fun/jokes/ed-
| msg.html.
| sigmoid10 wrote:
| There are still people who use machine wide python installs
| instead of environments? Python dependency hell was already bad
| years ago, but today it's completely impractical to do it this
| way. Even on raspberries.
| lynx97 wrote:
| Debian pretty much "solved" this by making pip refuse to
| install packages if you are not in an venv.
| ChickeNES wrote:
| Ditto OpenSUSE, at least on Tumbleweed
| gm678 wrote:
| It needed distro buy in and implementation, but this is
| from the Python side: https://peps.python.org/pep-0668/
| auscompgeek wrote:
| IIRC that's actually a change in upstream pip.
| 77pt77 wrote:
| Well, with my python 3.13.5 not even that works!
|
| Pretty impressive but this seems to be a staple of most
| AI/ML projects.
|
| "Works on my machine" or "just use docker", although here
| the later doesn't even seem to be an option.
| superkuh wrote:
| Yep. Python stopped being Python a decade ago. Now there are
| just innumberable Pythons. Perl... on the otherhand, you can
| still run any perl script from any time on any _system_ perl
| interpreter and it works! Granted, perl is unpopular and not
| getting constant new features re: hardcore math /computation
| libs.
|
| Anyway, I think I'll stick with Festival 1.96 for TTS. It's
| super fast even on my core2duo and I have exactly zero chance
| of getting this Python 3'ish script to run on any machine
| with an OS older than a handful of years.
| m-s-y wrote:
| It breaks my heart that Perl fell out of favor. Perl "6"
| didn't help in the slightest.
| yjftsjthsd-h wrote:
| Using venv won't save you from having the wrong version of
| the actual Python interpreter installed.
| VagabundoP wrote:
| Install it with uvx that should solve the python issues.
|
| https://docs.astral.sh/uv/guides/tools/
|
| uv installation:
|
| https://docs.astral.sh/uv/getting-started/installation/
| dzogchen wrote:
| Such an ignorant thing to say for something that requires 25MB
| RAM.
| Bilal_io wrote:
| Not sure what the size has to do with anything.
|
| I send you a 500kb Windows .exe file and claim it runs
| literally everywhere.
|
| Would it be ignorant to say anything against it because of
| its size?
| asadm wrote:
| we all know runs anywhere in this context means compute
| wise. It's dumb to blame author for your dev setup issues.
| Hackbraten wrote:
| I didn't realize that that's what it meant until you
| mentioned it.
| dlcarrier wrote:
| It reminds me of the costs and benefits of RollerCoaster
| Tycoon being written in assembly language. Because it was so
| light on resources, it could run on any privately owned
| computer, or at least anything x86, which was pretty much
| everything at the time.
|
| Now, RISC architectures are much more common, so instead of
| the rare 68K Apple/Amiga/etc computer that existed at the
| time, it's super common to want to run software on an ARM or
| occasionally RISC-V processor, so writing in x86 assembly
| language would require emulation, making for worse
| performance than a compiled language.
| IshKebab wrote:
| Yeah some people have a problem and think "I'll use Python".
| Now they have like fifty problems.
| exe34 wrote:
| system python is for system applications that are known to work
| together. If you need a python install for something else,
| there's venv or conda and then pip install stuff.
| xena wrote:
| It doesn't work on Fedora because of the lack of g++ having the
| right version.
| trostaft wrote:
| Not sure if they've fixed between then and now, but I just
| had it working locally on Fedora. > g++
| --version g++ (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
| Copyright (C) 2025 Free Software Foundation, Inc.
| akx wrote:
| I opened a couple of PRs to fix this situation:
|
| https://github.com/KittenML/KittenTTS/pull/21
| https://github.com/KittenML/KittenTTS/pull/24
| https://github.com/KittenML/KittenTTS/pull/25
|
| If you have `uv` installed, you can try my merged ref that has
| all of these PRs (and #22, a fix for short generation being
| trimmed unnecessarily) with uvx --from
| git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25
| kittentts --output output.wav --text "This high quality TTS
| model works without a GPU"
| tetris11 wrote:
| Thanks for the quick intro into UV, it looks like docker
| layers for python
|
| I found the TTS a bit slow so I piped the output into ffplay
| with 1.2x speedup to make it sound a bit better
| uvx --from
| git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25
| kittentts --text "I serve 12 different beers at my restaurant
| for over 1000000 customers" --voice expr-voice-3-m --output -
| | ffplay -af "atempo=1.2" -f wav -
| akx wrote:
| Ah, yeah, good catch - I added the model-native speed
| multiplier to the CLI too (`--speed=1.2` for instance).
| tetris11 wrote:
| https://github.com/KittenML/KittenTTS/pull/21/commits/0aa
| cfc...
|
| Nice one, thanks!
| miellaby wrote:
| You're supposed to use venv for everything but the python
| scripts distributed with your os
| turnsout wrote:
| You're getting a lot of comments along the lines of "Why don't
| you just ____," which only shows how Stockholmed the entire
| Python community is.
|
| With no other language are you expected to maintain several
| entirely different versions of the language, each of which is a
| relatively large installation. Can you imagine if we all had
| five different llvms or gccs just to compile five different
| modern C projects?
|
| I'm going to get downvoted to oblivion, but it doesn't change
| the reality that Python in 2025 is unnecessarily fragile.
| jhurliman wrote:
| That's exactly what I have. The C++ codebases I work on build
| against a specific pinned version of LLVM with many warnings
| (as errors) enabled, and building with a different version
| entails a nonzero amount of effort. Ubuntu will happily
| install several versions of LLVM side by side or compilation
| can be done in a Docker container with the correct compiler.
| Similarly, the TypeScript codebases I work with test against
| specific versions of node.js in CI and the engine field in
| package.json is specified. The different versions are managed
| via nvm. Python is the same via uv and pyproject.yaml.
| turnsout wrote:
| I don't doubt it, but I don't think that situation is
| accepted as the default in C/C++ development. For the most
| part, I expect OSS to compile with my own clang.
| debugnik wrote:
| I agree with your point, but
|
| > if we all had five different llvms or gccs
|
| Oof, those are poor examples. Most compilers using LLVM other
| than clang do ship with their own LLVM patches, and cross-
| compiling with GCC does require installing a toolchain for
| each target.
| turnsout wrote:
| Cross-compiling is a totally different subject... I'm
| trying to make an apples-to-apples comparison. If you
| compile a lot of OSS C projects for the host architecture,
| you typically do not need multiple LLVMs or GCCs. Usually,
| the makefile detects various things about the platform and
| compiler and then fails with an inscrutable error. But that
| is a separate issue! haha
| 77pt77 wrote:
| > Can you imagine if we all had five different llvms or gccs
| just to compile five different modern C projects?
|
| Yes, because all I have to do is look at the real world.
| 77pt77 wrote:
| I had the too new.
|
| This package is the epitome of dependency hell.
|
| Seriously, stick with piper-tts.
|
| Easy to install, 50MB gives you excellent results and 100MB
| gives you good results with hundreds of voices.
| countfeng wrote:
| Very good model, thanks for the open source
| rohan_joshi wrote:
| thanks a lot, this model is just a preview checkpoint. the full
| release next week will be of much higher quality.
| tapper wrote:
| I am blind and use NVDA with a sinth. How is this news? I don't
| get it! My sinth is called eloquence and is 4089KB
| mwcampbell wrote:
| Does your Eloquence installation include multiple languages?
| The one I have is only 1876 KB for US English only. And classic
| DECtalk is even smaller; I have here a version that's only 638
| KB (again, US English only).
| Perz1val wrote:
| Is the name a joke on "If the emperor had a tts device"? It's
| funny
| killerstorm wrote:
| I'm curious why smallish TTS models have metallic voice quality.
|
| The pronunciation sounds about right - i thought it's the hard
| part. And the model does it well. But voice timbre should be
| simpler to fix? Like, a simple FIR might improve it?
| codedokode wrote:
| Probably "metallicity" is due to lack of details and cannot be
| fixed that easy.
| nickpsecurity wrote:
| We change our tone based on personal style, emotion, context,
| and other factors. An accurate generator might need to encode
| all that information in the model. It will be larger than a
| model that doesn't do all of that.
| dr_kiszonka wrote:
| Microsoft's and some of Google's TTS models make the simplest
| mistakes. For instance, they sometimes read "i.e." as "for
| example." This is a problem if you have low vision and use TTS
| for, say, proofreading your emails.
|
| Why does it happen? I'm genuinely curious.
| lynx97 wrote:
| Well, speech synthesizers are pretty much famous for speaking
| all sorts of things wrong. But what I find very concerning
| about LLM based TTS is that some of them cant really speak
| numbers greater then 100. They try, but fail a lot. At least
| tts-1-hd was pretty much doing this for almost every 3 or 4
| digit number. Especially noticeable when it is supposed to read
| a year number.
| jpc0 wrote:
| Not entirely related but humans have the same problem.
|
| For scriptwriting when doing voice overs we always explicitly
| write out everything. So instead of 1 000 000 we would write
| one million or a million. This is a trivial example but if
| the number was 1 548 736 you will almost never be able to
| just read that off. However one million, five hundred and
| forty eight thousand, seven hundred and thirty six can just
| be read without parsing.
|
| Same with urls, W W W dot Google dot com.
| lynx97 wrote:
| Regarding humans, yes and no. If a human had constantly
| problems with 3 and 4 digit numbers like tts-1-hd does, I'd
| ask myself if they were neurodivergent in some way.
|
| And yes, I added instructions along the lines of what you
| describe to my prompt. Its just sad that we have to. After
| all, LLM TTS has solved a bunch of real problems, like
| switching languages in a text, or foreign words. The
| pronounciation is better then anything we ever had. But it
| fails to read short numbers. I feel like that small issue
| could probably have been solved by doing some fine tuning.
| But I actually dont really understand the tech for it,
| so...
| wongarsu wrote:
| From the web demo this model is really good at numbers. It
| rushes through them, slurs them a bit together, but they are
| all correct, even 7 digit numbers (didn't test further).
|
| Looks like they are sidestepping these kinds of issues by
| generating the phonemes with the preprocessing stage of
| traditional speech synthesizers, and using the LLM only to
| turn those phonemes into natural-ish sounding speech. That
| limits how natural the model can become, but it should be
| able to correctly pronounce anything the preprocessing can
| pronounce
| 3rd3 wrote:
| You probably mean "e.g." as "for example", not "i.e."?
|
| This might be on purpose and part of the training data because
| "for example" just sounds much better than "e.g.". Presumably
| for most purposes, linguistic naturalness is more important
| than fidelity.
| layer8 wrote:
| Sometimes I use "for example" and "e.g." in consecutive
| sentences to not sound repetitive, or possibly even within
| the same sentence (e.g. in parentheses). In that case,
| speaking both as "for example" would degrade it
| linguistically.
|
| In any case, I'd like TTS to not take that kind of artistic
| freedom.
| Retr0id wrote:
| They're often trained from video subtitles, and humans writing
| subtitles make that kind of mistake too.
| BenGosub wrote:
| I wonder what would it take to extend it with a custom voice?
| junon wrote:
| This feels different. This feels like a genuinely monumental
| release. Holy cow.
|
| Very well done. The quality is excellent and the technical
| parameters are, simply, unbelievable. Makes me want to try to
| embed this on a board just to see if it's possible.
| ricardobeat wrote:
| The samples featured elsewhere seem to be from a larger model?
|
| After testing this locally, it still sounds quite mechanical, and
| fails catastrophically for simple phrases with numbers ("easy as
| 1-2-3"). If the 80M model can improve on this and keep the
| expressiveness seen in the reddit post, that looks promising.
| tecleandor wrote:
| Not bad for the size (with my very limited knowledge of this
| field) !
|
| In a couple tests, the "Male 2" voice sounds reasonable, but I've
| found it has problem with some groups of words, specially when
| played with little context. I think it's small sentences.
|
| For example, if you try to do just "Hey gang!", it will sound
| something like "Chay yang". But if you add an additional sentence
| after that, it will sound a bit different (but still weird).
| rishav_sharan wrote:
| Question for the experts here; What would be a SOTA TTS that can
| run on an average laptop (32GB RAM, 4GB VRAM). I just want to
| attach a TTS to my SLM output, and get the highest possible voice
| quality/ human resembleness.
| kroaton wrote:
| Try Unmute by Kyutai - https://unmute.sh/
| yahoozoo wrote:
| Is there a paper describing the architecture of the model?
| zelphirkalt wrote:
| What I am still looking for is a way to clone voice locally. I
| have OK hardware. For example I can use Mistral Small 3.1 or what
| it is called locally. Premade voices can be interesting too, but
| I am looking for custom voice. Perhaps by providing audio and the
| corresponding transcript to the model, training it, and then give
| it a new text and let it speak that.
| alexnewman wrote:
| I'm so confused on how the model is actually made. It doesn't
| seem to be in the code or this stuff is way simpler than i
| thought. It seems to use a fancy library from japan, not sure how
| much it's just that
| anthk wrote:
| Atom n270 running flite with a good voice -slt- vs this... would
| it be fast enough to play a MUD? Flite it's almost realtime
| fast...
| bashkiddie wrote:
| TL;DR: If you are interested in TTS, you should explore
| alternatives
|
| I tried to use it...
|
| Its python venv has grown to 6 GBytes in size. The demo sentence
|
| > "This high quality TTS model works without a GPU"
|
| works, it takes 3s to render the audio. Audio sounds like a voice
| in a tin can.
|
| I tried to have a news article read aloud and failed with
|
| > [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-
| zero status code returned while running Expand node.
| Name:'/bert/Expand' > Status Message: invalid expand shape
|
| If you are interested in TTS, you should explore alternatives
| MrGilbert wrote:
| A localized version of this, and I could finally build my tiny
| Amazon Echo replacement. I would love to see all speech synthesis
| performed on a local device.
| varenc wrote:
| I'm doing this now with Home Assistant voice. All the TTS, STT,
| and LLMs involved run locally on my network. It's absurdly
| superior to every other voice assistant product. (Would be nice
| if it was just a pure multi-modal model though)
| binary132 wrote:
| I'm new to TTS models but is this something I can plug into my
| own engine like with LLMs, or does it require the Python stack it
| ships with?
| imprezagx2 wrote:
| BEAT THIS! Commodore C64 has the same feature called SAM -
| speaker synthesizer, speaks English and Polish. 48 kB of RAM
|
| BEAT THIS!
| spapas82 wrote:
| This great for english, but is there something similar for other
| languages? Could this be trained somehow to support other
| languages?
| dirkc wrote:
| Have you considered adding some 'rendered' examples of what the
| model sounds like?
|
| I'm curious, but right now I don't want to install the package
| and run some code.
| C-Loftus wrote:
| Awesome work! Often times in the TTS space, human-similarity is
| given way too much emphasis at the expense of hurting user
| access. Frankly as long as a voice is clear and you listen to it
| for a while, the brain filters out most quirks you would perceive
| on the first pass. Hence why many blind folks still are perfectly
| fine using espeak-ng. The other properties like speed of
| generation and size make it worth it.
|
| I've been using a custom AI audiobook generation program [0] with
| piper for quite a while now and am very excited to look at
| integrating kitten. Historically piper has been the only good
| option for a free CPU-only local model so I am super happy to see
| more competition in the space. Easy installation is a big deal,
| since piper historically has had issues with that. (Hence why I
| had to add auto installation support in [0])
|
| [0] https://github.com/C-Loftus/QuickPiperAudiobook
| thedangler wrote:
| Elixir folks. How would I use this with Elixir? I'm new to Elixir
| and could use this in about 15 days.
| bglusman wrote:
| It looks like it's Python, so it might be possible to use via
| https://github.com/livebook-dev/pythonx ? But the parallel
| huggingface/bumblebee idea was also good, hadn't seen or
| thought of, that definitely works for a lot of other models,
| curious if you get working! Some chance I'll play with this
| myself in a few months, so feel free to report back here or DM
| me!
| bglusman wrote:
| I just decided to try this quickly and hit some issues on my
| Mac FYI, it might work better on Linux but I hit a
| compilation issue with `curated-tokenizers`, possibly from a
| typo in setup.py or pyproject.toml in curated-tokenizers,
| spotted by AI: -Wno-sign-compare-Wno-strict-prototypes should
| be -Wno-sign-compare -Wno-strict-prototypes so could perhaps
| fix with a PR to curated-tokenizers or by forking it...
|
| Might well be other issues behind that, and unclear if need
| any other dependencies that kitten doesn't rely on directly
| like torch or torchaudio? but... not 5 mins easy, but looks
| like issues might be able to be worked through...
|
| For reference this is all I was trying basically:
| Mix.install([:pythonx]) Pythonx.uv_init("""
| [project] name = "project" version = "0.0.0"
| requires-python = ">=3.8" dependencies = [
| "kittentts @ https://github.com/KittenML/KittenTTS/releases/d
| ownload/0.1/kittentts-0.1.0-py3-none-any.whl" ]
| """)
|
| to get the above error.
| dorian-graph wrote:
| It's not possible so far via Bumblebee, unfortunately[1].
|
| [1] https://github.com/elixir-nx/bumblebee/issues/209
| akx wrote:
| This is a fun model for circuit-bending, because the voice style
| vectors are pretty small.
|
| For instance, try adding `np.random.shuffle(ref_s[0])` after the
| line `ref_s = self.voices[voice]`...
|
| EDIT: be careful with your system volume settings if you do this.
| the_arun wrote:
| I like the direction we are heading. Build models that can run on
| CPUs & AI can become even more mainstream.
| butz wrote:
| How does one build similar model, but for different languages? I
| was under impression that being open source, there would be some
| instructions how to build everything on your own.
| peanut_merchant wrote:
| I ran some quick benchmarks.
|
| Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX
| Performance Results: Initial Latency: ~315ms for short
| text Audio Generation Speed (seconds of audio per
| second of processing): - Short text (12 chars): 3.35x
| realtime - Medium text (100 chars): 5.34x realtime -
| Long text (225 chars): 5.46x realtime - Very Long text (306
| chars): 5.50x realtime Findings: - Model loads
| in ~710ms - Generates audio at ~5x realtime speed
| (excluding initial latency) - Performance is consistent
| across different voices (4.63x - 5.28x realtime)
| divamgupta wrote:
| Thanks for running the benchmarks. Currently the models are not
| optimized yet. We will optimize loading etc when we release an
| SDK meant for production :)
| don-bright wrote:
| on my Intel(R) Celeron(R) N4020 CPU @ 1.10GHz it takes 6
| seconds to import/load and text generation is roughly 1x
| realtime on various lengths of text.
| Jotalea wrote:
| thanks for testing on the same hardware as mine, before me.
| yunusabd wrote:
| Impressive, might use this for https://hnup.date
| theshrike79 wrote:
| Love the idea, but the text it produces is way too flowery for
| my taste
|
| "A new tool is stirring up excitement and debate in the
| programming community"
|
| Just give me the facts without American style embellishments.
| You're not trying to sell me anything =)
| mattfrommars wrote:
| Can this work on intel npu unit?
| m00dy wrote:
| I think one of the female voices belongs to Elizabeth Warren.
| gunalx wrote:
| Would love to se something like this trained for multilingual
| purposes. It seems kinda like the same tier as piper, but a bit
| faster.
| 77pt77 wrote:
| How does this compare to say piper-tts?
|
| I ask because their models are pretty small. Some sound awesome
| and there is no depdendency hell like I'm seeing here.
|
| Example: https://rhasspy.github.io/piper-samples/#en_US-ryan-high
| moomoo11 wrote:
| Are there any speech to text (opposite direction) that I can load
| on mobile app?
| system2 wrote:
| One thing any GitHub project never has. A few-second demo.
| mrfakename wrote:
| Cool, it looks like this model is pretty similar to StyleTTS 2?
| Would it be possible to confirm?
| pjcodes wrote:
| This look pretty awesome. I will definitely give it a try and let
| you know the results
| marcobambini wrote:
| Is there any way to get a .gguf version?
| alexwang123 wrote:
| This is really great.
| ghm2180 wrote:
| Just amazing
| OrangeMusic wrote:
| It's just so annoying and idiotic that there aren't a few samples
| on the home page. It didn't occur to you that it's the very first
| thing people would want to hear?
| Piraty wrote:
| 25M ? lol . the venv is 6.9G
| akrymski wrote:
| Now if only we could get LLMs to this sort of size! I don't know
| much about how TTS works under the hood, why is it so much
| easier?
___________________________________________________________________
(page generated 2025-08-07 23:01 UTC)