[HN Gopher] Chatterbox TTS
___________________________________________________________________
Chatterbox TTS
Author : pinter69
Score : 107 points
Date : 2025-06-11 20:23 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gardnr wrote:
| Previously, on Hacker News:
|
| https://news.ycombinator.com/item?id=44120204
|
| https://news.ycombinator.com/item?id=44144155
|
| https://news.ycombinator.com/item?id=44195105
|
| https://news.ycombinator.com/item?id=44230867
|
| https://news.ycombinator.com/item?id=44172134
|
| https://news.ycombinator.com/item?id=44221910
|
| https://news.ycombinator.com/item?id=44145564
| pinter69 wrote:
| I did a quick google search before positing and only found a
| reference in a comment. But, I searched for the link to the
| GitHub.
| tomhow wrote:
| Thanks for posting this but it's conventional to only post
| links to past submissions if they had significant discussion,
| which none of these did.
| xnx wrote:
| You can run it for free here:
| https://huggingface.co/spaces/ResembleAI/Chatterbox
| abraxas wrote:
| Are these things good enough to narrate a book convincingly or
| does the voice lose coherence after a few paragraphs being
| spoken?
| pinter69 wrote:
| I consult a company in the space (not resemble) and I can
| definitely say it can narrate a book
| raincole wrote:
| Once it's good enough Audible will be flooded with AI-narrated
| books so we'll know soon. (The only question is whether Amazon
| would disclose it, ofc)
| landl0rd wrote:
| Flip side is a solution where I can have a book without an
| audiobook auto-generated (or use an existing ebook rather
| than paying audible $30 for their version) and it's "good
| enough" is a legit improvement. AI generated isn't as good
| but it's better than nothing. Also, being able to interrupt
| and ask for more detail/context would be pretty nice. Like
| I'm reading some Pynchon and I have to stop sometimes and
| look up the name of a reference to some product nobody knows
| now, stuff like that.
| fatesblind wrote:
| its watermarked
| wsintra2022 wrote:
| A year ago for fun I gave a friend a Carl Rogers therapy
| audiobook, for fun I made an Attenbrough esque reading and it
| was pretty good over a year ago so should be better now.
| vunderba wrote:
| Most of these TTS systems tend to fall apart the longer the
| text - it's a good idea to just wrap any longform text into
| separate paragraph segmented batches and then stitch them back
| together again at the end.
|
| I've also found that if your one-shot sample wave isn't really
| clean that sometimes Chatterbox produces random unholy
| whooshing sounds at the end of the generated audio which is an
| added bonus if you're recording Dante's Inferno.
| elektor wrote:
| Yes, I've generated an audiobook of a epub using this tool and
| the result was passable: https://github.com/santinic/audiblez
| Mizza wrote:
| Demos here: https://resemble-ai.github.io/chatterbox_demopage/
| (not mine)
|
| This is a good release if they're not too cherry picked!
|
| I say this every time it comes up, and it's not as sexy to work
| on, but in my experiments voice AI is really held back by
| transcription, not TTS. Unless that's changed recently.
| pinter69 wrote:
| Right you are. I've used speechmatics, they do a decent jon
| with transcription
| theyinwhy wrote:
| 1 error every 78 characters?
| ianbicking wrote:
| FWIW in my recent experience I've found LLMs are very good at
| reading through the transcription errors
|
| (I've yet to experiment with giving the LLM alternate
| transcriptions or confidence levels, but I bet they could make
| good use of that too)
| mikepurvis wrote:
| I was going to say, ideally you'd be able to funnel
| alternates to the LLM, because it would be vastly better
| equipped to judge what is a reasonable next word than a
| purely phonetic model.
| vunderba wrote:
| Pairing speech recognition with a LLM acting as a post-
| processor is a pretty good approach.
|
| I put together a script a while back which converts any
| passed audio file (wav, mp3, etc.), normalizes the audio,
| passes it to ggerganov whisper for transcription, and then
| forwards to an LLM to clean the text. I've used it with a
| pretty high rate of success on some of my very old and poorly
| recorded voice dictation recordings from over a decade ago.
|
| Public gist in case anyone finds it useful:
|
| https://gist.github.com/scpedicini/455409fe7656d3cca8959c123.
| ..
| Tokumei-no-hito wrote:
| thanks for sharing. are some local models better than
| others? can small models work well or do you want 8B+?
| vunderba wrote:
| So in my experience smaller models tend to produce worse
| results _BUT_ I actually got really good transcription
| cleanup with CoT (Chain of Thought models) like Qwen even
| quantized down to 8b.
| throwawaymaths wrote:
| do you know if any current locally hostable public
| transcribers are good at diarization? for some tasks having
| even crude diarization would improve QOL by a huge factor. i
| was looking at a whisper diarization python package for a bit
| but it was a bitch to deploy.
| iainmerrick wrote:
| Deepgram does it.
| throwawaymaths wrote:
| sorry i meant locally hostable public. ill edit parent.
| causal wrote:
| Play with the Huggingface demo and I'm guessing this page is a
| little cherry-picked? In particular I am not getting that kind
| of emotion in my responses.
| lukax wrote:
| Have you tried Soniox? It's a multilingual STT that supports
| real-time transcription and translation. It's actually real-
| time, not like some other providers claim. It also supports
| endpoint detection so you can easily integrate it into agents.
| And also really cheap, $0.12 per hour.
|
| https://soniox.com
| j2kun wrote:
| They should put the meaning of "TTS" in the readme somewhere,
| probably near the top. Or their website.
| byteknight wrote:
| TTS is a very common initialism for Text-to-Speech going back
| to at least the 90s.
| j2kun wrote:
| So? Acronym soup is bad communication.
| aquariusDue wrote:
| I miss glossaries.
| dylan604 wrote:
| Good writing rules can still be used even for repo
| READMEs where the first time an acronym is used it is
| spelled out to show what the acronym means. Too many
| assumptions being made that everyone is going to know it.
| Sometimes the author can be too inside baseball and
| assumes anyone reading their README will already know
| about the subject. Not all devs are literature majors and
| probably just never think about these things
| sdenton4 wrote:
| Table Top Simulator.
|
| It's obviously an AI for playing wargames without having to
| bother painting all the miniatures, or finding someone with the
| same weird interest in Balkan engagements during the Napoleonic
| era.
| decide1000 wrote:
| How does it perform on multi-lingual tasks?
| yjftsjthsd-h wrote:
| The readme says it only supports English
| pryelluw wrote:
| Silly question, what's the lowest spec hardware this will run ?
| bityard wrote:
| Not a silly question, I came here to ask too. Curious to know
| whether I need a GPU costing 4 digits or if it will run on my
| 12-year-old thinkpad shitbox. Or something in between.
| nmstoker wrote:
| I've found it excellent with really common accents but with other
| accents (that are pretty common too) it can easily get stuck
| picking a different accent. For instance several Scottish
| recordings ended up Australian, likewise a fairly mild Yorkshire
| accent
| az226 wrote:
| How does one train a TTS model with an LLM backbone? Practically,
| how does this work?
| cyanf wrote:
| you use a neural audio codec to encode audio into codebooks
|
| then you could treat the codebook entries as tokens and treat
| audio generation as a next token prediction task
|
| you then take the codebook entries generated and run it through
| the codec's decoder and yield audio
|
| it works surprisingly well
|
| speech text models (tts model with an llm as backbone) is the
| current meta
| teraflop wrote:
| > Every audio file generated by Chatterbox includes Resemble AI's
| Perth (Perceptual Threshold) Watermarker - imperceptible neural
| watermarks that survive MP3 compression, audio editing, and
| common manipulations while maintaining nearly 100% detection
| accuracy.
|
| Am I misunderstanding, or can you trivially disable the watermark
| by simply commenting out the call to the apply_watermark function
| in tts.py? https://github.com/resemble-
| ai/chatterbox/blob/master/src/ch...
|
| I thought the point of this sort of watermark was that it was
| embedded somehow in the model weights, so that it couldn't easily
| be separated out. If you're going to release an open-source model
| that adds a watermark as a separate post-processing step, then
| why bother with the watermark at all?
| jchw wrote:
| Possibly a sort of CYA gesture, kinda like how original Stable
| Diffusion had a content filter IIRC. Could also just be to
| prevent people from accidentally getting peanut butter in the
| toothpaste WRT training data, too.
| vunderba wrote:
| Yeah, there's even a flag to turn it off in the parser `--no-
| watermark`. I assumed they added it for downstream users
| pulling it in as a "feature" for their larger product.
| andy_xor_andrew wrote:
| in my experience, TTS has been a "pick two" situation:
|
| - fast / cheap to run
|
| - can clone voices
|
| - sounds super realistic
|
| from what I can tell, Chatterbox is the first that apparently
| lets you pick 3! (have not tried it myself yet, this is just what
| I can deduce)
___________________________________________________________________
(page generated 2025-06-11 23:00 UTC)