[HN Gopher] Chatterbox TTS
       ___________________________________________________________________
        
       Chatterbox TTS
        
       Author : pinter69
       Score  : 107 points
       Date   : 2025-06-11 20:23 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gardnr wrote:
       | Previously, on Hacker News:
       | 
       | https://news.ycombinator.com/item?id=44120204
       | 
       | https://news.ycombinator.com/item?id=44144155
       | 
       | https://news.ycombinator.com/item?id=44195105
       | 
       | https://news.ycombinator.com/item?id=44230867
       | 
       | https://news.ycombinator.com/item?id=44172134
       | 
       | https://news.ycombinator.com/item?id=44221910
       | 
       | https://news.ycombinator.com/item?id=44145564
        
         | pinter69 wrote:
         | I did a quick google search before positing and only found a
         | reference in a comment. But, I searched for the link to the
         | GitHub.
        
         | tomhow wrote:
         | Thanks for posting this but it's conventional to only post
         | links to past submissions if they had significant discussion,
         | which none of these did.
        
       | xnx wrote:
       | You can run it for free here:
       | https://huggingface.co/spaces/ResembleAI/Chatterbox
        
       | abraxas wrote:
       | Are these things good enough to narrate a book convincingly or
       | does the voice lose coherence after a few paragraphs being
       | spoken?
        
         | pinter69 wrote:
         | I consult a company in the space (not resemble) and I can
         | definitely say it can narrate a book
        
         | raincole wrote:
         | Once it's good enough Audible will be flooded with AI-narrated
         | books so we'll know soon. (The only question is whether Amazon
         | would disclose it, ofc)
        
           | landl0rd wrote:
           | Flip side is a solution where I can have a book without an
           | audiobook auto-generated (or use an existing ebook rather
           | than paying audible $30 for their version) and it's "good
           | enough" is a legit improvement. AI generated isn't as good
           | but it's better than nothing. Also, being able to interrupt
           | and ask for more detail/context would be pretty nice. Like
           | I'm reading some Pynchon and I have to stop sometimes and
           | look up the name of a reference to some product nobody knows
           | now, stuff like that.
        
           | fatesblind wrote:
           | its watermarked
        
         | wsintra2022 wrote:
         | A year ago for fun I gave a friend a Carl Rogers therapy
         | audiobook, for fun I made an Attenbrough esque reading and it
         | was pretty good over a year ago so should be better now.
        
         | vunderba wrote:
         | Most of these TTS systems tend to fall apart the longer the
         | text - it's a good idea to just wrap any longform text into
         | separate paragraph segmented batches and then stitch them back
         | together again at the end.
         | 
         | I've also found that if your one-shot sample wave isn't really
         | clean that sometimes Chatterbox produces random unholy
         | whooshing sounds at the end of the generated audio which is an
         | added bonus if you're recording Dante's Inferno.
        
         | elektor wrote:
         | Yes, I've generated an audiobook of a epub using this tool and
         | the result was passable: https://github.com/santinic/audiblez
        
       | Mizza wrote:
       | Demos here: https://resemble-ai.github.io/chatterbox_demopage/
       | (not mine)
       | 
       | This is a good release if they're not too cherry picked!
       | 
       | I say this every time it comes up, and it's not as sexy to work
       | on, but in my experiments voice AI is really held back by
       | transcription, not TTS. Unless that's changed recently.
        
         | pinter69 wrote:
         | Right you are. I've used speechmatics, they do a decent jon
         | with transcription
        
           | theyinwhy wrote:
           | 1 error every 78 characters?
        
         | ianbicking wrote:
         | FWIW in my recent experience I've found LLMs are very good at
         | reading through the transcription errors
         | 
         | (I've yet to experiment with giving the LLM alternate
         | transcriptions or confidence levels, but I bet they could make
         | good use of that too)
        
           | mikepurvis wrote:
           | I was going to say, ideally you'd be able to funnel
           | alternates to the LLM, because it would be vastly better
           | equipped to judge what is a reasonable next word than a
           | purely phonetic model.
        
           | vunderba wrote:
           | Pairing speech recognition with a LLM acting as a post-
           | processor is a pretty good approach.
           | 
           | I put together a script a while back which converts any
           | passed audio file (wav, mp3, etc.), normalizes the audio,
           | passes it to ggerganov whisper for transcription, and then
           | forwards to an LLM to clean the text. I've used it with a
           | pretty high rate of success on some of my very old and poorly
           | recorded voice dictation recordings from over a decade ago.
           | 
           | Public gist in case anyone finds it useful:
           | 
           | https://gist.github.com/scpedicini/455409fe7656d3cca8959c123.
           | ..
        
             | Tokumei-no-hito wrote:
             | thanks for sharing. are some local models better than
             | others? can small models work well or do you want 8B+?
        
               | vunderba wrote:
               | So in my experience smaller models tend to produce worse
               | results _BUT_ I actually got really good transcription
               | cleanup with CoT (Chain of Thought models) like Qwen even
               | quantized down to 8b.
        
           | throwawaymaths wrote:
           | do you know if any current locally hostable public
           | transcribers are good at diarization? for some tasks having
           | even crude diarization would improve QOL by a huge factor. i
           | was looking at a whisper diarization python package for a bit
           | but it was a bitch to deploy.
        
             | iainmerrick wrote:
             | Deepgram does it.
        
               | throwawaymaths wrote:
               | sorry i meant locally hostable public. ill edit parent.
        
         | causal wrote:
         | Play with the Huggingface demo and I'm guessing this page is a
         | little cherry-picked? In particular I am not getting that kind
         | of emotion in my responses.
        
         | lukax wrote:
         | Have you tried Soniox? It's a multilingual STT that supports
         | real-time transcription and translation. It's actually real-
         | time, not like some other providers claim. It also supports
         | endpoint detection so you can easily integrate it into agents.
         | And also really cheap, $0.12 per hour.
         | 
         | https://soniox.com
        
       | j2kun wrote:
       | They should put the meaning of "TTS" in the readme somewhere,
       | probably near the top. Or their website.
        
         | byteknight wrote:
         | TTS is a very common initialism for Text-to-Speech going back
         | to at least the 90s.
        
           | j2kun wrote:
           | So? Acronym soup is bad communication.
        
             | aquariusDue wrote:
             | I miss glossaries.
        
               | dylan604 wrote:
               | Good writing rules can still be used even for repo
               | READMEs where the first time an acronym is used it is
               | spelled out to show what the acronym means. Too many
               | assumptions being made that everyone is going to know it.
               | Sometimes the author can be too inside baseball and
               | assumes anyone reading their README will already know
               | about the subject. Not all devs are literature majors and
               | probably just never think about these things
        
         | sdenton4 wrote:
         | Table Top Simulator.
         | 
         | It's obviously an AI for playing wargames without having to
         | bother painting all the miniatures, or finding someone with the
         | same weird interest in Balkan engagements during the Napoleonic
         | era.
        
       | decide1000 wrote:
       | How does it perform on multi-lingual tasks?
        
         | yjftsjthsd-h wrote:
         | The readme says it only supports English
        
       | pryelluw wrote:
       | Silly question, what's the lowest spec hardware this will run ?
        
         | bityard wrote:
         | Not a silly question, I came here to ask too. Curious to know
         | whether I need a GPU costing 4 digits or if it will run on my
         | 12-year-old thinkpad shitbox. Or something in between.
        
       | nmstoker wrote:
       | I've found it excellent with really common accents but with other
       | accents (that are pretty common too) it can easily get stuck
       | picking a different accent. For instance several Scottish
       | recordings ended up Australian, likewise a fairly mild Yorkshire
       | accent
        
       | az226 wrote:
       | How does one train a TTS model with an LLM backbone? Practically,
       | how does this work?
        
         | cyanf wrote:
         | you use a neural audio codec to encode audio into codebooks
         | 
         | then you could treat the codebook entries as tokens and treat
         | audio generation as a next token prediction task
         | 
         | you then take the codebook entries generated and run it through
         | the codec's decoder and yield audio
         | 
         | it works surprisingly well
         | 
         | speech text models (tts model with an llm as backbone) is the
         | current meta
        
       | teraflop wrote:
       | > Every audio file generated by Chatterbox includes Resemble AI's
       | Perth (Perceptual Threshold) Watermarker - imperceptible neural
       | watermarks that survive MP3 compression, audio editing, and
       | common manipulations while maintaining nearly 100% detection
       | accuracy.
       | 
       | Am I misunderstanding, or can you trivially disable the watermark
       | by simply commenting out the call to the apply_watermark function
       | in tts.py? https://github.com/resemble-
       | ai/chatterbox/blob/master/src/ch...
       | 
       | I thought the point of this sort of watermark was that it was
       | embedded somehow in the model weights, so that it couldn't easily
       | be separated out. If you're going to release an open-source model
       | that adds a watermark as a separate post-processing step, then
       | why bother with the watermark at all?
        
         | jchw wrote:
         | Possibly a sort of CYA gesture, kinda like how original Stable
         | Diffusion had a content filter IIRC. Could also just be to
         | prevent people from accidentally getting peanut butter in the
         | toothpaste WRT training data, too.
        
         | vunderba wrote:
         | Yeah, there's even a flag to turn it off in the parser `--no-
         | watermark`. I assumed they added it for downstream users
         | pulling it in as a "feature" for their larger product.
        
       | andy_xor_andrew wrote:
       | in my experience, TTS has been a "pick two" situation:
       | 
       | - fast / cheap to run
       | 
       | - can clone voices
       | 
       | - sounds super realistic
       | 
       | from what I can tell, Chatterbox is the first that apparently
       | lets you pick 3! (have not tried it myself yet, this is just what
       | I can deduce)
        
       ___________________________________________________________________
       (page generated 2025-06-11 23:00 UTC)