[HN Gopher] A CC-By Open-Source TTS Model with Voice Cloning
       ___________________________________________________________________
        
       A CC-By Open-Source TTS Model with Voice Cloning
        
       Author : amrrs
       Score  : 117 points
       Date   : 2024-11-04 17:36 UTC (5 days ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | asaddhamani wrote:
       | From a quick try results aren't good. Sounds bland, and the text
       | I type isn't exactly equal to the text that is spoken. Didn't try
       | with voice cloning though.
       | 
       | Why is good TTS so expensive and why are there no good open
       | source options? Is it just from the need for high quality
       | training data? I don't imagine these models are more expensive to
       | run compared to SOTA LLMs, yet they cost so much more.
        
         | modeless wrote:
         | There are a lot of options. StyleTTS2 is pretty good, XTTSv2 is
         | pretty good, the new E2 TTS and F5 TTS also seem decent.
        
         | amrrs wrote:
         | Commercially available high quality training dataset is the
         | key. Open search libraries don't get the luxury of working with
         | voice actors to record voices.
        
           | Aeolun wrote:
           | Would it be hard to create such a training dataset? Seems
           | like you'd just need a lot of people to say a bunch of stuff
           | for you?
        
             | wahnfrieden wrote:
             | needs a crowdsourced model
        
               | huggingmouth wrote:
               | Ideally, Mozilla would step up here given their mission
               | statement, but they won't, probably because their CEO
               | needs another bonus.
        
               | IshKebab wrote:
               | Yeah there's no chance Mozilla would do anything like
               | this:
               | 
               | https://commonvoice.mozilla.org/
        
               | mgkimsal wrote:
               | That's the first thing I thought of! I wonder how used
               | these are. Are there any sources or data points
               | indicating that this commonvoice data is being used, and
               | if so, where/how? I think I may have contributed to this
               | a few times back years ago. Nice to see it's still going,
               | would be better to know it's being used.
        
         | em-bee wrote:
         | a few weeks ago i used piper to create an acceptable
         | translation of a book. i didn't listen to it all, but the
         | result sounded better than anything i was able to listen to
         | before. good enough to listen to a book if a human read one is
         | not available. just a few years ago, this was not the case.
         | 
         | in other words, while FOSS TTS lags behind commercial options,
         | it does get better and i expect within a few years it will
         | produce results that are at least as good as the commercial
         | options today if not fully caught up.
        
           | asaddhamani wrote:
           | Piper seems roughly equivalent to old-school TTS outputs that
           | sound flat, jumpy with the concatenative approach. Listen to
           | this first example I tried:
           | 
           | https://rhasspy.github.io/piper-
           | samples/samples/en/en_GB/ala...
           | 
           | Of all the TTS APIs I have tried, I like OpenAI voices the
           | best. Haven't considered things like elevenlabs because I
           | find them ridiculously expensive.
           | 
           | I love voice to voice interfaces, but only when they sound
           | natural to my ears, and the current pricing for good ones is
           | prohibitive for a huge number of use cases.
        
             | em-bee wrote:
             | well, i was comparing it to the free tools available a few
             | years ago, and against that, this example is a markable
             | improvement. it's the first that i could actually bear to
             | listen to over a longer period of time. i expect just
             | another few years and this will actually be good.
        
         | sjnair96 wrote:
         | Have you tried VoiceCraft?
        
           | asaddhamani wrote:
           | Yeah all these seem hyper focused on "voice cloning" so on
           | replicate VoiceCraft doesn't even let you try normal TTS
           | unless you provide a reference voice so I noped out.
        
         | miki123211 wrote:
         | From what I'm seeing, most of the open source TTS models are
         | trained on the same few voices, mostly in 16Khz, mostly from
         | Librivox books I think.
         | 
         | Eleven Labs is most likely trained on stolen audiobooks,
         | they've published a few Youtube videos in Polish, now taken
         | down, of AI renditions of famous Polish audiobook narrators.
         | This was all before they became popular, and before their voice
         | cloning models were publicly available I think.
        
           | generalizations wrote:
           | > mostly from Librivox books
           | 
           | That probably explains a lot. I've tried listening to some of
           | those audiobooks - very hit and miss, mostly miss. Definitely
           | amateur hour and mostly bad quality.
        
         | sandreas wrote:
         | I had pretty good results with coqui-tts and a VITS model, I
         | trained myself with an open dataset and later with one I
         | extracted from audiobooks / epub and therefore can't publish
         | (german)
         | 
         | The dataset and video tutorials are all available and linked on
         | (also english):
         | 
         | https://www.thorsten-voice.de/en/motivation-vision/
        
       | dmezzetti wrote:
       | Good quality and easy-to-use open TTS models are hard to find.
       | SpeechT5 while a bit old was relatively easy to clone voices with
       | using the Transformers library.
       | 
       | I've also found a couple of the ESPNet TTS models are decent.
       | I've exported those models to ONNX to make them easier to use.
       | 
       | For what it's worth, here is a list of models that cover what
       | I've worked on in the "Open models" TTS space.
       | 
       | https://huggingface.co/collections/NeuML/text-to-speech-tts-...
        
       | sandreas wrote:
       | BTW I was really impressed by the results of F5-TTS. The thing I
       | liked best was the "Tagged" TTS, where you can specify a tag to
       | use different tones of your own voice, like
       | {Angry}What have you done?       {Suprised}Me, I did nothing?
       | {Shouting}Who else do you think I'm talking to?       {Sad}Why
       | are you always shouting at me?
       | 
       | I wonder if this would also work for "Character" tags, like
       | {Susan}How was your day?       {Peter}I had a great day.
       | 
       | That would open great new ways of having audio books read by
       | cloned voices - switching between characters with the same voice
       | like often done by the real narrators
        
         | throwaway89201 wrote:
         | This feature also greatly interests me, although I'm looking
         | for a system that would allow to slightly alter the
         | pronunciation of individual words. Is anyone aware of such a
         | system?
         | 
         | Especially with TTS in a language other than English (but also
         | with English), the pronunciation of certain words is sometimes
         | jarringly wrong. Until TTS systems can compensate for this
         | themselves, it would be great if it were possible for humans to
         | use such tags to hint the system to pronounce better. Even if
         | you can't specify the exact correction, but the TTS would just
         | generate a 'different' sound, that could help.
        
           | sandreas wrote:
           | Features like artificial breathing, slightly different
           | pronounciation and other "features" are only available in
           | commercial systems... unfortunately I don't remember the name
           | or the video I saw about these, because I'm not interested in
           | non FOSS stuff for my personal projects.
        
       | DrPhish wrote:
       | I've had great luck so far with GPT-SoVITS. With a custom trained
       | Japanese model and clean reference audio the quality is
       | outstanding. It is quite finicky to set up and use though.
       | 
       | https://github.com/RVC-Boss/GPT-SoVITS
        
       | xrd wrote:
       | I have been having fun with this as well:
       | 
       | https://github.com/neonbjb/tortoise-tts
       | 
       | It supports voice cloning, but I am indeed having trouble getting
       | docker container working and the command line docs are not
       | perfect:
       | 
       | https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05b...
        
       ___________________________________________________________________
       (page generated 2024-11-09 23:01 UTC)