[HN Gopher] A CC-By Open-Source TTS Model with Voice Cloning
___________________________________________________________________
A CC-By Open-Source TTS Model with Voice Cloning
Author : amrrs
Score : 117 points
Date : 2024-11-04 17:36 UTC (5 days ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| asaddhamani wrote:
| From a quick try results aren't good. Sounds bland, and the text
| I type isn't exactly equal to the text that is spoken. Didn't try
| with voice cloning though.
|
| Why is good TTS so expensive and why are there no good open
| source options? Is it just from the need for high quality
| training data? I don't imagine these models are more expensive to
| run compared to SOTA LLMs, yet they cost so much more.
| modeless wrote:
| There are a lot of options. StyleTTS2 is pretty good, XTTSv2 is
| pretty good, the new E2 TTS and F5 TTS also seem decent.
| amrrs wrote:
| Commercially available high quality training dataset is the
| key. Open search libraries don't get the luxury of working with
| voice actors to record voices.
| Aeolun wrote:
| Would it be hard to create such a training dataset? Seems
| like you'd just need a lot of people to say a bunch of stuff
| for you?
| wahnfrieden wrote:
| needs a crowdsourced model
| huggingmouth wrote:
| Ideally, Mozilla would step up here given their mission
| statement, but they won't, probably because their CEO
| needs another bonus.
| IshKebab wrote:
| Yeah there's no chance Mozilla would do anything like
| this:
|
| https://commonvoice.mozilla.org/
| mgkimsal wrote:
| That's the first thing I thought of! I wonder how used
| these are. Are there any sources or data points
| indicating that this commonvoice data is being used, and
| if so, where/how? I think I may have contributed to this
| a few times back years ago. Nice to see it's still going,
| would be better to know it's being used.
| em-bee wrote:
| a few weeks ago i used piper to create an acceptable
| translation of a book. i didn't listen to it all, but the
| result sounded better than anything i was able to listen to
| before. good enough to listen to a book if a human read one is
| not available. just a few years ago, this was not the case.
|
| in other words, while FOSS TTS lags behind commercial options,
| it does get better and i expect within a few years it will
| produce results that are at least as good as the commercial
| options today if not fully caught up.
| asaddhamani wrote:
| Piper seems roughly equivalent to old-school TTS outputs that
| sound flat, jumpy with the concatenative approach. Listen to
| this first example I tried:
|
| https://rhasspy.github.io/piper-
| samples/samples/en/en_GB/ala...
|
| Of all the TTS APIs I have tried, I like OpenAI voices the
| best. Haven't considered things like elevenlabs because I
| find them ridiculously expensive.
|
| I love voice to voice interfaces, but only when they sound
| natural to my ears, and the current pricing for good ones is
| prohibitive for a huge number of use cases.
| em-bee wrote:
| well, i was comparing it to the free tools available a few
| years ago, and against that, this example is a markable
| improvement. it's the first that i could actually bear to
| listen to over a longer period of time. i expect just
| another few years and this will actually be good.
| sjnair96 wrote:
| Have you tried VoiceCraft?
| asaddhamani wrote:
| Yeah all these seem hyper focused on "voice cloning" so on
| replicate VoiceCraft doesn't even let you try normal TTS
| unless you provide a reference voice so I noped out.
| miki123211 wrote:
| From what I'm seeing, most of the open source TTS models are
| trained on the same few voices, mostly in 16Khz, mostly from
| Librivox books I think.
|
| Eleven Labs is most likely trained on stolen audiobooks,
| they've published a few Youtube videos in Polish, now taken
| down, of AI renditions of famous Polish audiobook narrators.
| This was all before they became popular, and before their voice
| cloning models were publicly available I think.
| generalizations wrote:
| > mostly from Librivox books
|
| That probably explains a lot. I've tried listening to some of
| those audiobooks - very hit and miss, mostly miss. Definitely
| amateur hour and mostly bad quality.
| sandreas wrote:
| I had pretty good results with coqui-tts and a VITS model, I
| trained myself with an open dataset and later with one I
| extracted from audiobooks / epub and therefore can't publish
| (german)
|
| The dataset and video tutorials are all available and linked on
| (also english):
|
| https://www.thorsten-voice.de/en/motivation-vision/
| dmezzetti wrote:
| Good quality and easy-to-use open TTS models are hard to find.
| SpeechT5 while a bit old was relatively easy to clone voices with
| using the Transformers library.
|
| I've also found a couple of the ESPNet TTS models are decent.
| I've exported those models to ONNX to make them easier to use.
|
| For what it's worth, here is a list of models that cover what
| I've worked on in the "Open models" TTS space.
|
| https://huggingface.co/collections/NeuML/text-to-speech-tts-...
| sandreas wrote:
| BTW I was really impressed by the results of F5-TTS. The thing I
| liked best was the "Tagged" TTS, where you can specify a tag to
| use different tones of your own voice, like
| {Angry}What have you done? {Suprised}Me, I did nothing?
| {Shouting}Who else do you think I'm talking to? {Sad}Why
| are you always shouting at me?
|
| I wonder if this would also work for "Character" tags, like
| {Susan}How was your day? {Peter}I had a great day.
|
| That would open great new ways of having audio books read by
| cloned voices - switching between characters with the same voice
| like often done by the real narrators
| throwaway89201 wrote:
| This feature also greatly interests me, although I'm looking
| for a system that would allow to slightly alter the
| pronunciation of individual words. Is anyone aware of such a
| system?
|
| Especially with TTS in a language other than English (but also
| with English), the pronunciation of certain words is sometimes
| jarringly wrong. Until TTS systems can compensate for this
| themselves, it would be great if it were possible for humans to
| use such tags to hint the system to pronounce better. Even if
| you can't specify the exact correction, but the TTS would just
| generate a 'different' sound, that could help.
| sandreas wrote:
| Features like artificial breathing, slightly different
| pronounciation and other "features" are only available in
| commercial systems... unfortunately I don't remember the name
| or the video I saw about these, because I'm not interested in
| non FOSS stuff for my personal projects.
| DrPhish wrote:
| I've had great luck so far with GPT-SoVITS. With a custom trained
| Japanese model and clean reference audio the quality is
| outstanding. It is quite finicky to set up and use though.
|
| https://github.com/RVC-Boss/GPT-SoVITS
| xrd wrote:
| I have been having fun with this as well:
|
| https://github.com/neonbjb/tortoise-tts
|
| It supports voice cloning, but I am indeed having trouble getting
| docker container working and the command line docs are not
| perfect:
|
| https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05b...
___________________________________________________________________
(page generated 2024-11-09 23:01 UTC)