[HN Gopher] BASE TTS: The largest text-to-speech model to-date
___________________________________________________________________
BASE TTS: The largest text-to-speech model to-date
Author : jcuenod
Score : 77 points
Date : 2024-02-14 19:09 UTC (3 hours ago)
(HTM) web link (amazon-ltts-paper.com)
(TXT) w3m dump (amazon-ltts-paper.com)
| unsupp0rted wrote:
| > Echoing the widely-reported "emergent abilities" of Large
| Language Models when trained on increasing volume of data, we
| show that BASE TTS variants built with 10k+ hours start to
| exhibit advanced understanding of texts that enable contextually
| appropriate prosody.
| IronWolve wrote:
| Awhile ago, when amazon had its text limited but unlimited free
| use of its neural tts, I was converting an ebook to audiobook, it
| was amazing how it could sound so lifelike and inflections of the
| voice. Amazing.
|
| Amazon really had the best sounding TTS I've seen compared to
| paid microsoft and google. Hands down better. But technology is
| getting better for opensource, I'd expect in a year or 2, home
| use will be on par in quality with paid services.
|
| I cant wait for realtime video translate, so shows with non-
| english subs can be translated into english speech. You can do it
| now with some services, upload a video and lang/voice/mouth will
| convert to any language.
| revenga99 wrote:
| Wow. I could see this as threatening audio book narrators.
| However I would still prefer a real narrator to this in its
| current state. I think what it might be missing is different
| voices/accents for different characters.
| swashboon wrote:
| Audible doesn't allow AI narration or much Public Domain stuff
| at the moment. The only thing keeping it from happening is the
| markets trying to keep back a flood of crap from over taking /
| drowning / diluting the more well crafted options and causing
| the consumers to get really annoyed.
| TOMDM wrote:
| Let's be honest, the moment Amazon thinks their tts is good
| enough, they'll be offering AI audible deals to every author
| on their platform
| swashboon wrote:
| Yea, hard to say because the obvious implementation would
| be to just have it built into phones once the model is
| potentially portable enough - I see this happening quicker
| as a more general TTS functionality much like Google is
| doing with 'subtitles anywhere' aka Live Caption. Paired
| with translations we maybe pretty close to the universal
| translator type functionality. I could see end users being
| able to customize their voice assistant even more or maybe
| having multiple based on if its talking for you or to you.
|
| Anyways the problem with this is it makes the product 'ai
| audiobook' basically worthless, why not just buy the eBook
| and have my personalized translator turn it into an audio
| book. Now you just have market differentiation between
| cheap ebook + ai narrator vs expensive + professional
| narration.
|
| Though narration costs are already pretty cheap - it really
| does not factor into the cost of publishing an audio book
| that much unless its really a bottom of the barrel book.
| TOMDM wrote:
| I'm looking forward to my on device TTS, but Amazon has a
| decent moat with the DRM on their Kindles.
|
| At least they'll have to remain somewhat competitive once
| consumers decide they want the AI audiobooks and the
| like.
| swashboon wrote:
| Thinking about this more - the copyright implications
| become much more interesting once its no longer a
| recording. Does it could as a private performance if you
| have headphones on? Is it a public performance if you
| listen to live TTS through your speakers in public?
| dshpala wrote:
| I think Google's product has that:
| https://play.google.com/books/publish/autonarrated/
| minimaxir wrote:
| The emotion examples are interesting. One of the current most
| obvious indicators of AI-generated voices/voice cloning is a lack
| of emotion and range, which make them objectively worse compared
| to professional voice actors, unless a lack of emotion and range
| is the desired voice direction.
|
| But if you listen to the emotion examples, the range essentially
| what you'd get from an audiobook narrator, not more traditional
| voice acting.
| tsumnia wrote:
| Sadly it's not my forte but I expect in the near future we'll
| see an additional "emotion" embedding or something similar.
| Actors regularly use 'action words' (verbs) [1] to help add
| context to lines. A model then could study a text, determine an
| appropriate verb/emotion range to work from, then produce the
| audio with that additional context.
|
| [1] https://indietips.com/subtext-action-verb/
| minimaxir wrote:
| The bottleneck is the annotations: there's no easy way to
| annotate "emotions" on the scale of data needed to have the
| model learn the necessary verbal tics.
|
| In contrast, image data on the intent for image generation
| models is very highly annotated in most cases.
| biomcgary wrote:
| Just run an LLM in sentiment analysis mode to annotate.
| tsumnia wrote:
| Oh yeah, the annotations are lacking compared to images.
| Again from the academic side, I think one solution could be
| to recruit theater majors just learning about 'verbing
| their lines' and having a collaboration between CS and
| Theater to produce a a proof-of-work dataset (since an
| acting class won't have more than 20-30 students in it).
| You'd need significantly more annotations, but you'd now
| have some labels to ascribe to texts with context since its
| a dialogue involving 1-* individuals.
| candiodari wrote:
| This already exists. These are transformers. Things like
| <laugh> work in a lot of models, for example. And you can
| vary, like _sigh_ and _uh_ work. I don 't think all of these
| were programmed in.
| tsumnia wrote:
| I've seen a few, there was even one posted to HN some time
| ago, though I don't recall the exact name. They were
| working on adding emotion to audio generation, but it was
| still a bit wonky. Emotion is a tricky concept and one of
| the reasons (I think) we haven't see a Paul Ekman
| microexpression detector yet. That's where my suggestion
| about looking to use action words comes into play, since
| those are more tangible, offer direction, without trying to
| identify various emotional valence levels.
| qwertox wrote:
| They are simply amazing. I see a future where computers will be
| able to mess with our brains by abusing our empathy.
|
| Imagine a computer sobbing at a child because it wants to
| terminate a chat session.
|
| This feels far more impacting than any visuals or text we're
| getting today.
| maxglute wrote:
| Are there any decent TTS models that can be ran locally that
| plugs into existing software like SAPI without too much lag?
| dvt wrote:
| Bark and Tortoise work fairly well. Bark does super fast
| inference[1] on my M1.
|
| [1] https://github.com/SaladTechnologies/bark
| turnsout wrote:
| @dvt Is this just a containerized version of Bark? Wondering
| if this repo has M1-specific improvements.
| dvt wrote:
| > Is this just a containerized version of Bark
|
| I think so.
| turnsout wrote:
| I'm finding M1 generation quite slow (CPU-only) on the
| stock Bark--any tips on speeding it up?
| dvt wrote:
| Sorry, haven't messed around too much with optimizations.
| I thought it was quite fast compared to Tortoise for
| example (where generation speed was at a 3:1 ratio).
| modeless wrote:
| XTTS has a streaming mode with ~300ms latency and sounds good,
| though it has hallucination issues. StyleTTS2 sounds good and
| doesn't hallucinate as much. It doesn't support streaming but
| it's fast so it can still respond quickly. But neither of them
| sound as good as Eleven Labs or OpenAI or this one.
| Nouser76 wrote:
| I've used coqui.ai's TTS models[0] and library[1] to great
| success. I was able to get cloned voice to be rendered in about
| 80% of the audio clip length, and I believe you can also stream
| the response. Do note the model license for XTTS, it is one
| they wrote themselves that has some restrictions.
|
| [0] https://huggingface.co/coqui/XTTS-v2
|
| [1] https://github.com/coqui-ai/TTS
| SparkyMcUnicorn wrote:
| > ... capable of mimicking speaker characteristics with just a
| few seconds of reference audio ... we have decided against open-
| sourcing this model as a precautionary measure.
|
| Disappointed yet again.
| someplaceguy wrote:
| Someone should send the developers this audio recording I have
| of Jeff Bezos saying that he changed his mind and wants the
| model to be released as open-source.
| mrfakename wrote:
| Sadly they didn't release the code or models
| CamperBob2 wrote:
| It's for Your Own Good, don't you know
| chankstein38 wrote:
| I'm so glad they are all so protective of my safety! Lord
| knows I'm a child incapable of controlling myself or having
| my own morals! /s
| chankstein38 wrote:
| Agreed. It hardly feels worth even reading through the paper
| since, from my perspective, it may as well just be made up. I
| can also write "Hey guys I made a good TTS it's really cool and
| great and the voices sound really natural" and put some samples
| together. If I never release any code or models or anything, it
| may as well have not been published.
| echelon wrote:
| The value of this stuff is going to zero. Don't worry about it.
|
| Product over model.
|
| Models and weights are a race to the bottom. Everyone is doing
| it and competing on data efficiency, methodology, MOS, etc.
| Groups all over are releasing their data and weights. It
| doesn't matter if Amazon doesn't, other labs will do it to get
| ahead and to get attention.
|
| This is going to be entirely pedestrian within a year.
|
| ElevenLabs is not a unicorn. It's an early-forming bubble.
| LarsDu88 wrote:
| Sounds about as good as ElevenLabs.io Hopefully if this ships on
| AWS, it will support SSML tags. I used Elevenlabs.io for all the
| voices in my VR game (https://roguestargun.com), but its still
| lacking on the emotion front which is all one-shot
| ghostbrainalpha wrote:
| Game looks great. Are you supporting Flight Sticks?
| JanSt wrote:
| I would love an API for this.. any information on availability?
| qwertox wrote:
| Interesting. Just a couple of hours ago I came across
| MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of
| their TTS in English (sadly no other languages available).
|
| If this year becomes the year when high quality Open Source TTS
| and ASR models appear that can run in real-time on an Nvidia RTX
| 40x0 or 30x0, then that would be great. On CPU even better.
|
| Also note the Ethical Statement on BASE TTS:
|
| > An application of this model can be to create synthetic voices
| of people who have lost the ability to speak due to accidents or
| illnesses, subject to informed consent and rigorous data privacy
| reviews. However, due to the potential misuse of this capability,
| we have decided against open-sourcing this model as a
| precautionary measure.
|
| [0] https://github.com/metavoiceio/metavoice-src
|
| [1] https://ttsdemo.themetavoice.xyz/
| nshm wrote:
| Metavoice is one of a dozen GPT-based TTS systems around
| starting from Tortoise. And not that great honestly. You can
| clearly hear "glass scratches" in their sound, it is because
| they trained on MP3-compressed data.
|
| There are much more clear sounding systems around. You can
| listen for StyleTTS2 to compare.
| qwertox wrote:
| I had forgotten about StyleTTS2, and it was discussed here on
| HN a couple of months ago. Maybe that's what made me feel
| that there's something going on.
| m2024 wrote:
| Check out `whisper` and `whisper-cpp` for ASR.
|
| I am running the smaller models in near real-time on a 3rd gen
| i7, even using my terrible built-in laptop mic from a distance.
| The medium and large models are impressively accurate for
| technical language.
| qwertox wrote:
| I'm using Whisper to transcribe notes I record with a
| lavalier mic during my bike rides (wind is no problem), but
| am using OpenAI's service. When it was released I tested it
| on a Ryzen 5950x and it was too slow and memory hungry for my
| taste. Using large was necessary for that use case (also, I'm
| recording in German).
| GaggiX wrote:
| With Whisper, you can find many smaller models that are
| fine-tuned for a particular language, so even smaller
| models can perform adequately.
| nshm wrote:
| Err, I deeply respect Amazon TTS team but this paper and
| synthesis is..... You publish the paper in 2024 and include
| YourTTS in your baselines to look better. Come on! There is XTTS2
| around!
|
| Voice sounds robotic and plain. Most likely a lot of audiobooks
| in training data and less conversational speech. And dropping
| diffusion was not a great idea, voice is not crystal clear
| anymore, it is more like a telephony recording.
| oersted wrote:
| The Spanish voice has an interesting accent: 85% Castillian (from
| Spain) pronunciation, with a few unexpected Latin American
| tonalities and phonemes (especially "s") sprinkled in.
|
| I guess it's what you'd expect from averaging a large amount of
| public-domain recordings. I think there's a bias towards Spain vs
| Latin America due to socioeconomic reasons, the population is
| obviously much smaller.
___________________________________________________________________
(page generated 2024-02-14 23:00 UTC)