[HN Gopher] BASE TTS: The largest text-to-speech model to-date
       ___________________________________________________________________
        
       BASE TTS: The largest text-to-speech model to-date
        
       Author : jcuenod
       Score  : 77 points
       Date   : 2024-02-14 19:09 UTC (3 hours ago)
        
 (HTM) web link (amazon-ltts-paper.com)
 (TXT) w3m dump (amazon-ltts-paper.com)
        
       | unsupp0rted wrote:
       | > Echoing the widely-reported "emergent abilities" of Large
       | Language Models when trained on increasing volume of data, we
       | show that BASE TTS variants built with 10k+ hours start to
       | exhibit advanced understanding of texts that enable contextually
       | appropriate prosody.
        
       | IronWolve wrote:
       | Awhile ago, when amazon had its text limited but unlimited free
       | use of its neural tts, I was converting an ebook to audiobook, it
       | was amazing how it could sound so lifelike and inflections of the
       | voice. Amazing.
       | 
       | Amazon really had the best sounding TTS I've seen compared to
       | paid microsoft and google. Hands down better. But technology is
       | getting better for opensource, I'd expect in a year or 2, home
       | use will be on par in quality with paid services.
       | 
       | I cant wait for realtime video translate, so shows with non-
       | english subs can be translated into english speech. You can do it
       | now with some services, upload a video and lang/voice/mouth will
       | convert to any language.
        
       | revenga99 wrote:
       | Wow. I could see this as threatening audio book narrators.
       | However I would still prefer a real narrator to this in its
       | current state. I think what it might be missing is different
       | voices/accents for different characters.
        
         | swashboon wrote:
         | Audible doesn't allow AI narration or much Public Domain stuff
         | at the moment. The only thing keeping it from happening is the
         | markets trying to keep back a flood of crap from over taking /
         | drowning / diluting the more well crafted options and causing
         | the consumers to get really annoyed.
        
           | TOMDM wrote:
           | Let's be honest, the moment Amazon thinks their tts is good
           | enough, they'll be offering AI audible deals to every author
           | on their platform
        
             | swashboon wrote:
             | Yea, hard to say because the obvious implementation would
             | be to just have it built into phones once the model is
             | potentially portable enough - I see this happening quicker
             | as a more general TTS functionality much like Google is
             | doing with 'subtitles anywhere' aka Live Caption. Paired
             | with translations we maybe pretty close to the universal
             | translator type functionality. I could see end users being
             | able to customize their voice assistant even more or maybe
             | having multiple based on if its talking for you or to you.
             | 
             | Anyways the problem with this is it makes the product 'ai
             | audiobook' basically worthless, why not just buy the eBook
             | and have my personalized translator turn it into an audio
             | book. Now you just have market differentiation between
             | cheap ebook + ai narrator vs expensive + professional
             | narration.
             | 
             | Though narration costs are already pretty cheap - it really
             | does not factor into the cost of publishing an audio book
             | that much unless its really a bottom of the barrel book.
        
               | TOMDM wrote:
               | I'm looking forward to my on device TTS, but Amazon has a
               | decent moat with the DRM on their Kindles.
               | 
               | At least they'll have to remain somewhat competitive once
               | consumers decide they want the AI audiobooks and the
               | like.
        
               | swashboon wrote:
               | Thinking about this more - the copyright implications
               | become much more interesting once its no longer a
               | recording. Does it could as a private performance if you
               | have headphones on? Is it a public performance if you
               | listen to live TTS through your speakers in public?
        
         | dshpala wrote:
         | I think Google's product has that:
         | https://play.google.com/books/publish/autonarrated/
        
       | minimaxir wrote:
       | The emotion examples are interesting. One of the current most
       | obvious indicators of AI-generated voices/voice cloning is a lack
       | of emotion and range, which make them objectively worse compared
       | to professional voice actors, unless a lack of emotion and range
       | is the desired voice direction.
       | 
       | But if you listen to the emotion examples, the range essentially
       | what you'd get from an audiobook narrator, not more traditional
       | voice acting.
        
         | tsumnia wrote:
         | Sadly it's not my forte but I expect in the near future we'll
         | see an additional "emotion" embedding or something similar.
         | Actors regularly use 'action words' (verbs) [1] to help add
         | context to lines. A model then could study a text, determine an
         | appropriate verb/emotion range to work from, then produce the
         | audio with that additional context.
         | 
         | [1] https://indietips.com/subtext-action-verb/
        
           | minimaxir wrote:
           | The bottleneck is the annotations: there's no easy way to
           | annotate "emotions" on the scale of data needed to have the
           | model learn the necessary verbal tics.
           | 
           | In contrast, image data on the intent for image generation
           | models is very highly annotated in most cases.
        
             | biomcgary wrote:
             | Just run an LLM in sentiment analysis mode to annotate.
        
             | tsumnia wrote:
             | Oh yeah, the annotations are lacking compared to images.
             | Again from the academic side, I think one solution could be
             | to recruit theater majors just learning about 'verbing
             | their lines' and having a collaboration between CS and
             | Theater to produce a a proof-of-work dataset (since an
             | acting class won't have more than 20-30 students in it).
             | You'd need significantly more annotations, but you'd now
             | have some labels to ascribe to texts with context since its
             | a dialogue involving 1-* individuals.
        
           | candiodari wrote:
           | This already exists. These are transformers. Things like
           | <laugh> work in a lot of models, for example. And you can
           | vary, like _sigh_ and _uh_ work. I don 't think all of these
           | were programmed in.
        
             | tsumnia wrote:
             | I've seen a few, there was even one posted to HN some time
             | ago, though I don't recall the exact name. They were
             | working on adding emotion to audio generation, but it was
             | still a bit wonky. Emotion is a tricky concept and one of
             | the reasons (I think) we haven't see a Paul Ekman
             | microexpression detector yet. That's where my suggestion
             | about looking to use action words comes into play, since
             | those are more tangible, offer direction, without trying to
             | identify various emotional valence levels.
        
         | qwertox wrote:
         | They are simply amazing. I see a future where computers will be
         | able to mess with our brains by abusing our empathy.
         | 
         | Imagine a computer sobbing at a child because it wants to
         | terminate a chat session.
         | 
         | This feels far more impacting than any visuals or text we're
         | getting today.
        
       | maxglute wrote:
       | Are there any decent TTS models that can be ran locally that
       | plugs into existing software like SAPI without too much lag?
        
         | dvt wrote:
         | Bark and Tortoise work fairly well. Bark does super fast
         | inference[1] on my M1.
         | 
         | [1] https://github.com/SaladTechnologies/bark
        
           | turnsout wrote:
           | @dvt Is this just a containerized version of Bark? Wondering
           | if this repo has M1-specific improvements.
        
             | dvt wrote:
             | > Is this just a containerized version of Bark
             | 
             | I think so.
        
               | turnsout wrote:
               | I'm finding M1 generation quite slow (CPU-only) on the
               | stock Bark--any tips on speeding it up?
        
               | dvt wrote:
               | Sorry, haven't messed around too much with optimizations.
               | I thought it was quite fast compared to Tortoise for
               | example (where generation speed was at a 3:1 ratio).
        
         | modeless wrote:
         | XTTS has a streaming mode with ~300ms latency and sounds good,
         | though it has hallucination issues. StyleTTS2 sounds good and
         | doesn't hallucinate as much. It doesn't support streaming but
         | it's fast so it can still respond quickly. But neither of them
         | sound as good as Eleven Labs or OpenAI or this one.
        
         | Nouser76 wrote:
         | I've used coqui.ai's TTS models[0] and library[1] to great
         | success. I was able to get cloned voice to be rendered in about
         | 80% of the audio clip length, and I believe you can also stream
         | the response. Do note the model license for XTTS, it is one
         | they wrote themselves that has some restrictions.
         | 
         | [0] https://huggingface.co/coqui/XTTS-v2
         | 
         | [1] https://github.com/coqui-ai/TTS
        
       | SparkyMcUnicorn wrote:
       | > ... capable of mimicking speaker characteristics with just a
       | few seconds of reference audio ... we have decided against open-
       | sourcing this model as a precautionary measure.
       | 
       | Disappointed yet again.
        
         | someplaceguy wrote:
         | Someone should send the developers this audio recording I have
         | of Jeff Bezos saying that he changed his mind and wants the
         | model to be released as open-source.
        
       | mrfakename wrote:
       | Sadly they didn't release the code or models
        
         | CamperBob2 wrote:
         | It's for Your Own Good, don't you know
        
           | chankstein38 wrote:
           | I'm so glad they are all so protective of my safety! Lord
           | knows I'm a child incapable of controlling myself or having
           | my own morals! /s
        
         | chankstein38 wrote:
         | Agreed. It hardly feels worth even reading through the paper
         | since, from my perspective, it may as well just be made up. I
         | can also write "Hey guys I made a good TTS it's really cool and
         | great and the voices sound really natural" and put some samples
         | together. If I never release any code or models or anything, it
         | may as well have not been published.
        
         | echelon wrote:
         | The value of this stuff is going to zero. Don't worry about it.
         | 
         | Product over model.
         | 
         | Models and weights are a race to the bottom. Everyone is doing
         | it and competing on data efficiency, methodology, MOS, etc.
         | Groups all over are releasing their data and weights. It
         | doesn't matter if Amazon doesn't, other labs will do it to get
         | ahead and to get attention.
         | 
         | This is going to be entirely pedestrian within a year.
         | 
         | ElevenLabs is not a unicorn. It's an early-forming bubble.
        
       | LarsDu88 wrote:
       | Sounds about as good as ElevenLabs.io Hopefully if this ships on
       | AWS, it will support SSML tags. I used Elevenlabs.io for all the
       | voices in my VR game (https://roguestargun.com), but its still
       | lacking on the emotion front which is all one-shot
        
         | ghostbrainalpha wrote:
         | Game looks great. Are you supporting Flight Sticks?
        
       | JanSt wrote:
       | I would love an API for this.. any information on availability?
        
       | qwertox wrote:
       | Interesting. Just a couple of hours ago I came across
       | MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of
       | their TTS in English (sadly no other languages available).
       | 
       | If this year becomes the year when high quality Open Source TTS
       | and ASR models appear that can run in real-time on an Nvidia RTX
       | 40x0 or 30x0, then that would be great. On CPU even better.
       | 
       | Also note the Ethical Statement on BASE TTS:
       | 
       | > An application of this model can be to create synthetic voices
       | of people who have lost the ability to speak due to accidents or
       | illnesses, subject to informed consent and rigorous data privacy
       | reviews. However, due to the potential misuse of this capability,
       | we have decided against open-sourcing this model as a
       | precautionary measure.
       | 
       | [0] https://github.com/metavoiceio/metavoice-src
       | 
       | [1] https://ttsdemo.themetavoice.xyz/
        
         | nshm wrote:
         | Metavoice is one of a dozen GPT-based TTS systems around
         | starting from Tortoise. And not that great honestly. You can
         | clearly hear "glass scratches" in their sound, it is because
         | they trained on MP3-compressed data.
         | 
         | There are much more clear sounding systems around. You can
         | listen for StyleTTS2 to compare.
        
           | qwertox wrote:
           | I had forgotten about StyleTTS2, and it was discussed here on
           | HN a couple of months ago. Maybe that's what made me feel
           | that there's something going on.
        
         | m2024 wrote:
         | Check out `whisper` and `whisper-cpp` for ASR.
         | 
         | I am running the smaller models in near real-time on a 3rd gen
         | i7, even using my terrible built-in laptop mic from a distance.
         | The medium and large models are impressively accurate for
         | technical language.
        
           | qwertox wrote:
           | I'm using Whisper to transcribe notes I record with a
           | lavalier mic during my bike rides (wind is no problem), but
           | am using OpenAI's service. When it was released I tested it
           | on a Ryzen 5950x and it was too slow and memory hungry for my
           | taste. Using large was necessary for that use case (also, I'm
           | recording in German).
        
             | GaggiX wrote:
             | With Whisper, you can find many smaller models that are
             | fine-tuned for a particular language, so even smaller
             | models can perform adequately.
        
       | nshm wrote:
       | Err, I deeply respect Amazon TTS team but this paper and
       | synthesis is..... You publish the paper in 2024 and include
       | YourTTS in your baselines to look better. Come on! There is XTTS2
       | around!
       | 
       | Voice sounds robotic and plain. Most likely a lot of audiobooks
       | in training data and less conversational speech. And dropping
       | diffusion was not a great idea, voice is not crystal clear
       | anymore, it is more like a telephony recording.
        
       | oersted wrote:
       | The Spanish voice has an interesting accent: 85% Castillian (from
       | Spain) pronunciation, with a few unexpected Latin American
       | tonalities and phonemes (especially "s") sprinkled in.
       | 
       | I guess it's what you'd expect from averaging a large amount of
       | public-domain recordings. I think there's a bias towards Spain vs
       | Latin America due to socioeconomic reasons, the population is
       | obviously much smaller.
        
       ___________________________________________________________________
       (page generated 2024-02-14 23:00 UTC)