[HN Gopher] Pushing the frontiers of audio generation
       ___________________________________________________________________
        
       Pushing the frontiers of audio generation
        
       Author : meetpateltech
       Score  : 148 points
       Date   : 2024-10-30 15:02 UTC (7 hours ago)
        
 (HTM) web link (deepmind.google)
 (TXT) w3m dump (deepmind.google)
        
       | jameszhao00 wrote:
       | Try it out in the demo https://cloud.google.com/text-to-
       | speech/?hl=en and in the API https://cloud.google.com/text-to-
       | speech/docs/create-dialogue...
        
         | deskr wrote:
         | If I change the language in the demo, it removes all my text
         | and replaces it with a template text. That's bad.
        
       | tmjdev wrote:
       | While it is impressive and I like to follow the advancements in
       | this field, it is incredibly frustrating to listen to. I can't
       | put my finger on why exactly. It's definitely closer to human-
       | sounding, but the uncanny valley is so deep here that I find
       | myself thinking "I just want the point, not the fake personality
       | that is coming with it". I can't make it through a 30s demo.
        
         | xnx wrote:
         | Agreed. To be fair, I also get annoyed by fake/exaggerated
         | expression from human podcasters.
        
         | iNic wrote:
         | It sounds like every sentence is an ad read.
        
           | JoblessWonder wrote:
           | Yeah... It isn't that it doesn't sound like human speech...
           | it just sounds like how humans speak when they are
           | uncomfortable or reading prepared and they aren't good at it.
        
         | semitones wrote:
         | I suppose it doesn't matter if it is a human, or a bot
         | delivering the message, if the message is boring
        
         | rob wrote:
         | Probably because you're expecting it and looking at a demo
         | page. Put these voices behind a real video or advertisement and
         | I would imagine most people wouldn't be able to tell that it's
         | AI generated at all.
        
           | Veen wrote:
           | It'd be annoying to me whether it was AI or human. The faux-
           | excitement and pseudo-bonhomie is grating. They should focus
           | on how people actually talk, not on copying the vocal
           | intonation of coked-up public radio presenters just back from
           | a positive affirmation seminar.
        
         | beoberha wrote:
         | Totally agree. Maybe it's just the clips they chose, but it
         | feels overfit on the weird conversational elements that make it
         | impressive? Like the "oh yeahs" from the other person when
         | someone is speaking. It is cool to see that natural flow in a
         | conversation generated by a model, but there's waaaay too much
         | of it in these examples to sound natural.
         | 
         | And I say all that completely slackjawed that this is possible.
        
           | amelius wrote:
           | > Like the "oh yeahs" from the other person when someone is
           | speaking.
           | 
           | I bet that if you select a British accent you will get fewer
           | of them.
        
             | kelseyfrog wrote:
             | Right mate
        
             | bryanrasmussen wrote:
             | I'm hoping it will be a lot of Ok Guv'ner and right you
             | ares in the style of Dick Van Dyke.
        
             | mindcrime wrote:
             | Gor blimey lad, that's the problem now innit???
        
             | Dilettante_ wrote:
             | Cheeky bugger, you are
        
             | KineticLensman wrote:
             | > a British accent
             | 
             | Hmm.... Scottish, Welsh, Irish (Nor'n) or English? If
             | English, North or South? If North, which city? Brummie?
             | Scouse? If South, London? Cockney or Multicultural London
             | English [0]?
             | 
             | [0]
             | https://en.wikipedia.org/wiki/Multicultural_London_English
        
               | beAbU wrote:
               | Need to increase your granularity a bit. I live in
               | Wexford Town, Ireland, and the other day I was chatting
               | to a person that told me their old schoolmates from
               | Castlebridge are making fun of their accent changing
               | since moving from their hometown.
               | 
               | Castlebridge is 10 minutes away by car. Madness!
        
               | KineticLensman wrote:
               | Yeah, totally agree. Here's a useful link for non-Brits,
               | that goes into a bit more detail:
               | 
               | https://accentbiasbritain.org/accents-in-britain/
               | 
               | Also, we have yet to define precisely define what is
               | meant by 'British'. This probably needs a "20 falsehoods
               | people believe about..."-type article.
        
           | echelon wrote:
           | I love the technology, but I really don't want AI to sound
           | like this.
           | 
           | Imagine being stuck on a call with this.
           | 
           | > "Hey, so like, is there anything I can help you with
           | today?"
           | 
           | > "Talk to a person."
           | 
           | > "Oh wow, right. (chuckle) You got it. Well, before I
           | connect you, can you maybe tell me a little bit more about
           | what problem you're having? For example, maybe it's something
           | to do with..."
        
             | cmehdy wrote:
             | That's how the DJ feature of Spotify talks and it's pretty
             | jarring.
             | 
             | "How's it going. We're gonna start by taking you back to
             | your 2022 favorites, starting with the sweet sounds of
             | XYZ". There's very little you can tweak about it, the
             | suggestions kinda suck, but you're getting a fake friend to
             | introduce them to you. Yay, I guess..
        
           | kelseyfrog wrote:
           | I'd love to see stats on disfluency rate in conversation,
           | podcasts, and this sample to get an idea of where it lies. It
           | seems like they could have cranked it up, but there's also
           | the chance that it's just the frequency illusion because we
           | were primed to pay attention to it.
        
         | onion2k wrote:
         | That could just be the context though. Listening to a clip
         | that's a demo of what the model can produce is very different
         | to listening to a YouTube video that's using the model to
         | generate speech about something you'd actually want to watch a
         | video of.
        
         | kaibee wrote:
         | > Example of a multi-speaker dialogue generated by NotebookLM
         | Audio Overview, based on a few potato-related documents.
         | 
         | Listening to this on 1.75x speed is excellent. I think the
         | generated speaking speed is slow for audio quality, bc it'd be
         | much harder to slow-down the generated audio while retaining
         | quality than vice versa.
        
         | moralestapia wrote:
         | It's due to the histrionic mental epidemic that we are going
         | through.
         | 
         | A lot of people are just like that IRL.
         | 
         | They cannot just say "the food was fine", it's usually some
         | crap like "What on earth! These are the best cheese sticks I've
         | had IN MY EN TI R E LIFE!".
        
           | shermantanktop wrote:
           | "I'm OBSESSED with the dipping sauce. So good."
        
         | hyperific wrote:
         | It's like their training set was made up entirely of awkward
         | podcaster banter.
        
           | ukuina wrote:
           | At least 83% Leo Laporte.
        
         | yapyap wrote:
         | they all sound like valley-people, complete with the raspy
         | voice and everything
        
         | swatcoder wrote:
         | We're used to hearing some kind of _identity_ behind voices --
         | we unconsciously sense clusters of vocabulary, intonation
         | patterns, ticks, frequent interruption vs quiet patience,
         | silence tolerance, response patterns to various triggers, etc
         | that communicate a coherent _person_ of some kind.
         | 
         | We may not _know_ that a given speaker is a GenX Methodist from
         | Wisconsin that grew up at skate parks in the suburbs, but we
         | hear clusters of speech behavior that lets our brain go  "yeah,
         | I'm used to things fitting together in this way sometimes"
         | 
         | These don't have that.
         | 
         | Instead, they seem to mostly smudge together behaviors that are
         | just generally common in aggregate across the training data.
         | The speakers all voice interrupting acknowledgements eagerly,
         | they all use bright and enunciated podcaster tone, they all
         | draw on similar word choice, etc -- they distinguish gender and
         | each have a stable overall vocal tone, but no identity.
         | 
         | I don't doubt that this'll improve quickly though, by training
         | specific "AI celebrity" voices narrowed to sound more coherent,
         | natural, identifiable, and consistent. (And then, probably,
         | leasing out those voices for $$$.)
         | 
         | As a _tech demo_ for  "render some vague sense of life behind
         | this generated dialog" this is pretty good, though.
        
           | lancesells wrote:
           | Agreed. To me it sounds like bad voice-over actors reading
           | from a script. So the natural parts of a conversation where
           | you might say the wrong thing and step back to correct
           | yourself are all gone. Impressive for sure.
        
             | htrp wrote:
             | every step of technological advancement builds on top of
             | the previous one.
             | 
             | now it's bad voice actors, in 2 years it'll be great ones
        
           | TimTheTinker wrote:
           | Whether this stops at the uncanny valley or progresses to
           | specific "AI celebrity" voices, I'm left thinking the
           | engineers involved in this never stopped to think carefully
           | about whether this _ought_ to be done in the first place.
        
             | jsheard wrote:
             | "Surely _my_ genAI product won 't be used to spam zero-
             | effort slop all over the internet!"
             | 
             | - guy whose genAI product will definitely be used to spam
             | zero-effort slop all over the internet.
        
               | _DeadFred_ wrote:
               | I think their main target is corporate creative jobs.
               | Background music to ads/videos/etc. And just like with
               | all AI, they will eat the jobs that support the rest of
               | the system, making it a one and done. It will give a one
               | time boost, and then be stuck at that level because
               | creatives won't have the jobs that allowed them to add to
               | the domain. In this case new music styles. New
               | techniques. It's literally eating the seed corn where the
               | sprouts are the creatives working in the boring
               | commercial jobs that allow them to practice/become
               | experts in the tools/etc that they then build up it all.
               | Their goal is cut the jobs that create their training
               | data and the ecosystem that builds up/expands the domain.
               | Everywhere AI touches will basically be 'stuck using
               | Cobol' because AI will be frozen at the point in time
               | where the energy infusing 'sprouts' all had their jobs
               | replaced by AI and without them creating new output for
               | AI to train on it's all ossified.
               | 
               | We are witnessing in real time the answer to why 'The
               | Matrix' was set when it was. Once AI takes over there is
               | no future culture.
        
           | adamhartenz wrote:
           | To be fair, the majority of podcasts are from a group of
           | generic white guys, and they almost sound identical to these
           | AI generated ones. The AI actually seems to to do a better
           | job too.
        
             | freestyle24147 wrote:
             | Citation absolutely needed. You call this fair?
             | 
             | > the majority of podcasts are from a group of generic
             | white guys
        
               | sangnoir wrote:
               | https://podcastcharts.byspotify.com/ keep the Pareto
               | distribution in mind
        
               | neom wrote:
               | I did the best fast research I could given not wanting to
               | spend more than 20 minutes on it and came to this result
               | (aprox): - Mixed/Diverse: 48.0% - White Men: 35.0% -
               | Women: 8.0% - Non-White: 6.0% - White Woman: 2.0% - Non-
               | White Woman: 1.0%
        
         | pvarangot wrote:
         | It's because it's probably trained with "professional audio",
         | ads, movies, audiobooks, and not "normal people talking". Like
         | the effect when diffusion was mostly trained with stock photos.
        
         | gwbas1c wrote:
         | I get the feeling that this is useful for something that
         | someone half-listens to.
        
       | nilsherzig wrote:
       | The voices are impressive (I can't tell the difference as a non
       | native speaker) but their "personality" sounds extremely annoying
       | lmao
        
         | xanderlewis wrote:
         | I know. Can they do anything other than obnoxious Californian?
         | The vocal fry is off the charts.
        
       | mg wrote:
       | Is there a free (ad supported?) online tool without login that
       | reads text that you paste into it?
       | 
       | I often would like to listen to a blog post instead of reading
       | it, but haven't found an easy, quick solution yet.
       | 
       | I tried piping text through OpenAI's tts-1-hd, model and it is
       | the first one I ever found that is human like enough for me to
       | like listening to it. So I could write a tool for my own usecase
       | that pipes the text to tts-1-hd and plays the audio. But maybe
       | there is already something with a public web interface out there?
        
         | jasonjmcghee wrote:
         | There is on iOS. No ads. "Reader" by Eleven Labs. I haven't
         | used it that much but have listened to some white papers and
         | blogs (some of which were like 45 minutes) and it "just
         | worked". Even let's you click text you want to jump to.
         | 
         | And it's Eleven Labs quality- which unless I've fallen behind
         | the times is the highest quality TTS by a margin.
        
           | jangxx wrote:
           | There's also the built-in "Speak Selection" feature you can
           | enable in the accessibility settings.
        
           | ukuina wrote:
           | Reader is on a pretty good path to a monthly subscription
           | model. Great audio quality, large selection of voices, and
           | support for long-form input text.
        
         | infinita740 wrote:
         | I use ms edge for this exact use case. Works well enough on any
         | platform
        
         | Jaxan wrote:
         | Both windows and macos (the operating systems) have this built-
         | in under accessibility. It's worth a try and I use it sometimes
         | when I want to read something while cooking.
        
         | beAbU wrote:
         | Good old Microsoft Sam? It'll sound like Stephen Hawking is
         | reading it to you!
        
       | 101008 wrote:
       | It looks like lately a lot of progress have been made in audio
       | generation / audio understanding (everything related to speech, I
       | mean).
       | 
       | Is this related to LLM, or is this a completely different branch
       | of AI, and is it just a coincidence? I am curious.
        
       | ruffrey wrote:
       | > This means it generates audio over 40-times faster than real
       | time.
       | 
       | Astounding
        
       | lrkehab wrote:
       | YouTube videos are already infested with insufferable AI elevator
       | background "music". Even some channels that were previously good
       | are using it.
       | 
       | On the bright side, you can stop watching these channels and have
       | more time for serious things.
        
         | ipsum2 wrote:
         | > AI elevator background "music".
         | 
         | What are some examples? I haven't encountered this.
        
           | xanderlewis wrote:
           | Just search 'jazz' on YouTube.
           | 
           | Almost all of the results will not consist of 'jazz' in any
           | real sense, but instead a collection of uncanny melodies and
           | chord progressions that wonder around going nowhere,
           | traditionally accompanied by an obscenely eye-offending
           | diffusion model-generated mishmash of seasonal tropes and
           | incongruent interior design choices. Often, it's MIDI bossa
           | nova presumably written by either a machine or someone who's
           | only ever heard a few bars of music at a time and has no idea
           | that 'feel' or 'soul' are a thing.
        
       | henning wrote:
       | To paraphrase the great Bertram Gilfoyle, computers don't need to
       | produce fake vocal tics.
        
       | seydor wrote:
       | But what's the end goal and audience here? I don't believe people
       | will resonate with robots making "um" and "ohs" because people
       | usually resonate with an artist, a producer, a writer, a singer
       | etc. A human layer with which people can empathize is essential.
       | This can work as long as people are deceived and don't know there
       | is no human behind it. If however i find out that a video is AI
       | -generated i instantly lose interest in it. There are e.g. a lot
       | of AI-generated architecture videos on youtube at the moment, i
       | have never wanted to listen to one, because i know the emotions
       | will be fake.
        
       | corry wrote:
       | I think I put my finger on exactly why it sounds a bit uncanny-
       | valley: it sounds like humans who are reading from a prepared
       | 'bit' or 'script'.
       | 
       | We've all been on those webinars where it's clear -- despite the
       | infusions (on cue) of "enthusiasm" from the speaker attempting to
       | make it sound more natural and off-the-cuff -- that they are
       | reading from a script.
       | 
       | It's a difficult-to-mask phenomenon for humans.
       | 
       | That all said, I actually have more grace for an AI sounding like
       | this than I do for a human presenter reading from a script. Like,
       | if I'm here "live" and paying attention to what you're saying, at
       | least do me the service of truly being "here" with me and
       | authentically communicating vs. simply reading something.
       | 
       | If you're going to simply read something, then just send it to me
       | to read too - don't pretend it's a spontaneously synchronous
       | communication.
        
       | ironlake wrote:
       | Is this another fake like the Google bot that made reservations
       | at a restaurant?
        
       | jchanimal wrote:
       | We've been using this at work to get inside of our customer's
       | perspective. It's helpful to throw eg a bunch of point-of-sale
       | data sync challenges into Notebook LM and eg pass a 10 minute
       | audio to the team so they can understand where our work fits in.
        
       ___________________________________________________________________
       (page generated 2024-10-30 23:00 UTC)