[HN Gopher] Pushing the frontiers of audio generation
___________________________________________________________________
Pushing the frontiers of audio generation
Author : meetpateltech
Score : 148 points
Date : 2024-10-30 15:02 UTC (7 hours ago)
(HTM) web link (deepmind.google)
(TXT) w3m dump (deepmind.google)
| jameszhao00 wrote:
| Try it out in the demo https://cloud.google.com/text-to-
| speech/?hl=en and in the API https://cloud.google.com/text-to-
| speech/docs/create-dialogue...
| deskr wrote:
| If I change the language in the demo, it removes all my text
| and replaces it with a template text. That's bad.
| tmjdev wrote:
| While it is impressive and I like to follow the advancements in
| this field, it is incredibly frustrating to listen to. I can't
| put my finger on why exactly. It's definitely closer to human-
| sounding, but the uncanny valley is so deep here that I find
| myself thinking "I just want the point, not the fake personality
| that is coming with it". I can't make it through a 30s demo.
| xnx wrote:
| Agreed. To be fair, I also get annoyed by fake/exaggerated
| expression from human podcasters.
| iNic wrote:
| It sounds like every sentence is an ad read.
| JoblessWonder wrote:
| Yeah... It isn't that it doesn't sound like human speech...
| it just sounds like how humans speak when they are
| uncomfortable or reading prepared and they aren't good at it.
| semitones wrote:
| I suppose it doesn't matter if it is a human, or a bot
| delivering the message, if the message is boring
| rob wrote:
| Probably because you're expecting it and looking at a demo
| page. Put these voices behind a real video or advertisement and
| I would imagine most people wouldn't be able to tell that it's
| AI generated at all.
| Veen wrote:
| It'd be annoying to me whether it was AI or human. The faux-
| excitement and pseudo-bonhomie is grating. They should focus
| on how people actually talk, not on copying the vocal
| intonation of coked-up public radio presenters just back from
| a positive affirmation seminar.
| beoberha wrote:
| Totally agree. Maybe it's just the clips they chose, but it
| feels overfit on the weird conversational elements that make it
| impressive? Like the "oh yeahs" from the other person when
| someone is speaking. It is cool to see that natural flow in a
| conversation generated by a model, but there's waaaay too much
| of it in these examples to sound natural.
|
| And I say all that completely slackjawed that this is possible.
| amelius wrote:
| > Like the "oh yeahs" from the other person when someone is
| speaking.
|
| I bet that if you select a British accent you will get fewer
| of them.
| kelseyfrog wrote:
| Right mate
| bryanrasmussen wrote:
| I'm hoping it will be a lot of Ok Guv'ner and right you
| ares in the style of Dick Van Dyke.
| mindcrime wrote:
| Gor blimey lad, that's the problem now innit???
| Dilettante_ wrote:
| Cheeky bugger, you are
| KineticLensman wrote:
| > a British accent
|
| Hmm.... Scottish, Welsh, Irish (Nor'n) or English? If
| English, North or South? If North, which city? Brummie?
| Scouse? If South, London? Cockney or Multicultural London
| English [0]?
|
| [0]
| https://en.wikipedia.org/wiki/Multicultural_London_English
| beAbU wrote:
| Need to increase your granularity a bit. I live in
| Wexford Town, Ireland, and the other day I was chatting
| to a person that told me their old schoolmates from
| Castlebridge are making fun of their accent changing
| since moving from their hometown.
|
| Castlebridge is 10 minutes away by car. Madness!
| KineticLensman wrote:
| Yeah, totally agree. Here's a useful link for non-Brits,
| that goes into a bit more detail:
|
| https://accentbiasbritain.org/accents-in-britain/
|
| Also, we have yet to define precisely define what is
| meant by 'British'. This probably needs a "20 falsehoods
| people believe about..."-type article.
| echelon wrote:
| I love the technology, but I really don't want AI to sound
| like this.
|
| Imagine being stuck on a call with this.
|
| > "Hey, so like, is there anything I can help you with
| today?"
|
| > "Talk to a person."
|
| > "Oh wow, right. (chuckle) You got it. Well, before I
| connect you, can you maybe tell me a little bit more about
| what problem you're having? For example, maybe it's something
| to do with..."
| cmehdy wrote:
| That's how the DJ feature of Spotify talks and it's pretty
| jarring.
|
| "How's it going. We're gonna start by taking you back to
| your 2022 favorites, starting with the sweet sounds of
| XYZ". There's very little you can tweak about it, the
| suggestions kinda suck, but you're getting a fake friend to
| introduce them to you. Yay, I guess..
| kelseyfrog wrote:
| I'd love to see stats on disfluency rate in conversation,
| podcasts, and this sample to get an idea of where it lies. It
| seems like they could have cranked it up, but there's also
| the chance that it's just the frequency illusion because we
| were primed to pay attention to it.
| onion2k wrote:
| That could just be the context though. Listening to a clip
| that's a demo of what the model can produce is very different
| to listening to a YouTube video that's using the model to
| generate speech about something you'd actually want to watch a
| video of.
| kaibee wrote:
| > Example of a multi-speaker dialogue generated by NotebookLM
| Audio Overview, based on a few potato-related documents.
|
| Listening to this on 1.75x speed is excellent. I think the
| generated speaking speed is slow for audio quality, bc it'd be
| much harder to slow-down the generated audio while retaining
| quality than vice versa.
| moralestapia wrote:
| It's due to the histrionic mental epidemic that we are going
| through.
|
| A lot of people are just like that IRL.
|
| They cannot just say "the food was fine", it's usually some
| crap like "What on earth! These are the best cheese sticks I've
| had IN MY EN TI R E LIFE!".
| shermantanktop wrote:
| "I'm OBSESSED with the dipping sauce. So good."
| hyperific wrote:
| It's like their training set was made up entirely of awkward
| podcaster banter.
| ukuina wrote:
| At least 83% Leo Laporte.
| yapyap wrote:
| they all sound like valley-people, complete with the raspy
| voice and everything
| swatcoder wrote:
| We're used to hearing some kind of _identity_ behind voices --
| we unconsciously sense clusters of vocabulary, intonation
| patterns, ticks, frequent interruption vs quiet patience,
| silence tolerance, response patterns to various triggers, etc
| that communicate a coherent _person_ of some kind.
|
| We may not _know_ that a given speaker is a GenX Methodist from
| Wisconsin that grew up at skate parks in the suburbs, but we
| hear clusters of speech behavior that lets our brain go "yeah,
| I'm used to things fitting together in this way sometimes"
|
| These don't have that.
|
| Instead, they seem to mostly smudge together behaviors that are
| just generally common in aggregate across the training data.
| The speakers all voice interrupting acknowledgements eagerly,
| they all use bright and enunciated podcaster tone, they all
| draw on similar word choice, etc -- they distinguish gender and
| each have a stable overall vocal tone, but no identity.
|
| I don't doubt that this'll improve quickly though, by training
| specific "AI celebrity" voices narrowed to sound more coherent,
| natural, identifiable, and consistent. (And then, probably,
| leasing out those voices for $$$.)
|
| As a _tech demo_ for "render some vague sense of life behind
| this generated dialog" this is pretty good, though.
| lancesells wrote:
| Agreed. To me it sounds like bad voice-over actors reading
| from a script. So the natural parts of a conversation where
| you might say the wrong thing and step back to correct
| yourself are all gone. Impressive for sure.
| htrp wrote:
| every step of technological advancement builds on top of
| the previous one.
|
| now it's bad voice actors, in 2 years it'll be great ones
| TimTheTinker wrote:
| Whether this stops at the uncanny valley or progresses to
| specific "AI celebrity" voices, I'm left thinking the
| engineers involved in this never stopped to think carefully
| about whether this _ought_ to be done in the first place.
| jsheard wrote:
| "Surely _my_ genAI product won 't be used to spam zero-
| effort slop all over the internet!"
|
| - guy whose genAI product will definitely be used to spam
| zero-effort slop all over the internet.
| _DeadFred_ wrote:
| I think their main target is corporate creative jobs.
| Background music to ads/videos/etc. And just like with
| all AI, they will eat the jobs that support the rest of
| the system, making it a one and done. It will give a one
| time boost, and then be stuck at that level because
| creatives won't have the jobs that allowed them to add to
| the domain. In this case new music styles. New
| techniques. It's literally eating the seed corn where the
| sprouts are the creatives working in the boring
| commercial jobs that allow them to practice/become
| experts in the tools/etc that they then build up it all.
| Their goal is cut the jobs that create their training
| data and the ecosystem that builds up/expands the domain.
| Everywhere AI touches will basically be 'stuck using
| Cobol' because AI will be frozen at the point in time
| where the energy infusing 'sprouts' all had their jobs
| replaced by AI and without them creating new output for
| AI to train on it's all ossified.
|
| We are witnessing in real time the answer to why 'The
| Matrix' was set when it was. Once AI takes over there is
| no future culture.
| adamhartenz wrote:
| To be fair, the majority of podcasts are from a group of
| generic white guys, and they almost sound identical to these
| AI generated ones. The AI actually seems to to do a better
| job too.
| freestyle24147 wrote:
| Citation absolutely needed. You call this fair?
|
| > the majority of podcasts are from a group of generic
| white guys
| sangnoir wrote:
| https://podcastcharts.byspotify.com/ keep the Pareto
| distribution in mind
| neom wrote:
| I did the best fast research I could given not wanting to
| spend more than 20 minutes on it and came to this result
| (aprox): - Mixed/Diverse: 48.0% - White Men: 35.0% -
| Women: 8.0% - Non-White: 6.0% - White Woman: 2.0% - Non-
| White Woman: 1.0%
| pvarangot wrote:
| It's because it's probably trained with "professional audio",
| ads, movies, audiobooks, and not "normal people talking". Like
| the effect when diffusion was mostly trained with stock photos.
| gwbas1c wrote:
| I get the feeling that this is useful for something that
| someone half-listens to.
| nilsherzig wrote:
| The voices are impressive (I can't tell the difference as a non
| native speaker) but their "personality" sounds extremely annoying
| lmao
| xanderlewis wrote:
| I know. Can they do anything other than obnoxious Californian?
| The vocal fry is off the charts.
| mg wrote:
| Is there a free (ad supported?) online tool without login that
| reads text that you paste into it?
|
| I often would like to listen to a blog post instead of reading
| it, but haven't found an easy, quick solution yet.
|
| I tried piping text through OpenAI's tts-1-hd, model and it is
| the first one I ever found that is human like enough for me to
| like listening to it. So I could write a tool for my own usecase
| that pipes the text to tts-1-hd and plays the audio. But maybe
| there is already something with a public web interface out there?
| jasonjmcghee wrote:
| There is on iOS. No ads. "Reader" by Eleven Labs. I haven't
| used it that much but have listened to some white papers and
| blogs (some of which were like 45 minutes) and it "just
| worked". Even let's you click text you want to jump to.
|
| And it's Eleven Labs quality- which unless I've fallen behind
| the times is the highest quality TTS by a margin.
| jangxx wrote:
| There's also the built-in "Speak Selection" feature you can
| enable in the accessibility settings.
| ukuina wrote:
| Reader is on a pretty good path to a monthly subscription
| model. Great audio quality, large selection of voices, and
| support for long-form input text.
| infinita740 wrote:
| I use ms edge for this exact use case. Works well enough on any
| platform
| Jaxan wrote:
| Both windows and macos (the operating systems) have this built-
| in under accessibility. It's worth a try and I use it sometimes
| when I want to read something while cooking.
| beAbU wrote:
| Good old Microsoft Sam? It'll sound like Stephen Hawking is
| reading it to you!
| 101008 wrote:
| It looks like lately a lot of progress have been made in audio
| generation / audio understanding (everything related to speech, I
| mean).
|
| Is this related to LLM, or is this a completely different branch
| of AI, and is it just a coincidence? I am curious.
| ruffrey wrote:
| > This means it generates audio over 40-times faster than real
| time.
|
| Astounding
| lrkehab wrote:
| YouTube videos are already infested with insufferable AI elevator
| background "music". Even some channels that were previously good
| are using it.
|
| On the bright side, you can stop watching these channels and have
| more time for serious things.
| ipsum2 wrote:
| > AI elevator background "music".
|
| What are some examples? I haven't encountered this.
| xanderlewis wrote:
| Just search 'jazz' on YouTube.
|
| Almost all of the results will not consist of 'jazz' in any
| real sense, but instead a collection of uncanny melodies and
| chord progressions that wonder around going nowhere,
| traditionally accompanied by an obscenely eye-offending
| diffusion model-generated mishmash of seasonal tropes and
| incongruent interior design choices. Often, it's MIDI bossa
| nova presumably written by either a machine or someone who's
| only ever heard a few bars of music at a time and has no idea
| that 'feel' or 'soul' are a thing.
| henning wrote:
| To paraphrase the great Bertram Gilfoyle, computers don't need to
| produce fake vocal tics.
| seydor wrote:
| But what's the end goal and audience here? I don't believe people
| will resonate with robots making "um" and "ohs" because people
| usually resonate with an artist, a producer, a writer, a singer
| etc. A human layer with which people can empathize is essential.
| This can work as long as people are deceived and don't know there
| is no human behind it. If however i find out that a video is AI
| -generated i instantly lose interest in it. There are e.g. a lot
| of AI-generated architecture videos on youtube at the moment, i
| have never wanted to listen to one, because i know the emotions
| will be fake.
| corry wrote:
| I think I put my finger on exactly why it sounds a bit uncanny-
| valley: it sounds like humans who are reading from a prepared
| 'bit' or 'script'.
|
| We've all been on those webinars where it's clear -- despite the
| infusions (on cue) of "enthusiasm" from the speaker attempting to
| make it sound more natural and off-the-cuff -- that they are
| reading from a script.
|
| It's a difficult-to-mask phenomenon for humans.
|
| That all said, I actually have more grace for an AI sounding like
| this than I do for a human presenter reading from a script. Like,
| if I'm here "live" and paying attention to what you're saying, at
| least do me the service of truly being "here" with me and
| authentically communicating vs. simply reading something.
|
| If you're going to simply read something, then just send it to me
| to read too - don't pretend it's a spontaneously synchronous
| communication.
| ironlake wrote:
| Is this another fake like the Google bot that made reservations
| at a restaurant?
| jchanimal wrote:
| We've been using this at work to get inside of our customer's
| perspective. It's helpful to throw eg a bunch of point-of-sale
| data sync challenges into Notebook LM and eg pass a 10 minute
| audio to the team so they can understand where our work fits in.
___________________________________________________________________
(page generated 2024-10-30 23:00 UTC)