[HN Gopher] Generating audio for video
___________________________________________________________________
Generating audio for video
Author : rvnx
Score : 125 points
Date : 2024-06-20 22:23 UTC (1 days ago)
(HTM) web link (deepmind.google)
(TXT) w3m dump (deepmind.google)
| gundmc wrote:
| The AI slop problem is bad enough on TikTok/YouTube today. I
| shudder at the future of user-generated video platforms. I also
| wonder if the low barrier to create these videos will outpace the
| storage and processing capacity of the free platforms.
| mrtesthah wrote:
| youtube should just offer to generate the videos for you
| directly to save space.
| ElFitz wrote:
| Absolutely.
|
| Using a recommendation algorithm similar to TikTok's, learn
| what each specific user are into, and instead of showing
| content produced by other users, produce custom-tailored
| content on the fly, perfectly matching the type, tone, style,
| length, and rhythm each user likes.
|
| Ideally without making anything up.
| yreg wrote:
| > Using a recommendation algorithm similar to TikTok's
|
| How is TTs recommendation system different from YT? Other
| than suggesting lower quality content that's irresistable?
| ElFitz wrote:
| In my experience, YouTube's is much more influenced by
| the latest videos a user has watched. It's pretty much
| always "more of the same".
|
| TikTok seems to manage to more quickly identify users'
| interests and surface content based on more signals,
| aggregated over a longer period of time, without relying
| as much on conscious users' actions (ie "follow /
| subscribe"), producing a wider diversity of
| recommendations.
|
| There's also the odd suggestion every now and then,
| probably used to gauge a user's interest in a different
| category.
| shzhdbi09gv8ioi wrote:
| > Ideally without making anything up.
|
| I have no idea anymore if this is sarcasm or a straight up
| belief.
|
| What serious professional would gamble on hallucinations?
| ElFitz wrote:
| The point here isn't to give users any kind of truth. It
| already isn't YouTube goal. Wether we're talking about
| the videos or the ads, they're happy spreading ridiculous
| nonsense.
|
| The only point of these kinds of platforms, for worse and
| for worse, is to give users what they want. So
| hallucinations wouldn't matter, as long as the end result
| matches users' preferences.
| hooverd wrote:
| Why? Platforms are already bad enough about just suggesting
| what they think I might like.
| ElFitz wrote:
| Because this way they don't have to rely on pesky people
| to produce content that maximises the engagement and
| retention of the other pesky people to which they want to
| show as many ads as possible.
|
| I am not implying this is a good thing. Or a bad one.
| It's just a step further down the same path we're already
| on, while taking an unreliable and costly middle-man
| (content producing users) out of the picture.
| hooverd wrote:
| The AI content mill will sadly never provide me with a 30
| minute video on dishwasher detergent or reposted NicoNico
| gems like this https://www.youtube.com/watch?v=xKljlnfE-
| GU&pp=ygUJbWlrdSB0Y....
| mrtesthah wrote:
| Perfect. Once the models are adequately trained, we can do
| away with the entire "content creator" economy altogether!
| squarefoot wrote:
| > youtube should just offer to generate the videos for you
| directly to save space.
|
| Try imagining this concept applied to newscasts.
| mrtesthah wrote:
| Oh, but isn't that what people want -- to live in a media
| reality that confirms 100% of their pre-existing biases
| with no risk of encountering cognitive dissonance? You're
| leaving money on the table by ignoring this opportunity!
| Move fast and break things!
| vineyardmike wrote:
| I've long proposed that we should have an "AI Instagram" where
| different tweaked personas (perfected via A/B testing/Genetic
| algorithms) are displayed to users with ai-generated
| images/posts/comments. Each persona set is specific to each
| user, and they don't have other IRL users that they can
| interact with. The user can interact with the personas, and
| even message them. The developer can add more features over
| time (stories, short form video, etc) as people get bored and
| technology formats improve, but it's unlimited content. It's
| perfect for advertising, because you can embed products and ads
| seamlessly and generate them alongside everything else.
|
| That said, storage is far cheaper than GPUs at the moment.
| xwolfi wrote:
| Have you tried AI porn ? There's something in the fact it's
| fake uncanny characters that makes it non-exciting. Like,
| jerking off to a toaster basically, and I assume it'd be the
| same for a social network with no human ?
| squarefoot wrote:
| This is probably already researched today, and it seems close
| to how people would interact with clones of their deceased
| relatives or famous people of the past. However it's also a
| powerful tool to create nearly 100% successful influencing by
| instructing each persona to subtly inject the same idea into
| its human user by employing the most convincing tactics
| needed for that user. It's quite easy to foresee the use in
| advertising, where it would completely redefine the word
| "targeted", but also corrupt politics.
| aaalll wrote:
| There is an AI reddit https://chirper.ai/
| TheAceOfHearts wrote:
| Wouldn't it be better to generate multiple tracks that can be
| mixed / tweaked together, rather than a single track? That way
| you can also keep the parts you like and continue iterating on
| the parts you dislike.
|
| If the sound is already being generated at a specific time,
| surely you can make it generate an output that can be consumed by
| existing audio mixing tools for further refinement.
|
| The problem with doing these all-in-one integrated solutions is
| that you're kinda giving people an all-or-nothing option, which
| doesn't seem that useful. Maybe I'll end up being proven wrong.
| bryanrasmussen wrote:
| the AI Musical IF This Then That Step 2 > https://www.lalal.ai/
| "Extract vocal, accompaniment and various instruments from any
| audio and video"
| anigbrowl wrote:
| Yes, same problem as with commercial AI music products not
| providing stems or MIDI, The engineers on these products are
| too full of themselves to actually ask anyone in the field what
| they want, so we just keep getting these stupid magic 8 ball
| efforts.
|
| This one is particularly annoying as I worked for years as a
| sound engineer and have recorded or produced the soundtrack for
| 10 feature films and some large number of shorts. What's going
| to happen with this is directors or producers are gonna do this
| at home for every scene in a burst of over-enthusiasm, realize
| the totality is Not Great, and then demand someone like me fix
| it, but for 1/4 of what the job used to pay, arguing 'but most
| of the work is already done'. It's all so tiresome.
| cageface wrote:
| I've tried to explain this to several friends. Until these
| tools can generate output that can be mixed properly they're
| going to be very niche.
| j16sdiz wrote:
| The sample they used for training are mixed.
|
| Unless they can have enough raw, unmixed sample, this depends
| on how well they "unmix" them.
| anigbrowl wrote:
| Yes...that's the problem. A problem that could be easily
| avoided by asking existing professionals what matters and
| what tools they actually want.
| jononor wrote:
| Most ML engineers know that many want more fine grained
| control. But the straight forward way to train such
| models is incredibly data demanding. The datasets used
| for whole image generation consist of several billion
| images. I do not think anyone has compiled any DAW
| project / stems projects that are anywhere close to this
| size. So that is a limiting factor right now. But we will
| find ways to get there, probably a lot of progress over
| the next 5 years. Maybe even the next 2.
| Jensson wrote:
| Same reason you don't see AI making images in layers etc, its
| just much easier to train an AI that generate everything in
| one layer. Training a model with the same level of quality
| output that generates multiple layers is much much harder,
| and of course companies and users prefers the higher quality
| over having layers, especially since the quality you get with
| a single layer is still barely passable.
| knowaveragejoe wrote:
| It sounds like between the two of you(and the person who
| mentioned generating images in layers for image editing
| software), you've stumbled upon an obvious gap in the market.
| tkgally wrote:
| ElevenLabs just released something that is more controllable:
|
| https://news.ycombinator.com/item?id=40736536
| chaosprint wrote:
| it's limited by the mechanism of diffusion.
| TacticalCoder wrote:
| > Wouldn't it be better to generate multiple tracks that can be
| mixed / tweaked together, rather than a single track? That way
| you can also keep the parts you like and continue iterating on
| the parts you dislike.
|
| Totally and that is 100% what is coming. For a great many
| pictures too: why generate a picture full of lightning issues /
| approximation when you'll soon be able to generate and entire
| 3D scene and render it properly.
|
| We've mastered 3D rendering and audio engineering.
|
| I want the 3D models and the 3D scenes. I want the individual
| tracks (and combine them in Dobly Atmos or whatever shall be
| cool).
|
| And that _is_ coming, no question about it.
| the_other wrote:
| > Wouldn't it be better to generate multiple tracks that can be
| mixed / tweaked together, rather than a single track? That way
| you can also keep the parts you like and continue iterating on
| the parts you dislike.
|
| That'd interest me (a musical hobbyist) more than the "whole
| track" generators, for sure.
|
| I imagine it's a harder task tho'. Presumably, if you give the
| same source material (video, prompt) to the AI multiple times,
| it will generate different pieces of music. So if you do a
| series of prompts, each one specifying a different instrument
| or group/bus, then you (or the AI) need to arrange for the
| parts to blend correctly, follow the same cues and assemble to
| a coherent arrangement. Is that one pass with multiple outputs,
| or multiple passes/prompts with one output each?
|
| I have got the impression (from casual reading) that the music
| generators don't inherently "know" about different parts of a
| piece of music. They just know about the final output.
| crazygringo wrote:
| Very very cool.
|
| But I literally can't keep track anymore of which AI generative
| combinations of modalities have been released.
|
| Crazy how two years ago this would have blown my mind. Now it's
| just, OK sure add it to the pile...
| xwolfi wrote:
| I still havent spend a dollar on any of it...
| TacticalCoder wrote:
| > I still havent spend a dollar on any of it...
|
| Subscribed to GTP-4o (or whatever the paying one is called)
| for translating / finding typos / summarizing / etc.
|
| Zero brand love and I'll switch to something else (maybe some
| future Claude model?) the second something better/faster
| comes out.
| crazygringo wrote:
| Well OpenAI's annual revenue is more than $1.6 billion, so it
| doesn't really matter if _you_ haven 't.
|
| Tons -- and I mean _tons_ -- of people have spent money on
| it. Because it 's worth it, it's generating actual economic
| value for them.
| lannisterstark wrote:
| Yep - I use LibreChat (and other services) via OpenAI API,
| and I save an incredible amount of time having it write
| boilerplate code, verify stuff in code, double check it
| after I've already reviewed something to see if I missed x
| or y, ask questions based on it which I can't figure out to
| get ideas etc etc.
|
| It's also exceptional at making IEPs/Learning Plans for
| certain things I'd like to learn for the week etc which I
| am already somewhat familiar with. I use it as a rough
| guide and it has worked well so far.
| ilrwbwrkhv wrote:
| Spammers need to spam. Of course it makes them money.
| lemoncookiechip wrote:
| Maybe this can help you keep track of stuff:
|
| https://www.tools-ai.online/
|
| https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRj...
|
| And here's some that I personally recommend and are "free" to
| use:
|
| TXT2VID / IMG2VID: https://lumalabs.ai/dream-machine
|
| TXT2MUSIC: https://suno.com/
|
| AI TXT2SPEECH: https://murf.ai/
|
| PDF Summarize (You can just use 4o or Claude though:
| https://askyourpdf.com/
|
| AI ChatBot: https://janitorai.com/ https://www.chub.ai/
|
| TXT2IMG / IMG2IMG: https://playground.com/
|
| Obviously SD 1.5/SDXL/Pony
|
| and so much more.
| astennumero wrote:
| I was just thinking the same. Can't believe I'm not excited.
| nanovision wrote:
| This is so cool.
| peppertree wrote:
| I wonder if this can be trained to do lip reading.
| masto wrote:
| I don't know if a computer can ever match the perfection of
| "shreds" videos. (The drum example came close)
|
| https://www.youtube.com/playlist?list=PLQvwVDViTLXu4usHto8PH...
| squarefoot wrote:
| As a wannabe drummer i can say the drumming example is quite bad
| as the drummer doesn't seem to hit toms that often to produce tom
| rolls, however the video is so heavily cropped that either I'm
| wrong or the AI was deliberately fed with something difficult to
| interpret.
| animanoir wrote:
| Boooring!
___________________________________________________________________
(page generated 2024-06-21 23:02 UTC)