[HN Gopher] Generating audio for video
       ___________________________________________________________________
        
       Generating audio for video
        
       Author : rvnx
       Score  : 125 points
       Date   : 2024-06-20 22:23 UTC (1 days ago)
        
 (HTM) web link (deepmind.google)
 (TXT) w3m dump (deepmind.google)
        
       | gundmc wrote:
       | The AI slop problem is bad enough on TikTok/YouTube today. I
       | shudder at the future of user-generated video platforms. I also
       | wonder if the low barrier to create these videos will outpace the
       | storage and processing capacity of the free platforms.
        
         | mrtesthah wrote:
         | youtube should just offer to generate the videos for you
         | directly to save space.
        
           | ElFitz wrote:
           | Absolutely.
           | 
           | Using a recommendation algorithm similar to TikTok's, learn
           | what each specific user are into, and instead of showing
           | content produced by other users, produce custom-tailored
           | content on the fly, perfectly matching the type, tone, style,
           | length, and rhythm each user likes.
           | 
           | Ideally without making anything up.
        
             | yreg wrote:
             | > Using a recommendation algorithm similar to TikTok's
             | 
             | How is TTs recommendation system different from YT? Other
             | than suggesting lower quality content that's irresistable?
        
               | ElFitz wrote:
               | In my experience, YouTube's is much more influenced by
               | the latest videos a user has watched. It's pretty much
               | always "more of the same".
               | 
               | TikTok seems to manage to more quickly identify users'
               | interests and surface content based on more signals,
               | aggregated over a longer period of time, without relying
               | as much on conscious users' actions (ie "follow /
               | subscribe"), producing a wider diversity of
               | recommendations.
               | 
               | There's also the odd suggestion every now and then,
               | probably used to gauge a user's interest in a different
               | category.
        
             | shzhdbi09gv8ioi wrote:
             | > Ideally without making anything up.
             | 
             | I have no idea anymore if this is sarcasm or a straight up
             | belief.
             | 
             | What serious professional would gamble on hallucinations?
        
               | ElFitz wrote:
               | The point here isn't to give users any kind of truth. It
               | already isn't YouTube goal. Wether we're talking about
               | the videos or the ads, they're happy spreading ridiculous
               | nonsense.
               | 
               | The only point of these kinds of platforms, for worse and
               | for worse, is to give users what they want. So
               | hallucinations wouldn't matter, as long as the end result
               | matches users' preferences.
        
             | hooverd wrote:
             | Why? Platforms are already bad enough about just suggesting
             | what they think I might like.
        
               | ElFitz wrote:
               | Because this way they don't have to rely on pesky people
               | to produce content that maximises the engagement and
               | retention of the other pesky people to which they want to
               | show as many ads as possible.
               | 
               | I am not implying this is a good thing. Or a bad one.
               | It's just a step further down the same path we're already
               | on, while taking an unreliable and costly middle-man
               | (content producing users) out of the picture.
        
               | hooverd wrote:
               | The AI content mill will sadly never provide me with a 30
               | minute video on dishwasher detergent or reposted NicoNico
               | gems like this https://www.youtube.com/watch?v=xKljlnfE-
               | GU&pp=ygUJbWlrdSB0Y....
        
             | mrtesthah wrote:
             | Perfect. Once the models are adequately trained, we can do
             | away with the entire "content creator" economy altogether!
        
           | squarefoot wrote:
           | > youtube should just offer to generate the videos for you
           | directly to save space.
           | 
           | Try imagining this concept applied to newscasts.
        
             | mrtesthah wrote:
             | Oh, but isn't that what people want -- to live in a media
             | reality that confirms 100% of their pre-existing biases
             | with no risk of encountering cognitive dissonance? You're
             | leaving money on the table by ignoring this opportunity!
             | Move fast and break things!
        
         | vineyardmike wrote:
         | I've long proposed that we should have an "AI Instagram" where
         | different tweaked personas (perfected via A/B testing/Genetic
         | algorithms) are displayed to users with ai-generated
         | images/posts/comments. Each persona set is specific to each
         | user, and they don't have other IRL users that they can
         | interact with. The user can interact with the personas, and
         | even message them. The developer can add more features over
         | time (stories, short form video, etc) as people get bored and
         | technology formats improve, but it's unlimited content. It's
         | perfect for advertising, because you can embed products and ads
         | seamlessly and generate them alongside everything else.
         | 
         | That said, storage is far cheaper than GPUs at the moment.
        
           | xwolfi wrote:
           | Have you tried AI porn ? There's something in the fact it's
           | fake uncanny characters that makes it non-exciting. Like,
           | jerking off to a toaster basically, and I assume it'd be the
           | same for a social network with no human ?
        
           | squarefoot wrote:
           | This is probably already researched today, and it seems close
           | to how people would interact with clones of their deceased
           | relatives or famous people of the past. However it's also a
           | powerful tool to create nearly 100% successful influencing by
           | instructing each persona to subtly inject the same idea into
           | its human user by employing the most convincing tactics
           | needed for that user. It's quite easy to foresee the use in
           | advertising, where it would completely redefine the word
           | "targeted", but also corrupt politics.
        
           | aaalll wrote:
           | There is an AI reddit https://chirper.ai/
        
       | TheAceOfHearts wrote:
       | Wouldn't it be better to generate multiple tracks that can be
       | mixed / tweaked together, rather than a single track? That way
       | you can also keep the parts you like and continue iterating on
       | the parts you dislike.
       | 
       | If the sound is already being generated at a specific time,
       | surely you can make it generate an output that can be consumed by
       | existing audio mixing tools for further refinement.
       | 
       | The problem with doing these all-in-one integrated solutions is
       | that you're kinda giving people an all-or-nothing option, which
       | doesn't seem that useful. Maybe I'll end up being proven wrong.
        
         | bryanrasmussen wrote:
         | the AI Musical IF This Then That Step 2 > https://www.lalal.ai/
         | "Extract vocal, accompaniment and various instruments from any
         | audio and video"
        
         | anigbrowl wrote:
         | Yes, same problem as with commercial AI music products not
         | providing stems or MIDI, The engineers on these products are
         | too full of themselves to actually ask anyone in the field what
         | they want, so we just keep getting these stupid magic 8 ball
         | efforts.
         | 
         | This one is particularly annoying as I worked for years as a
         | sound engineer and have recorded or produced the soundtrack for
         | 10 feature films and some large number of shorts. What's going
         | to happen with this is directors or producers are gonna do this
         | at home for every scene in a burst of over-enthusiasm, realize
         | the totality is Not Great, and then demand someone like me fix
         | it, but for 1/4 of what the job used to pay, arguing 'but most
         | of the work is already done'. It's all so tiresome.
        
           | cageface wrote:
           | I've tried to explain this to several friends. Until these
           | tools can generate output that can be mixed properly they're
           | going to be very niche.
        
           | j16sdiz wrote:
           | The sample they used for training are mixed.
           | 
           | Unless they can have enough raw, unmixed sample, this depends
           | on how well they "unmix" them.
        
             | anigbrowl wrote:
             | Yes...that's the problem. A problem that could be easily
             | avoided by asking existing professionals what matters and
             | what tools they actually want.
        
               | jononor wrote:
               | Most ML engineers know that many want more fine grained
               | control. But the straight forward way to train such
               | models is incredibly data demanding. The datasets used
               | for whole image generation consist of several billion
               | images. I do not think anyone has compiled any DAW
               | project / stems projects that are anywhere close to this
               | size. So that is a limiting factor right now. But we will
               | find ways to get there, probably a lot of progress over
               | the next 5 years. Maybe even the next 2.
        
           | Jensson wrote:
           | Same reason you don't see AI making images in layers etc, its
           | just much easier to train an AI that generate everything in
           | one layer. Training a model with the same level of quality
           | output that generates multiple layers is much much harder,
           | and of course companies and users prefers the higher quality
           | over having layers, especially since the quality you get with
           | a single layer is still barely passable.
        
           | knowaveragejoe wrote:
           | It sounds like between the two of you(and the person who
           | mentioned generating images in layers for image editing
           | software), you've stumbled upon an obvious gap in the market.
        
         | tkgally wrote:
         | ElevenLabs just released something that is more controllable:
         | 
         | https://news.ycombinator.com/item?id=40736536
        
         | chaosprint wrote:
         | it's limited by the mechanism of diffusion.
        
         | TacticalCoder wrote:
         | > Wouldn't it be better to generate multiple tracks that can be
         | mixed / tweaked together, rather than a single track? That way
         | you can also keep the parts you like and continue iterating on
         | the parts you dislike.
         | 
         | Totally and that is 100% what is coming. For a great many
         | pictures too: why generate a picture full of lightning issues /
         | approximation when you'll soon be able to generate and entire
         | 3D scene and render it properly.
         | 
         | We've mastered 3D rendering and audio engineering.
         | 
         | I want the 3D models and the 3D scenes. I want the individual
         | tracks (and combine them in Dobly Atmos or whatever shall be
         | cool).
         | 
         | And that _is_ coming, no question about it.
        
         | the_other wrote:
         | > Wouldn't it be better to generate multiple tracks that can be
         | mixed / tweaked together, rather than a single track? That way
         | you can also keep the parts you like and continue iterating on
         | the parts you dislike.
         | 
         | That'd interest me (a musical hobbyist) more than the "whole
         | track" generators, for sure.
         | 
         | I imagine it's a harder task tho'. Presumably, if you give the
         | same source material (video, prompt) to the AI multiple times,
         | it will generate different pieces of music. So if you do a
         | series of prompts, each one specifying a different instrument
         | or group/bus, then you (or the AI) need to arrange for the
         | parts to blend correctly, follow the same cues and assemble to
         | a coherent arrangement. Is that one pass with multiple outputs,
         | or multiple passes/prompts with one output each?
         | 
         | I have got the impression (from casual reading) that the music
         | generators don't inherently "know" about different parts of a
         | piece of music. They just know about the final output.
        
       | crazygringo wrote:
       | Very very cool.
       | 
       | But I literally can't keep track anymore of which AI generative
       | combinations of modalities have been released.
       | 
       | Crazy how two years ago this would have blown my mind. Now it's
       | just, OK sure add it to the pile...
        
         | xwolfi wrote:
         | I still havent spend a dollar on any of it...
        
           | TacticalCoder wrote:
           | > I still havent spend a dollar on any of it...
           | 
           | Subscribed to GTP-4o (or whatever the paying one is called)
           | for translating / finding typos / summarizing / etc.
           | 
           | Zero brand love and I'll switch to something else (maybe some
           | future Claude model?) the second something better/faster
           | comes out.
        
           | crazygringo wrote:
           | Well OpenAI's annual revenue is more than $1.6 billion, so it
           | doesn't really matter if _you_ haven 't.
           | 
           | Tons -- and I mean _tons_ -- of people have spent money on
           | it. Because it 's worth it, it's generating actual economic
           | value for them.
        
             | lannisterstark wrote:
             | Yep - I use LibreChat (and other services) via OpenAI API,
             | and I save an incredible amount of time having it write
             | boilerplate code, verify stuff in code, double check it
             | after I've already reviewed something to see if I missed x
             | or y, ask questions based on it which I can't figure out to
             | get ideas etc etc.
             | 
             | It's also exceptional at making IEPs/Learning Plans for
             | certain things I'd like to learn for the week etc which I
             | am already somewhat familiar with. I use it as a rough
             | guide and it has worked well so far.
        
             | ilrwbwrkhv wrote:
             | Spammers need to spam. Of course it makes them money.
        
         | lemoncookiechip wrote:
         | Maybe this can help you keep track of stuff:
         | 
         | https://www.tools-ai.online/
         | 
         | https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRj...
         | 
         | And here's some that I personally recommend and are "free" to
         | use:
         | 
         | TXT2VID / IMG2VID: https://lumalabs.ai/dream-machine
         | 
         | TXT2MUSIC: https://suno.com/
         | 
         | AI TXT2SPEECH: https://murf.ai/
         | 
         | PDF Summarize (You can just use 4o or Claude though:
         | https://askyourpdf.com/
         | 
         | AI ChatBot: https://janitorai.com/ https://www.chub.ai/
         | 
         | TXT2IMG / IMG2IMG: https://playground.com/
         | 
         | Obviously SD 1.5/SDXL/Pony
         | 
         | and so much more.
        
         | astennumero wrote:
         | I was just thinking the same. Can't believe I'm not excited.
        
       | nanovision wrote:
       | This is so cool.
        
       | peppertree wrote:
       | I wonder if this can be trained to do lip reading.
        
       | masto wrote:
       | I don't know if a computer can ever match the perfection of
       | "shreds" videos. (The drum example came close)
       | 
       | https://www.youtube.com/playlist?list=PLQvwVDViTLXu4usHto8PH...
        
       | squarefoot wrote:
       | As a wannabe drummer i can say the drumming example is quite bad
       | as the drummer doesn't seem to hit toms that often to produce tom
       | rolls, however the video is so heavily cropped that either I'm
       | wrong or the AI was deliberately fed with something difficult to
       | interpret.
        
       | animanoir wrote:
       | Boooring!
        
       ___________________________________________________________________
       (page generated 2024-06-21 23:02 UTC)