[HN Gopher] AudioX: Diffusion Transformer for Anything-to-Audio ...
       ___________________________________________________________________
        
       AudioX: Diffusion Transformer for Anything-to-Audio Generation
        
       Author : gnabgib
       Score  : 92 points
       Date   : 2025-04-14 17:35 UTC (5 hours ago)
        
 (HTM) web link (zeyuet.github.io)
 (TXT) w3m dump (zeyuet.github.io)
        
       | Fauntleroy wrote:
       | The video to audio examples are really impressive! The video
       | featuring the band showcases some of the obvious shortcomings of
       | this method (humans will have very precise expectations about the
       | kinds of sounds 5 trombones will make)--but the tennis example
       | shows its strengths (decent timing of hit sounds, eerily accurate
       | acoustics for the large internal space). I'm very excited to see
       | how this improves a few more papers down the line!
        
         | pcthrowaway wrote:
         | There were a lot of shortcomings.
         | 
         | - The woman playing what I think was an Erhu[1] seemed to be
         | imitating traditional music played by that instrument, but
         | really badly (it sounded much more like a human voice than the
         | actual instrument does). Also, I'm not even sure if it was able
         | to tell which instrument it was, or if it was picking up on
         | other cues from the video (which could be problematic, e.g. if
         | it profiles people based on their race and attire)
         | 
         | - Most of the sound was pretty delayed from the visual cues.
         | Not sure why
         | 
         | - The nature sounds were pretty muddy
         | 
         | - (I realize this is from video to music, but) the video with
         | pumping upbeat music set to the text "Maddox White witnessed
         | his father getting butchered by the Capo of the Italian mob"
         | was almost comically out of touch with the source
         | 
         | Nevertheless, it's an interesting demo and highlights more
         | applications for AI which I'm expecting we'll see massive
         | improvements in over the next few years! So despite the
         | shortcomings I agree it's still quite impressive.
         | 
         | [1] https://en.wikipedia.org/wiki/Erhu
        
       | oezi wrote:
       | Audio, but not Speech, right?
        
       | gigel82 wrote:
       | That "pseudo-human laughter" gave me some real chills; didn't
       | realize uncanny valley for audio is a real thing but damn...
        
         | BizarroLand wrote:
         | Sometimes when I lie awake at night I wonder what it is about
         | things that are "almost human" that terrifies so many of us so
         | deeply.
         | 
         | It's like the markings on the back of tiger's heads that
         | simulate eyes to prevent predators from attacking it. I'm sure
         | there used to be something that tigers benefited from having
         | this defense for enough for it to survive encoding into their
         | DNA, right?
         | 
         | So, what was it that encoded this fear response into us?
        
           | causality0 wrote:
           | Other hominids as well as visibly diseased humans.
        
           | observationist wrote:
           | Dead things, and behaviors that don't align with our
           | predictive models shift the context to one of threat - if
           | something shaped like something you understand starts
           | behaving in a way that you no longer understand, you'll
           | become progressively more concerned. If a pencil started
           | rolling around aggressively chasing you, it'd evoke fear,
           | even though you'd probably defend yourself fairly capably.
           | 
           | If enough predictive models are broken, people feel like
           | they've gone crazy - various drugs and experiments
           | demonstrate a lot of these factors.
           | 
           | The interesting thing about uncanny valley is that the
           | stimuli are on a threshold, and humans are really good at
           | picking up tiny violations of those expectations, which
           | translates to unease or fear.
        
       | darkwater wrote:
       | The toilet flushing one is full of weird, unrelated noises.
       | 
       | The tennis video, as other commented, is good but there is a
       | noticeable delay between the action and the sound. And the
       | "loving couple holding IA hands and then dancing", well, the
       | input is already cringe enough.
       | 
       | For all these diffusion models, look like we are 90% here, now we
       | just need the final 90%.
        
       | kristopolous wrote:
       | really the next big leap is something that gives me more
       | meaningful artistic control over these systems.
       | 
       | It's usually "generate a few, one of them is not terrible, none
       | are exactly what I wanted" then modify the prompt, wait an hour
       | or so ...
       | 
       | The workflow reminds me of programming 30 years ago - you did
       | something, then waited for the compile, see if it worked, tried
       | something else...
       | 
       | All you've got are a few crude tools and a bit of grit and
       | patience.
       | 
       | On the i2v tools I've found that if I modify the input to make
       | the contrast sharper, the shapes more discrete, the object easier
       | to segment, then I get better results. I wonder if there's hacks
       | like that here.
        
         | vunderba wrote:
         | > The workflow reminds me of programming 30 years ago - you did
         | something, then waited for the compile, see if it worked, tried
         | something else...
         | 
         | Well sure... if your compiler was the equivalent of the
         | Infinite Improbability Drive.
         | 
         | I assume you're referring to the classic positive/negative
         | prompts that you had to attach to older SD 1.5 workflows. From
         | the examples in the repo as well as the paper, it seems like
         | AudioX was trained to accept relatively natural english using
         | Qwen2.
        
       ___________________________________________________________________
       (page generated 2025-04-14 23:00 UTC)