[HN Gopher] AudioX: Diffusion Transformer for Anything-to-Audio ...
___________________________________________________________________
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Author : gnabgib
Score : 92 points
Date : 2025-04-14 17:35 UTC (5 hours ago)
(HTM) web link (zeyuet.github.io)
(TXT) w3m dump (zeyuet.github.io)
| Fauntleroy wrote:
| The video to audio examples are really impressive! The video
| featuring the band showcases some of the obvious shortcomings of
| this method (humans will have very precise expectations about the
| kinds of sounds 5 trombones will make)--but the tennis example
| shows its strengths (decent timing of hit sounds, eerily accurate
| acoustics for the large internal space). I'm very excited to see
| how this improves a few more papers down the line!
| pcthrowaway wrote:
| There were a lot of shortcomings.
|
| - The woman playing what I think was an Erhu[1] seemed to be
| imitating traditional music played by that instrument, but
| really badly (it sounded much more like a human voice than the
| actual instrument does). Also, I'm not even sure if it was able
| to tell which instrument it was, or if it was picking up on
| other cues from the video (which could be problematic, e.g. if
| it profiles people based on their race and attire)
|
| - Most of the sound was pretty delayed from the visual cues.
| Not sure why
|
| - The nature sounds were pretty muddy
|
| - (I realize this is from video to music, but) the video with
| pumping upbeat music set to the text "Maddox White witnessed
| his father getting butchered by the Capo of the Italian mob"
| was almost comically out of touch with the source
|
| Nevertheless, it's an interesting demo and highlights more
| applications for AI which I'm expecting we'll see massive
| improvements in over the next few years! So despite the
| shortcomings I agree it's still quite impressive.
|
| [1] https://en.wikipedia.org/wiki/Erhu
| oezi wrote:
| Audio, but not Speech, right?
| gigel82 wrote:
| That "pseudo-human laughter" gave me some real chills; didn't
| realize uncanny valley for audio is a real thing but damn...
| BizarroLand wrote:
| Sometimes when I lie awake at night I wonder what it is about
| things that are "almost human" that terrifies so many of us so
| deeply.
|
| It's like the markings on the back of tiger's heads that
| simulate eyes to prevent predators from attacking it. I'm sure
| there used to be something that tigers benefited from having
| this defense for enough for it to survive encoding into their
| DNA, right?
|
| So, what was it that encoded this fear response into us?
| causality0 wrote:
| Other hominids as well as visibly diseased humans.
| observationist wrote:
| Dead things, and behaviors that don't align with our
| predictive models shift the context to one of threat - if
| something shaped like something you understand starts
| behaving in a way that you no longer understand, you'll
| become progressively more concerned. If a pencil started
| rolling around aggressively chasing you, it'd evoke fear,
| even though you'd probably defend yourself fairly capably.
|
| If enough predictive models are broken, people feel like
| they've gone crazy - various drugs and experiments
| demonstrate a lot of these factors.
|
| The interesting thing about uncanny valley is that the
| stimuli are on a threshold, and humans are really good at
| picking up tiny violations of those expectations, which
| translates to unease or fear.
| darkwater wrote:
| The toilet flushing one is full of weird, unrelated noises.
|
| The tennis video, as other commented, is good but there is a
| noticeable delay between the action and the sound. And the
| "loving couple holding IA hands and then dancing", well, the
| input is already cringe enough.
|
| For all these diffusion models, look like we are 90% here, now we
| just need the final 90%.
| kristopolous wrote:
| really the next big leap is something that gives me more
| meaningful artistic control over these systems.
|
| It's usually "generate a few, one of them is not terrible, none
| are exactly what I wanted" then modify the prompt, wait an hour
| or so ...
|
| The workflow reminds me of programming 30 years ago - you did
| something, then waited for the compile, see if it worked, tried
| something else...
|
| All you've got are a few crude tools and a bit of grit and
| patience.
|
| On the i2v tools I've found that if I modify the input to make
| the contrast sharper, the shapes more discrete, the object easier
| to segment, then I get better results. I wonder if there's hacks
| like that here.
| vunderba wrote:
| > The workflow reminds me of programming 30 years ago - you did
| something, then waited for the compile, see if it worked, tried
| something else...
|
| Well sure... if your compiler was the equivalent of the
| Infinite Improbability Drive.
|
| I assume you're referring to the classic positive/negative
| prompts that you had to attach to older SD 1.5 workflows. From
| the examples in the repo as well as the paper, it seems like
| AudioX was trained to accept relatively natural english using
| Qwen2.
___________________________________________________________________
(page generated 2025-04-14 23:00 UTC)