[HN Gopher] Researchers use AI to turn sound recordings into str...
___________________________________________________________________
Researchers use AI to turn sound recordings into street images
Author : giuliomagnifico
Score : 113 points
Date : 2024-12-01 16:40 UTC (7 days ago)
(HTM) web link (news.utexas.edu)
(TXT) w3m dump (news.utexas.edu)
| ptx wrote:
| This is not my area of expertise, but if I understand the article
| correctly, they created a model that matches pre-existing audio
| clips to pre-existing images. But instead of returning the
| matching image, the LLM generates a distorted fake image which is
| vaguely similar to the real image.
|
| So it doesn't really, as the title claims, turn recordings into
| images (it already has the images) and the distorted fake images
| it creates are only "accurate" in that they broadly slot into the
| right category in terms of urban/rural setting, amount of
| greenery and amount of sky shown.
|
| It sounds like the matching is the useful part and the
| "generative" part is just a huge disadvantage. The paper doesn't
| seem to say if the LLM is any better than other types of models
| at the matching part.
| jebarker wrote:
| I think you are misunderstanding. I don't think the network
| matches the audio to a ground truth image and generates an
| image. It just takes in audio and predicts an image. They just
| use the ground truth images for training the model and for
| evaluation purposes.
|
| The generated images are only vaguely similar in detail to the
| originals, but the fact that they can estimate the macro
| structure from audio alone is surprising. I wonder if there's
| some kind of leakage between the training and test data, e.g.
| sampling frames from the same videos, because the idea you
| could get time of day right (dusk in a city) just from audio
| seems improbable.
|
| EDIT: also minor correction, it's not an LLM it's a diffusion
| model. EDIT2: my mistake, there is an LLM too!
| ptx wrote:
| It certainly looks like some amount of image matching is
| going on. Can the model really hear the white/green sign to
| the left in the first example in figure 3? Can it hear the
| green sign to the right and red things to the left in the
| last example?
| jebarker wrote:
| That would be explained by data leakage too, e.g. sampling
| frames in the train and test data from the same video
| sequences. There's nothing in the writeup that says
| suggests the model is explicitly matching audio to ground
| truth images.
| notahacker wrote:
| The researchers' suggestion that certain architectural
| features might have been encoded in the sound [which is
| at least superficially plausible] is rather undermined
| the data leakage in the model also leading to it generate
| the right _colour_ signage in the right part of multiple
| images. _The fidelity of the sound clearly isn 't enough
| for the model to register key aspects of the sign's
| geometry like it only being a few feet from the observer,
| but it has somehow managed to pick up that it's green and
| x pixels from the left of the image..._
| mewpmewp2 wrote:
| I don't know if data leakage is the right word, but maybe
| overfitting if they took a 1 hour clip from same place
| and used 90 percent for training and 10 percent for
| eval/test?
|
| It is still decent way to start I think, but it needs to
| get more varied data after that and use different
| geographical locations for eval and test.
| sdenton4 wrote:
| Yeah, I also saw that sign and thought - 'yeah, this is
| bullshit.' It's got exactly the same placement in the frame
| - which would requires some next-level beamforming
| capability - and also has the same color, which is
| impossible. There's some serious data leakage going on
| here.
|
| [edit] The bottom right image is even more suspect. There's
| a vertical green sign in the same place on the right side
| of the image, but also some curious red striping in the
| distance in both images. One could argue 'street signs are
| green' but the red striping seems pretty unique, and not
| something where one would just guess the right color.
| ptx wrote:
| In response to the correction: The paper says that "we
| propose a Soundscape-to-Image Diffusion model, a generative
| Artificial Intelligence (AI) model supported by Large
| Language Models (LLMs)" so there's an LLM involved somewhere
| presumably?
| jebarker wrote:
| I'm sorry, you're correct, I missed that. I'll edit my
| edit!
| swid wrote:
| I've heard clips of hot water being poured vs cold water, and
| if you heard the examples, you would probably guess right
| too.
|
| Time of day seems almost easy. Are there animals noises?
| Those won't sound the same all day. And traffic too. Even
| things like the sound of wind may generally be different in
| the morning vs night.
|
| This is not to suggest the researchers are not leaking data,
| or that the examples were cherry picked, it seems probable
| they are doing one or the other. But it is to say, if you
| were trained on a particular intersection, and heard a sample
| from it, you could probably train a model to predict time of
| day reasonably well.
| notum wrote:
| Thank you. I was about to just write "confirmation bias" as a
| comment.
| lifeisstillgood wrote:
| I am not by any stretch a mathematician but AI research like this
| reminds me of things that excite mathematicians - it's like
| people spent three hundred years playing with Prime numbers and
| all of a sudden "oh yeah, silicon, fibre optics, ahah secure
| encryption"
|
| There are going to be real useful tools - but we need to play for
| another century before we have that aha moment. Probably :-)
| simonw wrote:
| The word "accurate" in that headline is doing a LOT of work.
|
| Here's how the results were scored:
|
| "Computer evaluations compared the relative proportions of
| greenery, building and sky between source and generated images,
| whereas human judges were asked to correctly match one of three
| generated images to an audio sample."
|
| So this is very impressive and a cool piece of research, but
| unsurprisingly not recreating the space "accurately" if you
| assume that means anything more than "has the right amount of sky
| and buildings and greenery".
| aqme28 wrote:
| You're correct, but I'm also curious how you could measure
| accuracy here. There isn't any easy way that I can think of.
| DougMerritt wrote:
| That's not an unreasonable question, however the larger point
| is that this sort of thing _cannot_ be done perfectly
| accurately.
|
| This was established mathematically, answering an old 1966
| question from famous mathematician Mark Kac: "You can't hear
| the shape of a drum" -- there isn't a unique answer even when
| allowed to use arbitrary test sounds.
|
| Wikipedia:
| https://en.wikipedia.org/wiki/Hearing_the_shape_of_a_drum
|
| Article in American Scientist 1996 Jan-Feb:
| https://www2.math.upenn.edu/~kazdan/425S11/Drum-Gordon-
| Webb....
| gyrovagueGeist wrote:
| I love this paper, but something I think is often missed
| when it comes up is that you CAN hear the shape of many
| drums if you restrict the shape space, for example with a
| prior of "what a drum should look like" Zelditch proved
| spectral uniqueness for convex, fully connected, drums with
| some symmetry.
| DougMerritt wrote:
| Aha, I had missed that! Thanks.
| pfortuny wrote:
| You are totally right but "convex" is a very strong
| assumption (so strong that the shape is determined by its
| harmonics). Very strong.
| scarmig wrote:
| If you add multiple hearing points, you massively constrain
| the space of possible drums. The question then becomes
| something like "can you see the shape of a drum?"
|
| Proof of concept: echolocation.
| BeefWellington wrote:
| If you can't measure the results then the entire thing needs
| a rethink.
| delichon wrote:
| That confuses accurate with detailed. It can be accurate even
| if it only reports one bit, like greenery proportion "low" or
| "high".
| ghostly_s wrote:
| But it's not reporting one bit, it's generating a detailed
| (color!) photo. 100s of kilobits of made up garbage plus one
| accurate bit cannot reasonably be described as an "accurate"
| result.
| conception wrote:
| I wonder if there is a precise vs accurate wording shenanigans
| going on here...
| dang wrote:
| Ok, we've made the title less 'accurate' (and more accurate?)
| above. Thanks!
| lowercased wrote:
| Am I missing something or is there no way to see those generated
| images except in postage-stamp sizes?
|
| tldr:
|
| you can view the image directly at https://news.utexas.edu/wp-
| content/uploads/2024/11/AI-street...
|
| Still not overly useful.
| amaurose wrote:
| I'd be very interested in the reverse: A background sound
| generator for still images. Would be nice to have for advanced
| picture frames...
| jebarker wrote:
| I'd like this too, especially if it generated binaural audio.
| galleywest200 wrote:
| Some of these are not very accurrate. That "country side" image
| has the entirely wrong foliage color (fall colors vs. spring
| colors). It also appears to place buildings when the "ground
| truth" image is by a small stream.
|
| I would not rely on this tool for any meaningful data collection.
| DougMerritt wrote:
| You really cannot expect audio processing to yield color
| information.
|
| Beyond that, you are correct that the 3D shapes themselves
| cannot be derived perfectly accurately (see my other post)
| joshdavham wrote:
| This is interesting. Sorta reminds me of how bats use sonar for
| their surroundings.
| estebarb wrote:
| Except that it is not. If you want to use this for tasks like
| navigation then it is useless. Practically, it is the same as
| using chatgpt to generate a realistic scenery based on a text
| description.
| IshKebab wrote:
| Researchers use AI to turn source recordings into _plausible_
| street images.
| amelius wrote:
| You can train DL models on anything. If you get accurate results
| then that is maybe publish-worthy.
|
| In this particular case, it is not.
| mewpmewp2 wrote:
| How do you determine if something is publish worthy? If someone
| puts in a lot of effort experimenting with something that
| fails, it can still seem publish worthy so others can learn
| what works and what doesn't. It should be more about level of
| effort I think. Otherwise the incentives become all wrong too.
| amelius wrote:
| In this case the algorithm can determine broad classes like
| "rural" or "city", and aside from those classes the generated
| images have little connection with the audio. I think most DL
| researchers would agree that this is low-effort stuff, and
| therefore not publish-worthy. In addition to this the word
| "accurate" in the title is misleading.
| wigster wrote:
| how on earth does the first example recreate the blue-white logo
| on the building? b***t
| dmje wrote:
| Probably the thing you really want from an article with this
| topic focus is to be able to see the images bigger then a postage
| stamp size. And, even more irritating - the images are actually
| there, in reasonable size, just not linked...
|
| https://news.utexas.edu/wp-content/uploads/2024/11/AI-street...
| https://news.utexas.edu/wp-content/uploads/2024/11/AI-street...
| harrall wrote:
| I think this is cool but it's a more of a statistical correlation
| than an AI-related paper.
|
| What I'm saying is that if you were to replace 'AI' with "ask
| humans to draw an image based on these sounds," you'll probably
| get somewhat similar results.
|
| Which is still interesting either way.
___________________________________________________________________
(page generated 2024-12-08 23:00 UTC)