[HN Gopher] Researchers use AI to turn sound recordings into str...
       ___________________________________________________________________
        
       Researchers use AI to turn sound recordings into street images
        
       Author : giuliomagnifico
       Score  : 113 points
       Date   : 2024-12-01 16:40 UTC (7 days ago)
        
 (HTM) web link (news.utexas.edu)
 (TXT) w3m dump (news.utexas.edu)
        
       | ptx wrote:
       | This is not my area of expertise, but if I understand the article
       | correctly, they created a model that matches pre-existing audio
       | clips to pre-existing images. But instead of returning the
       | matching image, the LLM generates a distorted fake image which is
       | vaguely similar to the real image.
       | 
       | So it doesn't really, as the title claims, turn recordings into
       | images (it already has the images) and the distorted fake images
       | it creates are only "accurate" in that they broadly slot into the
       | right category in terms of urban/rural setting, amount of
       | greenery and amount of sky shown.
       | 
       | It sounds like the matching is the useful part and the
       | "generative" part is just a huge disadvantage. The paper doesn't
       | seem to say if the LLM is any better than other types of models
       | at the matching part.
        
         | jebarker wrote:
         | I think you are misunderstanding. I don't think the network
         | matches the audio to a ground truth image and generates an
         | image. It just takes in audio and predicts an image. They just
         | use the ground truth images for training the model and for
         | evaluation purposes.
         | 
         | The generated images are only vaguely similar in detail to the
         | originals, but the fact that they can estimate the macro
         | structure from audio alone is surprising. I wonder if there's
         | some kind of leakage between the training and test data, e.g.
         | sampling frames from the same videos, because the idea you
         | could get time of day right (dusk in a city) just from audio
         | seems improbable.
         | 
         | EDIT: also minor correction, it's not an LLM it's a diffusion
         | model. EDIT2: my mistake, there is an LLM too!
        
           | ptx wrote:
           | It certainly looks like some amount of image matching is
           | going on. Can the model really hear the white/green sign to
           | the left in the first example in figure 3? Can it hear the
           | green sign to the right and red things to the left in the
           | last example?
        
             | jebarker wrote:
             | That would be explained by data leakage too, e.g. sampling
             | frames in the train and test data from the same video
             | sequences. There's nothing in the writeup that says
             | suggests the model is explicitly matching audio to ground
             | truth images.
        
               | notahacker wrote:
               | The researchers' suggestion that certain architectural
               | features might have been encoded in the sound [which is
               | at least superficially plausible] is rather undermined
               | the data leakage in the model also leading to it generate
               | the right _colour_ signage in the right part of multiple
               | images. _The fidelity of the sound clearly isn 't enough
               | for the model to register key aspects of the sign's
               | geometry like it only being a few feet from the observer,
               | but it has somehow managed to pick up that it's green and
               | x pixels from the left of the image..._
        
               | mewpmewp2 wrote:
               | I don't know if data leakage is the right word, but maybe
               | overfitting if they took a 1 hour clip from same place
               | and used 90 percent for training and 10 percent for
               | eval/test?
               | 
               | It is still decent way to start I think, but it needs to
               | get more varied data after that and use different
               | geographical locations for eval and test.
        
             | sdenton4 wrote:
             | Yeah, I also saw that sign and thought - 'yeah, this is
             | bullshit.' It's got exactly the same placement in the frame
             | - which would requires some next-level beamforming
             | capability - and also has the same color, which is
             | impossible. There's some serious data leakage going on
             | here.
             | 
             | [edit] The bottom right image is even more suspect. There's
             | a vertical green sign in the same place on the right side
             | of the image, but also some curious red striping in the
             | distance in both images. One could argue 'street signs are
             | green' but the red striping seems pretty unique, and not
             | something where one would just guess the right color.
        
           | ptx wrote:
           | In response to the correction: The paper says that "we
           | propose a Soundscape-to-Image Diffusion model, a generative
           | Artificial Intelligence (AI) model supported by Large
           | Language Models (LLMs)" so there's an LLM involved somewhere
           | presumably?
        
             | jebarker wrote:
             | I'm sorry, you're correct, I missed that. I'll edit my
             | edit!
        
           | swid wrote:
           | I've heard clips of hot water being poured vs cold water, and
           | if you heard the examples, you would probably guess right
           | too.
           | 
           | Time of day seems almost easy. Are there animals noises?
           | Those won't sound the same all day. And traffic too. Even
           | things like the sound of wind may generally be different in
           | the morning vs night.
           | 
           | This is not to suggest the researchers are not leaking data,
           | or that the examples were cherry picked, it seems probable
           | they are doing one or the other. But it is to say, if you
           | were trained on a particular intersection, and heard a sample
           | from it, you could probably train a model to predict time of
           | day reasonably well.
        
         | notum wrote:
         | Thank you. I was about to just write "confirmation bias" as a
         | comment.
        
       | lifeisstillgood wrote:
       | I am not by any stretch a mathematician but AI research like this
       | reminds me of things that excite mathematicians - it's like
       | people spent three hundred years playing with Prime numbers and
       | all of a sudden "oh yeah, silicon, fibre optics, ahah secure
       | encryption"
       | 
       | There are going to be real useful tools - but we need to play for
       | another century before we have that aha moment. Probably :-)
        
       | simonw wrote:
       | The word "accurate" in that headline is doing a LOT of work.
       | 
       | Here's how the results were scored:
       | 
       | "Computer evaluations compared the relative proportions of
       | greenery, building and sky between source and generated images,
       | whereas human judges were asked to correctly match one of three
       | generated images to an audio sample."
       | 
       | So this is very impressive and a cool piece of research, but
       | unsurprisingly not recreating the space "accurately" if you
       | assume that means anything more than "has the right amount of sky
       | and buildings and greenery".
        
         | aqme28 wrote:
         | You're correct, but I'm also curious how you could measure
         | accuracy here. There isn't any easy way that I can think of.
        
           | DougMerritt wrote:
           | That's not an unreasonable question, however the larger point
           | is that this sort of thing _cannot_ be done perfectly
           | accurately.
           | 
           | This was established mathematically, answering an old 1966
           | question from famous mathematician Mark Kac: "You can't hear
           | the shape of a drum" -- there isn't a unique answer even when
           | allowed to use arbitrary test sounds.
           | 
           | Wikipedia:
           | https://en.wikipedia.org/wiki/Hearing_the_shape_of_a_drum
           | 
           | Article in American Scientist 1996 Jan-Feb:
           | https://www2.math.upenn.edu/~kazdan/425S11/Drum-Gordon-
           | Webb....
        
             | gyrovagueGeist wrote:
             | I love this paper, but something I think is often missed
             | when it comes up is that you CAN hear the shape of many
             | drums if you restrict the shape space, for example with a
             | prior of "what a drum should look like" Zelditch proved
             | spectral uniqueness for convex, fully connected, drums with
             | some symmetry.
        
               | DougMerritt wrote:
               | Aha, I had missed that! Thanks.
        
               | pfortuny wrote:
               | You are totally right but "convex" is a very strong
               | assumption (so strong that the shape is determined by its
               | harmonics). Very strong.
        
             | scarmig wrote:
             | If you add multiple hearing points, you massively constrain
             | the space of possible drums. The question then becomes
             | something like "can you see the shape of a drum?"
             | 
             | Proof of concept: echolocation.
        
           | BeefWellington wrote:
           | If you can't measure the results then the entire thing needs
           | a rethink.
        
         | delichon wrote:
         | That confuses accurate with detailed. It can be accurate even
         | if it only reports one bit, like greenery proportion "low" or
         | "high".
        
           | ghostly_s wrote:
           | But it's not reporting one bit, it's generating a detailed
           | (color!) photo. 100s of kilobits of made up garbage plus one
           | accurate bit cannot reasonably be described as an "accurate"
           | result.
        
         | conception wrote:
         | I wonder if there is a precise vs accurate wording shenanigans
         | going on here...
        
         | dang wrote:
         | Ok, we've made the title less 'accurate' (and more accurate?)
         | above. Thanks!
        
       | lowercased wrote:
       | Am I missing something or is there no way to see those generated
       | images except in postage-stamp sizes?
       | 
       | tldr:
       | 
       | you can view the image directly at https://news.utexas.edu/wp-
       | content/uploads/2024/11/AI-street...
       | 
       | Still not overly useful.
        
       | amaurose wrote:
       | I'd be very interested in the reverse: A background sound
       | generator for still images. Would be nice to have for advanced
       | picture frames...
        
         | jebarker wrote:
         | I'd like this too, especially if it generated binaural audio.
        
       | galleywest200 wrote:
       | Some of these are not very accurrate. That "country side" image
       | has the entirely wrong foliage color (fall colors vs. spring
       | colors). It also appears to place buildings when the "ground
       | truth" image is by a small stream.
       | 
       | I would not rely on this tool for any meaningful data collection.
        
         | DougMerritt wrote:
         | You really cannot expect audio processing to yield color
         | information.
         | 
         | Beyond that, you are correct that the 3D shapes themselves
         | cannot be derived perfectly accurately (see my other post)
        
       | joshdavham wrote:
       | This is interesting. Sorta reminds me of how bats use sonar for
       | their surroundings.
        
         | estebarb wrote:
         | Except that it is not. If you want to use this for tasks like
         | navigation then it is useless. Practically, it is the same as
         | using chatgpt to generate a realistic scenery based on a text
         | description.
        
       | IshKebab wrote:
       | Researchers use AI to turn source recordings into _plausible_
       | street images.
        
       | amelius wrote:
       | You can train DL models on anything. If you get accurate results
       | then that is maybe publish-worthy.
       | 
       | In this particular case, it is not.
        
         | mewpmewp2 wrote:
         | How do you determine if something is publish worthy? If someone
         | puts in a lot of effort experimenting with something that
         | fails, it can still seem publish worthy so others can learn
         | what works and what doesn't. It should be more about level of
         | effort I think. Otherwise the incentives become all wrong too.
        
           | amelius wrote:
           | In this case the algorithm can determine broad classes like
           | "rural" or "city", and aside from those classes the generated
           | images have little connection with the audio. I think most DL
           | researchers would agree that this is low-effort stuff, and
           | therefore not publish-worthy. In addition to this the word
           | "accurate" in the title is misleading.
        
       | wigster wrote:
       | how on earth does the first example recreate the blue-white logo
       | on the building? b***t
        
       | dmje wrote:
       | Probably the thing you really want from an article with this
       | topic focus is to be able to see the images bigger then a postage
       | stamp size. And, even more irritating - the images are actually
       | there, in reasonable size, just not linked...
       | 
       | https://news.utexas.edu/wp-content/uploads/2024/11/AI-street...
       | https://news.utexas.edu/wp-content/uploads/2024/11/AI-street...
        
       | harrall wrote:
       | I think this is cool but it's a more of a statistical correlation
       | than an AI-related paper.
       | 
       | What I'm saying is that if you were to replace 'AI' with "ask
       | humans to draw an image based on these sounds," you'll probably
       | get somewhat similar results.
       | 
       | Which is still interesting either way.
        
       ___________________________________________________________________
       (page generated 2024-12-08 23:00 UTC)