[HN Gopher] Visual Reasoning Is Coming Soon
       ___________________________________________________________________
        
       Visual Reasoning Is Coming Soon
        
       Author : softwaredoug
       Score  : 92 points
       Date   : 2025-04-09 15:58 UTC (7 hours ago)
        
 (HTM) web link (arcturus-labs.com)
 (TXT) w3m dump (arcturus-labs.com)
        
       | thierrydamiba wrote:
       | Excellent write up.
       | 
       | The example you used to demonstrate is well done.
       | 
       | Such a simple way to describe the current issues.
        
       | nkingsy wrote:
       | The example of the cat and detective hat shows that even with the
       | latest update, it isn't "editing" the image. The generated cat is
       | younger, with bigger, brighter eyes, more "perfect" ears.
       | 
       | I found that when editing images of myself, the result looked
       | weird, like a funky version of me. For the cat, it looks "more
       | attractive" I guess, but for humans (and I'd imagine for a cat
       | looking at the edited cat with a keen eye for cat faces), the
       | features often don't work together when changed slightly.
        
         | porphyra wrote:
         | Chatgpt 4o's advanced image generation seems to have a low-
         | resolution autoregressive part that generates tokens directly,
         | and an image upscaling decoding step that turns the (perhaps
         | 100 px wide) token-image into the actual 1024 px wide final
         | result. The former step is able to almost nail things
         | perfectly, but the latter step will always change things
         | slightly. That's why it is so good at, say, generating large
         | text but still struggles with fine text, and will always
         | introduce subtle variations when you ask it to edit an existing
         | image.
        
           | BriggyDwiggs42 wrote:
           | Has anyone tried putting in a model that selects the editing
           | region prior to the process? Training data would probably be
           | hard, but maybe existing image recognition tech that draws
           | rectangles would be a start.
        
       | CSMastermind wrote:
       | What's interesting to me is how many of these advancements are
       | just obvious next steps for these tools. Chain of thought, tree
       | of thought, mixture of experts etc. are things you'd come up with
       | in the first 10 minutes of thinking about improving LLMs.
       | 
       | Of course the devil's always in the details and there have been
       | real non-obvious advancements at the same time.
        
       | AIPedant wrote:
       | This seems to ignore the mixed record of video generation models:
       | For visual reasoning practice, we can do supervised fine-tuning
       | on sequences similar to the marble example above. For instance,
       | to understand more about the physical world, we can show the
       | model sequential pictures of Slinkys going down stairs, or
       | basketball players shooting 3-pointers, or people hammering
       | birdhouses together....             But where will we get all
       | this training data? For spatial and physical reasoning tasks, we
       | can leverage computer graphics to generate synthetic data. This
       | approach is particularly valuable because simulations provide a
       | controlled environment where we can create scenarios with known
       | outcomes, making it easy to verify the model's predictions. But
       | we'll also need real-world examples. Fortunately, there's an
       | abundance of video content online that we can tap into. While
       | initial datasets might require human annotation, soon models
       | themselves will be able to process videos and their transcripts
       | to extract training examples automatically.
       | 
       | Almost every video generator makes constant "folk physics" errors
       | and doesn't understand object permanence. DeepMind's Veo2 is very
       | impressive but still struggles with object permanence and
       | qualitatively nonsensical physics:
       | https://x.com/Norod78/status/1894438169061269750
       | 
       | Humans do not learn these things by pure observation (newborns
       | understand object permanence, I suspect this is the case for all
       | vertebrates). I doubt transformers are capable of learning it as
       | robustly, even if trained on all of YouTube. There will always be
       | "out of distribution" physical nonsense involving mistakes humans
       | (or lizards) would never make, even if they've never seen the
       | specific objects.
        
         | throwanem wrote:
         | > newborns understand object permanence
         | 
         | Is that why the peekaboo game is funny for babies? The violated
         | expectation at the soul of the comedy?
        
           | broof wrote:
           | Yeah I had thought that newborns famously _didnt_ understand
           | object permanence and that it developed sometime during their
           | first year. And that was why peekaboo is fun, you're
           | essentially popping in and out of existence.
        
             | AIPedant wrote:
             | This is a case where early 20th century psychology is
             | wrong, yet still propagates as false folk knowledge:
             | 
             | https://en.wikipedia.org/wiki/Object_permanence#Contradicti
             | n...
        
               | smusamashah wrote:
               | My kid use to drop things and look down at them from his
               | chair. I didn't understand what he was trying to do. I
               | learned that it was his way of trying to understand how
               | world works. That if he dropped something, will it remain
               | there or disappear.
               | 
               | That contradicts this contradiction, unless there is
               | another explanation.
        
               | simplify wrote:
               | A child doesn't always a mental "why" reasoning for their
               | actions. Sometimes kids just behave in coarse playful
               | ways, and those ways happen to be very useful for mental
               | development.
        
               | sepositus wrote:
               | My (much) older kid still does ridiculous things that
               | defy reason (and usually end up with something broken). I
               | don't think it's fair to say that every action they take
               | has some deeper meaning behind it.
               | 
               | "Why were you throwing the baseball at the running fan?"
               | "I don't know...I was bored."
        
               | crispycas12 wrote:
               | That's odd. This was content that was still on the MCAT
               | when I took it last year. I even remember keeping the
               | formation of objection permanence occuring ~0-2 years of
               | age on my flashcards.
        
               | throwanem wrote:
               | Have you checked lately, though?
        
             | andoando wrote:
             | babies pretty much laugh if you're laughing and being silly
        
               | Tostino wrote:
               | Can confirm. Is quite fun.
        
               | throwanem wrote:
               | All passed me by, sad to say, hence the guessing. For
               | what I had to start with I've done pretty well, and I
               | think no one ever really sees their each and every last
               | hope come true. Maybe next time.
        
         | dinfinity wrote:
         | You provide no actual arguments as to why LLMs are
         | fundamentally unable to learn this. Your doubt is as valuable
         | as my confidence.
        
           | AIPedant wrote:
           | Well, it's a good thing I didn't say "fundamentally unable to
           | learn this"!
           | 
           | I said that learning visual reasoning from video is probably
           | not enough: if you claim it is enough, you have to reconcile
           | that with failures in Sora, Veo 2, etc. Veo 2's problems are
           | especially serious since it was trained on an all-DeepMind-
           | can-eat diet of YouTube videos. It seems like they need a
           | stronger algorithm, not more Red Dead Redemption 2 footage.
        
             | dinfinity wrote:
             | > I said that learning visual reasoning from video is
             | probably not enough
             | 
             | Fair enough; you did indeed say that.
             | 
             | > if you claim it is enough, you have to reconcile that
             | with failures in Sora, Veo 2, etc.
             | 
             | This is flawed reasoning, though. The current state of
             | video generating AI and the completeness of the training
             | set does not reliably prove that the network used to
             | perform the generation is incapable of physical modeling
             | and/or object permanence. Those things are ultimately (the
             | modeling of) relations between past and present tokens, so
             | the transformer architecture does fit.
             | 
             | It might just be a matter of compute/network size (modeling
             | four dimensional physical relations in high resolution is
             | pretty hard, yo). If you look at the scaling results from
             | the early Sora blogs, the natural increase of physical
             | accuracy with more compute is visible:
             | https://openai.com/index/video-generation-models-as-world-
             | si...
             | 
             | It also might be a matter of fine-tuning training on (and
             | optimizing for) four dimensional/physical accuracy rather
             | than on "does this generated frame look like the actual
             | frame?"
        
               | zveyaeyv3sfye wrote:
               | > This is flawed reasoning, though.
               | 
               | Says you, then continues to lay out a series of dreamy
               | speculations, glowing with AI fairy magic.
        
           | viccis wrote:
           | Because the nature of their operation (learning a probability
           | distribution over a corpus of observed data) is not the same
           | as creating synthetic a priori knowledge (object permanence
           | is a case of cause and effect which is synthetic a priori
           | knowledge). All LLM knowledge is by definition a posteriori.
        
         | nonameiguess wrote:
         | "All of YouTube" brings the same problem as training on all of
         | the text on the Internet. Much of that text is not factual,
         | which is why RLHF and various other fine-tuning efforts need to
         | happen in addition to just reading all the text on the
         | Internet. All videos on YouTube are not unedited footage of the
         | real world faithfully reproducing the same physics you'd get by
         | watching the real world instead of YouTube.
         | 
         | As for object permanence, I don't know jack about animal
         | cognitive development, but it seems important that all animals
         | are themselves also objects. Whether or not they can see at
         | all, they can feel their bodies and sense in some way or other
         | its relation to the larger world of other objects. They know
         | they don't blink in and out of existence or teleport, which
         | seems like it would create a strong bias toward believing
         | nothing else can do that, either. The same holds true with
         | physics. As physical objects existing in the physical world, we
         | are ourselves subject to physics and learn a model that is
         | largely correct within the realm of energy densities and speeds
         | we can directly experience. If we had to learn physics entirely
         | from watching videos, I'm afraid Roadrunner cartoons and Fast
         | and the Furious movies would muddy the waters a bit.
        
       | porphyra wrote:
       | I think that one reason that humans are so good at understanding
       | images is that our eyes see video rather than still images. Video
       | lets us see "cause and effect" by seeing what happens after
       | something. It also allows us to grasp the 3D structure of things
       | since we will almost always see everything from multiple angles.
       | So long as we just feed a big bunch of stills into training these
       | models, it will struggle to understand how things affect one
       | another.
        
         | throwanem wrote:
         | I have some bad news for you about how every digital video
         | you've ever seen in your life is encoded.
        
       | KTibow wrote:
       | I've seen some speculate that o3 is already using visual
       | reasoning and that's what made it a breakthrough model.
        
       | District5524 wrote:
       | The first caption of the cat picture may be a bit misleading for
       | those who are not sure of how this works: "The best a traditional
       | LLM can do when asked to give it a detective hat and monocle."
       | The role of the traditional LLM in creating a picture is quite
       | minimal (if there is any LLMs used), it might just tweak a bit
       | the prompt for the diffusion model. It was definitely not the LLM
       | that created the picture:
       | https://platform.openai.com/docs/guides/image-generation 4o image
       | generation is surely a bit different, but I don't really have
       | that kind of more precise technical information (there must be
       | indeed a specialized transformer model used, linking tokens to
       | pixels, https://openai.com/index/introducing-4o-image-
       | generation/)
        
       | uaas wrote:
       | > Rather watch than read? Hey, I get it - sometimes you just want
       | to kick back and watch! Check out this quick video where I walk
       | through everything in this post
       | 
       | Hm, no, I've never had this thought.
        
       | rel_ic wrote:
       | The inconsistency of an optimistic blog post ending with a
       | picture of a terminator robot makes me think this author isn't
       | taking themself seriously enough. Or - the author is the
       | terminator robot?
        
       | Tiberium wrote:
       | It's sad that they used 4o's image generation feature for the cat
       | example which does some diffusion or something else, results in
       | the whole image changing. They should've instead used Gemini 2.0
       | Flash's image generation feature (or at least mentioned it!),
       | which, even if far lower quality and resolution (max of
       | 1024x1024, but Gemini will try to match the res of the original
       | image, so you can get something like 681x1024), is much much
       | better at leaving the untouched parts of the image actually
       | "untouched".
       | 
       | Here's the best out of a few attempts for a really similar
       | prompt, more detailed since Flash is a much smaller model "Give
       | the cat a detective hat and a monocle over his right eye,
       | properly integrate them into the photo.". You can see how the
       | rest of the image is practically untouched to the naked human
       | eye: https://ibb.co/zVgDbqV3
       | 
       | Honestly Google has been really good at catching up in the LLM
       | race, and their modern models like 2.0 Flash, 2.5 Pro are one of
       | (or the) best in their respective areas. I hope that they'll
       | scale up their image generation feature to base it on 2.5 Pro (or
       | maybe 3 Pro by the time they do it) for higher quality and prompt
       | adherence.
       | 
       | If you want, you can give 2.0 Flash image gen a try for free
       | (with generous limits) on
       | https://aistudio.google.com/prompts/new_chat, just select it in
       | the model selector on the right.
        
         | blixt wrote:
         | I'm not sure I see the behavior in the Gemini 2.0 Flash model's
         | image output as a strength. It seems to me it has multiple
         | output modes, one indeed being masked edits. But it also seems
         | to have convolutional matrix edits (e.g. "make this image
         | grayscale" looks practically like it's applying a Photoshop
         | filter) and true latent space edits ("show me this scene 1
         | minute later" or "move the camera so it is above this scene,
         | pointing down"). And it almost seems to me these are actually
         | distinct modes, which seems like it's been a bit too hand
         | engineered.
         | 
         | On the other hand, OpenAI's model, while it does seem to have
         | some upscaling magic happening (which makes the outputs look a
         | lot nicer than the ones from Gemini FWIW), also seems to
         | perform all its edits entirely in latent space (hence it's easy
         | to see things degrade at a conceptual level such as texture,
         | rotation, position, etc.) But this is a sign that its latent
         | space mode is solid enough to always use, while with Gemini 2.0
         | Flash I get the feeling when it is used, it's just not
         | performing as well.
        
       ___________________________________________________________________
       (page generated 2025-04-09 23:01 UTC)