[HN Gopher] Visual Reasoning Is Coming Soon
___________________________________________________________________
Visual Reasoning Is Coming Soon
Author : softwaredoug
Score : 92 points
Date : 2025-04-09 15:58 UTC (7 hours ago)
(HTM) web link (arcturus-labs.com)
(TXT) w3m dump (arcturus-labs.com)
| thierrydamiba wrote:
| Excellent write up.
|
| The example you used to demonstrate is well done.
|
| Such a simple way to describe the current issues.
| nkingsy wrote:
| The example of the cat and detective hat shows that even with the
| latest update, it isn't "editing" the image. The generated cat is
| younger, with bigger, brighter eyes, more "perfect" ears.
|
| I found that when editing images of myself, the result looked
| weird, like a funky version of me. For the cat, it looks "more
| attractive" I guess, but for humans (and I'd imagine for a cat
| looking at the edited cat with a keen eye for cat faces), the
| features often don't work together when changed slightly.
| porphyra wrote:
| Chatgpt 4o's advanced image generation seems to have a low-
| resolution autoregressive part that generates tokens directly,
| and an image upscaling decoding step that turns the (perhaps
| 100 px wide) token-image into the actual 1024 px wide final
| result. The former step is able to almost nail things
| perfectly, but the latter step will always change things
| slightly. That's why it is so good at, say, generating large
| text but still struggles with fine text, and will always
| introduce subtle variations when you ask it to edit an existing
| image.
| BriggyDwiggs42 wrote:
| Has anyone tried putting in a model that selects the editing
| region prior to the process? Training data would probably be
| hard, but maybe existing image recognition tech that draws
| rectangles would be a start.
| CSMastermind wrote:
| What's interesting to me is how many of these advancements are
| just obvious next steps for these tools. Chain of thought, tree
| of thought, mixture of experts etc. are things you'd come up with
| in the first 10 minutes of thinking about improving LLMs.
|
| Of course the devil's always in the details and there have been
| real non-obvious advancements at the same time.
| AIPedant wrote:
| This seems to ignore the mixed record of video generation models:
| For visual reasoning practice, we can do supervised fine-tuning
| on sequences similar to the marble example above. For instance,
| to understand more about the physical world, we can show the
| model sequential pictures of Slinkys going down stairs, or
| basketball players shooting 3-pointers, or people hammering
| birdhouses together.... But where will we get all
| this training data? For spatial and physical reasoning tasks, we
| can leverage computer graphics to generate synthetic data. This
| approach is particularly valuable because simulations provide a
| controlled environment where we can create scenarios with known
| outcomes, making it easy to verify the model's predictions. But
| we'll also need real-world examples. Fortunately, there's an
| abundance of video content online that we can tap into. While
| initial datasets might require human annotation, soon models
| themselves will be able to process videos and their transcripts
| to extract training examples automatically.
|
| Almost every video generator makes constant "folk physics" errors
| and doesn't understand object permanence. DeepMind's Veo2 is very
| impressive but still struggles with object permanence and
| qualitatively nonsensical physics:
| https://x.com/Norod78/status/1894438169061269750
|
| Humans do not learn these things by pure observation (newborns
| understand object permanence, I suspect this is the case for all
| vertebrates). I doubt transformers are capable of learning it as
| robustly, even if trained on all of YouTube. There will always be
| "out of distribution" physical nonsense involving mistakes humans
| (or lizards) would never make, even if they've never seen the
| specific objects.
| throwanem wrote:
| > newborns understand object permanence
|
| Is that why the peekaboo game is funny for babies? The violated
| expectation at the soul of the comedy?
| broof wrote:
| Yeah I had thought that newborns famously _didnt_ understand
| object permanence and that it developed sometime during their
| first year. And that was why peekaboo is fun, you're
| essentially popping in and out of existence.
| AIPedant wrote:
| This is a case where early 20th century psychology is
| wrong, yet still propagates as false folk knowledge:
|
| https://en.wikipedia.org/wiki/Object_permanence#Contradicti
| n...
| smusamashah wrote:
| My kid use to drop things and look down at them from his
| chair. I didn't understand what he was trying to do. I
| learned that it was his way of trying to understand how
| world works. That if he dropped something, will it remain
| there or disappear.
|
| That contradicts this contradiction, unless there is
| another explanation.
| simplify wrote:
| A child doesn't always a mental "why" reasoning for their
| actions. Sometimes kids just behave in coarse playful
| ways, and those ways happen to be very useful for mental
| development.
| sepositus wrote:
| My (much) older kid still does ridiculous things that
| defy reason (and usually end up with something broken). I
| don't think it's fair to say that every action they take
| has some deeper meaning behind it.
|
| "Why were you throwing the baseball at the running fan?"
| "I don't know...I was bored."
| crispycas12 wrote:
| That's odd. This was content that was still on the MCAT
| when I took it last year. I even remember keeping the
| formation of objection permanence occuring ~0-2 years of
| age on my flashcards.
| throwanem wrote:
| Have you checked lately, though?
| andoando wrote:
| babies pretty much laugh if you're laughing and being silly
| Tostino wrote:
| Can confirm. Is quite fun.
| throwanem wrote:
| All passed me by, sad to say, hence the guessing. For
| what I had to start with I've done pretty well, and I
| think no one ever really sees their each and every last
| hope come true. Maybe next time.
| dinfinity wrote:
| You provide no actual arguments as to why LLMs are
| fundamentally unable to learn this. Your doubt is as valuable
| as my confidence.
| AIPedant wrote:
| Well, it's a good thing I didn't say "fundamentally unable to
| learn this"!
|
| I said that learning visual reasoning from video is probably
| not enough: if you claim it is enough, you have to reconcile
| that with failures in Sora, Veo 2, etc. Veo 2's problems are
| especially serious since it was trained on an all-DeepMind-
| can-eat diet of YouTube videos. It seems like they need a
| stronger algorithm, not more Red Dead Redemption 2 footage.
| dinfinity wrote:
| > I said that learning visual reasoning from video is
| probably not enough
|
| Fair enough; you did indeed say that.
|
| > if you claim it is enough, you have to reconcile that
| with failures in Sora, Veo 2, etc.
|
| This is flawed reasoning, though. The current state of
| video generating AI and the completeness of the training
| set does not reliably prove that the network used to
| perform the generation is incapable of physical modeling
| and/or object permanence. Those things are ultimately (the
| modeling of) relations between past and present tokens, so
| the transformer architecture does fit.
|
| It might just be a matter of compute/network size (modeling
| four dimensional physical relations in high resolution is
| pretty hard, yo). If you look at the scaling results from
| the early Sora blogs, the natural increase of physical
| accuracy with more compute is visible:
| https://openai.com/index/video-generation-models-as-world-
| si...
|
| It also might be a matter of fine-tuning training on (and
| optimizing for) four dimensional/physical accuracy rather
| than on "does this generated frame look like the actual
| frame?"
| zveyaeyv3sfye wrote:
| > This is flawed reasoning, though.
|
| Says you, then continues to lay out a series of dreamy
| speculations, glowing with AI fairy magic.
| viccis wrote:
| Because the nature of their operation (learning a probability
| distribution over a corpus of observed data) is not the same
| as creating synthetic a priori knowledge (object permanence
| is a case of cause and effect which is synthetic a priori
| knowledge). All LLM knowledge is by definition a posteriori.
| nonameiguess wrote:
| "All of YouTube" brings the same problem as training on all of
| the text on the Internet. Much of that text is not factual,
| which is why RLHF and various other fine-tuning efforts need to
| happen in addition to just reading all the text on the
| Internet. All videos on YouTube are not unedited footage of the
| real world faithfully reproducing the same physics you'd get by
| watching the real world instead of YouTube.
|
| As for object permanence, I don't know jack about animal
| cognitive development, but it seems important that all animals
| are themselves also objects. Whether or not they can see at
| all, they can feel their bodies and sense in some way or other
| its relation to the larger world of other objects. They know
| they don't blink in and out of existence or teleport, which
| seems like it would create a strong bias toward believing
| nothing else can do that, either. The same holds true with
| physics. As physical objects existing in the physical world, we
| are ourselves subject to physics and learn a model that is
| largely correct within the realm of energy densities and speeds
| we can directly experience. If we had to learn physics entirely
| from watching videos, I'm afraid Roadrunner cartoons and Fast
| and the Furious movies would muddy the waters a bit.
| porphyra wrote:
| I think that one reason that humans are so good at understanding
| images is that our eyes see video rather than still images. Video
| lets us see "cause and effect" by seeing what happens after
| something. It also allows us to grasp the 3D structure of things
| since we will almost always see everything from multiple angles.
| So long as we just feed a big bunch of stills into training these
| models, it will struggle to understand how things affect one
| another.
| throwanem wrote:
| I have some bad news for you about how every digital video
| you've ever seen in your life is encoded.
| KTibow wrote:
| I've seen some speculate that o3 is already using visual
| reasoning and that's what made it a breakthrough model.
| District5524 wrote:
| The first caption of the cat picture may be a bit misleading for
| those who are not sure of how this works: "The best a traditional
| LLM can do when asked to give it a detective hat and monocle."
| The role of the traditional LLM in creating a picture is quite
| minimal (if there is any LLMs used), it might just tweak a bit
| the prompt for the diffusion model. It was definitely not the LLM
| that created the picture:
| https://platform.openai.com/docs/guides/image-generation 4o image
| generation is surely a bit different, but I don't really have
| that kind of more precise technical information (there must be
| indeed a specialized transformer model used, linking tokens to
| pixels, https://openai.com/index/introducing-4o-image-
| generation/)
| uaas wrote:
| > Rather watch than read? Hey, I get it - sometimes you just want
| to kick back and watch! Check out this quick video where I walk
| through everything in this post
|
| Hm, no, I've never had this thought.
| rel_ic wrote:
| The inconsistency of an optimistic blog post ending with a
| picture of a terminator robot makes me think this author isn't
| taking themself seriously enough. Or - the author is the
| terminator robot?
| Tiberium wrote:
| It's sad that they used 4o's image generation feature for the cat
| example which does some diffusion or something else, results in
| the whole image changing. They should've instead used Gemini 2.0
| Flash's image generation feature (or at least mentioned it!),
| which, even if far lower quality and resolution (max of
| 1024x1024, but Gemini will try to match the res of the original
| image, so you can get something like 681x1024), is much much
| better at leaving the untouched parts of the image actually
| "untouched".
|
| Here's the best out of a few attempts for a really similar
| prompt, more detailed since Flash is a much smaller model "Give
| the cat a detective hat and a monocle over his right eye,
| properly integrate them into the photo.". You can see how the
| rest of the image is practically untouched to the naked human
| eye: https://ibb.co/zVgDbqV3
|
| Honestly Google has been really good at catching up in the LLM
| race, and their modern models like 2.0 Flash, 2.5 Pro are one of
| (or the) best in their respective areas. I hope that they'll
| scale up their image generation feature to base it on 2.5 Pro (or
| maybe 3 Pro by the time they do it) for higher quality and prompt
| adherence.
|
| If you want, you can give 2.0 Flash image gen a try for free
| (with generous limits) on
| https://aistudio.google.com/prompts/new_chat, just select it in
| the model selector on the right.
| blixt wrote:
| I'm not sure I see the behavior in the Gemini 2.0 Flash model's
| image output as a strength. It seems to me it has multiple
| output modes, one indeed being masked edits. But it also seems
| to have convolutional matrix edits (e.g. "make this image
| grayscale" looks practically like it's applying a Photoshop
| filter) and true latent space edits ("show me this scene 1
| minute later" or "move the camera so it is above this scene,
| pointing down"). And it almost seems to me these are actually
| distinct modes, which seems like it's been a bit too hand
| engineered.
|
| On the other hand, OpenAI's model, while it does seem to have
| some upscaling magic happening (which makes the outputs look a
| lot nicer than the ones from Gemini FWIW), also seems to
| perform all its edits entirely in latent space (hence it's easy
| to see things degrade at a conceptual level such as texture,
| rotation, position, etc.) But this is a sign that its latent
| space mode is solid enough to always use, while with Gemini 2.0
| Flash I get the feeling when it is used, it's just not
| performing as well.
___________________________________________________________________
(page generated 2025-04-09 23:01 UTC)