[HN Gopher] Image Diffusion Models Exhibit Emergent Temporal Pro...
       ___________________________________________________________________
        
       Image Diffusion Models Exhibit Emergent Temporal Propagation in
       Videos
        
       Author : 50kIters
       Score  : 110 points
       Date   : 2025-11-26 07:55 UTC (15 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | onesandofgrain wrote:
       | Can someone smarter than me explain what this is about?
        
         | Kalabint wrote:
         | > Can someone smarter than me explain what this is about?
         | 
         | I think you can find the answer under point 3:
         | 
         | > In this work, our primary goal is to show that pretrained
         | text-to-image diffusion models can be repurposed as object
         | trackers without task-specific finetuning.
         | 
         | Meaning that you can track Objects in Videos without using
         | specialised ML Models for Video Object Tracking.
        
           | echelon wrote:
           | All of these emergent properties of image and video models
           | leads me to believe that evolution of animal intelligence
           | around motility and visually understanding the physical
           | environment might be "easy" relative to other "hard steps".
           | 
           | The more complex that an eye gets, the more the brain evolves
           | not just the physics and chemistry of optics, but also rich
           | feature sets about predator/prey labels, tracking, movement,
           | self-localization, distance, etc.
           | 
           | These might not be separate things. These things might just
           | come "for free".
        
             | fxtentacle wrote:
             | I wouldn't call these properties "emergent".
             | 
             | If you train a system to memorize A-B pairs and then you
             | normally use it to find B when given A, then it's not
             | surprising that finding A when given B also works, because
             | you trained it in an almost symmetrical fashion on A-B
             | pairs, which are, obviously, also B-A pairs.
        
             | jacquesm wrote:
             | There is a massive amount of pre-processing already done in
             | the retina itself and in the LGN:
             | 
             | https://en.wikipedia.org/wiki/Lateral_geniculate_nucleus
             | 
             | So the brain does not necessarily receive 'raw' images to
             | process to begin with, there is already a lot of high level
             | data extracted at that point such as optical flow to detect
             | moving objects.
        
               | Mkengin wrote:
               | Interesting. So similar to the vision encoder + projector
               | in VLMs?
        
               | DrierCycle wrote:
               | And the occipital is developed around extraordinary
               | levels of image separation, broken down into tiny areas
               | of the input, scattered and woven for details of motion,
               | gradient, contrast, etc.
        
         | magicalhippo wrote:
         | Glossing through the paper, here's my take.
         | 
         | Someone previously found that that the cross-attention layers
         | in text-to-image diffusion models captures correlation between
         | the input text tokens and corresponding image regions, so that
         | one can use this to segment the image, pixels containing "cat"
         | for example. However this segmentation was rather coarse. The
         | authors of this paper found that also using the self-attention
         | layers leads to a much more detailed segmentation.
         | 
         | They then extend this to video by using the self-attention
         | between two consecutive frames to determine how the
         | segmentation changes from one frame to the next.
         | 
         | Now, text-to-image diffusion models require a text input to
         | generate the image to begin with. From what I can gather they
         | limit themselves to semi-supervised video segmentation, so that
         | the first frame is already segmented by say a human or some
         | other process.
         | 
         | They then run a "inversion" procedure which tries to generate
         | text that causes the text-to-image diffusion model to segment
         | the first frame as closely as possible to the provided
         | segmentation.
         | 
         | With the text in hand, they can then run the earlier
         | segmentation propagation steps to track the segmented object
         | throughout the video.
         | 
         | The key here is that the text-to-image diffusion model is
         | pretrained, and not fine-tuned for this task.
         | 
         | That said, I'm no expert.
        
           | jacquesm wrote:
           | For a 'not an expert' explanation you did a better job than
           | the original paper.
        
           | nicolailolansen wrote:
           | Bravo!
        
       | ttul wrote:
       | This is a cool result. Deep learning image models are trained on
       | enormous amounts of data and the information recorded in their
       | weights continues to astonish me. Over in the Stable Diffusion
       | space, hobbyists (as opposed to professional researchers) are
       | continuing to find new ways to squeeze intelligence out of models
       | that were trained in 2022 and are considerably out of date
       | compared with the latest "flow matching" models like Qwen Image
       | and Flux.
       | 
       | Makes you wonder what intelligence is lurking in a 10T parameter
       | model like Gemini 3 that we may not discover for some years
       | yet...
        
         | smerrill25 wrote:
         | Hey, do you know how you figured out about this information? I
         | would be super curious to keep track of current ad-hoc ways of
         | pushing older models to do cooler things. LMK
        
           | ttul wrote:
           | 1) Reading papers. 2) Reading "Deep Learning: Foundations and
           | Concepts". 3) Taking Jeremy Howard's Fast.ai course
        
       | tpoacher wrote:
       | If the authors are reading. I notice you used a "Soft IoU" for
       | validation.
       | 
       | A large part of my 2017 phd thesis [0] is dedicated in exploring
       | the formulation and utility of soft validation operators,
       | including this soft IoU, and the extent to which they are
       | "better" / "more reliable" than thresholding (whether this occurs
       | in isolation, or even when marginalised out, as in with the AUC).
       | Long story short, soft operators are at least an order of
       | magnitude more reliable than their thresholding counterparts [1],
       | despite the fact that thresholding still seems to be the
       | industry/academia standard. This is the case for any set-
       | operation-based operator, such as the Dice coefficient (a.k.a.
       | F1-score), not just for the IoU. Recently, influential groups
       | have proposed the matthews correlation coefficient as a "better
       | operator", but still treat it in binary / thresholding terms,
       | which means it's still unreliable to an order of magnitude. I
       | suspect this insight goes beyond images (e.g. the F1-score is
       | often used in ML problems more generally, in situations where
       | probabilistic outputs are thresholded to compare against binary
       | ground truth labels), but I haven't tested that hypothesis
       | explicitly beyond the image domain (yet).
       | 
       | In this work you effectively used the "goedel" (i.e. min/max)
       | fuzzy operator to define fuzzy intersection and union, for the
       | purposes of using it in an IoU operator. There are other fuzzy
       | norms with interesting properties that you can also explore.
       | Other classical ones include product and lukasiewicz. I show in
       | [0] and [1] that these have "best case scenario sub-pixel
       | overlap", "average case" and "worst-case scenario" underlying
       | semantics. (In other words, min/max should not be a random choice
       | of T-norm, but a conscious choice which should match your
       | problem, and what the operator is intended to validate
       | specifically). In my own work, I then proceeded show that if you
       | take gradient direction at the boundary into account, you can
       | come up with a fuzzy intersection/union pair which has
       | directional semantics, and is even more reliable an operator when
       | used to define a soft IoU.
       | 
       | Having said that, in your case you're comparing against a binary
       | ground truth. This collapses all the different T-norms to the
       | same value. I wonder if this is the reason you chose a binary
       | ground truth. If yes, you might want to consider my work, and use
       | original 'soft' ground truths instead, for higher reliability, as
       | well as ability to define intersection semantics.
       | 
       | I hope the above is of interest / use to you :) (and, if you were
       | to decide to cite my work, it wouldn't be the eeeeeend of the
       | world, I gueeeeesss xD )
       | 
       | [0]
       | https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
       | 
       | [1]
       | https://repository.essex.ac.uk/24856/1/Papastylianou.etal201...
        
       ___________________________________________________________________
       (page generated 2025-11-26 23:00 UTC)