[HN Gopher] Image Diffusion Models Exhibit Emergent Temporal Pro...
___________________________________________________________________
Image Diffusion Models Exhibit Emergent Temporal Propagation in
Videos
Author : 50kIters
Score : 110 points
Date : 2025-11-26 07:55 UTC (15 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| onesandofgrain wrote:
| Can someone smarter than me explain what this is about?
| Kalabint wrote:
| > Can someone smarter than me explain what this is about?
|
| I think you can find the answer under point 3:
|
| > In this work, our primary goal is to show that pretrained
| text-to-image diffusion models can be repurposed as object
| trackers without task-specific finetuning.
|
| Meaning that you can track Objects in Videos without using
| specialised ML Models for Video Object Tracking.
| echelon wrote:
| All of these emergent properties of image and video models
| leads me to believe that evolution of animal intelligence
| around motility and visually understanding the physical
| environment might be "easy" relative to other "hard steps".
|
| The more complex that an eye gets, the more the brain evolves
| not just the physics and chemistry of optics, but also rich
| feature sets about predator/prey labels, tracking, movement,
| self-localization, distance, etc.
|
| These might not be separate things. These things might just
| come "for free".
| fxtentacle wrote:
| I wouldn't call these properties "emergent".
|
| If you train a system to memorize A-B pairs and then you
| normally use it to find B when given A, then it's not
| surprising that finding A when given B also works, because
| you trained it in an almost symmetrical fashion on A-B
| pairs, which are, obviously, also B-A pairs.
| jacquesm wrote:
| There is a massive amount of pre-processing already done in
| the retina itself and in the LGN:
|
| https://en.wikipedia.org/wiki/Lateral_geniculate_nucleus
|
| So the brain does not necessarily receive 'raw' images to
| process to begin with, there is already a lot of high level
| data extracted at that point such as optical flow to detect
| moving objects.
| Mkengin wrote:
| Interesting. So similar to the vision encoder + projector
| in VLMs?
| DrierCycle wrote:
| And the occipital is developed around extraordinary
| levels of image separation, broken down into tiny areas
| of the input, scattered and woven for details of motion,
| gradient, contrast, etc.
| magicalhippo wrote:
| Glossing through the paper, here's my take.
|
| Someone previously found that that the cross-attention layers
| in text-to-image diffusion models captures correlation between
| the input text tokens and corresponding image regions, so that
| one can use this to segment the image, pixels containing "cat"
| for example. However this segmentation was rather coarse. The
| authors of this paper found that also using the self-attention
| layers leads to a much more detailed segmentation.
|
| They then extend this to video by using the self-attention
| between two consecutive frames to determine how the
| segmentation changes from one frame to the next.
|
| Now, text-to-image diffusion models require a text input to
| generate the image to begin with. From what I can gather they
| limit themselves to semi-supervised video segmentation, so that
| the first frame is already segmented by say a human or some
| other process.
|
| They then run a "inversion" procedure which tries to generate
| text that causes the text-to-image diffusion model to segment
| the first frame as closely as possible to the provided
| segmentation.
|
| With the text in hand, they can then run the earlier
| segmentation propagation steps to track the segmented object
| throughout the video.
|
| The key here is that the text-to-image diffusion model is
| pretrained, and not fine-tuned for this task.
|
| That said, I'm no expert.
| jacquesm wrote:
| For a 'not an expert' explanation you did a better job than
| the original paper.
| nicolailolansen wrote:
| Bravo!
| ttul wrote:
| This is a cool result. Deep learning image models are trained on
| enormous amounts of data and the information recorded in their
| weights continues to astonish me. Over in the Stable Diffusion
| space, hobbyists (as opposed to professional researchers) are
| continuing to find new ways to squeeze intelligence out of models
| that were trained in 2022 and are considerably out of date
| compared with the latest "flow matching" models like Qwen Image
| and Flux.
|
| Makes you wonder what intelligence is lurking in a 10T parameter
| model like Gemini 3 that we may not discover for some years
| yet...
| smerrill25 wrote:
| Hey, do you know how you figured out about this information? I
| would be super curious to keep track of current ad-hoc ways of
| pushing older models to do cooler things. LMK
| ttul wrote:
| 1) Reading papers. 2) Reading "Deep Learning: Foundations and
| Concepts". 3) Taking Jeremy Howard's Fast.ai course
| tpoacher wrote:
| If the authors are reading. I notice you used a "Soft IoU" for
| validation.
|
| A large part of my 2017 phd thesis [0] is dedicated in exploring
| the formulation and utility of soft validation operators,
| including this soft IoU, and the extent to which they are
| "better" / "more reliable" than thresholding (whether this occurs
| in isolation, or even when marginalised out, as in with the AUC).
| Long story short, soft operators are at least an order of
| magnitude more reliable than their thresholding counterparts [1],
| despite the fact that thresholding still seems to be the
| industry/academia standard. This is the case for any set-
| operation-based operator, such as the Dice coefficient (a.k.a.
| F1-score), not just for the IoU. Recently, influential groups
| have proposed the matthews correlation coefficient as a "better
| operator", but still treat it in binary / thresholding terms,
| which means it's still unreliable to an order of magnitude. I
| suspect this insight goes beyond images (e.g. the F1-score is
| often used in ML problems more generally, in situations where
| probabilistic outputs are thresholded to compare against binary
| ground truth labels), but I haven't tested that hypothesis
| explicitly beyond the image domain (yet).
|
| In this work you effectively used the "goedel" (i.e. min/max)
| fuzzy operator to define fuzzy intersection and union, for the
| purposes of using it in an IoU operator. There are other fuzzy
| norms with interesting properties that you can also explore.
| Other classical ones include product and lukasiewicz. I show in
| [0] and [1] that these have "best case scenario sub-pixel
| overlap", "average case" and "worst-case scenario" underlying
| semantics. (In other words, min/max should not be a random choice
| of T-norm, but a conscious choice which should match your
| problem, and what the operator is intended to validate
| specifically). In my own work, I then proceeded show that if you
| take gradient direction at the boundary into account, you can
| come up with a fuzzy intersection/union pair which has
| directional semantics, and is even more reliable an operator when
| used to define a soft IoU.
|
| Having said that, in your case you're comparing against a binary
| ground truth. This collapses all the different T-norms to the
| same value. I wonder if this is the reason you chose a binary
| ground truth. If yes, you might want to consider my work, and use
| original 'soft' ground truths instead, for higher reliability, as
| well as ability to define intersection semantics.
|
| I hope the above is of interest / use to you :) (and, if you were
| to decide to cite my work, it wouldn't be the eeeeeend of the
| world, I gueeeeesss xD )
|
| [0]
| https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
|
| [1]
| https://repository.essex.ac.uk/24856/1/Papastylianou.etal201...
___________________________________________________________________
(page generated 2025-11-26 23:00 UTC)