[HN Gopher] Beyond Diffusion: Inductive Moment Matching
___________________________________________________________________
Beyond Diffusion: Inductive Moment Matching
Author : outrun86
Score : 182 points
Date : 2025-03-12 03:05 UTC (19 hours ago)
(HTM) web link (lumalabs.ai)
(TXT) w3m dump (lumalabs.ai)
| goldemerald wrote:
| I've been lightly following this type of research for a few
| years. I immediately recognized the broad idea as stemming from
| the lab of the ridiculously prolific Stefano Ermon. He's always
| taken a unique angle for generative models since the before times
| of GenAI. I was fortunate to get lunch with him in grad school
| after a talk he gave. Seeing the work from his lab in these
| modern days is compelling, I always figured his style of research
| would break out into the mainstream eventually. I'm hopeful the
| the future of ML improvements come from clever test-time
| algorithms like this article shows. I'm looking forward to when
| you can train a high quality generative model without needing a
| super cluster or webscale data.
| imjonse wrote:
| Some of their research had already broken out into mainstream,
| DDIM at least was their paper and probably others too in the
| diffusion domain.
| programjames wrote:
| Anyone willing to give an intuitive summary of what they did
| mathwise? The math in the paper is super ugly to churn through.
| oofbey wrote:
| In normal diffusion you train a model to take lots of tiny
| steps, all the same small size. e.g. "You're gonna take 20
| steps, at times [1.0, 0.95, 0.90, 0.85...]" and each time the
| model takes that small fixed-size step to make the image look
| better.
|
| Here they train a model to say "I'm gonna ask you to take a
| step from time B to A - might be a small step, might be a big
| step - but whatever size it is, make the image that much
| better." You you might ask the model to improve the image from
| t=1.0 to t=0.25 and be almost done. It gets a side variable
| telling it how much improvement to make in each step.
|
| I'm not sure this right, but that's what I got out of it by
| skimming the blog & paper.
| kadushka wrote:
| No, we typically train any diffusion model on a single step
| (randomly chosen).
| bearobear wrote:
| Last author here (I also did the DDIM paper,
| https://arxiv.org/abs/2010.02502). I know this is going to be
| very tricky math-wise (and in the paper we just wrote the most
| general thing to make reviewers happy), so I tried to explain
| the idea more easily under the blog post
| (https://lumalabs.ai/news/inductive-moment-matching).
|
| If you look at how a single step of the DDIM sampler interacts
| with the target timestep, it is actually just a linear
| function. This is obviously quite inflexible if we want to use
| it to represent a flexible function where we can choose any
| target timestep. So just add this as an argument to the neural
| network and then train it with a moment matching objective.
|
| In general, I feel that analyzing a method's inference-time
| properties before training it can be helpful to not only
| diffusion models, but also LLMs including various recent
| diffusion LLMs, which prompted me to write a position paper in
| the hopes that others develop cool new ideas
| (https://arxiv.org/abs/2503.07154).
| niemandhier wrote:
| Just as a counter perspective: I think your paper is great!
|
| Please don't let people ever discourage you from writing
| proper papers. Ever since meta etc. started asking for ,,2
| papers in relevant fields" we see a flood of papers that
| should be tweets.
| qumpis wrote:
| What happens if we don't add any moments matching objective?
| e.g. at train time just fit a diffusion model that predicts
| the target given any pair of timesteps (t, t')? Why is moment
| matching critical here?
|
| Also regarding linearity, why is it inflexible? It seems
| quite convenient that a simple linear interpolation is used
| for reconstruction, besides, even in DDIM, the directions
| towards the final target changes at each step as the images
| become less noisy. In standard diffusion models or even flow
| matching, denoising is always equal to the prediction of the
| original data + direction from current timestep to the
| timestep t'. Just to be clear, it is intuitive that such
| models are inferior in few-step generations since they don't
| optimise for test time efficiency (in terms of the tradeoff
| of quality vs compute), but it's unclear what inflexibility
| exists there beyond this limitation.
|
| Clearly there's no expected benefit in quality if all
| timesteps are used in denoising?
| littlestymaar wrote:
| Stupid question, what's a "timestep" in that context?
| nmca wrote:
| The authors own summary from the position paper is:
|
| In particular, we examine the one-step iterative process of
| DDIM [39, 19, 21] and show that it has limited capacity with
| respect to the target timestep under the current denoising
| network design. This can be addressed by adding the target
| timestep to the inputs of the denoising network [15].
|
| Interestingly, this one fix, plus a proper moment matching
| objective [5] leads to a stable, single-stage algorithm that
| surpasses diffusion models in sample quality while being over
| an order of magnitude more efficient at inference [50].
| Notably, these ideas do not rely on denoising score matching
| [46] or the score-based stochastic differential equations [41]
| on which the foundations of diffusion models are built.
| hyperbovine wrote:
| The math is totally standard if you've read recent important
| papers on score matching and flow matching. If you haven't,
| well, I can't see how you could possibly hope to understand
| this work at a technical level anyways.
| bbminner wrote:
| Can anyone share insight into how this is different from
| consistency models? The insight seems quite similar?
| bearobear wrote:
| Consistency models is a special case of IMM where you do moment
| matching with 1 sample from each distribution (i.e., you cannot
| match distributions properly). See Fig 5 for an ablation study,
| of course, adding more samples when you are doing moment
| matching makes it more stable during training :)
| throwaway2562 wrote:
| I'm trying to understand what the 'spectral' interpretation
| of IMM is: but perhaps I shouldn't
|
| https://sander.ai/2024/09/02/spectral-autoregression.html
| bbminner wrote:
| Makes sense. How can you even approximately estimate higher
| order differences in conditional moments in such a high dim
| space? Seems statistically impossible to get a reasonable
| estimate for a gradient. Moment matching in sample space has
| always been very hard.
| lukasb wrote:
| "Inference can generally be scaled along two dimensions:
| extending sequence length (in autoregressive models), and
| augmenting the number of refinement steps (in diffusion models)."
|
| Does this mean that diffusion models for text could scale
| inference compute to improve quality for a fixed-length output?
| svachalek wrote:
| Yes, although so far it seems the main advantage of text
| diffusion models is that they're really, really fast.
| Iterations reach an asymptote very quickly.
| lukasb wrote:
| Yeah I guess progressive refinement is limited in quality by
| how good the first N iterations are that establish the broad
| outlines.
| vessenes wrote:
| FWIW i don't think we've seen nearly all the ideas for text
| diffusion yet -- why not 'jiggle the text around a bit'
| when things have stabilized, or add space to fill, or have
| a separate judging module identify space that needs more
| tokens? Lots of super interesting possibilities.
| kadushka wrote:
| I don't know which text diffusion models you're talking
| about, the latest and greatest is this one:
| https://arxiv.org/abs/2502.09992 and it's extremely slow -
| couple of orders of magnitude slower than a regular LLM,
| mainly because it does not support KV caching, and requires
| many full sequence processing steps per token.
| janalsncm wrote:
| I'm not familiar with that paper but it would probably be
| best to compare speeds with an unoptimized transformer
| decoder. The Vaswani paper came out 8 years ago so
| implementations will be pretty highly optimized at this
| point.
|
| On the other hand if there was a theoretical reason why
| text diffusion models could never be faster than
| autoregressive transformers it would be notable.
| kadushka wrote:
| There's not enough improvement over regular LLMs to
| motivate optimization effort. Recall that the original
| transformer was well received because it was fast and
| scalable compared to RNNs.
| brcmthrowaway wrote:
| This is a gamechanger
| echelon wrote:
| Does this mean high quality images and video will be possible in
| one or a few sampling steps?
|
| Fast, real time video generation? (One second of compute per one
| second of output.)
|
| Does this mean more efficient and more generalizable training and
| fine tuning?
| richard___ wrote:
| Reminds me of the Kevin Frans shortcut networks paper?
| xela79 wrote:
| this went over my head quickly; read through it a few times, than
| asked GPT for a summary on my level of understanding, which does
| clear it up for me ,personally , to grasp the overall idea:
|
| Alright, imagine you have a big box of LEGO bricks, and you're
| trying to build a really cool spaceship. There are two main ways
| people usually build things like this:
|
| Step-by-step (Autoregressive Models) - Imagine you put one LEGO
| brick down at a time, making sure each piece fits perfectly
| before adding the next. It works, but it takes a long time.
|
| Fix and refine (Diffusion Models) - Imagine you start by dumping
| all the LEGO bricks in a messy pile. Then, you slowly move pieces
| around, fixing mistakes until you get a spaceship. This is faster
| than the first method, but it still takes a lot of tiny
| adjustments.
|
| What's the Problem? People have been using these two ways for a
| long time, and they've gotten really good at them. But no matter
| how big or smart your LEGO-building robot gets, these methods
| don't get that much better. They're kind of stuck.
|
| The New Way: Inductive Moment Matching (IMM) IMM is like a
| magical LEGO helper that doesn't just follow the usual slow
| steps. Instead, it looks at what the final spaceship should look
| like ahead of time and figures out how to jump closer to the
| final result in fewer steps.
|
| Instead of moving one LEGO brick at a time or slowly fixing a
| messy pile, it's like the helper knows where each piece should go
| ahead of time and moves big sections all at once. That makes it
| way faster and still super accurate!
|
| Why is This Cool? Faster - It builds things much more quickly
| than the old methods. More efficient - It doesn't waste as much
| time adjusting tiny details. Works with all kinds of problems -
| This method can be used for pictures, videos, and maybe even
| other things like 3D models. Real-World Example Imagine drawing a
| picture of a dog. Old way: You draw one tiny detail at a time, or
| you start with a blurry dog and keep fixing it. New way (IMM):
| You already kind of know what the dog should look like, so you
| make big strokes to get there quickly!
|
| So basically, IMM is a super smart way to skip unnecessary steps
| and get amazing results much faster.
| b2w wrote:
| So like intuitive photographic memory?
| Climatebamb wrote:
| More like "Oh i remember what you roughly want, i rememeber
| basic steps of reaching it just not details, lets generate
| the details" vs. "learning x steps from noise to image".
|
| You make the way of reaching your target faster.
| azinman2 wrote:
| Thank you, this is helpful framing. Obviously all the details
| are missing, but the blog post was impenetrable for me, and I'm
| quite technical.
| 33a wrote:
| Reminds me of https://ggx-
| research.github.io/publication/2023/05/10/public...
___________________________________________________________________
(page generated 2025-03-12 23:01 UTC)