hngopher.com

       [HN Gopher] Beyond Diffusion: Inductive Moment Matching
       ___________________________________________________________________
        
       Beyond Diffusion: Inductive Moment Matching
        
       Author : outrun86
       Score  : 182 points
       Date   : 2025-03-12 03:05 UTC (19 hours ago)
        
 (HTM) web link (lumalabs.ai)
 (TXT) w3m dump (lumalabs.ai)
        
       | goldemerald wrote:
       | I've been lightly following this type of research for a few
       | years. I immediately recognized the broad idea as stemming from
       | the lab of the ridiculously prolific Stefano Ermon. He's always
       | taken a unique angle for generative models since the before times
       | of GenAI. I was fortunate to get lunch with him in grad school
       | after a talk he gave. Seeing the work from his lab in these
       | modern days is compelling, I always figured his style of research
       | would break out into the mainstream eventually. I'm hopeful the
       | the future of ML improvements come from clever test-time
       | algorithms like this article shows. I'm looking forward to when
       | you can train a high quality generative model without needing a
       | super cluster or webscale data.
        
         | imjonse wrote:
         | Some of their research had already broken out into mainstream,
         | DDIM at least was their paper and probably others too in the
         | diffusion domain.
        
       | programjames wrote:
       | Anyone willing to give an intuitive summary of what they did
       | mathwise? The math in the paper is super ugly to churn through.
        
         | oofbey wrote:
         | In normal diffusion you train a model to take lots of tiny
         | steps, all the same small size. e.g. "You're gonna take 20
         | steps, at times [1.0, 0.95, 0.90, 0.85...]" and each time the
         | model takes that small fixed-size step to make the image look
         | better.
         | 
         | Here they train a model to say "I'm gonna ask you to take a
         | step from time B to A - might be a small step, might be a big
         | step - but whatever size it is, make the image that much
         | better." You you might ask the model to improve the image from
         | t=1.0 to t=0.25 and be almost done. It gets a side variable
         | telling it how much improvement to make in each step.
         | 
         | I'm not sure this right, but that's what I got out of it by
         | skimming the blog & paper.
        
           | kadushka wrote:
           | No, we typically train any diffusion model on a single step
           | (randomly chosen).
        
         | bearobear wrote:
         | Last author here (I also did the DDIM paper,
         | https://arxiv.org/abs/2010.02502). I know this is going to be
         | very tricky math-wise (and in the paper we just wrote the most
         | general thing to make reviewers happy), so I tried to explain
         | the idea more easily under the blog post
         | (https://lumalabs.ai/news/inductive-moment-matching).
         | 
         | If you look at how a single step of the DDIM sampler interacts
         | with the target timestep, it is actually just a linear
         | function. This is obviously quite inflexible if we want to use
         | it to represent a flexible function where we can choose any
         | target timestep. So just add this as an argument to the neural
         | network and then train it with a moment matching objective.
         | 
         | In general, I feel that analyzing a method's inference-time
         | properties before training it can be helpful to not only
         | diffusion models, but also LLMs including various recent
         | diffusion LLMs, which prompted me to write a position paper in
         | the hopes that others develop cool new ideas
         | (https://arxiv.org/abs/2503.07154).
        
           | niemandhier wrote:
           | Just as a counter perspective: I think your paper is great!
           | 
           | Please don't let people ever discourage you from writing
           | proper papers. Ever since meta etc. started asking for ,,2
           | papers in relevant fields" we see a flood of papers that
           | should be tweets.
        
           | qumpis wrote:
           | What happens if we don't add any moments matching objective?
           | e.g. at train time just fit a diffusion model that predicts
           | the target given any pair of timesteps (t, t')? Why is moment
           | matching critical here?
           | 
           | Also regarding linearity, why is it inflexible? It seems
           | quite convenient that a simple linear interpolation is used
           | for reconstruction, besides, even in DDIM, the directions
           | towards the final target changes at each step as the images
           | become less noisy. In standard diffusion models or even flow
           | matching, denoising is always equal to the prediction of the
           | original data + direction from current timestep to the
           | timestep t'. Just to be clear, it is intuitive that such
           | models are inferior in few-step generations since they don't
           | optimise for test time efficiency (in terms of the tradeoff
           | of quality vs compute), but it's unclear what inflexibility
           | exists there beyond this limitation.
           | 
           | Clearly there's no expected benefit in quality if all
           | timesteps are used in denoising?
        
           | littlestymaar wrote:
           | Stupid question, what's a "timestep" in that context?
        
         | nmca wrote:
         | The authors own summary from the position paper is:
         | 
         | In particular, we examine the one-step iterative process of
         | DDIM [39, 19, 21] and show that it has limited capacity with
         | respect to the target timestep under the current denoising
         | network design. This can be addressed by adding the target
         | timestep to the inputs of the denoising network [15].
         | 
         | Interestingly, this one fix, plus a proper moment matching
         | objective [5] leads to a stable, single-stage algorithm that
         | surpasses diffusion models in sample quality while being over
         | an order of magnitude more efficient at inference [50].
         | Notably, these ideas do not rely on denoising score matching
         | [46] or the score-based stochastic differential equations [41]
         | on which the foundations of diffusion models are built.
        
         | hyperbovine wrote:
         | The math is totally standard if you've read recent important
         | papers on score matching and flow matching. If you haven't,
         | well, I can't see how you could possibly hope to understand
         | this work at a technical level anyways.
        
       | bbminner wrote:
       | Can anyone share insight into how this is different from
       | consistency models? The insight seems quite similar?
        
         | bearobear wrote:
         | Consistency models is a special case of IMM where you do moment
         | matching with 1 sample from each distribution (i.e., you cannot
         | match distributions properly). See Fig 5 for an ablation study,
         | of course, adding more samples when you are doing moment
         | matching makes it more stable during training :)
        
           | throwaway2562 wrote:
           | I'm trying to understand what the 'spectral' interpretation
           | of IMM is: but perhaps I shouldn't
           | 
           | https://sander.ai/2024/09/02/spectral-autoregression.html
        
           | bbminner wrote:
           | Makes sense. How can you even approximately estimate higher
           | order differences in conditional moments in such a high dim
           | space? Seems statistically impossible to get a reasonable
           | estimate for a gradient. Moment matching in sample space has
           | always been very hard.
        
       | lukasb wrote:
       | "Inference can generally be scaled along two dimensions:
       | extending sequence length (in autoregressive models), and
       | augmenting the number of refinement steps (in diffusion models)."
       | 
       | Does this mean that diffusion models for text could scale
       | inference compute to improve quality for a fixed-length output?
        
         | svachalek wrote:
         | Yes, although so far it seems the main advantage of text
         | diffusion models is that they're really, really fast.
         | Iterations reach an asymptote very quickly.
        
           | lukasb wrote:
           | Yeah I guess progressive refinement is limited in quality by
           | how good the first N iterations are that establish the broad
           | outlines.
        
             | vessenes wrote:
             | FWIW i don't think we've seen nearly all the ideas for text
             | diffusion yet -- why not 'jiggle the text around a bit'
             | when things have stabilized, or add space to fill, or have
             | a separate judging module identify space that needs more
             | tokens? Lots of super interesting possibilities.
        
           | kadushka wrote:
           | I don't know which text diffusion models you're talking
           | about, the latest and greatest is this one:
           | https://arxiv.org/abs/2502.09992 and it's extremely slow -
           | couple of orders of magnitude slower than a regular LLM,
           | mainly because it does not support KV caching, and requires
           | many full sequence processing steps per token.
        
             | janalsncm wrote:
             | I'm not familiar with that paper but it would probably be
             | best to compare speeds with an unoptimized transformer
             | decoder. The Vaswani paper came out 8 years ago so
             | implementations will be pretty highly optimized at this
             | point.
             | 
             | On the other hand if there was a theoretical reason why
             | text diffusion models could never be faster than
             | autoregressive transformers it would be notable.
        
               | kadushka wrote:
               | There's not enough improvement over regular LLMs to
               | motivate optimization effort. Recall that the original
               | transformer was well received because it was fast and
               | scalable compared to RNNs.
        
       | brcmthrowaway wrote:
       | This is a gamechanger
        
       | echelon wrote:
       | Does this mean high quality images and video will be possible in
       | one or a few sampling steps?
       | 
       | Fast, real time video generation? (One second of compute per one
       | second of output.)
       | 
       | Does this mean more efficient and more generalizable training and
       | fine tuning?
        
       | richard___ wrote:
       | Reminds me of the Kevin Frans shortcut networks paper?
        
       | xela79 wrote:
       | this went over my head quickly; read through it a few times, than
       | asked GPT for a summary on my level of understanding, which does
       | clear it up for me ,personally , to grasp the overall idea:
       | 
       | Alright, imagine you have a big box of LEGO bricks, and you're
       | trying to build a really cool spaceship. There are two main ways
       | people usually build things like this:
       | 
       | Step-by-step (Autoregressive Models) - Imagine you put one LEGO
       | brick down at a time, making sure each piece fits perfectly
       | before adding the next. It works, but it takes a long time.
       | 
       | Fix and refine (Diffusion Models) - Imagine you start by dumping
       | all the LEGO bricks in a messy pile. Then, you slowly move pieces
       | around, fixing mistakes until you get a spaceship. This is faster
       | than the first method, but it still takes a lot of tiny
       | adjustments.
       | 
       | What's the Problem? People have been using these two ways for a
       | long time, and they've gotten really good at them. But no matter
       | how big or smart your LEGO-building robot gets, these methods
       | don't get that much better. They're kind of stuck.
       | 
       | The New Way: Inductive Moment Matching (IMM) IMM is like a
       | magical LEGO helper that doesn't just follow the usual slow
       | steps. Instead, it looks at what the final spaceship should look
       | like ahead of time and figures out how to jump closer to the
       | final result in fewer steps.
       | 
       | Instead of moving one LEGO brick at a time or slowly fixing a
       | messy pile, it's like the helper knows where each piece should go
       | ahead of time and moves big sections all at once. That makes it
       | way faster and still super accurate!
       | 
       | Why is This Cool? Faster - It builds things much more quickly
       | than the old methods. More efficient - It doesn't waste as much
       | time adjusting tiny details. Works with all kinds of problems -
       | This method can be used for pictures, videos, and maybe even
       | other things like 3D models. Real-World Example Imagine drawing a
       | picture of a dog. Old way: You draw one tiny detail at a time, or
       | you start with a blurry dog and keep fixing it. New way (IMM):
       | You already kind of know what the dog should look like, so you
       | make big strokes to get there quickly!
       | 
       | So basically, IMM is a super smart way to skip unnecessary steps
       | and get amazing results much faster.
        
         | b2w wrote:
         | So like intuitive photographic memory?
        
           | Climatebamb wrote:
           | More like "Oh i remember what you roughly want, i rememeber
           | basic steps of reaching it just not details, lets generate
           | the details" vs. "learning x steps from noise to image".
           | 
           | You make the way of reaching your target faster.
        
         | azinman2 wrote:
         | Thank you, this is helpful framing. Obviously all the details
         | are missing, but the blog post was impenetrable for me, and I'm
         | quite technical.
        
       | 33a wrote:
       | Reminds me of https://ggx-
       | research.github.io/publication/2023/05/10/public...
        
       ___________________________________________________________________
       (page generated 2025-03-12 23:01 UTC)