[HN Gopher] Block Diffusion: Interpolating between autoregressiv...
___________________________________________________________________
Block Diffusion: Interpolating between autoregressive and diffusion
models
Author : GaggiX
Score : 95 points
Date : 2025-03-14 14:58 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| hycpax wrote:
| Only starts approaching PPL of AR at block size 4. May as well
| just use multi-token prediction with a standard AR model.
| GaggiX wrote:
| The memory bandwidth bottleneck limits the speed of running local
| models, the fact that this model is parallelizable means that
| even with one batch inference it will be possible to balance
| memory bandwidth bottleneck and compute bottleneck (aka much more
| speed).
| kelseyfrog wrote:
| I wonder how sliding window would work over blocks.
| sbussard wrote:
| Excellent question
| mountainriver wrote:
| This is cool but I feel like you lose the best part of language-
| diffusion models which is their ability to edit early tokens.
| drak0n1c wrote:
| Those early tokens aren't necessarily immutable, they still
| could be "edited" depending on UI. Human conversation and even
| internal compositional cogitation is full of "what I meant by
| that" or "on second thought" type clarifications and
| corrections. Sometimes these aren't verbosely disclaimed,
| there's body language involved. Likewise there could be
| occasional lookback parsing and later blocks could convey
| modifications. The UI can then highlight those revisions
| transparently by applying strikethrough styling, coloration,
| dotted underline with tooltip on hover, etc.
|
| Like we've seen with human interactions and media, this may be
| susceptible to misinterpretation by the reader or listener,
| especially via second-hand clips or screenshots lacking full
| context. But if the UX is clean and speedy it would be less
| likely.
| magicalhippo wrote:
| I'm reminded of the Physics of Language Models[1] where they
| showed a standatd autoregressive LLM got a lot more accurate
| if the models got access to the backspace key, so to speak.
|
| [1]: https://physics.allen-zhu.com/home
| Voloskaya wrote:
| To be fair, it's not "obviously" better, but it opens a new
| point on the tradeoff curve. For a lot of use cases full
| autoregression is clearly better, and for some others ful
| diffusion will still be better.
|
| Autoregressivity has high quality outputs but is fairly slow.
| Diffusion has low quality output but is quite fast.
|
| This allows you to go in the middle, not as high quality as
| full autoregression and not as fast as full diffusion, but a
| balance between both.
| impossiblefork wrote:
| Isn't this basically the diffusion-autoregressive sampling
| strategy from the LLaDA paper, maybe more carefully evaluated?
| volodia wrote:
| the LLaDA paper is a scaled-up version of this paper; they cite
| it as an anonymous ICLR submission
| impossiblefork wrote:
| Ah.
| m00x wrote:
| I'm not sure if this is what you mean, but LLaDA isn't block
| text diffusion. This is a mix between an autoregressive model
| and a diffusion model, which is brand new.
| transitivebs wrote:
| this animation really makes the difference hit home:
| https://x.com/_akhaliq/status/1900027075370586262
| bbminner wrote:
| This animation is a perfect abstract!
| sbussard wrote:
| I love when someone comes up with a good idea that becomes
| immediately obvious as soon as it is introduced
| 85392_school wrote:
| Based on the animation, I personally don't expect this to be very
| helpful. The main way diffusion models help is preventing answers
| like "No. [proceeds to explain why the answer is yes]", and since
| the blocks are so small, the LLM can't fully explain before it
| has to say yes or no.
| prophesi wrote:
| Could you expound on this? From what I'm reading, this sounds
| like an issue with diffusion models that their block diffusion
| model is purposefully designed to mitigate, by conditioning on
| previous blocks and allowing for larger blocks if that
| conditioning still doesn't help maintain coherence.
| 85392_school wrote:
| It's an issue that you run into as long as you're forced to
| start with a yes/no answer. It's a problem forward-only LLMs
| have and diffusion models don't, and normal block diffusion
| is closer to forward LLMs than diffusion models.
|
| You _could_ increase the block size to act more like a full
| diffusion model, but you would lose some of the benefits of
| block diffusion.
| throwaway314155 wrote:
| Interesting. Makes me want to play around with an open
| diffusion LM. Do you have any recommendations?
| jasonjmcghee wrote:
| My understanding here is block size can be arbitrarily large,
| under similar constraints as diffusion models. Is that not the
| case?
| bondarchuk wrote:
| Diffusion on images is easy to understand for me: you start with
| noise, the model denoises by shifting the pixels towards their
| final value. What is the equivalent operation for increasing or
| reducing noise in language here? Is the "noisy" sentence half-way
| through training or inference sort-of-correct but not really, and
| at 90% almost-correct but with slightly wrong words
| (semantically)? Is the noise somehow semantic at all or is it
| something else?
| littlestymaar wrote:
| Not familiar at all with diffusion LLM but I'd guess you'd have
| noisy logits.
| simne wrote:
| For LLM, standard method to provide to foundation model
| sentence without one word (random, but already known to model)
| and ask to fill this word.
| mike978 wrote:
| https://m-arriola.com/bd3lms/
|
| https://github.com/kuleshov-group/bd3lms
___________________________________________________________________
(page generated 2025-03-14 23:00 UTC)