[HN Gopher] Block Diffusion: Interpolating between autoregressiv...
       ___________________________________________________________________
        
       Block Diffusion: Interpolating between autoregressive and diffusion
       models
        
       Author : GaggiX
       Score  : 95 points
       Date   : 2025-03-14 14:58 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | hycpax wrote:
       | Only starts approaching PPL of AR at block size 4. May as well
       | just use multi-token prediction with a standard AR model.
        
       | GaggiX wrote:
       | The memory bandwidth bottleneck limits the speed of running local
       | models, the fact that this model is parallelizable means that
       | even with one batch inference it will be possible to balance
       | memory bandwidth bottleneck and compute bottleneck (aka much more
       | speed).
        
       | kelseyfrog wrote:
       | I wonder how sliding window would work over blocks.
        
         | sbussard wrote:
         | Excellent question
        
       | mountainriver wrote:
       | This is cool but I feel like you lose the best part of language-
       | diffusion models which is their ability to edit early tokens.
        
         | drak0n1c wrote:
         | Those early tokens aren't necessarily immutable, they still
         | could be "edited" depending on UI. Human conversation and even
         | internal compositional cogitation is full of "what I meant by
         | that" or "on second thought" type clarifications and
         | corrections. Sometimes these aren't verbosely disclaimed,
         | there's body language involved. Likewise there could be
         | occasional lookback parsing and later blocks could convey
         | modifications. The UI can then highlight those revisions
         | transparently by applying strikethrough styling, coloration,
         | dotted underline with tooltip on hover, etc.
         | 
         | Like we've seen with human interactions and media, this may be
         | susceptible to misinterpretation by the reader or listener,
         | especially via second-hand clips or screenshots lacking full
         | context. But if the UX is clean and speedy it would be less
         | likely.
        
           | magicalhippo wrote:
           | I'm reminded of the Physics of Language Models[1] where they
           | showed a standatd autoregressive LLM got a lot more accurate
           | if the models got access to the backspace key, so to speak.
           | 
           | [1]: https://physics.allen-zhu.com/home
        
         | Voloskaya wrote:
         | To be fair, it's not "obviously" better, but it opens a new
         | point on the tradeoff curve. For a lot of use cases full
         | autoregression is clearly better, and for some others ful
         | diffusion will still be better.
         | 
         | Autoregressivity has high quality outputs but is fairly slow.
         | Diffusion has low quality output but is quite fast.
         | 
         | This allows you to go in the middle, not as high quality as
         | full autoregression and not as fast as full diffusion, but a
         | balance between both.
        
       | impossiblefork wrote:
       | Isn't this basically the diffusion-autoregressive sampling
       | strategy from the LLaDA paper, maybe more carefully evaluated?
        
         | volodia wrote:
         | the LLaDA paper is a scaled-up version of this paper; they cite
         | it as an anonymous ICLR submission
        
           | impossiblefork wrote:
           | Ah.
        
           | m00x wrote:
           | I'm not sure if this is what you mean, but LLaDA isn't block
           | text diffusion. This is a mix between an autoregressive model
           | and a diffusion model, which is brand new.
        
       | transitivebs wrote:
       | this animation really makes the difference hit home:
       | https://x.com/_akhaliq/status/1900027075370586262
        
         | bbminner wrote:
         | This animation is a perfect abstract!
        
       | sbussard wrote:
       | I love when someone comes up with a good idea that becomes
       | immediately obvious as soon as it is introduced
        
       | 85392_school wrote:
       | Based on the animation, I personally don't expect this to be very
       | helpful. The main way diffusion models help is preventing answers
       | like "No. [proceeds to explain why the answer is yes]", and since
       | the blocks are so small, the LLM can't fully explain before it
       | has to say yes or no.
        
         | prophesi wrote:
         | Could you expound on this? From what I'm reading, this sounds
         | like an issue with diffusion models that their block diffusion
         | model is purposefully designed to mitigate, by conditioning on
         | previous blocks and allowing for larger blocks if that
         | conditioning still doesn't help maintain coherence.
        
           | 85392_school wrote:
           | It's an issue that you run into as long as you're forced to
           | start with a yes/no answer. It's a problem forward-only LLMs
           | have and diffusion models don't, and normal block diffusion
           | is closer to forward LLMs than diffusion models.
           | 
           | You _could_ increase the block size to act more like a full
           | diffusion model, but you would lose some of the benefits of
           | block diffusion.
        
             | throwaway314155 wrote:
             | Interesting. Makes me want to play around with an open
             | diffusion LM. Do you have any recommendations?
        
         | jasonjmcghee wrote:
         | My understanding here is block size can be arbitrarily large,
         | under similar constraints as diffusion models. Is that not the
         | case?
        
       | bondarchuk wrote:
       | Diffusion on images is easy to understand for me: you start with
       | noise, the model denoises by shifting the pixels towards their
       | final value. What is the equivalent operation for increasing or
       | reducing noise in language here? Is the "noisy" sentence half-way
       | through training or inference sort-of-correct but not really, and
       | at 90% almost-correct but with slightly wrong words
       | (semantically)? Is the noise somehow semantic at all or is it
       | something else?
        
         | littlestymaar wrote:
         | Not familiar at all with diffusion LLM but I'd guess you'd have
         | noisy logits.
        
         | simne wrote:
         | For LLM, standard method to provide to foundation model
         | sentence without one word (random, but already known to model)
         | and ask to fill this word.
        
       | mike978 wrote:
       | https://m-arriola.com/bd3lms/
       | 
       | https://github.com/kuleshov-group/bd3lms
        
       ___________________________________________________________________
       (page generated 2025-03-14 23:00 UTC)