[HN Gopher] Strengths and limitations of diffusion language models
       ___________________________________________________________________
        
       Strengths and limitations of diffusion language models
        
       Author : rbanffy
       Score  : 52 points
       Date   : 2025-05-22 10:10 UTC (12 hours ago)
        
 (HTM) web link (www.seangoedecke.com)
 (TXT) w3m dump (www.seangoedecke.com)
        
       | cubefox wrote:
       | That's a nice explanation. I wonder whether autoregressive and
       | diffusion language models could be combined such that the model
       | only denoises the (most recent) end of a sequence of text, like a
       | paragraph, while the rest is unchangeable and allows for key-
       | value caching.
        
         | gfysfm wrote:
         | Hi, I wrote the post. Thank you!
         | 
         | That's how it does work, but unfortunately denoising the last
         | paragraph requires computing attention scores for every token
         | in that paragraph, which requires checking those tokens against
         | every token in the sequence. So it's still much less cacheable
         | than the equivalent autoregressive model.
        
       | billconan wrote:
       | I'm curious, in image generation, flow matching is said to be
       | better than diffusion, then why do these language models still
       | start from diffusion, instead of jumping to flow matching
       | directly?
        
         | gessha wrote:
         | This is just a guess but I think it's due to diffusion training
         | being more popular so we've figured more of the kinks with
         | those models. Flow matching models might follow after you
         | figure out some of their hyperparameters.
        
       | mountainriver wrote:
       | A big discussion on this happened here as well
       | https://news.ycombinator.com/item?id=44057820
       | 
       | There is quite a bit of evidence diffusion models work better at
       | reasoning because they don't suffer from early token bias.
       | 
       | https://github.com/HKUNLP/diffusion-vs-ar
       | https://arxiv.org/html/2410.14157v3
        
       | accrual wrote:
       | Great overview. I wonder if we'll start to see more text
       | diffusion models from other players, or maybe even a mixture of
       | diffusion and transformer models alternating roles behind a
       | single UI, depending on the context and request.
        
         | shrubhub wrote:
         | The diffusion models are (or can be) transformer models!
         | They're just not autoregressive.
        
       ___________________________________________________________________
       (page generated 2025-05-22 23:01 UTC)