[HN Gopher] Strengths and limitations of diffusion language models
___________________________________________________________________
Strengths and limitations of diffusion language models
Author : rbanffy
Score : 52 points
Date : 2025-05-22 10:10 UTC (12 hours ago)
(HTM) web link (www.seangoedecke.com)
(TXT) w3m dump (www.seangoedecke.com)
| cubefox wrote:
| That's a nice explanation. I wonder whether autoregressive and
| diffusion language models could be combined such that the model
| only denoises the (most recent) end of a sequence of text, like a
| paragraph, while the rest is unchangeable and allows for key-
| value caching.
| gfysfm wrote:
| Hi, I wrote the post. Thank you!
|
| That's how it does work, but unfortunately denoising the last
| paragraph requires computing attention scores for every token
| in that paragraph, which requires checking those tokens against
| every token in the sequence. So it's still much less cacheable
| than the equivalent autoregressive model.
| billconan wrote:
| I'm curious, in image generation, flow matching is said to be
| better than diffusion, then why do these language models still
| start from diffusion, instead of jumping to flow matching
| directly?
| gessha wrote:
| This is just a guess but I think it's due to diffusion training
| being more popular so we've figured more of the kinks with
| those models. Flow matching models might follow after you
| figure out some of their hyperparameters.
| mountainriver wrote:
| A big discussion on this happened here as well
| https://news.ycombinator.com/item?id=44057820
|
| There is quite a bit of evidence diffusion models work better at
| reasoning because they don't suffer from early token bias.
|
| https://github.com/HKUNLP/diffusion-vs-ar
| https://arxiv.org/html/2410.14157v3
| accrual wrote:
| Great overview. I wonder if we'll start to see more text
| diffusion models from other players, or maybe even a mixture of
| diffusion and transformer models alternating roles behind a
| single UI, depending on the context and request.
| shrubhub wrote:
| The diffusion models are (or can be) transformer models!
| They're just not autoregressive.
___________________________________________________________________
(page generated 2025-05-22 23:01 UTC)