[HN Gopher] Diffusion language models are super data learners
___________________________________________________________________
Diffusion language models are super data learners
Author : babelfish
Score : 121 points
Date : 2025-08-10 16:04 UTC (6 hours ago)
(HTM) web link (jinjieni.notion.site)
(TXT) w3m dump (jinjieni.notion.site)
| woadwarrior01 wrote:
| > During inference, generating sequences ranging from 16 to 4096
| tokens incurs a 16x to 4700x increase in FLOPs compared to AR
| baselines.
|
| I wonder why the increase in FLOPs has such a wide spectrum?
| Naively, I'd have expected the FLOPs to increase linearly with
| the number of tokens. OTOH, it sort of makes sense because
| because diffusion models are not autoregressive, as their name
| suggests.
| ckjellqv wrote:
| My guess is that autoregressive models can use Key Value (KV)
| caching to eliminate most of the FLOPs inside the self-
| attention block. Can't use KV caching inside diffusion (because
| it's not a causal model) but they sell this as a win anyway
| because they believe it leads to better reasoning.
| godelski wrote:
| This is interesting but I'm not sure some of the claims can be
| made without some more information. Terms like "downstream task",
| "in/out of distribution" are frequently used in the literature to
| mean many different things[0] and it is hard to know which one
| you mean from context. As a reader I * _cannot know*_ what is in-
| distribution or not if I have no notion of what the training
| data[1] is. Consequently, I also can 't know what downstream
| tasks are.
|
| Though I'm very confused by this > This
| phenomenon persists for both in-domain and out-of-domain training
| data.
|
| What does it mean for _training data_ to be "out-of-domain"? The
| domain is any valid input into your function. Was this intended
| to be distribution? I'd still be a bit confused by that because
| it makes it sound like you're talking about training and
| validation data, both of which are in distribution.
| > Is validation loss a good metric for AR and DLM
|
| In academic settings, does anyone seriously believe that the
| answer would be yes? I would be extremely concerned if people
| honestly believed that you could use loss as a strong indicator
| for comparing two different architectures[2]. These losses are
| not measuring the things we want to measure, they are proxies of
| them. The architectures themselves are a big part of forming that
| loss landscape. This would be a fine comparison if the metric
| were not a proxy but since it it then it isn't reliable unless we
| know what the divergence is[3]. This is all fine, but to advance
| as a field we need to remember what we don't know.
|
| Overall, I'm still not sure what is meant by "Super Data
| Learners".
|
| It seems like this is counted by information per parameter? I do
| think there is good discussion in the "causal" attention vs the
| free-form attention of diffusion, but I think there are also some
| potential oversteps in the conclusions here. A lower triangular
| matrix is still full-rank, so there is high representation power
| here, though it is correct that the free form has _more_ (even
| when including the permutation and the untangling via the FFN
| layer in the transformer). I think if this part can be
| highlighted more and more time is spent on explaining then a much
| stronger case can be made. But I think some additional analysis
| is needed to determine if this is a diffusion vs transformer
| thing or triangular attention vs full rank attention thing. From
| a mathematical perspective the second question can be answered
| much more easily, but then there is a larger question about
| training these things because the problem of training free-form
| matrices is that they are... well... free form. There 's actually
| some good discussions about this in the Normalizing Flow
| literature as they work through a similar problem of
| representation power and training/computational efficiencies. I
| think this work has the potential to open up a larger discussion
| for talking about the representation power of different
| architectures. Which, IMO, that is a really important topic that
| we need to discuss these days. Though I'm biased since I work on
| neural architectures.
|
| Just for fun ;) Reviewer 2: Rating: 4:
| Borderline accept Confidence: 4: You are confident in your
| assessment, but not absolutely certain. Limitations: I
| think this is a sufficient work but with better clarity and some
| additional analysis (actually do theoretical mathematical
| analysis ;) I think it could be an excellent work and have much
| more impact than it has in its current form. There is much more
| to be said, but hey, we're on HN and this last part is being done
| half jokingly.
|
| [0] Let's say you train on wikipedia and reddit and just train as
| entropy of next token. Is coding out-of-distribution? Arguably it
| isn't because there are code samples in both of those datasets.
| It is not even clear if this is OOD by task. It is even unclear
| if we strip out things we can identify as code as we aren't
| necessarily stripping out the discussion of code in natural
| language. We are, after all, talking about learning in extremely
| high dimensional spaces and so these 'little nuances' are rather
| critical in determining what is actually being done. This is
| deeply related to the 'black box' nature of all of this. As a
| clear counter, I don't think there is ambiguity when training on
| Shakespeare that there is ambiguity that coding tasks are OOD. I
| also think if you strip literal code from reddit and wiki we
| could say this task is at least not within the main distribution.
|
| [1] Am I understanding correctly that these are the same as the
| referenced [2,3]? Put that experimental setting section up. I
| want to look _backwards_ for this type of information, not
| _forward_. Because looking backwards I 'll have a good idea of
| where I need to go and probably got some of that information
| before I start asking lots of questions.
|
| [2] I suspect many people do and I do have this extreme concern.
| So I actually appreciate this being included.
|
| [3] Which we can't calculate. After all, we're not using these
| proxy metrics for (just) computational efficiency, we are using
| them because we have no formal (mathematical) definition of our
| true objective metrics. We have no formal definition of "human
| like language" or "correct code given human language inputs".
| BarakWidawsky wrote:
| I wonder how much of this is due to Diffusion models having less
| capacity for memorization than auto regressive models
|
| The auto regressive models consistently show better loss for the
| same number of training tokens
|
| I find a lot of the conclusions compelling but I would've loved
| to see more epochs of training on the 1B model with a 10B
| dataset, as that model _was_ showing epoch over epoch
| improvements
| thesz wrote:
| > I wonder how much of this is due to Diffusion models having
| less capacity for memorization than auto regressive models
|
| Diffusion requires more computation resources than
| autoregressive models, compute excess is proportional to the
| length of sequence. Time dilated RNNs and adaptive computation
| in image recognition hint us that we can compute more with same
| weights and achieve better results.
|
| Which, I believe, also hint at the at least one flaw of the TS
| study - I did not see that they matched DLM and AR by compute,
| they matched them only by weights.
| heyitsguay wrote:
| Do you have references on adaptive methods for image
| recognition?
| godelski wrote:
| I don't have an exact reference but there are a lot more
| hints that evidence the claim (compute more with same
| weights). In fact, I wouldn't even call them hints since
| they aren't subtle at all. For one, animal brains are
| perfect examples of this. But in the ML space, we could
| think of this purely from the mathematical perspective.
|
| I think it might be confusing because neurons are neurons
| right? And they can only hold so much memory, so what's the
| difference? Well, that difference is architecture and
| training.
|
| Let's think about signals for a moment and to help
| understand this, let's move to small dimensions[0]. Like 2D
| or 3D. (I'll use 3D, but you'll see why this can still ruin
| visualization) We're talking about universal approximates,
| so we can think of these as finite length strings, but have
| fixed end points. Our goal is then to untangle these
| strings. Oh no, this bundle has a knot! We can't actually
| untangle this string just by stretching. We also have a
| rule that we can't cut and glue things. We'd be stuck if we
| didn't have a trick up our sleeves. We can move into a
| higher dimension and untangle these strings there[1]. We'll
| need at least 2N-D. To the flatlander this will look like a
| cut, but it isn't.
|
| The reason this needs to be understood is because we need
| to know where we get those dimensions. It is through
| architecture and training. But let's just think about that
| architecture. When we're learning these relationships we
| need to have the capacity to perform these higher
| dimensional movements, but once we already uncover the
| relationships we don't necessarily need to. The
| relationship it depends on the dimensionality of the
| relationship itself, not the data.
|
| This is true for all models and is fundamentally why things
| like distillation even work. It is also why that FFN layer
| post attention in the transformer needs to project into a
| higher dimension before returning (typical is 4x and I
| think you can reason why that gives more flexibility than
| 2x). Also related to the latent manifold hypothesis.
|
| If you ever wondered if math is useful to machine learning,
| I hope this gives some motivation to learn more. You don't
| need math to build good models, but even a little math goes
| a long way to help make better models.
|
| [0] Note, we're doing a significant amount of
| simplification here. There's a lot of depth and complexity
| to all of this but I think this will be sufficient to point
| anyone in (mostly) the right direction.
|
| [1] Think about a Klein bottle. In 4D it has a single
| surface. But the 3D projection of this shape makes it look
| like it is intersecting itself. Unfortunately we can't
| really visualize the 4D version :(
| godelski wrote:
| > as that model was showing epoch over epoch improvements
|
| Both of them were showing improvements. I agree with you that
| I'd like to see more, but I'm not sure more would significantly
| change the argument (which is a lot about how metrics aren't
| straight forward). Especially since the 96B token experiment
| shows.
|
| IN FACT, those results are _so similar_ I had to open them up
| in GIMP to align and spot the differences. Now I 'm actually
| not convinced there wasn't a mistake. There are differences,
| just very minor. Harder to tell with the AR model because
| scale, but in the diffusion you can see a little bump in the
| second one right before the concavity change at the end. There
| some more bumps in the AR model earlier on that help show
| differences too, but the fact that the envelopes are nearly
| identical is... suspicious. I'm not claiming maliciousness
| because even if a mistake these things are so easy to make that
| they are common. I'm not even convinced there is a mistake, but
| it warrants extra thinking.
|
| That said, money is finite and these are quite computationally
| heavy. Author looks to be a research fellow and so I'm assuming
| not backed by big tech.
| cma wrote:
| > The auto regressive models consistently show better loss for
| the same number of training tokens
|
| I thought bi-directional transformers (non auto-regressive)
| show less loss than autoregressive for the same amount of
| training tokens.
| semiinfinitely wrote:
| Results probably just indicate that the ar baseline is fucked
___________________________________________________________________
(page generated 2025-08-10 23:00 UTC)