[HN Gopher] Diffusion language models are super data learners
       ___________________________________________________________________
        
       Diffusion language models are super data learners
        
       Author : babelfish
       Score  : 121 points
       Date   : 2025-08-10 16:04 UTC (6 hours ago)
        
 (HTM) web link (jinjieni.notion.site)
 (TXT) w3m dump (jinjieni.notion.site)
        
       | woadwarrior01 wrote:
       | > During inference, generating sequences ranging from 16 to 4096
       | tokens incurs a 16x to 4700x increase in FLOPs compared to AR
       | baselines.
       | 
       | I wonder why the increase in FLOPs has such a wide spectrum?
       | Naively, I'd have expected the FLOPs to increase linearly with
       | the number of tokens. OTOH, it sort of makes sense because
       | because diffusion models are not autoregressive, as their name
       | suggests.
        
         | ckjellqv wrote:
         | My guess is that autoregressive models can use Key Value (KV)
         | caching to eliminate most of the FLOPs inside the self-
         | attention block. Can't use KV caching inside diffusion (because
         | it's not a causal model) but they sell this as a win anyway
         | because they believe it leads to better reasoning.
        
       | godelski wrote:
       | This is interesting but I'm not sure some of the claims can be
       | made without some more information. Terms like "downstream task",
       | "in/out of distribution" are frequently used in the literature to
       | mean many different things[0] and it is hard to know which one
       | you mean from context. As a reader I * _cannot know*_ what is in-
       | distribution or not if I have no notion of what the training
       | data[1] is. Consequently, I also can 't know what downstream
       | tasks are.
       | 
       | Though I'm very confused by this                 > This
       | phenomenon persists for both in-domain and out-of-domain training
       | data.
       | 
       | What does it mean for _training data_ to be  "out-of-domain"? The
       | domain is any valid input into your function. Was this intended
       | to be distribution? I'd still be a bit confused by that because
       | it makes it sound like you're talking about training and
       | validation data, both of which are in distribution.
       | > Is validation loss a good metric for AR and DLM
       | 
       | In academic settings, does anyone seriously believe that the
       | answer would be yes? I would be extremely concerned if people
       | honestly believed that you could use loss as a strong indicator
       | for comparing two different architectures[2]. These losses are
       | not measuring the things we want to measure, they are proxies of
       | them. The architectures themselves are a big part of forming that
       | loss landscape. This would be a fine comparison if the metric
       | were not a proxy but since it it then it isn't reliable unless we
       | know what the divergence is[3]. This is all fine, but to advance
       | as a field we need to remember what we don't know.
       | 
       | Overall, I'm still not sure what is meant by "Super Data
       | Learners".
       | 
       | It seems like this is counted by information per parameter? I do
       | think there is good discussion in the "causal" attention vs the
       | free-form attention of diffusion, but I think there are also some
       | potential oversteps in the conclusions here. A lower triangular
       | matrix is still full-rank, so there is high representation power
       | here, though it is correct that the free form has _more_ (even
       | when including the permutation and the untangling via the FFN
       | layer in the transformer). I think if this part can be
       | highlighted more and more time is spent on explaining then a much
       | stronger case can be made. But I think some additional analysis
       | is needed to determine if this is a diffusion vs transformer
       | thing or triangular attention vs full rank attention thing. From
       | a mathematical perspective the second question can be answered
       | much more easily, but then there is a larger question about
       | training these things because the problem of training free-form
       | matrices is that they are... well... free form. There 's actually
       | some good discussions about this in the Normalizing Flow
       | literature as they work through a similar problem of
       | representation power and training/computational efficiencies. I
       | think this work has the potential to open up a larger discussion
       | for talking about the representation power of different
       | architectures. Which, IMO, that is a really important topic that
       | we need to discuss these days. Though I'm biased since I work on
       | neural architectures.
       | 
       | Just for fun ;)                 Reviewer 2:       Rating: 4:
       | Borderline accept       Confidence: 4: You are confident in your
       | assessment, but not absolutely certain.       Limitations: I
       | think this is a sufficient work but with better clarity and some
       | additional analysis (actually do  theoretical mathematical
       | analysis ;) I think it could be an excellent work and have much
       | more impact than it has in its current form. There is much more
       | to be said, but hey, we're on HN and this last part is being done
       | half jokingly.
       | 
       | [0] Let's say you train on wikipedia and reddit and just train as
       | entropy of next token. Is coding out-of-distribution? Arguably it
       | isn't because there are code samples in both of those datasets.
       | It is not even clear if this is OOD by task. It is even unclear
       | if we strip out things we can identify as code as we aren't
       | necessarily stripping out the discussion of code in natural
       | language. We are, after all, talking about learning in extremely
       | high dimensional spaces and so these 'little nuances' are rather
       | critical in determining what is actually being done. This is
       | deeply related to the 'black box' nature of all of this. As a
       | clear counter, I don't think there is ambiguity when training on
       | Shakespeare that there is ambiguity that coding tasks are OOD. I
       | also think if you strip literal code from reddit and wiki we
       | could say this task is at least not within the main distribution.
       | 
       | [1] Am I understanding correctly that these are the same as the
       | referenced [2,3]? Put that experimental setting section up. I
       | want to look _backwards_ for this type of information, not
       | _forward_. Because looking backwards I 'll have a good idea of
       | where I need to go and probably got some of that information
       | before I start asking lots of questions.
       | 
       | [2] I suspect many people do and I do have this extreme concern.
       | So I actually appreciate this being included.
       | 
       | [3] Which we can't calculate. After all, we're not using these
       | proxy metrics for (just) computational efficiency, we are using
       | them because we have no formal (mathematical) definition of our
       | true objective metrics. We have no formal definition of "human
       | like language" or "correct code given human language inputs".
        
       | BarakWidawsky wrote:
       | I wonder how much of this is due to Diffusion models having less
       | capacity for memorization than auto regressive models
       | 
       | The auto regressive models consistently show better loss for the
       | same number of training tokens
       | 
       | I find a lot of the conclusions compelling but I would've loved
       | to see more epochs of training on the 1B model with a 10B
       | dataset, as that model _was_ showing epoch over epoch
       | improvements
        
         | thesz wrote:
         | > I wonder how much of this is due to Diffusion models having
         | less capacity for memorization than auto regressive models
         | 
         | Diffusion requires more computation resources than
         | autoregressive models, compute excess is proportional to the
         | length of sequence. Time dilated RNNs and adaptive computation
         | in image recognition hint us that we can compute more with same
         | weights and achieve better results.
         | 
         | Which, I believe, also hint at the at least one flaw of the TS
         | study - I did not see that they matched DLM and AR by compute,
         | they matched them only by weights.
        
           | heyitsguay wrote:
           | Do you have references on adaptive methods for image
           | recognition?
        
             | godelski wrote:
             | I don't have an exact reference but there are a lot more
             | hints that evidence the claim (compute more with same
             | weights). In fact, I wouldn't even call them hints since
             | they aren't subtle at all. For one, animal brains are
             | perfect examples of this. But in the ML space, we could
             | think of this purely from the mathematical perspective.
             | 
             | I think it might be confusing because neurons are neurons
             | right? And they can only hold so much memory, so what's the
             | difference? Well, that difference is architecture and
             | training.
             | 
             | Let's think about signals for a moment and to help
             | understand this, let's move to small dimensions[0]. Like 2D
             | or 3D. (I'll use 3D, but you'll see why this can still ruin
             | visualization) We're talking about universal approximates,
             | so we can think of these as finite length strings, but have
             | fixed end points. Our goal is then to untangle these
             | strings. Oh no, this bundle has a knot! We can't actually
             | untangle this string just by stretching. We also have a
             | rule that we can't cut and glue things. We'd be stuck if we
             | didn't have a trick up our sleeves. We can move into a
             | higher dimension and untangle these strings there[1]. We'll
             | need at least 2N-D. To the flatlander this will look like a
             | cut, but it isn't.
             | 
             | The reason this needs to be understood is because we need
             | to know where we get those dimensions. It is through
             | architecture and training. But let's just think about that
             | architecture. When we're learning these relationships we
             | need to have the capacity to perform these higher
             | dimensional movements, but once we already uncover the
             | relationships we don't necessarily need to. The
             | relationship it depends on the dimensionality of the
             | relationship itself, not the data.
             | 
             | This is true for all models and is fundamentally why things
             | like distillation even work. It is also why that FFN layer
             | post attention in the transformer needs to project into a
             | higher dimension before returning (typical is 4x and I
             | think you can reason why that gives more flexibility than
             | 2x). Also related to the latent manifold hypothesis.
             | 
             | If you ever wondered if math is useful to machine learning,
             | I hope this gives some motivation to learn more. You don't
             | need math to build good models, but even a little math goes
             | a long way to help make better models.
             | 
             | [0] Note, we're doing a significant amount of
             | simplification here. There's a lot of depth and complexity
             | to all of this but I think this will be sufficient to point
             | anyone in (mostly) the right direction.
             | 
             | [1] Think about a Klein bottle. In 4D it has a single
             | surface. But the 3D projection of this shape makes it look
             | like it is intersecting itself. Unfortunately we can't
             | really visualize the 4D version :(
        
         | godelski wrote:
         | > as that model was showing epoch over epoch improvements
         | 
         | Both of them were showing improvements. I agree with you that
         | I'd like to see more, but I'm not sure more would significantly
         | change the argument (which is a lot about how metrics aren't
         | straight forward). Especially since the 96B token experiment
         | shows.
         | 
         | IN FACT, those results are _so similar_ I had to open them up
         | in GIMP to align and spot the differences. Now I 'm actually
         | not convinced there wasn't a mistake. There are differences,
         | just very minor. Harder to tell with the AR model because
         | scale, but in the diffusion you can see a little bump in the
         | second one right before the concavity change at the end. There
         | some more bumps in the AR model earlier on that help show
         | differences too, but the fact that the envelopes are nearly
         | identical is... suspicious. I'm not claiming maliciousness
         | because even if a mistake these things are so easy to make that
         | they are common. I'm not even convinced there is a mistake, but
         | it warrants extra thinking.
         | 
         | That said, money is finite and these are quite computationally
         | heavy. Author looks to be a research fellow and so I'm assuming
         | not backed by big tech.
        
         | cma wrote:
         | > The auto regressive models consistently show better loss for
         | the same number of training tokens
         | 
         | I thought bi-directional transformers (non auto-regressive)
         | show less loss than autoregressive for the same amount of
         | training tokens.
        
       | semiinfinitely wrote:
       | Results probably just indicate that the ar baseline is fucked
        
       ___________________________________________________________________
       (page generated 2025-08-10 23:00 UTC)