[HN Gopher] Why I find diffusion models interesting?
___________________________________________________________________
Why I find diffusion models interesting?
Author : whoami_nr
Score : 186 points
Date : 2025-03-06 22:35 UTC (1 days ago)
(HTM) web link (rnikhil.com)
(TXT) w3m dump (rnikhil.com)
| mistrial9 wrote:
| this is the huggingface page
| https://huggingface.co/papers/2502.09992
| jacobn wrote:
| The animation on the page looks an awful lot like autoregressive
| inference in that virtually all of the tokens are predicted in
| order? But I guess it doesn't have to do that in the general
| case?
| creata wrote:
| The example in the linked demo[0] seems less left-to-right.
|
| Anyway, I think we'd expect it to usually be more-or-less left-
| to-right -- We usually decide what to write or speak left-to-
| right, too, and we don't seem to suffer much for it.
|
| (Unrelated: it's funny that the example generated code has a
| variable "my array" with a space in it.)
|
| [0]: https://ml-gsai.github.io/LLaDA-demo/
| whoami_nr wrote:
| yeah but you can backtrack your thinking. You also have a
| mind voice to plan out the next couple words/reflect/self
| correct before uttering them.
| frotaur wrote:
| Very related : https://arxiv.org/abs/2401.17505
| whoami_nr wrote:
| So, in practice there are some limitations here. Chat
| interfaces force you to feed the entire context to the model
| everytime you ping it. Even multi step tool calls have a
| similar thing going. So, yeah we may effectively turn all of
| this effectively into autoregressive models too.
| vinkelhake wrote:
| I don't get where the author is coming from with the idea that a
| diffusion based LLM would hallucinate less.
|
| > dLLMs can generate certain important portions first, validate
| it, and then continue the rest of the generation.
|
| If you pause the animation in the linked tweet (not the one on
| the page), you can see that the intermediate versions are full
| of, well, baloney.
|
| (and anyone who has messed around with diffusion based image
| generation knows the models are perfectly happy to hallucinate).
| whoami_nr wrote:
| The Llada paper: https://ml-gsai.github.io/LLaDA-demo/ here
| implied strong bidirectional reasoning capabilities and
| improved performance on reversal tasks (where the model needs
| to reason backwards).
|
| I made a logical leap from there.
| gdiamos wrote:
| Bidirectional seq2seq models are usually more accurate than
| unidirectional models.
|
| However, autoregressive models that generate one token at a
| time are usually more accurate than parallel models that
| generate multiple tokens at a time.
|
| In diffusion LLMs, both of these two effects interact. You can
| trade them off by determining how many tokens are generated at
| a time, and how many future tokens are used to predict the next
| set of tokens.
| markisus wrote:
| Regarding faulty intermediate versions, I think that's the
| point. The diffusion process can correct wrong tokens with the
| global state implies it.
| evrydayhustling wrote:
| I think the discussion here is confusing the algorithm for
| the output. It's true that diffusion can rewrite tokens
| during generation, but it is doing so for consistency with
| the evolving output -- not "accuracy". I'm unaware of any
| research which shows that the final product, when iteration
| stops, is less likely to contain hallucinations than with
| autoregression.
|
| With that said, I'm still excited about diffusion -- if it
| offers different cost points, and different interaction modes
| with generated text, it will be useful.
| Legend2440 wrote:
| Hallucination is probably a feature of statistical prediction
| as a whole, not any particular architecture of neural network.
| mitthrowaway2 wrote:
| I'm not sure about hallucination about facts, but it might be
| less prone to logically inconsistent statements of the form
| "the sky is red because[...] and that's why the sky is blue".
| gdiamos wrote:
| I think these models would get interesting at extreme scale.
| Generate a novel in 40 iterations on a rack of GPUs.
|
| At some point in the future, you will be able to autogen a 10M
| line codebase in a few seconds on a giant GPU cluster.
| gdiamos wrote:
| Diffusion LLMs also follow scaling laws -
| https://proceedings.neurips.cc/paper_files/paper/2023/file/3...
| esperent wrote:
| Is it possible that combining multiple AIs will be able to
| _somewhat_ bypass scaling laws, in a similar way that
| multicore CPUs can _somewhat_ bypass the limitations of a
| single CPU core?
| gdiamos wrote:
| I'm sure there are ways of bypassing scaling laws, but I
| think we need more research to discover and validate them
| impossiblefork wrote:
| Those aren't the modern type with discrete masking based
| diffusion though.
|
| Of course, these too will have scaling laws.
| nthingtohide wrote:
| I read a wikipedia article of a person who was very intelligent
| but also suffered from a mental illness. He told people around
| him that his next novel will be of exactly N number of words
| and it will end with the sentence P.
|
| I don't remember article. I read it a decade ago. It's like he
| was doing diffusion in his mind, subconsciously perhaps
| eru wrote:
| Seems pretty easy to achieve if you have text editing
| software that tells you the number of words written so far?
| Philpax wrote:
| I know the r-word is coming back in vogue, but it was still
| unpleasant to see it in the middle of an otherwise technical blog
| post. Ah well.
|
| Diffusion LMs are interesting and I'm looking forward to seeing
| how they develop, but from playing around with that model, it's
| GPT-2 level. I suspect it will need to be significantly scaled up
| before we can meaningfully compare it to the autoregressive
| paradigm.
| mountainriver wrote:
| Meta has one based on flow matching that is bigger, it performs
| pretty well
| gsf_emergency_2 wrote:
| A possible detente between SCHMIDHUBER & the school of Yann
| Lecun ?
|
| https://doi.org/10.1103/PhysRevLett.129.228004
| gsf_emergency_2 wrote:
| I've got a couple more related snowclones..
|
| _Sufficiently humourous sneering is indistinguishable from
| progress_
|
| _Sufficiently high social status is indistinguishable from
| wisdom_
| gsf_emergency_2 wrote:
| Sufficiently profane reasoning is indistinguishable from
| autoregression
|
| Sufficiently anti-regressive compression is indistinguishable
| from sentience (--maybe the SCHMIDHUBER)
|
| https://psycnet.apa.org/record/2007-12667-001
| IncreasePosts wrote:
| Retarded is too good of a word to go unused. It feels super
| wrong to call a mentally disabled person retarded or a retard.
| And we're told we can't call stupid things retarded. So who
| gets to use it? No one?
|
| With gay, on the other hand, gay people call each other gay and
| are usually okay being labeled as gay. So, it's still in use,
| and I think it's fine to push back against using it to mean
| "lame" or whatever.
|
| Finally, you should keep in mind that the author may not be
| American or familiar with American social trends. "Retarded"
| might be just fine in South Africa or Australia(I don't know).
| Similar to how very few Americans would bat an eye at someone
| using the phrase "spaz out", whereas it is viewed as very
| offensive in England.
| kazinator wrote:
| If you have a burning urge to use "retarded" with complete
| dick-o-matic immunity, try a sentence like, "the flame
| retardant chemical successfully retarded the spread of the
| fire". You may singe a few eyebrows, that's about it.
| billab995 wrote:
| Might seem like a descriptive word but the fact is, it's
| hurtful to people who are working harder to make their way in
| life than I'll ever have to. Even when just heard in passing.
|
| Why do things in life that will hurt someone who'll likely
| just retreat away rather than confront you. Be the good guy.
| mitthrowaway2 wrote:
| That's the euphemism treadmill though, isn't it? "Retard"
| literally means late or delayed (hence French: _en retard_
| ). Back when it was originally introduced to refer to a
| handicap, it was chosen for that reason to be a kind,
| polite, and indirect phrasing. That will also be the fate
| of any new terms that we choose. Hence for example in
| physics the term _retarded potential_
| (https://en.wikipedia.org/wiki/Retarded_potential) was
| chosen to refer to the delaying effect of the speed of
| light on electromagnetic fields, before the word had any
| association with mental disability.
|
| Words don't need to retain intrinsic hurtfulness; their
| hurtfulness comes from their usage, and the hurtful intent
| with which they are spoken. We don't need to yield those
| words to make them the property of 1990s schoolyard bullies
| in perpetual ownership.
|
| To that extent I'd still say this article's usage is not
| great.
| barrkel wrote:
| > Words don't need to retain intrinsic hurtfulness; their
| hurtfulness comes from their usage, and the hurtful
| intent with which they are spoken.
|
| Yes; and a rose by any other name would smell as sweet.
|
| Words don't need to retain intrinsic hurtfulness, but
| it's not quite right that the hurtfulness comes from the
| usage either. The hurtfulness comes from the actual
| referent, combined with intent.
|
| If I tell someone they are idiotic, imbecilic, moronic,
| mentally retarded, mentally handicapped, mentally
| challenged, I am merely iterating through a historical
| list of words and phrases used to describe the same real
| thing in the world. The hurt fundamentally comes from
| describing someone of sound mind as if they are not. We
| all know that we don't want to have a cognitive
| disability, given a choice, nor to be thought as if we
| had.
|
| The euphemism treadmill tries to pretend that the
| referent isn't an undignified position to be in. But
| because it fundamentally is, no matter what words are
| used, they can still be used to insult.
| t-3 wrote:
| Any word used to describe intellectual disability would be
| just as hurtful, at least when given enough time to enter
| the vernacular. That's just how language and society works.
| Children especially can call each other anything and make
| it offensive, because bullying and cliquish behavior is
| very natural and it's hard to train actual politeness and
| empathy into people in authoritarian environments like
| schools.
| billab995 wrote:
| You're right, it's the intent that matters. <any_word>,
| used to describe something stupid or negative while also
| being an outdated description for a specific group of
| people...
|
| The fact is, it's _that_ word that's evolved into
| something hurtful. So rather than be the guy who sticks
| up for the_word and try convince everyone it shouldn't be
| hurtful, I just decided to stop using it. The reason why
| I stopped was seeing first hand how it affected someone
| with Down Syndrome who heard me saying it. Sometimes real
| life beats theoretical debate. It's something I still
| feel shame about nearly 20 years later.
|
| It wasn't a particularly onerous decision to stop using
| it, or one that opened the floodgate of other words to be
| 'banned'. And if someone uses it and hasn't realized
| that, then move on - just avoid using it next time. Not a
| big deal. It's the obnoxious, purposefully hurtful use of
| it that's not great (which doesn't seem to be the case
| here tbh). It's the intent that matters more.
| whoami_nr wrote:
| Yes, I am not American and I had no clue about the
| connotations.
| echelon wrote:
| > I know the r-word is coming back in vogue
|
| This is so utterly fascinating to watch.
|
| Three years ago this would have cost you your job. Now
| everybody's back at it again.
|
| What is happening?
| esperent wrote:
| For anyone else confused, this "r-word" is "retarded".
|
| They're not talking about a human. To me that makes it feel
| very different.
|
| However, there's also a large component coming from the
| current political situation. People feel more confident to
| push back against things like the policing of word usage.
| They're less likely to get "cancelled" now. They feel more
| confident that the zeitgeist is on their side now. They're
| probably right.
| bongodongobob wrote:
| Eh, I'm as left as they come and I'm tired of pretending
| that banning words solve anything. Who's offended? Why? Do
| you have a group of retarded friends you hang out with on
| the regular? Are they reading the article? No and no. Let's
| not pretend that changing the term to differentently abled
| or whatever has any meaning. It doesn't. It's a handful of
| loud people (usually well off white women) on social media
| dictating what is and isn't ok. Phrases like "temporarily
| unhoused" rather than homeless is another good way to
| pretend to be taking action when you're doing less than
| nothing. Fight for policy, not changing words.
| esperent wrote:
| > I'm as left as they come and I'm tired of pretending
| that banning words solve anything. Who's offended? Why?
|
| I'm with you on this, also speaking as a strong leftist.
|
| I do think that "banning" , or at least strongly
| condemning, the use of words when the specific group
| being slurred are clear that they consider it a slur and
| want it to stop is reasonable. But not when it's social
| justice warriors getting offended on behalf of other
| people.
|
| However, I think it's absolutely ridiculous that even
| when discussing the banning of these words, we're not
| allowed to use them directly. We are supposed to say
| "n-word", "r-word" even when discussing in an academic
| sense. Utter nonsense, it's as if saying these words out
| loud would conjure a demon.
| imtringued wrote:
| The point of these meaningless dictionary changes isn't
| to solve anything. It's to give plausible deniability to
| asshole behaviour through virtue signalling.
|
| Crazy assholes will argue along the lines that it is an
| insignificant inconvenience and hence anyone who uses the
| old language must use it maliciously and on purpose,
| because they are ableist, racist or whatever.
|
| This then gives assholes the justification to behave like
| a biggot towards the allegedly ableist person. The goal
| is to dress up your own abusive bullying as virtuous,
| even though deep down you don't actually care about
| disabled people.
| esperent wrote:
| This is an interesting take, and I think it's not
| unreasonable to label the worst of the social justice
| warriors as assholes.
|
| However, most of them are well meaning. They're misguided
| rather than assholes. They really do want to take action
| for social improvement. It's just that real change is too
| hard and requires messy things like protesting on the
| street or getting involved in politics and law. So, they
| fall back on things like policing words, or calling out
| perceived bad actors, which they can do from the comfort
| of their homes via the internet.
|
| To be fair, some genuinely bad people have been
| "cancelled". The "me too" movement didn't happen without
| reason. It's just that it went too far, and started
| ignoring pesky things like evidence, or innocent until
| proven otherwise.
| bloomingkales wrote:
| _Do you have a group of retarded friends you hang out
| with on the regular?_
|
| I should not have laughed at this.
| Uehreka wrote:
| Yes and yes? I'm an AI enthusiast interested in the
| article and I'm offended by that word for pretty non-
| hypothetical reasons. When I was in middle school I was
| bullied a lot by people who would repeatedly call me the
| r-slur. That word reminds me of some of the most shameful
| and humiliating moments of my life. If I hear someone use
| it out of nowhere it makes me wince. Seeing it written
| down isn't as bad, but I definitely would prefer people
| phased it out of their repertoire.
| inverted_flag wrote:
| The zeitgeist is shifting away from "wokeness" and people are
| testing the waters trying to see what they can get away with
| saying now.
| exe34 wrote:
| Elon Musk made it cool again.
| kelseyfrog wrote:
| I'm personally happy to see effort in this space simply because I
| think it's an interesting set of tradeoffs (compute [?] accuracy)
| - a departure from the fixed next token compute budget required
| now.
|
| It brings up interesting questions, like what's the equivalency
| between smaller diffusion models which consume more compute
| because they have a greater number of diffusion steps compared to
| larger traditional LLMs which essentially have a single step. How
| effective is decoupling the context window size to the diffusion
| window size? Is there an optimum ratio?
| machiaweliczny wrote:
| I actually think that diffusion LLMs will be best for code
| generation
| billab995 wrote:
| Stopped reading at the r word. Do better.
| mountainriver wrote:
| The most interesting thing about diffusion LMs that tends to be
| missed, are their ability to edit early tokens.
|
| We know that the early tokens in an autoregressive sequence
| disproportionately bias the outcome. I would go as far as to say
| this is some of the magic of reasoning models is they generate so
| much text they can kinda get around this.
|
| However, diffusion seems like a much better way to solve this
| problem.
| kgeist wrote:
| But how can test-time compute be implemented for diffusion
| models if they already operate on the whole text at once? Say
| it gets stuck--how does it proceed further? Autoregressive
| reasoning models would simply backtrack and try other
| approaches. It feels like denoising the whole text further
| wouldn't lead to good results, but I may be wrong.
| eru wrote:
| Perhaps do a couple of independent runs, and then combine
| them afterwards?
| spwa4 wrote:
| Diffusion LLMs are still residual networks. You can Google
| that, but it means that they don't generate the whole text at
| once. Every layer generates corrections to be made to the
| whole text at once.
|
| Think of it like writing a text by forcing your teacher to
| write for you by entering in the assignment 100 times. You
| begin by generating completely inaccurate text, almost
| random, that leans perhaps a little bit towards the answer.
| Then you systematically begin to correct small parts of the
| text. The teacher that sees the text, and uses red the red
| pen to correct a bunch of things. Then the corrected text is
| copied onto a fresh page, and resubmitted to the teacher. And
| again. And again. And again. And again. 50 times. 100 times.
| That's how diffusion models work.
|
| Technically, it adds your corrections to the text, but that's
| mathematical addition, not adding at the end. Also
| technically every layer is a teacher that's slightly
| different from the previous one. And and and ... but this is
| the basic principle. The big advantage is that this makes
| neural networks slowly lean towards the answer. First they
| decide to have 3 sections, one about X, Y and one about Z,
| then they decide on what sentences to put, then they start
| thinking about individual words, then they start worrying
| about things like grammar, and finally about spelling and
| pronouns and ...
|
| So to answer your question: diffusion networks can at any
| time decide to send out a correction that effectively erases
| the text (in several ways). So they can always start over by
| just correcting everything all at once back to randomness.
| kgeist wrote:
| Yeah, but with autoregressive models, the state grows,
| whereas with diffusion models, it remains fixed. As a
| result, a diffusion model can't access its past thoughts
| (e.g., thoughts that rejected certain dead ends) and may
| start oscillating between the same subpar results if you
| keep denoising multiple times.
| ithkuil wrote:
| Yeah reasoning models are "self-doubt" models.
|
| The model is trained to encourage re-evaluating the soundness
| of tokens produced during the "thinking phase".
|
| The model state vector is kept in a state of open exploration.
| Influenced by the already emitted tokens but less strongly so.
|
| The non-reasoning models were just trained with the goal of
| producing useful output on a first try and they did their best
| to maximize that fitness function.
| kazinator wrote:
| Interestingly, that animation at the end _mainly_ proceeds from
| left to right, with just some occasional exceptions.
|
| So I followed the link, and gave the model this bit of
| conversation starter:
|
| > _You still go mostly left to right._
|
| The denoising animation it generated went like this:
|
| > [Yes] [.] [MASK] [MASK] [MASK] ... [MASK]
|
| and proceeded by deletion of the mask elements on the right one
| by one, leaving just the "Yes.".
|
| :)
| DeathArrow wrote:
| That got me thinking that it would be nice to have something like
| ComfyUi to work with diffusion based LLMs. Apply LORAs, use
| multiple inputs, have multiple outputs.
|
| Something akin to ComfyUi but for LLMs would open up a world of
| possibilities.
| hdjrudni wrote:
| Maybe not even 'akin' but literally ComfyUI. Comfy already has
| a bunch of image-to-text nodes. I haven't seen txt2txt or Loras
| and such for them though. But I also haven't looked.
| Philpax wrote:
| It's complicated by the ComfyUI data model, which treats
| strings as immediate values/constants and not variables in
| their own right. This could ostensibly be fixed/worked
| around, but I imagine that it would come at a cost to
| backwards compatibility.
| dragonwriter wrote:
| ComfyUI already has nodes (mostly in extensions available
| through the built in manager) for working with LLMs, both
| remote LLMs accessed through APIs and local ones running under
| Comfy itself, the same as it runs other models.
| terhechte wrote:
| Check out Floneum, it's basically ComfyUI for LLM's, extendable
| via plugins
|
| https://floneum.com/
|
| Scroll down a bit on the website to see a screenshot.
| DeathArrow wrote:
| Thank you!
| chw9e wrote:
| This was a very cool paper about using diffusion language models
| and beam search: https://arxiv.org/html/2405.20519v1
|
| Just looking at all of the amazing tools and workflows that
| people have made with ComfyUI and stuff makes me wonder what we
| could do with diffusion LMs. It seems diffusion models are much
| more easily hackable than LLMs.
| FailMore wrote:
| Thanks for the post, I'm interested in them too
| monroewalker wrote:
| See also this recent post about Mercury-Coder from Inception
| Labs. There's a "diffusion effect" toggle for their chat
| interface but I have no idea if that's an accurate representation
| of the model's diffusion process or just some randomly generated
| characters showing what the diffusion process looks like
|
| https://news.ycombinator.com/item?id=43187518
|
| https://www.inceptionlabs.ai/news
| alexmolas wrote:
| I guess the biggest limitation of this approach is that the max
| output length is fixed before generation starts. Unlike
| autoregressive LLM, which can keep generating forever.
| gdiamos wrote:
| max output size is always limited by the inference framework in
| autoregressive LLMs
|
| eventually they run out of memory or patience
| antirez wrote:
| There is a disproportionate skepticism in autoregressive models
| and a disproportionate optimism in alternative paradigms because
| of the absolutely non verifiable idea that LLMs, when predicting
| the next token, don't already model, in the activation states,
| the gist of what they could going to say, similar to what humans
| do. That's funny because many times it can be observed in the
| output of truly high quality replies that the first tokens only
| made sense _in the perspective_ of what comes later.
| spr-alex wrote:
| maybe i understand this a little differently, the argument i am
| most familiar with is this one from lecun, where the error
| accumulation in the prediction is the concern with
| autoregression
| https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMR...
| antirez wrote:
| The error accumulation thing is basically without any ground
| as regressive models correct what they are saying in the
| process of emitting tokens (trivial to test yourself: force a
| given continuation in the prompt and the LLMs will not follow
| at all). LeCun provided an incredible amount of wrong claims
| about LLMs, many of which he now no longer accepts: like the
| stochastic parrot claim. Now the idea that there is just a
| statistical relationship in the next token prediction is
| considered laughable, but even when it was formulated there
| were obvious empirical hints.
| HeatrayEnjoyer wrote:
| >force a given continuation in the prompt and the LLMs will
| not follow at all
|
| They don't? That's not the case at all, unless I am
| misunderstanding.
| antirez wrote:
| I'm not talking about the fine tuning that make them side
| with the user even when they are wrong (anyway, this is
| less and less common now compared to the past, but anyway
| it's a different effect). I'm referring if in the
| template you make the assistant reply starting with wrong
| words / directions, and the LLM finds a way to say what
| it really meant saying "wait, actually I was wrong" or
| other sentences that allow it to avoid following the
| line.
| spr-alex wrote:
| i think the opposite, the error accumulation thing is
| basically the daily experience of using LLMs.
|
| As for the premise that models cant self correct that's not
| the argument i've ever seen, transformers have global
| attention across the context window. It's that their
| prediction abilities are increasingly poor as generation
| goes on. Is anyone having a different experience than that?
|
| Everyone doing some form of "prompt engineering" whether
| with optimized ML tuning, whether with a human in the loop,
| or some kind of agentic fine tuning step, runs into
| perplexity errors that get worse with longer contexts in my
| opinion.
|
| There's some "sweet spot" for how long of a prompt to use
| for many use cases, for example. It's clear to me that less
| is more a lot of the time
|
| Now will diffusion fare significantly better on error is
| another question. Intuition would guide me to think more
| flexiblity with token-rewriting should enable much greater
| error correction capabilities. Ultimately as different
| approaches come online we'll get PPL comparables and the
| data will speak for itself
| flippyhead wrote:
| It's a pet peeve of mine to make a statement in the form of a
| question?
| ajkjk wrote:
| I don't know why (and am curious) but this particularly odd
| question phrasing seems to happen a lot among Indian immigrants
| I've met in America. Maybe it's considered grammatically
| correct in India or something?
| exe34 wrote:
| I've seen an explanation (that I don't fully buy), that
| school teachers end most sentences with a question because
| they're trying to get the children? the children? to
| complete? their sentence.
| beeforpork wrote:
| What it is interesting that the original title is not a question?
| beeforpork wrote:
| Sorry, this was redundant?
| prometheus76 wrote:
| Why did the person who posted this change the headline of the
| article ("Diffusion models are interesting") into a nonsensical
| question?
| amclennon wrote:
| Considering that the article links back to this post, the
| simplest explanation might be that the author changed the title
| at some point. If this were a larger publication, I would have
| probably assumed an A/B test
| whoami_nr wrote:
| Author here. I just messed up while posting.
| inverted_flag wrote:
| How do diffusion LLMs decide how long the output should be?
| Normal LLMs generate a stop token and then halt. Do diffusion
| LLMs just output a fixed block of tokens and truncate the output
| that comes after a stop token?
| bilsbie wrote:
| What if we combine the best of both worlds? What might that look
| like?
___________________________________________________________________
(page generated 2025-03-07 23:01 UTC)