[HN Gopher] Multimodal Diffusion Language Models for Thinking-Aw...
___________________________________________________________________
Multimodal Diffusion Language Models for Thinking-Aware Editing and
Generation
Author : lnyan
Score : 124 points
Date : 2025-11-19 09:27 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Hard_Space wrote:
| Be aware that the project page has the wrong Arxiv link at the
| time of writing. This is the correct one:
|
| https://arxiv.org/abs/2511.09611
| NitpickLawyer wrote:
| > To resolve this, we propose a parallel multimodal diffusion
| framework, MMaDA-Parallel, that enables _continuous,
| bidirectional interaction_ between text and images throughout the
| entire denoising trajectory.
|
| > (ParaRL), a novel strategy that applies _semantic rewards along
| the trajectory_ to enforce cross-modal consistency.
|
| (emphasis mine)
|
| This sounds really cool. The fact that one generation "attends"
| to the other is really interesting. I'm curious if this would
| hold for other modalities. I'm thinking coding specific
| applications, where things can change once something is
| generated. My hunch is that coding would benefit a lot from this
| approach, because the "manual" way of writing code often
| resembles diffusion more than autoregressive (that is, we often
| edit something here, then because we did that we have to import
| something, then change something there, then that leads to
| further changes, etc).
|
| For now coding seems to benefit a lot from <thinking> -> <coding>
| -> <env_feedback> -> <reflexion> -> <thinking> -> <coding>, but
| this seems at a glance to be shoehorned in for autoregressive
| generation... GPT5 in particular seems to be better at this, with
| multiple "tool calls" interleaved in its thinking sessions. I
| wonder if this would get better with the paralel denoising thing
| proposed here, where both thinking and coding are done in
| paralel, and one can "attend" to the other. Add some feedback
| (linters, compilers, LSPs, tests, etc.) and this can go places.
| If it works.
| soulofmischief wrote:
| Diffusion text models aren't new, I've made them at home. Also,
| plenty of frontier models are good at tool calling, GPT-5 has
| just been trained to do it more so that it appears to do better
| at coding exercises with codex/IDEs.
|
| If you haven't tried an agentic IDE such as Cursor yet, or at
| least an extension such as Copilot, I would recommend checking
| them out and trying out Anthropic's models as well.
| NitpickLawyer wrote:
| Do you have any examples / papers where they do the parallel
| thing proposed here? I've tried googles diffusion coding
| model, but AFAICT they don't do parallel thinking & coding.
| It seems to just take a prompt and output code.
|
| What's cool with this thinking & generation in parallel is
| that one can attend to the other. So you're not limited by
| prompt influences code, but can do prompt influences both
| thinking and code, and code can influence thinking and
| thinking can influence code.
| lossolo wrote:
| They use bidirectional attention between modalities, not
| within the same modality. This doesn't change much in the
| context you're referring to (coding). How do you think
| "thinking" works in current SOTA models like
| GPT-5-Thinking/Pro? When generating code, the model's
| "thinking" already attends to the code, and both influence
| each other during generation. "Reasoning" models modify the
| code as they generate it, they delete it, revise it, and
| adjust their internal reasoning based on the new tokens
| they produce during the "thinking" process. There are
| dozens of denoising models created for text, they are not
| good at it and parallel sampling between modalities will
| not change that.
| ricardobeat wrote:
| They cannot "edit" the code though, like you can with
| diffusion. They must emit all tokens again, or a
| patch/diff which is not directly connected to the
| previous stream of tokens.
| lossolo wrote:
| LLMs can "edit" code, but as you say, they do it
| differently from diffusion models. They operate directly
| on long text sequences and use much more context, which
| is one reason they currently work better for coding.
| Diffusion models for code aren't a new idea, people have
| tried different designs, but so far they tend to
| underperform autoregressive LLMs, probably because
| denoising over discrete tokens is harder to make work
| than straightforward next token prediction.
| boriskourt wrote:
| Interesting approach and a very readable paper.
|
| > We provide two varients of MMaDA-Parallel with different
| tokenizers. MMaDA-Parallel-A is trained with tokenizer Amused-VQ,
| and MMaDA-Parallel-M is trained with tokenizer Magvitv2.
|
| tyfeld/MMaDA-Parallel-A: https://huggingface.co/tyfeld/MMaDA-
| Parallel-A/tree/main
|
| tyfeld/MMaDA-Parallel-M: https://huggingface.co/tyfeld/MMaDA-
| Parallel-M/tree/main
| warthog wrote:
| This looks awesome. Although from a UX perspective might not be
| as good as streaming token by token for text generation use
| cases. However for image gen and editing - 100%
| jasonjmcghee wrote:
| Out of curiosity, is it possible this suffers from the same
| issues Anthropic found where reasoning expressed by the model and
| actual internal reasoning differ?
| Lerc wrote:
| I think this is likely to happen in all models since their
| internal reasoning is not in the same form as the output. This
| is probably true also for humans.
|
| This may solve the additional clouding that comes from LLMs
| using what is an effectively an iteration of instants to
| introspect the past. You cannot ask a autoregressive model what
| the thinking was behind the output because the only memory it
| has of the past is the output. It has to infer what it meant
| just the same as anyone else would.
|
| To some extent this probably also happens in humans. You have
| richer memories, but you still do a lot of post hoc
| rationalisation.
| observationist wrote:
| Native latent reasoning, with latent aware RL scaffolding and
| all the rest will have to be built. If you use the direct
| text framework, you get confabulation / hallucination issues
| from the divergence between the tokens in the context and the
| rich activation representation that resulted in the output.
|
| There are all sorts of places where the text and output is at
| least one degree of separation from the underlying activation
| vectors or other representations handled by a model, from
| floating point precision all the way up to tokenization
| abstraction, and a lot of experiments get run as if the
| tokens and context and representations are all one unified
| data concept. Have to match data abstractions appropriately,
| or the weird edge cases will break things in unexpected ways.
| David-Henrry wrote:
| Multimodal diffusion language models that support thinking-aware
| editing and generation could significantly enhance AI creativity
| and precision across text and image tasks.
___________________________________________________________________
(page generated 2025-11-19 23:01 UTC)