[HN Gopher] Multimodal Diffusion Language Models for Thinking-Aw...
       ___________________________________________________________________
        
       Multimodal Diffusion Language Models for Thinking-Aware Editing and
       Generation
        
       Author : lnyan
       Score  : 124 points
       Date   : 2025-11-19 09:27 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Hard_Space wrote:
       | Be aware that the project page has the wrong Arxiv link at the
       | time of writing. This is the correct one:
       | 
       | https://arxiv.org/abs/2511.09611
        
       | NitpickLawyer wrote:
       | > To resolve this, we propose a parallel multimodal diffusion
       | framework, MMaDA-Parallel, that enables _continuous,
       | bidirectional interaction_ between text and images throughout the
       | entire denoising trajectory.
       | 
       | > (ParaRL), a novel strategy that applies _semantic rewards along
       | the trajectory_ to enforce cross-modal consistency.
       | 
       | (emphasis mine)
       | 
       | This sounds really cool. The fact that one generation "attends"
       | to the other is really interesting. I'm curious if this would
       | hold for other modalities. I'm thinking coding specific
       | applications, where things can change once something is
       | generated. My hunch is that coding would benefit a lot from this
       | approach, because the "manual" way of writing code often
       | resembles diffusion more than autoregressive (that is, we often
       | edit something here, then because we did that we have to import
       | something, then change something there, then that leads to
       | further changes, etc).
       | 
       | For now coding seems to benefit a lot from <thinking> -> <coding>
       | -> <env_feedback> -> <reflexion> -> <thinking> -> <coding>, but
       | this seems at a glance to be shoehorned in for autoregressive
       | generation... GPT5 in particular seems to be better at this, with
       | multiple "tool calls" interleaved in its thinking sessions. I
       | wonder if this would get better with the paralel denoising thing
       | proposed here, where both thinking and coding are done in
       | paralel, and one can "attend" to the other. Add some feedback
       | (linters, compilers, LSPs, tests, etc.) and this can go places.
       | If it works.
        
         | soulofmischief wrote:
         | Diffusion text models aren't new, I've made them at home. Also,
         | plenty of frontier models are good at tool calling, GPT-5 has
         | just been trained to do it more so that it appears to do better
         | at coding exercises with codex/IDEs.
         | 
         | If you haven't tried an agentic IDE such as Cursor yet, or at
         | least an extension such as Copilot, I would recommend checking
         | them out and trying out Anthropic's models as well.
        
           | NitpickLawyer wrote:
           | Do you have any examples / papers where they do the parallel
           | thing proposed here? I've tried googles diffusion coding
           | model, but AFAICT they don't do parallel thinking & coding.
           | It seems to just take a prompt and output code.
           | 
           | What's cool with this thinking & generation in parallel is
           | that one can attend to the other. So you're not limited by
           | prompt influences code, but can do prompt influences both
           | thinking and code, and code can influence thinking and
           | thinking can influence code.
        
             | lossolo wrote:
             | They use bidirectional attention between modalities, not
             | within the same modality. This doesn't change much in the
             | context you're referring to (coding). How do you think
             | "thinking" works in current SOTA models like
             | GPT-5-Thinking/Pro? When generating code, the model's
             | "thinking" already attends to the code, and both influence
             | each other during generation. "Reasoning" models modify the
             | code as they generate it, they delete it, revise it, and
             | adjust their internal reasoning based on the new tokens
             | they produce during the "thinking" process. There are
             | dozens of denoising models created for text, they are not
             | good at it and parallel sampling between modalities will
             | not change that.
        
               | ricardobeat wrote:
               | They cannot "edit" the code though, like you can with
               | diffusion. They must emit all tokens again, or a
               | patch/diff which is not directly connected to the
               | previous stream of tokens.
        
               | lossolo wrote:
               | LLMs can "edit" code, but as you say, they do it
               | differently from diffusion models. They operate directly
               | on long text sequences and use much more context, which
               | is one reason they currently work better for coding.
               | Diffusion models for code aren't a new idea, people have
               | tried different designs, but so far they tend to
               | underperform autoregressive LLMs, probably because
               | denoising over discrete tokens is harder to make work
               | than straightforward next token prediction.
        
       | boriskourt wrote:
       | Interesting approach and a very readable paper.
       | 
       | > We provide two varients of MMaDA-Parallel with different
       | tokenizers. MMaDA-Parallel-A is trained with tokenizer Amused-VQ,
       | and MMaDA-Parallel-M is trained with tokenizer Magvitv2.
       | 
       | tyfeld/MMaDA-Parallel-A: https://huggingface.co/tyfeld/MMaDA-
       | Parallel-A/tree/main
       | 
       | tyfeld/MMaDA-Parallel-M: https://huggingface.co/tyfeld/MMaDA-
       | Parallel-M/tree/main
        
       | warthog wrote:
       | This looks awesome. Although from a UX perspective might not be
       | as good as streaming token by token for text generation use
       | cases. However for image gen and editing - 100%
        
       | jasonjmcghee wrote:
       | Out of curiosity, is it possible this suffers from the same
       | issues Anthropic found where reasoning expressed by the model and
       | actual internal reasoning differ?
        
         | Lerc wrote:
         | I think this is likely to happen in all models since their
         | internal reasoning is not in the same form as the output. This
         | is probably true also for humans.
         | 
         | This may solve the additional clouding that comes from LLMs
         | using what is an effectively an iteration of instants to
         | introspect the past. You cannot ask a autoregressive model what
         | the thinking was behind the output because the only memory it
         | has of the past is the output. It has to infer what it meant
         | just the same as anyone else would.
         | 
         | To some extent this probably also happens in humans. You have
         | richer memories, but you still do a lot of post hoc
         | rationalisation.
        
           | observationist wrote:
           | Native latent reasoning, with latent aware RL scaffolding and
           | all the rest will have to be built. If you use the direct
           | text framework, you get confabulation / hallucination issues
           | from the divergence between the tokens in the context and the
           | rich activation representation that resulted in the output.
           | 
           | There are all sorts of places where the text and output is at
           | least one degree of separation from the underlying activation
           | vectors or other representations handled by a model, from
           | floating point precision all the way up to tokenization
           | abstraction, and a lot of experiments get run as if the
           | tokens and context and representations are all one unified
           | data concept. Have to match data abstractions appropriately,
           | or the weird edge cases will break things in unexpected ways.
        
       | David-Henrry wrote:
       | Multimodal diffusion language models that support thinking-aware
       | editing and generation could significantly enhance AI creativity
       | and precision across text and image tasks.
        
       ___________________________________________________________________
       (page generated 2025-11-19 23:01 UTC)