[HN Gopher] Mercury, the first commercial-scale diffusion langua...
       ___________________________________________________________________
        
       Mercury, the first commercial-scale diffusion language model
        
       Author : HyprMusic
       Score  : 75 points
       Date   : 2025-04-30 21:51 UTC (1 hours ago)
        
 (HTM) web link (www.inceptionlabs.ai)
 (TXT) w3m dump (www.inceptionlabs.ai)
        
       | g-mork wrote:
       | There are some open weight attempts at this around too:
       | https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restr...
       | 
       | Saw another on Twitter past few days that looked like a better
       | contender to Mercury, doesn't look like it got posted to
       | LocalLLaMa, and I can't find it now. Very exciting stuff
        
       | echelon wrote:
       | There are so many models. Every single day half a dozen new
       | models land. And even more papers.
       | 
       | It feels like models are becoming fungible apart from the
       | hyperscaler frontier models from OpenAI, Google, Anthropic, et
       | al.
       | 
       | I suppose VCs won't be funding many more "labs"-type companies or
       | "we have a model" as the core value prop companies? Unless it has
       | a tight application loop or is truly unique?
       | 
       | Disregarding the team composition, research background, and
       | specific problem domain - if you were starting an AI company
       | today, what part of the stack would you focus on? Foundation
       | models, AI/ML infra, tooling, application layer, ...?
       | 
       | Where does the value accrue? What are the most important problems
       | to work on?
        
       | byearthithatius wrote:
       | Interesting approach. However, I never thought of auto regression
       | being _the_ current issue with language modeling. If anything it
       | seems the community was generally surprised just how far next
       | "token" prediction took us. Remember back when we did char
       | generating RNNs and were impressed they could make almost
       | coherent sentences?
       | 
       | Diffusion is an alternative but I am having a hard time
       | understanding the whole "built in error correction" that sounds
       | like marketing BS. Both approaches replicate probability
       | distributions which will be naturally error-prone because of
       | variance.
        
         | nullc wrote:
         | Consider the entropy of the distribution of token X in these
         | examples:
         | 
         | "Four X"
         | 
         | and
         | 
         | "Four X and seven years ago".
         | 
         | In the first case X could be pretty much anything, but in the
         | second case we both know the only likely completion.
         | 
         | So it seems like there would be a huge advantage in not having
         | to run autogressively. But in practice it's less significant
         | then you might imagine because the AR model can _internally_
         | model the probability of X conditioned on the stuff it hasn 't
         | output yet, and in fact because without reinforcement the
         | training causes it converge on the target probability of the
         | whole output, the AR model _must_ do some form of lookahead
         | internally.
         | 
         | (That said RLHF seems to break this product of the
         | probabilities property pretty badly, so maybe it will be the
         | case that diffusion will suffer less intelligence loss
         | ::shrugs::).
        
       | jonplackett wrote:
       | Ok. My go to puzzle is this:
       | 
       | You have 2 minutes to cool down a cup of coffee to the lowest
       | temp you can
       | 
       | You have two options:
       | 
       | 1. Add cold milk immediately, then let it sit for 2 mins.
       | 
       | 2. Let it sit for 2 mins, then add the cold milk.
       | 
       | Which one cools the coffee to the lowest temperature and why?
       | 
       | And Mercury gets this right - while as of right now ChatGPT 4o
       | get it wrong.
       | 
       | So that's pretty impressive.
        
         | byearthithatius wrote:
         | Token-based Gemini is 4 seconds of thinking:
         | 
         | Okay, let's break this down using the principle of heat
         | transfer (specifically Newton's Law of Cooling):
         | Heat loss is faster when the temperature difference is greater.
         | A very hot object loses heat to cool surroundings much faster
         | than a warm object.              Option 1: Add milk
         | immediately.                  You instantly lower the coffee's
         | temperature by mixing it with cold milk.                  Now,
         | this warmer (not hot) mixture sits for 2 minutes. Because the
         | temperature difference between the mixture and the room is
         | smaller, it cools down more slowly over those 2 minutes.
         | Option 2: Let it sit for 2 mins, then add milk.
         | The very hot coffee sits for 2 minutes. Because the temperature
         | difference between the hot coffee and the room is large, it
         | loses heat more quickly during these 2 minutes.
         | After 2 minutes of rapid cooling, you then add the cold milk,
         | lowering the temperature further.
         | 
         | Conclusion:
         | 
         | To get the coffee to the lowest temperature, you should choose
         | Option 2: Let it sit for 2 mins, then add the cold milk.
        
         | crazygringo wrote:
         | For me, ChatGPT (the free version, GPT-4o mini I believe?) gets
         | it right (unless there's a gotcha in your prompt that I'm
         | missing?):
         | 
         | ---                   **Option 2: Let the coffee sit for 2
         | minutes, then add the cold milk.**           This results in a
         | lower final temperature. Here's why:              ---
         | ### **Why Option 2 is Better:**              - **Cooling rate
         | depends on temperature difference:**           The hotter the
         | coffee is compared to its environment, the faster it cools down
         | due to Newton's Law of Cooling.              - **If you add
         | milk early (Option 1):**           You reduce the coffee's
         | temperature right away, which **slows** the rate of heat loss
         | to the air over the 2 minutes.              - **If you wait to
         | add milk (Option 2):**           The coffee stays hotter and
         | thus loses more heat during those 2 minutes. Then when you add
         | cold milk at the end, it drops further, reaching a **lower
         | final temperature**.              ---              ### **In
         | Summary:**         Letting the coffee cool first **maximizes
         | heat loss**, and then the milk **further reduces** the
         | temperature. This combo gives the **lowest final temperature**
         | after 2 minutes.              Would you like a quick visual
         | explanation or simulation of this?
        
         | maytc wrote:
         | That example is probably in the training data?
         | 
         | The puzzle assumes that the room temperature is greater than
         | the cold milk's temperature. When I added that the room
         | temperature is, say, -10 degC, Mercury fails to see the
         | difference.
        
       | marcyb5st wrote:
       | Super happy to see something like this getting traction. As
       | someone that is trying to reduce my carbon footprint sometimes I
       | feel bad about asking any model to do something trivial. With
       | something like that perhaps the guilt will lessen
        
         | whall6 wrote:
         | If you live in the U.S., marginal electricity demand during the
         | day is almost invariably met with solar or wind (solar
         | typically runs at a huge surplus on sunny days). Go forth and
         | AI in peace, marcyb5st.
        
           | marcyb5st wrote:
           | Thanks! That helps somewhat. However, it feels like that's
           | just part of the story.
           | 
           | If I remember correctly hyperscalers put their green agendas
           | in stasis now that LLMs are around and that makes me believe
           | that there is a CO2 cost associated.
           | 
           | Still, any improvement is a good news and if diffusion models
           | replace autoregressive models we can invest that surplus in
           | energy in something else useful for the environment.
        
       | inerte wrote:
       | Not sure if I would tradeoff speed for accuracy.
       | 
       | Yes, it's incredible boring to wait for the AI Agents in IDEs to
       | finish their job. I get distracted and open YouTube. Once I gave
       | a prompt so big and complex to Cline it spent 2 straight hours
       | writing code.
       | 
       | But after these 2 hours I spent 16 more tweaking and fixing all
       | the stuff that wasn't working. I now realize I should have done
       | things incrementally even when I have a pretty good idea of the
       | final picture.
       | 
       | I've been more and more only using the "thinking" models of o3 in
       | ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually
       | get it right.
       | 
       | But at the same time I am open to the idea that speed can unlock
       | new ways of using the tooling. It would still be awesome to
       | basically just have a conversation with my IDE while I am
       | manually testing the app. Or combine really fast models like this
       | one with a "thinking background" one, that would runs for
       | seconds/minutes but try to catch the bugs left behind.
       | 
       | I guess only giving a try will tell.
        
         | kadushka wrote:
         | AI field desperately needs smarter models - not faster models.
        
         | tyre wrote:
         | Check out RooCode if you haven't. There's an orchestrator mode
         | that can start with a big model to come up with a plan and
         | break down, then spin out small tasks to smaller models for
         | scoped implementation.
        
       | parsimo2010 wrote:
       | This sounds like a neat idea but it seems like bad timing. OpenAI
       | just released token-based that beats the best diffusion image
       | generation. If diffusion isn't even the best at generating
       | images, I don't know if I'm going to spend a lot of time
       | evaluating it for text.
       | 
       | Speed is great but it doesn't seem like other text-based model
       | trends are going to work out of the box, like reasoning. So you
       | have to get dLLMs up to the quality of a regular autoregressive
       | LLM and then you need to innovate more to catch up to reasoning
       | models, just to match the current state of the art. It's possible
       | they'll get there, but I'm not optimistic.
        
         | orbital-decay wrote:
         | Does it beat them because it's a transformer, or because it's a
         | much larger end-to-end model with higher quality multimodal
         | training?
        
         | jonplackett wrote:
         | The reason image-1 is so good is because it's the same model
         | doing the talking and the image making.
         | 
         | I wonder if the same would be true for a multi-modal diffusion
         | model that can now also speak?
        
       | pants2 wrote:
       | This is awesome for the future of autocomplete. Current models
       | aren't fast enough to give useful suggestions at the speed that I
       | type - but this certainly is.
       | 
       | That said, token-based models are currently fast enough for most
       | real-time chat applications, so I wonder what other use-cases
       | there will be where speed is greatly prioritized over smarts.
       | Perhaps trading on Trump tweets?
        
       ___________________________________________________________________
       (page generated 2025-04-30 23:00 UTC)