[HN Gopher] Mercury, the first commercial-scale diffusion langua...
___________________________________________________________________
Mercury, the first commercial-scale diffusion language model
Author : HyprMusic
Score : 75 points
Date : 2025-04-30 21:51 UTC (1 hours ago)
(HTM) web link (www.inceptionlabs.ai)
(TXT) w3m dump (www.inceptionlabs.ai)
| g-mork wrote:
| There are some open weight attempts at this around too:
| https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restr...
|
| Saw another on Twitter past few days that looked like a better
| contender to Mercury, doesn't look like it got posted to
| LocalLLaMa, and I can't find it now. Very exciting stuff
| echelon wrote:
| There are so many models. Every single day half a dozen new
| models land. And even more papers.
|
| It feels like models are becoming fungible apart from the
| hyperscaler frontier models from OpenAI, Google, Anthropic, et
| al.
|
| I suppose VCs won't be funding many more "labs"-type companies or
| "we have a model" as the core value prop companies? Unless it has
| a tight application loop or is truly unique?
|
| Disregarding the team composition, research background, and
| specific problem domain - if you were starting an AI company
| today, what part of the stack would you focus on? Foundation
| models, AI/ML infra, tooling, application layer, ...?
|
| Where does the value accrue? What are the most important problems
| to work on?
| byearthithatius wrote:
| Interesting approach. However, I never thought of auto regression
| being _the_ current issue with language modeling. If anything it
| seems the community was generally surprised just how far next
| "token" prediction took us. Remember back when we did char
| generating RNNs and were impressed they could make almost
| coherent sentences?
|
| Diffusion is an alternative but I am having a hard time
| understanding the whole "built in error correction" that sounds
| like marketing BS. Both approaches replicate probability
| distributions which will be naturally error-prone because of
| variance.
| nullc wrote:
| Consider the entropy of the distribution of token X in these
| examples:
|
| "Four X"
|
| and
|
| "Four X and seven years ago".
|
| In the first case X could be pretty much anything, but in the
| second case we both know the only likely completion.
|
| So it seems like there would be a huge advantage in not having
| to run autogressively. But in practice it's less significant
| then you might imagine because the AR model can _internally_
| model the probability of X conditioned on the stuff it hasn 't
| output yet, and in fact because without reinforcement the
| training causes it converge on the target probability of the
| whole output, the AR model _must_ do some form of lookahead
| internally.
|
| (That said RLHF seems to break this product of the
| probabilities property pretty badly, so maybe it will be the
| case that diffusion will suffer less intelligence loss
| ::shrugs::).
| jonplackett wrote:
| Ok. My go to puzzle is this:
|
| You have 2 minutes to cool down a cup of coffee to the lowest
| temp you can
|
| You have two options:
|
| 1. Add cold milk immediately, then let it sit for 2 mins.
|
| 2. Let it sit for 2 mins, then add the cold milk.
|
| Which one cools the coffee to the lowest temperature and why?
|
| And Mercury gets this right - while as of right now ChatGPT 4o
| get it wrong.
|
| So that's pretty impressive.
| byearthithatius wrote:
| Token-based Gemini is 4 seconds of thinking:
|
| Okay, let's break this down using the principle of heat
| transfer (specifically Newton's Law of Cooling):
| Heat loss is faster when the temperature difference is greater.
| A very hot object loses heat to cool surroundings much faster
| than a warm object. Option 1: Add milk
| immediately. You instantly lower the coffee's
| temperature by mixing it with cold milk. Now,
| this warmer (not hot) mixture sits for 2 minutes. Because the
| temperature difference between the mixture and the room is
| smaller, it cools down more slowly over those 2 minutes.
| Option 2: Let it sit for 2 mins, then add milk.
| The very hot coffee sits for 2 minutes. Because the temperature
| difference between the hot coffee and the room is large, it
| loses heat more quickly during these 2 minutes.
| After 2 minutes of rapid cooling, you then add the cold milk,
| lowering the temperature further.
|
| Conclusion:
|
| To get the coffee to the lowest temperature, you should choose
| Option 2: Let it sit for 2 mins, then add the cold milk.
| crazygringo wrote:
| For me, ChatGPT (the free version, GPT-4o mini I believe?) gets
| it right (unless there's a gotcha in your prompt that I'm
| missing?):
|
| --- **Option 2: Let the coffee sit for 2
| minutes, then add the cold milk.** This results in a
| lower final temperature. Here's why: ---
| ### **Why Option 2 is Better:** - **Cooling rate
| depends on temperature difference:** The hotter the
| coffee is compared to its environment, the faster it cools down
| due to Newton's Law of Cooling. - **If you add
| milk early (Option 1):** You reduce the coffee's
| temperature right away, which **slows** the rate of heat loss
| to the air over the 2 minutes. - **If you wait to
| add milk (Option 2):** The coffee stays hotter and
| thus loses more heat during those 2 minutes. Then when you add
| cold milk at the end, it drops further, reaching a **lower
| final temperature**. --- ### **In
| Summary:** Letting the coffee cool first **maximizes
| heat loss**, and then the milk **further reduces** the
| temperature. This combo gives the **lowest final temperature**
| after 2 minutes. Would you like a quick visual
| explanation or simulation of this?
| maytc wrote:
| That example is probably in the training data?
|
| The puzzle assumes that the room temperature is greater than
| the cold milk's temperature. When I added that the room
| temperature is, say, -10 degC, Mercury fails to see the
| difference.
| marcyb5st wrote:
| Super happy to see something like this getting traction. As
| someone that is trying to reduce my carbon footprint sometimes I
| feel bad about asking any model to do something trivial. With
| something like that perhaps the guilt will lessen
| whall6 wrote:
| If you live in the U.S., marginal electricity demand during the
| day is almost invariably met with solar or wind (solar
| typically runs at a huge surplus on sunny days). Go forth and
| AI in peace, marcyb5st.
| marcyb5st wrote:
| Thanks! That helps somewhat. However, it feels like that's
| just part of the story.
|
| If I remember correctly hyperscalers put their green agendas
| in stasis now that LLMs are around and that makes me believe
| that there is a CO2 cost associated.
|
| Still, any improvement is a good news and if diffusion models
| replace autoregressive models we can invest that surplus in
| energy in something else useful for the environment.
| inerte wrote:
| Not sure if I would tradeoff speed for accuracy.
|
| Yes, it's incredible boring to wait for the AI Agents in IDEs to
| finish their job. I get distracted and open YouTube. Once I gave
| a prompt so big and complex to Cline it spent 2 straight hours
| writing code.
|
| But after these 2 hours I spent 16 more tweaking and fixing all
| the stuff that wasn't working. I now realize I should have done
| things incrementally even when I have a pretty good idea of the
| final picture.
|
| I've been more and more only using the "thinking" models of o3 in
| ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually
| get it right.
|
| But at the same time I am open to the idea that speed can unlock
| new ways of using the tooling. It would still be awesome to
| basically just have a conversation with my IDE while I am
| manually testing the app. Or combine really fast models like this
| one with a "thinking background" one, that would runs for
| seconds/minutes but try to catch the bugs left behind.
|
| I guess only giving a try will tell.
| kadushka wrote:
| AI field desperately needs smarter models - not faster models.
| tyre wrote:
| Check out RooCode if you haven't. There's an orchestrator mode
| that can start with a big model to come up with a plan and
| break down, then spin out small tasks to smaller models for
| scoped implementation.
| parsimo2010 wrote:
| This sounds like a neat idea but it seems like bad timing. OpenAI
| just released token-based that beats the best diffusion image
| generation. If diffusion isn't even the best at generating
| images, I don't know if I'm going to spend a lot of time
| evaluating it for text.
|
| Speed is great but it doesn't seem like other text-based model
| trends are going to work out of the box, like reasoning. So you
| have to get dLLMs up to the quality of a regular autoregressive
| LLM and then you need to innovate more to catch up to reasoning
| models, just to match the current state of the art. It's possible
| they'll get there, but I'm not optimistic.
| orbital-decay wrote:
| Does it beat them because it's a transformer, or because it's a
| much larger end-to-end model with higher quality multimodal
| training?
| jonplackett wrote:
| The reason image-1 is so good is because it's the same model
| doing the talking and the image making.
|
| I wonder if the same would be true for a multi-modal diffusion
| model that can now also speak?
| pants2 wrote:
| This is awesome for the future of autocomplete. Current models
| aren't fast enough to give useful suggestions at the speed that I
| type - but this certainly is.
|
| That said, token-based models are currently fast enough for most
| real-time chat applications, so I wonder what other use-cases
| there will be where speed is greatly prioritized over smarts.
| Perhaps trading on Trump tweets?
___________________________________________________________________
(page generated 2025-04-30 23:00 UTC)