[HN Gopher] Mercury: Ultra-fast language models based on diffusion
       ___________________________________________________________________
        
       Mercury: Ultra-fast language models based on diffusion
        
       Author : PaulHoule
       Score  : 353 points
       Date   : 2025-07-07 12:31 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | mynti wrote:
       | is there a kind of nanogpt for diffusion language models? i would
       | love to understand them better
        
         | nvtop wrote:
         | This video has a live coding part which implements a masked
         | diffusion generation process:
         | https://www.youtube.com/watch?v=oot4O9wMohw
        
       | chc4 wrote:
       | Using the free playground link, and it is in fact extremely fast.
       | The "diffusion mode" toggle is also pretty neat as a
       | visualization, although I'm not sure how accurate it is - it
       | renders as line noise and then refines, while in reality
       | presumably those are tokens from an imprecise vector in some
       | state space that then become more precise until it's only a
       | definite word, right?
        
         | PaulHoule wrote:
         | It's insane how fast that thing is!
        
         | maelito wrote:
         | Link : https://chat.inceptionlabs.ai/
        
         | icyfox wrote:
         | Some text diffusion models use continuous latent space but they
         | historically haven't done that well. Most the ones we're seeing
         | now typically are trained to predict actual token output that's
         | fed forward into the next time series. The diffusion property
         | comes from their ability to modify previous timesteps to
         | converge on the final output.
         | 
         | I have an explanation about one of these recent architectures
         | that seems similar to what Mercury is doing under the hood
         | here: https://pierce.dev/notes/how-text-diffusion-works/
        
           | chc4 wrote:
           | Oh neat, thanks! The OP is surprisingly light on details on
           | how it actually works and is mostly benchmarks, so this is
           | very appreciated :)
        
       | luckystarr wrote:
       | I'm kind of impressed by the speed of it. I told it to write a
       | MQTT topic pattern matcher based on a Trie and it spat out
       | something reasonable on first try. It hat a few compilation
       | issues though, but fair enough.
        
       | earthnail wrote:
       | Tried it on some coding questions and it hallucinated a lot, but
       | the appearance (i.e. if you're not a domain expert) of the output
       | is impressive.
        
       | TechDebtDevin wrote:
       | Oddly fast, almost instantaneous.
        
       | mike_hearn wrote:
       | A good chance to bring up something I've been flagging to
       | colleagues for a while now: with LLM agents we are very quickly
       | going to become even more CPU bottlenecked on testing performance
       | than today, and every team I know of today was bottlenecked on CI
       | speed even before LLMs. There's no point having an agent that can
       | write code 100x faster than a human if every change takes an hour
       | to test.
       | 
       | Maybe I've just got unlucky in the past, but in most projects I
       | worked on a lot of developer time was wasted on waiting for PRs
       | to go green. Many runs end up bottlenecked on I/O or availability
       | of workers, and so changes can sit in queues for hours, or they
       | flake out and everything has to start again.
       | 
       | As they get better coding agents are going to be assigned simple
       | tickets that they turn into green PRs, with the model reacting to
       | test failures and fixing them as they go. This will make the CI
       | bottleneck even worse.
       | 
       | It feels like there's a lot of low hanging fruit in most
       | project's testing setups, but for some reason I've seen nearly no
       | progress here for years. It feels like we kinda collectively got
       | used to the idea that CI services are slow and expensive, then
       | stopped trying to improve things. If anything CI got a lot slower
       | over time as people tried to make builds fully hermetic (so no
       | inter-run caching), and move them from on-prem dedicated hardware
       | to expensive cloud VMs with slow IO, which haven't got much
       | faster over time.
       | 
       | Mercury is crazy fast and in a few quick tests I did, created
       | good and correct code. How will we make test execution keep up
       | with it?
        
         | TechDebtDevin wrote:
         | LLM making a quick edit, <100 lines... Sure. Asking an LLM to
         | rubber-duck your code, sure. But integrating an LLM into your
         | CI is going to end up costing you 100s of hours productivity on
         | any large project. That or spend half the time you should be
         | spending learning to write your own code, dialing down context
         | sizing and prompt accuracy.
         | 
         | I really really don't understand the hubris around llm tooling,
         | and don't see it catching on outside of personal projects and
         | small web apps. These things don't handle complex systems well
         | at all, you would have to put a gun in my mouth to let one of
         | these things work on an important repo of mine without any
         | supervision... And if I'm supervising the LLM I might as well
         | do it myself, because I'm going to end up redoing 50% of its
         | work anyways..
        
           | mike_hearn wrote:
           | I've used Claude with a large, mature codebase and it did
           | fine. Not for every possible task, but for many.
           | 
           | Probably, Mercury isn't as good at coding as Claude is. But
           | even if it's not, there's lots of small tasks that LLMs can
           | do without needing senior engineer level skills. Adding test
           | coverage, fixing low priority bugs, adding nice animations to
           | the UI etc. Stuff that maybe isn't critical so if a PR turns
           | up and it's DOA you just close it, but which otherwise works.
           | 
           | Note that many projects already use this approach with bots
           | like Renovate. Such bots also consume a ton of CI time, but
           | it's generally worth it.
        
             | flir wrote:
             | Don't want to put words in the parent commenter's mouth,
             | but I think the key word is "unsupervised". Claude doesn't
             | know what it doesn't know, and will keep going round the
             | loop until the tests go green, or until the heat death of
             | the universe.
        
               | mike_hearn wrote:
               | Yes, but you can just impose timeouts to solve that. If
               | it's unsupervised the only cost is computation.
        
             | airstrike wrote:
             | IMHO LLMs are notoriously bad at test coverage. They
             | usually hard code a value to have the test pass, since they
             | lack the reasoning required to understand why the test
             | exists or the concept of assertion, really
        
               | wrs wrote:
               | I don't know, Claude is very good at writing that utterly
               | useless kind of unit test where every dependency is
               | mocked out and the test is just the inverted dual of the
               | original code. 100% coverage, nothing tested.
        
               | conradkay wrote:
               | Yeah and that's even worse because there's not an easy
               | metric you can have the agent work towards and get
               | feedback on.
               | 
               | I'm not that into "prompt engineering" but tests seem
               | like a big opportunity for improvement. Maybe something
               | like (but much more thorough):
               | 
               | 1. "Create a document describing all real-world actions
               | which could lead to the code being used. List all
               | methods/code which gets called before it (in order) along
               | with their exact parameters and return value. Enumerate
               | all potential edge cases and errors that could occur and
               | if it ends up influencing this task. After that, write a
               | high-level overview of what need to occur in this
               | implementation. Don't make it top down where you think
               | about what functions/classes/abstractions which are
               | created, just the raw steps that will need to occur" 2.
               | Have it write the tests 3. Have it write the code
               | 
               | Maybe TDD ends up worse but I suspect the initial plan
               | which is somewhat close to code makes that not the case
               | 
               | Writing the initial doc yourself would definitely be
               | better, but I suspect just writing one really good one,
               | then giving it as an example in each subsequent prompt
               | captures a lot of the improvement
        
               | astrange wrote:
               | This is why unit tests are the least useful kind of test
               | and regression tests are the most useful.
               | 
               | I think unit tests are best written /before/ the real
               | code and thrown out after. Of course, that's extremely
               | situational.
        
           | DSingularity wrote:
           | He is simply observing that if PR numbers and launch rates
           | increase dramatically CI cost will become untenable.
        
           | kraftman wrote:
           | I keep seeing this argument over and over again, and I have
           | to wonder, at what point do you accept that maybe LLM's are
           | useful? Like how many people need to say that they find it
           | makes them more productive before you'll shift your
           | perspective?
        
             | candiddevmike wrote:
             | People say they are more productive using visual basic, but
             | that will never shift my perspective on it.
             | 
             | Code is a liability. Code you didn't write is a ticking
             | time bomb.
        
             | psychoslave wrote:
             | That's a tool, and it depends what you need to do. If it
             | fits someone need and make them more productive, or even
             | simply enjoy more the activity, good.
             | 
             | Just because two people are fixing something on the whole
             | doesn't mean the same tool will hold fine. Gum, pushpin,
             | nail, screw,bolts?
             | 
             | The parent thread did mention they use LLM successfully in
             | small side project.
        
             | dragonwriter wrote:
             | > I keep seeing this argument over and over again, and I
             | have to wonder, at what point do you accept that maybe
             | LLM's are useful?
             | 
             | The post you are responding to literally acknowledges that
             | LLMs are useful in certain roles in coding in the first
             | sentence.
             | 
             | > Like how many people need to say that they find it makes
             | them more productive before you'll shift your perspective?
             | 
             |  _Argumentum ad populum_ is not a good way of establishing
             | fact claims beyond the fact of a belief being popular.
        
               | kraftman wrote:
               | ...and my comment clearly isnt talking about that, but at
               | the suggestion that its useless to write code with an LLM
               | because you'll end up rewriting 50% of it.
               | 
               | If everyone has an opinion different to mine, I dont
               | instantly change my opinion, but I do try and investigate
               | the source of the difference, to find out what I'm
               | missing or what they are missing.
               | 
               | The polarisation between people that find LLMs useful or
               | not is very similar to the polarisation between people
               | that find automated testing useful or not, and I have a
               | suspicion they have the same underlying cause.
        
               | nwienert wrote:
               | You seem to think everyone shares your view, around me I
               | see a lot of people acknowledging they are useful to a
               | degree, but also clearly finding limits in a wide array
               | of cases, including that they really struggle with
               | logical code, architectural decisions, re-using the right
               | code patterns, larger scale changes that aren't copy
               | paste, etc.
               | 
               | So far what I see is that if I provide lots of context
               | and clear instructions to a mostly non-logical area of
               | code, I can speed myself up about 20-40%, but only works
               | in about 30-50% of the problems I solve day to day at a
               | day job.
               | 
               | So basically - it's about a rough 20% improvement in my
               | productivity - because I spend most of my time of the
               | difficult things it can't do anyway.
               | 
               | Meanwhile these companies are raising billion dollar seed
               | rounds and telling us that all programming will be done
               | by AI by next year.
        
             | ninetyninenine wrote:
             | They say it's only effective for personal projects but
             | there's literally evidence of LLMs being used for what he
             | says can't be used. Actual physical evidence.
             | 
             | It's self delusion. And also the pace of AI is so fast he
             | may not be aware of how fast LLMs are integrating into our
             | coding environments. Like 1 year ago what he said could be
             | somewhat true but right now what he said is clearly not
             | true at all.
        
             | MangoToupe wrote:
             | > at what point do you accept that maybe LLM's are useful?
             | 
             | LLMs _are_ useful, just not for every task and price point.
        
           | blitzar wrote:
           | Do the opposite - integrate your CI into your LLM.
           | 
           | Make it run tests after it changes your code and either
           | confirm it didnt break anything or go back and try again.
        
         | piva00 wrote:
         | I haven't worked in places using off-the-shelf/SaaS CI in more
         | than a decade so I feel my experience has been quite the
         | opposite from yours.
         | 
         | We always worked hard to make the CI/CD pipeline as fast as
         | possible. I personally worked on those kind of projects at 2
         | different employers as a SRE: a smaller 300-people shop which I
         | was responsible for all their infra needs (CI/CD, live
         | deployments, migrated later to k8s when it became somewhat
         | stable, at least enough for the workloads we ran, but still in
         | its beta-days), then at a different employer some 5k+ strong
         | working on improving the CI/CD setup which used Jenkins as a
         | backend but we developed a completely different shim on top for
         | developer experience while also working on a bespoke worker
         | scheduler/runner.
         | 
         | I haven't experienced a CI/CD setup that takes longer than 10
         | minutes to run in many, many years, got quite surprised reading
         | your comment and feeling spoiled I haven't felt this pain for
         | more than a decade, didn't really expect it was still an issue.
        
           | mike_hearn wrote:
           | I think the prevalence of teams having a "CI guy" who often
           | is developing custom glue, is a sign that CI is still not
           | really working as well as it should given the age of the
           | tech.
           | 
           | I've done a lot of work on systems software over the years so
           | there's often tests that are very I/O or computation heavy,
           | lots of cryptography, or compilation, things like that. But
           | probably there are places doing just ordinary CRUD web app
           | development where there's Playwright tests or similar that
           | are quite slow.
           | 
           | A lot of the problems are cultural. CI times are a commons,
           | so it can end in tragedy. If everyone is responsible for CI
           | times then nobody is. Eventually management gets sick of
           | pouring money into it and devs learn to juggle stacks of PRs
           | on top of each other. Sometimes you get a lot of pushback on
           | attempts to optimize CI because some devs will really scream
           | about any optimization that might potentially go wrong (e.g.
           | depending on your build system cache), even if caching
           | nothing causes an explosion in CI costs. Not their money,
           | after all.
        
         | kccqzy wrote:
         | > Maybe I've just got unlucky in the past, but in most projects
         | I worked on a lot of developer time was wasted on waiting for
         | PRs to go green.
         | 
         | I don't understand this. Developer time is so much more
         | expensive than machine time. Do companies not just double their
         | CI workers after hearing people complain? It's just a throw-
         | more-resources problem. When I was at Google, it was somewhat
         | common for me to debug non-deterministic bugs such as a missing
         | synchronization or fence causing flakiness; and it was common
         | to just launch 10000 copies of the same test on 10000 machines
         | to find perhaps a single digit number of failures. My current
         | employer has a clunkier implementation of the same thing (no
         | UI), but there's also a single command to launch 1000 test
         | workers to run all tests from your own checkout. The goal is to
         | finish testing a 1M loc codebase in no more than five minutes
         | so that you get quick feedback on your changes.
         | 
         | > make builds fully hermetic (so no inter-run caching)
         | 
         | These are orthogonal. You want maximum deterministic CI steps
         | so that you make builds fully hermetic and cache every single
         | thing.
        
           | mark_undoio wrote:
           | > I don't understand this. Developer time is so much more
           | expensive than machine time. Do companies not just double
           | their CI workers after hearing people complain? It's just a
           | throw-more-resources problem.
           | 
           | I'd personally agree. But this sounds like the kind of thing
           | that, at many companies, could be a real challenge.
           | 
           | Ultimately, you can _measure_ dollars spent on CI workers. It
           | 's much harder and less direct to quantify the cost of not
           | having them (until, for instance, people start taking
           | shortcuts with testing and a regression escapes to
           | production).
           | 
           | That kind of asymmetry tends, unless somebody has a strong
           | overriding vision of where the value _really_ comes from, to
           | result in penny pinching on the wrong things.
        
             | mike_hearn wrote:
             | It's more than that. You can measure salaries too,
             | measurement isn't the issue.
             | 
             | The problem is that if you let people spend the companies
             | money without any checks or balances they'll just blow
             | through unlimited amounts of it. That's why companies
             | always have lots of procedures and policies around expense
             | reporting. There's no upper limit to how much money
             | developers will spend on cloud hardware given the chance,
             | as the example above of casually running a test 10,000
             | times in parallel demonstrates nicely.
             | 
             | CI doesn't require you to fill out an expense report every
             | time you run a PR thank goodness, but there still has to be
             | a way to limit financial liability. Usually companies do
             | start out by doubling cluster sizes a few times, but each
             | time it buys a few months and then the complaints return.
             | After a few rounds of this managers realize that demand is
             | unlimited and start pushing back on always increasing the
             | budget. Devs get annoyed and spend an afternoon on
             | optimizations, suddenly times are good again.
             | 
             | The meme on HN is that developer time is always more
             | expensive than machine time, but I've been on both sides of
             | this and seen how the budgets work out. It's often not
             | true, especially if you use clouds like Azure which are
             | overloaded and expensive, or have plenty of junior devs,
             | and/or teams outside the US where salaries are lower.
             | There's often a lot of low hanging fruit in test times so
             | it can make sense to optimize, even so, huge waste is still
             | the order of the day.
        
           | mike_hearn wrote:
           | I was also at Google for years. Places like that are not even
           | close to representative. They can afford to just-throw-more-
           | resources, they get bulk discounts on hardware and they pay
           | top dollar for engineers.
           | 
           | In more common scenarios that represent 95% of the software
           | industry CI budgets are fixed, clusters are sized to be busy
           | most of the time, and you cannot simply launch 10,000 copies
           | of the same test on 10,000 machines. And even despite that
           | these CI clusters can easily burn through the equivalent of
           | several SWE salaries.
           | 
           |  _> These are orthogonal. You want maximum deterministic CI
           | steps so that you make builds fully hermetic and cache every
           | single thing._
           | 
           | Again, that's how companies like Google do it. In _normal_
           | companies, build caching isn 't always perfectly reliable,
           | and if CI runs suffer flakes due to caching then eventually
           | some engineer is gonna get mad and convince someone else to
           | turn the caching off. Blaze goes to extreme lengths to ensure
           | this doesn't happen, and Google spends extreme sums of money
           | on helping it do that (e.g. porting third party libraries to
           | use Blaze instead of their own build system).
           | 
           | In companies without money printing machines, they sacrifice
           | caching to get determinism and everything ends up slow.
        
             | PaulHoule wrote:
             | Most of my experience writing concurrent/parallel code in
             | (mainly) Java has been rewriting half-baked stuff that
             | would need a lot of testing with straightforward reliable
             | and reasonably performant code that uses sound and easy-to-
             | use primitives such as Executors (watch out for teardown
             | though), database transactions, atomic database operations,
             | etc. Drink the Kool Aid and mess around with _synchronized_
             | or actors or Streams or something and you 're looking at a
             | world of hurt.
             | 
             | I've written a limited number of systems that needed tests
             | that probe for race conditions by doing something like
             | having 3000 threads run a random workload for 40 seconds.
             | I'm proud of that "SuperHammer" test on a certain level but
             | boy did I hate having to run it with every build.
        
             | kridsdale1 wrote:
             | I'm at Google today and even with all the resources, I am
             | absolutely most bottlenecked by the Presubmit TAP and human
             | review latency. Making CLs in the editor takes me a few
             | hours. Getting them in the system takes days and sometimes
             | weeks.
        
               | simonw wrote:
               | Presumably the "days and sometimes weeks" thing is
               | entirely down to human review latency?
        
           | mystified5016 wrote:
           | IME it's less of a "throw more resources" problem and more of
           | a "stop using resources in literally the worst way possible"
           | 
           | CI caching is, apparently, extremely difficult. Why spend a
           | couple of hours learning about your CI caches when you can
           | just download and build the same pinned static library a
           | billion times? The server you're downloading from is (of
           | course) someone else's problem and you don't care about
           | wasting their resources either. The power you're burning by
           | running CI for there hours instead of one is also someone
           | else's problem. Compute time? Someone else's problem. Cloud
           | costs? You bet it's someone else's problem.
           | 
           | Sure, some things you don't want to cache. I _always_ do a
           | 100% clean build when cutting a release or merging to master.
           | But for intermediate commits on a feature branch? Literally
           | no reason not to cache builds the exact same way you do on
           | your local machine.
        
           | ronbenton wrote:
           | >Do companies not just double their CI workers after hearing
           | people complain?
           | 
           | They do not.
           | 
           | I don't know if it's a matter of justifying management
           | levels, but these discussions are often drawn out and
           | belabored in my experience. By the time you get approval, or
           | even worse, rejected, for asking for more compute (or
           | whatever the ask is), you've spent way more money on the
           | human resource time than you would ever spend on the
           | requested resources.
        
             | kccqzy wrote:
             | I have never once been refused by a manager or director
             | when I am explicitly asking for cost approval. The only
             | kind of long and drawn out discussions are unproductive
             | technical decision making. Example: the ask of "let's spend
             | an extra $50,000 worth of compute on CI" is quickly
             | approved but "let's locate the newly approved CI resource
             | to a different data center so that we have CI in multiple
             | DCs" solicits debates that can last weeks.
        
             | mysteria wrote:
             | This is exactly my experience with asking for more compute
             | at work. We have to prepare loads of written justification,
             | come up with alternatives or optimizations (which we
             | already know won't work), etc. and in the end we choose the
             | slow compute and reduced productivity over the bureaucracy.
             | 
             | And when we manage to make a proper request it ends up
             | being rejected anyways as many other teams are asking for
             | the same thing and "the company has limited resources".
             | Duh.
        
           | IshKebab wrote:
           | Developer time is more expensive than machine time, but at
           | most companies it isn't 10000x more expensive. Google is
           | likely an exception because it pays extremely well and has
           | access to very cheap machines.
           | 
           | Even then, there are other factors:
           | 
           | * You might need commercial licenses. It may be very cheap to
           | run open source code 10000x, but guess how much 10000 Questa
           | licenses cost.
           | 
           | * Moores law is dead Amdahl's law very much isn't. Not
           | everything is embarrassingly parallel.
           | 
           | * Some people care about the environment. I worked at a
           | company that spent 200 CPU hours on every single PR (even to
           | fix typos; I failed to convince them they were insane for not
           | using Bazel or similar). That's a not insignificant amount of
           | CO2.
        
             | underdeserver wrote:
             | That's solvable with modern cloud offerings - Provision
             | spot instances for a few minutes and shut them down
             | afterwards. Let the cloud provider deal with demand
             | balancing.
             | 
             | I think the real issue is that developers waiting for PRs
             | to go green are taking a coffee break between tasks, not
             | sitting idly getting annoyed. If that's the case you're
             | cutting into rest time and won't get much value out of
             | optimizing this.
        
               | IshKebab wrote:
               | Both companies I've worked in recently have been too
               | paranoid about IP to use the cloud for CI.
               | 
               | Anyway I don't see how that solves any of the issues
               | except maybe cost to some degree (but maybe not; cloud is
               | expensive).
        
               | fragmede wrote:
               | Sorta. For CI/CD you can use spot instances and spin them
               | down outside of business hours, so they can end up being
               | cheaper than buying many really beefy machines and
               | amortizing them over the standard depreciation schedule.
        
               | simonw wrote:
               | Were they running CI on their own physical servers under
               | a desk or in a basement somewhere, or renting their own
               | racks in a data center just for CI?
        
               | jiggawatts wrote:
               | That's paranoid to the point of lunacy.
               | 
               | Azure for example has "confidential compute" that
               | encrypts even the memory contents of the VM such that
               | even their own engineers can't access the contents.
               | 
               | As long as you don't back up the disks and use HTTPS for
               | pulls, I don't see a realistic business risk.
               | 
               | If a cloud like Azure or AWS got caught stealing
               | competitor code they'd be sued _and_ immediately lose a
               | huge chunk of their customers.
               | 
               | It makes zero business sense to do so.
               | 
               | PS: Microsoft employees have made public comments saying
               | that they refuse to even _look_ at some open source
               | repository to avoid any risk of accidentally
               | "contaminating" their own code with something that has an
               | incompatible license.
        
               | kccqzy wrote:
               | I don't know about Azure's implementation of confidential
               | compute but GCP's version basically essentially relies on
               | AMD SEV-SVP. Historically there have been vulnerabilities
               | that undermine the confidentiality guarantee.
        
             | hyperpape wrote:
             | > Moores law is dead Amdahl's law
             | 
             | Yes, but the OP specifically is talking about CI for large
             | numbers of pull requests, which should be very
             | parallelizable (I can imagine exceptions, but only with
             | anti-patterns, e.g. if your test pipeline makes some kind
             | of requests to something that itself isn't scalable).
        
               | vlovich123 wrote:
               | Actually, OP was talking about the throughput of running
               | on a large number of pull requests and the latency of
               | running on a single pull request. The latter is not
               | necessarily parallelizable.
        
           | physicsguy wrote:
           | Not really, in most small companies/departments, PS100k a
           | month is considered a painful cloud bill and adding more EC2
           | instances to provide cloud runners can add 10% to that
           | easily.
        
           | wat10000 wrote:
           | Many companies are strangely reluctant to spend money on
           | hardware for developers. They might refuse to spend $1,000 on
           | a better laptop to be used for the next three years by an
           | employee, whose time costs them that much money in a single
           | afternoon.
        
             | kridsdale1 wrote:
             | I have faced this at each of the $50B in profit companies I
             | have worked at.
        
             | PaulHoule wrote:
             | That's been a pet peeve of mine for so long. (Glad my
             | current employer gets me the best 1.5 machine from Dell
             | every few years!)
             | 
             | On the other hand I've seen many overcapitalized pre-launch
             | startups go for months with a $20,000+ AWS bill without
             | thinking about it then suddenly panic about what they're
             | spending; they'd find tens of XXXXL instances spun up doing
             | nothing, S3 buckets full of hundreds of terabytes of temp
             | files that never got cleared out, etc. With basic due
             | diligence they could have gotten that down to $2k a month,
             | somebody obsessive about cost control could have done even
             | better.
        
           | wbl wrote:
           | No it is not. Senior management often has a barely disguised
           | contempt for engineering and spending money to do a better
           | job. They listen much more to sales complain.
        
             | kridsdale1 wrote:
             | That depends on the company.
        
           | MangoToupe wrote:
           | Writing testing infrastructure so that you _can_ just double
           | workers and get a corresponding doubling in productivity is
           | non-trivial. Certainly I 've never seen anything like
           | Google's testing infrastructure anywhere else I've worked.
        
             | mike_hearn wrote:
             | Yeah Google's infrastructure is unique because Blaze is
             | tightly integrated with the remote execution workers and
             | can shard testing work across many machines automatically.
             | Most places can't do that so once you have enough hardware
             | that queue depth isn't too big you can't make anything go
             | faster by adding hardware, you can only try to scale
             | vertically or optimize. But if you're using hosted CI SaaS
             | it's often not always easy to get bigger machines, or the
             | bigger machines are superlinear in cost.
        
           | socalgal2 wrote:
           | Even Google can not buy more old Intel Macs or Pixel 6s or
           | Samsung S20s to increase their testing on those devices (as
           | an example)
           | 
           | Maybe that affects less devs who don't need to test on actual
           | hardware but plenty of apps do. Pretty much anything that
           | touches a GPU driver for example like a game.
        
           | anp wrote:
           | I'm currently at google (opinions not representative of my
           | employer's etc) and this is true for things that run in a
           | data center but it's a lot harder for things that need to be
           | tested on physical hardware like parts of Android or CrOS.
        
           | wavemode wrote:
           | You're confusing throughput and latency. Lengthy CI runs
           | increase the latency of developer output, but they don't
           | significantly reduce overall throughput, given a developer
           | will typically be working on multiple things at once, and can
           | just switch tasks while CI is running. The productivity cost
           | of CI is not zero, but it's way, way less than the raw
           | wallclock time spent per run.
           | 
           | Then also factor in that most developer tasks are not even
           | bottlenecked by CI. They are bottlenecked primarily by code
           | review, and secondarily by deployment.
        
         | mathiaspoint wrote:
         | Good God I hate CI. Just let me run the build automation myself
         | dammit! If you're worried about reproducibility make it
         | reproducible and hash the artifacts, make people include the
         | hash in the PR comment if you want to enforce it.
         | 
         | The amount of time people waste futzing around in eg Groovy is
         | INSANE and I'm honestly inclined to reject job offers from
         | companies that have any serious CI code at this point.
        
         | droopyEyelids wrote:
         | In most companies the CI/Dev Tools team is a career dead end.
         | There is no possibility to show a business impact, it's just a
         | money pit that leadership can't/won't understand (and if they
         | do start to understand it, then it becomes _their_ money pit,
         | which is a career dead end for them) So no one who has their
         | head on straight wants to spend time improving it.
         | 
         | And you can't even really say it's a short sighted attitude. It
         | definitely is from a developer's perspective, and maybe it is
         | for the company if dev time is what decides the success of the
         | business overall.
        
           | MangoToupe wrote:
           | > it's just a money pit that leadership can't/won't
           | understand
           | 
           | In my experience it's the opposite: they want more automated
           | testing, but don't want to pay for the friction this causes
           | on productivity.
        
         | yieldcrv wrote:
         | then kill the CI/CD
         | 
         | these redundant processes are for human interoperability
        
         | blitzar wrote:
         | Yet, now I have added a LLM workflow to my coding the value of
         | my old and mostly useless workflows is now 10x'd.
         | 
         | Git checkpoints, code linting and my naive suite of unit and
         | integration tests are now crucial to my LLM not wasting _too
         | much_ time generating total garbage.
        
         | vjerancrnjak wrote:
         | It's because people don't know how to write tests. All of the
         | "don't do N select queries in a for loop" comments made in PRs
         | are completely ignored in tests.
         | 
         | Each test can output many db queries. And then you create
         | multiple cases.
         | 
         | People don't even know how to write code that just deals with N
         | things at a time.
         | 
         | I am confident that tests run slowly because the code that is
         | tested completely sucks and is not written for batch mode.
         | 
         | Ignoring batch mode, tests are most of the time written in a a
         | way where test cases are run sequentially. Yet attempts to run
         | them concurrently result in flaky tests, because the way you
         | write them and the way you design interfaces does not allow
         | concurrent execution at all.
         | 
         | Another comment, code done by the best AI model still sucks.
         | Anything simple, like a music player with a library of 10000
         | songs is something it can't do. First attempt will be horrible.
         | No understanding of concurrent metadata parsing, lists showing
         | 10000 songs at once in UI being slow etc.
         | 
         | So AI is just another excuse for people writing horrible code
         | and horrible tests. If it's so smart , try to speed up your CI
         | with it.
        
         | rapind wrote:
         | > This will make the CI bottleneck even worse.
         | 
         | I agree. I think there are potentially multiple solutions to
         | this since there are multiple bottlenecks. The most obvious is
         | probably network overhead when talking to a database. Another
         | might be storage overhead if storage is being used.
         | 
         | Frankly another one is language. I suspect type-safe, compiled,
         | functional languages are going to see some big advantages here
         | over dynamic interpreted languages. I think this is the sweet
         | spot that grants you a ton of performance over dynamic
         | languages, gives you more confidence in the models changes, and
         | requires less testing.
         | 
         | Faster turn-around, even when you're leaning heavily on AI, is
         | a competitive advantage IMO.
        
           | mike_hearn wrote:
           | It could go either way. Depends very much on what kind of
           | errors LLMs make.
           | 
           | Type safe languages in theory should do well, because you get
           | feedback on hallucinated APIs very fast. But if the LLM
           | generally writes code that compiles, unless the compiler is
           | very fast you might get out-run by an LLM just spitting out
           | JavaScript at high speed, because it's faster to run the
           | tests than wait for the compile.
           | 
           | The sweet spot is probably JIT compiled type safe languages.
           | Java, Kotlin, TypeScript. The type systems can find enough
           | bugs to be worth it, but you don't have to wait too long to
           | get test results either.
        
         | rafaelmn wrote:
         | > If anything CI got a lot slower over time as people tried to
         | make builds fully hermetic (so no inter-run caching), and move
         | them from on-prem dedicated hardware to expensive cloud VMs
         | with slow IO, which haven't got much faster over time.
         | 
         | I am guesstimating (based on previous experience self-hosting
         | the runner for MacOS builds) that the project I am working on
         | could get like 2-5x pipeline performance at 1/2 cost just by
         | using self-hosted runners on bare metal rented machines like
         | Hetzner. Maybe I am naive, and I am not the person that would
         | be responsible for it - but having a few bare metal machines
         | you can use in the off hours to run regression tests, for less
         | than you are paying the existing CI runner just for build, that
         | speed up everything massively seems like a pure win for
         | relatively low effort. Like sure everyone already has stuff on
         | their plate and would rather pay external service to do it -
         | but TBH once you have this kind of compute handy you will find
         | uses anyway and just doing things efficiently. And knowing how
         | to deal with bare metal/utilize this kind of compute sounds
         | generally useful skill - but I rarely encounter people
         | enthusiastic about making this kind of move. Its usually - hey
         | lets move to this other service that has slightly cheaper
         | instances and a proprietary caching layer so that we can get
         | locked into their CI crap.
         | 
         | Its not like these services have 0 downtime/bug free/do not
         | require integration effort - I just don't see why going bare
         | metal is always such a taboo topic even for simple stuff like
         | builds.
        
           | azeirah wrote:
           | At the last place I worked at, which was just a small startup
           | with 5 developers, I calculated that a server workstation in
           | the office would be both cheaper and more performant than
           | renting a similar machine in the cloud.
           | 
           | Bare metal makes such a big difference for test and CI
           | scenarios. It even has an integrated a GPU to speed up webdev
           | tests. Good luck finding an affordable machine in the cloud
           | that has a proper GPU for this kind of a use-case
        
             | rafaelmn wrote:
             | Is it a startup or small business ? In my book a startup
             | expects to scale and hosting bare metal HW in an office
             | with 5 people means you have to figure everything out again
             | when you get 20/50/100 people - IMO not worth the effort
             | and hosting hardware has zero transferable skills to your
             | product.
             | 
             | Running on managed bare metal servers is theoretically the
             | same as running any other infra provider except you are on
             | the hook for a bit more maintenance, you scale to 20 people
             | you just rent a few more machines. I really do not see many
             | downsides for the build server/test runner scenario.
        
           | mike_hearn wrote:
           | Yep. For my own company I used a bare metal machine in
           | Hetzner running Linux and a Windows VM along with a bunch of
           | old MacBook Pros wired up in the home office for CI.
           | 
           | It works, and it's cheap. A full CI run still takes half an
           | hour on the Linux machine (the product [1] is a kind of build
           | system for shipping desktop apps cross platform, so there's
           | lots of file IO and cryptography involved). The Macs are by
           | far the fastest. The M1 Mac is embarrassingly fast. It can
           | complete the same run in five minutes despite the Hetzner box
           | having way more hardware. In fairness, it's running both a
           | Linux and Windows build simultaneously.
           | 
           | I'm convinced the quickest way to improve CI times in most
           | shops is to just build an in-office cluster of M4 Macs in an
           | air conditioned room. They don't have to be HA. The hardware
           | is more expensive but you don't rent per month, and CI is
           | often bottlenecked on serial execution speed so the higher
           | single threaded performance of Apple Silicon is worth it.
           | Also, pay for a decent CI system like TeamCity. It helps
           | reduce egregious waste from problems like not caching things
           | or not re-using checkout directories. In several years of
           | doing this I haven't had build caching related failures.
           | 
           | [1] https://hydraulic.dev/
        
           | adamcharnock wrote:
           | > 2-5x pipeline performance at 1/2 cost just by using self-
           | hosted runners on bare metal rented machines like Hetzner
           | 
           | This is absolutely the case. Its a combination of having
           | dedicated CPU cores, dedicated memory bandwidth, and (perhaps
           | most of all) dedicated local NVMe drives. We see a 2x speed
           | up running _within VMs_ on bare metal.
           | 
           | > And knowing how to deal with bare metal/utilize this kind
           | of compute sounds generally useful skill - but I rarely
           | encounter people enthusiastic about making this kind of move
           | 
           | We started our current company for this reason [0]. A lot of
           | people know this makes sense on some level, but not many
           | people want to do it. So we say we'll do it for you, give you
           | the engineering time needed to support it, and you'll still
           | save money.
           | 
           | > I just don't see why going bare metal is always such a
           | taboo topic even for simple stuff like builds.
           | 
           | It is decreasingly so from what I see. Enough people have
           | been variously burned by public cloud providers to know they
           | are not a panacea. But they just need a little assistance in
           | making the jump.
           | 
           | [0] - https://lithus.eu
        
         | TheDudeMan wrote:
         | This is because coders didn't spend enough time making their
         | tests efficient. Maybe LLM coding agents can help with that.
        
         | grogenaut wrote:
         | Before cars people spent little on petroleum products or motor
         | oil or gasoline or mechanics. Now they do. That's how systems
         | work. You wanna go faster well you need better roads, traffic
         | lights, on ramps, etc. you're still going faster.
         | 
         | Use AI to solve the IP bottlenecks or build more features that
         | ear more revenue that buy more ci boxes. Same as if you added
         | 10 devs which you are with AI so why wouldn't some of the dev
         | support costs go up.
         | 
         | Are you not in a place where you can make an efficiency
         | argument to get more ci or optimize? What's a ci box cost?
        
         | daxfohl wrote:
         | There are a couple mitigating considerations
         | 
         | 1. As implementation phase gets faster, the bottleneck could
         | actually switch to PM. In which case, changes will be more
         | serial, so a lot fewer conflicts to worry about.
         | 
         | 2. I think we could see a resurrection of specs like TLA+. Most
         | engineers don't bother with them, but I imagine code agents
         | could quickly create them, verify the code is consistent with
         | them, and then require fewer full integration tests.
         | 
         | 3. When background agents are cleaning up redundant code, they
         | can also clean up redundant tests.
         | 
         | 4. Unlike human engineering teams, I expect AIs to work more
         | efficiently on monoliths than with distributed microservices.
         | This could lead to better coverage on locally runnable tests,
         | reducing flakes and CI load.
         | 
         | 5. It's interesting that even as AI increases efficiency, that
         | increased velocity and sheer amount of code it'll write and
         | execute for new use cases will create its own problems that
         | we'll have to solve. I think we'll continue to have new
         | problems for human engineers to solve for quite some time.
        
         | SoftTalker wrote:
         | Wow, your story gives me flashbacks to the 1990s when I worked
         | in a mainframe environment. Compile jobs submitted by
         | developers were among the lowest priorities. I could make a
         | change to a program, submit a compile job, and wait literally
         | half a day for it to complete. Then I could run my testing,
         | which again might have to wait for hours. I generally had other
         | stuff I could work on during those delays but not always.
        
         | trhway wrote:
         | >There's no point having an agent that can write code 100x
         | faster than a human if every change takes an hour to test.
         | 
         | Testing every change incrementally is a vestige of the code
         | being done by humans (and thus of the current approach where AI
         | helps and/or replaces one given human), in small increments at
         | that, and of the failures being analyzed by individual humans
         | who can keep in their head only limited number of
         | things/dependencies at once.
        
         | ASinclair wrote:
         | Call me a skeptic but I do not believe LLMs are significantly
         | altering the time between commits so much that CI is the
         | problem.
         | 
         | However, improving CI performance is valuable regardless.
        
         | gdiamos wrote:
         | This sounds like a strawman.
         | 
         | GPUs can do 1 million trillion instructions per second.
         | 
         | Are you saying it's impossible to write a test that finishes in
         | less than one second on that machine?
         | 
         | Is that a fundamental limitation or an incredibly inefficient
         | test?
        
           | nradclif wrote:
           | A million trillion operations per second is literally an
           | exaflop. That's one hell of a GPU you have.
        
             | gdiamos wrote:
             | Thanks, I missed a factor of 1000x, it should be a million
             | billion
        
         | mrkeen wrote:
         | > Maybe I've just got unlucky in the past, but in most projects
         | I worked on a lot of developer time was wasted on waiting for
         | PRs to go green. Many runs end up bottlenecked on I/O or
         | availability of workers
         | 
         | No, this is common. The devs just haven't grokked dependency
         | inversion. And I think the rate of new devs entering the
         | workforce will keep it that way forever.
         | 
         | Here's how to make it slow:
         | 
         | * Always refer to "the database". You're not just storing and
         | retrieving objects _from anywhere_ - you 're always using the
         | database.
         | 
         | * Work with statements, not expressions. Instead of "the
         | balance is the sum of the transactions", execute several
         | transaction writes (to _the database_ ) and read back the
         | resulting balance. This will force you to sequentialise the
         | tests (simultaneous tests would otherwise race and cause
         | flakiness) plus you get to write a bunch of setup and teardown
         | and wipe state between tests.
         | 
         | * If you've done the above, you'll probably need to wait for
         | state changes before running an assertion. Use a thread sleep,
         | and if the test is ever flaky, bump up the sleep time and
         | commit it if the test goes green again.
        
         | pamelafox wrote:
         | For Python apps, I've gotten good CI speedups by moving over to
         | the astral.sh toolchain, using uv for the package installation
         | with caching. Once I move to their type-checker instead of
         | mypy, that'll speed the CI up even more. The playwright test
         | running will then probably be the slowest part, and that's only
         | in apps with frontends.
         | 
         | (Also, Hi Mike, pretty sure I worked with you at Google Maps
         | back in early 2000s, you were my favorite SRE so I trust your
         | opinion on this!)
        
       | fastball wrote:
       | ICYMI, DeepMind also has a Gemini model that is diffusion-
       | based[1]. I've tested it a bit and while (like with this model)
       | the speed is indeed impressive, the quality of responses was much
       | worse than other Gemini models in my testing.
       | 
       | [1] https://deepmind.google/models/gemini-diffusion/
        
         | tripplyons wrote:
         | Is the Gemini Diffusion demo free? I've been on the waitlist
         | for it for a few weeks now.
        
         | Powdering7082 wrote:
         | From my minor testing I agree that it's crazy fast and not that
         | good at being correct
        
       | thelastbender12 wrote:
       | The speed here is super impressive! I am curious - are there any
       | qualitative ways in which modeling text using diffusion differs
       | from that using autoregressive models? The kind of problems it
       | works better on, creativity, and similar.
        
         | orbital-decay wrote:
         | One works in the coarse-to-fine direction, another works start-
         | to-end. Which means different directionality biases, at least.
         | Difference in speed, generalization, etc. is less clear and
         | needs to be proven in practice, as fundamentally they are
         | closer than it seems. Diffusion models have some well-studied
         | shortcuts to trade speed for quality, but nothing stops you
         | from implementing the same for the other type.
        
           | ekunazanu wrote:
           | I once read that diffusion is essentially just autoregression
           | in the frequency domain. Honestly, that comparison didn't
           | seem too far off.
        
       | JimDabell wrote:
       | Pricing:
       | 
       | US$0.000001 per output token ($1/M tokens)
       | 
       | US$0.00000025 per input token ($0.25/M tokens)
       | 
       | https://platform.inceptionlabs.ai/docs#models
        
         | asaddhamani wrote:
         | The pricing is a little on the higher side. Working on a
         | performance-sensitive application, I tried Mercury and Groq
         | (Llama 3.1 8b, Llama 4 Scout) and the performance was neck-and-
         | neck but the pricing was way better for Groq.
         | 
         | But I'll be following diffusion models closely, and I hope we
         | get some good open source ones soon. Excited about their
         | potential.
        
           | tripplyons wrote:
           | Good to know. I didn't realize how good the pricing is on
           | Groq!
        
             | tlack wrote:
             | If your application is pricing sensitive, check out
             | DeepInfra.com - they have a variety of models in the
             | pennies-per-mil range. Not quite as fast as Mercury, Groq
             | or Samba Nova though.
             | 
             | (I have no affiliation with this company aside from being a
             | happy customer the last few years)
        
       | empiko wrote:
       | I strongly believe that this will be a really important technique
       | in the near future. The cost saving this might create is mouth
       | watering.
        
         | NitpickLawyer wrote:
         | > I strongly believe that this will be a really important
         | technique in the near future.
         | 
         | I share the same belief, but regardless of cost. What excites
         | me is the ability to "go both ways", edit previous tokens after
         | others have been generated, using other signals as "guided
         | generation", and so on. Next token prediction works for
         | "stories", but diffusion matches better with "coding flows"
         | (i.e. going back and forth, add something, come back, import
         | something, edit something, and so on).
         | 
         | It would also be very interesting to see how applying this at
         | different "abstraction layers" would work. Say you have one
         | layer working on ctags, one working on files, and one working
         | on "functions". And they all "talk" to each other, passing
         | context and "re-diffusing" their respective layers after each
         | change. No idea where the data for this would come, maybe from
         | IDEs?
        
           | sansseriff wrote:
           | I wonder if there's a way to do diffusion within some sort of
           | schema-defined or type constrained space.
           | 
           | A lot of people these days are asking for structured output
           | from LLMs so that a schema is followed. Even if you train on
           | schema-following with a transformer, you're still just
           | 'hoping' in the end that the generated json matches the
           | schema.
           | 
           | I'm not a diffusion excerpt, but maybe there's a way to
           | diffuse one value in the 'space' of numbers, and another
           | value in the 'space' of all strings, as required by a schema:
           | 
           | { "type": "object", "properties": { "amount": { "type":
           | "number" }, "description": { "type": "string" } },
           | "required": ["amount", "description"] }
           | 
           | I'm not sure how far this could lead. Could you diffuse more
           | complex schemas that generalize to a arbitrary syntax tree?
           | E.g. diffuse some code in a programming language that is
           | guaranteed to be type-safe?
        
       | baalimago wrote:
       | I, for one, am willing to trade accuracy for speed. I'd rather
       | have 10 iterations of poor replies which forces me to ask the
       | right question than 1 reply which takes 10 times as long and
       | _maybe_ is good, since it tries to reason about my poor question.
        
         | PaulHoule wrote:
         | Personally I like asking coding agents a question and getting
         | an answer back immediately. Systems like Junie that go off and
         | research a bunch of irrelevant things than ask permission than
         | do a lot more irrelevant research, ask more permission and such
         | and then 15 minutes later give you a mountain of broken code
         | are a waste of time if you ask me. (Even if you give permission
         | in advance)
        
       | pmxi wrote:
       | This is cool. I think faster models can unlock entirely new usage
       | paradigms, like how faster search enables incremental search.
        
       | amelius wrote:
       | Damn, that is fast. But it is faster than I can read, so
       | hopefully they can use that speed and turn it into better quality
       | of the output. Because otherwise, I honestly don't see the
       | advantage, in practical terms, over existing LLMs. It's like
       | having a TV with a 200Hz refresh rate, where 100Hz is just fine.
        
         | pmxi wrote:
         | There are plenty of LLM use cases where the output isn't meant
         | to be read by a human at all. e.g:
         | 
         | parsing unstructured text into structured formats like JSON
         | 
         | translating between natural or programming languages
         | 
         | serving as a reasoning step in agentic systems
         | 
         | So even if it's "too fast to read," that speed can still be
         | useful
        
           | amelius wrote:
           | Sure, but I was talking about the chat interface, sorry if
           | that was not clear.
        
           | martinald wrote:
           | You're missing another big advantage is cost. If you can do
           | 1000tok/s on a $2/hr H100 vs 60tok/s on the same hardware,
           | you can price it at 1/40th of the price for the same margin.
        
         | Legend2440 wrote:
         | This lets you do more (potentially a lot more) reasoning steps
         | and tool calls before answering.
        
       | irthomasthomas wrote:
       | I've used mercury quite a bit in my commit message generator. I
       | noticed it would always produce the exact same response if you
       | ran it multiple times, and increasing temperature didn't affect
       | it. To get some variability I added a $(uuidgen) to the prompt.
       | Then I could run it again for a new response if I didn't like the
       | first.
        
         | everlier wrote:
         | Something like https://github.com/av/klmbr could also work
        
       | seydor wrote:
       | I wonder if diffusion llms solve the hallucination problem more
       | effectively. In the same way that image models learned to create
       | less absurd images, dllms can perhaps learn to create sensical
       | responses more predictably
        
       | awaymazdacx5 wrote:
       | Having token embeddings with diffusion models, for 16x16
       | transformer encoding. Image is tokenized before transformers
       | compile it. If decomposed virtualization modulates according to a
       | diffusion model.
        
       | storus wrote:
       | Can Mercury use tools? I haven't seen it described anywhere. How
       | about streaming with tools?
        
       | nashashmi wrote:
       | I guess this makes specific language patterns cheaper and more
       | artistic language patterns more expensive. This could be a good
       | way to limit pirated and masqueraded materials submitted by
       | students.
        
       | true_blue wrote:
       | I tried the playground and got a strange response. I asked for a
       | regex pattern, and the model gave itself a little game-plan, then
       | it wrote the pattern and started to write tests for it. But it
       | never stopped writing tests. It continued to write tests of
       | increasing size until I guess it reached a context limit and the
       | answer was canceled. Also, for each test it wrote, it added a
       | comment about if the test should pass or fail, but after about
       | the 30th test, it started giving the wrong answer for those too,
       | saying that a test should fail when actually it should pass if
       | the pattern is correct. And after about the 120th test, the tests
       | started to not even make sense anymore. They were just nonsense
       | characters until the answer got cut off.
       | 
       | The pattern it made was also wrong, but I think the first issue
       | is more interesting.
        
         | fiatjaf wrote:
         | This is too funny to be true.
        
         | beders wrote:
         | I think that's a prime example showing that token prediction
         | simply isn't good enough for correctness. It never will be.
         | LLMs are not designed to reason about code.
        
         | ianbicking wrote:
         | FWIW, I remember regular models doing this not that long ago,
         | sometimes getting stuck in something like an infinite loop
         | where they keep producing output that is only a slight
         | variation on previous output.
        
           | data-ottawa wrote:
           | if you shrink the context window on most models you'll get
           | this type of behaviour. If you go too small you end up with
           | basically gibberish even on modern models like Gemini 2.5.
           | 
           | Mercury has a 32k context window according to the paper,
           | which could be why it does that.
        
       | skybrian wrote:
       | Company blog post: https://www.inceptionlabs.ai/introducing-
       | mercury-our-general...
       | 
       | News coverage from February:
       | https://techcrunch.com/2025/02/26/inception-emerges-from-ste...
        
       | mtillman wrote:
       | Ton of performance upside in most GPU adjacent code right now.
       | 
       |  _However_ , is this what arXiv is for? It seems more like
       | marketing their links than research. Please correct me if I'm
       | wrong/naive on this topic.
        
         | ricopags wrote:
         | not wrong, per se, but it's far from the first time
        
       | eden-u4 wrote:
       | No open model/weights?
        
         | krasin wrote:
         | Not only they do not release models/weights. They don't even
         | tell the size of the models!
         | 
         | The linked whitepaper is pretty useless, and I am saying as a
         | big fan of diffusion-transformers-for-not-just-images-or-videos
         | approach.
         | 
         | Also, Gemini Diffusion ([1]) is way better at coding than
         | Mercury offering.
         | 
         | 1. https://deepmind.google/models/gemini-diffusion/
        
       | gdiamos wrote:
       | I think the LLM dev community is underestimating these models.
       | E.g. there is no LLM inference framework that supports them
       | today.
       | 
       | Yes the diffusion foundation models have higher cross entropy.
       | But diffusion LLMs can also be post trained and aligned, which
       | cuts the gap.
       | 
       | IMO, investing in post training and data is easier than forcing
       | GPU vendors to invest in DRAM to handle large batch sizes and
       | forcing users to figure out how to batch their requests by
       | 100-1000x. It is also purely in the hands of LLM providers.
        
         | mathiaspoint wrote:
         | You can absolutely tune causal LLMs. In fact the original idea
         | with GPTs was that you _had_ to tune them before they 'd be
         | useful for anything.
        
           | gdiamos wrote:
           | Yes I agree you can tune autoregressive LLMs
           | 
           | You can also tune diffusion LLMs
           | 
           | After doing so, the diffusion LLM will be able to generate
           | more tokens/sec during inference
        
       | KaranSohi wrote:
       | We have used their LLM in our company and it's great! From
       | Accuracy to speed of response generation, this model seems very
       | promising!
        
       | ceroxylon wrote:
       | The output is very fast but many steps backwards in all of my
       | personal benchmarks. Great tech but not usable in production when
       | it is over 60% hallucinations.
        
         | mike_hearn wrote:
         | That might just depend on how big it is/how much money was
         | spent on training. The neural architecture can clearly work.
         | Beyond that catching up may be just a matter of effort.
        
       | mmaunder wrote:
       | Holy shit that is fast. Try the playground. You need to get that
       | visceral experience to truly appreciate what the future looks
       | like.
        
       | mmaunder wrote:
       | Code output is verifiable in multiple ways. Combine that with
       | this kind of speed (and far faster in future) and you can brute
       | force your way to a killer app in a few minutes.
        
         | OneOffAsk wrote:
         | Yes, exactly. The demo of Gemini's Diffusion model [0] was
         | really eye-opening to me in this regard. Since then, I've been
         | convinced the future of lots of software engineering is
         | basically UX and SQA: describe the desired states, have an LLM
         | fill in the gaps based on its understanding of human intent,
         | and unit test it to verify. Like most engineering fields, we'll
         | have an empirical understanding of systems as opposed to the
         | analytical understanding of code we have today. I'd argue most
         | complex software is already only approximately understood even
         | before LLMs. I doubt the quality of software will go up (in
         | fact the opposite), but I think this work will scale much
         | better and be much, much more boring.
         | 
         | [0] https://simonwillison.net/2025/May/21/gemini-diffusion/
        
       | jonplackett wrote:
       | Wow, this thing is really quite smart.
       | 
       | I was expecting really crappy performance but just chatting to
       | it, giving it some puzzles, it feels very smart and gets a lot of
       | things right that a lot of other models don't.
        
       | ahmedhawas123 wrote:
       | Reinforcement learning really helped Transformer based LLMs
       | evolve in terms of quality and reasoning which we saw as DeepSeek
       | was launched. I am curious if what this is is equivalent to an
       | early GPT 4o that has not yet reaped the benefits of add-on
       | technologies that helped improve the quality?
        
       | M4v3R wrote:
       | I am personally _very_ excited for this development. Recently I
       | AI-coded a simple game for a game jam and half the time was spent
       | waiting for the AI agent to finish its work so I can test it. If
       | instead of waiting 1-2 minutes for every prompt to be executed
       | and implemented I could wait 10 seconds instead that would be
       | literally game changing. I could test 5-10 different versions of
       | the same idea in the time it took me to test one with the current
       | tech.
       | 
       | Of course this model is not as advanced yet for this to be
       | feasible, but so was Claude 3.0 just over a year ago. This will
       | only get better over time I'm sure. Exciting times ahead of us.
        
       | ianbicking wrote:
       | For something a little different than a coding task, I tried
       | using it in my game: https://www.playintra.win/ (in settings you
       | can select Mercury, the game uses OpenRouter)
       | 
       | At first it seemed pretty competent and of course very fast, but
       | it seemed to really fall apart as the context got longer. The
       | context in this case is a sequence of events and locations, and
       | it needs to understand how those events are ordered and therefore
       | what the current situation and environment are (though there's
       | also lots of hints in the prompts to keep it focused on the
       | present moment). It's challenging, but lots of smaller models can
       | pull it off.
       | 
       | But also a first release and a new architecture. Maybe it just
       | needs more time to bake (GPT 3.5 couldn't do these things
       | either). Though I also imagine it might just perform
       | _differently_ from other LLMs, not really on the same spectrum of
       | performance, and requiring different prompting.
        
       | armcat wrote:
       | I've been looking at the code on their chat playground,
       | https://chat.inceptionlabs.ai/, and they have a helper function
       | `const convertOpenAIMessages = (convo) => { ... }`, which also
       | contains `models: ['gpt-3.5-turbo']`. I also see in API response:
       | `"openai": true`. Is it actually using OpenAI, or is it actually
       | calling its dLLM? Does anyone know?
       | 
       | Also: you can turn on "Diffusion Effect" in the top-right corner,
       | but this just seems to be an "animation gimmick" right?
        
         | Alifatisk wrote:
         | The speed of the response is waaay to quick for using OpenAi as
         | backend, it's almost instant!
        
           | armcat wrote:
           | I've been asking bespoke questions and the timing is >2
           | seconds, and slower than what I get for the same questions to
           | ChatGPT (using gpt-4.1-mini). I am looking at their call
           | stack and what I see: "verifyOpenAIConnection()",
           | "generateOpenAIChatCompletion()", "getOpenAIModels()", etc.
           | Maybe it's just so it's compatible with OpenAI API?
        
             | martinald wrote:
             | Check the bottom, I think it's just some off the shelf chat
             | UI that uses OpenAI compatible API behind the scenes.
        
               | armcat wrote:
               | Ah got it, it looks like it's a whole bunch of things so
               | it can also interface with ollama, and other APIs.
        
       | Alifatisk wrote:
       | Love the ui in the playground, it reminds me of Qwen chat.
       | 
       | We have reached a point where the bottlenecks in genAI is not the
       | knowledge or accuracy, it is the context window and speed.
       | 
       | Luckily, Google (and Meta?) has pushed the limits of the context
       | window to about 1 million tokens which is incredible. But I feel
       | like todays options are still stuck about ~128k token window per
       | chat, and after that it starts to forget.
       | 
       | Another issue is the time time it takes for inference AND
       | reasoning. dLLMs is an interesting approach at this. I know we
       | have Groqs hardware aswell.
       | 
       | I do wonder, can this be combined with Groqs hardware? Would the
       | response be instant then?
       | 
       | How many tokens can each chat handle in the playground? I
       | couldn't find so much info about it.
       | 
       | Which model is it using for inference?
       | 
       | Also, is the training the same on dLLMs as on the standardised
       | autoregressive LLMs? Or is the weights and models completely
       | different?
        
         | martinald wrote:
         | I agree entirely with you. While Claude Code is amazing, it is
         | also slow as hell and the context issue keeps coming up
         | (usually at what feels like the worst possible time for me).
         | 
         | It honestly feels like dialup most LLMs (apart from this!).
         | 
         | AFIAK with traditional models context size is very memory
         | intensive (though I know there are a lot of things that are
         | trying to 'optimize' this). I believe memory usage grows at the
         | square of context length, so even 10xing context length
         | requires 100x the memory.
         | 
         | (Image) diffusion does not grow like that, it is much more
         | linear. But I have no idea (yet!) about text diffusion models
         | if someone wants to chip in :).
        
         | kadushka wrote:
         | _We have reached a point where the bottlenecks in genAI is not
         | the knowledge or accuracy, it is the context window and speed._
         | 
         | You're joking, right? I'm using o3 and it can't do half of the
         | coding tasks I tried.
        
       | mxs_ wrote:
       | In their tech report, they say this is based on:
       | 
       | > "Our methods extend [28] through careful modifications to the
       | data and computation to scale up learning."
       | 
       | [28] is Lou et al. (2023), the "Score Entropy Discrete Diffusion"
       | (SEDD) model (https://arxiv.org/abs/2310.16834).
       | 
       | I wrote the first (as far as I can tell) independent from-scratch
       | reimplementation of SEDD:
       | 
       | https://github.com/mstarodub/dllm
       | 
       | My goal was making it as clean and readable as possible. I also
       | implemented the more complex denoising strategy they described
       | (but didn't implement).
       | 
       | It runs on a single GPU in a few hours on a toy dataset.
        
       | mseri wrote:
       | Sounds all cool and interesting, however:
       | 
       | > By submitting User Submissions through the Services, you hereby
       | do and shall grant Inception a worldwide, non-exclusive,
       | perpetual, royalty-free, fully paid, sublicensable and
       | transferable license to use, edit, modify, truncate, aggregate,
       | reproduce, distribute, prepare derivative works of, display,
       | perform, and otherwise fully exploit the User Submissions in
       | connection with this site, the Services and our (and our
       | successors' and assigns') businesses, including without
       | limitation for promoting and redistributing part or all of this
       | site or the Services (and derivative works thereof) in any media
       | formats and through any media channels (including, without
       | limitation, third party websites and feeds), and including after
       | your termination of your account or the Services. For clarity,
       | Inception may use User Submissions to train artificial
       | intelligence models. (However, we will not train models using
       | submissions from users accessing our Services via OpenRouter.)
        
       ___________________________________________________________________
       (page generated 2025-07-07 23:00 UTC)