[HN Gopher] Mercury: Ultra-fast language models based on diffusion
___________________________________________________________________
Mercury: Ultra-fast language models based on diffusion
Author : PaulHoule
Score : 353 points
Date : 2025-07-07 12:31 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| mynti wrote:
| is there a kind of nanogpt for diffusion language models? i would
| love to understand them better
| nvtop wrote:
| This video has a live coding part which implements a masked
| diffusion generation process:
| https://www.youtube.com/watch?v=oot4O9wMohw
| chc4 wrote:
| Using the free playground link, and it is in fact extremely fast.
| The "diffusion mode" toggle is also pretty neat as a
| visualization, although I'm not sure how accurate it is - it
| renders as line noise and then refines, while in reality
| presumably those are tokens from an imprecise vector in some
| state space that then become more precise until it's only a
| definite word, right?
| PaulHoule wrote:
| It's insane how fast that thing is!
| maelito wrote:
| Link : https://chat.inceptionlabs.ai/
| icyfox wrote:
| Some text diffusion models use continuous latent space but they
| historically haven't done that well. Most the ones we're seeing
| now typically are trained to predict actual token output that's
| fed forward into the next time series. The diffusion property
| comes from their ability to modify previous timesteps to
| converge on the final output.
|
| I have an explanation about one of these recent architectures
| that seems similar to what Mercury is doing under the hood
| here: https://pierce.dev/notes/how-text-diffusion-works/
| chc4 wrote:
| Oh neat, thanks! The OP is surprisingly light on details on
| how it actually works and is mostly benchmarks, so this is
| very appreciated :)
| luckystarr wrote:
| I'm kind of impressed by the speed of it. I told it to write a
| MQTT topic pattern matcher based on a Trie and it spat out
| something reasonable on first try. It hat a few compilation
| issues though, but fair enough.
| earthnail wrote:
| Tried it on some coding questions and it hallucinated a lot, but
| the appearance (i.e. if you're not a domain expert) of the output
| is impressive.
| TechDebtDevin wrote:
| Oddly fast, almost instantaneous.
| mike_hearn wrote:
| A good chance to bring up something I've been flagging to
| colleagues for a while now: with LLM agents we are very quickly
| going to become even more CPU bottlenecked on testing performance
| than today, and every team I know of today was bottlenecked on CI
| speed even before LLMs. There's no point having an agent that can
| write code 100x faster than a human if every change takes an hour
| to test.
|
| Maybe I've just got unlucky in the past, but in most projects I
| worked on a lot of developer time was wasted on waiting for PRs
| to go green. Many runs end up bottlenecked on I/O or availability
| of workers, and so changes can sit in queues for hours, or they
| flake out and everything has to start again.
|
| As they get better coding agents are going to be assigned simple
| tickets that they turn into green PRs, with the model reacting to
| test failures and fixing them as they go. This will make the CI
| bottleneck even worse.
|
| It feels like there's a lot of low hanging fruit in most
| project's testing setups, but for some reason I've seen nearly no
| progress here for years. It feels like we kinda collectively got
| used to the idea that CI services are slow and expensive, then
| stopped trying to improve things. If anything CI got a lot slower
| over time as people tried to make builds fully hermetic (so no
| inter-run caching), and move them from on-prem dedicated hardware
| to expensive cloud VMs with slow IO, which haven't got much
| faster over time.
|
| Mercury is crazy fast and in a few quick tests I did, created
| good and correct code. How will we make test execution keep up
| with it?
| TechDebtDevin wrote:
| LLM making a quick edit, <100 lines... Sure. Asking an LLM to
| rubber-duck your code, sure. But integrating an LLM into your
| CI is going to end up costing you 100s of hours productivity on
| any large project. That or spend half the time you should be
| spending learning to write your own code, dialing down context
| sizing and prompt accuracy.
|
| I really really don't understand the hubris around llm tooling,
| and don't see it catching on outside of personal projects and
| small web apps. These things don't handle complex systems well
| at all, you would have to put a gun in my mouth to let one of
| these things work on an important repo of mine without any
| supervision... And if I'm supervising the LLM I might as well
| do it myself, because I'm going to end up redoing 50% of its
| work anyways..
| mike_hearn wrote:
| I've used Claude with a large, mature codebase and it did
| fine. Not for every possible task, but for many.
|
| Probably, Mercury isn't as good at coding as Claude is. But
| even if it's not, there's lots of small tasks that LLMs can
| do without needing senior engineer level skills. Adding test
| coverage, fixing low priority bugs, adding nice animations to
| the UI etc. Stuff that maybe isn't critical so if a PR turns
| up and it's DOA you just close it, but which otherwise works.
|
| Note that many projects already use this approach with bots
| like Renovate. Such bots also consume a ton of CI time, but
| it's generally worth it.
| flir wrote:
| Don't want to put words in the parent commenter's mouth,
| but I think the key word is "unsupervised". Claude doesn't
| know what it doesn't know, and will keep going round the
| loop until the tests go green, or until the heat death of
| the universe.
| mike_hearn wrote:
| Yes, but you can just impose timeouts to solve that. If
| it's unsupervised the only cost is computation.
| airstrike wrote:
| IMHO LLMs are notoriously bad at test coverage. They
| usually hard code a value to have the test pass, since they
| lack the reasoning required to understand why the test
| exists or the concept of assertion, really
| wrs wrote:
| I don't know, Claude is very good at writing that utterly
| useless kind of unit test where every dependency is
| mocked out and the test is just the inverted dual of the
| original code. 100% coverage, nothing tested.
| conradkay wrote:
| Yeah and that's even worse because there's not an easy
| metric you can have the agent work towards and get
| feedback on.
|
| I'm not that into "prompt engineering" but tests seem
| like a big opportunity for improvement. Maybe something
| like (but much more thorough):
|
| 1. "Create a document describing all real-world actions
| which could lead to the code being used. List all
| methods/code which gets called before it (in order) along
| with their exact parameters and return value. Enumerate
| all potential edge cases and errors that could occur and
| if it ends up influencing this task. After that, write a
| high-level overview of what need to occur in this
| implementation. Don't make it top down where you think
| about what functions/classes/abstractions which are
| created, just the raw steps that will need to occur" 2.
| Have it write the tests 3. Have it write the code
|
| Maybe TDD ends up worse but I suspect the initial plan
| which is somewhat close to code makes that not the case
|
| Writing the initial doc yourself would definitely be
| better, but I suspect just writing one really good one,
| then giving it as an example in each subsequent prompt
| captures a lot of the improvement
| astrange wrote:
| This is why unit tests are the least useful kind of test
| and regression tests are the most useful.
|
| I think unit tests are best written /before/ the real
| code and thrown out after. Of course, that's extremely
| situational.
| DSingularity wrote:
| He is simply observing that if PR numbers and launch rates
| increase dramatically CI cost will become untenable.
| kraftman wrote:
| I keep seeing this argument over and over again, and I have
| to wonder, at what point do you accept that maybe LLM's are
| useful? Like how many people need to say that they find it
| makes them more productive before you'll shift your
| perspective?
| candiddevmike wrote:
| People say they are more productive using visual basic, but
| that will never shift my perspective on it.
|
| Code is a liability. Code you didn't write is a ticking
| time bomb.
| psychoslave wrote:
| That's a tool, and it depends what you need to do. If it
| fits someone need and make them more productive, or even
| simply enjoy more the activity, good.
|
| Just because two people are fixing something on the whole
| doesn't mean the same tool will hold fine. Gum, pushpin,
| nail, screw,bolts?
|
| The parent thread did mention they use LLM successfully in
| small side project.
| dragonwriter wrote:
| > I keep seeing this argument over and over again, and I
| have to wonder, at what point do you accept that maybe
| LLM's are useful?
|
| The post you are responding to literally acknowledges that
| LLMs are useful in certain roles in coding in the first
| sentence.
|
| > Like how many people need to say that they find it makes
| them more productive before you'll shift your perspective?
|
| _Argumentum ad populum_ is not a good way of establishing
| fact claims beyond the fact of a belief being popular.
| kraftman wrote:
| ...and my comment clearly isnt talking about that, but at
| the suggestion that its useless to write code with an LLM
| because you'll end up rewriting 50% of it.
|
| If everyone has an opinion different to mine, I dont
| instantly change my opinion, but I do try and investigate
| the source of the difference, to find out what I'm
| missing or what they are missing.
|
| The polarisation between people that find LLMs useful or
| not is very similar to the polarisation between people
| that find automated testing useful or not, and I have a
| suspicion they have the same underlying cause.
| nwienert wrote:
| You seem to think everyone shares your view, around me I
| see a lot of people acknowledging they are useful to a
| degree, but also clearly finding limits in a wide array
| of cases, including that they really struggle with
| logical code, architectural decisions, re-using the right
| code patterns, larger scale changes that aren't copy
| paste, etc.
|
| So far what I see is that if I provide lots of context
| and clear instructions to a mostly non-logical area of
| code, I can speed myself up about 20-40%, but only works
| in about 30-50% of the problems I solve day to day at a
| day job.
|
| So basically - it's about a rough 20% improvement in my
| productivity - because I spend most of my time of the
| difficult things it can't do anyway.
|
| Meanwhile these companies are raising billion dollar seed
| rounds and telling us that all programming will be done
| by AI by next year.
| ninetyninenine wrote:
| They say it's only effective for personal projects but
| there's literally evidence of LLMs being used for what he
| says can't be used. Actual physical evidence.
|
| It's self delusion. And also the pace of AI is so fast he
| may not be aware of how fast LLMs are integrating into our
| coding environments. Like 1 year ago what he said could be
| somewhat true but right now what he said is clearly not
| true at all.
| MangoToupe wrote:
| > at what point do you accept that maybe LLM's are useful?
|
| LLMs _are_ useful, just not for every task and price point.
| blitzar wrote:
| Do the opposite - integrate your CI into your LLM.
|
| Make it run tests after it changes your code and either
| confirm it didnt break anything or go back and try again.
| piva00 wrote:
| I haven't worked in places using off-the-shelf/SaaS CI in more
| than a decade so I feel my experience has been quite the
| opposite from yours.
|
| We always worked hard to make the CI/CD pipeline as fast as
| possible. I personally worked on those kind of projects at 2
| different employers as a SRE: a smaller 300-people shop which I
| was responsible for all their infra needs (CI/CD, live
| deployments, migrated later to k8s when it became somewhat
| stable, at least enough for the workloads we ran, but still in
| its beta-days), then at a different employer some 5k+ strong
| working on improving the CI/CD setup which used Jenkins as a
| backend but we developed a completely different shim on top for
| developer experience while also working on a bespoke worker
| scheduler/runner.
|
| I haven't experienced a CI/CD setup that takes longer than 10
| minutes to run in many, many years, got quite surprised reading
| your comment and feeling spoiled I haven't felt this pain for
| more than a decade, didn't really expect it was still an issue.
| mike_hearn wrote:
| I think the prevalence of teams having a "CI guy" who often
| is developing custom glue, is a sign that CI is still not
| really working as well as it should given the age of the
| tech.
|
| I've done a lot of work on systems software over the years so
| there's often tests that are very I/O or computation heavy,
| lots of cryptography, or compilation, things like that. But
| probably there are places doing just ordinary CRUD web app
| development where there's Playwright tests or similar that
| are quite slow.
|
| A lot of the problems are cultural. CI times are a commons,
| so it can end in tragedy. If everyone is responsible for CI
| times then nobody is. Eventually management gets sick of
| pouring money into it and devs learn to juggle stacks of PRs
| on top of each other. Sometimes you get a lot of pushback on
| attempts to optimize CI because some devs will really scream
| about any optimization that might potentially go wrong (e.g.
| depending on your build system cache), even if caching
| nothing causes an explosion in CI costs. Not their money,
| after all.
| kccqzy wrote:
| > Maybe I've just got unlucky in the past, but in most projects
| I worked on a lot of developer time was wasted on waiting for
| PRs to go green.
|
| I don't understand this. Developer time is so much more
| expensive than machine time. Do companies not just double their
| CI workers after hearing people complain? It's just a throw-
| more-resources problem. When I was at Google, it was somewhat
| common for me to debug non-deterministic bugs such as a missing
| synchronization or fence causing flakiness; and it was common
| to just launch 10000 copies of the same test on 10000 machines
| to find perhaps a single digit number of failures. My current
| employer has a clunkier implementation of the same thing (no
| UI), but there's also a single command to launch 1000 test
| workers to run all tests from your own checkout. The goal is to
| finish testing a 1M loc codebase in no more than five minutes
| so that you get quick feedback on your changes.
|
| > make builds fully hermetic (so no inter-run caching)
|
| These are orthogonal. You want maximum deterministic CI steps
| so that you make builds fully hermetic and cache every single
| thing.
| mark_undoio wrote:
| > I don't understand this. Developer time is so much more
| expensive than machine time. Do companies not just double
| their CI workers after hearing people complain? It's just a
| throw-more-resources problem.
|
| I'd personally agree. But this sounds like the kind of thing
| that, at many companies, could be a real challenge.
|
| Ultimately, you can _measure_ dollars spent on CI workers. It
| 's much harder and less direct to quantify the cost of not
| having them (until, for instance, people start taking
| shortcuts with testing and a regression escapes to
| production).
|
| That kind of asymmetry tends, unless somebody has a strong
| overriding vision of where the value _really_ comes from, to
| result in penny pinching on the wrong things.
| mike_hearn wrote:
| It's more than that. You can measure salaries too,
| measurement isn't the issue.
|
| The problem is that if you let people spend the companies
| money without any checks or balances they'll just blow
| through unlimited amounts of it. That's why companies
| always have lots of procedures and policies around expense
| reporting. There's no upper limit to how much money
| developers will spend on cloud hardware given the chance,
| as the example above of casually running a test 10,000
| times in parallel demonstrates nicely.
|
| CI doesn't require you to fill out an expense report every
| time you run a PR thank goodness, but there still has to be
| a way to limit financial liability. Usually companies do
| start out by doubling cluster sizes a few times, but each
| time it buys a few months and then the complaints return.
| After a few rounds of this managers realize that demand is
| unlimited and start pushing back on always increasing the
| budget. Devs get annoyed and spend an afternoon on
| optimizations, suddenly times are good again.
|
| The meme on HN is that developer time is always more
| expensive than machine time, but I've been on both sides of
| this and seen how the budgets work out. It's often not
| true, especially if you use clouds like Azure which are
| overloaded and expensive, or have plenty of junior devs,
| and/or teams outside the US where salaries are lower.
| There's often a lot of low hanging fruit in test times so
| it can make sense to optimize, even so, huge waste is still
| the order of the day.
| mike_hearn wrote:
| I was also at Google for years. Places like that are not even
| close to representative. They can afford to just-throw-more-
| resources, they get bulk discounts on hardware and they pay
| top dollar for engineers.
|
| In more common scenarios that represent 95% of the software
| industry CI budgets are fixed, clusters are sized to be busy
| most of the time, and you cannot simply launch 10,000 copies
| of the same test on 10,000 machines. And even despite that
| these CI clusters can easily burn through the equivalent of
| several SWE salaries.
|
| _> These are orthogonal. You want maximum deterministic CI
| steps so that you make builds fully hermetic and cache every
| single thing._
|
| Again, that's how companies like Google do it. In _normal_
| companies, build caching isn 't always perfectly reliable,
| and if CI runs suffer flakes due to caching then eventually
| some engineer is gonna get mad and convince someone else to
| turn the caching off. Blaze goes to extreme lengths to ensure
| this doesn't happen, and Google spends extreme sums of money
| on helping it do that (e.g. porting third party libraries to
| use Blaze instead of their own build system).
|
| In companies without money printing machines, they sacrifice
| caching to get determinism and everything ends up slow.
| PaulHoule wrote:
| Most of my experience writing concurrent/parallel code in
| (mainly) Java has been rewriting half-baked stuff that
| would need a lot of testing with straightforward reliable
| and reasonably performant code that uses sound and easy-to-
| use primitives such as Executors (watch out for teardown
| though), database transactions, atomic database operations,
| etc. Drink the Kool Aid and mess around with _synchronized_
| or actors or Streams or something and you 're looking at a
| world of hurt.
|
| I've written a limited number of systems that needed tests
| that probe for race conditions by doing something like
| having 3000 threads run a random workload for 40 seconds.
| I'm proud of that "SuperHammer" test on a certain level but
| boy did I hate having to run it with every build.
| kridsdale1 wrote:
| I'm at Google today and even with all the resources, I am
| absolutely most bottlenecked by the Presubmit TAP and human
| review latency. Making CLs in the editor takes me a few
| hours. Getting them in the system takes days and sometimes
| weeks.
| simonw wrote:
| Presumably the "days and sometimes weeks" thing is
| entirely down to human review latency?
| mystified5016 wrote:
| IME it's less of a "throw more resources" problem and more of
| a "stop using resources in literally the worst way possible"
|
| CI caching is, apparently, extremely difficult. Why spend a
| couple of hours learning about your CI caches when you can
| just download and build the same pinned static library a
| billion times? The server you're downloading from is (of
| course) someone else's problem and you don't care about
| wasting their resources either. The power you're burning by
| running CI for there hours instead of one is also someone
| else's problem. Compute time? Someone else's problem. Cloud
| costs? You bet it's someone else's problem.
|
| Sure, some things you don't want to cache. I _always_ do a
| 100% clean build when cutting a release or merging to master.
| But for intermediate commits on a feature branch? Literally
| no reason not to cache builds the exact same way you do on
| your local machine.
| ronbenton wrote:
| >Do companies not just double their CI workers after hearing
| people complain?
|
| They do not.
|
| I don't know if it's a matter of justifying management
| levels, but these discussions are often drawn out and
| belabored in my experience. By the time you get approval, or
| even worse, rejected, for asking for more compute (or
| whatever the ask is), you've spent way more money on the
| human resource time than you would ever spend on the
| requested resources.
| kccqzy wrote:
| I have never once been refused by a manager or director
| when I am explicitly asking for cost approval. The only
| kind of long and drawn out discussions are unproductive
| technical decision making. Example: the ask of "let's spend
| an extra $50,000 worth of compute on CI" is quickly
| approved but "let's locate the newly approved CI resource
| to a different data center so that we have CI in multiple
| DCs" solicits debates that can last weeks.
| mysteria wrote:
| This is exactly my experience with asking for more compute
| at work. We have to prepare loads of written justification,
| come up with alternatives or optimizations (which we
| already know won't work), etc. and in the end we choose the
| slow compute and reduced productivity over the bureaucracy.
|
| And when we manage to make a proper request it ends up
| being rejected anyways as many other teams are asking for
| the same thing and "the company has limited resources".
| Duh.
| IshKebab wrote:
| Developer time is more expensive than machine time, but at
| most companies it isn't 10000x more expensive. Google is
| likely an exception because it pays extremely well and has
| access to very cheap machines.
|
| Even then, there are other factors:
|
| * You might need commercial licenses. It may be very cheap to
| run open source code 10000x, but guess how much 10000 Questa
| licenses cost.
|
| * Moores law is dead Amdahl's law very much isn't. Not
| everything is embarrassingly parallel.
|
| * Some people care about the environment. I worked at a
| company that spent 200 CPU hours on every single PR (even to
| fix typos; I failed to convince them they were insane for not
| using Bazel or similar). That's a not insignificant amount of
| CO2.
| underdeserver wrote:
| That's solvable with modern cloud offerings - Provision
| spot instances for a few minutes and shut them down
| afterwards. Let the cloud provider deal with demand
| balancing.
|
| I think the real issue is that developers waiting for PRs
| to go green are taking a coffee break between tasks, not
| sitting idly getting annoyed. If that's the case you're
| cutting into rest time and won't get much value out of
| optimizing this.
| IshKebab wrote:
| Both companies I've worked in recently have been too
| paranoid about IP to use the cloud for CI.
|
| Anyway I don't see how that solves any of the issues
| except maybe cost to some degree (but maybe not; cloud is
| expensive).
| fragmede wrote:
| Sorta. For CI/CD you can use spot instances and spin them
| down outside of business hours, so they can end up being
| cheaper than buying many really beefy machines and
| amortizing them over the standard depreciation schedule.
| simonw wrote:
| Were they running CI on their own physical servers under
| a desk or in a basement somewhere, or renting their own
| racks in a data center just for CI?
| jiggawatts wrote:
| That's paranoid to the point of lunacy.
|
| Azure for example has "confidential compute" that
| encrypts even the memory contents of the VM such that
| even their own engineers can't access the contents.
|
| As long as you don't back up the disks and use HTTPS for
| pulls, I don't see a realistic business risk.
|
| If a cloud like Azure or AWS got caught stealing
| competitor code they'd be sued _and_ immediately lose a
| huge chunk of their customers.
|
| It makes zero business sense to do so.
|
| PS: Microsoft employees have made public comments saying
| that they refuse to even _look_ at some open source
| repository to avoid any risk of accidentally
| "contaminating" their own code with something that has an
| incompatible license.
| kccqzy wrote:
| I don't know about Azure's implementation of confidential
| compute but GCP's version basically essentially relies on
| AMD SEV-SVP. Historically there have been vulnerabilities
| that undermine the confidentiality guarantee.
| hyperpape wrote:
| > Moores law is dead Amdahl's law
|
| Yes, but the OP specifically is talking about CI for large
| numbers of pull requests, which should be very
| parallelizable (I can imagine exceptions, but only with
| anti-patterns, e.g. if your test pipeline makes some kind
| of requests to something that itself isn't scalable).
| vlovich123 wrote:
| Actually, OP was talking about the throughput of running
| on a large number of pull requests and the latency of
| running on a single pull request. The latter is not
| necessarily parallelizable.
| physicsguy wrote:
| Not really, in most small companies/departments, PS100k a
| month is considered a painful cloud bill and adding more EC2
| instances to provide cloud runners can add 10% to that
| easily.
| wat10000 wrote:
| Many companies are strangely reluctant to spend money on
| hardware for developers. They might refuse to spend $1,000 on
| a better laptop to be used for the next three years by an
| employee, whose time costs them that much money in a single
| afternoon.
| kridsdale1 wrote:
| I have faced this at each of the $50B in profit companies I
| have worked at.
| PaulHoule wrote:
| That's been a pet peeve of mine for so long. (Glad my
| current employer gets me the best 1.5 machine from Dell
| every few years!)
|
| On the other hand I've seen many overcapitalized pre-launch
| startups go for months with a $20,000+ AWS bill without
| thinking about it then suddenly panic about what they're
| spending; they'd find tens of XXXXL instances spun up doing
| nothing, S3 buckets full of hundreds of terabytes of temp
| files that never got cleared out, etc. With basic due
| diligence they could have gotten that down to $2k a month,
| somebody obsessive about cost control could have done even
| better.
| wbl wrote:
| No it is not. Senior management often has a barely disguised
| contempt for engineering and spending money to do a better
| job. They listen much more to sales complain.
| kridsdale1 wrote:
| That depends on the company.
| MangoToupe wrote:
| Writing testing infrastructure so that you _can_ just double
| workers and get a corresponding doubling in productivity is
| non-trivial. Certainly I 've never seen anything like
| Google's testing infrastructure anywhere else I've worked.
| mike_hearn wrote:
| Yeah Google's infrastructure is unique because Blaze is
| tightly integrated with the remote execution workers and
| can shard testing work across many machines automatically.
| Most places can't do that so once you have enough hardware
| that queue depth isn't too big you can't make anything go
| faster by adding hardware, you can only try to scale
| vertically or optimize. But if you're using hosted CI SaaS
| it's often not always easy to get bigger machines, or the
| bigger machines are superlinear in cost.
| socalgal2 wrote:
| Even Google can not buy more old Intel Macs or Pixel 6s or
| Samsung S20s to increase their testing on those devices (as
| an example)
|
| Maybe that affects less devs who don't need to test on actual
| hardware but plenty of apps do. Pretty much anything that
| touches a GPU driver for example like a game.
| anp wrote:
| I'm currently at google (opinions not representative of my
| employer's etc) and this is true for things that run in a
| data center but it's a lot harder for things that need to be
| tested on physical hardware like parts of Android or CrOS.
| wavemode wrote:
| You're confusing throughput and latency. Lengthy CI runs
| increase the latency of developer output, but they don't
| significantly reduce overall throughput, given a developer
| will typically be working on multiple things at once, and can
| just switch tasks while CI is running. The productivity cost
| of CI is not zero, but it's way, way less than the raw
| wallclock time spent per run.
|
| Then also factor in that most developer tasks are not even
| bottlenecked by CI. They are bottlenecked primarily by code
| review, and secondarily by deployment.
| mathiaspoint wrote:
| Good God I hate CI. Just let me run the build automation myself
| dammit! If you're worried about reproducibility make it
| reproducible and hash the artifacts, make people include the
| hash in the PR comment if you want to enforce it.
|
| The amount of time people waste futzing around in eg Groovy is
| INSANE and I'm honestly inclined to reject job offers from
| companies that have any serious CI code at this point.
| droopyEyelids wrote:
| In most companies the CI/Dev Tools team is a career dead end.
| There is no possibility to show a business impact, it's just a
| money pit that leadership can't/won't understand (and if they
| do start to understand it, then it becomes _their_ money pit,
| which is a career dead end for them) So no one who has their
| head on straight wants to spend time improving it.
|
| And you can't even really say it's a short sighted attitude. It
| definitely is from a developer's perspective, and maybe it is
| for the company if dev time is what decides the success of the
| business overall.
| MangoToupe wrote:
| > it's just a money pit that leadership can't/won't
| understand
|
| In my experience it's the opposite: they want more automated
| testing, but don't want to pay for the friction this causes
| on productivity.
| yieldcrv wrote:
| then kill the CI/CD
|
| these redundant processes are for human interoperability
| blitzar wrote:
| Yet, now I have added a LLM workflow to my coding the value of
| my old and mostly useless workflows is now 10x'd.
|
| Git checkpoints, code linting and my naive suite of unit and
| integration tests are now crucial to my LLM not wasting _too
| much_ time generating total garbage.
| vjerancrnjak wrote:
| It's because people don't know how to write tests. All of the
| "don't do N select queries in a for loop" comments made in PRs
| are completely ignored in tests.
|
| Each test can output many db queries. And then you create
| multiple cases.
|
| People don't even know how to write code that just deals with N
| things at a time.
|
| I am confident that tests run slowly because the code that is
| tested completely sucks and is not written for batch mode.
|
| Ignoring batch mode, tests are most of the time written in a a
| way where test cases are run sequentially. Yet attempts to run
| them concurrently result in flaky tests, because the way you
| write them and the way you design interfaces does not allow
| concurrent execution at all.
|
| Another comment, code done by the best AI model still sucks.
| Anything simple, like a music player with a library of 10000
| songs is something it can't do. First attempt will be horrible.
| No understanding of concurrent metadata parsing, lists showing
| 10000 songs at once in UI being slow etc.
|
| So AI is just another excuse for people writing horrible code
| and horrible tests. If it's so smart , try to speed up your CI
| with it.
| rapind wrote:
| > This will make the CI bottleneck even worse.
|
| I agree. I think there are potentially multiple solutions to
| this since there are multiple bottlenecks. The most obvious is
| probably network overhead when talking to a database. Another
| might be storage overhead if storage is being used.
|
| Frankly another one is language. I suspect type-safe, compiled,
| functional languages are going to see some big advantages here
| over dynamic interpreted languages. I think this is the sweet
| spot that grants you a ton of performance over dynamic
| languages, gives you more confidence in the models changes, and
| requires less testing.
|
| Faster turn-around, even when you're leaning heavily on AI, is
| a competitive advantage IMO.
| mike_hearn wrote:
| It could go either way. Depends very much on what kind of
| errors LLMs make.
|
| Type safe languages in theory should do well, because you get
| feedback on hallucinated APIs very fast. But if the LLM
| generally writes code that compiles, unless the compiler is
| very fast you might get out-run by an LLM just spitting out
| JavaScript at high speed, because it's faster to run the
| tests than wait for the compile.
|
| The sweet spot is probably JIT compiled type safe languages.
| Java, Kotlin, TypeScript. The type systems can find enough
| bugs to be worth it, but you don't have to wait too long to
| get test results either.
| rafaelmn wrote:
| > If anything CI got a lot slower over time as people tried to
| make builds fully hermetic (so no inter-run caching), and move
| them from on-prem dedicated hardware to expensive cloud VMs
| with slow IO, which haven't got much faster over time.
|
| I am guesstimating (based on previous experience self-hosting
| the runner for MacOS builds) that the project I am working on
| could get like 2-5x pipeline performance at 1/2 cost just by
| using self-hosted runners on bare metal rented machines like
| Hetzner. Maybe I am naive, and I am not the person that would
| be responsible for it - but having a few bare metal machines
| you can use in the off hours to run regression tests, for less
| than you are paying the existing CI runner just for build, that
| speed up everything massively seems like a pure win for
| relatively low effort. Like sure everyone already has stuff on
| their plate and would rather pay external service to do it -
| but TBH once you have this kind of compute handy you will find
| uses anyway and just doing things efficiently. And knowing how
| to deal with bare metal/utilize this kind of compute sounds
| generally useful skill - but I rarely encounter people
| enthusiastic about making this kind of move. Its usually - hey
| lets move to this other service that has slightly cheaper
| instances and a proprietary caching layer so that we can get
| locked into their CI crap.
|
| Its not like these services have 0 downtime/bug free/do not
| require integration effort - I just don't see why going bare
| metal is always such a taboo topic even for simple stuff like
| builds.
| azeirah wrote:
| At the last place I worked at, which was just a small startup
| with 5 developers, I calculated that a server workstation in
| the office would be both cheaper and more performant than
| renting a similar machine in the cloud.
|
| Bare metal makes such a big difference for test and CI
| scenarios. It even has an integrated a GPU to speed up webdev
| tests. Good luck finding an affordable machine in the cloud
| that has a proper GPU for this kind of a use-case
| rafaelmn wrote:
| Is it a startup or small business ? In my book a startup
| expects to scale and hosting bare metal HW in an office
| with 5 people means you have to figure everything out again
| when you get 20/50/100 people - IMO not worth the effort
| and hosting hardware has zero transferable skills to your
| product.
|
| Running on managed bare metal servers is theoretically the
| same as running any other infra provider except you are on
| the hook for a bit more maintenance, you scale to 20 people
| you just rent a few more machines. I really do not see many
| downsides for the build server/test runner scenario.
| mike_hearn wrote:
| Yep. For my own company I used a bare metal machine in
| Hetzner running Linux and a Windows VM along with a bunch of
| old MacBook Pros wired up in the home office for CI.
|
| It works, and it's cheap. A full CI run still takes half an
| hour on the Linux machine (the product [1] is a kind of build
| system for shipping desktop apps cross platform, so there's
| lots of file IO and cryptography involved). The Macs are by
| far the fastest. The M1 Mac is embarrassingly fast. It can
| complete the same run in five minutes despite the Hetzner box
| having way more hardware. In fairness, it's running both a
| Linux and Windows build simultaneously.
|
| I'm convinced the quickest way to improve CI times in most
| shops is to just build an in-office cluster of M4 Macs in an
| air conditioned room. They don't have to be HA. The hardware
| is more expensive but you don't rent per month, and CI is
| often bottlenecked on serial execution speed so the higher
| single threaded performance of Apple Silicon is worth it.
| Also, pay for a decent CI system like TeamCity. It helps
| reduce egregious waste from problems like not caching things
| or not re-using checkout directories. In several years of
| doing this I haven't had build caching related failures.
|
| [1] https://hydraulic.dev/
| adamcharnock wrote:
| > 2-5x pipeline performance at 1/2 cost just by using self-
| hosted runners on bare metal rented machines like Hetzner
|
| This is absolutely the case. Its a combination of having
| dedicated CPU cores, dedicated memory bandwidth, and (perhaps
| most of all) dedicated local NVMe drives. We see a 2x speed
| up running _within VMs_ on bare metal.
|
| > And knowing how to deal with bare metal/utilize this kind
| of compute sounds generally useful skill - but I rarely
| encounter people enthusiastic about making this kind of move
|
| We started our current company for this reason [0]. A lot of
| people know this makes sense on some level, but not many
| people want to do it. So we say we'll do it for you, give you
| the engineering time needed to support it, and you'll still
| save money.
|
| > I just don't see why going bare metal is always such a
| taboo topic even for simple stuff like builds.
|
| It is decreasingly so from what I see. Enough people have
| been variously burned by public cloud providers to know they
| are not a panacea. But they just need a little assistance in
| making the jump.
|
| [0] - https://lithus.eu
| TheDudeMan wrote:
| This is because coders didn't spend enough time making their
| tests efficient. Maybe LLM coding agents can help with that.
| grogenaut wrote:
| Before cars people spent little on petroleum products or motor
| oil or gasoline or mechanics. Now they do. That's how systems
| work. You wanna go faster well you need better roads, traffic
| lights, on ramps, etc. you're still going faster.
|
| Use AI to solve the IP bottlenecks or build more features that
| ear more revenue that buy more ci boxes. Same as if you added
| 10 devs which you are with AI so why wouldn't some of the dev
| support costs go up.
|
| Are you not in a place where you can make an efficiency
| argument to get more ci or optimize? What's a ci box cost?
| daxfohl wrote:
| There are a couple mitigating considerations
|
| 1. As implementation phase gets faster, the bottleneck could
| actually switch to PM. In which case, changes will be more
| serial, so a lot fewer conflicts to worry about.
|
| 2. I think we could see a resurrection of specs like TLA+. Most
| engineers don't bother with them, but I imagine code agents
| could quickly create them, verify the code is consistent with
| them, and then require fewer full integration tests.
|
| 3. When background agents are cleaning up redundant code, they
| can also clean up redundant tests.
|
| 4. Unlike human engineering teams, I expect AIs to work more
| efficiently on monoliths than with distributed microservices.
| This could lead to better coverage on locally runnable tests,
| reducing flakes and CI load.
|
| 5. It's interesting that even as AI increases efficiency, that
| increased velocity and sheer amount of code it'll write and
| execute for new use cases will create its own problems that
| we'll have to solve. I think we'll continue to have new
| problems for human engineers to solve for quite some time.
| SoftTalker wrote:
| Wow, your story gives me flashbacks to the 1990s when I worked
| in a mainframe environment. Compile jobs submitted by
| developers were among the lowest priorities. I could make a
| change to a program, submit a compile job, and wait literally
| half a day for it to complete. Then I could run my testing,
| which again might have to wait for hours. I generally had other
| stuff I could work on during those delays but not always.
| trhway wrote:
| >There's no point having an agent that can write code 100x
| faster than a human if every change takes an hour to test.
|
| Testing every change incrementally is a vestige of the code
| being done by humans (and thus of the current approach where AI
| helps and/or replaces one given human), in small increments at
| that, and of the failures being analyzed by individual humans
| who can keep in their head only limited number of
| things/dependencies at once.
| ASinclair wrote:
| Call me a skeptic but I do not believe LLMs are significantly
| altering the time between commits so much that CI is the
| problem.
|
| However, improving CI performance is valuable regardless.
| gdiamos wrote:
| This sounds like a strawman.
|
| GPUs can do 1 million trillion instructions per second.
|
| Are you saying it's impossible to write a test that finishes in
| less than one second on that machine?
|
| Is that a fundamental limitation or an incredibly inefficient
| test?
| nradclif wrote:
| A million trillion operations per second is literally an
| exaflop. That's one hell of a GPU you have.
| gdiamos wrote:
| Thanks, I missed a factor of 1000x, it should be a million
| billion
| mrkeen wrote:
| > Maybe I've just got unlucky in the past, but in most projects
| I worked on a lot of developer time was wasted on waiting for
| PRs to go green. Many runs end up bottlenecked on I/O or
| availability of workers
|
| No, this is common. The devs just haven't grokked dependency
| inversion. And I think the rate of new devs entering the
| workforce will keep it that way forever.
|
| Here's how to make it slow:
|
| * Always refer to "the database". You're not just storing and
| retrieving objects _from anywhere_ - you 're always using the
| database.
|
| * Work with statements, not expressions. Instead of "the
| balance is the sum of the transactions", execute several
| transaction writes (to _the database_ ) and read back the
| resulting balance. This will force you to sequentialise the
| tests (simultaneous tests would otherwise race and cause
| flakiness) plus you get to write a bunch of setup and teardown
| and wipe state between tests.
|
| * If you've done the above, you'll probably need to wait for
| state changes before running an assertion. Use a thread sleep,
| and if the test is ever flaky, bump up the sleep time and
| commit it if the test goes green again.
| pamelafox wrote:
| For Python apps, I've gotten good CI speedups by moving over to
| the astral.sh toolchain, using uv for the package installation
| with caching. Once I move to their type-checker instead of
| mypy, that'll speed the CI up even more. The playwright test
| running will then probably be the slowest part, and that's only
| in apps with frontends.
|
| (Also, Hi Mike, pretty sure I worked with you at Google Maps
| back in early 2000s, you were my favorite SRE so I trust your
| opinion on this!)
| fastball wrote:
| ICYMI, DeepMind also has a Gemini model that is diffusion-
| based[1]. I've tested it a bit and while (like with this model)
| the speed is indeed impressive, the quality of responses was much
| worse than other Gemini models in my testing.
|
| [1] https://deepmind.google/models/gemini-diffusion/
| tripplyons wrote:
| Is the Gemini Diffusion demo free? I've been on the waitlist
| for it for a few weeks now.
| Powdering7082 wrote:
| From my minor testing I agree that it's crazy fast and not that
| good at being correct
| thelastbender12 wrote:
| The speed here is super impressive! I am curious - are there any
| qualitative ways in which modeling text using diffusion differs
| from that using autoregressive models? The kind of problems it
| works better on, creativity, and similar.
| orbital-decay wrote:
| One works in the coarse-to-fine direction, another works start-
| to-end. Which means different directionality biases, at least.
| Difference in speed, generalization, etc. is less clear and
| needs to be proven in practice, as fundamentally they are
| closer than it seems. Diffusion models have some well-studied
| shortcuts to trade speed for quality, but nothing stops you
| from implementing the same for the other type.
| ekunazanu wrote:
| I once read that diffusion is essentially just autoregression
| in the frequency domain. Honestly, that comparison didn't
| seem too far off.
| JimDabell wrote:
| Pricing:
|
| US$0.000001 per output token ($1/M tokens)
|
| US$0.00000025 per input token ($0.25/M tokens)
|
| https://platform.inceptionlabs.ai/docs#models
| asaddhamani wrote:
| The pricing is a little on the higher side. Working on a
| performance-sensitive application, I tried Mercury and Groq
| (Llama 3.1 8b, Llama 4 Scout) and the performance was neck-and-
| neck but the pricing was way better for Groq.
|
| But I'll be following diffusion models closely, and I hope we
| get some good open source ones soon. Excited about their
| potential.
| tripplyons wrote:
| Good to know. I didn't realize how good the pricing is on
| Groq!
| tlack wrote:
| If your application is pricing sensitive, check out
| DeepInfra.com - they have a variety of models in the
| pennies-per-mil range. Not quite as fast as Mercury, Groq
| or Samba Nova though.
|
| (I have no affiliation with this company aside from being a
| happy customer the last few years)
| empiko wrote:
| I strongly believe that this will be a really important technique
| in the near future. The cost saving this might create is mouth
| watering.
| NitpickLawyer wrote:
| > I strongly believe that this will be a really important
| technique in the near future.
|
| I share the same belief, but regardless of cost. What excites
| me is the ability to "go both ways", edit previous tokens after
| others have been generated, using other signals as "guided
| generation", and so on. Next token prediction works for
| "stories", but diffusion matches better with "coding flows"
| (i.e. going back and forth, add something, come back, import
| something, edit something, and so on).
|
| It would also be very interesting to see how applying this at
| different "abstraction layers" would work. Say you have one
| layer working on ctags, one working on files, and one working
| on "functions". And they all "talk" to each other, passing
| context and "re-diffusing" their respective layers after each
| change. No idea where the data for this would come, maybe from
| IDEs?
| sansseriff wrote:
| I wonder if there's a way to do diffusion within some sort of
| schema-defined or type constrained space.
|
| A lot of people these days are asking for structured output
| from LLMs so that a schema is followed. Even if you train on
| schema-following with a transformer, you're still just
| 'hoping' in the end that the generated json matches the
| schema.
|
| I'm not a diffusion excerpt, but maybe there's a way to
| diffuse one value in the 'space' of numbers, and another
| value in the 'space' of all strings, as required by a schema:
|
| { "type": "object", "properties": { "amount": { "type":
| "number" }, "description": { "type": "string" } },
| "required": ["amount", "description"] }
|
| I'm not sure how far this could lead. Could you diffuse more
| complex schemas that generalize to a arbitrary syntax tree?
| E.g. diffuse some code in a programming language that is
| guaranteed to be type-safe?
| baalimago wrote:
| I, for one, am willing to trade accuracy for speed. I'd rather
| have 10 iterations of poor replies which forces me to ask the
| right question than 1 reply which takes 10 times as long and
| _maybe_ is good, since it tries to reason about my poor question.
| PaulHoule wrote:
| Personally I like asking coding agents a question and getting
| an answer back immediately. Systems like Junie that go off and
| research a bunch of irrelevant things than ask permission than
| do a lot more irrelevant research, ask more permission and such
| and then 15 minutes later give you a mountain of broken code
| are a waste of time if you ask me. (Even if you give permission
| in advance)
| pmxi wrote:
| This is cool. I think faster models can unlock entirely new usage
| paradigms, like how faster search enables incremental search.
| amelius wrote:
| Damn, that is fast. But it is faster than I can read, so
| hopefully they can use that speed and turn it into better quality
| of the output. Because otherwise, I honestly don't see the
| advantage, in practical terms, over existing LLMs. It's like
| having a TV with a 200Hz refresh rate, where 100Hz is just fine.
| pmxi wrote:
| There are plenty of LLM use cases where the output isn't meant
| to be read by a human at all. e.g:
|
| parsing unstructured text into structured formats like JSON
|
| translating between natural or programming languages
|
| serving as a reasoning step in agentic systems
|
| So even if it's "too fast to read," that speed can still be
| useful
| amelius wrote:
| Sure, but I was talking about the chat interface, sorry if
| that was not clear.
| martinald wrote:
| You're missing another big advantage is cost. If you can do
| 1000tok/s on a $2/hr H100 vs 60tok/s on the same hardware,
| you can price it at 1/40th of the price for the same margin.
| Legend2440 wrote:
| This lets you do more (potentially a lot more) reasoning steps
| and tool calls before answering.
| irthomasthomas wrote:
| I've used mercury quite a bit in my commit message generator. I
| noticed it would always produce the exact same response if you
| ran it multiple times, and increasing temperature didn't affect
| it. To get some variability I added a $(uuidgen) to the prompt.
| Then I could run it again for a new response if I didn't like the
| first.
| everlier wrote:
| Something like https://github.com/av/klmbr could also work
| seydor wrote:
| I wonder if diffusion llms solve the hallucination problem more
| effectively. In the same way that image models learned to create
| less absurd images, dllms can perhaps learn to create sensical
| responses more predictably
| awaymazdacx5 wrote:
| Having token embeddings with diffusion models, for 16x16
| transformer encoding. Image is tokenized before transformers
| compile it. If decomposed virtualization modulates according to a
| diffusion model.
| storus wrote:
| Can Mercury use tools? I haven't seen it described anywhere. How
| about streaming with tools?
| nashashmi wrote:
| I guess this makes specific language patterns cheaper and more
| artistic language patterns more expensive. This could be a good
| way to limit pirated and masqueraded materials submitted by
| students.
| true_blue wrote:
| I tried the playground and got a strange response. I asked for a
| regex pattern, and the model gave itself a little game-plan, then
| it wrote the pattern and started to write tests for it. But it
| never stopped writing tests. It continued to write tests of
| increasing size until I guess it reached a context limit and the
| answer was canceled. Also, for each test it wrote, it added a
| comment about if the test should pass or fail, but after about
| the 30th test, it started giving the wrong answer for those too,
| saying that a test should fail when actually it should pass if
| the pattern is correct. And after about the 120th test, the tests
| started to not even make sense anymore. They were just nonsense
| characters until the answer got cut off.
|
| The pattern it made was also wrong, but I think the first issue
| is more interesting.
| fiatjaf wrote:
| This is too funny to be true.
| beders wrote:
| I think that's a prime example showing that token prediction
| simply isn't good enough for correctness. It never will be.
| LLMs are not designed to reason about code.
| ianbicking wrote:
| FWIW, I remember regular models doing this not that long ago,
| sometimes getting stuck in something like an infinite loop
| where they keep producing output that is only a slight
| variation on previous output.
| data-ottawa wrote:
| if you shrink the context window on most models you'll get
| this type of behaviour. If you go too small you end up with
| basically gibberish even on modern models like Gemini 2.5.
|
| Mercury has a 32k context window according to the paper,
| which could be why it does that.
| skybrian wrote:
| Company blog post: https://www.inceptionlabs.ai/introducing-
| mercury-our-general...
|
| News coverage from February:
| https://techcrunch.com/2025/02/26/inception-emerges-from-ste...
| mtillman wrote:
| Ton of performance upside in most GPU adjacent code right now.
|
| _However_ , is this what arXiv is for? It seems more like
| marketing their links than research. Please correct me if I'm
| wrong/naive on this topic.
| ricopags wrote:
| not wrong, per se, but it's far from the first time
| eden-u4 wrote:
| No open model/weights?
| krasin wrote:
| Not only they do not release models/weights. They don't even
| tell the size of the models!
|
| The linked whitepaper is pretty useless, and I am saying as a
| big fan of diffusion-transformers-for-not-just-images-or-videos
| approach.
|
| Also, Gemini Diffusion ([1]) is way better at coding than
| Mercury offering.
|
| 1. https://deepmind.google/models/gemini-diffusion/
| gdiamos wrote:
| I think the LLM dev community is underestimating these models.
| E.g. there is no LLM inference framework that supports them
| today.
|
| Yes the diffusion foundation models have higher cross entropy.
| But diffusion LLMs can also be post trained and aligned, which
| cuts the gap.
|
| IMO, investing in post training and data is easier than forcing
| GPU vendors to invest in DRAM to handle large batch sizes and
| forcing users to figure out how to batch their requests by
| 100-1000x. It is also purely in the hands of LLM providers.
| mathiaspoint wrote:
| You can absolutely tune causal LLMs. In fact the original idea
| with GPTs was that you _had_ to tune them before they 'd be
| useful for anything.
| gdiamos wrote:
| Yes I agree you can tune autoregressive LLMs
|
| You can also tune diffusion LLMs
|
| After doing so, the diffusion LLM will be able to generate
| more tokens/sec during inference
| KaranSohi wrote:
| We have used their LLM in our company and it's great! From
| Accuracy to speed of response generation, this model seems very
| promising!
| ceroxylon wrote:
| The output is very fast but many steps backwards in all of my
| personal benchmarks. Great tech but not usable in production when
| it is over 60% hallucinations.
| mike_hearn wrote:
| That might just depend on how big it is/how much money was
| spent on training. The neural architecture can clearly work.
| Beyond that catching up may be just a matter of effort.
| mmaunder wrote:
| Holy shit that is fast. Try the playground. You need to get that
| visceral experience to truly appreciate what the future looks
| like.
| mmaunder wrote:
| Code output is verifiable in multiple ways. Combine that with
| this kind of speed (and far faster in future) and you can brute
| force your way to a killer app in a few minutes.
| OneOffAsk wrote:
| Yes, exactly. The demo of Gemini's Diffusion model [0] was
| really eye-opening to me in this regard. Since then, I've been
| convinced the future of lots of software engineering is
| basically UX and SQA: describe the desired states, have an LLM
| fill in the gaps based on its understanding of human intent,
| and unit test it to verify. Like most engineering fields, we'll
| have an empirical understanding of systems as opposed to the
| analytical understanding of code we have today. I'd argue most
| complex software is already only approximately understood even
| before LLMs. I doubt the quality of software will go up (in
| fact the opposite), but I think this work will scale much
| better and be much, much more boring.
|
| [0] https://simonwillison.net/2025/May/21/gemini-diffusion/
| jonplackett wrote:
| Wow, this thing is really quite smart.
|
| I was expecting really crappy performance but just chatting to
| it, giving it some puzzles, it feels very smart and gets a lot of
| things right that a lot of other models don't.
| ahmedhawas123 wrote:
| Reinforcement learning really helped Transformer based LLMs
| evolve in terms of quality and reasoning which we saw as DeepSeek
| was launched. I am curious if what this is is equivalent to an
| early GPT 4o that has not yet reaped the benefits of add-on
| technologies that helped improve the quality?
| M4v3R wrote:
| I am personally _very_ excited for this development. Recently I
| AI-coded a simple game for a game jam and half the time was spent
| waiting for the AI agent to finish its work so I can test it. If
| instead of waiting 1-2 minutes for every prompt to be executed
| and implemented I could wait 10 seconds instead that would be
| literally game changing. I could test 5-10 different versions of
| the same idea in the time it took me to test one with the current
| tech.
|
| Of course this model is not as advanced yet for this to be
| feasible, but so was Claude 3.0 just over a year ago. This will
| only get better over time I'm sure. Exciting times ahead of us.
| ianbicking wrote:
| For something a little different than a coding task, I tried
| using it in my game: https://www.playintra.win/ (in settings you
| can select Mercury, the game uses OpenRouter)
|
| At first it seemed pretty competent and of course very fast, but
| it seemed to really fall apart as the context got longer. The
| context in this case is a sequence of events and locations, and
| it needs to understand how those events are ordered and therefore
| what the current situation and environment are (though there's
| also lots of hints in the prompts to keep it focused on the
| present moment). It's challenging, but lots of smaller models can
| pull it off.
|
| But also a first release and a new architecture. Maybe it just
| needs more time to bake (GPT 3.5 couldn't do these things
| either). Though I also imagine it might just perform
| _differently_ from other LLMs, not really on the same spectrum of
| performance, and requiring different prompting.
| armcat wrote:
| I've been looking at the code on their chat playground,
| https://chat.inceptionlabs.ai/, and they have a helper function
| `const convertOpenAIMessages = (convo) => { ... }`, which also
| contains `models: ['gpt-3.5-turbo']`. I also see in API response:
| `"openai": true`. Is it actually using OpenAI, or is it actually
| calling its dLLM? Does anyone know?
|
| Also: you can turn on "Diffusion Effect" in the top-right corner,
| but this just seems to be an "animation gimmick" right?
| Alifatisk wrote:
| The speed of the response is waaay to quick for using OpenAi as
| backend, it's almost instant!
| armcat wrote:
| I've been asking bespoke questions and the timing is >2
| seconds, and slower than what I get for the same questions to
| ChatGPT (using gpt-4.1-mini). I am looking at their call
| stack and what I see: "verifyOpenAIConnection()",
| "generateOpenAIChatCompletion()", "getOpenAIModels()", etc.
| Maybe it's just so it's compatible with OpenAI API?
| martinald wrote:
| Check the bottom, I think it's just some off the shelf chat
| UI that uses OpenAI compatible API behind the scenes.
| armcat wrote:
| Ah got it, it looks like it's a whole bunch of things so
| it can also interface with ollama, and other APIs.
| Alifatisk wrote:
| Love the ui in the playground, it reminds me of Qwen chat.
|
| We have reached a point where the bottlenecks in genAI is not the
| knowledge or accuracy, it is the context window and speed.
|
| Luckily, Google (and Meta?) has pushed the limits of the context
| window to about 1 million tokens which is incredible. But I feel
| like todays options are still stuck about ~128k token window per
| chat, and after that it starts to forget.
|
| Another issue is the time time it takes for inference AND
| reasoning. dLLMs is an interesting approach at this. I know we
| have Groqs hardware aswell.
|
| I do wonder, can this be combined with Groqs hardware? Would the
| response be instant then?
|
| How many tokens can each chat handle in the playground? I
| couldn't find so much info about it.
|
| Which model is it using for inference?
|
| Also, is the training the same on dLLMs as on the standardised
| autoregressive LLMs? Or is the weights and models completely
| different?
| martinald wrote:
| I agree entirely with you. While Claude Code is amazing, it is
| also slow as hell and the context issue keeps coming up
| (usually at what feels like the worst possible time for me).
|
| It honestly feels like dialup most LLMs (apart from this!).
|
| AFIAK with traditional models context size is very memory
| intensive (though I know there are a lot of things that are
| trying to 'optimize' this). I believe memory usage grows at the
| square of context length, so even 10xing context length
| requires 100x the memory.
|
| (Image) diffusion does not grow like that, it is much more
| linear. But I have no idea (yet!) about text diffusion models
| if someone wants to chip in :).
| kadushka wrote:
| _We have reached a point where the bottlenecks in genAI is not
| the knowledge or accuracy, it is the context window and speed._
|
| You're joking, right? I'm using o3 and it can't do half of the
| coding tasks I tried.
| mxs_ wrote:
| In their tech report, they say this is based on:
|
| > "Our methods extend [28] through careful modifications to the
| data and computation to scale up learning."
|
| [28] is Lou et al. (2023), the "Score Entropy Discrete Diffusion"
| (SEDD) model (https://arxiv.org/abs/2310.16834).
|
| I wrote the first (as far as I can tell) independent from-scratch
| reimplementation of SEDD:
|
| https://github.com/mstarodub/dllm
|
| My goal was making it as clean and readable as possible. I also
| implemented the more complex denoising strategy they described
| (but didn't implement).
|
| It runs on a single GPU in a few hours on a toy dataset.
| mseri wrote:
| Sounds all cool and interesting, however:
|
| > By submitting User Submissions through the Services, you hereby
| do and shall grant Inception a worldwide, non-exclusive,
| perpetual, royalty-free, fully paid, sublicensable and
| transferable license to use, edit, modify, truncate, aggregate,
| reproduce, distribute, prepare derivative works of, display,
| perform, and otherwise fully exploit the User Submissions in
| connection with this site, the Services and our (and our
| successors' and assigns') businesses, including without
| limitation for promoting and redistributing part or all of this
| site or the Services (and derivative works thereof) in any media
| formats and through any media channels (including, without
| limitation, third party websites and feeds), and including after
| your termination of your account or the Services. For clarity,
| Inception may use User Submissions to train artificial
| intelligence models. (However, we will not train models using
| submissions from users accessing our Services via OpenRouter.)
___________________________________________________________________
(page generated 2025-07-07 23:00 UTC)