[HN Gopher] Running GPT-OSS-120B at 500 tokens per second on Nvi...
___________________________________________________________________
Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs
Author : philipkiely
Score : 231 points
Date : 2025-08-07 02:28 UTC (20 hours ago)
(HTM) web link (www.baseten.co)
(TXT) w3m dump (www.baseten.co)
| tmshapland wrote:
| Such a fascinating read. I didn't realize how much massaging
| needed to be done to get the models to perform well. I just sort
| of assumed they worked out of the box.
| acters wrote:
| Personally, I think bigger companies should be more proactive
| and work with some of the popular inference engine software
| devs with getting their special snowflake LLM to work before it
| gets released. I guess it is all very much experimental at the
| end of the day. Those devs are putting in God's work for us to
| use on our budget friendly hardware choices.
| eric-burel wrote:
| SMEs are starting to want local LLMs and it's a nightmare to
| figure what hardware would work for what models. I am asking
| devs in my hometown to literally visit their installs to
| figure combos that work.
| CMCDragonkai wrote:
| Are you installing them onsite?
| eric-burel wrote:
| Some are asking that yeah but I haven't run an install
| yet, I am documenting the process. This is a last resort,
| hosting on European cloud is more efficient but some
| companies don't even want to hear about cloud hosting.
| mutkach wrote:
| This is a good take, actually. GPT-OSS is not much of a
| snowflake (judging by the model's architecture card at least)
| but TRT-LLM treats every model like that - there is too much
| hardcode - which makes it very difficult to just use it out-
| of-the-box for the hottest SotA thing.
| diggan wrote:
| > GPT-OSS is not much of a snowflake
|
| Yeah, according to the architecture it doesn't seem like a
| snowflake, but they also decided to invent a new
| prompting/conversation format
| (https://github.com/openai/harmony) which definitely makes
| it a bit of a snowflake today, can't just use what worked a
| couple of days ago, but everyone needs to add proper
| support for it.
| diggan wrote:
| This is literally what they did for GPT-OSS, seems there was
| coordination to support it on day 1 with collaborations with
| OpenAI
| magicalhippo wrote:
| Maybe I'm especially daft this morning but I don't get the point
| of the speculative decoding.
|
| How does the target model validate the draft tokens without
| running the inference as normal?
|
| Because if it is doing just that, I don't get the point as you
| can't trust the draft tokens before they are validated, so you're
| still stuck waiting for the target model.
| joliu wrote:
| It does run inference, but on the batch of tokens that were
| drafted, akin to the prefill phase.
|
| So your draft model can decode N new tokens, then the real
| model does one inference pass to score the N new drafted
| tokens.
|
| Prefill is computation bound whereas decode is bandwidth bound,
| so in practice doing one prefill over N tokens is cheaper than
| doing N decode passes.
| furyofantares wrote:
| Not an expert, but here's how I understand it. You know how
| input tokens are cheaper than output tokens? It's related to
| that.
|
| Say the model so far has "The capital of France". The small
| model generates "is Paris.", which let's say is 5 tokens.
|
| You feed the large model "The capital of France is Paris." to
| validate all 5 of those tokens in a single forward pass.
| ahmedfromtunis wrote:
| But what would happen if the small model's prediction was "is
| Rome."? Wouldn't that result in costlier inference if the
| small model is "wrong" more than it is correct.
|
| Also, if the small model would be sufficiently more "correct"
| than "wrong", wouldn't be more efficient to get rid of the
| large model at this point?
| cwyers wrote:
| So, the way speculative decoding works, the model begins
| predicting at the first wrong token, so you still get 'is'
| for free.
| acters wrote:
| I believe that is exactly the downside of using speculative
| decoding, which is why it is very important to have the
| models properly sized between each other by making sure the
| small use is big enough to be mostly correct while also
| being exceptionally faster than the larger one. However the
| larger one has to be fast enough that catching flaws won't
| introduce too manyrandom delays. Also, if the small one is
| incorrect then the larger one correcting the mistake is
| miles better than leaving in incorrect output.
|
| It is about improving quality while allowing for faster
| speed most of the time. The tradeoff is that you consume
| more memory from having two models loaded vs one of them
| exclusively.
|
| If you just focus on one then it would make sense to reduce
| memory usage by just running the smaller model.
| acters wrote:
| Another caveat with this method is that both larger and
| smaller models need to behave very similar because a lot
| of the savings come from generating the necessary fluff
| around each detail such as grammar, formatting and
| words/letters that transition between each other.
|
| Unsurprisingly gpt-oss has both larger and smaller models
| that work very similarly! Both model sizes are so similar
| that even if getting a few wrong would not be slowing
| down the performance enough to equal the speed of the
| larger model(which is the worst case with this setup). We
| want the speed of the smaller model as much as possible.
| That is all
| imtringued wrote:
| You're forgetting that some sequences are more predictable
| than others, hence the name "speculative" decoding. Let's
| say your token encoding has 128k tokens. That means the
| model has to pick the right token out of 128k. Some of
| those tokens are incredibly rare, while others are super
| common. The big model has seen the rare tokens many more
| times than the small model. This means that the small model
| will be able to do things like produce grammatically
| correct English, but not know anything about a specific JS
| framework.
|
| The post training fine tuning costs (low thousand dollars)
| are the main reason why speculative decoding is relatively
| unpopular. The most effective speculative decoding strategy
| requires you to train multiple prediction heads ala medusa
| (or whatever succeeded it). If you don't do any fine
| tuning, then the probability of the small model being
| useful is slim. Using a random model as your draft model
| will probably give you very disappointing results.
| isoprophlex wrote:
| but... do you get any validation during the forward pass? the
| small model could just as well have generated "is Berlin." or
| whatever. do these models somehow give you a likelihood for
| the next token when you're prefilling, that you can compare
| against? if so why not just... use that always?
|
| or is this a scenario where computation is expensive but
| validation is cheap?
|
| EDIT: thanks, people, for educating me! very insightful :)
| sanxiyn wrote:
| Yes, models give likelihoods you can compare against. No,
| you can't do that without drafting, because likelihood of
| token N+2 depends on token N+1. That is, you get P(is, The
| capital of France) and P(Berlin, The capital of France is),
| but for the later you need to give "is" as input, you can't
| do P(Berlin, The Capital of France _).
| shikon7 wrote:
| Yes, the forward pass does a next token prediction on all
| input tokens (so we know exactly how many tokens from the
| small model matched). The expensive thing is not the
| computation, but the memory bandwidth, as each pass needs
| to load the model from memory.
|
| If the small model predicts some tokens correctly, you save
| some passes, at the expense of doing some extra
| computations when the tokens were not correct.
|
| In any case, each forward pass will give at least one new
| token.
| pama wrote:
| If you want to go down the rabbit hole of the state of the
| art, I recommend the EAGLE3 paper:
| https://arxiv.org/abs/2503.01840
| cristoperb wrote:
| My simplified understanding: The target model can validate the
| draft tokens all at once, in a single forward pass. The output
| of that forward pass is a list of probabilities for each draft
| token which are compared to the probabilities produced by the
| draft model. If the target model's probabilities are the same
| or greater than the draft model, the tokens are accepted. Worst
| case none of the draft tokens are accepted and instead the
| target model selects the single next token as usual.
| robrenaud wrote:
| I think your core misunderstanding is that you are assuming K
| calls to generate 1 token is expensive as 1 call to generate K
| tokens. It is actually much more expensive to generate serially
| than even in small batches.
| porridgeraisin wrote:
| Let's say I want to run f2(f1(x)) where f1 and f2 are both a
| single pass through GPT4.
|
| This takes 2 seconds time, assuming 1 second for every pass.
|
| What I instead do is kick off f1(x) in another thread, and then
| run f2(g1(x)) where g1 is one pass through GPT-nano.
|
| This takes 1 + 0.1 seconds, assuming gpt nano takes 0.1s for
| every pass. In this 1.1 seconds, the f1(x) that we kicked off
| in the 2nd thread would have finished (it takes 1 second).
|
| So in 1.1 seconds we have available to us f1(x), f2(g1(x)), and
| we store the intermediate g1(x) as well
|
| We compare g1(x) and f1(x)
|
| If they were equal, i.e g1(x) = f1(x), then we have our answer
| = f2(g1(x)) in just 1.1s.
|
| If they were not, we compute f2(output of f1(x) from 2nd
| thread) which takes 1 further second, bringing our total to
| 2.1s.
|
| If the small model is equalling the big model in say 2/3 of
| cases, you will spend 2/3 * 1.1 + 1/3 * 2.1 = 1.433s on average
| for this computation. Without speculative decoding, it is
| always 2s.
| arkmm wrote:
| This is a really great explanation.
| magicalhippo wrote:
| Thanks, very nice explanation, that makes perfect sense. I
| guess their graphics confused me for some reason and had me
| thinking all wrong.
|
| Now I see they tried to point out the obvious thing which is
| to predict multiple tokens ahead, not just two as in your
| example.
| bhaney wrote:
| > How does the target model validate the draft tokens without
| running the inference as normal?
|
| It does run the inference as normal, just in parallel with the
| other inferences
|
| > if it is doing just that, I don't get the point
|
| Running inferences in parallel allows you to only read the
| model weights out of memory only once for N parallel
| inferences, as opposed to reading them out of memory N times
| for N serial inferences. Inference is massively bottlenecked by
| memory bandwidth to the tune of one or two orders of magnitude
| compared to compute, so this helps a lot.
| littlestymaar wrote:
| > Inference is massively bottlenecked by memory bandwidth to
| the tune of one or two orders of magnitude compared to
| compute, so this helps a lot.
|
| Nitpick: it's only bottlenecked by memory bandwidth if the
| batch size is too low (that is: if you don't have many users
| calling the same model in parallel).
|
| Speculative decoding is just a way of running a single query
| as if it was parallel queries.
| jlebar wrote:
| Just want to suggest: Ask an LLM about it! If you have access
| to a reasoning model like o3, I've found it to be very helpful.
|
| I think this answer is as good as any of the human-generated
| ones in the thread so far, but the real power is that you can
| ask it follow-up questions.
| https://chatgpt.com/share/6894504f-4458-8008-a8c9-f371588259...
| modeless wrote:
| What's the best speed people have gotten on 4090s?
| ActorNightly wrote:
| You can't fit the model into 4090 without quantization, its
| like 64 gigs.
|
| For home use, Gemma27B QAT is king. Its almost as good as
| Deepseek R1
| modeless wrote:
| The 20B one fits.
| steinvakt2 wrote:
| Does it fit on a 5080 (16gb)?
| jwitthuhn wrote:
| Haven't tried myself but it looks like it probably does.
| The weight files total 13.8 GB which gives you a little
| left over to hold your context.
| northern-lights wrote:
| It fits on a 5070TI, so should fit on a 5080 as well.
| SirMaster wrote:
| You don't really need it to fit all in VRAM due to the
| efficient MoE architecture and with llama.cpp
|
| The 120B is running at 20 tokens/sec on my 5060Ti 16GB with
| 64GB of system ram. Now personally I find 20 tokens/sec quite
| usable, but for some maybe it's not enough.
| asabla wrote:
| I'm on a 5090 so it's not apples to apples comparison. But I'm
| getting ~150t/s for the 20B version using ~16000 context size.
| modeless wrote:
| Cool, what software?
| asabla wrote:
| Initial testing has only been done with ollama. Plan on
| testing out llama.cpp and vllm when there is enough time
| steinvakt2 wrote:
| And flash attention doesn't work on 5090 yet, right? So
| currently 4090 is probably faster, or?
| PeterStuer wrote:
| I don't think the 4090 has native 4bit support, which will
| probably have a significant impact.
| diggan wrote:
| > And flash attention doesn't work on 5090 yet, right?
|
| Flash attention works with GPT-OSS + llama.cpp (tested on
| 1d72c8418) and other Blackwell card (RTX Pro 6000) so I
| think it should work on 5090 as well, it's the same
| architecture after all.
| littlestymaar wrote:
| Very fast "Sorry I can't help with that" generator.
| jeffhuys wrote:
| Just "liberate" it
| sarthaksoni wrote:
| Reading this made me realize how easy it is to set up GPT-OSS 20B
| in comparison. I had it running on my Mac in five minutes, thanks
| to Llama.
| DrPhish wrote:
| Its also easy to do 120b on CPU if you have the resources. I
| had 120b running on my home LLM CPU inference box in just as
| long as it took to download the GGUFs, git pull and rebuild
| llama-server. I had it running at 40t/s with zero effort and
| 50t/s with a brief tweaking. Its just too bad that even the
| 120b isn't really worth running compared to the other models
| that are out there.
|
| It really is amazing what ggerganov and the llama.cpp team have
| done to democratize LLMs for individuals that can't afford a
| massive GPU farm worth more than the average annual salary.
| wkat4242 wrote:
| What hardware do you have? 50tk/s is really impressive for
| cpu.
| DrPhish wrote:
| 2xEPYC Genoa w/768GB of DDR5-4800 and an A5000 24GB card. I
| built it in January 2024 for about $6k and have thoroughly
| enjoyed running every new model as it gets released. Some
| of the best money I've ever spent.
| wkat4242 wrote:
| Wow nice!! That's a really good deal for that much
| hardware.
|
| How many tokens/s do you get for DeepSeek-R1?
| DrPhish wrote:
| Thanks, it was a bit of a gamble at the time (lots of
| dodgy ebay parts), but it paid off.
|
| R1 starts at about 10t/s on an empty context but quickly
| falls off. I'd say the majority of my tokens are
| generating around 6t/s.
|
| Some of the other big MoE models can be quite a bit
| faster.
|
| I'm mostly using QwenCoder 480b at Q8 these days for 9t/s
| average. I've found I get better real-world results out
| of it than K2, R1 or GLM4.5.
| testaburger wrote:
| Which specific model epcys? And if it's not too much to
| ask which motherboard and power supply? I'm really
| interested in building something similar
| smartbit wrote:
| Looking at
| https://news.ycombinator.com/submitted?id=DrPhish it's
| probably this machine https://rentry.co/miqumaxx
| * Gigabyte MZ73-LM1 with two AMD EPYC GENOA 9334 QS
| 64c/128t * 24 sticks of M321R4GA3BB6-CQK 32GB
| DDR5-4800 RDIMM PC5-38400R * 24GB A5000
|
| Note that the RAM price almost doubled since Jan 2024
| ekianjo wrote:
| thats a r/localllama user right there
| fouc wrote:
| I've seen some mentions of pure-cpu setups being
| successful for large models using old epyc/xeon
| workstations off ebay with 40+ cpus. Interesting
| approach!
| SirMaster wrote:
| I'm getting 20 tokens/sec on the 120B model with a 5060Ti
| 16GB and a regular desktop Ryzen 7800x3d with 64GB of
| DDR5-6000.
| wkat4242 wrote:
| Wow that's not bad. It's strange, for me it is much much
| slower on a Radeon Pro VII (also 16GB, with a memory
| bandwidth of 1TB/s!) and a Ryzen 5 5600 with also 64GB.
| It's basically unworkably slow. Also, I only get 100% CPU
| when I check ollama ps, the GPU is not being used at all
| :( It's also counterproductive because the model is just
| too large for 64GB.
|
| I wonder what makes it work so well on yours! My CPU
| isn't much slower and my GPU probably faster.
| magicalhippo wrote:
| AMD basically decided they wanted to focus on HPC and
| data center customers rather than consumers, and so GPGPU
| driver support for consumer cards has been non-existing
| or terrible[1].
|
| [1]: https://github.com/ROCm/ROCm/discussions/3893
| exe34 wrote:
| I imagine the gguf is quantised stuff?
| DrPhish wrote:
| No, I'm running the unquantized 120b
| amelius wrote:
| Why is it hard to set up llms? You can just ask an llm to do it
| for you, no? If this relatively simple task is already too much
| for llms then what good are they?
| diggan wrote:
| In the case of the GPT-OSS models, the worst (time consuming)
| part of supporting it is the new format they've trained the
| model with, "OpenAI harmony", in my own clients I couldn't
| just replace the model and call it a day, but still working
| on getting then to work correctly with tool calling...
| CraigRood wrote:
| I was playing with it yesterday and every single session gave
| me factually incorrect information.
|
| Speed and ease of use is one thing, but it shouldn't be at the
| cost of accuracy.
| OliverGuy wrote:
| If you are trying to get facts out of an LLM you are using it
| wrong, if you want a fact it should use a tool (eg we search,
| rag etc) to get the information that contains the fact
| (Wikipedia page, documentation etc) and then parse that
| document for the fact and return it to you.
| LoganDark wrote:
| 120B is pretty easy to run too, if you have enough memory.
| eric-burel wrote:
| "Encourage Open-Source and Open-Weight AI" is the part just after
| "Ensure that Frontier AI Protects Free Speech and American
| Values" in America's AI Action Plan. I know this is not rational
| but OpenAI OSS models kinda give me chills as I am reading the
| Plan in parallel. Anyway I like seeing oss model providers
| talking about hardware, because that's a limiting point for most
| developers that are not familiar with this layer.
| geertj wrote:
| > Ensure that Frontier AI Protects Free Speech and American
| Values
|
| I am in the early phases of collecting my thoughts on this
| topic so bear with me, but it this a bad thing?
|
| AI models will have a world view. I think I prefer them having
| a western world view, as that has built our modern society and
| has proven to be most successful in making the lives of people
| better.
|
| At the very minimum I would want a model to document its world
| view, and be aligned to it so that it does not try to socially
| engineer me to surreptitiously change mine.
| petesergeant wrote:
| > but it this a bad thing?
|
| I think the worry is that there's no fixed definitions here,
| so the executive can use this to exert partisan or
| ideological pressure on model providers.
|
| Every four years the models get RLHF'd to switch between
| thinking guns are amazing vs thinking guns are terrible.
| exe34 wrote:
| > I think I prefer them having a western world view,
|
| What worries me is that the current "western world view" of
| America is not the same as the western world view we've
| shared with them since the cold war. The trend is towards the
| same kind of values and behaviour we see in the Islamic
| Republic and the Russian Federation. If that sort of "western
| world view" gets baked into the intelligent infrastructure,
| it may be very hard to change course in the future. For
| example dissidence and wrongthink is going to get harder and
| harder.
| AesopAerial wrote:
| > I think I prefer them having a western world view, as that
| has built our modern society and has proven to be most
| successful in making the lives of people better.
|
| Highly debatable, and most people anywhere would probably say
| the same thing about whatever world view they hold.
| ben_w wrote:
| "Western" != "American": I grew up in a country where even
| the police are not, and do not wish to be, routinely armed.
|
| Even then, there is an important difference between de-facto
| and de-jure rules. Fun fact: even North Korea has a
| constitutional guarantee of freedom of speech and the right
| vote*. They don't _do_ these things as we would understand
| any of those words, but they have those things right there in
| the constitution.
|
| So: does the USA, as it exists today, represent the values
| you want? Can you honestly say, hand on heart, that Alligator
| Alcatraz should be a thing your AI has been trained to
| support? Or that it's fine for Qatar to donate a 747 that
| becomes part of the library of the current president, not the
| office of the president, when his term in office comes to an
| end?
|
| I won't list everything, this isn't the place for that, but
| even if we wind the clock back a few years, do you (/we) want
| an AI aligned with a political circus of kayfabe that
| distracts us from the real political machinations?
|
| Of course, this is still USA-focused.
|
| I'd say that what really made a difference to our quality of
| life wasn't even the American political system: there were
| massive improvements to human existence starting with the
| first industrial revolution in the UK in the 1760s, but the
| social and political nature of the world back then was so
| bleak that communism got invented a century later and
| introduced what was at the time controversial ideas like
| "women are not property" and "universal free education is
| good", and the USA's systems changed substantially several
| times since then (at a minimum Civil War, New Deal, and the
| Civil Rights movement).
|
| The "meta system" that allows change can be considered good,
| but not uniquely so if you compare this to the Russian
| Revolution getting rid of the Tzars and a 40 years later they
| were in orbit (and this _despite_ the Holodomor and WW2) and
| then threw off these shackles with Glasnost and the fall of
| the USSR (and note there that in Russia specifically, not all
| the former soviet countries but specifically Russia, the
| freedom gained _failed_ to bring material improvements and
| the lives of those living through it were, in aggregate, made
| worse despite that freedom), and similar stories with the
| Chinese starting with dangerous incompetence (Four Pests
| campaign) and now in a position where "which is more
| powerful, them or the USA?" is a matter of which measure you
| use rather than it being obvious.
|
| * https://en.wikipedia.org/wiki/Constitution_of_North_Korea#C
| h...
| eric-burel wrote:
| Yeah I mean you'd want to take a look at the plan to get a
| bigger picture, it reflects a specific set of values which
| are not universally shared. This should led to the
| development of European models, but it feels inefficient to
| duplicate the work in each country/region just because open
| source models are planned to be used as trojan horses for
| values.
| hsaliak wrote:
| TLDR: tensorrt
| mutkach wrote:
| > Inspired by GPUs, we parallelized this effort across multiple
| engineers. One engineer tried vLLM, another SGLang, and a third
| worked on TensorRT-LLM. We were able to quickly get TensorRT-LLM
| working, which was fortunate as it is usually the most performant
| inference framework for LLMs.
|
| > TensorRT-LLM
|
| It is usually the hardest to setup correctly and is often out of
| the date regarding the relevant architectures. It also requires
| compiling the model on the exact same hardware-drivers-libraries
| stack as your production environment which is a great pain in the
| rear end to say the least. Multimodal setups also been a disaster
| - at least for a while - when it was near-impossible to make it
| work even for mainstream models - like Multimodal Llamas. The big
| question is whether it's worth it, since when running the GPT-
| OSS-120B on H100 using vLLM is flawless in comparison - and the
| throughput stays at 130-140 t/s for a single H100. (It's also
| somewhat a clickbait of a title - I was expecting to see 500t/s
| for a single GPU, when in fact it's just a tensor-parallel setup)
|
| It's also funny that they went for a separate release of TRT-LLM
| just to make sure that gpt-oss will work correctly, TRT-LLM is a
| mess
| philipkiely wrote:
| TRT-LLM has its challenges from a DX perspective and yeah for
| Multi-modal we still use vLLM pretty often.
|
| But for the kind of traffic we are trying to serve -- high
| volume and latency sensitive -- it consistently wins head-to-
| head in our benchmarking and we have invested a ton of dev work
| in the tooling around it.
| wcallahan wrote:
| I just used GPT-OSS-120B on a cross Atlantic flight on my MacBook
| Pro (M4, 128GB RAM).
|
| A few things I noticed: - it's only fast with with small context
| windows and small total token context; once more than ~10k tokens
| you're basically queueing everything for a long time - MCPs/web
| search/url fetch have already become a very important part of
| interacting with LLMs; when they're not available the LLM utility
| is greatly diminished - a lot of CLI/TUI coding tools (e.g.,
| opencode) were not working reliably offline at this time with the
| model, despite being setup prior to being offline
|
| That's in addition to the other quirks others have noted with the
| OSS models.
| MoonObserver wrote:
| M2 Max processor. I saw 60+ tok/s on short conversations, but
| it degraded to 30 tok/s as the conversation got longer. Do you
| know what actually accounts for this slowdown? I don't believe
| it was thermal throttling.
| summarity wrote:
| Physics: You always have the same memory bandwidth. The
| longer the context, the more bits will need to pass through
| the same pipe. Context is cumulative.
| VierScar wrote:
| No I don't think it's the bits. I would say it's the
| computation. Inference requires performing a lot of matmul,
| and with more tokens the number of computation operations
| increases exponentially - O(n^2) at least. So increasing
| your context/conversation will quickly degrade performance
|
| I seriously doubt it's the throughput of memory during
| inference that's the bottleneck here.
| zozbot234 wrote:
| Typically, the token generation phase is memory-bound for
| LLM inference in general, and this becomes especially
| clear as context length increases (since the model's
| parameters are a fixed quantity.) If it was pure compute
| bound there would be huge gains to be had by shifting
| some of the load to the NPU (ANE) but AIUI it's just not
| so.
| summarity wrote:
| It literally is. LLM inference is almost entirely memory
| bound. In fact for naive inference (no batching), you can
| calculate the token throughput just based on the model
| size, context size and memory bandwidth.
| zozbot234 wrote:
| Prompt pre-processing (before the first token is output)
| is raw compute-bound. That's why it would be nice if we
| could direct llama.cpp/ollama to run that phase only on
| iGPU/NPU (for systems without a separate dGPU, obviously)
| and shift the whole thing over to CPU inference for the
| latter token-generation phase.
|
| (A memory-bound workload like token gen wouldn't usually
| run into the CPU's thermal or power limits, so there
| would be little or no gain from offloading work to the
| iGPU/NPU in that phase.)
| MereInterest wrote:
| Nitpick: O(n^2) is quadratic, not exponential. For it to
| "increase exponentially", n would need to be in the
| exponent, such as O(2^n).
| esafak wrote:
| To contrast with exponential, the term is _power law_.
| torginus wrote:
| Inference takes quadratic amount of time wrt context size
| conradev wrote:
| Are you using Ollama or LMStudio/llama.cpp?
| https://x.com/ggerganov/status/1953088008816619637
| diggan wrote:
| > LMStudio/llama.cpp
|
| Even though LM Studio uses llama.cpp as a runtime, the
| performance differs between them. With LM Studio 0.3.22 Build
| 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s
| on a RTX Pro 6000, while with llama.cpp compiled from
| 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost
| 100 more per second, both running lmstudio-community/gpt-
| oss-120b-GGUF.
| esafak wrote:
| Is it always like this or does it depend on the model?
| diggan wrote:
| Depends on the model. Each runner needs to implement
| support when there are new architectures, and they all
| seemingly focuses on different things. As far as I've
| gathered so far, vLLM focuses on inference speed, SGLang
| on parallizing across multiple GPUs, Ollama on being as
| fast out the door with their implementation as possible,
| sometimes cutting corners, llama.cpp sits somewhere in-
| between Ollama and vLLM. Then LM Studio seems to lag
| slightly behind with their llama.cpp usage, so I'm
| guessing that's the difference between LM Studio and
| building llama.cpp from source today.
| XCSme wrote:
| I know there was a downloadable version of Wikipedia (not that
| large). Maybe soon we'll have a lot of data stored locally and
| expose it via MCP, then the AIs can do "web search" locally.
|
| I think 99% of web searches lead to the same 100-1k websites. I
| assume it's only a few GBs to have a copy of those locally,
| thus this raises copyright concerns.
| Aurornis wrote:
| The mostly static knowledge content from sites like Wikipedia
| is already well represented in LLMs.
|
| LLMs call out to external websites when something isn't
| commonly represented in training data, like specific project
| documentation or news events.
| XCSme wrote:
| That's true, but the data is only approximately represented
| in the weights.
|
| Maybe it's better to have the AI only "reason", and somehow
| instantly access precise data.
| adsharma wrote:
| What use cases will gain from this architecture?
| gigatexal wrote:
| M3 Max 128GB here and it's mad impressive.
|
| Im spec'ing out a Mac Studio with 512GB ram because I can
| window shop and wish but I think the trend for local LLMs is
| getting really good.
|
| Do we know WHY openAI even released them?
| diggan wrote:
| > Do we know WHY openAI even released them?
|
| Regulations and trying to earn good will of developers using
| local LLMs, something that was slowly eroding since it was a
| while ago (GPT2 - 2019) they released weights to the public.
| Epa095 wrote:
| If the new gpt 5 is actually better, then this oss version is
| not really a threat to Openai's income stream, but it can be
| a threat to their competitors.
| zackify wrote:
| You didn't even mention how it'll be on fire unless you use low
| power mode.
|
| Yes all this has been known since the M4 came out. The memory
| bandwidth is too low.
|
| Try using it with real tasks like cline or opencode and the
| context length is too long and slow to be practical
| Aurornis wrote:
| > Yes all this has been known since the M4 came out. The
| memory bandwidth is too low.
|
| The M4 Max with 128GB of RAM (the part used in the comment)
| has over 500GB/sec of memory bandwidth.
| radarsat1 wrote:
| How long did your battery last?!
| woleium wrote:
| planes have power sockets now, but i do wonder how much jet
| fuel a whole plane of gpus would consume in electricity
| (assuming the system can handle it, which seems unlikely) and
| air conditioning.
| TimBurman wrote:
| That's an interesting question. According to Rich and
| Greg's Airplane Page[1], the A320 has three generators
| rated for 90kVA continuous each, one per engine and a third
| in the auxilary power unit that isn't normally deployed.
| Cruising demand is around 140 kVA of the 180 kVA supplied
| by the engines, leaving 40 kVA to spare. The A380 has six
| similar generators, two in reserve. They give the
| percentages so you could calculate how much fuel each
| system is consuming.
|
| [1] https://alverstokeaviation.blogspot.com/2016/03/
|
| This page also has a rendered image of the generator:
|
| https://aviation.stackexchange.com/questions/43490/how-
| much-...
| mich5632 wrote:
| I think this the difference between compute bound pre-fill (a
| cpu has a high bandwidth/compute ratio), vs decode. The time to
| first token is below 0.5s - even for a 10k context.
| fouc wrote:
| What was your iogpu.wired_limit_mb set to? By default only ~70%
| or ~90GB of your RAM will be available to your GPU cores unless
| you change your wired limit setting.
| blitzar wrote:
| > widely-available H100 GPUs
|
| Just looked in the parts drawer at home and dont seem to have a
| $25,000 GPU for some inexplicable reason.
| KolmogorovComp wrote:
| available != cheap
| blitzar wrote:
| available /@'veIl@bl/
|
| adjective: available
|
| able to be used or obtained; at someone's disposal
| swexbe wrote:
| You can rent one from most cloud providers for a few bucks
| an hour.
| koakuma-chan wrote:
| Might as well just use openai api
| ekianjo wrote:
| thats not the same thing at all
| poly2it wrote:
| That depends on your intentions.
| Kurtz79 wrote:
| Does it even make sense calling them 'GPUs' (I just checked
| NVIDIA product page for the H100 and it is indeed so)?
|
| There should be a quicker way to differentiate between
| 'consumer-grade hardware that is mainly meant to be used for
| gaming and can also run LLMs inference in a limited way' and
| 'business-grade hardware whose main purpose is AI training or
| running inference for LLMs".
| amelius wrote:
| Well, does it come with graphics connectors?
| OliverGuy wrote:
| Nope, doesn't have any of the required hardware to even
| process graphics iirc
| diggan wrote:
| Although the RTX Pro 6000 is not consumer-grade, it does
| come with graphics ports (four Displayports) and does
| render graphics like a consumer card :) So seems the
| difference between the segments is becoming smaller, not
| bigger.
| simpleintheory wrote:
| That's because it's intended as a workstation GPU not one
| used in servers
| diggan wrote:
| Sure, but it still sits in the 'business-grade hardware
| whose main purpose is AI training or running inference
| for LLMs" segment parent mentioned, yet have graphics
| connectors so the only thing I'm saying is that just
| looking at that won't help you understand what segment
| the GPU goes into.
| blitzar wrote:
| We are fast approaching the return of the _math coprocessor_.
| In fashion they say that trends tend to reappear roughly
| every two decades, its overdue.
| egorfine wrote:
| Yeah I would love for Nvidia to introduce faster update
| cycle to their hardware, so that we'll have models like
| "H201", "H220", etc.
|
| I think it will also make sense to replace "H" with a brand
| number, sort of like they already do for customer GPUs.
|
| So then maybe one day we'll have a math coprocessor called
| "Nvidia 80287".
| beAbU wrote:
| I remember the building hugh end workstations for a summer
| job in the 2000s, where I had to fit Tesla cards in the
| machines. I don't remember what their device names were, we
| just called them tesla cards.
|
| "Accelerator card" makes a lot of sense to me.
| WithinReason wrote:
| It's called a tensorcore and it's in most GPUs
| genewitch wrote:
| "GPGPU" was something from over a decade ago; for general
| purpose GPU computing
| hnuser123456 wrote:
| Yeah, Crysis came out in 2007 and could run physics on the
| GPU.
| addandsubtract wrote:
| We could call the consumer ones GFX cards, and keep GPU for
| the matrix multiplying ones.
| beAbU wrote:
| GPU stands for "graphics processing unit" so I'm not sure
| how your suggestion solves it.
|
| Maybe renaming the device to an MPU, where the M stands for
| "matrix/math/mips" would make it more semantically correct?
| rebolek wrote:
| I think that G was changed to "general", so now it's
| "general processing unit".
| rpdillon wrote:
| This doesn't seem to be true at all. It's a highly
| specialized chip for doing highly parallel operations.
| There's nothing general about it.
|
| I looked around briefly and could find no evidence that
| it's been renamed. Do you have a source?
| fouc wrote:
| CPU is already the general (computing) processing unit so
| that wouldn't make sense
| codedokode wrote:
| By the way I wonder, what has more performance, a $25 000
| professional GPU or a bunch of cheaper consumer GPUs costing
| $25 000 in total?
| omneity wrote:
| Consumer GPUs in theory and by a large margin (10 5090s
| will eat an H100 lunch with 6 times the bandwidth, 3x VRAM
| and a relatively similar compute ratio), but your
| bottleneck is the interconnect and that is intentionally
| crippled to avoid beowulf GPU clusters eating into their
| datacenter market.
|
| Last consumer GPU with NVLink was the RTX 3090. Even the
| workstation-grade GPUs lost it.
|
| https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-
| more-...
| sigbottle wrote:
| H100s also has custom async WGMMA instructions among
| other things. From what I understand, at least the async
| instructions formalize the notion of pipelining, which
| engineers were already implicitly using because to
| optimize memory accesses you're effectively trying to
| overlap them in that kind of optimal parallel manner.
| AlphaSite wrote:
| I think apple calls them NPUs and Broadcom calls them XPUs.
| Given they're basically the number 2 and 3 accelerator
| manufacturers one of those probably works.
| washadjeffmad wrote:
| I just specify SXM (node) when I want to differentiate from
| PCIe. We have H100s in both.
| lopuhin wrote:
| you can rent them for less then $2/h in a lot of places (maybe
| not in the drawer)
| dougSF70 wrote:
| With Ollama i got the 20B model running on 8 TitanX cards
| (2015). Ollama distributed the model so that the 15GB of vram
| required was split evenly accross the 8 cards. The tok/s were
| faster than reading speed.
| Aurornis wrote:
| For the price of 8 decade old Titan X cards, someone could
| pick up a single modern GPU with 16GB or more of RAM.
| philipkiely wrote:
| This comment made my day ty! Yeah definitely speaking from a
| datacenter perspective -- fastest piece of hardware I have in
| the parts drawer is probably my old iPhone 8.
| Aurornis wrote:
| They're widely available to rent.
|
| Unless you're running it 24/7 for multiple years, it's not
| going to be cost effective to buy the GPU instead of renting a
| hosted one.
|
| For personal use you wouldn't get a recent generation data
| center card anyway. You'd get something like a Mac Studio or
| Strix Halo and deal with the slower speed.
| varispeed wrote:
| I rented H100 for training a couple of times and I found that
| they couldn't do training at all. Same code worked fine on
| Mac M1 or RTX 5080, but on H100 I was getting completely
| different results.
|
| So I wonder what I could be doing wrong. In the end I just
| use RTX 5080 as my models fit neatly in the available RAM.
|
| * by not working at all, I mean the scripts worked, but
| results were wrong. As if H100 couldn't do maths properly.
| blueboo wrote:
| You might find $2.50 in change to use one for an hour though
| vonneumannstan wrote:
| >Just looked in the parts drawer at home and dont seem to have
| a $25,000 GPU for some inexplicable reason.
|
| It just means you CAN buy one if you want, as in they're in
| stock and "available", not that you can necessarily afford one.
| smcleod wrote:
| TensorRT-LLM is a right nightmare to setup and maintain. Good on
| them for getting it to work for them - but it's not for everyone.
| philipkiely wrote:
| We have built a ton of tooling on top of TRT-LLM and use it not
| just for LLMs but also for TTS models (Orpheus), STT models
| (Whisper), and embedding models.
| nektro wrote:
| > we were the clear leader running on NVIDIA GPUs for both
| latency and throughput per public data from real-world use on
| OpenRouter.
|
| Baseten: 592.6 tps Groq: 784.6 tps Cerebras: 4,245 tps
|
| still impressive work
| philipkiely wrote:
| Yeah the custom hardware providers are super good at TPS. Kudos
| to their teams for sure, and the demos of instant reasoning are
| incredibly impressive.
|
| That said, we are serving the model at its full 131K context
| window, and they are serving 33K max, which could matter for
| some edge case prompts.
|
| Additionally, NVIDIA hardware is much more widely available if
| you are scaling a high-traffic application.
| lagrange77 wrote:
| While you're here..
|
| Do you guys know a website that clearly shows which OS LLM models
| run on / fit into a specific GPU(setup)?
|
| The best heuristic i could find for the necessary VRAM is Number
| of Parameters x (Precision / 8) x 1.2 from here [0].
|
| [0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-
| llms...
| diggan wrote:
| Maybe I'm spoiled by having great internet connection, but I
| usually download the weights and try to run them via various
| tools (llama.cpp, LM Studio, vLLM and SGLang typically) and see
| what works. There seems to be so many variables involved
| (runners, architectures, implementations, hardware and so on)
| that none of the calculators I've tried so far been accurate,
| both in the way that they've over-estimated and under-estimated
| what I could run.
|
| So in the end, trying to actually run them seems to be the only
| fool-proof way of knowing for sure :)
| reactordev wrote:
| huggingface has this built in if you care to fill out your
| software and hardware profile here:
|
| https://huggingface.co/settings/local-apps
|
| Then on the model pages, it will show you whether you can use
| it.
| diggan wrote:
| Interesting, never knew about that! I filled out my details,
| then went to https://huggingface.co/openai/gpt-oss-120b but
| I'm not sure if I see any difference? Where is it supposed to
| show if I can run it or not?
| reactordev wrote:
| You'll see green check next to models you can use on the
| model card.
|
| https://huggingface.co/unsloth/gpt-oss-20b-GGUF
| diggan wrote:
| Ah, it only works for GGUF, not for .safetensors (which
| the format HuggingFace themselves came up with :P ) ? I
| see the checks at https://huggingface.co/unsloth/gpt-
| oss-20b-GGUF but nothing at
| https://huggingface.co/openai/gpt-oss-120b, seems a bit
| backwards.
| philipkiely wrote:
| Yeah we have tried to build calculators before it just depends
| so much.
|
| Your equation is roughly correct, but I tend to multiply by a
| factor of 2 not 1.2 to allow for highly concurrent traffic.
| lagrange77 wrote:
| Thanks for your answers!
|
| While it is seemingly hard to calculate it, maybe one should
| just make a database website that tracks specific setups
| (model, exact variant / quantisation, runner, hardware) where
| users can report, which combination they got running (or not)
| along with metrics like tokens/s.
|
| Visitors could then specify their runner and hardware and
| filter for a list of models that would run on that.
| diggan wrote:
| Yeah, what you're suggesting sounds like it could be more
| useful than the "generalized calculators" people are
| currently publishing and using.
| philipkiely wrote:
| Went to bed with 2 votes, woke up to this. Thank you so much HN!
| radarsat1 wrote:
| Would love to try fully local agentic coding. Is it feasible yet?
| I have a laptop with a 3050 but that's not nearly enough VRAM, I
| guess. Still, would be interested to know what's possible today
| on reasonable consumer hardware.
| zackangelo wrote:
| GPT-OSS will run even faster on Blackwell chips because of its
| hardware support for fp4.
|
| If anyone is working on training or inference in Rust, I'm
| currently working on adding fp8 and fp4 support to cudarc[0] and
| candle[1]. This is being done so I can support these models in
| our inference engine for Mixlayer[2].
|
| [0] https://github.com/coreylowman/cudarc/pull/449 [1]
| https://github.com/huggingface/candle/pull/2989 [2]
| https://mixlayer.com
| diggan wrote:
| Ah, interesting. As someone with a RTX Pro 6000, is it ready
| today to be able to run gpt-oss-120b inference, or are there
| still missing pieces? Both linked PRs seems merged already, so
| unsure if it's ready to be played around with or not.
| mikewarot wrote:
| You know what's actually hard to find in all this? The actual
| dimensions of the arrays in the model GPT-OSS-120B. At least with
| statically typed languages, you know how big your arrays are at a
| glance. I'm trying to find it in the GitHub repo[1], and I'm not
| seeing it.
|
| I'm just trying to figure out how wide the datastream through
| this is, in particular, the actual data (not the weights) that
| flow through all of it. The width of the output stream. Just how
| big is a token at the output, prior to reducing it with
| "temperature" to a few bytes?
|
| Assume infinitely fast compute in a magic black box, but you have
| to send the output through gigabit ethernet... what's the maximum
| number of tokens per second?
|
| [1] https://github.com/openai/gpt-oss/tree/main/gpt_oss
| amluto wrote:
| What's the application where you want to stream out the logits
| for each consecutive token while still sampling each token
| according to the usual rule? Keep in mind that, if you are
| doing the usual clever tricks like restricting the next token
| sampled to something that satisfies a grammar, you need to
| process the logits and _sample them and return a token_ before
| running the next round of inference.
| mikewarot wrote:
| I know the actual output of the model is wider than a
| token.... but I can't find it (the actual width, or number of
| bytes) in the source. Perhaps it's my very casual familiarity
| with Python that's limiting me, but I don't see any actual
| declarations of array sizes anywhere in the code.
|
| I'm just trying to calculate the actual bandwidth required
| for the full output of the model, not just a token to be
| handed off to the user.
|
| I need this so I can compute just what bandwidth a fully FPGA
| (later ASIC) based implementation of the model would result
| in.
|
| Edit/Append: I asked GPT-5, and it estimated:
| Total bytes = 50,000 tokens x 4 bytes/token = 200,000 bytes
|
| Which sounds about right to me. This yields a maximum of
| about 500 logits/second on Gigabit ethernet.
|
| The actual compute of the model is peanuts compared to just
| shuffling the data around.
| steeve wrote:
| According to https://huggingface.co/openai/gpt-
| oss-120b/blob/main/config....
|
| That's 2880 values (so multiply by dtype)
| OldfieldFund wrote:
| laughs in Cerebras
| adsharma wrote:
| What's the best number on vLLM and SGlang so far on H100?
|
| It's sad that MLPerf takes a long time to catch up to SOTA
| models.
___________________________________________________________________
(page generated 2025-08-07 23:01 UTC)