[HN Gopher] Running GPT-OSS-120B at 500 tokens per second on Nvi...
       ___________________________________________________________________
        
       Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs
        
       Author : philipkiely
       Score  : 231 points
       Date   : 2025-08-07 02:28 UTC (20 hours ago)
        
 (HTM) web link (www.baseten.co)
 (TXT) w3m dump (www.baseten.co)
        
       | tmshapland wrote:
       | Such a fascinating read. I didn't realize how much massaging
       | needed to be done to get the models to perform well. I just sort
       | of assumed they worked out of the box.
        
         | acters wrote:
         | Personally, I think bigger companies should be more proactive
         | and work with some of the popular inference engine software
         | devs with getting their special snowflake LLM to work before it
         | gets released. I guess it is all very much experimental at the
         | end of the day. Those devs are putting in God's work for us to
         | use on our budget friendly hardware choices.
        
           | eric-burel wrote:
           | SMEs are starting to want local LLMs and it's a nightmare to
           | figure what hardware would work for what models. I am asking
           | devs in my hometown to literally visit their installs to
           | figure combos that work.
        
             | CMCDragonkai wrote:
             | Are you installing them onsite?
        
               | eric-burel wrote:
               | Some are asking that yeah but I haven't run an install
               | yet, I am documenting the process. This is a last resort,
               | hosting on European cloud is more efficient but some
               | companies don't even want to hear about cloud hosting.
        
           | mutkach wrote:
           | This is a good take, actually. GPT-OSS is not much of a
           | snowflake (judging by the model's architecture card at least)
           | but TRT-LLM treats every model like that - there is too much
           | hardcode - which makes it very difficult to just use it out-
           | of-the-box for the hottest SotA thing.
        
             | diggan wrote:
             | > GPT-OSS is not much of a snowflake
             | 
             | Yeah, according to the architecture it doesn't seem like a
             | snowflake, but they also decided to invent a new
             | prompting/conversation format
             | (https://github.com/openai/harmony) which definitely makes
             | it a bit of a snowflake today, can't just use what worked a
             | couple of days ago, but everyone needs to add proper
             | support for it.
        
           | diggan wrote:
           | This is literally what they did for GPT-OSS, seems there was
           | coordination to support it on day 1 with collaborations with
           | OpenAI
        
       | magicalhippo wrote:
       | Maybe I'm especially daft this morning but I don't get the point
       | of the speculative decoding.
       | 
       | How does the target model validate the draft tokens without
       | running the inference as normal?
       | 
       | Because if it is doing just that, I don't get the point as you
       | can't trust the draft tokens before they are validated, so you're
       | still stuck waiting for the target model.
        
         | joliu wrote:
         | It does run inference, but on the batch of tokens that were
         | drafted, akin to the prefill phase.
         | 
         | So your draft model can decode N new tokens, then the real
         | model does one inference pass to score the N new drafted
         | tokens.
         | 
         | Prefill is computation bound whereas decode is bandwidth bound,
         | so in practice doing one prefill over N tokens is cheaper than
         | doing N decode passes.
        
         | furyofantares wrote:
         | Not an expert, but here's how I understand it. You know how
         | input tokens are cheaper than output tokens? It's related to
         | that.
         | 
         | Say the model so far has "The capital of France". The small
         | model generates "is Paris.", which let's say is 5 tokens.
         | 
         | You feed the large model "The capital of France is Paris." to
         | validate all 5 of those tokens in a single forward pass.
        
           | ahmedfromtunis wrote:
           | But what would happen if the small model's prediction was "is
           | Rome."? Wouldn't that result in costlier inference if the
           | small model is "wrong" more than it is correct.
           | 
           | Also, if the small model would be sufficiently more "correct"
           | than "wrong", wouldn't be more efficient to get rid of the
           | large model at this point?
        
             | cwyers wrote:
             | So, the way speculative decoding works, the model begins
             | predicting at the first wrong token, so you still get 'is'
             | for free.
        
             | acters wrote:
             | I believe that is exactly the downside of using speculative
             | decoding, which is why it is very important to have the
             | models properly sized between each other by making sure the
             | small use is big enough to be mostly correct while also
             | being exceptionally faster than the larger one. However the
             | larger one has to be fast enough that catching flaws won't
             | introduce too manyrandom delays. Also, if the small one is
             | incorrect then the larger one correcting the mistake is
             | miles better than leaving in incorrect output.
             | 
             | It is about improving quality while allowing for faster
             | speed most of the time. The tradeoff is that you consume
             | more memory from having two models loaded vs one of them
             | exclusively.
             | 
             | If you just focus on one then it would make sense to reduce
             | memory usage by just running the smaller model.
        
               | acters wrote:
               | Another caveat with this method is that both larger and
               | smaller models need to behave very similar because a lot
               | of the savings come from generating the necessary fluff
               | around each detail such as grammar, formatting and
               | words/letters that transition between each other.
               | 
               | Unsurprisingly gpt-oss has both larger and smaller models
               | that work very similarly! Both model sizes are so similar
               | that even if getting a few wrong would not be slowing
               | down the performance enough to equal the speed of the
               | larger model(which is the worst case with this setup). We
               | want the speed of the smaller model as much as possible.
               | That is all
        
             | imtringued wrote:
             | You're forgetting that some sequences are more predictable
             | than others, hence the name "speculative" decoding. Let's
             | say your token encoding has 128k tokens. That means the
             | model has to pick the right token out of 128k. Some of
             | those tokens are incredibly rare, while others are super
             | common. The big model has seen the rare tokens many more
             | times than the small model. This means that the small model
             | will be able to do things like produce grammatically
             | correct English, but not know anything about a specific JS
             | framework.
             | 
             | The post training fine tuning costs (low thousand dollars)
             | are the main reason why speculative decoding is relatively
             | unpopular. The most effective speculative decoding strategy
             | requires you to train multiple prediction heads ala medusa
             | (or whatever succeeded it). If you don't do any fine
             | tuning, then the probability of the small model being
             | useful is slim. Using a random model as your draft model
             | will probably give you very disappointing results.
        
           | isoprophlex wrote:
           | but... do you get any validation during the forward pass? the
           | small model could just as well have generated "is Berlin." or
           | whatever. do these models somehow give you a likelihood for
           | the next token when you're prefilling, that you can compare
           | against? if so why not just... use that always?
           | 
           | or is this a scenario where computation is expensive but
           | validation is cheap?
           | 
           | EDIT: thanks, people, for educating me! very insightful :)
        
             | sanxiyn wrote:
             | Yes, models give likelihoods you can compare against. No,
             | you can't do that without drafting, because likelihood of
             | token N+2 depends on token N+1. That is, you get P(is, The
             | capital of France) and P(Berlin, The capital of France is),
             | but for the later you need to give "is" as input, you can't
             | do P(Berlin, The Capital of France _).
        
             | shikon7 wrote:
             | Yes, the forward pass does a next token prediction on all
             | input tokens (so we know exactly how many tokens from the
             | small model matched). The expensive thing is not the
             | computation, but the memory bandwidth, as each pass needs
             | to load the model from memory.
             | 
             | If the small model predicts some tokens correctly, you save
             | some passes, at the expense of doing some extra
             | computations when the tokens were not correct.
             | 
             | In any case, each forward pass will give at least one new
             | token.
        
             | pama wrote:
             | If you want to go down the rabbit hole of the state of the
             | art, I recommend the EAGLE3 paper:
             | https://arxiv.org/abs/2503.01840
        
         | cristoperb wrote:
         | My simplified understanding: The target model can validate the
         | draft tokens all at once, in a single forward pass. The output
         | of that forward pass is a list of probabilities for each draft
         | token which are compared to the probabilities produced by the
         | draft model. If the target model's probabilities are the same
         | or greater than the draft model, the tokens are accepted. Worst
         | case none of the draft tokens are accepted and instead the
         | target model selects the single next token as usual.
        
         | robrenaud wrote:
         | I think your core misunderstanding is that you are assuming K
         | calls to generate 1 token is expensive as 1 call to generate K
         | tokens. It is actually much more expensive to generate serially
         | than even in small batches.
        
         | porridgeraisin wrote:
         | Let's say I want to run f2(f1(x)) where f1 and f2 are both a
         | single pass through GPT4.
         | 
         | This takes 2 seconds time, assuming 1 second for every pass.
         | 
         | What I instead do is kick off f1(x) in another thread, and then
         | run f2(g1(x)) where g1 is one pass through GPT-nano.
         | 
         | This takes 1 + 0.1 seconds, assuming gpt nano takes 0.1s for
         | every pass. In this 1.1 seconds, the f1(x) that we kicked off
         | in the 2nd thread would have finished (it takes 1 second).
         | 
         | So in 1.1 seconds we have available to us f1(x), f2(g1(x)), and
         | we store the intermediate g1(x) as well
         | 
         | We compare g1(x) and f1(x)
         | 
         | If they were equal, i.e g1(x) = f1(x), then we have our answer
         | = f2(g1(x)) in just 1.1s.
         | 
         | If they were not, we compute f2(output of f1(x) from 2nd
         | thread) which takes 1 further second, bringing our total to
         | 2.1s.
         | 
         | If the small model is equalling the big model in say 2/3 of
         | cases, you will spend 2/3 * 1.1 + 1/3 * 2.1 = 1.433s on average
         | for this computation. Without speculative decoding, it is
         | always 2s.
        
           | arkmm wrote:
           | This is a really great explanation.
        
           | magicalhippo wrote:
           | Thanks, very nice explanation, that makes perfect sense. I
           | guess their graphics confused me for some reason and had me
           | thinking all wrong.
           | 
           | Now I see they tried to point out the obvious thing which is
           | to predict multiple tokens ahead, not just two as in your
           | example.
        
         | bhaney wrote:
         | > How does the target model validate the draft tokens without
         | running the inference as normal?
         | 
         | It does run the inference as normal, just in parallel with the
         | other inferences
         | 
         | > if it is doing just that, I don't get the point
         | 
         | Running inferences in parallel allows you to only read the
         | model weights out of memory only once for N parallel
         | inferences, as opposed to reading them out of memory N times
         | for N serial inferences. Inference is massively bottlenecked by
         | memory bandwidth to the tune of one or two orders of magnitude
         | compared to compute, so this helps a lot.
        
           | littlestymaar wrote:
           | > Inference is massively bottlenecked by memory bandwidth to
           | the tune of one or two orders of magnitude compared to
           | compute, so this helps a lot.
           | 
           | Nitpick: it's only bottlenecked by memory bandwidth if the
           | batch size is too low (that is: if you don't have many users
           | calling the same model in parallel).
           | 
           | Speculative decoding is just a way of running a single query
           | as if it was parallel queries.
        
         | jlebar wrote:
         | Just want to suggest: Ask an LLM about it! If you have access
         | to a reasoning model like o3, I've found it to be very helpful.
         | 
         | I think this answer is as good as any of the human-generated
         | ones in the thread so far, but the real power is that you can
         | ask it follow-up questions.
         | https://chatgpt.com/share/6894504f-4458-8008-a8c9-f371588259...
        
       | modeless wrote:
       | What's the best speed people have gotten on 4090s?
        
         | ActorNightly wrote:
         | You can't fit the model into 4090 without quantization, its
         | like 64 gigs.
         | 
         | For home use, Gemma27B QAT is king. Its almost as good as
         | Deepseek R1
        
           | modeless wrote:
           | The 20B one fits.
        
             | steinvakt2 wrote:
             | Does it fit on a 5080 (16gb)?
        
               | jwitthuhn wrote:
               | Haven't tried myself but it looks like it probably does.
               | The weight files total 13.8 GB which gives you a little
               | left over to hold your context.
        
               | northern-lights wrote:
               | It fits on a 5070TI, so should fit on a 5080 as well.
        
           | SirMaster wrote:
           | You don't really need it to fit all in VRAM due to the
           | efficient MoE architecture and with llama.cpp
           | 
           | The 120B is running at 20 tokens/sec on my 5060Ti 16GB with
           | 64GB of system ram. Now personally I find 20 tokens/sec quite
           | usable, but for some maybe it's not enough.
        
         | asabla wrote:
         | I'm on a 5090 so it's not apples to apples comparison. But I'm
         | getting ~150t/s for the 20B version using ~16000 context size.
        
           | modeless wrote:
           | Cool, what software?
        
             | asabla wrote:
             | Initial testing has only been done with ollama. Plan on
             | testing out llama.cpp and vllm when there is enough time
        
           | steinvakt2 wrote:
           | And flash attention doesn't work on 5090 yet, right? So
           | currently 4090 is probably faster, or?
        
             | PeterStuer wrote:
             | I don't think the 4090 has native 4bit support, which will
             | probably have a significant impact.
        
             | diggan wrote:
             | > And flash attention doesn't work on 5090 yet, right?
             | 
             | Flash attention works with GPT-OSS + llama.cpp (tested on
             | 1d72c8418) and other Blackwell card (RTX Pro 6000) so I
             | think it should work on 5090 as well, it's the same
             | architecture after all.
        
       | littlestymaar wrote:
       | Very fast "Sorry I can't help with that" generator.
        
         | jeffhuys wrote:
         | Just "liberate" it
        
       | sarthaksoni wrote:
       | Reading this made me realize how easy it is to set up GPT-OSS 20B
       | in comparison. I had it running on my Mac in five minutes, thanks
       | to Llama.
        
         | DrPhish wrote:
         | Its also easy to do 120b on CPU if you have the resources. I
         | had 120b running on my home LLM CPU inference box in just as
         | long as it took to download the GGUFs, git pull and rebuild
         | llama-server. I had it running at 40t/s with zero effort and
         | 50t/s with a brief tweaking. Its just too bad that even the
         | 120b isn't really worth running compared to the other models
         | that are out there.
         | 
         | It really is amazing what ggerganov and the llama.cpp team have
         | done to democratize LLMs for individuals that can't afford a
         | massive GPU farm worth more than the average annual salary.
        
           | wkat4242 wrote:
           | What hardware do you have? 50tk/s is really impressive for
           | cpu.
        
             | DrPhish wrote:
             | 2xEPYC Genoa w/768GB of DDR5-4800 and an A5000 24GB card. I
             | built it in January 2024 for about $6k and have thoroughly
             | enjoyed running every new model as it gets released. Some
             | of the best money I've ever spent.
        
               | wkat4242 wrote:
               | Wow nice!! That's a really good deal for that much
               | hardware.
               | 
               | How many tokens/s do you get for DeepSeek-R1?
        
               | DrPhish wrote:
               | Thanks, it was a bit of a gamble at the time (lots of
               | dodgy ebay parts), but it paid off.
               | 
               | R1 starts at about 10t/s on an empty context but quickly
               | falls off. I'd say the majority of my tokens are
               | generating around 6t/s.
               | 
               | Some of the other big MoE models can be quite a bit
               | faster.
               | 
               | I'm mostly using QwenCoder 480b at Q8 these days for 9t/s
               | average. I've found I get better real-world results out
               | of it than K2, R1 or GLM4.5.
        
               | testaburger wrote:
               | Which specific model epcys? And if it's not too much to
               | ask which motherboard and power supply? I'm really
               | interested in building something similar
        
               | smartbit wrote:
               | Looking at
               | https://news.ycombinator.com/submitted?id=DrPhish it's
               | probably this machine https://rentry.co/miqumaxx
               | * Gigabyte MZ73-LM1 with two AMD EPYC GENOA 9334 QS
               | 64c/128t       * 24 sticks of M321R4GA3BB6-CQK 32GB
               | DDR5-4800 RDIMM PC5-38400R       * 24GB A5000
               | 
               | Note that the RAM price almost doubled since Jan 2024
        
               | ekianjo wrote:
               | thats a r/localllama user right there
        
               | fouc wrote:
               | I've seen some mentions of pure-cpu setups being
               | successful for large models using old epyc/xeon
               | workstations off ebay with 40+ cpus. Interesting
               | approach!
        
             | SirMaster wrote:
             | I'm getting 20 tokens/sec on the 120B model with a 5060Ti
             | 16GB and a regular desktop Ryzen 7800x3d with 64GB of
             | DDR5-6000.
        
               | wkat4242 wrote:
               | Wow that's not bad. It's strange, for me it is much much
               | slower on a Radeon Pro VII (also 16GB, with a memory
               | bandwidth of 1TB/s!) and a Ryzen 5 5600 with also 64GB.
               | It's basically unworkably slow. Also, I only get 100% CPU
               | when I check ollama ps, the GPU is not being used at all
               | :( It's also counterproductive because the model is just
               | too large for 64GB.
               | 
               | I wonder what makes it work so well on yours! My CPU
               | isn't much slower and my GPU probably faster.
        
               | magicalhippo wrote:
               | AMD basically decided they wanted to focus on HPC and
               | data center customers rather than consumers, and so GPGPU
               | driver support for consumer cards has been non-existing
               | or terrible[1].
               | 
               | [1]: https://github.com/ROCm/ROCm/discussions/3893
        
           | exe34 wrote:
           | I imagine the gguf is quantised stuff?
        
             | DrPhish wrote:
             | No, I'm running the unquantized 120b
        
         | amelius wrote:
         | Why is it hard to set up llms? You can just ask an llm to do it
         | for you, no? If this relatively simple task is already too much
         | for llms then what good are they?
        
           | diggan wrote:
           | In the case of the GPT-OSS models, the worst (time consuming)
           | part of supporting it is the new format they've trained the
           | model with, "OpenAI harmony", in my own clients I couldn't
           | just replace the model and call it a day, but still working
           | on getting then to work correctly with tool calling...
        
         | CraigRood wrote:
         | I was playing with it yesterday and every single session gave
         | me factually incorrect information.
         | 
         | Speed and ease of use is one thing, but it shouldn't be at the
         | cost of accuracy.
        
           | OliverGuy wrote:
           | If you are trying to get facts out of an LLM you are using it
           | wrong, if you want a fact it should use a tool (eg we search,
           | rag etc) to get the information that contains the fact
           | (Wikipedia page, documentation etc) and then parse that
           | document for the fact and return it to you.
        
         | LoganDark wrote:
         | 120B is pretty easy to run too, if you have enough memory.
        
       | eric-burel wrote:
       | "Encourage Open-Source and Open-Weight AI" is the part just after
       | "Ensure that Frontier AI Protects Free Speech and American
       | Values" in America's AI Action Plan. I know this is not rational
       | but OpenAI OSS models kinda give me chills as I am reading the
       | Plan in parallel. Anyway I like seeing oss model providers
       | talking about hardware, because that's a limiting point for most
       | developers that are not familiar with this layer.
        
         | geertj wrote:
         | > Ensure that Frontier AI Protects Free Speech and American
         | Values
         | 
         | I am in the early phases of collecting my thoughts on this
         | topic so bear with me, but it this a bad thing?
         | 
         | AI models will have a world view. I think I prefer them having
         | a western world view, as that has built our modern society and
         | has proven to be most successful in making the lives of people
         | better.
         | 
         | At the very minimum I would want a model to document its world
         | view, and be aligned to it so that it does not try to socially
         | engineer me to surreptitiously change mine.
        
           | petesergeant wrote:
           | > but it this a bad thing?
           | 
           | I think the worry is that there's no fixed definitions here,
           | so the executive can use this to exert partisan or
           | ideological pressure on model providers.
           | 
           | Every four years the models get RLHF'd to switch between
           | thinking guns are amazing vs thinking guns are terrible.
        
           | exe34 wrote:
           | > I think I prefer them having a western world view,
           | 
           | What worries me is that the current "western world view" of
           | America is not the same as the western world view we've
           | shared with them since the cold war. The trend is towards the
           | same kind of values and behaviour we see in the Islamic
           | Republic and the Russian Federation. If that sort of "western
           | world view" gets baked into the intelligent infrastructure,
           | it may be very hard to change course in the future. For
           | example dissidence and wrongthink is going to get harder and
           | harder.
        
           | AesopAerial wrote:
           | > I think I prefer them having a western world view, as that
           | has built our modern society and has proven to be most
           | successful in making the lives of people better.
           | 
           | Highly debatable, and most people anywhere would probably say
           | the same thing about whatever world view they hold.
        
           | ben_w wrote:
           | "Western" != "American": I grew up in a country where even
           | the police are not, and do not wish to be, routinely armed.
           | 
           | Even then, there is an important difference between de-facto
           | and de-jure rules. Fun fact: even North Korea has a
           | constitutional guarantee of freedom of speech and the right
           | vote*. They don't _do_ these things as we would understand
           | any of those words, but they have those things right there in
           | the constitution.
           | 
           | So: does the USA, as it exists today, represent the values
           | you want? Can you honestly say, hand on heart, that Alligator
           | Alcatraz should be a thing your AI has been trained to
           | support? Or that it's fine for Qatar to donate a 747 that
           | becomes part of the library of the current president, not the
           | office of the president, when his term in office comes to an
           | end?
           | 
           | I won't list everything, this isn't the place for that, but
           | even if we wind the clock back a few years, do you (/we) want
           | an AI aligned with a political circus of kayfabe that
           | distracts us from the real political machinations?
           | 
           | Of course, this is still USA-focused.
           | 
           | I'd say that what really made a difference to our quality of
           | life wasn't even the American political system: there were
           | massive improvements to human existence starting with the
           | first industrial revolution in the UK in the 1760s, but the
           | social and political nature of the world back then was so
           | bleak that communism got invented a century later and
           | introduced what was at the time controversial ideas like
           | "women are not property" and "universal free education is
           | good", and the USA's systems changed substantially several
           | times since then (at a minimum Civil War, New Deal, and the
           | Civil Rights movement).
           | 
           | The "meta system" that allows change can be considered good,
           | but not uniquely so if you compare this to the Russian
           | Revolution getting rid of the Tzars and a 40 years later they
           | were in orbit (and this _despite_ the Holodomor and WW2) and
           | then threw off these shackles with Glasnost and the fall of
           | the USSR (and note there that in Russia specifically, not all
           | the former soviet countries but specifically Russia, the
           | freedom gained _failed_ to bring material improvements and
           | the lives of those living through it were, in aggregate, made
           | worse despite that freedom), and similar stories with the
           | Chinese starting with dangerous incompetence (Four Pests
           | campaign) and now in a position where  "which is more
           | powerful, them or the USA?" is a matter of which measure you
           | use rather than it being obvious.
           | 
           | * https://en.wikipedia.org/wiki/Constitution_of_North_Korea#C
           | h...
        
           | eric-burel wrote:
           | Yeah I mean you'd want to take a look at the plan to get a
           | bigger picture, it reflects a specific set of values which
           | are not universally shared. This should led to the
           | development of European models, but it feels inefficient to
           | duplicate the work in each country/region just because open
           | source models are planned to be used as trojan horses for
           | values.
        
       | hsaliak wrote:
       | TLDR: tensorrt
        
       | mutkach wrote:
       | > Inspired by GPUs, we parallelized this effort across multiple
       | engineers. One engineer tried vLLM, another SGLang, and a third
       | worked on TensorRT-LLM. We were able to quickly get TensorRT-LLM
       | working, which was fortunate as it is usually the most performant
       | inference framework for LLMs.
       | 
       | > TensorRT-LLM
       | 
       | It is usually the hardest to setup correctly and is often out of
       | the date regarding the relevant architectures. It also requires
       | compiling the model on the exact same hardware-drivers-libraries
       | stack as your production environment which is a great pain in the
       | rear end to say the least. Multimodal setups also been a disaster
       | - at least for a while - when it was near-impossible to make it
       | work even for mainstream models - like Multimodal Llamas. The big
       | question is whether it's worth it, since when running the GPT-
       | OSS-120B on H100 using vLLM is flawless in comparison - and the
       | throughput stays at 130-140 t/s for a single H100. (It's also
       | somewhat a clickbait of a title - I was expecting to see 500t/s
       | for a single GPU, when in fact it's just a tensor-parallel setup)
       | 
       | It's also funny that they went for a separate release of TRT-LLM
       | just to make sure that gpt-oss will work correctly, TRT-LLM is a
       | mess
        
         | philipkiely wrote:
         | TRT-LLM has its challenges from a DX perspective and yeah for
         | Multi-modal we still use vLLM pretty often.
         | 
         | But for the kind of traffic we are trying to serve -- high
         | volume and latency sensitive -- it consistently wins head-to-
         | head in our benchmarking and we have invested a ton of dev work
         | in the tooling around it.
        
       | wcallahan wrote:
       | I just used GPT-OSS-120B on a cross Atlantic flight on my MacBook
       | Pro (M4, 128GB RAM).
       | 
       | A few things I noticed: - it's only fast with with small context
       | windows and small total token context; once more than ~10k tokens
       | you're basically queueing everything for a long time - MCPs/web
       | search/url fetch have already become a very important part of
       | interacting with LLMs; when they're not available the LLM utility
       | is greatly diminished - a lot of CLI/TUI coding tools (e.g.,
       | opencode) were not working reliably offline at this time with the
       | model, despite being setup prior to being offline
       | 
       | That's in addition to the other quirks others have noted with the
       | OSS models.
        
         | MoonObserver wrote:
         | M2 Max processor. I saw 60+ tok/s on short conversations, but
         | it degraded to 30 tok/s as the conversation got longer. Do you
         | know what actually accounts for this slowdown? I don't believe
         | it was thermal throttling.
        
           | summarity wrote:
           | Physics: You always have the same memory bandwidth. The
           | longer the context, the more bits will need to pass through
           | the same pipe. Context is cumulative.
        
             | VierScar wrote:
             | No I don't think it's the bits. I would say it's the
             | computation. Inference requires performing a lot of matmul,
             | and with more tokens the number of computation operations
             | increases exponentially - O(n^2) at least. So increasing
             | your context/conversation will quickly degrade performance
             | 
             | I seriously doubt it's the throughput of memory during
             | inference that's the bottleneck here.
        
               | zozbot234 wrote:
               | Typically, the token generation phase is memory-bound for
               | LLM inference in general, and this becomes especially
               | clear as context length increases (since the model's
               | parameters are a fixed quantity.) If it was pure compute
               | bound there would be huge gains to be had by shifting
               | some of the load to the NPU (ANE) but AIUI it's just not
               | so.
        
               | summarity wrote:
               | It literally is. LLM inference is almost entirely memory
               | bound. In fact for naive inference (no batching), you can
               | calculate the token throughput just based on the model
               | size, context size and memory bandwidth.
        
               | zozbot234 wrote:
               | Prompt pre-processing (before the first token is output)
               | is raw compute-bound. That's why it would be nice if we
               | could direct llama.cpp/ollama to run that phase only on
               | iGPU/NPU (for systems without a separate dGPU, obviously)
               | and shift the whole thing over to CPU inference for the
               | latter token-generation phase.
               | 
               | (A memory-bound workload like token gen wouldn't usually
               | run into the CPU's thermal or power limits, so there
               | would be little or no gain from offloading work to the
               | iGPU/NPU in that phase.)
        
               | MereInterest wrote:
               | Nitpick: O(n^2) is quadratic, not exponential. For it to
               | "increase exponentially", n would need to be in the
               | exponent, such as O(2^n).
        
               | esafak wrote:
               | To contrast with exponential, the term is _power law_.
        
           | torginus wrote:
           | Inference takes quadratic amount of time wrt context size
        
         | conradev wrote:
         | Are you using Ollama or LMStudio/llama.cpp?
         | https://x.com/ggerganov/status/1953088008816619637
        
           | diggan wrote:
           | > LMStudio/llama.cpp
           | 
           | Even though LM Studio uses llama.cpp as a runtime, the
           | performance differs between them. With LM Studio 0.3.22 Build
           | 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s
           | on a RTX Pro 6000, while with llama.cpp compiled from
           | 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost
           | 100 more per second, both running lmstudio-community/gpt-
           | oss-120b-GGUF.
        
             | esafak wrote:
             | Is it always like this or does it depend on the model?
        
               | diggan wrote:
               | Depends on the model. Each runner needs to implement
               | support when there are new architectures, and they all
               | seemingly focuses on different things. As far as I've
               | gathered so far, vLLM focuses on inference speed, SGLang
               | on parallizing across multiple GPUs, Ollama on being as
               | fast out the door with their implementation as possible,
               | sometimes cutting corners, llama.cpp sits somewhere in-
               | between Ollama and vLLM. Then LM Studio seems to lag
               | slightly behind with their llama.cpp usage, so I'm
               | guessing that's the difference between LM Studio and
               | building llama.cpp from source today.
        
         | XCSme wrote:
         | I know there was a downloadable version of Wikipedia (not that
         | large). Maybe soon we'll have a lot of data stored locally and
         | expose it via MCP, then the AIs can do "web search" locally.
         | 
         | I think 99% of web searches lead to the same 100-1k websites. I
         | assume it's only a few GBs to have a copy of those locally,
         | thus this raises copyright concerns.
        
           | Aurornis wrote:
           | The mostly static knowledge content from sites like Wikipedia
           | is already well represented in LLMs.
           | 
           | LLMs call out to external websites when something isn't
           | commonly represented in training data, like specific project
           | documentation or news events.
        
             | XCSme wrote:
             | That's true, but the data is only approximately represented
             | in the weights.
             | 
             | Maybe it's better to have the AI only "reason", and somehow
             | instantly access precise data.
        
               | adsharma wrote:
               | What use cases will gain from this architecture?
        
         | gigatexal wrote:
         | M3 Max 128GB here and it's mad impressive.
         | 
         | Im spec'ing out a Mac Studio with 512GB ram because I can
         | window shop and wish but I think the trend for local LLMs is
         | getting really good.
         | 
         | Do we know WHY openAI even released them?
        
           | diggan wrote:
           | > Do we know WHY openAI even released them?
           | 
           | Regulations and trying to earn good will of developers using
           | local LLMs, something that was slowly eroding since it was a
           | while ago (GPT2 - 2019) they released weights to the public.
        
           | Epa095 wrote:
           | If the new gpt 5 is actually better, then this oss version is
           | not really a threat to Openai's income stream, but it can be
           | a threat to their competitors.
        
         | zackify wrote:
         | You didn't even mention how it'll be on fire unless you use low
         | power mode.
         | 
         | Yes all this has been known since the M4 came out. The memory
         | bandwidth is too low.
         | 
         | Try using it with real tasks like cline or opencode and the
         | context length is too long and slow to be practical
        
           | Aurornis wrote:
           | > Yes all this has been known since the M4 came out. The
           | memory bandwidth is too low.
           | 
           | The M4 Max with 128GB of RAM (the part used in the comment)
           | has over 500GB/sec of memory bandwidth.
        
         | radarsat1 wrote:
         | How long did your battery last?!
        
           | woleium wrote:
           | planes have power sockets now, but i do wonder how much jet
           | fuel a whole plane of gpus would consume in electricity
           | (assuming the system can handle it, which seems unlikely) and
           | air conditioning.
        
             | TimBurman wrote:
             | That's an interesting question. According to Rich and
             | Greg's Airplane Page[1], the A320 has three generators
             | rated for 90kVA continuous each, one per engine and a third
             | in the auxilary power unit that isn't normally deployed.
             | Cruising demand is around 140 kVA of the 180 kVA supplied
             | by the engines, leaving 40 kVA to spare. The A380 has six
             | similar generators, two in reserve. They give the
             | percentages so you could calculate how much fuel each
             | system is consuming.
             | 
             | [1] https://alverstokeaviation.blogspot.com/2016/03/
             | 
             | This page also has a rendered image of the generator:
             | 
             | https://aviation.stackexchange.com/questions/43490/how-
             | much-...
        
         | mich5632 wrote:
         | I think this the difference between compute bound pre-fill (a
         | cpu has a high bandwidth/compute ratio), vs decode. The time to
         | first token is below 0.5s - even for a 10k context.
        
         | fouc wrote:
         | What was your iogpu.wired_limit_mb set to? By default only ~70%
         | or ~90GB of your RAM will be available to your GPU cores unless
         | you change your wired limit setting.
        
       | blitzar wrote:
       | > widely-available H100 GPUs
       | 
       | Just looked in the parts drawer at home and dont seem to have a
       | $25,000 GPU for some inexplicable reason.
        
         | KolmogorovComp wrote:
         | available != cheap
        
           | blitzar wrote:
           | available /@'veIl@bl/
           | 
           | adjective: available
           | 
           | able to be used or obtained; at someone's disposal
        
             | swexbe wrote:
             | You can rent one from most cloud providers for a few bucks
             | an hour.
        
               | koakuma-chan wrote:
               | Might as well just use openai api
        
               | ekianjo wrote:
               | thats not the same thing at all
        
               | poly2it wrote:
               | That depends on your intentions.
        
         | Kurtz79 wrote:
         | Does it even make sense calling them 'GPUs' (I just checked
         | NVIDIA product page for the H100 and it is indeed so)?
         | 
         | There should be a quicker way to differentiate between
         | 'consumer-grade hardware that is mainly meant to be used for
         | gaming and can also run LLMs inference in a limited way' and
         | 'business-grade hardware whose main purpose is AI training or
         | running inference for LLMs".
        
           | amelius wrote:
           | Well, does it come with graphics connectors?
        
             | OliverGuy wrote:
             | Nope, doesn't have any of the required hardware to even
             | process graphics iirc
        
               | diggan wrote:
               | Although the RTX Pro 6000 is not consumer-grade, it does
               | come with graphics ports (four Displayports) and does
               | render graphics like a consumer card :) So seems the
               | difference between the segments is becoming smaller, not
               | bigger.
        
               | simpleintheory wrote:
               | That's because it's intended as a workstation GPU not one
               | used in servers
        
               | diggan wrote:
               | Sure, but it still sits in the 'business-grade hardware
               | whose main purpose is AI training or running inference
               | for LLMs" segment parent mentioned, yet have graphics
               | connectors so the only thing I'm saying is that just
               | looking at that won't help you understand what segment
               | the GPU goes into.
        
           | blitzar wrote:
           | We are fast approaching the return of the _math coprocessor_.
           | In fashion they say that trends tend to reappear roughly
           | every two decades, its overdue.
        
             | egorfine wrote:
             | Yeah I would love for Nvidia to introduce faster update
             | cycle to their hardware, so that we'll have models like
             | "H201", "H220", etc.
             | 
             | I think it will also make sense to replace "H" with a brand
             | number, sort of like they already do for customer GPUs.
             | 
             | So then maybe one day we'll have a math coprocessor called
             | "Nvidia 80287".
        
             | beAbU wrote:
             | I remember the building hugh end workstations for a summer
             | job in the 2000s, where I had to fit Tesla cards in the
             | machines. I don't remember what their device names were, we
             | just called them tesla cards.
             | 
             | "Accelerator card" makes a lot of sense to me.
        
             | WithinReason wrote:
             | It's called a tensorcore and it's in most GPUs
        
           | genewitch wrote:
           | "GPGPU" was something from over a decade ago; for general
           | purpose GPU computing
        
             | hnuser123456 wrote:
             | Yeah, Crysis came out in 2007 and could run physics on the
             | GPU.
        
           | addandsubtract wrote:
           | We could call the consumer ones GFX cards, and keep GPU for
           | the matrix multiplying ones.
        
             | beAbU wrote:
             | GPU stands for "graphics processing unit" so I'm not sure
             | how your suggestion solves it.
             | 
             | Maybe renaming the device to an MPU, where the M stands for
             | "matrix/math/mips" would make it more semantically correct?
        
               | rebolek wrote:
               | I think that G was changed to "general", so now it's
               | "general processing unit".
        
               | rpdillon wrote:
               | This doesn't seem to be true at all. It's a highly
               | specialized chip for doing highly parallel operations.
               | There's nothing general about it.
               | 
               | I looked around briefly and could find no evidence that
               | it's been renamed. Do you have a source?
        
               | fouc wrote:
               | CPU is already the general (computing) processing unit so
               | that wouldn't make sense
        
           | codedokode wrote:
           | By the way I wonder, what has more performance, a $25 000
           | professional GPU or a bunch of cheaper consumer GPUs costing
           | $25 000 in total?
        
             | omneity wrote:
             | Consumer GPUs in theory and by a large margin (10 5090s
             | will eat an H100 lunch with 6 times the bandwidth, 3x VRAM
             | and a relatively similar compute ratio), but your
             | bottleneck is the interconnect and that is intentionally
             | crippled to avoid beowulf GPU clusters eating into their
             | datacenter market.
             | 
             | Last consumer GPU with NVLink was the RTX 3090. Even the
             | workstation-grade GPUs lost it.
             | 
             | https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-
             | more-...
        
               | sigbottle wrote:
               | H100s also has custom async WGMMA instructions among
               | other things. From what I understand, at least the async
               | instructions formalize the notion of pipelining, which
               | engineers were already implicitly using because to
               | optimize memory accesses you're effectively trying to
               | overlap them in that kind of optimal parallel manner.
        
           | AlphaSite wrote:
           | I think apple calls them NPUs and Broadcom calls them XPUs.
           | Given they're basically the number 2 and 3 accelerator
           | manufacturers one of those probably works.
        
           | washadjeffmad wrote:
           | I just specify SXM (node) when I want to differentiate from
           | PCIe. We have H100s in both.
        
         | lopuhin wrote:
         | you can rent them for less then $2/h in a lot of places (maybe
         | not in the drawer)
        
         | dougSF70 wrote:
         | With Ollama i got the 20B model running on 8 TitanX cards
         | (2015). Ollama distributed the model so that the 15GB of vram
         | required was split evenly accross the 8 cards. The tok/s were
         | faster than reading speed.
        
           | Aurornis wrote:
           | For the price of 8 decade old Titan X cards, someone could
           | pick up a single modern GPU with 16GB or more of RAM.
        
         | philipkiely wrote:
         | This comment made my day ty! Yeah definitely speaking from a
         | datacenter perspective -- fastest piece of hardware I have in
         | the parts drawer is probably my old iPhone 8.
        
         | Aurornis wrote:
         | They're widely available to rent.
         | 
         | Unless you're running it 24/7 for multiple years, it's not
         | going to be cost effective to buy the GPU instead of renting a
         | hosted one.
         | 
         | For personal use you wouldn't get a recent generation data
         | center card anyway. You'd get something like a Mac Studio or
         | Strix Halo and deal with the slower speed.
        
           | varispeed wrote:
           | I rented H100 for training a couple of times and I found that
           | they couldn't do training at all. Same code worked fine on
           | Mac M1 or RTX 5080, but on H100 I was getting completely
           | different results.
           | 
           | So I wonder what I could be doing wrong. In the end I just
           | use RTX 5080 as my models fit neatly in the available RAM.
           | 
           | * by not working at all, I mean the scripts worked, but
           | results were wrong. As if H100 couldn't do maths properly.
        
         | blueboo wrote:
         | You might find $2.50 in change to use one for an hour though
        
         | vonneumannstan wrote:
         | >Just looked in the parts drawer at home and dont seem to have
         | a $25,000 GPU for some inexplicable reason.
         | 
         | It just means you CAN buy one if you want, as in they're in
         | stock and "available", not that you can necessarily afford one.
        
       | smcleod wrote:
       | TensorRT-LLM is a right nightmare to setup and maintain. Good on
       | them for getting it to work for them - but it's not for everyone.
        
         | philipkiely wrote:
         | We have built a ton of tooling on top of TRT-LLM and use it not
         | just for LLMs but also for TTS models (Orpheus), STT models
         | (Whisper), and embedding models.
        
       | nektro wrote:
       | > we were the clear leader running on NVIDIA GPUs for both
       | latency and throughput per public data from real-world use on
       | OpenRouter.
       | 
       | Baseten: 592.6 tps Groq: 784.6 tps Cerebras: 4,245 tps
       | 
       | still impressive work
        
         | philipkiely wrote:
         | Yeah the custom hardware providers are super good at TPS. Kudos
         | to their teams for sure, and the demos of instant reasoning are
         | incredibly impressive.
         | 
         | That said, we are serving the model at its full 131K context
         | window, and they are serving 33K max, which could matter for
         | some edge case prompts.
         | 
         | Additionally, NVIDIA hardware is much more widely available if
         | you are scaling a high-traffic application.
        
       | lagrange77 wrote:
       | While you're here..
       | 
       | Do you guys know a website that clearly shows which OS LLM models
       | run on / fit into a specific GPU(setup)?
       | 
       | The best heuristic i could find for the necessary VRAM is Number
       | of Parameters x (Precision / 8) x 1.2 from here [0].
       | 
       | [0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-
       | llms...
        
         | diggan wrote:
         | Maybe I'm spoiled by having great internet connection, but I
         | usually download the weights and try to run them via various
         | tools (llama.cpp, LM Studio, vLLM and SGLang typically) and see
         | what works. There seems to be so many variables involved
         | (runners, architectures, implementations, hardware and so on)
         | that none of the calculators I've tried so far been accurate,
         | both in the way that they've over-estimated and under-estimated
         | what I could run.
         | 
         | So in the end, trying to actually run them seems to be the only
         | fool-proof way of knowing for sure :)
        
         | reactordev wrote:
         | huggingface has this built in if you care to fill out your
         | software and hardware profile here:
         | 
         | https://huggingface.co/settings/local-apps
         | 
         | Then on the model pages, it will show you whether you can use
         | it.
        
           | diggan wrote:
           | Interesting, never knew about that! I filled out my details,
           | then went to https://huggingface.co/openai/gpt-oss-120b but
           | I'm not sure if I see any difference? Where is it supposed to
           | show if I can run it or not?
        
             | reactordev wrote:
             | You'll see green check next to models you can use on the
             | model card.
             | 
             | https://huggingface.co/unsloth/gpt-oss-20b-GGUF
        
               | diggan wrote:
               | Ah, it only works for GGUF, not for .safetensors (which
               | the format HuggingFace themselves came up with :P ) ? I
               | see the checks at https://huggingface.co/unsloth/gpt-
               | oss-20b-GGUF but nothing at
               | https://huggingface.co/openai/gpt-oss-120b, seems a bit
               | backwards.
        
         | philipkiely wrote:
         | Yeah we have tried to build calculators before it just depends
         | so much.
         | 
         | Your equation is roughly correct, but I tend to multiply by a
         | factor of 2 not 1.2 to allow for highly concurrent traffic.
        
         | lagrange77 wrote:
         | Thanks for your answers!
         | 
         | While it is seemingly hard to calculate it, maybe one should
         | just make a database website that tracks specific setups
         | (model, exact variant / quantisation, runner, hardware) where
         | users can report, which combination they got running (or not)
         | along with metrics like tokens/s.
         | 
         | Visitors could then specify their runner and hardware and
         | filter for a list of models that would run on that.
        
           | diggan wrote:
           | Yeah, what you're suggesting sounds like it could be more
           | useful than the "generalized calculators" people are
           | currently publishing and using.
        
       | philipkiely wrote:
       | Went to bed with 2 votes, woke up to this. Thank you so much HN!
        
       | radarsat1 wrote:
       | Would love to try fully local agentic coding. Is it feasible yet?
       | I have a laptop with a 3050 but that's not nearly enough VRAM, I
       | guess. Still, would be interested to know what's possible today
       | on reasonable consumer hardware.
        
       | zackangelo wrote:
       | GPT-OSS will run even faster on Blackwell chips because of its
       | hardware support for fp4.
       | 
       | If anyone is working on training or inference in Rust, I'm
       | currently working on adding fp8 and fp4 support to cudarc[0] and
       | candle[1]. This is being done so I can support these models in
       | our inference engine for Mixlayer[2].
       | 
       | [0] https://github.com/coreylowman/cudarc/pull/449 [1]
       | https://github.com/huggingface/candle/pull/2989 [2]
       | https://mixlayer.com
        
         | diggan wrote:
         | Ah, interesting. As someone with a RTX Pro 6000, is it ready
         | today to be able to run gpt-oss-120b inference, or are there
         | still missing pieces? Both linked PRs seems merged already, so
         | unsure if it's ready to be played around with or not.
        
       | mikewarot wrote:
       | You know what's actually hard to find in all this? The actual
       | dimensions of the arrays in the model GPT-OSS-120B. At least with
       | statically typed languages, you know how big your arrays are at a
       | glance. I'm trying to find it in the GitHub repo[1], and I'm not
       | seeing it.
       | 
       | I'm just trying to figure out how wide the datastream through
       | this is, in particular, the actual data (not the weights) that
       | flow through all of it. The width of the output stream. Just how
       | big is a token at the output, prior to reducing it with
       | "temperature" to a few bytes?
       | 
       | Assume infinitely fast compute in a magic black box, but you have
       | to send the output through gigabit ethernet... what's the maximum
       | number of tokens per second?
       | 
       | [1] https://github.com/openai/gpt-oss/tree/main/gpt_oss
        
         | amluto wrote:
         | What's the application where you want to stream out the logits
         | for each consecutive token while still sampling each token
         | according to the usual rule? Keep in mind that, if you are
         | doing the usual clever tricks like restricting the next token
         | sampled to something that satisfies a grammar, you need to
         | process the logits and _sample them and return a token_ before
         | running the next round of inference.
        
           | mikewarot wrote:
           | I know the actual output of the model is wider than a
           | token.... but I can't find it (the actual width, or number of
           | bytes) in the source. Perhaps it's my very casual familiarity
           | with Python that's limiting me, but I don't see any actual
           | declarations of array sizes anywhere in the code.
           | 
           | I'm just trying to calculate the actual bandwidth required
           | for the full output of the model, not just a token to be
           | handed off to the user.
           | 
           | I need this so I can compute just what bandwidth a fully FPGA
           | (later ASIC) based implementation of the model would result
           | in.
           | 
           | Edit/Append: I asked GPT-5, and it estimated:
           | Total bytes = 50,000 tokens x 4 bytes/token = 200,000 bytes
           | 
           | Which sounds about right to me. This yields a maximum of
           | about 500 logits/second on Gigabit ethernet.
           | 
           | The actual compute of the model is peanuts compared to just
           | shuffling the data around.
        
         | steeve wrote:
         | According to https://huggingface.co/openai/gpt-
         | oss-120b/blob/main/config....
         | 
         | That's 2880 values (so multiply by dtype)
        
       | OldfieldFund wrote:
       | laughs in Cerebras
        
       | adsharma wrote:
       | What's the best number on vLLM and SGlang so far on H100?
       | 
       | It's sad that MLPerf takes a long time to catch up to SOTA
       | models.
        
       ___________________________________________________________________
       (page generated 2025-08-07 23:01 UTC)